Re: asynchronous execution
[ Adjusting subject line to reflect the actual topic of discussion better. ]
On Fri, Sep 23, 2016 at 9:29 AM, Robert Haas <robertmhaas@gmail.com> wrote:
On Fri, Sep 23, 2016 at 8:45 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
For e.g., in the above plan which you specified, suppose :
1. Hash Join has called ExecProcNode() for the child foreign scan b, and so
is
waiting in ExecAsyncWaitForNode(foreign_scan_on_b).
2. The event wait list already has foreign scan on a that is on a different
subtree.
3. This foreign scan a happens to be ready, so in
ExecAsyncWaitForNode (), ExecDispatchNode(foreign_scan_a) is called,
which returns with result_ready.
4. Since it returns result_ready, it's parent node is now inserted in the
callbacks array, and so it's parent (Append) is executed.
5. But, this Append planstate is already in the middle of executing Hash
join, and is waiting for HashJoin.Ah, yeah, something like that could happen. I've spent much of this
week working on a new design for this feature which I think will avoid
this problem. It doesn't work yet - in fact I can't even really test
it yet. But I'll post what I've got by the end of the day today so
that anyone who is interested can look at it and critique.
Well, I promised to post this, so here it is. It's not really working
all that well at this point, and it's definitely not doing anything
that interesting, but you can see the outline of what I have in mind.
Since Kyotaro Horiguchi found that my previous design had a
system-wide performance impact due to the ExecProcNode changes, I
decided to take a different approach here: I created an async
infrastructure where both the requestor and the requestee have to be
specifically modified to support parallelism, and then modified Append
and ForeignScan to cooperate using the new interface. Hopefully that
means that anything other than those two nodes will suffer no
performance impact. Of course, it might have other problems....
Some notes:
- EvalPlanQual rechecks are broken.
- EXPLAIN ANALYZE instrumentation is broken.
- ExecReScanAppend is broken, because the async stuff needs some way
of canceling an async request and I didn't invent anything like that
yet.
- The postgres_fdw changes pretend to be async but aren't actually.
It's just a demo of (part of) the interface at this point.
- The postgres_fdw changes also report all pg-fdw paths as
async-capable, but actually the direct-modify ones aren't, so the
regression tests fail.
- Errors in the executor can leak the WaitEventSet. Probably we need
to modify ResourceOwners to be able to own WaitEventSets.
- There are probably other bugs, too.
Whee!
Note that I've tried to solve the re-entrancy problems by (1) putting
all of the event loop's state inside the EState rather than in local
variables and (2) having the function that is called to report arrival
of a result be thoroughly different than the function that is used to
return a tuple to a synchronous caller.
Comments welcome, if you're feeling brave enough to look at anything
this half-baked.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Attachments:
async-wip-2016-09-23.patchbinary/octet-stream; name=async-wip-2016-09-23.patchDownload
diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c
index daf0438..ab69aa3 100644
--- a/contrib/postgres_fdw/postgres_fdw.c
+++ b/contrib/postgres_fdw/postgres_fdw.c
@@ -19,6 +19,7 @@
#include "commands/defrem.h"
#include "commands/explain.h"
#include "commands/vacuum.h"
+#include "executor/execAsync.h"
#include "foreign/fdwapi.h"
#include "funcapi.h"
#include "miscadmin.h"
@@ -343,6 +344,14 @@ static void postgresGetForeignJoinPaths(PlannerInfo *root,
JoinPathExtraData *extra);
static bool postgresRecheckForeignScan(ForeignScanState *node,
TupleTableSlot *slot);
+static bool postgresIsForeignPathAsyncCapable(ForeignPath *path);
+static void postgresForeignAsyncRequest(EState *estate,
+ PendingAsyncRequest *areq);
+static void postgresForeignAsyncConfigureWait(EState *estate,
+ PendingAsyncRequest *areq,
+ bool reinit);
+static void postgresForeignAsyncNotify(EState *estate,
+ PendingAsyncRequest *areq);
/*
* Helper functions
@@ -455,6 +464,12 @@ postgres_fdw_handler(PG_FUNCTION_ARGS)
/* Support functions for join push-down */
routine->GetForeignJoinPaths = postgresGetForeignJoinPaths;
+ /* Support functions for async execution */
+ routine->IsForeignPathAsyncCapable = postgresIsForeignPathAsyncCapable;
+ routine->ForeignAsyncRequest = postgresForeignAsyncRequest;
+ routine->ForeignAsyncConfigureWait = postgresForeignAsyncConfigureWait;
+ routine->ForeignAsyncNotify = postgresForeignAsyncNotify;
+
PG_RETURN_POINTER(routine);
}
@@ -4342,6 +4357,40 @@ postgresGetForeignJoinPaths(PlannerInfo *root,
/* XXX Consider parameterized paths for the join relation */
}
+static bool
+postgresIsForeignPathAsyncCapable(ForeignPath *path)
+{
+ return true;
+}
+
+/*
+ * XXX. Just for testing purposes, let's run everything through the async
+ * mechanism but return tuples synchronously.
+ */
+static void
+postgresForeignAsyncRequest(EState *estate, PendingAsyncRequest *areq)
+{
+ ForeignScanState *node = (ForeignScanState *) areq->requestee;
+ TupleTableSlot *slot;
+
+ Assert(IsA(node, ForeignScanState));
+ slot = postgresIterateForeignScan(node);
+ ExecAsyncRequestDone(estate, areq, (Node *) slot);
+}
+
+static void
+postgresForeignAsyncConfigureWait(EState *estate, PendingAsyncRequest *areq,
+ bool reinit)
+{
+ elog(ERROR, "postgresForeignAsyncConfigureWait");
+}
+
+static void
+postgresForeignAsyncNotify(EState *estate, PendingAsyncRequest *areq)
+{
+ elog(ERROR, "postgresForeignAsyncNotify");
+}
+
/*
* Create a tuple from the specified row of the PGresult.
*
diff --git a/src/backend/executor/Makefile b/src/backend/executor/Makefile
index 51edd4c..0675b01 100644
--- a/src/backend/executor/Makefile
+++ b/src/backend/executor/Makefile
@@ -12,8 +12,8 @@ subdir = src/backend/executor
top_builddir = ../../..
include $(top_builddir)/src/Makefile.global
-OBJS = execAmi.o execCurrent.o execGrouping.o execIndexing.o execJunk.o \
- execMain.o execParallel.o execProcnode.o execQual.o \
+OBJS = execAmi.o execAsync.o execCurrent.o execGrouping.o execIndexing.o \
+ execJunk.o execMain.o execParallel.o execProcnode.o execQual.o \
execScan.o execTuples.o \
execUtils.o functions.o instrument.o nodeAppend.o nodeAgg.o \
nodeBitmapAnd.o nodeBitmapOr.o \
diff --git a/src/backend/executor/README b/src/backend/executor/README
index f1d1e4c..1dee3db 100644
--- a/src/backend/executor/README
+++ b/src/backend/executor/README
@@ -199,3 +199,46 @@ query returning the same set of scan tuples multiple times. Likewise,
SRFs are disallowed in an UPDATE's targetlist. There, they would have the
effect of the same row being updated multiple times, which is not very
useful --- and updates after the first would have no effect anyway.
+
+Asynchronous Execution
+----------------------
+
+In certain cases, it's desirable for a node to indicate that it cannot
+return any tuple immediately but may be able to do at a later time. This
+might either because the node is waiting on an event external to the
+database system, such as a ForeignScan awaiting network I/O, or because
+the node is waiting for an event internal to the database system - e.g.
+one process involved in a parallel query may find that it cannot progress
+a certain parallel operation until some other process reaches a certain
+point in the computation. A process which discovers this type of situation
+can always handle it simply by blocking, but this may waste time that could
+be spent executing some other part of the plan where progress could be
+made immediately. This is particularly likely to occur when the plan
+contains an Append node.
+
+To use asynchronous execution, a node must first request a tuple from an
+async-capable child node using ExecAsyncRequest. Next, it must execute
+the asynchronous event loop using ExecAsyncEventLoop; it can avoid giving
+up control indefinitely by passing a timeout to this function, even passing
+-1 to poll for events without blocking. Eventually, when a node to which an
+asynchronous request has been made produces a tuple, the requesting node
+will receive a callback from the event loop via ExecAsyncResponse. Typically,
+the ExecAsyncResponse callback is the only one required for nodes that wish
+to request tuples asynchronously.
+
+On the other hand, nodes that wish to produce tuples asynchronously
+generally need to implement three methods:
+
+1. When an asynchronous request is made, the node's ExecAsyncRequest callback
+will be invoked; it should use ExecAsyncSetRequiredEvents to indicate the
+number of file descriptor events for which it wishes to wait and whether it
+wishes to receive a callback when the process latch is set. Alternatively,
+it can instead use ExecAsyncRequestDone if a result is available immediately.
+
+2. When the event loop wishes to wait or poll for file descriptor events and
+the process latch, the ExecAsyncConfigureWait callback is invoked to configure
+the file descriptor wait events for which the node wishes to wait. This
+callback isn't needed if the node only cares about the process latch.
+
+3. When file descriptors or the process latch become ready, the node's
+ExecAsyncNotify callback is invoked.
diff --git a/src/backend/executor/execAmi.c b/src/backend/executor/execAmi.c
index 2587ef7..9fcc4e4 100644
--- a/src/backend/executor/execAmi.c
+++ b/src/backend/executor/execAmi.c
@@ -464,11 +464,16 @@ ExecSupportsBackwardScan(Plan *node)
{
ListCell *l;
+ /* With async, tuples may be interleaved, so can't back up. */
+ if (((Append *) node)->nasyncplans != 0)
+ return false;
+
foreach(l, ((Append *) node)->appendplans)
{
if (!ExecSupportsBackwardScan((Plan *) lfirst(l)))
return false;
}
+
/* need not check tlist because Append doesn't evaluate it */
return true;
}
diff --git a/src/backend/executor/execAsync.c b/src/backend/executor/execAsync.c
new file mode 100644
index 0000000..5858bb5
--- /dev/null
+++ b/src/backend/executor/execAsync.c
@@ -0,0 +1,462 @@
+/*-------------------------------------------------------------------------
+ *
+ * execAsync.c
+ * Support routines for asynchronous execution.
+ *
+ * Portions Copyright (c) 1996-2016, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/executor/execAsync.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "executor/execAsync.h"
+#include "executor/nodeAppend.h"
+#include "executor/nodeForeignscan.h"
+#include "miscadmin.h"
+#include "storage/latch.h"
+
+static bool ExecAsyncEventWait(EState *estate, long timeout);
+static void ExecAsyncConfigureWait(EState *estate, PendingAsyncRequest *areq,
+ bool reinit);
+static void ExecAsyncNotify(EState *estate, PendingAsyncRequest *areq);
+static void ExecAsyncResponse(EState *estate, PendingAsyncRequest *areq);
+
+#define EVENT_BUFFER_SIZE 16
+
+/*
+ * Asynchronously request a tuple from a designed async-aware node.
+ *
+ * requestor is the node that wants the tuple; requestee is the node from
+ * which it wants the tuple. request_index is an arbitrary integer specified
+ * by the requestor which will be available at the time the requestor receives
+ * the tuple. This is useful if the requestor has multiple children and
+ * needs an easy way to figure out which one is delivering a tuple.
+ */
+void
+ExecAsyncRequest(EState *estate, PlanState *requestor, int request_index,
+ PlanState *requestee)
+{
+ PendingAsyncRequest *areq = NULL;
+ int i = estate->es_num_pending_async;
+
+ /*
+ * If the number of pending asynchronous nodes exceeds the number of
+ * available slots in the es_pending_async array, expand the array.
+ * We start with 16 slots, and thereafter double the array size each
+ * time we run out of slots.
+ */
+ if (i >= estate->es_max_pending_async)
+ {
+ int newmax;
+
+ newmax = estate->es_max_pending_async * 2;
+ if (estate->es_max_pending_async == 0)
+ {
+ newmax = 16;
+ estate->es_pending_async =
+ MemoryContextAllocZero(estate->es_query_cxt,
+ newmax * sizeof(PendingAsyncRequest *));
+ }
+ else
+ {
+ int newentries = newmax - estate->es_max_pending_async;
+
+ estate->es_pending_async =
+ repalloc(estate->es_pending_async,
+ newmax * sizeof(PendingAsyncRequest *));
+ MemSet(&estate->es_pending_async[estate->es_max_pending_async],
+ 0, newentries * sizeof(PendingAsyncRequest *));
+ }
+ estate->es_max_pending_async = newmax;
+ }
+
+ /*
+ * To avoid unnecessary palloc traffic, we reuse a previously-allocated
+ * PendingAsyncRequest if there is one. If not, we must allocate a new
+ * one.
+ */
+ if (estate->es_pending_async[i] == NULL)
+ {
+ areq = MemoryContextAllocZero(estate->es_query_cxt,
+ sizeof(PendingAsyncRequest));
+ estate->es_pending_async[i] = areq;
+ }
+ else
+ {
+ areq = estate->es_pending_async[i];
+ MemSet(areq, 0, sizeof(PendingAsyncRequest));
+ }
+ areq->myindex = estate->es_num_pending_async++;
+
+ /* Initialize the new request. */
+ areq->requestor = requestor;
+ areq->request_index = request_index;
+ areq->requestee = requestee;
+
+ /* Give the requestee a chance to do whatever it wants. */
+ switch (nodeTag(requestee))
+ {
+ case T_ForeignScanState:
+ ExecAsyncForeignScanRequest(estate, areq);
+ break;
+ default:
+ /* If requestee doesn't support async, caller messed up. */
+ elog(ERROR, "unrecognized node type: %d",
+ (int) nodeTag(requestee));
+ }
+}
+
+/*
+ * Execute the main loop until the timeout expires or a result is delivered
+ * to the requestor.
+ *
+ * If the timeout is -1, there is no timeout; wait indefinitely until a
+ * result is ready for requestor. If the timeout is 0, do not block, but
+ * poll for events and fire callbacks for as long as we can do so without
+ * blocking. If timeout is greater than 0, block for at most the number
+ * of milliseconds indicated by the timeout.
+ *
+ * Returns true if a result was delivered to the requestor. A return value
+ * of false indicates that the timeout was reached without delivering a
+ * result to the requestor.
+ */
+bool
+ExecAsyncEventLoop(EState *estate, PlanState *requestor, long timeout)
+{
+ instr_time start_time;
+ long cur_timeout = timeout;
+ bool requestor_done = false;
+
+ Assert(requestor != NULL);
+
+ /*
+ * If we plan to wait - but not indefinitely - we need to record the
+ * current time.
+ */
+ if (timeout > 0)
+ INSTR_TIME_SET_CURRENT(start_time);
+
+ /* Main event loop: poll for events, deliver notifications. */
+ for (;;)
+ {
+ int i;
+ bool any_node_done = false;
+
+ CHECK_FOR_INTERRUPTS();
+
+ /*
+ * Check for events, but don't block if there notifications that
+ * have not been delivered yet.
+ */
+ if (estate->es_async_callback_pending > 0)
+ ExecAsyncEventWait(estate, 0);
+ else if (!ExecAsyncEventWait(estate, cur_timeout))
+ cur_timeout = 0; /* Timeout was reached. */
+ else
+ {
+ instr_time cur_time;
+ long cur_timeout = -1;
+
+ INSTR_TIME_SET_CURRENT(cur_time);
+ INSTR_TIME_SUBTRACT(cur_time, start_time);
+ cur_timeout = timeout - (long) INSTR_TIME_GET_MILLISEC(cur_time);
+ if (cur_timeout < 0)
+ cur_timeout = 0;
+ }
+
+ /* Deliver notifications. */
+ for (i = 0; i < estate->es_num_pending_async; ++i)
+ {
+ PendingAsyncRequest *areq = estate->es_pending_async[i];
+
+ /* Skip it if no callback is pending. */
+ if (!areq->callback_pending)
+ continue;
+
+ /*
+ * Mark it as no longer needing a callback. We must do this
+ * before dispatching the callback in case the callback resets
+ * the flag.
+ */
+ areq->callback_pending = false;
+ estate->es_async_callback_pending--;
+
+ /* Perform the actual callback; set request_done if appropraite. */
+ if (!areq->request_complete)
+ ExecAsyncNotify(estate, areq);
+ else
+ {
+ any_node_done = true;
+ if (requestor == areq->requestor)
+ requestor_done = true;
+ ExecAsyncResponse(estate, areq);
+ }
+ }
+
+ /* If any node completed, compact the array. */
+ if (any_node_done)
+ {
+ int hidx = 0,
+ tidx;
+
+ /*
+ * Swap all non-yet-completed items to the start of the array.
+ * Keep them in the same order.
+ */
+ for (tidx = 0; tidx < estate->es_num_pending_async; ++tidx)
+ {
+ PendingAsyncRequest *head;
+ PendingAsyncRequest *tail = estate->es_pending_async[tidx];
+
+ if (!tail->callback_pending && tail->request_complete)
+ continue;
+ head = estate->es_pending_async[hidx];
+ estate->es_pending_async[tidx] = head;
+ estate->es_pending_async[hidx] = tail;
+ ++hidx;
+ }
+ estate->es_num_pending_async = hidx;
+ }
+
+ /*
+ * We only consider exiting the loop when no notifications are
+ * pending. Otherwise, each call to this function might advance
+ * the computation by only a very small amount; to the contrary,
+ * we want to push it forward as far as possible.
+ */
+ if (estate->es_async_callback_pending == 0)
+ {
+ /* If requestor is ready, exit. */
+ if (requestor_done)
+ return true;
+ /* If timeout was 0 or has expired, exit. */
+ if (cur_timeout == 0)
+ return false;
+ }
+ }
+}
+
+/*
+ * Wait or poll for events. As with ExecAsyncEventLoop, a timeout of -1
+ * means wait forever, 0 means don't wait at all, and >0 means wait for the
+ * indicated number of milliseconds.
+ *
+ * Returns true if we found some events and false if we timed out.
+ */
+static bool
+ExecAsyncEventWait(EState *estate, long timeout)
+{
+ WaitEvent occurred_event[EVENT_BUFFER_SIZE];
+ int noccurred;
+ int i;
+ int n;
+ bool reinit = false;
+ bool process_latch_set = false;
+
+ if (estate->es_wait_event_set == NULL)
+ {
+ /*
+ * Allow for a few extra events without reinitializing. It
+ * doesn't seem worth the complexity of doing anything very
+ * aggressive here, because plans that depend on massive numbers
+ * of external FDs are likely to run afoul of kernel limits anyway.
+ */
+ estate->es_allocated_fd_events = estate->es_total_fd_events + 16;
+ estate->es_wait_event_set =
+ CreateWaitEventSet(estate->es_query_cxt,
+ estate->es_allocated_fd_events + 1);
+ AddWaitEventToSet(estate->es_wait_event_set,
+ WL_LATCH_SET, PGINVALID_SOCKET, MyLatch, NULL);
+ reinit = true;
+ }
+
+ /* Give each waiting node a chance to add or modify events. */
+ for (i = 0; i < estate->es_num_pending_async; ++i)
+ {
+ PendingAsyncRequest *areq = estate->es_pending_async[i];
+
+ if (areq->num_fd_events > 0)
+ ExecAsyncConfigureWait(estate, areq, reinit);
+ }
+
+ /* Wait for at least one event to occur. */
+ noccurred = WaitEventSetWait(estate->es_wait_event_set, timeout,
+ occurred_event, EVENT_BUFFER_SIZE);
+ if (noccurred == 0)
+ return false;
+
+ /*
+ * Loop over the occurred events and set the callback_pending flags
+ * for the appropriate requests. The waiting nodes should have
+ * registered their wait events with user_data pointing back to the
+ * PendingAsyncRequest, but the process latch needs special handling.
+ */
+ for (n = 0; n < noccurred; ++n)
+ {
+ WaitEvent *w = &occurred_event[n];
+
+ if ((w->events & WL_LATCH_SET) != 0)
+ {
+ process_latch_set = true;
+ continue;
+ }
+
+ if ((w->events & (WL_SOCKET_READABLE|WL_SOCKET_WRITEABLE)) != 0)
+ {
+ PendingAsyncRequest *areq = w->user_data;
+
+ if (!areq->callback_pending)
+ {
+ Assert(!areq->request_complete);
+ areq->callback_pending = true;
+ estate->es_async_callback_pending++;
+ }
+ }
+ }
+
+ /*
+ * If the process latch got set, we must schedule a callback for every
+ * requestee that cares about it.
+ */
+ if (process_latch_set)
+ {
+ for (i = 0; i < estate->es_num_pending_async; ++i)
+ {
+ PendingAsyncRequest *areq = estate->es_pending_async[i];
+
+ if (areq->wants_process_latch)
+ {
+ Assert(!areq->request_complete);
+ areq->callback_pending = true;
+ }
+ }
+ }
+
+ return true;
+}
+
+/*
+ * Give the asynchronous node a chance to configure the file descriptor
+ * events for which it wishes to wait. We expect the node-type specific
+ * callback to make one or more calls of the following form:
+ *
+ * AddWaitEventToSet(es->es_wait_event_set, events, fd, NULL, areq);
+ *
+ * The events should include only WL_SOCKET_READABLE or WL_SOCKET_WRITEABLE,
+ * and the number of calls should not exceed areq->num_fd_events (as
+ * prevously set via ExecAsyncSetRequiredEvents).
+ */
+static void
+ExecAsyncConfigureWait(EState *estate, PendingAsyncRequest *areq,
+ bool reinit)
+{
+ switch (nodeTag(areq->requestee))
+ {
+ case T_ForeignScanState:
+ ExecAsyncForeignScanConfigureWait(estate, areq, reinit);
+ break;
+ default:
+ elog(ERROR, "unrecognized node type: %d",
+ (int) nodeTag(areq->requestee));
+ }
+}
+
+/*
+ * Call the asynchronous node back when a relevant event has occurred.
+ */
+static void
+ExecAsyncNotify(EState *estate, PendingAsyncRequest *areq)
+{
+ switch (nodeTag(areq->requestee))
+ {
+ case T_ForeignScanState:
+ ExecAsyncForeignScanNotify(estate, areq);
+ break;
+ default:
+ elog(ERROR, "unrecognized node type: %d",
+ (int) nodeTag(areq->requestee));
+ }
+}
+
+/*
+ * Call the requestor back when an asynchronous node has produced a result.
+ */
+static void
+ExecAsyncResponse(EState *estate, PendingAsyncRequest *areq)
+{
+ switch (nodeTag(areq->requestor))
+ {
+ case T_AppendState:
+ ExecAsyncAppendResponse(estate, areq);
+ break;
+ default:
+ elog(ERROR, "unrecognized node type: %d",
+ (int) nodeTag(areq->requestor));
+ }
+}
+
+/*
+ * An executor node should call this function to signal that it needs to wait
+ * on one or more file descriptor events that can be registered on a
+ * WaitEventSet, and possibly also on the process latch. num_fd_events
+ * should be the maximum number of file descriptor events that it will wish to
+ * register. force_reset should be true if the node can't reuse the
+ * WaitEventSet it most recently initialized, for example because it needs to
+ * drop a wait event from the set.
+ */
+void
+ExecAsyncSetRequiredEvents(EState *estate, PendingAsyncRequest *areq,
+ int num_fd_events, bool wants_process_latch,
+ bool force_reset)
+{
+ estate->es_total_fd_events += num_fd_events - areq->num_fd_events;
+ areq->num_fd_events = num_fd_events;
+ areq->wants_process_latch = wants_process_latch;
+
+ if (force_reset && estate->es_wait_event_set != NULL)
+ {
+ FreeWaitEventSet(estate->es_wait_event_set);
+ estate->es_wait_event_set = NULL;
+ }
+}
+
+/*
+ * An async-capable node should call this function to deliver the tuple to
+ * the node which requested it. The node can call this from its
+ * ExecAsyncRequest callback if the requested tuple is available immediately,
+ * or at a later time from its ExecAsyncNotify callback.
+ */
+void
+ExecAsyncRequestDone(EState *estate, PendingAsyncRequest *areq, Node *result)
+{
+ /*
+ * Since the request is complete, the requestee is no longer allowed
+ * to wait for any events. Note that this forces a rebuild of
+ * es_wait_event_set every time a process that was previously waiting
+ * stops doing so. It might be possible to defer that decision until
+ * we actually wait again, because it's quite possible that a new
+ * request will be made of the same node before any wait actually
+ * happens. However, we have to balance the cost of rebuilding the
+ * WaitEventSet against the additional overhead of tracking which nodes
+ * need a callback to remove registered wait events. It's not clear
+ * that we would come out ahead, so use brute force for now.
+ */
+ if (areq->num_fd_events > 0 || areq->wants_process_latch)
+ ExecAsyncSetRequiredEvents(estate, areq, 0, false, true);
+
+ /* Save result and mark request as complete. */
+ areq->result = result;
+ areq->request_complete = true;
+
+ /* Make sure this request is flagged for a callback. */
+ if (!areq->callback_pending)
+ {
+ areq->callback_pending = true;
+ estate->es_async_callback_pending++;
+ }
+}
diff --git a/src/backend/executor/nodeAppend.c b/src/backend/executor/nodeAppend.c
index a26bd63..bb06569 100644
--- a/src/backend/executor/nodeAppend.c
+++ b/src/backend/executor/nodeAppend.c
@@ -58,6 +58,7 @@
#include "postgres.h"
#include "executor/execdebug.h"
+#include "executor/execAsync.h"
#include "executor/nodeAppend.h"
static bool exec_append_initialize_next(AppendState *appendstate);
@@ -79,16 +80,21 @@ exec_append_initialize_next(AppendState *appendstate)
/*
* get information from the append node
*/
- whichplan = appendstate->as_whichplan;
+ whichplan = appendstate->as_whichsyncplan;
- if (whichplan < 0)
+ /*
+ * This routine is only responsible for setting up for nodes being scanned
+ * synchronously, so the first node we can scan is given by nasyncplans
+ * and the last is given by as_nplans - 1.
+ */
+ if (whichplan < appendstate->as_nasyncplans)
{
/*
* if scanning in reverse, we start at the last scan in the list and
* then proceed back to the first.. in any case we inform ExecAppend
* that we are at the end of the line by returning FALSE
*/
- appendstate->as_whichplan = 0;
+ appendstate->as_whichsyncplan = appendstate->as_nasyncplans;
return FALSE;
}
else if (whichplan >= appendstate->as_nplans)
@@ -96,7 +102,7 @@ exec_append_initialize_next(AppendState *appendstate)
/*
* as above, end the scan if we go beyond the last scan in our list..
*/
- appendstate->as_whichplan = appendstate->as_nplans - 1;
+ appendstate->as_whichsyncplan = appendstate->as_nplans - 1;
return FALSE;
}
else
@@ -142,6 +148,15 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
appendstate->ps.state = estate;
appendstate->appendplans = appendplanstates;
appendstate->as_nplans = nplans;
+ appendstate->as_nasyncplans = node->nasyncplans;
+ appendstate->as_syncdone = (node->nasyncplans == nplans);
+ appendstate->as_asyncresult = (TupleTableSlot **)
+ palloc0(node->nasyncplans * sizeof(TupleTableSlot *));
+
+ /* initially, all async requests need a request */
+ for (i = 0; i < appendstate->as_nasyncplans; ++i)
+ appendstate->as_needrequest =
+ bms_add_member(appendstate->as_needrequest, i);
/*
* Miscellaneous initialization
@@ -176,9 +191,9 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
appendstate->ps.ps_ProjInfo = NULL;
/*
- * initialize to scan first subplan
+ * initialize to scan first synchronous subplan
*/
- appendstate->as_whichplan = 0;
+ appendstate->as_whichsyncplan = appendstate->as_nasyncplans;
exec_append_initialize_next(appendstate);
return appendstate;
@@ -193,15 +208,78 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
TupleTableSlot *
ExecAppend(AppendState *node)
{
+ if (node->as_nasyncplans > 0)
+ {
+ EState *estate = node->ps.state;
+ int i;
+
+ /*
+ * If there are any asynchronously-generated results that have
+ * not yet been returned, return one of them.
+ */
+ if (node->as_nasyncresult > 0)
+ {
+ --node->as_nasyncresult;
+ return node->as_asyncresult[node->as_nasyncresult];
+ }
+
+ /*
+ * If there are any nodes that need a new asynchronous request,
+ * make all of them.
+ */
+ while ((i = bms_first_member(node->as_needrequest)) >= 0)
+ {
+ ExecAsyncRequest(estate, &node->ps, i, node->appendplans[i]);
+ node->as_nasyncpending++;
+ }
+ }
+
for (;;)
{
PlanState *subnode;
TupleTableSlot *result;
/*
- * figure out which subplan we are currently processing
+ * if we have async requests outstanding, run the event loop
*/
- subnode = node->appendplans[node->as_whichplan];
+ if (node->as_nasyncpending > 0)
+ {
+ long timeout = node->as_syncdone ? -1 : 0;
+
+ for (;;)
+ {
+ if (node->as_nasyncpending == 0)
+ {
+ /*
+ * If there is no asynchronous activity still pending
+ * and the synchronous activity is also complete, we're
+ * totally done scanning this node. Otherwise, we're
+ * done with the asynchronous stuff but must continue
+ * scanning the synchronous children.
+ */
+ if (node->as_syncdone)
+ return ExecClearTuple(node->ps.ps_ResultTupleSlot);
+ break;
+ }
+ if (!ExecAsyncEventLoop(node->ps.state, &node->ps, timeout))
+ {
+ /* Timeout reached. */
+ break;
+ }
+ if (node->as_nasyncresult > 0)
+ {
+ /* Asynchronous subplan returned a tuple! */
+ --node->as_nasyncresult;
+ return node->as_asyncresult[node->as_nasyncresult];
+ }
+ }
+ }
+
+ /*
+ * figure out which synchronous subplan we are currently processing
+ */
+ Assert(!node->as_syncdone);
+ subnode = node->appendplans[node->as_whichsyncplan];
/*
* get a tuple from the subplan
@@ -221,14 +299,21 @@ ExecAppend(AppendState *node)
/*
* Go on to the "next" subplan in the appropriate direction. If no
* more subplans, return the empty slot set up for us by
- * ExecInitAppend.
+ * ExecInitAppend, unless there are async plans we have yet to finish.
*/
if (ScanDirectionIsForward(node->ps.state->es_direction))
- node->as_whichplan++;
+ node->as_whichsyncplan++;
else
- node->as_whichplan--;
+ node->as_whichsyncplan--;
if (!exec_append_initialize_next(node))
- return ExecClearTuple(node->ps.ps_ResultTupleSlot);
+ {
+ node->as_syncdone = true;
+ if (node->as_nasyncpending == 0)
+ {
+ Assert(bms_is_empty(node->as_needrequest));
+ return ExecClearTuple(node->ps.ps_ResultTupleSlot);
+ }
+ }
/* Else loop back and try to get a tuple from the new subplan */
}
@@ -267,6 +352,16 @@ ExecReScanAppend(AppendState *node)
{
int i;
+ /*
+ * XXX. Cancel outstanding asynchronous tuple requests here! (How?)
+ */
+
+ /* Reset async state. */
+ for (i = 0; i < node->as_nasyncplans; ++i)
+ node->as_needrequest = bms_add_member(node->as_needrequest, i);
+ node->as_nasyncresult = 0;
+ node->as_syncdone = (node->as_nasyncplans == node->as_nplans);
+
for (i = 0; i < node->as_nplans; i++)
{
PlanState *subnode = node->appendplans[i];
@@ -285,6 +380,47 @@ ExecReScanAppend(AppendState *node)
if (subnode->chgParam == NULL)
ExecReScan(subnode);
}
- node->as_whichplan = 0;
+ node->as_whichsyncplan = node->as_nasyncplans;
exec_append_initialize_next(node);
}
+
+/* ----------------------------------------------------------------
+ * ExecAsyncAppendResponse
+ *
+ * Receive a response from an asynchronous request we made.
+ * ----------------------------------------------------------------
+ */
+void
+ExecAsyncAppendResponse(EState *estate, PendingAsyncRequest *areq)
+{
+ AppendState *node = (AppendState *) areq->requestor;
+ TupleTableSlot *slot;
+
+ /* We shouldn't be called until the request is complete. */
+ Assert(areq->request_complete);
+
+ /* Our result slot shouldn't already be occupied. */
+ Assert(TupIsNull(node->ps.ps_ResultTupleSlot));
+
+ /* Result should be a TupleTableSlot or NULL. */
+ slot = (TupleTableSlot *) areq->result;
+ Assert(slot == NULL || IsA(slot, TupleTableSlot));
+
+ /* Request is no longer pending. */
+ Assert(node->as_nasyncpending > 0);
+ --node->as_nasyncpending;
+
+ /* If the result is NULL or an empty slot, there's nothing more to do. */
+ if (TupIsNull(slot))
+ return;
+
+ /* Save result so we can return it. */
+ Assert(node->as_nasyncresult < node->as_nasyncplans);
+ node->as_asyncresult[node->as_nasyncresult++] = slot;
+
+ /*
+ * Mark the node that returned a result as ready for a new request. We
+ * don't launch another one here immediately because it might compelte
+ */
+ bms_add_member(node->as_needrequest, areq->request_index);
+}
diff --git a/src/backend/executor/nodeForeignscan.c b/src/backend/executor/nodeForeignscan.c
index d886aaf..85d436f 100644
--- a/src/backend/executor/nodeForeignscan.c
+++ b/src/backend/executor/nodeForeignscan.c
@@ -355,3 +355,52 @@ ExecForeignScanInitializeWorker(ForeignScanState *node, shm_toc *toc)
fdwroutine->InitializeWorkerForeignScan(node, toc, coordinate);
}
}
+
+/* ----------------------------------------------------------------
+ * ExecAsyncForeignScanRequest
+ *
+ * Initiate an asynchronous request
+ * ----------------------------------------------------------------
+ */
+void
+ExecAsyncForeignScanRequest(EState *estate, PendingAsyncRequest *areq)
+{
+ ForeignScanState *node = (ForeignScanState *) areq->requestee;
+ FdwRoutine *fdwroutine = node->fdwroutine;
+
+ Assert(fdwroutine->ForeignAsyncRequest != NULL);
+ fdwroutine->ForeignAsyncRequest(estate, areq);
+}
+
+/* ----------------------------------------------------------------
+ * ExecAsyncForeignScanConfigureWait
+ *
+ * In async mode, configure for a wait
+ * ----------------------------------------------------------------
+ */
+void
+ExecAsyncForeignScanConfigureWait(EState *estate,
+ PendingAsyncRequest *areq, bool reinit)
+{
+ ForeignScanState *node = (ForeignScanState *) areq->requestee;
+ FdwRoutine *fdwroutine = node->fdwroutine;
+
+ Assert(fdwroutine->ForeignAsyncConfigureWait != NULL);
+ fdwroutine->ForeignAsyncConfigureWait(estate, areq, reinit);
+}
+
+/* ----------------------------------------------------------------
+ * ExecAsyncForeignScanNotify
+ *
+ * Event loop callback
+ * ----------------------------------------------------------------
+ */
+void
+ExecAsyncForeignScanNotify(EState *estate, PendingAsyncRequest *areq)
+{
+ ForeignScanState *node = (ForeignScanState *) areq->requestee;
+ FdwRoutine *fdwroutine = node->fdwroutine;
+
+ Assert(fdwroutine->ForeignAsyncNotify != NULL);
+ fdwroutine->ForeignAsyncNotify(estate, areq);
+}
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index 71714bc..23b4e18 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -218,6 +218,7 @@ _copyAppend(const Append *from)
* copy remainder of node
*/
COPY_NODE_FIELD(appendplans);
+ COPY_SCALAR_FIELD(nasyncplans);
return newnode;
}
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index ae86954..dc5b938 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -359,6 +359,7 @@ _outAppend(StringInfo str, const Append *node)
_outPlanInfo(str, (const Plan *) node);
WRITE_NODE_FIELD(appendplans);
+ WRITE_INT_FIELD(nasyncplans);
}
static void
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index 917e6c8..69453b5 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -1519,6 +1519,7 @@ _readAppend(void)
ReadCommonPlan(&local_node->plan);
READ_NODE_FIELD(appendplans);
+ READ_INT_FIELD(nasyncplans);
READ_DONE();
}
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index 47158f6..e7e55c0 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -193,7 +193,7 @@ static CteScan *make_ctescan(List *qptlist, List *qpqual,
Index scanrelid, int ctePlanId, int cteParam);
static WorkTableScan *make_worktablescan(List *qptlist, List *qpqual,
Index scanrelid, int wtParam);
-static Append *make_append(List *appendplans, List *tlist);
+static Append *make_append(List *asyncplans, int nasyncplans, List *tlist);
static RecursiveUnion *make_recursive_union(List *tlist,
Plan *lefttree,
Plan *righttree,
@@ -270,6 +270,7 @@ static ModifyTable *make_modifytable(PlannerInfo *root,
List *resultRelations, List *subplans,
List *withCheckOptionLists, List *returningLists,
List *rowMarks, OnConflictExpr *onconflict, int epqParam);
+static bool is_async_capable_path(Path *path);
/*
@@ -955,8 +956,10 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path)
{
Append *plan;
List *tlist = build_path_tlist(root, &best_path->path);
- List *subplans = NIL;
+ List *asyncplans = NIL;
+ List *syncplans = NIL;
ListCell *subpaths;
+ int nasyncplans = 0;
/*
* The subpaths list could be empty, if every child was proven empty by
@@ -991,7 +994,14 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path)
/* Must insist that all children return the same tlist */
subplan = create_plan_recurse(root, subpath, CP_EXACT_TLIST);
- subplans = lappend(subplans, subplan);
+ /* Classify as async-capable or not */
+ if (is_async_capable_path(subpath))
+ {
+ asyncplans = lappend(asyncplans, subplan);
+ ++nasyncplans;
+ }
+ else
+ syncplans = lappend(syncplans, subplan);
}
/*
@@ -1001,7 +1011,7 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path)
* parent-rel Vars it'll be asked to emit.
*/
- plan = make_append(subplans, tlist);
+ plan = make_append(list_concat(asyncplans, syncplans), nasyncplans, tlist);
copy_generic_path_info(&plan->plan, (Path *) best_path);
@@ -4934,7 +4944,7 @@ make_foreignscan(List *qptlist,
}
static Append *
-make_append(List *appendplans, List *tlist)
+make_append(List *appendplans, int nasyncplans, List *tlist)
{
Append *node = makeNode(Append);
Plan *plan = &node->plan;
@@ -4944,6 +4954,7 @@ make_append(List *appendplans, List *tlist)
plan->lefttree = NULL;
plan->righttree = NULL;
node->appendplans = appendplans;
+ node->nasyncplans = nasyncplans;
return node;
}
@@ -6218,3 +6229,27 @@ is_projection_capable_plan(Plan *plan)
}
return true;
}
+
+/*
+ * is_projection_capable_path
+ * Check whether a given Path node is async-capable.
+ */
+static bool
+is_async_capable_path(Path *path)
+{
+ switch (nodeTag(path))
+ {
+ case T_ForeignPath:
+ {
+ FdwRoutine *fdwroutine = path->parent->fdwroutine;
+
+ Assert(fdwroutine != NULL);
+ if (fdwroutine->IsForeignPathAsyncCapable != NULL &&
+ fdwroutine->IsForeignPathAsyncCapable((ForeignPath *) path))
+ return true;
+ }
+ default:
+ break;
+ }
+ return false;
+}
diff --git a/src/include/executor/execAsync.h b/src/include/executor/execAsync.h
new file mode 100644
index 0000000..2abc32d
--- /dev/null
+++ b/src/include/executor/execAsync.h
@@ -0,0 +1,29 @@
+/*--------------------------------------------------------------------
+ * execAsync.c
+ * Support functions for asynchronous query execution
+ *
+ * Portions Copyright (c) 1996-2016, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/executor/execAsync.c
+ *--------------------------------------------------------------------
+ */
+
+#ifndef EXECASYNC_H
+#define EXECASYNC_H
+
+#include "nodes/execnodes.h"
+
+extern void ExecAsyncRequest(EState *estate, PlanState *requestor,
+ int request_index, PlanState *requestee);
+extern bool ExecAsyncEventLoop(EState *estate, PlanState *requestor,
+ long timeout);
+
+extern void ExecAsyncSetRequiredEvents(EState *estate,
+ PendingAsyncRequest *areq, int num_fd_events,
+ bool wants_process_latch, bool force_reset);
+extern void ExecAsyncRequestDone(EState *estate,
+ PendingAsyncRequest *areq, Node *result);
+
+#endif /* EXECASYNC_H */
diff --git a/src/include/executor/nodeAppend.h b/src/include/executor/nodeAppend.h
index 51c381e..81a079d 100644
--- a/src/include/executor/nodeAppend.h
+++ b/src/include/executor/nodeAppend.h
@@ -21,4 +21,7 @@ extern TupleTableSlot *ExecAppend(AppendState *node);
extern void ExecEndAppend(AppendState *node);
extern void ExecReScanAppend(AppendState *node);
+extern void ExecAsyncAppendResponse(EState *estate,
+ PendingAsyncRequest *areq);
+
#endif /* NODEAPPEND_H */
diff --git a/src/include/executor/nodeForeignscan.h b/src/include/executor/nodeForeignscan.h
index 0cdec4e..3e69ab0 100644
--- a/src/include/executor/nodeForeignscan.h
+++ b/src/include/executor/nodeForeignscan.h
@@ -29,4 +29,11 @@ extern void ExecForeignScanInitializeDSM(ForeignScanState *node,
extern void ExecForeignScanInitializeWorker(ForeignScanState *node,
shm_toc *toc);
+extern void ExecAsyncForeignScanRequest(EState *estate,
+ PendingAsyncRequest *areq);
+extern void ExecAsyncForeignScanConfigureWait(EState *estate,
+ PendingAsyncRequest *areq, bool reinit);
+extern void ExecAsyncForeignScanNotify(EState *estate,
+ PendingAsyncRequest *areq);
+
#endif /* NODEFOREIGNSCAN_H */
diff --git a/src/include/foreign/fdwapi.h b/src/include/foreign/fdwapi.h
index e1b0d0d..88feb9a 100644
--- a/src/include/foreign/fdwapi.h
+++ b/src/include/foreign/fdwapi.h
@@ -155,6 +155,15 @@ typedef bool (*IsForeignScanParallelSafe_function) (PlannerInfo *root,
RelOptInfo *rel,
RangeTblEntry *rte);
+typedef bool (*IsForeignPathAsyncCapable_function) (ForeignPath *path);
+typedef void (*ForeignAsyncRequest_function) (EState *estate,
+ PendingAsyncRequest *areq);
+typedef void (*ForeignAsyncConfigureWait_function) (EState *estate,
+ PendingAsyncRequest *areq,
+ bool reinit);
+typedef void (*ForeignAsyncNotify_function) (EState *estate,
+ PendingAsyncRequest *areq);
+
/*
* FdwRoutine is the struct returned by a foreign-data wrapper's handler
* function. It provides pointers to the callback functions needed by the
@@ -224,6 +233,12 @@ typedef struct FdwRoutine
EstimateDSMForeignScan_function EstimateDSMForeignScan;
InitializeDSMForeignScan_function InitializeDSMForeignScan;
InitializeWorkerForeignScan_function InitializeWorkerForeignScan;
+
+ /* Support functions for asynchronous execution */
+ IsForeignPathAsyncCapable_function IsForeignPathAsyncCapable;
+ ForeignAsyncRequest_function ForeignAsyncRequest;
+ ForeignAsyncConfigureWait_function ForeignAsyncConfigureWait;
+ ForeignAsyncNotify_function ForeignAsyncNotify;
} FdwRoutine;
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 4fa3661..e5282b5 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -347,6 +347,25 @@ typedef struct ResultRelInfo
} ResultRelInfo;
/* ----------------
+ * PendingAsyncRequest
+ *
+ * State for an asynchronous tuple request.
+ * ----------------
+ */
+typedef struct PendingAsyncRequest
+{
+ int myindex; /* Index in es_pending_async. */
+ struct PlanState *requestor; /* Node that wants a tuple. */
+ struct PlanState *requestee; /* Node from which a tuple is wanted. */
+ int request_index; /* Scratch space for requestor. */
+ int num_fd_events; /* Max number of FD events requestee needs. */
+ bool wants_process_latch; /* Requestee cares about MyLatch. */
+ bool callback_pending; /* Callback is needed. */
+ bool request_complete; /* Request complete, result valid. */
+ Node *result; /* Result (NULL if no more tuples). */
+} PendingAsyncRequest;
+
+/* ----------------
* EState information
*
* Master working state for an Executor invocation
@@ -422,6 +441,31 @@ typedef struct EState
HeapTuple *es_epqTuple; /* array of EPQ substitute tuples */
bool *es_epqTupleSet; /* true if EPQ tuple is provided */
bool *es_epqScanDone; /* true if EPQ tuple has been fetched */
+
+ /*
+ * Support for asynchronous execution.
+ *
+ * es_max_pending_async is the allocated size of es_pending_async, and
+ * es_num_pending_aync is the number of entries that are currently valid.
+ * (Entries after that may point to storage that can be reused.)
+ * es_async_callback_pending is the number of PendingAsyncRequests for
+ * which callback_pending is true.
+ *
+ * es_total_fd_events is the total number of FD events needed by all
+ * pending async nodes, and es_allocated_fd_events is the number any
+ * current wait event set was allocated to handle. es_wait_event_set, if
+ * non-NULL, is a previously allocated event set that may be reusable by a
+ * future wait provided that nothing's been removed and not too many more
+ * events have been added.
+ */
+ int es_num_pending_async;
+ int es_max_pending_async;
+ int es_async_callback_pending;
+ PendingAsyncRequest **es_pending_async;
+
+ int es_total_fd_events;
+ int es_allocated_fd_events;
+ struct WaitEventSet *es_wait_event_set;
} EState;
@@ -1141,17 +1185,20 @@ typedef struct ModifyTableState
/* ----------------
* AppendState information
- *
- * nplans how many plans are in the array
- * whichplan which plan is being executed (0 .. n-1)
* ----------------
*/
typedef struct AppendState
{
PlanState ps; /* its first field is NodeTag */
PlanState **appendplans; /* array of PlanStates for my inputs */
- int as_nplans;
- int as_whichplan;
+ int as_nplans; /* total # of children */
+ int as_nasyncplans; /* # of async-capable children */
+ int as_whichsyncplan; /* which sync plan is being executed */
+ bool as_syncdone; /* all synchronous plans done? */
+ Bitmapset *as_needrequest; /* async plans needing a new request */
+ TupleTableSlot **as_asyncresult; /* unreturned results of async plans */
+ int as_nasyncresult; /* # of valid entries in as_asyncresult */
+ int as_nasyncpending; /* # of outstanding async requests */
} AppendState;
/* ----------------
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index e2fbc7d..327119b 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -208,6 +208,7 @@ typedef struct Append
{
Plan plan;
List *appendplans;
+ int nasyncplans; /* # of async plans, always at start of list */
} Append;
/* ----------------
--
2.5.4 (Apple Git-61)
On 24 September 2016 at 06:39, Robert Haas <robertmhaas@gmail.com> wrote:
Since Kyotaro Horiguchi found that my previous design had a
system-wide performance impact due to the ExecProcNode changes, I
decided to take a different approach here: I created an async
infrastructure where both the requestor and the requestee have to be
specifically modified to support parallelism, and then modified Append
and ForeignScan to cooperate using the new interface. Hopefully that
means that anything other than those two nodes will suffer no
performance impact. Of course, it might have other problems....
I see that the reason why you re-designed the asynchronous execution
implementation is because the earlier implementation showed
performance degradation in local sequential and local parallel scans.
But I checked that the ExecProcNode() changes were not that
significant as to cause the degradation. It will not call
ExecAsyncWaitForNode() unless that node supports asynchronism. Do you
feel there is anywhere else in the implementation that is really
causing this degrade ? That previous implementation has some issues,
but they seemed solvable. We could resolve the plan state recursion
issue by explicitly making sure the same plan state does not get
called again while it is already executing.
Thanks
-Amit Khandekar
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Sorry for delayed response, I'll have enough time from now and
address this.
At Fri, 23 Sep 2016 21:09:03 -0400, Robert Haas <robertmhaas@gmail.com> wrote in <CA+TgmoaXQEt4tZ03FtQhnzeDEMzBck+Lrni0UWHVVgOTnA6C1w@mail.gmail.com>
Well, I promised to post this, so here it is. It's not really working
all that well at this point, and it's definitely not doing anything
that interesting, but you can see the outline of what I have in mind.
Since Kyotaro Horiguchi found that my previous design had a
system-wide performance impact due to the ExecProcNode changes, I
decided to take a different approach here: I created an async
infrastructure where both the requestor and the requestee have to be
specifically modified to support parallelism, and then modified Append
and ForeignScan to cooperate using the new interface. Hopefully that
means that anything other than those two nodes will suffer no
performance impact. Of course, it might have other problems....Some notes:
- EvalPlanQual rechecks are broken.
- EXPLAIN ANALYZE instrumentation is broken.
- ExecReScanAppend is broken, because the async stuff needs some way
of canceling an async request and I didn't invent anything like that
yet.
- The postgres_fdw changes pretend to be async but aren't actually.
It's just a demo of (part of) the interface at this point.
- The postgres_fdw changes also report all pg-fdw paths as
async-capable, but actually the direct-modify ones aren't, so the
regression tests fail.
- Errors in the executor can leak the WaitEventSet. Probably we need
to modify ResourceOwners to be able to own WaitEventSets.
- There are probably other bugs, too.Whee!
Note that I've tried to solve the re-entrancy problems by (1) putting
all of the event loop's state inside the EState rather than in local
variables and (2) having the function that is called to report arrival
of a result be thoroughly different than the function that is used to
return a tuple to a synchronous caller.Comments welcome, if you're feeling brave enough to look at anything
this half-baked.
--
Kyotaro Horiguchi
NTT Open Source Software Center
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Hello, thank you for the comment.
At Wed, 28 Sep 2016 10:00:08 +0530, Amit Khandekar <amitdkhan.pg@gmail.com> wrote in <CAJ3gD9fRmEhUoBMnNN8K_QrHZf7m4rmOHTFDj492oeLZff8o=w@mail.gmail.com>
On 24 September 2016 at 06:39, Robert Haas <robertmhaas@gmail.com> wrote:
Since Kyotaro Horiguchi found that my previous design had a
system-wide performance impact due to the ExecProcNode changes, I
decided to take a different approach here: I created an async
infrastructure where both the requestor and the requestee have to be
specifically modified to support parallelism, and then modified Append
and ForeignScan to cooperate using the new interface. Hopefully that
means that anything other than those two nodes will suffer no
performance impact. Of course, it might have other problems....I see that the reason why you re-designed the asynchronous execution
implementation is because the earlier implementation showed
performance degradation in local sequential and local parallel scans.
But I checked that the ExecProcNode() changes were not that
significant as to cause the degradation.
The basic thought is that we don't allow degradation of as small
as around one percent for simple cases in exchange for this
feature (or similar ones).
Very simple case of SeqScan runs through a very short path, on
where prediction failure penalties of CPU by few branches results
in visible impact. I avoided that by using likely/unlikly but
more fundamental measure is preferable.
It will not call ExecAsyncWaitForNode() unless that node
supports asynchronism.
That's true, but it takes a certain amount of CPU cycle to decide
call it or not. The small bit of time is the issue in focus now.
Do you feel there is anywhere else in
the implementation that is really causing this degrade ? That
previous implementation has some issues, but they seemed
solvable. We could resolve the plan state recursion issue by
explicitly making sure the same plan state does not get called
again while it is already executing.
regards,
--
Kyotaro Horiguchi
NTT Open Source Software Center
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Thank you for the thought.
At Fri, 23 Sep 2016 21:09:03 -0400, Robert Haas <robertmhaas@gmail.com> wrote in <CA+TgmoaXQEt4tZ03FtQhnzeDEMzBck+Lrni0UWHVVgOTnA6C1w@mail.gmail.com>
[ Adjusting subject line to reflect the actual topic of discussion better. ]
On Fri, Sep 23, 2016 at 9:29 AM, Robert Haas <robertmhaas@gmail.com> wrote:
On Fri, Sep 23, 2016 at 8:45 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
For e.g., in the above plan which you specified, suppose :
1. Hash Join has called ExecProcNode() for the child foreign scan b, and so
is
waiting in ExecAsyncWaitForNode(foreign_scan_on_b).
2. The event wait list already has foreign scan on a that is on a different
subtree.
3. This foreign scan a happens to be ready, so in
ExecAsyncWaitForNode (), ExecDispatchNode(foreign_scan_a) is called,
which returns with result_ready.
4. Since it returns result_ready, it's parent node is now inserted in the
callbacks array, and so it's parent (Append) is executed.
5. But, this Append planstate is already in the middle of executing Hash
join, and is waiting for HashJoin.Ah, yeah, something like that could happen. I've spent much of this
week working on a new design for this feature which I think will avoid
this problem. It doesn't work yet - in fact I can't even really test
it yet. But I'll post what I've got by the end of the day today so
that anyone who is interested can look at it and critique.Well, I promised to post this, so here it is. It's not really working
all that well at this point, and it's definitely not doing anything
that interesting, but you can see the outline of what I have in mind.
Since Kyotaro Horiguchi found that my previous design had a
system-wide performance impact due to the ExecProcNode changes, I
decided to take a different approach here: I created an async
infrastructure where both the requestor and the requestee have to be
specifically modified to support parallelism, and then modified Append
and ForeignScan to cooperate using the new interface. Hopefully that
means that anything other than those two nodes will suffer no
performance impact. Of course, it might have other problems....
The previous framework didn't need to distinguish async-capable
and uncapable nodes from the parent node's view. The things in
ExecProcNode was required for the reason. Instead, this new one
removes the ExecProcNode stuff by distinguishing the two kinds of
node in async-aware parents, that is, Append. This no longer
involves async-unaware nodes into the tuple bubbling-up mechanism
so the reentrant problem doesn't seem to occur.
On the other hand, for example, the following plan, regrardless
of its practicality, (there should be more good example..)
(Async-unaware node)
- NestLoop
- Append
- n * ForegnScan
- Append
- n * ForegnScan
If the NestLoop, Append are async-aware, all of the ForeignScans
can run asynchronously with the previous framework. The topmost
NestLoop will be awakened after that firing of any ForenScans
makes a tuple bubbles up to the NestLoop. This is because the
not-need-to-distinguish-aware-or-not nature provided by the
ExecProcNode stuff.
On the other hand, with the new one, in order to do the same
thing, ExecAppend have in turn to behave differently whether the
parent is async or not. To do this will be bothersome but not
with confidence.
I examine this further intensively, especially for performance
degeneration and obstacles to complete this.
Some notes:
- EvalPlanQual rechecks are broken.
- EXPLAIN ANALYZE instrumentation is broken.
- ExecReScanAppend is broken, because the async stuff needs some way
of canceling an async request and I didn't invent anything like that
yet.
- The postgres_fdw changes pretend to be async but aren't actually.
It's just a demo of (part of) the interface at this point.
- The postgres_fdw changes also report all pg-fdw paths as
async-capable, but actually the direct-modify ones aren't, so the
regression tests fail.
- Errors in the executor can leak the WaitEventSet. Probably we need
to modify ResourceOwners to be able to own WaitEventSets.
- There are probably other bugs, too.Whee!
Note that I've tried to solve the re-entrancy problems by (1) putting
all of the event loop's state inside the EState rather than in local
variables and (2) having the function that is called to report arrival
of a result be thoroughly different than the function that is used to
return a tuple to a synchronous caller.Comments welcome, if you're feeling brave enough to look at anything
this half-baked.
--
Kyotaro Horiguchi
NTT Open Source Software Center
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Wed, Sep 28, 2016 at 12:30 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
On 24 September 2016 at 06:39, Robert Haas <robertmhaas@gmail.com> wrote:
Since Kyotaro Horiguchi found that my previous design had a
system-wide performance impact due to the ExecProcNode changes, I
decided to take a different approach here: I created an async
infrastructure where both the requestor and the requestee have to be
specifically modified to support parallelism, and then modified Append
and ForeignScan to cooperate using the new interface. Hopefully that
means that anything other than those two nodes will suffer no
performance impact. Of course, it might have other problems....I see that the reason why you re-designed the asynchronous execution
implementation is because the earlier implementation showed
performance degradation in local sequential and local parallel scans.
But I checked that the ExecProcNode() changes were not that
significant as to cause the degradation.
I think we need some testing to prove that one way or the other. If
you can do some - say on a plan with multiple nested loop joins with
inner index-scans, which will call ExecProcNode() a lot - that would
be great. I don't think we can just rely on "it doesn't seem like it
should be slower", though - ExecProcNode() is too important a function
for us to guess at what the performance will be.
The thing I'm really worried about with either implementation is what
happens when we start to add asynchronous capability to multiple
nodes. For example, if you imagine a plan like this:
Append
-> Hash Join
-> Foreign Scan
-> Hash
-> Seq Scan
-> Hash Join
-> Foreign Scan
-> Hash
-> Seq Scan
In order for this to run asynchronously, you need not only Append and
Foreign Scan to be async-capable, but also Hash Join. That's true in
either approach. Things are slightly better with the original
approach, but the basic problem is there in both cases. So it seems
we need an approach that will make adding async capability to a node
really cheap, which seems like it might be a problem.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 4 October 2016 at 02:30, Robert Haas <robertmhaas@gmail.com> wrote:
On Wed, Sep 28, 2016 at 12:30 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
On 24 September 2016 at 06:39, Robert Haas <robertmhaas@gmail.com> wrote:
Since Kyotaro Horiguchi found that my previous design had a
system-wide performance impact due to the ExecProcNode changes, I
decided to take a different approach here: I created an async
infrastructure where both the requestor and the requestee have to be
specifically modified to support parallelism, and then modified Append
and ForeignScan to cooperate using the new interface. Hopefully that
means that anything other than those two nodes will suffer no
performance impact. Of course, it might have other problems....I see that the reason why you re-designed the asynchronous execution
implementation is because the earlier implementation showed
performance degradation in local sequential and local parallel scans.
But I checked that the ExecProcNode() changes were not that
significant as to cause the degradation.I think we need some testing to prove that one way or the other. If
you can do some - say on a plan with multiple nested loop joins with
inner index-scans, which will call ExecProcNode() a lot - that would
be great. I don't think we can just rely on "it doesn't seem like it
should be slower"
Agreed. I will come up with some tests.
, though - ExecProcNode() is too important a function
for us to guess at what the performance will be.
Also, parent pointers are not required in the new design. Thinking of
parent pointers, now it seems the event won't get bubbled up the tree
with the new design. But still, , I think it's possible to switch over
to the other asynchronous tree when some node in the current subtree
is waiting. But I am not sure, will think more on that.
The thing I'm really worried about with either implementation is what
happens when we start to add asynchronous capability to multiple
nodes. For example, if you imagine a plan like this:Append
-> Hash Join
-> Foreign Scan
-> Hash
-> Seq Scan
-> Hash Join
-> Foreign Scan
-> Hash
-> Seq ScanIn order for this to run asynchronously, you need not only Append and
Foreign Scan to be async-capable, but also Hash Join. That's true in
either approach. Things are slightly better with the original
approach, but the basic problem is there in both cases. So it seems
we need an approach that will make adding async capability to a node
really cheap, which seems like it might be a problem.
Yes, we might have to deal with this.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Tue, Oct 4, 2016 at 7:53 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
Also, parent pointers are not required in the new design. Thinking of
parent pointers, now it seems the event won't get bubbled up the tree
with the new design. But still, , I think it's possible to switch over
to the other asynchronous tree when some node in the current subtree
is waiting. But I am not sure, will think more on that.
The bubbling-up still happens, because each node that made an async
request gets a callback with the final response - and if it is itself
the recipient of an async request, it can use that callback to respond
to that request in turn. This version doesn't bubble up through
non-async-aware nodes, but that might be a good thing.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Hello, this works but ExecAppend gets a bit degradation.
At Mon, 03 Oct 2016 19:46:32 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in <20161003.194632.204401048.horiguchi.kyotaro@lab.ntt.co.jp>
Some notes:
- EvalPlanQual rechecks are broken.
This is fixed by adding (restoring) async-cancelation.
- EXPLAIN ANALYZE instrumentation is broken.
EXPLAIN ANALYE seems working but async-specific information is
not available yet.
- ExecReScanAppend is broken, because the async stuff needs some way
of canceling an async request and I didn't invent anything like that
yet.
Fixed as EvalPlanQual.
- The postgres_fdw changes pretend to be async but aren't actually.
It's just a demo of (part of) the interface at this point.
Applied my previous patch with some modification.
- The postgres_fdw changes also report all pg-fdw paths as
async-capable, but actually the direct-modify ones aren't, so the
regression tests fail.
All actions other than scan does vacate_connection() to use a
connection.
- Errors in the executor can leak the WaitEventSet. Probably we need
to modify ResourceOwners to be able to own WaitEventSets.
WaitEventSet itself is not leaked but epoll-fd should be closed
at failure. This seems doable with TRY-CATCHing in
ExecAsyncEventLoop. (not yet)
- There are probably other bugs, too.
Whee!
Note that I've tried to solve the re-entrancy problems by (1) putting
all of the event loop's state inside the EState rather than in local
variables and (2) having the function that is called to report arrival
of a result be thoroughly different than the function that is used to
return a tuple to a synchronous caller.Comments welcome, if you're feeling brave enough to look at anything
this half-baked.
This doesn't cause reentry since this no longer bubbles up
tupples through async-unaware nodes. This framework passes tuples
through private channels for requestor and requestees.
Anyway, I amended this and made postgres_fdw async and then
finally all regtests passed with minor modifications. The
attached patches are the following.
0001-robert-s-2nd-framework.patch
The patch Robert shown upthread
0002-Fix-some-bugs.patch
A small patch to fix complation errors of 0001.
0003-Modify-async-execution-infrastructure.patch
Several modifications on the infrastructure. The details are
shown after the measurement below.
0004-Make-postgres_fdw-async-capable.patch
True-async postgres-fdw.
gentblr.sql, testrun.sh, calc.pl
Performance test script suite.
gentblr.sql - creates test tables.
testrun.sh - does single test run and
calc.pl - drives testrunc.sh and summirize its results.
I measured performance and had the following result.
t0 - SELECT sum(a) FROM <local single table>;
pl - SELECT sum(a) FROM <4 local children>;
pf0 - SELECT sum(a) FROM <4 foreign children on single connection>;
pf1 - SELECT sum(a) FROM <4 foreign children on dedicate connections>;
The result is written as "time<ms> (std dev <ms>)"
sync
t0: 3820.33 ( 1.88)
pl: 1608.59 ( 12.06)
pf0: 7928.29 ( 46.58)
pf1: 8023.16 ( 26.43)
async
t0: 3806.31 ( 4.49) 0.4% faster (should be error)
pl: 1629.17 ( 0.29) 1.3% slower
pf0: 6447.07 ( 25.19) 18.7% faster
pf1: 1876.80 ( 47.13) 76.6% faster
t0 is not affected since the ExecProcNode stuff has gone.
pl is getting a bit slower. (almost the same to simple seqscan of
the previous patch) This should be a misprediction penalty.
pf0, pf1 are faster as expected.
========
The below is a summary of modifications made by 0002 and 0003 patch.
execAsync.c, execnodes.h:
- Added include "pgstat.h" to use WAIT_EVENT_ASYNC_WAIT.
- Changed the interface of ExecAsyncRequest to return if a tuple is
immediately available or not.
- Made ExecAsyncConfigureWait to return if it registered at least
one waitevent or not. This is used to know the caller
(ExecAsyncEventWait) has a event to wait (for safety).
If two or more postgres_fdw nodes are sharing one connection,
only one of them can be waited at once. It is a
responsibility to the FDW drivers to ensure at least one wait
event to be added but on failure WaitEventSetWait silently
waits forever.
- There were separate areq->callback_pending and
areq->request_complete but they are altering together so they are
replaced with one state variable areq->state. New enum
AsyncRequestState for areq->state in execnodes.h.
nodeAppend.c:
- Return a tuple immediately if ExecAsyncRequest says that a
tuple is available.
- Reduced nest level of for(;;).
nodeForeignscan.[ch], fdwapi.h, execProcnode.c::
- Calling postgresIterateForeignScan can yield tuples in wrong
shape. Call ExecForeignScan instead.
- Changed the interface of AsyncConfigureWait as execAsync.c.
- Added ShutdownForeignScan interface.
createplan.c, ruleutils.c, plannodes.h:
- With the Rebert's change, explain shows somewhat odd plans
where the Output of Append is named after non-parent
child. This does not harm but uneasy. Added index of the
parent in Append.referent to make it reasoable. (But this
looks ugly..). Still children in explain are in different
order from definition. (expected/postgres_fdw.out is edited)
regards,
--
Kyotaro Horiguchi
NTT Open Source Software Center
Attachments:
0001-robert-s-2nd-framework.patchtext/x-patch; charset=us-asciiDownload
From 1af1d3ca952e6a241852d7b9b27be50915c8b0cc Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Wed, 12 Oct 2016 12:46:10 +0900
Subject: [PATCH 1/4] robert's 2nd framework
---
contrib/postgres_fdw/postgres_fdw.c | 49 ++++
src/backend/executor/Makefile | 4 +-
src/backend/executor/README | 43 +++
src/backend/executor/execAmi.c | 5 +
src/backend/executor/execAsync.c | 462 ++++++++++++++++++++++++++++++++
src/backend/executor/nodeAppend.c | 162 ++++++++++-
src/backend/executor/nodeForeignscan.c | 49 ++++
src/backend/nodes/copyfuncs.c | 1 +
src/backend/nodes/outfuncs.c | 1 +
src/backend/nodes/readfuncs.c | 1 +
src/backend/optimizer/plan/createplan.c | 45 +++-
src/include/executor/execAsync.h | 29 ++
src/include/executor/nodeAppend.h | 3 +
src/include/executor/nodeForeignscan.h | 7 +
src/include/foreign/fdwapi.h | 15 ++
src/include/nodes/execnodes.h | 57 +++-
src/include/nodes/plannodes.h | 1 +
17 files changed, 909 insertions(+), 25 deletions(-)
create mode 100644 src/backend/executor/execAsync.c
create mode 100644 src/include/executor/execAsync.h
diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c
index daf0438..ab69aa3 100644
--- a/contrib/postgres_fdw/postgres_fdw.c
+++ b/contrib/postgres_fdw/postgres_fdw.c
@@ -19,6 +19,7 @@
#include "commands/defrem.h"
#include "commands/explain.h"
#include "commands/vacuum.h"
+#include "executor/execAsync.h"
#include "foreign/fdwapi.h"
#include "funcapi.h"
#include "miscadmin.h"
@@ -343,6 +344,14 @@ static void postgresGetForeignJoinPaths(PlannerInfo *root,
JoinPathExtraData *extra);
static bool postgresRecheckForeignScan(ForeignScanState *node,
TupleTableSlot *slot);
+static bool postgresIsForeignPathAsyncCapable(ForeignPath *path);
+static void postgresForeignAsyncRequest(EState *estate,
+ PendingAsyncRequest *areq);
+static void postgresForeignAsyncConfigureWait(EState *estate,
+ PendingAsyncRequest *areq,
+ bool reinit);
+static void postgresForeignAsyncNotify(EState *estate,
+ PendingAsyncRequest *areq);
/*
* Helper functions
@@ -455,6 +464,12 @@ postgres_fdw_handler(PG_FUNCTION_ARGS)
/* Support functions for join push-down */
routine->GetForeignJoinPaths = postgresGetForeignJoinPaths;
+ /* Support functions for async execution */
+ routine->IsForeignPathAsyncCapable = postgresIsForeignPathAsyncCapable;
+ routine->ForeignAsyncRequest = postgresForeignAsyncRequest;
+ routine->ForeignAsyncConfigureWait = postgresForeignAsyncConfigureWait;
+ routine->ForeignAsyncNotify = postgresForeignAsyncNotify;
+
PG_RETURN_POINTER(routine);
}
@@ -4342,6 +4357,40 @@ postgresGetForeignJoinPaths(PlannerInfo *root,
/* XXX Consider parameterized paths for the join relation */
}
+static bool
+postgresIsForeignPathAsyncCapable(ForeignPath *path)
+{
+ return true;
+}
+
+/*
+ * XXX. Just for testing purposes, let's run everything through the async
+ * mechanism but return tuples synchronously.
+ */
+static void
+postgresForeignAsyncRequest(EState *estate, PendingAsyncRequest *areq)
+{
+ ForeignScanState *node = (ForeignScanState *) areq->requestee;
+ TupleTableSlot *slot;
+
+ Assert(IsA(node, ForeignScanState));
+ slot = postgresIterateForeignScan(node);
+ ExecAsyncRequestDone(estate, areq, (Node *) slot);
+}
+
+static void
+postgresForeignAsyncConfigureWait(EState *estate, PendingAsyncRequest *areq,
+ bool reinit)
+{
+ elog(ERROR, "postgresForeignAsyncConfigureWait");
+}
+
+static void
+postgresForeignAsyncNotify(EState *estate, PendingAsyncRequest *areq)
+{
+ elog(ERROR, "postgresForeignAsyncNotify");
+}
+
/*
* Create a tuple from the specified row of the PGresult.
*
diff --git a/src/backend/executor/Makefile b/src/backend/executor/Makefile
index 51edd4c..0675b01 100644
--- a/src/backend/executor/Makefile
+++ b/src/backend/executor/Makefile
@@ -12,8 +12,8 @@ subdir = src/backend/executor
top_builddir = ../../..
include $(top_builddir)/src/Makefile.global
-OBJS = execAmi.o execCurrent.o execGrouping.o execIndexing.o execJunk.o \
- execMain.o execParallel.o execProcnode.o execQual.o \
+OBJS = execAmi.o execAsync.o execCurrent.o execGrouping.o execIndexing.o \
+ execJunk.o execMain.o execParallel.o execProcnode.o execQual.o \
execScan.o execTuples.o \
execUtils.o functions.o instrument.o nodeAppend.o nodeAgg.o \
nodeBitmapAnd.o nodeBitmapOr.o \
diff --git a/src/backend/executor/README b/src/backend/executor/README
index f1d1e4c..1dee3db 100644
--- a/src/backend/executor/README
+++ b/src/backend/executor/README
@@ -199,3 +199,46 @@ query returning the same set of scan tuples multiple times. Likewise,
SRFs are disallowed in an UPDATE's targetlist. There, they would have the
effect of the same row being updated multiple times, which is not very
useful --- and updates after the first would have no effect anyway.
+
+Asynchronous Execution
+----------------------
+
+In certain cases, it's desirable for a node to indicate that it cannot
+return any tuple immediately but may be able to do at a later time. This
+might either because the node is waiting on an event external to the
+database system, such as a ForeignScan awaiting network I/O, or because
+the node is waiting for an event internal to the database system - e.g.
+one process involved in a parallel query may find that it cannot progress
+a certain parallel operation until some other process reaches a certain
+point in the computation. A process which discovers this type of situation
+can always handle it simply by blocking, but this may waste time that could
+be spent executing some other part of the plan where progress could be
+made immediately. This is particularly likely to occur when the plan
+contains an Append node.
+
+To use asynchronous execution, a node must first request a tuple from an
+async-capable child node using ExecAsyncRequest. Next, it must execute
+the asynchronous event loop using ExecAsyncEventLoop; it can avoid giving
+up control indefinitely by passing a timeout to this function, even passing
+-1 to poll for events without blocking. Eventually, when a node to which an
+asynchronous request has been made produces a tuple, the requesting node
+will receive a callback from the event loop via ExecAsyncResponse. Typically,
+the ExecAsyncResponse callback is the only one required for nodes that wish
+to request tuples asynchronously.
+
+On the other hand, nodes that wish to produce tuples asynchronously
+generally need to implement three methods:
+
+1. When an asynchronous request is made, the node's ExecAsyncRequest callback
+will be invoked; it should use ExecAsyncSetRequiredEvents to indicate the
+number of file descriptor events for which it wishes to wait and whether it
+wishes to receive a callback when the process latch is set. Alternatively,
+it can instead use ExecAsyncRequestDone if a result is available immediately.
+
+2. When the event loop wishes to wait or poll for file descriptor events and
+the process latch, the ExecAsyncConfigureWait callback is invoked to configure
+the file descriptor wait events for which the node wishes to wait. This
+callback isn't needed if the node only cares about the process latch.
+
+3. When file descriptors or the process latch become ready, the node's
+ExecAsyncNotify callback is invoked.
diff --git a/src/backend/executor/execAmi.c b/src/backend/executor/execAmi.c
index 2587ef7..9fcc4e4 100644
--- a/src/backend/executor/execAmi.c
+++ b/src/backend/executor/execAmi.c
@@ -464,11 +464,16 @@ ExecSupportsBackwardScan(Plan *node)
{
ListCell *l;
+ /* With async, tuples may be interleaved, so can't back up. */
+ if (((Append *) node)->nasyncplans != 0)
+ return false;
+
foreach(l, ((Append *) node)->appendplans)
{
if (!ExecSupportsBackwardScan((Plan *) lfirst(l)))
return false;
}
+
/* need not check tlist because Append doesn't evaluate it */
return true;
}
diff --git a/src/backend/executor/execAsync.c b/src/backend/executor/execAsync.c
new file mode 100644
index 0000000..5858bb5
--- /dev/null
+++ b/src/backend/executor/execAsync.c
@@ -0,0 +1,462 @@
+/*-------------------------------------------------------------------------
+ *
+ * execAsync.c
+ * Support routines for asynchronous execution.
+ *
+ * Portions Copyright (c) 1996-2016, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/executor/execAsync.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "executor/execAsync.h"
+#include "executor/nodeAppend.h"
+#include "executor/nodeForeignscan.h"
+#include "miscadmin.h"
+#include "storage/latch.h"
+
+static bool ExecAsyncEventWait(EState *estate, long timeout);
+static void ExecAsyncConfigureWait(EState *estate, PendingAsyncRequest *areq,
+ bool reinit);
+static void ExecAsyncNotify(EState *estate, PendingAsyncRequest *areq);
+static void ExecAsyncResponse(EState *estate, PendingAsyncRequest *areq);
+
+#define EVENT_BUFFER_SIZE 16
+
+/*
+ * Asynchronously request a tuple from a designed async-aware node.
+ *
+ * requestor is the node that wants the tuple; requestee is the node from
+ * which it wants the tuple. request_index is an arbitrary integer specified
+ * by the requestor which will be available at the time the requestor receives
+ * the tuple. This is useful if the requestor has multiple children and
+ * needs an easy way to figure out which one is delivering a tuple.
+ */
+void
+ExecAsyncRequest(EState *estate, PlanState *requestor, int request_index,
+ PlanState *requestee)
+{
+ PendingAsyncRequest *areq = NULL;
+ int i = estate->es_num_pending_async;
+
+ /*
+ * If the number of pending asynchronous nodes exceeds the number of
+ * available slots in the es_pending_async array, expand the array.
+ * We start with 16 slots, and thereafter double the array size each
+ * time we run out of slots.
+ */
+ if (i >= estate->es_max_pending_async)
+ {
+ int newmax;
+
+ newmax = estate->es_max_pending_async * 2;
+ if (estate->es_max_pending_async == 0)
+ {
+ newmax = 16;
+ estate->es_pending_async =
+ MemoryContextAllocZero(estate->es_query_cxt,
+ newmax * sizeof(PendingAsyncRequest *));
+ }
+ else
+ {
+ int newentries = newmax - estate->es_max_pending_async;
+
+ estate->es_pending_async =
+ repalloc(estate->es_pending_async,
+ newmax * sizeof(PendingAsyncRequest *));
+ MemSet(&estate->es_pending_async[estate->es_max_pending_async],
+ 0, newentries * sizeof(PendingAsyncRequest *));
+ }
+ estate->es_max_pending_async = newmax;
+ }
+
+ /*
+ * To avoid unnecessary palloc traffic, we reuse a previously-allocated
+ * PendingAsyncRequest if there is one. If not, we must allocate a new
+ * one.
+ */
+ if (estate->es_pending_async[i] == NULL)
+ {
+ areq = MemoryContextAllocZero(estate->es_query_cxt,
+ sizeof(PendingAsyncRequest));
+ estate->es_pending_async[i] = areq;
+ }
+ else
+ {
+ areq = estate->es_pending_async[i];
+ MemSet(areq, 0, sizeof(PendingAsyncRequest));
+ }
+ areq->myindex = estate->es_num_pending_async++;
+
+ /* Initialize the new request. */
+ areq->requestor = requestor;
+ areq->request_index = request_index;
+ areq->requestee = requestee;
+
+ /* Give the requestee a chance to do whatever it wants. */
+ switch (nodeTag(requestee))
+ {
+ case T_ForeignScanState:
+ ExecAsyncForeignScanRequest(estate, areq);
+ break;
+ default:
+ /* If requestee doesn't support async, caller messed up. */
+ elog(ERROR, "unrecognized node type: %d",
+ (int) nodeTag(requestee));
+ }
+}
+
+/*
+ * Execute the main loop until the timeout expires or a result is delivered
+ * to the requestor.
+ *
+ * If the timeout is -1, there is no timeout; wait indefinitely until a
+ * result is ready for requestor. If the timeout is 0, do not block, but
+ * poll for events and fire callbacks for as long as we can do so without
+ * blocking. If timeout is greater than 0, block for at most the number
+ * of milliseconds indicated by the timeout.
+ *
+ * Returns true if a result was delivered to the requestor. A return value
+ * of false indicates that the timeout was reached without delivering a
+ * result to the requestor.
+ */
+bool
+ExecAsyncEventLoop(EState *estate, PlanState *requestor, long timeout)
+{
+ instr_time start_time;
+ long cur_timeout = timeout;
+ bool requestor_done = false;
+
+ Assert(requestor != NULL);
+
+ /*
+ * If we plan to wait - but not indefinitely - we need to record the
+ * current time.
+ */
+ if (timeout > 0)
+ INSTR_TIME_SET_CURRENT(start_time);
+
+ /* Main event loop: poll for events, deliver notifications. */
+ for (;;)
+ {
+ int i;
+ bool any_node_done = false;
+
+ CHECK_FOR_INTERRUPTS();
+
+ /*
+ * Check for events, but don't block if there notifications that
+ * have not been delivered yet.
+ */
+ if (estate->es_async_callback_pending > 0)
+ ExecAsyncEventWait(estate, 0);
+ else if (!ExecAsyncEventWait(estate, cur_timeout))
+ cur_timeout = 0; /* Timeout was reached. */
+ else
+ {
+ instr_time cur_time;
+ long cur_timeout = -1;
+
+ INSTR_TIME_SET_CURRENT(cur_time);
+ INSTR_TIME_SUBTRACT(cur_time, start_time);
+ cur_timeout = timeout - (long) INSTR_TIME_GET_MILLISEC(cur_time);
+ if (cur_timeout < 0)
+ cur_timeout = 0;
+ }
+
+ /* Deliver notifications. */
+ for (i = 0; i < estate->es_num_pending_async; ++i)
+ {
+ PendingAsyncRequest *areq = estate->es_pending_async[i];
+
+ /* Skip it if no callback is pending. */
+ if (!areq->callback_pending)
+ continue;
+
+ /*
+ * Mark it as no longer needing a callback. We must do this
+ * before dispatching the callback in case the callback resets
+ * the flag.
+ */
+ areq->callback_pending = false;
+ estate->es_async_callback_pending--;
+
+ /* Perform the actual callback; set request_done if appropraite. */
+ if (!areq->request_complete)
+ ExecAsyncNotify(estate, areq);
+ else
+ {
+ any_node_done = true;
+ if (requestor == areq->requestor)
+ requestor_done = true;
+ ExecAsyncResponse(estate, areq);
+ }
+ }
+
+ /* If any node completed, compact the array. */
+ if (any_node_done)
+ {
+ int hidx = 0,
+ tidx;
+
+ /*
+ * Swap all non-yet-completed items to the start of the array.
+ * Keep them in the same order.
+ */
+ for (tidx = 0; tidx < estate->es_num_pending_async; ++tidx)
+ {
+ PendingAsyncRequest *head;
+ PendingAsyncRequest *tail = estate->es_pending_async[tidx];
+
+ if (!tail->callback_pending && tail->request_complete)
+ continue;
+ head = estate->es_pending_async[hidx];
+ estate->es_pending_async[tidx] = head;
+ estate->es_pending_async[hidx] = tail;
+ ++hidx;
+ }
+ estate->es_num_pending_async = hidx;
+ }
+
+ /*
+ * We only consider exiting the loop when no notifications are
+ * pending. Otherwise, each call to this function might advance
+ * the computation by only a very small amount; to the contrary,
+ * we want to push it forward as far as possible.
+ */
+ if (estate->es_async_callback_pending == 0)
+ {
+ /* If requestor is ready, exit. */
+ if (requestor_done)
+ return true;
+ /* If timeout was 0 or has expired, exit. */
+ if (cur_timeout == 0)
+ return false;
+ }
+ }
+}
+
+/*
+ * Wait or poll for events. As with ExecAsyncEventLoop, a timeout of -1
+ * means wait forever, 0 means don't wait at all, and >0 means wait for the
+ * indicated number of milliseconds.
+ *
+ * Returns true if we found some events and false if we timed out.
+ */
+static bool
+ExecAsyncEventWait(EState *estate, long timeout)
+{
+ WaitEvent occurred_event[EVENT_BUFFER_SIZE];
+ int noccurred;
+ int i;
+ int n;
+ bool reinit = false;
+ bool process_latch_set = false;
+
+ if (estate->es_wait_event_set == NULL)
+ {
+ /*
+ * Allow for a few extra events without reinitializing. It
+ * doesn't seem worth the complexity of doing anything very
+ * aggressive here, because plans that depend on massive numbers
+ * of external FDs are likely to run afoul of kernel limits anyway.
+ */
+ estate->es_allocated_fd_events = estate->es_total_fd_events + 16;
+ estate->es_wait_event_set =
+ CreateWaitEventSet(estate->es_query_cxt,
+ estate->es_allocated_fd_events + 1);
+ AddWaitEventToSet(estate->es_wait_event_set,
+ WL_LATCH_SET, PGINVALID_SOCKET, MyLatch, NULL);
+ reinit = true;
+ }
+
+ /* Give each waiting node a chance to add or modify events. */
+ for (i = 0; i < estate->es_num_pending_async; ++i)
+ {
+ PendingAsyncRequest *areq = estate->es_pending_async[i];
+
+ if (areq->num_fd_events > 0)
+ ExecAsyncConfigureWait(estate, areq, reinit);
+ }
+
+ /* Wait for at least one event to occur. */
+ noccurred = WaitEventSetWait(estate->es_wait_event_set, timeout,
+ occurred_event, EVENT_BUFFER_SIZE);
+ if (noccurred == 0)
+ return false;
+
+ /*
+ * Loop over the occurred events and set the callback_pending flags
+ * for the appropriate requests. The waiting nodes should have
+ * registered their wait events with user_data pointing back to the
+ * PendingAsyncRequest, but the process latch needs special handling.
+ */
+ for (n = 0; n < noccurred; ++n)
+ {
+ WaitEvent *w = &occurred_event[n];
+
+ if ((w->events & WL_LATCH_SET) != 0)
+ {
+ process_latch_set = true;
+ continue;
+ }
+
+ if ((w->events & (WL_SOCKET_READABLE|WL_SOCKET_WRITEABLE)) != 0)
+ {
+ PendingAsyncRequest *areq = w->user_data;
+
+ if (!areq->callback_pending)
+ {
+ Assert(!areq->request_complete);
+ areq->callback_pending = true;
+ estate->es_async_callback_pending++;
+ }
+ }
+ }
+
+ /*
+ * If the process latch got set, we must schedule a callback for every
+ * requestee that cares about it.
+ */
+ if (process_latch_set)
+ {
+ for (i = 0; i < estate->es_num_pending_async; ++i)
+ {
+ PendingAsyncRequest *areq = estate->es_pending_async[i];
+
+ if (areq->wants_process_latch)
+ {
+ Assert(!areq->request_complete);
+ areq->callback_pending = true;
+ }
+ }
+ }
+
+ return true;
+}
+
+/*
+ * Give the asynchronous node a chance to configure the file descriptor
+ * events for which it wishes to wait. We expect the node-type specific
+ * callback to make one or more calls of the following form:
+ *
+ * AddWaitEventToSet(es->es_wait_event_set, events, fd, NULL, areq);
+ *
+ * The events should include only WL_SOCKET_READABLE or WL_SOCKET_WRITEABLE,
+ * and the number of calls should not exceed areq->num_fd_events (as
+ * prevously set via ExecAsyncSetRequiredEvents).
+ */
+static void
+ExecAsyncConfigureWait(EState *estate, PendingAsyncRequest *areq,
+ bool reinit)
+{
+ switch (nodeTag(areq->requestee))
+ {
+ case T_ForeignScanState:
+ ExecAsyncForeignScanConfigureWait(estate, areq, reinit);
+ break;
+ default:
+ elog(ERROR, "unrecognized node type: %d",
+ (int) nodeTag(areq->requestee));
+ }
+}
+
+/*
+ * Call the asynchronous node back when a relevant event has occurred.
+ */
+static void
+ExecAsyncNotify(EState *estate, PendingAsyncRequest *areq)
+{
+ switch (nodeTag(areq->requestee))
+ {
+ case T_ForeignScanState:
+ ExecAsyncForeignScanNotify(estate, areq);
+ break;
+ default:
+ elog(ERROR, "unrecognized node type: %d",
+ (int) nodeTag(areq->requestee));
+ }
+}
+
+/*
+ * Call the requestor back when an asynchronous node has produced a result.
+ */
+static void
+ExecAsyncResponse(EState *estate, PendingAsyncRequest *areq)
+{
+ switch (nodeTag(areq->requestor))
+ {
+ case T_AppendState:
+ ExecAsyncAppendResponse(estate, areq);
+ break;
+ default:
+ elog(ERROR, "unrecognized node type: %d",
+ (int) nodeTag(areq->requestor));
+ }
+}
+
+/*
+ * An executor node should call this function to signal that it needs to wait
+ * on one or more file descriptor events that can be registered on a
+ * WaitEventSet, and possibly also on the process latch. num_fd_events
+ * should be the maximum number of file descriptor events that it will wish to
+ * register. force_reset should be true if the node can't reuse the
+ * WaitEventSet it most recently initialized, for example because it needs to
+ * drop a wait event from the set.
+ */
+void
+ExecAsyncSetRequiredEvents(EState *estate, PendingAsyncRequest *areq,
+ int num_fd_events, bool wants_process_latch,
+ bool force_reset)
+{
+ estate->es_total_fd_events += num_fd_events - areq->num_fd_events;
+ areq->num_fd_events = num_fd_events;
+ areq->wants_process_latch = wants_process_latch;
+
+ if (force_reset && estate->es_wait_event_set != NULL)
+ {
+ FreeWaitEventSet(estate->es_wait_event_set);
+ estate->es_wait_event_set = NULL;
+ }
+}
+
+/*
+ * An async-capable node should call this function to deliver the tuple to
+ * the node which requested it. The node can call this from its
+ * ExecAsyncRequest callback if the requested tuple is available immediately,
+ * or at a later time from its ExecAsyncNotify callback.
+ */
+void
+ExecAsyncRequestDone(EState *estate, PendingAsyncRequest *areq, Node *result)
+{
+ /*
+ * Since the request is complete, the requestee is no longer allowed
+ * to wait for any events. Note that this forces a rebuild of
+ * es_wait_event_set every time a process that was previously waiting
+ * stops doing so. It might be possible to defer that decision until
+ * we actually wait again, because it's quite possible that a new
+ * request will be made of the same node before any wait actually
+ * happens. However, we have to balance the cost of rebuilding the
+ * WaitEventSet against the additional overhead of tracking which nodes
+ * need a callback to remove registered wait events. It's not clear
+ * that we would come out ahead, so use brute force for now.
+ */
+ if (areq->num_fd_events > 0 || areq->wants_process_latch)
+ ExecAsyncSetRequiredEvents(estate, areq, 0, false, true);
+
+ /* Save result and mark request as complete. */
+ areq->result = result;
+ areq->request_complete = true;
+
+ /* Make sure this request is flagged for a callback. */
+ if (!areq->callback_pending)
+ {
+ areq->callback_pending = true;
+ estate->es_async_callback_pending++;
+ }
+}
diff --git a/src/backend/executor/nodeAppend.c b/src/backend/executor/nodeAppend.c
index a26bd63..bb06569 100644
--- a/src/backend/executor/nodeAppend.c
+++ b/src/backend/executor/nodeAppend.c
@@ -58,6 +58,7 @@
#include "postgres.h"
#include "executor/execdebug.h"
+#include "executor/execAsync.h"
#include "executor/nodeAppend.h"
static bool exec_append_initialize_next(AppendState *appendstate);
@@ -79,16 +80,21 @@ exec_append_initialize_next(AppendState *appendstate)
/*
* get information from the append node
*/
- whichplan = appendstate->as_whichplan;
+ whichplan = appendstate->as_whichsyncplan;
- if (whichplan < 0)
+ /*
+ * This routine is only responsible for setting up for nodes being scanned
+ * synchronously, so the first node we can scan is given by nasyncplans
+ * and the last is given by as_nplans - 1.
+ */
+ if (whichplan < appendstate->as_nasyncplans)
{
/*
* if scanning in reverse, we start at the last scan in the list and
* then proceed back to the first.. in any case we inform ExecAppend
* that we are at the end of the line by returning FALSE
*/
- appendstate->as_whichplan = 0;
+ appendstate->as_whichsyncplan = appendstate->as_nasyncplans;
return FALSE;
}
else if (whichplan >= appendstate->as_nplans)
@@ -96,7 +102,7 @@ exec_append_initialize_next(AppendState *appendstate)
/*
* as above, end the scan if we go beyond the last scan in our list..
*/
- appendstate->as_whichplan = appendstate->as_nplans - 1;
+ appendstate->as_whichsyncplan = appendstate->as_nplans - 1;
return FALSE;
}
else
@@ -142,6 +148,15 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
appendstate->ps.state = estate;
appendstate->appendplans = appendplanstates;
appendstate->as_nplans = nplans;
+ appendstate->as_nasyncplans = node->nasyncplans;
+ appendstate->as_syncdone = (node->nasyncplans == nplans);
+ appendstate->as_asyncresult = (TupleTableSlot **)
+ palloc0(node->nasyncplans * sizeof(TupleTableSlot *));
+
+ /* initially, all async requests need a request */
+ for (i = 0; i < appendstate->as_nasyncplans; ++i)
+ appendstate->as_needrequest =
+ bms_add_member(appendstate->as_needrequest, i);
/*
* Miscellaneous initialization
@@ -176,9 +191,9 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
appendstate->ps.ps_ProjInfo = NULL;
/*
- * initialize to scan first subplan
+ * initialize to scan first synchronous subplan
*/
- appendstate->as_whichplan = 0;
+ appendstate->as_whichsyncplan = appendstate->as_nasyncplans;
exec_append_initialize_next(appendstate);
return appendstate;
@@ -193,15 +208,78 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
TupleTableSlot *
ExecAppend(AppendState *node)
{
+ if (node->as_nasyncplans > 0)
+ {
+ EState *estate = node->ps.state;
+ int i;
+
+ /*
+ * If there are any asynchronously-generated results that have
+ * not yet been returned, return one of them.
+ */
+ if (node->as_nasyncresult > 0)
+ {
+ --node->as_nasyncresult;
+ return node->as_asyncresult[node->as_nasyncresult];
+ }
+
+ /*
+ * If there are any nodes that need a new asynchronous request,
+ * make all of them.
+ */
+ while ((i = bms_first_member(node->as_needrequest)) >= 0)
+ {
+ ExecAsyncRequest(estate, &node->ps, i, node->appendplans[i]);
+ node->as_nasyncpending++;
+ }
+ }
+
for (;;)
{
PlanState *subnode;
TupleTableSlot *result;
/*
- * figure out which subplan we are currently processing
+ * if we have async requests outstanding, run the event loop
*/
- subnode = node->appendplans[node->as_whichplan];
+ if (node->as_nasyncpending > 0)
+ {
+ long timeout = node->as_syncdone ? -1 : 0;
+
+ for (;;)
+ {
+ if (node->as_nasyncpending == 0)
+ {
+ /*
+ * If there is no asynchronous activity still pending
+ * and the synchronous activity is also complete, we're
+ * totally done scanning this node. Otherwise, we're
+ * done with the asynchronous stuff but must continue
+ * scanning the synchronous children.
+ */
+ if (node->as_syncdone)
+ return ExecClearTuple(node->ps.ps_ResultTupleSlot);
+ break;
+ }
+ if (!ExecAsyncEventLoop(node->ps.state, &node->ps, timeout))
+ {
+ /* Timeout reached. */
+ break;
+ }
+ if (node->as_nasyncresult > 0)
+ {
+ /* Asynchronous subplan returned a tuple! */
+ --node->as_nasyncresult;
+ return node->as_asyncresult[node->as_nasyncresult];
+ }
+ }
+ }
+
+ /*
+ * figure out which synchronous subplan we are currently processing
+ */
+ Assert(!node->as_syncdone);
+ subnode = node->appendplans[node->as_whichsyncplan];
/*
* get a tuple from the subplan
@@ -221,14 +299,21 @@ ExecAppend(AppendState *node)
/*
* Go on to the "next" subplan in the appropriate direction. If no
* more subplans, return the empty slot set up for us by
- * ExecInitAppend.
+ * ExecInitAppend, unless there are async plans we have yet to finish.
*/
if (ScanDirectionIsForward(node->ps.state->es_direction))
- node->as_whichplan++;
+ node->as_whichsyncplan++;
else
- node->as_whichplan--;
+ node->as_whichsyncplan--;
if (!exec_append_initialize_next(node))
- return ExecClearTuple(node->ps.ps_ResultTupleSlot);
+ {
+ node->as_syncdone = true;
+ if (node->as_nasyncpending == 0)
+ {
+ Assert(bms_is_empty(node->as_needrequest));
+ return ExecClearTuple(node->ps.ps_ResultTupleSlot);
+ }
+ }
/* Else loop back and try to get a tuple from the new subplan */
}
@@ -267,6 +352,16 @@ ExecReScanAppend(AppendState *node)
{
int i;
+ /*
+ * XXX. Cancel outstanding asynchronous tuple requests here! (How?)
+ */
+
+ /* Reset async state. */
+ for (i = 0; i < node->as_nasyncplans; ++i)
+ node->as_needrequest = bms_add_member(node->as_needrequest, i);
+ node->as_nasyncresult = 0;
+ node->as_syncdone = (node->as_nasyncplans == node->as_nplans);
+
for (i = 0; i < node->as_nplans; i++)
{
PlanState *subnode = node->appendplans[i];
@@ -285,6 +380,47 @@ ExecReScanAppend(AppendState *node)
if (subnode->chgParam == NULL)
ExecReScan(subnode);
}
- node->as_whichplan = 0;
+ node->as_whichsyncplan = node->as_nasyncplans;
exec_append_initialize_next(node);
}
+
+/* ----------------------------------------------------------------
+ * ExecAsyncAppendResponse
+ *
+ * Receive a response from an asynchronous request we made.
+ * ----------------------------------------------------------------
+ */
+void
+ExecAsyncAppendResponse(EState *estate, PendingAsyncRequest *areq)
+{
+ AppendState *node = (AppendState *) areq->requestor;
+ TupleTableSlot *slot;
+
+ /* We shouldn't be called until the request is complete. */
+ Assert(areq->request_complete);
+
+ /* Our result slot shouldn't already be occupied. */
+ Assert(TupIsNull(node->ps.ps_ResultTupleSlot));
+
+ /* Result should be a TupleTableSlot or NULL. */
+ slot = (TupleTableSlot *) areq->result;
+ Assert(slot == NULL || IsA(slot, TupleTableSlot));
+
+ /* Request is no longer pending. */
+ Assert(node->as_nasyncpending > 0);
+ --node->as_nasyncpending;
+
+ /* If the result is NULL or an empty slot, there's nothing more to do. */
+ if (TupIsNull(slot))
+ return;
+
+ /* Save result so we can return it. */
+ Assert(node->as_nasyncresult < node->as_nasyncplans);
+ node->as_asyncresult[node->as_nasyncresult++] = slot;
+
+ /*
+ * Mark the node that returned a result as ready for a new request. We
+ * don't launch another one here immediately because it might compelte
+ */
+ bms_add_member(node->as_needrequest, areq->request_index);
+}
diff --git a/src/backend/executor/nodeForeignscan.c b/src/backend/executor/nodeForeignscan.c
index d886aaf..85d436f 100644
--- a/src/backend/executor/nodeForeignscan.c
+++ b/src/backend/executor/nodeForeignscan.c
@@ -355,3 +355,52 @@ ExecForeignScanInitializeWorker(ForeignScanState *node, shm_toc *toc)
fdwroutine->InitializeWorkerForeignScan(node, toc, coordinate);
}
}
+
+/* ----------------------------------------------------------------
+ * ExecAsyncForeignScanRequest
+ *
+ * Initiate an asynchronous request
+ * ----------------------------------------------------------------
+ */
+void
+ExecAsyncForeignScanRequest(EState *estate, PendingAsyncRequest *areq)
+{
+ ForeignScanState *node = (ForeignScanState *) areq->requestee;
+ FdwRoutine *fdwroutine = node->fdwroutine;
+
+ Assert(fdwroutine->ForeignAsyncRequest != NULL);
+ fdwroutine->ForeignAsyncRequest(estate, areq);
+}
+
+/* ----------------------------------------------------------------
+ * ExecAsyncForeignScanConfigureWait
+ *
+ * In async mode, configure for a wait
+ * ----------------------------------------------------------------
+ */
+void
+ExecAsyncForeignScanConfigureWait(EState *estate,
+ PendingAsyncRequest *areq, bool reinit)
+{
+ ForeignScanState *node = (ForeignScanState *) areq->requestee;
+ FdwRoutine *fdwroutine = node->fdwroutine;
+
+ Assert(fdwroutine->ForeignAsyncConfigureWait != NULL);
+ fdwroutine->ForeignAsyncConfigureWait(estate, areq, reinit);
+}
+
+/* ----------------------------------------------------------------
+ * ExecAsyncForeignScanNotify
+ *
+ * Event loop callback
+ * ----------------------------------------------------------------
+ */
+void
+ExecAsyncForeignScanNotify(EState *estate, PendingAsyncRequest *areq)
+{
+ ForeignScanState *node = (ForeignScanState *) areq->requestee;
+ FdwRoutine *fdwroutine = node->fdwroutine;
+
+ Assert(fdwroutine->ForeignAsyncNotify != NULL);
+ fdwroutine->ForeignAsyncNotify(estate, areq);
+}
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index 71714bc..23b4e18 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -218,6 +218,7 @@ _copyAppend(const Append *from)
* copy remainder of node
*/
COPY_NODE_FIELD(appendplans);
+ COPY_SCALAR_FIELD(nasyncplans);
return newnode;
}
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index ae86954..dc5b938 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -359,6 +359,7 @@ _outAppend(StringInfo str, const Append *node)
_outPlanInfo(str, (const Plan *) node);
WRITE_NODE_FIELD(appendplans);
+ WRITE_INT_FIELD(nasyncplans);
}
static void
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index 917e6c8..69453b5 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -1519,6 +1519,7 @@ _readAppend(void)
ReadCommonPlan(&local_node->plan);
READ_NODE_FIELD(appendplans);
+ READ_INT_FIELD(nasyncplans);
READ_DONE();
}
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index 47158f6..e7e55c0 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -193,7 +193,7 @@ static CteScan *make_ctescan(List *qptlist, List *qpqual,
Index scanrelid, int ctePlanId, int cteParam);
static WorkTableScan *make_worktablescan(List *qptlist, List *qpqual,
Index scanrelid, int wtParam);
-static Append *make_append(List *appendplans, List *tlist);
+static Append *make_append(List *asyncplans, int nasyncplans, List *tlist);
static RecursiveUnion *make_recursive_union(List *tlist,
Plan *lefttree,
Plan *righttree,
@@ -270,6 +270,7 @@ static ModifyTable *make_modifytable(PlannerInfo *root,
List *resultRelations, List *subplans,
List *withCheckOptionLists, List *returningLists,
List *rowMarks, OnConflictExpr *onconflict, int epqParam);
+static bool is_async_capable_path(Path *path);
/*
@@ -955,8 +956,10 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path)
{
Append *plan;
List *tlist = build_path_tlist(root, &best_path->path);
- List *subplans = NIL;
+ List *asyncplans = NIL;
+ List *syncplans = NIL;
ListCell *subpaths;
+ int nasyncplans = 0;
/*
* The subpaths list could be empty, if every child was proven empty by
@@ -991,7 +994,14 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path)
/* Must insist that all children return the same tlist */
subplan = create_plan_recurse(root, subpath, CP_EXACT_TLIST);
- subplans = lappend(subplans, subplan);
+ /* Classify as async-capable or not */
+ if (is_async_capable_path(subpath))
+ {
+ asyncplans = lappend(asyncplans, subplan);
+ ++nasyncplans;
+ }
+ else
+ syncplans = lappend(syncplans, subplan);
}
/*
@@ -1001,7 +1011,7 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path)
* parent-rel Vars it'll be asked to emit.
*/
- plan = make_append(subplans, tlist);
+ plan = make_append(list_concat(asyncplans, syncplans), nasyncplans, tlist);
copy_generic_path_info(&plan->plan, (Path *) best_path);
@@ -4934,7 +4944,7 @@ make_foreignscan(List *qptlist,
}
static Append *
-make_append(List *appendplans, List *tlist)
+make_append(List *appendplans, int nasyncplans, List *tlist)
{
Append *node = makeNode(Append);
Plan *plan = &node->plan;
@@ -4944,6 +4954,7 @@ make_append(List *appendplans, List *tlist)
plan->lefttree = NULL;
plan->righttree = NULL;
node->appendplans = appendplans;
+ node->nasyncplans = nasyncplans;
return node;
}
@@ -6218,3 +6229,27 @@ is_projection_capable_plan(Plan *plan)
}
return true;
}
+
+/*
+ * is_projection_capable_path
+ * Check whether a given Path node is async-capable.
+ */
+static bool
+is_async_capable_path(Path *path)
+{
+ switch (nodeTag(path))
+ {
+ case T_ForeignPath:
+ {
+ FdwRoutine *fdwroutine = path->parent->fdwroutine;
+
+ Assert(fdwroutine != NULL);
+ if (fdwroutine->IsForeignPathAsyncCapable != NULL &&
+ fdwroutine->IsForeignPathAsyncCapable((ForeignPath *) path))
+ return true;
+ }
+ default:
+ break;
+ }
+ return false;
+}
diff --git a/src/include/executor/execAsync.h b/src/include/executor/execAsync.h
new file mode 100644
index 0000000..2abc32d
--- /dev/null
+++ b/src/include/executor/execAsync.h
@@ -0,0 +1,29 @@
+/*--------------------------------------------------------------------
+ * execAsync.c
+ * Support functions for asynchronous query execution
+ *
+ * Portions Copyright (c) 1996-2016, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/executor/execAsync.c
+ *--------------------------------------------------------------------
+ */
+
+#ifndef EXECASYNC_H
+#define EXECASYNC_H
+
+#include "nodes/execnodes.h"
+
+extern void ExecAsyncRequest(EState *estate, PlanState *requestor,
+ int request_index, PlanState *requestee);
+extern bool ExecAsyncEventLoop(EState *estate, PlanState *requestor,
+ long timeout);
+
+extern void ExecAsyncSetRequiredEvents(EState *estate,
+ PendingAsyncRequest *areq, int num_fd_events,
+ bool wants_process_latch, bool force_reset);
+extern void ExecAsyncRequestDone(EState *estate,
+ PendingAsyncRequest *areq, Node *result);
+
+#endif /* EXECASYNC_H */
diff --git a/src/include/executor/nodeAppend.h b/src/include/executor/nodeAppend.h
index 51c381e..81a079d 100644
--- a/src/include/executor/nodeAppend.h
+++ b/src/include/executor/nodeAppend.h
@@ -21,4 +21,7 @@ extern TupleTableSlot *ExecAppend(AppendState *node);
extern void ExecEndAppend(AppendState *node);
extern void ExecReScanAppend(AppendState *node);
+extern void ExecAsyncAppendResponse(EState *estate,
+ PendingAsyncRequest *areq);
+
#endif /* NODEAPPEND_H */
diff --git a/src/include/executor/nodeForeignscan.h b/src/include/executor/nodeForeignscan.h
index 0cdec4e..3e69ab0 100644
--- a/src/include/executor/nodeForeignscan.h
+++ b/src/include/executor/nodeForeignscan.h
@@ -29,4 +29,11 @@ extern void ExecForeignScanInitializeDSM(ForeignScanState *node,
extern void ExecForeignScanInitializeWorker(ForeignScanState *node,
shm_toc *toc);
+extern void ExecAsyncForeignScanRequest(EState *estate,
+ PendingAsyncRequest *areq);
+extern void ExecAsyncForeignScanConfigureWait(EState *estate,
+ PendingAsyncRequest *areq, bool reinit);
+extern void ExecAsyncForeignScanNotify(EState *estate,
+ PendingAsyncRequest *areq);
+
#endif /* NODEFOREIGNSCAN_H */
diff --git a/src/include/foreign/fdwapi.h b/src/include/foreign/fdwapi.h
index e1b0d0d..88feb9a 100644
--- a/src/include/foreign/fdwapi.h
+++ b/src/include/foreign/fdwapi.h
@@ -155,6 +155,15 @@ typedef bool (*IsForeignScanParallelSafe_function) (PlannerInfo *root,
RelOptInfo *rel,
RangeTblEntry *rte);
+typedef bool (*IsForeignPathAsyncCapable_function) (ForeignPath *path);
+typedef void (*ForeignAsyncRequest_function) (EState *estate,
+ PendingAsyncRequest *areq);
+typedef void (*ForeignAsyncConfigureWait_function) (EState *estate,
+ PendingAsyncRequest *areq,
+ bool reinit);
+typedef void (*ForeignAsyncNotify_function) (EState *estate,
+ PendingAsyncRequest *areq);
+
/*
* FdwRoutine is the struct returned by a foreign-data wrapper's handler
* function. It provides pointers to the callback functions needed by the
@@ -224,6 +233,12 @@ typedef struct FdwRoutine
EstimateDSMForeignScan_function EstimateDSMForeignScan;
InitializeDSMForeignScan_function InitializeDSMForeignScan;
InitializeWorkerForeignScan_function InitializeWorkerForeignScan;
+
+ /* Support functions for asynchronous execution */
+ IsForeignPathAsyncCapable_function IsForeignPathAsyncCapable;
+ ForeignAsyncRequest_function ForeignAsyncRequest;
+ ForeignAsyncConfigureWait_function ForeignAsyncConfigureWait;
+ ForeignAsyncNotify_function ForeignAsyncNotify;
} FdwRoutine;
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index f6f73f3..b50b41c 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -347,6 +347,25 @@ typedef struct ResultRelInfo
} ResultRelInfo;
/* ----------------
+ * PendingAsyncRequest
+ *
+ * State for an asynchronous tuple request.
+ * ----------------
+ */
+typedef struct PendingAsyncRequest
+{
+ int myindex; /* Index in es_pending_async. */
+ struct PlanState *requestor; /* Node that wants a tuple. */
+ struct PlanState *requestee; /* Node from which a tuple is wanted. */
+ int request_index; /* Scratch space for requestor. */
+ int num_fd_events; /* Max number of FD events requestee needs. */
+ bool wants_process_latch; /* Requestee cares about MyLatch. */
+ bool callback_pending; /* Callback is needed. */
+ bool request_complete; /* Request complete, result valid. */
+ Node *result; /* Result (NULL if no more tuples). */
+} PendingAsyncRequest;
+
+/* ----------------
* EState information
*
* Master working state for an Executor invocation
@@ -422,6 +441,31 @@ typedef struct EState
HeapTuple *es_epqTuple; /* array of EPQ substitute tuples */
bool *es_epqTupleSet; /* true if EPQ tuple is provided */
bool *es_epqScanDone; /* true if EPQ tuple has been fetched */
+
+ /*
+ * Support for asynchronous execution.
+ *
+ * es_max_pending_async is the allocated size of es_pending_async, and
+ * es_num_pending_aync is the number of entries that are currently valid.
+ * (Entries after that may point to storage that can be reused.)
+ * es_async_callback_pending is the number of PendingAsyncRequests for
+ * which callback_pending is true.
+ *
+ * es_total_fd_events is the total number of FD events needed by all
+ * pending async nodes, and es_allocated_fd_events is the number any
+ * current wait event set was allocated to handle. es_wait_event_set, if
+ * non-NULL, is a previously allocated event set that may be reusable by a
+ * future wait provided that nothing's been removed and not too many more
+ * events have been added.
+ */
+ int es_num_pending_async;
+ int es_max_pending_async;
+ int es_async_callback_pending;
+ PendingAsyncRequest **es_pending_async;
+
+ int es_total_fd_events;
+ int es_allocated_fd_events;
+ struct WaitEventSet *es_wait_event_set;
} EState;
@@ -1147,17 +1191,20 @@ typedef struct ModifyTableState
/* ----------------
* AppendState information
- *
- * nplans how many plans are in the array
- * whichplan which plan is being executed (0 .. n-1)
* ----------------
*/
typedef struct AppendState
{
PlanState ps; /* its first field is NodeTag */
PlanState **appendplans; /* array of PlanStates for my inputs */
- int as_nplans;
- int as_whichplan;
+ int as_nplans; /* total # of children */
+ int as_nasyncplans; /* # of async-capable children */
+ int as_whichsyncplan; /* which sync plan is being executed */
+ bool as_syncdone; /* all synchronous plans done? */
+ Bitmapset *as_needrequest; /* async plans needing a new request */
+ TupleTableSlot **as_asyncresult; /* unreturned results of async plans */
+ int as_nasyncresult; /* # of valid entries in as_asyncresult */
+ int as_nasyncpending; /* # of outstanding async requests */
} AppendState;
/* ----------------
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index e2fbc7d..327119b 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -208,6 +208,7 @@ typedef struct Append
{
Plan plan;
List *appendplans;
+ int nasyncplans; /* # of async plans, always at start of list */
} Append;
/* ----------------
--
2.9.2
0002-Fix-some-bugs.patchtext/x-patch; charset=us-asciiDownload
From 2879fc2643e0916431def8a281ac9eb3c58794ee Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Wed, 12 Oct 2016 14:03:53 +0900
Subject: [PATCH 2/4] Fix some bugs.
---
contrib/postgres_fdw/expected/postgres_fdw.out | 142 ++++++++++++-------------
contrib/postgres_fdw/postgres_fdw.c | 3 +-
src/backend/executor/execAsync.c | 4 +-
src/backend/postmaster/pgstat.c | 3 +
src/include/pgstat.h | 3 +-
5 files changed, 81 insertions(+), 74 deletions(-)
diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out
index d97e694..6677bc4 100644
--- a/contrib/postgres_fdw/expected/postgres_fdw.out
+++ b/contrib/postgres_fdw/expected/postgres_fdw.out
@@ -5082,12 +5082,12 @@ INSERT INTO b(aa) VALUES('bbbbb');
SELECT tableoid::regclass, * FROM a;
tableoid | aa
----------+-------
- a | aaa
- a | aaaa
- a | aaaaa
b | bbb
b | bbbb
b | bbbbb
+ a | aaa
+ a | aaaa
+ a | aaaaa
(6 rows)
SELECT tableoid::regclass, * FROM b;
@@ -5110,12 +5110,12 @@ UPDATE a SET aa = 'zzzzzz' WHERE aa LIKE 'aaaa%';
SELECT tableoid::regclass, * FROM a;
tableoid | aa
----------+--------
- a | aaa
- a | zzzzzz
- a | zzzzzz
b | bbb
b | bbbb
b | bbbbb
+ a | aaa
+ a | zzzzzz
+ a | zzzzzz
(6 rows)
SELECT tableoid::regclass, * FROM b;
@@ -5138,12 +5138,12 @@ UPDATE b SET aa = 'new';
SELECT tableoid::regclass, * FROM a;
tableoid | aa
----------+--------
- a | aaa
- a | zzzzzz
- a | zzzzzz
b | new
b | new
b | new
+ a | aaa
+ a | zzzzzz
+ a | zzzzzz
(6 rows)
SELECT tableoid::regclass, * FROM b;
@@ -5166,12 +5166,12 @@ UPDATE a SET aa = 'newtoo';
SELECT tableoid::regclass, * FROM a;
tableoid | aa
----------+--------
- a | newtoo
- a | newtoo
- a | newtoo
b | newtoo
b | newtoo
b | newtoo
+ a | newtoo
+ a | newtoo
+ a | newtoo
(6 rows)
SELECT tableoid::regclass, * FROM b;
@@ -5230,120 +5230,120 @@ insert into bar2 values(4,44,44);
insert into bar2 values(7,77,77);
explain (verbose, costs off)
select * from bar where f1 in (select f1 from foo) for update;
- QUERY PLAN
-----------------------------------------------------------------------------------------------
+ QUERY PLAN
+------------------------------------------------------------------------------------------------------------------------
LockRows
- Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid, foo.ctid, foo.*, foo.tableoid
+ Output: bar2.f1, bar2.f2, bar2.ctid, ((bar2.*)::bar), bar2.tableoid, foo2.ctid, ((foo2.*)::foo), foo2.tableoid
-> Hash Join
- Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid, foo.ctid, foo.*, foo.tableoid
- Hash Cond: (bar.f1 = foo.f1)
+ Output: bar2.f1, bar2.f2, bar2.ctid, ((bar2.*)::bar), bar2.tableoid, foo2.ctid, ((foo2.*)::foo), foo2.tableoid
+ Hash Cond: (bar2.f1 = foo2.f1)
-> Append
- -> Seq Scan on public.bar
- Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid
-> Foreign Scan on public.bar2
Output: bar2.f1, bar2.f2, bar2.ctid, bar2.*, bar2.tableoid
Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 FOR UPDATE
+ -> Seq Scan on public.bar
+ Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid
-> Hash
- Output: foo.ctid, foo.*, foo.tableoid, foo.f1
+ Output: foo2.ctid, ((foo2.*)::foo), foo2.tableoid, foo2.f1
-> HashAggregate
- Output: foo.ctid, foo.*, foo.tableoid, foo.f1
- Group Key: foo.f1
+ Output: foo2.ctid, ((foo2.*)::foo), foo2.tableoid, foo2.f1
+ Group Key: foo2.f1
-> Append
- -> Seq Scan on public.foo
- Output: foo.ctid, foo.*, foo.tableoid, foo.f1
-> Foreign Scan on public.foo2
Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1
Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1
+ -> Seq Scan on public.foo
+ Output: foo.ctid, foo.*, foo.tableoid, foo.f1
(22 rows)
select * from bar where f1 in (select f1 from foo) for update;
f1 | f2
----+----
- 1 | 11
- 2 | 22
3 | 33
4 | 44
+ 1 | 11
+ 2 | 22
(4 rows)
explain (verbose, costs off)
select * from bar where f1 in (select f1 from foo) for share;
- QUERY PLAN
-----------------------------------------------------------------------------------------------
+ QUERY PLAN
+------------------------------------------------------------------------------------------------------------------------
LockRows
- Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid, foo.ctid, foo.*, foo.tableoid
+ Output: bar2.f1, bar2.f2, bar2.ctid, ((bar2.*)::bar), bar2.tableoid, foo2.ctid, ((foo2.*)::foo), foo2.tableoid
-> Hash Join
- Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid, foo.ctid, foo.*, foo.tableoid
- Hash Cond: (bar.f1 = foo.f1)
+ Output: bar2.f1, bar2.f2, bar2.ctid, ((bar2.*)::bar), bar2.tableoid, foo2.ctid, ((foo2.*)::foo), foo2.tableoid
+ Hash Cond: (bar2.f1 = foo2.f1)
-> Append
- -> Seq Scan on public.bar
- Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid
-> Foreign Scan on public.bar2
Output: bar2.f1, bar2.f2, bar2.ctid, bar2.*, bar2.tableoid
Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 FOR SHARE
+ -> Seq Scan on public.bar
+ Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid
-> Hash
- Output: foo.ctid, foo.*, foo.tableoid, foo.f1
+ Output: foo2.ctid, ((foo2.*)::foo), foo2.tableoid, foo2.f1
-> HashAggregate
- Output: foo.ctid, foo.*, foo.tableoid, foo.f1
- Group Key: foo.f1
+ Output: foo2.ctid, ((foo2.*)::foo), foo2.tableoid, foo2.f1
+ Group Key: foo2.f1
-> Append
- -> Seq Scan on public.foo
- Output: foo.ctid, foo.*, foo.tableoid, foo.f1
-> Foreign Scan on public.foo2
Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1
Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1
+ -> Seq Scan on public.foo
+ Output: foo.ctid, foo.*, foo.tableoid, foo.f1
(22 rows)
select * from bar where f1 in (select f1 from foo) for share;
f1 | f2
----+----
- 1 | 11
- 2 | 22
3 | 33
4 | 44
+ 1 | 11
+ 2 | 22
(4 rows)
-- Check UPDATE with inherited target and an inherited source table
explain (verbose, costs off)
update bar set f2 = f2 + 100 where f1 in (select f1 from foo);
- QUERY PLAN
----------------------------------------------------------------------------------------------
+ QUERY PLAN
+---------------------------------------------------------------------------------------------------------
Update on public.bar
Update on public.bar
Foreign Update on public.bar2
Remote SQL: UPDATE public.loct2 SET f2 = $2 WHERE ctid = $1
-> Hash Join
- Output: bar.f1, (bar.f2 + 100), bar.ctid, foo.ctid, foo.*, foo.tableoid
- Hash Cond: (bar.f1 = foo.f1)
+ Output: bar.f1, (bar.f2 + 100), bar.ctid, foo2.ctid, ((foo2.*)::foo), foo2.tableoid
+ Hash Cond: (bar.f1 = foo2.f1)
-> Seq Scan on public.bar
Output: bar.f1, bar.f2, bar.ctid
-> Hash
- Output: foo.ctid, foo.*, foo.tableoid, foo.f1
+ Output: foo2.ctid, ((foo2.*)::foo), foo2.tableoid, foo2.f1
-> HashAggregate
- Output: foo.ctid, foo.*, foo.tableoid, foo.f1
- Group Key: foo.f1
+ Output: foo2.ctid, ((foo2.*)::foo), foo2.tableoid, foo2.f1
+ Group Key: foo2.f1
-> Append
- -> Seq Scan on public.foo
- Output: foo.ctid, foo.*, foo.tableoid, foo.f1
-> Foreign Scan on public.foo2
Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1
Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1
+ -> Seq Scan on public.foo
+ Output: foo.ctid, foo.*, foo.tableoid, foo.f1
-> Hash Join
- Output: bar2.f1, (bar2.f2 + 100), bar2.f3, bar2.ctid, foo.ctid, foo.*, foo.tableoid
- Hash Cond: (bar2.f1 = foo.f1)
+ Output: bar2.f1, (bar2.f2 + 100), bar2.f3, bar2.ctid, foo2.ctid, ((foo2.*)::foo), foo2.tableoid
+ Hash Cond: (bar2.f1 = foo2.f1)
-> Foreign Scan on public.bar2
Output: bar2.f1, bar2.f2, bar2.f3, bar2.ctid
Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 FOR UPDATE
-> Hash
- Output: foo.ctid, foo.*, foo.tableoid, foo.f1
+ Output: foo2.ctid, ((foo2.*)::foo), foo2.tableoid, foo2.f1
-> HashAggregate
- Output: foo.ctid, foo.*, foo.tableoid, foo.f1
- Group Key: foo.f1
+ Output: foo2.ctid, ((foo2.*)::foo), foo2.tableoid, foo2.f1
+ Group Key: foo2.f1
-> Append
- -> Seq Scan on public.foo
- Output: foo.ctid, foo.*, foo.tableoid, foo.f1
-> Foreign Scan on public.foo2
Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1
Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1
+ -> Seq Scan on public.foo
+ Output: foo.ctid, foo.*, foo.tableoid, foo.f1
(37 rows)
update bar set f2 = f2 + 100 where f1 in (select f1 from foo);
@@ -5371,26 +5371,26 @@ where bar.f1 = ss.f1;
Foreign Update on public.bar2
Remote SQL: UPDATE public.loct2 SET f2 = $2 WHERE ctid = $1
-> Hash Join
- Output: bar.f1, (bar.f2 + 100), bar.ctid, (ROW(foo.f1))
- Hash Cond: (foo.f1 = bar.f1)
+ Output: bar.f1, (bar.f2 + 100), bar.ctid, (ROW(foo2.f1))
+ Hash Cond: (foo2.f1 = bar.f1)
-> Append
- -> Seq Scan on public.foo
- Output: ROW(foo.f1), foo.f1
-> Foreign Scan on public.foo2
Output: ROW(foo2.f1), foo2.f1
Remote SQL: SELECT f1 FROM public.loct1
- -> Seq Scan on public.foo foo_1
- Output: ROW((foo_1.f1 + 3)), (foo_1.f1 + 3)
-> Foreign Scan on public.foo2 foo2_1
Output: ROW((foo2_1.f1 + 3)), (foo2_1.f1 + 3)
Remote SQL: SELECT f1 FROM public.loct1
+ -> Seq Scan on public.foo
+ Output: ROW(foo.f1), foo.f1
+ -> Seq Scan on public.foo foo_1
+ Output: ROW((foo_1.f1 + 3)), (foo_1.f1 + 3)
-> Hash
Output: bar.f1, bar.f2, bar.ctid
-> Seq Scan on public.bar
Output: bar.f1, bar.f2, bar.ctid
-> Merge Join
- Output: bar2.f1, (bar2.f2 + 100), bar2.f3, bar2.ctid, (ROW(foo.f1))
- Merge Cond: (bar2.f1 = foo.f1)
+ Output: bar2.f1, (bar2.f2 + 100), bar2.f3, bar2.ctid, (ROW(foo2.f1))
+ Merge Cond: (bar2.f1 = foo2.f1)
-> Sort
Output: bar2.f1, bar2.f2, bar2.f3, bar2.ctid
Sort Key: bar2.f1
@@ -5398,19 +5398,19 @@ where bar.f1 = ss.f1;
Output: bar2.f1, bar2.f2, bar2.f3, bar2.ctid
Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 FOR UPDATE
-> Sort
- Output: (ROW(foo.f1)), foo.f1
- Sort Key: foo.f1
+ Output: (ROW(foo2.f1)), foo2.f1
+ Sort Key: foo2.f1
-> Append
- -> Seq Scan on public.foo
- Output: ROW(foo.f1), foo.f1
-> Foreign Scan on public.foo2
Output: ROW(foo2.f1), foo2.f1
Remote SQL: SELECT f1 FROM public.loct1
- -> Seq Scan on public.foo foo_1
- Output: ROW((foo_1.f1 + 3)), (foo_1.f1 + 3)
-> Foreign Scan on public.foo2 foo2_1
Output: ROW((foo2_1.f1 + 3)), (foo2_1.f1 + 3)
Remote SQL: SELECT f1 FROM public.loct1
+ -> Seq Scan on public.foo
+ Output: ROW(foo.f1), foo.f1
+ -> Seq Scan on public.foo foo_1
+ Output: ROW((foo_1.f1 + 3)), (foo_1.f1 + 3)
(45 rows)
update bar set f2 = f2 + 100
@@ -5577,8 +5577,8 @@ update bar set f2 = f2 + 100 returning *;
update bar set f2 = f2 + 100 returning *;
f1 | f2
----+-----
- 1 | 311
2 | 322
+ 1 | 311
6 | 266
3 | 333
4 | 344
diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c
index ab69aa3..6da5843 100644
--- a/contrib/postgres_fdw/postgres_fdw.c
+++ b/contrib/postgres_fdw/postgres_fdw.c
@@ -20,6 +20,7 @@
#include "commands/explain.h"
#include "commands/vacuum.h"
#include "executor/execAsync.h"
+#include "executor/nodeForeignscan.h"
#include "foreign/fdwapi.h"
#include "funcapi.h"
#include "miscadmin.h"
@@ -4374,7 +4375,7 @@ postgresForeignAsyncRequest(EState *estate, PendingAsyncRequest *areq)
TupleTableSlot *slot;
Assert(IsA(node, ForeignScanState));
- slot = postgresIterateForeignScan(node);
+ slot = ExecForeignScan(node);
ExecAsyncRequestDone(estate, areq, (Node *) slot);
}
diff --git a/src/backend/executor/execAsync.c b/src/backend/executor/execAsync.c
index 5858bb5..e070c26 100644
--- a/src/backend/executor/execAsync.c
+++ b/src/backend/executor/execAsync.c
@@ -18,6 +18,7 @@
#include "executor/nodeAppend.h"
#include "executor/nodeForeignscan.h"
#include "miscadmin.h"
+#include "pgstat.h"
#include "storage/latch.h"
static bool ExecAsyncEventWait(EState *estate, long timeout);
@@ -286,7 +287,8 @@ ExecAsyncEventWait(EState *estate, long timeout)
/* Wait for at least one event to occur. */
noccurred = WaitEventSetWait(estate->es_wait_event_set, timeout,
- occurred_event, EVENT_BUFFER_SIZE);
+ occurred_event, EVENT_BUFFER_SIZE,
+ WAIT_EVENT_ASYNC_WAIT);
if (noccurred == 0)
return false;
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 5112d6d..558bb8f 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -3393,6 +3393,9 @@ pgstat_get_wait_ipc(WaitEventIPC w)
case WAIT_EVENT_SYNC_REP:
event_name = "SyncRep";
break;
+ case WAIT_EVENT_ASYNC_WAIT:
+ event_name = "AsyncExecWait";
+ break;
/* no default case, so that compiler will warn */
}
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 1c9bf13..40c6d08 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -785,7 +785,8 @@ typedef enum
WAIT_EVENT_MQ_SEND,
WAIT_EVENT_PARALLEL_FINISH,
WAIT_EVENT_SAFE_SNAPSHOT,
- WAIT_EVENT_SYNC_REP
+ WAIT_EVENT_SYNC_REP,
+ WAIT_EVENT_ASYNC_WAIT
} WaitEventIPC;
/* ----------
--
2.9.2
0003-Modify-async-execution-infrastructure.patchtext/x-patch; charset=us-asciiDownload
From b21c0792ae9efb5e0c3db787b6be118ea5ff9938 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Mon, 17 Oct 2016 15:54:32 +0900
Subject: [PATCH 3/4] Modify async execution infrastructure.
---
contrib/postgres_fdw/expected/postgres_fdw.out | 68 ++++++++--------
contrib/postgres_fdw/postgres_fdw.c | 5 +-
src/backend/executor/execAsync.c | 105 ++++++++++++++-----------
src/backend/executor/nodeAppend.c | 50 ++++++------
src/backend/executor/nodeForeignscan.c | 4 +-
src/backend/nodes/copyfuncs.c | 1 +
src/backend/nodes/outfuncs.c | 1 +
src/backend/nodes/readfuncs.c | 1 +
src/backend/optimizer/plan/createplan.c | 24 +++++-
src/backend/utils/adt/ruleutils.c | 6 +-
src/include/executor/nodeForeignscan.h | 2 +-
src/include/foreign/fdwapi.h | 2 +-
src/include/nodes/execnodes.h | 10 ++-
src/include/nodes/plannodes.h | 1 +
14 files changed, 167 insertions(+), 113 deletions(-)
diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out
index 6677bc4..d429790 100644
--- a/contrib/postgres_fdw/expected/postgres_fdw.out
+++ b/contrib/postgres_fdw/expected/postgres_fdw.out
@@ -5230,13 +5230,13 @@ insert into bar2 values(4,44,44);
insert into bar2 values(7,77,77);
explain (verbose, costs off)
select * from bar where f1 in (select f1 from foo) for update;
- QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------
+ QUERY PLAN
+----------------------------------------------------------------------------------------------
LockRows
- Output: bar2.f1, bar2.f2, bar2.ctid, ((bar2.*)::bar), bar2.tableoid, foo2.ctid, ((foo2.*)::foo), foo2.tableoid
+ Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid, foo.ctid, foo.*, foo.tableoid
-> Hash Join
- Output: bar2.f1, bar2.f2, bar2.ctid, ((bar2.*)::bar), bar2.tableoid, foo2.ctid, ((foo2.*)::foo), foo2.tableoid
- Hash Cond: (bar2.f1 = foo2.f1)
+ Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid, foo.ctid, foo.*, foo.tableoid
+ Hash Cond: (bar.f1 = foo.f1)
-> Append
-> Foreign Scan on public.bar2
Output: bar2.f1, bar2.f2, bar2.ctid, bar2.*, bar2.tableoid
@@ -5244,10 +5244,10 @@ select * from bar where f1 in (select f1 from foo) for update;
-> Seq Scan on public.bar
Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid
-> Hash
- Output: foo2.ctid, ((foo2.*)::foo), foo2.tableoid, foo2.f1
+ Output: foo.ctid, foo.*, foo.tableoid, foo.f1
-> HashAggregate
- Output: foo2.ctid, ((foo2.*)::foo), foo2.tableoid, foo2.f1
- Group Key: foo2.f1
+ Output: foo.ctid, foo.*, foo.tableoid, foo.f1
+ Group Key: foo.f1
-> Append
-> Foreign Scan on public.foo2
Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1
@@ -5267,13 +5267,13 @@ select * from bar where f1 in (select f1 from foo) for update;
explain (verbose, costs off)
select * from bar where f1 in (select f1 from foo) for share;
- QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------
+ QUERY PLAN
+----------------------------------------------------------------------------------------------
LockRows
- Output: bar2.f1, bar2.f2, bar2.ctid, ((bar2.*)::bar), bar2.tableoid, foo2.ctid, ((foo2.*)::foo), foo2.tableoid
+ Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid, foo.ctid, foo.*, foo.tableoid
-> Hash Join
- Output: bar2.f1, bar2.f2, bar2.ctid, ((bar2.*)::bar), bar2.tableoid, foo2.ctid, ((foo2.*)::foo), foo2.tableoid
- Hash Cond: (bar2.f1 = foo2.f1)
+ Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid, foo.ctid, foo.*, foo.tableoid
+ Hash Cond: (bar.f1 = foo.f1)
-> Append
-> Foreign Scan on public.bar2
Output: bar2.f1, bar2.f2, bar2.ctid, bar2.*, bar2.tableoid
@@ -5281,10 +5281,10 @@ select * from bar where f1 in (select f1 from foo) for share;
-> Seq Scan on public.bar
Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid
-> Hash
- Output: foo2.ctid, ((foo2.*)::foo), foo2.tableoid, foo2.f1
+ Output: foo.ctid, foo.*, foo.tableoid, foo.f1
-> HashAggregate
- Output: foo2.ctid, ((foo2.*)::foo), foo2.tableoid, foo2.f1
- Group Key: foo2.f1
+ Output: foo.ctid, foo.*, foo.tableoid, foo.f1
+ Group Key: foo.f1
-> Append
-> Foreign Scan on public.foo2
Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1
@@ -5305,22 +5305,22 @@ select * from bar where f1 in (select f1 from foo) for share;
-- Check UPDATE with inherited target and an inherited source table
explain (verbose, costs off)
update bar set f2 = f2 + 100 where f1 in (select f1 from foo);
- QUERY PLAN
----------------------------------------------------------------------------------------------------------
+ QUERY PLAN
+---------------------------------------------------------------------------------------------
Update on public.bar
Update on public.bar
Foreign Update on public.bar2
Remote SQL: UPDATE public.loct2 SET f2 = $2 WHERE ctid = $1
-> Hash Join
- Output: bar.f1, (bar.f2 + 100), bar.ctid, foo2.ctid, ((foo2.*)::foo), foo2.tableoid
- Hash Cond: (bar.f1 = foo2.f1)
+ Output: bar.f1, (bar.f2 + 100), bar.ctid, foo.ctid, foo.*, foo.tableoid
+ Hash Cond: (bar.f1 = foo.f1)
-> Seq Scan on public.bar
Output: bar.f1, bar.f2, bar.ctid
-> Hash
- Output: foo2.ctid, ((foo2.*)::foo), foo2.tableoid, foo2.f1
+ Output: foo.ctid, foo.*, foo.tableoid, foo.f1
-> HashAggregate
- Output: foo2.ctid, ((foo2.*)::foo), foo2.tableoid, foo2.f1
- Group Key: foo2.f1
+ Output: foo.ctid, foo.*, foo.tableoid, foo.f1
+ Group Key: foo.f1
-> Append
-> Foreign Scan on public.foo2
Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1
@@ -5328,16 +5328,16 @@ update bar set f2 = f2 + 100 where f1 in (select f1 from foo);
-> Seq Scan on public.foo
Output: foo.ctid, foo.*, foo.tableoid, foo.f1
-> Hash Join
- Output: bar2.f1, (bar2.f2 + 100), bar2.f3, bar2.ctid, foo2.ctid, ((foo2.*)::foo), foo2.tableoid
- Hash Cond: (bar2.f1 = foo2.f1)
+ Output: bar2.f1, (bar2.f2 + 100), bar2.f3, bar2.ctid, foo.ctid, foo.*, foo.tableoid
+ Hash Cond: (bar2.f1 = foo.f1)
-> Foreign Scan on public.bar2
Output: bar2.f1, bar2.f2, bar2.f3, bar2.ctid
Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 FOR UPDATE
-> Hash
- Output: foo2.ctid, ((foo2.*)::foo), foo2.tableoid, foo2.f1
+ Output: foo.ctid, foo.*, foo.tableoid, foo.f1
-> HashAggregate
- Output: foo2.ctid, ((foo2.*)::foo), foo2.tableoid, foo2.f1
- Group Key: foo2.f1
+ Output: foo.ctid, foo.*, foo.tableoid, foo.f1
+ Group Key: foo.f1
-> Append
-> Foreign Scan on public.foo2
Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1
@@ -5371,8 +5371,8 @@ where bar.f1 = ss.f1;
Foreign Update on public.bar2
Remote SQL: UPDATE public.loct2 SET f2 = $2 WHERE ctid = $1
-> Hash Join
- Output: bar.f1, (bar.f2 + 100), bar.ctid, (ROW(foo2.f1))
- Hash Cond: (foo2.f1 = bar.f1)
+ Output: bar.f1, (bar.f2 + 100), bar.ctid, (ROW(foo.f1))
+ Hash Cond: (foo.f1 = bar.f1)
-> Append
-> Foreign Scan on public.foo2
Output: ROW(foo2.f1), foo2.f1
@@ -5389,8 +5389,8 @@ where bar.f1 = ss.f1;
-> Seq Scan on public.bar
Output: bar.f1, bar.f2, bar.ctid
-> Merge Join
- Output: bar2.f1, (bar2.f2 + 100), bar2.f3, bar2.ctid, (ROW(foo2.f1))
- Merge Cond: (bar2.f1 = foo2.f1)
+ Output: bar2.f1, (bar2.f2 + 100), bar2.f3, bar2.ctid, (ROW(foo.f1))
+ Merge Cond: (bar2.f1 = foo.f1)
-> Sort
Output: bar2.f1, bar2.f2, bar2.f3, bar2.ctid
Sort Key: bar2.f1
@@ -5398,8 +5398,8 @@ where bar.f1 = ss.f1;
Output: bar2.f1, bar2.f2, bar2.f3, bar2.ctid
Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 FOR UPDATE
-> Sort
- Output: (ROW(foo2.f1)), foo2.f1
- Sort Key: foo2.f1
+ Output: (ROW(foo.f1)), foo.f1
+ Sort Key: foo.f1
-> Append
-> Foreign Scan on public.foo2
Output: ROW(foo2.f1), foo2.f1
diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c
index 6da5843..997bd6c 100644
--- a/contrib/postgres_fdw/postgres_fdw.c
+++ b/contrib/postgres_fdw/postgres_fdw.c
@@ -348,7 +348,7 @@ static bool postgresRecheckForeignScan(ForeignScanState *node,
static bool postgresIsForeignPathAsyncCapable(ForeignPath *path);
static void postgresForeignAsyncRequest(EState *estate,
PendingAsyncRequest *areq);
-static void postgresForeignAsyncConfigureWait(EState *estate,
+static bool postgresForeignAsyncConfigureWait(EState *estate,
PendingAsyncRequest *areq,
bool reinit);
static void postgresForeignAsyncNotify(EState *estate,
@@ -4379,11 +4379,12 @@ postgresForeignAsyncRequest(EState *estate, PendingAsyncRequest *areq)
ExecAsyncRequestDone(estate, areq, (Node *) slot);
}
-static void
+static bool
postgresForeignAsyncConfigureWait(EState *estate, PendingAsyncRequest *areq,
bool reinit)
{
elog(ERROR, "postgresForeignAsyncConfigureWait");
+ return false;
}
static void
diff --git a/src/backend/executor/execAsync.c b/src/backend/executor/execAsync.c
index e070c26..33496a9 100644
--- a/src/backend/executor/execAsync.c
+++ b/src/backend/executor/execAsync.c
@@ -22,7 +22,7 @@
#include "storage/latch.h"
static bool ExecAsyncEventWait(EState *estate, long timeout);
-static void ExecAsyncConfigureWait(EState *estate, PendingAsyncRequest *areq,
+static bool ExecAsyncConfigureWait(EState *estate, PendingAsyncRequest *areq,
bool reinit);
static void ExecAsyncNotify(EState *estate, PendingAsyncRequest *areq);
static void ExecAsyncResponse(EState *estate, PendingAsyncRequest *areq);
@@ -43,7 +43,7 @@ ExecAsyncRequest(EState *estate, PlanState *requestor, int request_index,
PlanState *requestee)
{
PendingAsyncRequest *areq = NULL;
- int i = estate->es_num_pending_async;
+ int nasync = estate->es_num_pending_async;
/*
* If the number of pending asynchronous nodes exceeds the number of
@@ -51,7 +51,7 @@ ExecAsyncRequest(EState *estate, PlanState *requestor, int request_index,
* We start with 16 slots, and thereafter double the array size each
* time we run out of slots.
*/
- if (i >= estate->es_max_pending_async)
+ if (nasync >= estate->es_max_pending_async)
{
int newmax;
@@ -81,25 +81,28 @@ ExecAsyncRequest(EState *estate, PlanState *requestor, int request_index,
* PendingAsyncRequest if there is one. If not, we must allocate a new
* one.
*/
- if (estate->es_pending_async[i] == NULL)
+ if (estate->es_pending_async[nasync] == NULL)
{
areq = MemoryContextAllocZero(estate->es_query_cxt,
sizeof(PendingAsyncRequest));
- estate->es_pending_async[i] = areq;
+ estate->es_pending_async[nasync] = areq;
}
else
{
- areq = estate->es_pending_async[i];
+ areq = estate->es_pending_async[nasync];
MemSet(areq, 0, sizeof(PendingAsyncRequest));
}
- areq->myindex = estate->es_num_pending_async++;
+ areq->myindex = estate->es_num_pending_async;
/* Initialize the new request. */
areq->requestor = requestor;
areq->request_index = request_index;
areq->requestee = requestee;
- /* Give the requestee a chance to do whatever it wants. */
+ /*
+ * Give the requestee a chance to do whatever it wants.
+ * Requst functions return true if a result is immediately available.
+ */
switch (nodeTag(requestee))
{
case T_ForeignScanState:
@@ -110,6 +113,20 @@ ExecAsyncRequest(EState *estate, PlanState *requestor, int request_index,
elog(ERROR, "unrecognized node type: %d",
(int) nodeTag(requestee));
}
+
+ /*
+ * If a result is available, complete it immediately.
+ */
+ if (areq->state == ASYNC_COMPLETE)
+ {
+ Assert(areq->result == NULL || IsA(areq->result, TupleTableSlot));
+ ExecAsyncResponse(estate, areq);
+
+ return;
+ }
+
+ /* No result available now, make this node pending */
+ estate->es_num_pending_async++;
}
/*
@@ -175,22 +192,19 @@ ExecAsyncEventLoop(EState *estate, PlanState *requestor, long timeout)
{
PendingAsyncRequest *areq = estate->es_pending_async[i];
- /* Skip it if no callback is pending. */
- if (!areq->callback_pending)
- continue;
-
- /*
- * Mark it as no longer needing a callback. We must do this
- * before dispatching the callback in case the callback resets
- * the flag.
- */
- areq->callback_pending = false;
- estate->es_async_callback_pending--;
-
- /* Perform the actual callback; set request_done if appropraite. */
- if (!areq->request_complete)
+ /* Skip it if not pending. */
+ if (areq->state == ASYNC_CALLBACK_PENDING)
+ {
+ /*
+ * Mark it as no longer needing a callback. We must do this
+ * before dispatching the callback in case the callback resets
+ * the flag.
+ */
+ estate->es_async_callback_pending--;
ExecAsyncNotify(estate, areq);
- else
+ }
+
+ if (areq->state == ASYNC_COMPLETE)
{
any_node_done = true;
if (requestor == areq->requestor)
@@ -214,7 +228,7 @@ ExecAsyncEventLoop(EState *estate, PlanState *requestor, long timeout)
PendingAsyncRequest *head;
PendingAsyncRequest *tail = estate->es_pending_async[tidx];
- if (!tail->callback_pending && tail->request_complete)
+ if (tail->state == ASYNC_COMPLETE)
continue;
head = estate->es_pending_async[hidx];
estate->es_pending_async[tidx] = head;
@@ -247,7 +261,8 @@ ExecAsyncEventLoop(EState *estate, PlanState *requestor, long timeout)
* means wait forever, 0 means don't wait at all, and >0 means wait for the
* indicated number of milliseconds.
*
- * Returns true if we found some events and false if we timed out.
+ * Returns true if we found some events and false if we timed out or there's
+ * no event to wait. The latter is occur when the areq is processed during
*/
static bool
ExecAsyncEventWait(EState *estate, long timeout)
@@ -258,6 +273,7 @@ ExecAsyncEventWait(EState *estate, long timeout)
int n;
bool reinit = false;
bool process_latch_set = false;
+ bool added = false;
if (estate->es_wait_event_set == NULL)
{
@@ -282,13 +298,16 @@ ExecAsyncEventWait(EState *estate, long timeout)
PendingAsyncRequest *areq = estate->es_pending_async[i];
if (areq->num_fd_events > 0)
- ExecAsyncConfigureWait(estate, areq, reinit);
+ added |= ExecAsyncConfigureWait(estate, areq, reinit);
}
+ Assert(added);
+
/* Wait for at least one event to occur. */
noccurred = WaitEventSetWait(estate->es_wait_event_set, timeout,
occurred_event, EVENT_BUFFER_SIZE,
WAIT_EVENT_ASYNC_WAIT);
+
if (noccurred == 0)
return false;
@@ -312,12 +331,10 @@ ExecAsyncEventWait(EState *estate, long timeout)
{
PendingAsyncRequest *areq = w->user_data;
- if (!areq->callback_pending)
- {
- Assert(!areq->request_complete);
- areq->callback_pending = true;
- estate->es_async_callback_pending++;
- }
+ Assert(areq->state == ASYNC_WAITING);
+
+ areq->state = ASYNC_CALLBACK_PENDING;
+ estate->es_async_callback_pending++;
}
}
@@ -333,8 +350,8 @@ ExecAsyncEventWait(EState *estate, long timeout)
if (areq->wants_process_latch)
{
- Assert(!areq->request_complete);
- areq->callback_pending = true;
+ Assert(areq->state == ASYNC_WAITING);
+ areq->state = ASYNC_CALLBACK_PENDING;
}
}
}
@@ -352,15 +369,19 @@ ExecAsyncEventWait(EState *estate, long timeout)
* The events should include only WL_SOCKET_READABLE or WL_SOCKET_WRITEABLE,
* and the number of calls should not exceed areq->num_fd_events (as
* prevously set via ExecAsyncSetRequiredEvents).
+ *
+ * Individual requests can omit registering an event but it is a
+ * responsibility of the node driver to set at least one event per one
+ * requestor.
*/
-static void
+static bool
ExecAsyncConfigureWait(EState *estate, PendingAsyncRequest *areq,
bool reinit)
{
switch (nodeTag(areq->requestee))
{
case T_ForeignScanState:
- ExecAsyncForeignScanConfigureWait(estate, areq, reinit);
+ return ExecAsyncForeignScanConfigureWait(estate, areq, reinit);
break;
default:
elog(ERROR, "unrecognized node type: %d",
@@ -419,6 +440,7 @@ ExecAsyncSetRequiredEvents(EState *estate, PendingAsyncRequest *areq,
estate->es_total_fd_events += num_fd_events - areq->num_fd_events;
areq->num_fd_events = num_fd_events;
areq->wants_process_latch = wants_process_latch;
+ areq->state = ASYNC_WAITING;
if (force_reset && estate->es_wait_event_set != NULL)
{
@@ -448,17 +470,12 @@ ExecAsyncRequestDone(EState *estate, PendingAsyncRequest *areq, Node *result)
* need a callback to remove registered wait events. It's not clear
* that we would come out ahead, so use brute force for now.
*/
+ Assert(areq->state == ASYNC_IDLE || areq->state == ASYNC_CALLBACK_PENDING);
+
if (areq->num_fd_events > 0 || areq->wants_process_latch)
ExecAsyncSetRequiredEvents(estate, areq, 0, false, true);
/* Save result and mark request as complete. */
areq->result = result;
- areq->request_complete = true;
-
- /* Make sure this request is flagged for a callback. */
- if (!areq->callback_pending)
- {
- areq->callback_pending = true;
- estate->es_async_callback_pending++;
- }
+ areq->state = ASYNC_COMPLETE;
}
diff --git a/src/backend/executor/nodeAppend.c b/src/backend/executor/nodeAppend.c
index bb06569..c234f1f 100644
--- a/src/backend/executor/nodeAppend.c
+++ b/src/backend/executor/nodeAppend.c
@@ -229,9 +229,15 @@ ExecAppend(AppendState *node)
*/
while ((i = bms_first_member(node->as_needrequest)) >= 0)
{
- ExecAsyncRequest(estate, &node->ps, i, node->appendplans[i]);
node->as_nasyncpending++;
+
+ ExecAsyncRequest(estate, &node->ps, i, node->appendplans[i]);
+ /* If this request immediately gives a result, take it. */
+ if (node->as_nasyncresult > 0)
+ return node->as_asyncresult[--node->as_nasyncresult];
}
+ if (node->as_nasyncpending == 0 && node->as_syncdone)
+ return ExecClearTuple(node->ps.ps_ResultTupleSlot);
}
for (;;)
@@ -246,32 +252,32 @@ ExecAppend(AppendState *node)
{
long timeout = node->as_syncdone ? -1 : 0;
- for (;;)
+ while (node->as_nasyncpending > 0)
{
- if (node->as_nasyncpending == 0)
- {
- /*
- * If there is no asynchronous activity still pending
- * and the synchronous activity is also complete, we're
- * totally done scanning this node. Otherwise, we're
- * done with the asynchronous stuff but must continue
- * scanning the synchronous children.
- */
- if (node->as_syncdone)
- return ExecClearTuple(node->ps.ps_ResultTupleSlot);
- break;
- }
- if (!ExecAsyncEventLoop(node->ps.state, &node->ps, timeout))
- {
- /* Timeout reached. */
- break;
- }
- if (node->as_nasyncresult > 0)
+ if (ExecAsyncEventLoop(node->ps.state, &node->ps, timeout) &&
+ node->as_nasyncresult > 0)
{
/* Asynchronous subplan returned a tuple! */
--node->as_nasyncresult;
return node->as_asyncresult[node->as_nasyncresult];
}
+
+ /* Timeout reached. Go through to sync nodes if exists */
+ if (!node->as_syncdone)
+ break;
+ }
+
+ /*
+ * If there is no asynchronous activity still pending and the
+ * synchronous activity is also complete, we're totally done
+ * scanning this node. Otherwise, we're done with the
+ * asynchronous stuff but must continue scanning the synchronous
+ * children.
+ */
+ if (node->as_syncdone)
+ {
+ Assert(node->as_nasyncpending == 0);
+ return ExecClearTuple(node->ps.ps_ResultTupleSlot);
}
}
@@ -397,7 +403,7 @@ ExecAsyncAppendResponse(EState *estate, PendingAsyncRequest *areq)
TupleTableSlot *slot;
/* We shouldn't be called until the request is complete. */
- Assert(areq->request_complete);
+ Assert(areq->state == ASYNC_COMPLETE);
/* Our result slot shouldn't already be occupied. */
Assert(TupIsNull(node->ps.ps_ResultTupleSlot));
diff --git a/src/backend/executor/nodeForeignscan.c b/src/backend/executor/nodeForeignscan.c
index 85d436f..d3567bb 100644
--- a/src/backend/executor/nodeForeignscan.c
+++ b/src/backend/executor/nodeForeignscan.c
@@ -378,7 +378,7 @@ ExecAsyncForeignScanRequest(EState *estate, PendingAsyncRequest *areq)
* In async mode, configure for a wait
* ----------------------------------------------------------------
*/
-void
+bool
ExecAsyncForeignScanConfigureWait(EState *estate,
PendingAsyncRequest *areq, bool reinit)
{
@@ -386,7 +386,7 @@ ExecAsyncForeignScanConfigureWait(EState *estate,
FdwRoutine *fdwroutine = node->fdwroutine;
Assert(fdwroutine->ForeignAsyncConfigureWait != NULL);
- fdwroutine->ForeignAsyncConfigureWait(estate, areq, reinit);
+ return fdwroutine->ForeignAsyncConfigureWait(estate, areq, reinit);
}
/* ----------------------------------------------------------------
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index 23b4e18..72d8cd6 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -219,6 +219,7 @@ _copyAppend(const Append *from)
*/
COPY_NODE_FIELD(appendplans);
COPY_SCALAR_FIELD(nasyncplans);
+ COPY_SCALAR_FIELD(referent);
return newnode;
}
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index dc5b938..1ebdc48 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -360,6 +360,7 @@ _outAppend(StringInfo str, const Append *node)
WRITE_NODE_FIELD(appendplans);
WRITE_INT_FIELD(nasyncplans);
+ WRITE_INT_FIELD(referent);
}
static void
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index 69453b5..8443a62 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -1520,6 +1520,7 @@ _readAppend(void)
READ_NODE_FIELD(appendplans);
READ_INT_FIELD(nasyncplans);
+ READ_INT_FIELD(referent);
READ_DONE();
}
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index e7e55c0..c73bbb3 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -193,7 +193,8 @@ static CteScan *make_ctescan(List *qptlist, List *qpqual,
Index scanrelid, int ctePlanId, int cteParam);
static WorkTableScan *make_worktablescan(List *qptlist, List *qpqual,
Index scanrelid, int wtParam);
-static Append *make_append(List *asyncplans, int nasyncplans, List *tlist);
+static Append *make_append(List *asyncplans, int nasyncplans,
+ int referent, List *tlist);
static RecursiveUnion *make_recursive_union(List *tlist,
Plan *lefttree,
Plan *righttree,
@@ -960,6 +961,8 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path)
List *syncplans = NIL;
ListCell *subpaths;
int nasyncplans = 0;
+ bool first = true;
+ bool referent_is_sync = true;
/*
* The subpaths list could be empty, if every child was proven empty by
@@ -985,7 +988,14 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path)
return plan;
}
- /* Build the plan for each child */
+ /*
+ * Build the plan for each child
+
+ * The first child in an inheritance set is the representative in
+ * explaining tlist entries (see set_deparse_planstate). We should keep
+ * the first child in best_path->subpaths at the head of the subplan list
+ * for the reason.
+ */
foreach(subpaths, best_path->subpaths)
{
Path *subpath = (Path *) lfirst(subpaths);
@@ -999,9 +1009,13 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path)
{
asyncplans = lappend(asyncplans, subplan);
++nasyncplans;
+ if (first)
+ referent_is_sync = false;
}
else
syncplans = lappend(syncplans, subplan);
+
+ first = false;
}
/*
@@ -1011,7 +1025,8 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path)
* parent-rel Vars it'll be asked to emit.
*/
- plan = make_append(list_concat(asyncplans, syncplans), nasyncplans, tlist);
+ plan = make_append(list_concat(asyncplans, syncplans), nasyncplans,
+ referent_is_sync ? nasyncplans : 0, tlist);
copy_generic_path_info(&plan->plan, (Path *) best_path);
@@ -4944,7 +4959,7 @@ make_foreignscan(List *qptlist,
}
static Append *
-make_append(List *appendplans, int nasyncplans, List *tlist)
+make_append(List *appendplans, int nasyncplans, int referent, List *tlist)
{
Append *node = makeNode(Append);
Plan *plan = &node->plan;
@@ -4955,6 +4970,7 @@ make_append(List *appendplans, int nasyncplans, List *tlist)
plan->righttree = NULL;
node->appendplans = appendplans;
node->nasyncplans = nasyncplans;
+ node->referent = referent;
return node;
}
diff --git a/src/backend/utils/adt/ruleutils.c b/src/backend/utils/adt/ruleutils.c
index 8a81d7a..de0e96c 100644
--- a/src/backend/utils/adt/ruleutils.c
+++ b/src/backend/utils/adt/ruleutils.c
@@ -4056,7 +4056,11 @@ set_deparse_planstate(deparse_namespace *dpns, PlanState *ps)
* lists containing references to non-target relations.
*/
if (IsA(ps, AppendState))
- dpns->outer_planstate = ((AppendState *) ps)->appendplans[0];
+ {
+ int idx = ((Append*)(((AppendState *) ps)->ps.plan))->referent;
+ dpns->outer_planstate =
+ ((AppendState *) ps)->appendplans[idx];
+ }
else if (IsA(ps, MergeAppendState))
dpns->outer_planstate = ((MergeAppendState *) ps)->mergeplans[0];
else if (IsA(ps, ModifyTableState))
diff --git a/src/include/executor/nodeForeignscan.h b/src/include/executor/nodeForeignscan.h
index 3e69ab0..47a3920 100644
--- a/src/include/executor/nodeForeignscan.h
+++ b/src/include/executor/nodeForeignscan.h
@@ -31,7 +31,7 @@ extern void ExecForeignScanInitializeWorker(ForeignScanState *node,
extern void ExecAsyncForeignScanRequest(EState *estate,
PendingAsyncRequest *areq);
-extern void ExecAsyncForeignScanConfigureWait(EState *estate,
+extern bool ExecAsyncForeignScanConfigureWait(EState *estate,
PendingAsyncRequest *areq, bool reinit);
extern void ExecAsyncForeignScanNotify(EState *estate,
PendingAsyncRequest *areq);
diff --git a/src/include/foreign/fdwapi.h b/src/include/foreign/fdwapi.h
index 88feb9a..65517fd 100644
--- a/src/include/foreign/fdwapi.h
+++ b/src/include/foreign/fdwapi.h
@@ -158,7 +158,7 @@ typedef bool (*IsForeignScanParallelSafe_function) (PlannerInfo *root,
typedef bool (*IsForeignPathAsyncCapable_function) (ForeignPath *path);
typedef void (*ForeignAsyncRequest_function) (EState *estate,
PendingAsyncRequest *areq);
-typedef void (*ForeignAsyncConfigureWait_function) (EState *estate,
+typedef bool (*ForeignAsyncConfigureWait_function) (EState *estate,
PendingAsyncRequest *areq,
bool reinit);
typedef void (*ForeignAsyncNotify_function) (EState *estate,
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index b50b41c..0c6af86 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -352,6 +352,13 @@ typedef struct ResultRelInfo
* State for an asynchronous tuple request.
* ----------------
*/
+typedef enum AsyncRequestState
+{
+ ASYNC_IDLE,
+ ASYNC_WAITING,
+ ASYNC_CALLBACK_PENDING,
+ ASYNC_COMPLETE
+} AsyncRequestState;
typedef struct PendingAsyncRequest
{
int myindex; /* Index in es_pending_async. */
@@ -360,8 +367,7 @@ typedef struct PendingAsyncRequest
int request_index; /* Scratch space for requestor. */
int num_fd_events; /* Max number of FD events requestee needs. */
bool wants_process_latch; /* Requestee cares about MyLatch. */
- bool callback_pending; /* Callback is needed. */
- bool request_complete; /* Request complete, result valid. */
+ AsyncRequestState state;
Node *result; /* Result (NULL if no more tuples). */
} PendingAsyncRequest;
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index 327119b..1df6693 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -209,6 +209,7 @@ typedef struct Append
Plan plan;
List *appendplans;
int nasyncplans; /* # of async plans, always at start of list */
+ int referent; /* index of inheritance tree referent */
} Append;
/* ----------------
--
2.9.2
0004-Make-postgres_fdw-async-capable.patchtext/x-patch; charset=us-asciiDownload
From 1337546fee26e5a80372c090acebc8bc53de3508 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Mon, 17 Oct 2016 16:00:56 +0900
Subject: [PATCH 4/4] Make postgres_fdw async-capable
---
contrib/postgres_fdw/connection.c | 79 ++--
contrib/postgres_fdw/expected/postgres_fdw.out | 64 ++--
contrib/postgres_fdw/postgres_fdw.c | 483 +++++++++++++++++++++----
contrib/postgres_fdw/postgres_fdw.h | 2 +
contrib/postgres_fdw/sql/postgres_fdw.sql | 4 +-
src/backend/executor/execProcnode.c | 9 +
src/include/foreign/fdwapi.h | 2 +
7 files changed, 510 insertions(+), 133 deletions(-)
diff --git a/contrib/postgres_fdw/connection.c b/contrib/postgres_fdw/connection.c
index bcdddc2..ebc9417 100644
--- a/contrib/postgres_fdw/connection.c
+++ b/contrib/postgres_fdw/connection.c
@@ -49,6 +49,7 @@ typedef struct ConnCacheEntry
* one level of subxact open, etc */
bool have_prep_stmt; /* have we prepared any stmts in this xact? */
bool have_error; /* have any subxacts aborted in this xact? */
+ void *storage; /* connection specific storage */
} ConnCacheEntry;
/*
@@ -64,6 +65,7 @@ static unsigned int prep_stmt_number = 0;
static bool xact_got_connection = false;
/* prototypes of private functions */
+static ConnCacheEntry *get_connection_entry(Oid umid);
static PGconn *connect_pg_server(ForeignServer *server, UserMapping *user);
static void check_conn_params(const char **keywords, const char **values);
static void configure_remote_session(PGconn *conn);
@@ -75,26 +77,12 @@ static void pgfdw_subxact_callback(SubXactEvent event,
SubTransactionId parentSubid,
void *arg);
-
/*
- * Get a PGconn which can be used to execute queries on the remote PostgreSQL
- * server with the user's authorization. A new connection is established
- * if we don't already have a suitable one, and a transaction is opened at
- * the right subtransaction nesting depth if we didn't do that already.
- *
- * will_prep_stmt must be true if caller intends to create any prepared
- * statements. Since those don't go away automatically at transaction end
- * (not even on error), we need this flag to cue manual cleanup.
- *
- * XXX Note that caching connections theoretically requires a mechanism to
- * detect change of FDW objects to invalidate already established connections.
- * We could manage that by watching for invalidation events on the relevant
- * syscaches. For the moment, though, it's not clear that this would really
- * be useful and not mere pedantry. We could not flush any active connections
- * mid-transaction anyway.
+ * Common function to acquire or create a connection cache entry.
*/
-PGconn *
-GetConnection(UserMapping *user, bool will_prep_stmt)
+
+static ConnCacheEntry *
+get_connection_entry(Oid umid)
{
bool found;
ConnCacheEntry *entry;
@@ -122,11 +110,8 @@ GetConnection(UserMapping *user, bool will_prep_stmt)
RegisterSubXactCallback(pgfdw_subxact_callback, NULL);
}
- /* Set flag that we did GetConnection during the current transaction */
- xact_got_connection = true;
-
/* Create hash key for the entry. Assume no pad bytes in key struct */
- key = user->umid;
+ key = umid;
/*
* Find or create cached entry for requested connection.
@@ -139,8 +124,39 @@ GetConnection(UserMapping *user, bool will_prep_stmt)
entry->xact_depth = 0;
entry->have_prep_stmt = false;
entry->have_error = false;
+ entry->storage = NULL;
}
+ return entry;
+}
+
+/*
+ * Get a PGconn which can be used to execute queries on the remote PostgreSQL
+ * server with the user's authorization. A new connection is established
+ * if we don't already have a suitable one, and a transaction is opened at
+ * the right subtransaction nesting depth if we didn't do that already.
+ *
+ * will_prep_stmt must be true if caller intends to create any prepared
+ * statements. Since those don't go away automatically at transaction end
+ * (not even on error), we need this flag to cue manual cleanup.
+ *
+ * XXX Note that caching connections theoretically requires a mechanism to
+ * detect change of FDW objects to invalidate already established connections.
+ * We could manage that by watching for invalidation events on the relevant
+ * syscaches. For the moment, though, it's not clear that this would really
+ * be useful and not mere pedantry. We could not flush any active connections
+ * mid-transaction anyway.
+ */
+PGconn *
+GetConnection(UserMapping *user, bool will_prep_stmt)
+{
+ ConnCacheEntry *entry;
+
+ /* Set flag that we did GetConnection during the current transaction */
+ xact_got_connection = true;
+
+ entry = get_connection_entry(user->umid);
+
/*
* We don't check the health of cached connection here, because it would
* require some overhead. Broken connection will be detected when the
@@ -177,6 +193,25 @@ GetConnection(UserMapping *user, bool will_prep_stmt)
}
/*
+ * Rerturns the connection specific storage for this user. Allocate with
+ * initsize if not exists.
+ */
+void *
+GetConnectionSpecificStorage(UserMapping *user, size_t initsize)
+{
+ ConnCacheEntry *entry;
+
+ entry = get_connection_entry(user->umid);
+ if (entry->storage == NULL)
+ {
+ entry->storage = MemoryContextAlloc(CacheMemoryContext, initsize);
+ memset(entry->storage, 0, initsize);
+ }
+
+ return entry->storage;
+}
+
+/*
* Connect to remote server using specified server and user mapping properties.
*/
static PGconn *
diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out
index d429790..a53fff4 100644
--- a/contrib/postgres_fdw/expected/postgres_fdw.out
+++ b/contrib/postgres_fdw/expected/postgres_fdw.out
@@ -5082,12 +5082,12 @@ INSERT INTO b(aa) VALUES('bbbbb');
SELECT tableoid::regclass, * FROM a;
tableoid | aa
----------+-------
- b | bbb
- b | bbbb
- b | bbbbb
a | aaa
a | aaaa
a | aaaaa
+ b | bbb
+ b | bbbb
+ b | bbbbb
(6 rows)
SELECT tableoid::regclass, * FROM b;
@@ -5110,12 +5110,12 @@ UPDATE a SET aa = 'zzzzzz' WHERE aa LIKE 'aaaa%';
SELECT tableoid::regclass, * FROM a;
tableoid | aa
----------+--------
- b | bbb
- b | bbbb
- b | bbbbb
a | aaa
a | zzzzzz
a | zzzzzz
+ b | bbb
+ b | bbbb
+ b | bbbbb
(6 rows)
SELECT tableoid::regclass, * FROM b;
@@ -5138,12 +5138,12 @@ UPDATE b SET aa = 'new';
SELECT tableoid::regclass, * FROM a;
tableoid | aa
----------+--------
- b | new
- b | new
- b | new
a | aaa
a | zzzzzz
a | zzzzzz
+ b | new
+ b | new
+ b | new
(6 rows)
SELECT tableoid::regclass, * FROM b;
@@ -5166,12 +5166,12 @@ UPDATE a SET aa = 'newtoo';
SELECT tableoid::regclass, * FROM a;
tableoid | aa
----------+--------
- b | newtoo
- b | newtoo
- b | newtoo
a | newtoo
a | newtoo
a | newtoo
+ b | newtoo
+ b | newtoo
+ b | newtoo
(6 rows)
SELECT tableoid::regclass, * FROM b;
@@ -5259,9 +5259,9 @@ select * from bar where f1 in (select f1 from foo) for update;
select * from bar where f1 in (select f1 from foo) for update;
f1 | f2
----+----
+ 1 | 11
3 | 33
4 | 44
- 1 | 11
2 | 22
(4 rows)
@@ -5296,9 +5296,9 @@ select * from bar where f1 in (select f1 from foo) for share;
select * from bar where f1 in (select f1 from foo) for share;
f1 | f2
----+----
+ 1 | 11
3 | 33
4 | 44
- 1 | 11
2 | 22
(4 rows)
@@ -5561,27 +5561,33 @@ delete from foo where f1 < 5 returning *;
(5 rows)
explain (verbose, costs off)
-update bar set f2 = f2 + 100 returning *;
- QUERY PLAN
-------------------------------------------------------------------------------
- Update on public.bar
- Output: bar.f1, bar.f2
- Update on public.bar
- Foreign Update on public.bar2
- -> Seq Scan on public.bar
- Output: bar.f1, (bar.f2 + 100), bar.ctid
- -> Foreign Update on public.bar2
- Remote SQL: UPDATE public.loct2 SET f2 = (f2 + 100) RETURNING f1, f2
-(8 rows)
+with u as (update bar set f2 = f2 + 100 returning *) select * from u order by 1;
+ QUERY PLAN
+--------------------------------------------------------------------------------------
+ Sort
+ Output: u.f1, u.f2
+ Sort Key: u.f1
+ CTE u
+ -> Update on public.bar
+ Output: bar.f1, bar.f2
+ Update on public.bar
+ Foreign Update on public.bar2
+ -> Seq Scan on public.bar
+ Output: bar.f1, (bar.f2 + 100), bar.ctid
+ -> Foreign Update on public.bar2
+ Remote SQL: UPDATE public.loct2 SET f2 = (f2 + 100) RETURNING f1, f2
+ -> CTE Scan on u
+ Output: u.f1, u.f2
+(14 rows)
-update bar set f2 = f2 + 100 returning *;
+with u as (update bar set f2 = f2 + 100 returning *) select * from u order by 1;
f1 | f2
----+-----
- 2 | 322
1 | 311
- 6 | 266
+ 2 | 322
3 | 333
4 | 344
+ 6 | 266
7 | 277
(6 rows)
diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c
index 997bd6c..c2b5b17 100644
--- a/contrib/postgres_fdw/postgres_fdw.c
+++ b/contrib/postgres_fdw/postgres_fdw.c
@@ -34,6 +34,7 @@
#include "optimizer/var.h"
#include "optimizer/tlist.h"
#include "parser/parsetree.h"
+#include "pgstat.h"
#include "utils/builtins.h"
#include "utils/guc.h"
#include "utils/lsyscache.h"
@@ -52,6 +53,9 @@ PG_MODULE_MAGIC;
/* If no remote estimates, assume a sort costs 20% extra */
#define DEFAULT_FDW_SORT_MULTIPLIER 1.2
+/* Retrive PgFdwScanState struct from ForeginScanState */
+#define GetPgFdwScanState(n) ((PgFdwScanState *)(n)->fdw_state)
+
/*
* Indexes of FDW-private information stored in fdw_private lists.
*
@@ -121,10 +125,27 @@ enum FdwDirectModifyPrivateIndex
};
/*
+ * Connection private area structure.
+ */
+ typedef struct PgFdwConnspecate
+{
+ ForeignScanState *current_owner; /* The node currently running a query
+ * on this connection*/
+} PgFdwConnspecate;
+
+/* Execution state base type */
+typedef struct PgFdwState
+{
+ PGconn *conn; /* connection for the scan */
+ PgFdwConnspecate *connspec; /* connection private memory */
+} PgFdwState;
+
+/*
* Execution state of a foreign scan using postgres_fdw.
*/
typedef struct PgFdwScanState
{
+ PgFdwState s; /* common structure */
Relation rel; /* relcache entry for the foreign table. NULL
* for a foreign join scan. */
TupleDesc tupdesc; /* tuple descriptor of scan */
@@ -135,7 +156,7 @@ typedef struct PgFdwScanState
List *retrieved_attrs; /* list of retrieved attribute numbers */
/* for remote query execution */
- PGconn *conn; /* connection for the scan */
+ bool result_ready;
unsigned int cursor_number; /* quasi-unique ID for my cursor */
bool cursor_exists; /* have we created the cursor? */
int numParams; /* number of parameters passed to query */
@@ -151,6 +172,13 @@ typedef struct PgFdwScanState
/* batch-level state, for optimizing rewinds and avoiding useless fetch */
int fetch_ct_2; /* Min(# of fetches done, 2) */
bool eof_reached; /* true if last fetch reached EOF */
+ bool run_async; /* true if run asynchronously */
+ bool async_waiting; /* true if requesting the parent to wait */
+ ForeignScanState *waiter; /* Next node to run a query among nodes
+ * sharing the same connection */
+ ForeignScanState *last_waiter; /* A waiting node at the end of a waiting
+ * list. Maintained only by the current
+ * owner of the connection */
/* working memory contexts */
MemoryContext batch_cxt; /* context holding current batch of tuples */
@@ -164,11 +192,11 @@ typedef struct PgFdwScanState
*/
typedef struct PgFdwModifyState
{
+ PgFdwState s; /* common structure */
Relation rel; /* relcache entry for the foreign table */
AttInMetadata *attinmeta; /* attribute datatype conversion metadata */
/* for remote query execution */
- PGconn *conn; /* connection for the scan */
char *p_name; /* name of prepared statement, if created */
/* extracted fdw_private data */
@@ -191,6 +219,7 @@ typedef struct PgFdwModifyState
*/
typedef struct PgFdwDirectModifyState
{
+ PgFdwState s; /* common structure */
Relation rel; /* relcache entry for the foreign table */
AttInMetadata *attinmeta; /* attribute datatype conversion metadata */
@@ -289,6 +318,7 @@ static void postgresBeginForeignScan(ForeignScanState *node, int eflags);
static TupleTableSlot *postgresIterateForeignScan(ForeignScanState *node);
static void postgresReScanForeignScan(ForeignScanState *node);
static void postgresEndForeignScan(ForeignScanState *node);
+static void postgresShutdownForeignScan(ForeignScanState *node);
static void postgresAddForeignUpdateTargets(Query *parsetree,
RangeTblEntry *target_rte,
Relation target_relation);
@@ -349,8 +379,8 @@ static bool postgresIsForeignPathAsyncCapable(ForeignPath *path);
static void postgresForeignAsyncRequest(EState *estate,
PendingAsyncRequest *areq);
static bool postgresForeignAsyncConfigureWait(EState *estate,
- PendingAsyncRequest *areq,
- bool reinit);
+ PendingAsyncRequest *areq,
+ bool reinit);
static void postgresForeignAsyncNotify(EState *estate,
PendingAsyncRequest *areq);
@@ -373,7 +403,10 @@ static bool ec_member_matches_foreign(PlannerInfo *root, RelOptInfo *rel,
EquivalenceClass *ec, EquivalenceMember *em,
void *arg);
static void create_cursor(ForeignScanState *node);
-static void fetch_more_data(ForeignScanState *node);
+static void request_more_data(ForeignScanState *node);
+static void fetch_received_data(ForeignScanState *node);
+static void vacate_connection(PgFdwState *fdwconn);
+static void absorb_current_result(ForeignScanState *node);
static void close_cursor(PGconn *conn, unsigned int cursor_number);
static void prepare_foreign_modify(PgFdwModifyState *fmstate);
static const char **convert_prep_stmt_params(PgFdwModifyState *fmstate,
@@ -434,6 +467,7 @@ postgres_fdw_handler(PG_FUNCTION_ARGS)
routine->IterateForeignScan = postgresIterateForeignScan;
routine->ReScanForeignScan = postgresReScanForeignScan;
routine->EndForeignScan = postgresEndForeignScan;
+ routine->ShutdownForeignScan = postgresShutdownForeignScan;
/* Functions for updating foreign tables */
routine->AddForeignUpdateTargets = postgresAddForeignUpdateTargets;
@@ -1314,12 +1348,21 @@ postgresBeginForeignScan(ForeignScanState *node, int eflags)
* Get connection to the foreign server. Connection manager will
* establish new connection if necessary.
*/
- fsstate->conn = GetConnection(user, false);
+ fsstate->s.conn = GetConnection(user, false);
+ fsstate->s.connspec = (PgFdwConnspecate *)
+ GetConnectionSpecificStorage(user, sizeof(PgFdwConnspecate));
+ fsstate->s.connspec->current_owner = NULL;
+ fsstate->waiter = NULL;
+ fsstate->last_waiter = node;
/* Assign a unique ID for my cursor */
- fsstate->cursor_number = GetCursorNumber(fsstate->conn);
+ fsstate->cursor_number = GetCursorNumber(fsstate->s.conn);
fsstate->cursor_exists = false;
+ /* Initialize async execution status */
+ fsstate->run_async = false;
+ fsstate->async_waiting = false;
+
/* Get private info created by planner functions. */
fsstate->query = strVal(list_nth(fsplan->fdw_private,
FdwScanPrivateSelectSql));
@@ -1375,32 +1418,126 @@ postgresBeginForeignScan(ForeignScanState *node, int eflags)
static TupleTableSlot *
postgresIterateForeignScan(ForeignScanState *node)
{
- PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+ PgFdwScanState *fsstate = GetPgFdwScanState(node);
TupleTableSlot *slot = node->ss.ss_ScanTupleSlot;
/*
- * If this is the first call after Begin or ReScan, we need to create the
- * cursor on the remote side.
- */
- if (!fsstate->cursor_exists)
- create_cursor(node);
-
- /*
* Get some more tuples, if we've run out.
*/
if (fsstate->next_tuple >= fsstate->num_tuples)
{
- /* No point in another fetch if we already detected EOF, though. */
- if (!fsstate->eof_reached)
- fetch_more_data(node);
- /* If we didn't get any tuples, must be end of data. */
+ ForeignScanState *next_conn_owner = node;
+
+ /* This node has sent a query on this connection */
+ if (fsstate->s.connspec->current_owner == node)
+ {
+ /* Check if the result is available */
+ if (PQisBusy(fsstate->s.conn))
+ {
+ int rc = WaitLatchOrSocket(NULL,
+ WL_SOCKET_READABLE | WL_TIMEOUT,
+ PQsocket(fsstate->s.conn), 0,
+ WAIT_EVENT_ASYNC_WAIT);
+ if (fsstate->run_async && !(rc & WL_SOCKET_READABLE))
+ {
+ /*
+ * This node is not ready yet. Tell the caller to wait.
+ */
+ fsstate->result_ready = false;
+ return ExecClearTuple(slot);
+ }
+ }
+
+ Assert(fsstate->async_waiting);
+ fsstate->async_waiting = false;
+ fetch_received_data(node);
+
+ /*
+ * If someone is waiting this node on the same connection, let the
+ * first waiter be the next owner of this connection.
+ */
+ if (fsstate->waiter)
+ {
+ PgFdwScanState *next_owner_state;
+
+ next_conn_owner = fsstate->waiter;
+ next_owner_state = GetPgFdwScanState(next_conn_owner);
+ fsstate->waiter = NULL;
+
+ /*
+ * only the current owner is responsible to maintain the shortcut
+ * to the last waiter
+ */
+ next_owner_state->last_waiter = fsstate->last_waiter;
+
+ /*
+ * for simplicity, last_waiter points itself on a node that no one
+ * is waiting for.
+ */
+ fsstate->last_waiter = node;
+ }
+ }
+ else if (fsstate->s.connspec->current_owner)
+ {
+ /*
+ * Anyone else is holding this connection. Add myself to the tail
+ * of the waiters' list then return not-ready. To avoid scanning
+ * through the waiters' list, the current owner is to maintain the
+ * shortcut to the last waiter.
+ */
+ PgFdwScanState *conn_owner_state =
+ GetPgFdwScanState(fsstate->s.connspec->current_owner);
+ ForeignScanState *last_waiter = conn_owner_state->last_waiter;
+ PgFdwScanState *last_waiter_state = GetPgFdwScanState(last_waiter);
+
+ last_waiter_state->waiter = node;
+ conn_owner_state->last_waiter = node;
+
+ /* Register the node to the async-waiting node list */
+ Assert(!GetPgFdwScanState(node)->async_waiting);
+
+ GetPgFdwScanState(node)->async_waiting = true;
+
+ fsstate->result_ready = fsstate->eof_reached;
+ return ExecClearTuple(slot);
+ }
+
+ /*
+ * Send the next request for the next owner of this connection if
+ * needed.
+ */
+
+ if (!GetPgFdwScanState(next_conn_owner)->eof_reached)
+ {
+ PgFdwScanState *next_owner_state =
+ GetPgFdwScanState(next_conn_owner);
+
+ request_more_data(next_conn_owner);
+
+ /* Register the node to the async-waiting node list */
+ if (!next_owner_state->async_waiting)
+ next_owner_state->async_waiting = true;
+
+ if (!next_owner_state->run_async)
+ fetch_received_data(next_conn_owner);
+ }
+
+
+ /*
+ * If we haven't received a result for the given node this time,
+ * return with no tuple to give way to other nodes.
+ */
if (fsstate->next_tuple >= fsstate->num_tuples)
+ {
+ fsstate->result_ready = fsstate->eof_reached;
return ExecClearTuple(slot);
+ }
}
/*
* Return the next tuple.
*/
+ fsstate->result_ready = true;
ExecStoreTuple(fsstate->tuples[fsstate->next_tuple++],
slot,
InvalidBuffer,
@@ -1416,7 +1553,7 @@ postgresIterateForeignScan(ForeignScanState *node)
static void
postgresReScanForeignScan(ForeignScanState *node)
{
- PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+ PgFdwScanState *fsstate = GetPgFdwScanState(node);
char sql[64];
PGresult *res;
@@ -1424,6 +1561,9 @@ postgresReScanForeignScan(ForeignScanState *node)
if (!fsstate->cursor_exists)
return;
+ /* Absorb the ramining result */
+ absorb_current_result(node);
+
/*
* If any internal parameters affecting this node have changed, we'd
* better destroy and recreate the cursor. Otherwise, rewinding it should
@@ -1452,9 +1592,9 @@ postgresReScanForeignScan(ForeignScanState *node)
* We don't use a PG_TRY block here, so be careful not to throw error
* without releasing the PGresult.
*/
- res = pgfdw_exec_query(fsstate->conn, sql);
+ res = pgfdw_exec_query(fsstate->s.conn, sql);
if (PQresultStatus(res) != PGRES_COMMAND_OK)
- pgfdw_report_error(ERROR, res, fsstate->conn, true, sql);
+ pgfdw_report_error(ERROR, res, fsstate->s.conn, true, sql);
PQclear(res);
/* Now force a fresh FETCH. */
@@ -1472,7 +1612,7 @@ postgresReScanForeignScan(ForeignScanState *node)
static void
postgresEndForeignScan(ForeignScanState *node)
{
- PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+ PgFdwScanState *fsstate = GetPgFdwScanState(node);
/* if fsstate is NULL, we are in EXPLAIN; nothing to do */
if (fsstate == NULL)
@@ -1480,16 +1620,32 @@ postgresEndForeignScan(ForeignScanState *node)
/* Close the cursor if open, to prevent accumulation of cursors */
if (fsstate->cursor_exists)
- close_cursor(fsstate->conn, fsstate->cursor_number);
+ close_cursor(fsstate->s.conn, fsstate->cursor_number);
/* Release remote connection */
- ReleaseConnection(fsstate->conn);
- fsstate->conn = NULL;
+ ReleaseConnection(fsstate->s.conn);
+ fsstate->s.conn = NULL;
/* MemoryContexts will be deleted automatically. */
}
/*
+ * postgresShutdownForeignScan
+ * Remove asynchrony stuff and cleanup garbage on the connection.
+ */
+static void
+postgresShutdownForeignScan(ForeignScanState *node)
+{
+ ForeignScan *plan = (ForeignScan *) node->ss.ps.plan;
+
+ if (plan->operation != CMD_SELECT)
+ return;
+
+ /* Absorb the ramining result */
+ absorb_current_result(node);
+}
+
+/*
* postgresAddForeignUpdateTargets
* Add resjunk column(s) needed for update/delete on a foreign table
*/
@@ -1691,7 +1847,9 @@ postgresBeginForeignModify(ModifyTableState *mtstate,
user = GetUserMapping(userid, table->serverid);
/* Open connection; report that we'll create a prepared statement. */
- fmstate->conn = GetConnection(user, true);
+ fmstate->s.conn = GetConnection(user, true);
+ fmstate->s.connspec = (PgFdwConnspecate *)
+ GetConnectionSpecificStorage(user, sizeof(PgFdwConnspecate));
fmstate->p_name = NULL; /* prepared statement not made yet */
/* Deconstruct fdw_private data. */
@@ -1770,6 +1928,8 @@ postgresExecForeignInsert(EState *estate,
PGresult *res;
int n_rows;
+ vacate_connection((PgFdwState *)fmstate);
+
/* Set up the prepared statement on the remote server, if we didn't yet */
if (!fmstate->p_name)
prepare_foreign_modify(fmstate);
@@ -1780,14 +1940,14 @@ postgresExecForeignInsert(EState *estate,
/*
* Execute the prepared statement.
*/
- if (!PQsendQueryPrepared(fmstate->conn,
+ if (!PQsendQueryPrepared(fmstate->s.conn,
fmstate->p_name,
fmstate->p_nums,
p_values,
NULL,
NULL,
0))
- pgfdw_report_error(ERROR, NULL, fmstate->conn, false, fmstate->query);
+ pgfdw_report_error(ERROR, NULL, fmstate->s.conn, false, fmstate->query);
/*
* Get the result, and check for success.
@@ -1795,10 +1955,10 @@ postgresExecForeignInsert(EState *estate,
* We don't use a PG_TRY block here, so be careful not to throw error
* without releasing the PGresult.
*/
- res = pgfdw_get_result(fmstate->conn, fmstate->query);
+ res = pgfdw_get_result(fmstate->s.conn, fmstate->query);
if (PQresultStatus(res) !=
(fmstate->has_returning ? PGRES_TUPLES_OK : PGRES_COMMAND_OK))
- pgfdw_report_error(ERROR, res, fmstate->conn, true, fmstate->query);
+ pgfdw_report_error(ERROR, res, fmstate->s.conn, true, fmstate->query);
/* Check number of rows affected, and fetch RETURNING tuple if any */
if (fmstate->has_returning)
@@ -1836,6 +1996,8 @@ postgresExecForeignUpdate(EState *estate,
PGresult *res;
int n_rows;
+ vacate_connection((PgFdwState *)fmstate);
+
/* Set up the prepared statement on the remote server, if we didn't yet */
if (!fmstate->p_name)
prepare_foreign_modify(fmstate);
@@ -1856,14 +2018,14 @@ postgresExecForeignUpdate(EState *estate,
/*
* Execute the prepared statement.
*/
- if (!PQsendQueryPrepared(fmstate->conn,
+ if (!PQsendQueryPrepared(fmstate->s.conn,
fmstate->p_name,
fmstate->p_nums,
p_values,
NULL,
NULL,
0))
- pgfdw_report_error(ERROR, NULL, fmstate->conn, false, fmstate->query);
+ pgfdw_report_error(ERROR, NULL, fmstate->s.conn, false, fmstate->query);
/*
* Get the result, and check for success.
@@ -1871,10 +2033,10 @@ postgresExecForeignUpdate(EState *estate,
* We don't use a PG_TRY block here, so be careful not to throw error
* without releasing the PGresult.
*/
- res = pgfdw_get_result(fmstate->conn, fmstate->query);
+ res = pgfdw_get_result(fmstate->s.conn, fmstate->query);
if (PQresultStatus(res) !=
(fmstate->has_returning ? PGRES_TUPLES_OK : PGRES_COMMAND_OK))
- pgfdw_report_error(ERROR, res, fmstate->conn, true, fmstate->query);
+ pgfdw_report_error(ERROR, res, fmstate->s.conn, true, fmstate->query);
/* Check number of rows affected, and fetch RETURNING tuple if any */
if (fmstate->has_returning)
@@ -1912,6 +2074,8 @@ postgresExecForeignDelete(EState *estate,
PGresult *res;
int n_rows;
+ vacate_connection((PgFdwState *)fmstate);
+
/* Set up the prepared statement on the remote server, if we didn't yet */
if (!fmstate->p_name)
prepare_foreign_modify(fmstate);
@@ -1932,14 +2096,14 @@ postgresExecForeignDelete(EState *estate,
/*
* Execute the prepared statement.
*/
- if (!PQsendQueryPrepared(fmstate->conn,
+ if (!PQsendQueryPrepared(fmstate->s.conn,
fmstate->p_name,
fmstate->p_nums,
p_values,
NULL,
NULL,
0))
- pgfdw_report_error(ERROR, NULL, fmstate->conn, false, fmstate->query);
+ pgfdw_report_error(ERROR, NULL, fmstate->s.conn, false, fmstate->query);
/*
* Get the result, and check for success.
@@ -1947,10 +2111,10 @@ postgresExecForeignDelete(EState *estate,
* We don't use a PG_TRY block here, so be careful not to throw error
* without releasing the PGresult.
*/
- res = pgfdw_get_result(fmstate->conn, fmstate->query);
+ res = pgfdw_get_result(fmstate->s.conn, fmstate->query);
if (PQresultStatus(res) !=
(fmstate->has_returning ? PGRES_TUPLES_OK : PGRES_COMMAND_OK))
- pgfdw_report_error(ERROR, res, fmstate->conn, true, fmstate->query);
+ pgfdw_report_error(ERROR, res, fmstate->s.conn, true, fmstate->query);
/* Check number of rows affected, and fetch RETURNING tuple if any */
if (fmstate->has_returning)
@@ -1997,16 +2161,16 @@ postgresEndForeignModify(EState *estate,
* We don't use a PG_TRY block here, so be careful not to throw error
* without releasing the PGresult.
*/
- res = pgfdw_exec_query(fmstate->conn, sql);
+ res = pgfdw_exec_query(fmstate->s.conn, sql);
if (PQresultStatus(res) != PGRES_COMMAND_OK)
- pgfdw_report_error(ERROR, res, fmstate->conn, true, sql);
+ pgfdw_report_error(ERROR, res, fmstate->s.conn, true, sql);
PQclear(res);
fmstate->p_name = NULL;
}
/* Release remote connection */
- ReleaseConnection(fmstate->conn);
- fmstate->conn = NULL;
+ ReleaseConnection(fmstate->s.conn);
+ fmstate->s.conn = NULL;
}
/*
@@ -2286,7 +2450,9 @@ postgresBeginDirectModify(ForeignScanState *node, int eflags)
* Get connection to the foreign server. Connection manager will
* establish new connection if necessary.
*/
- dmstate->conn = GetConnection(user, false);
+ dmstate->s.conn = GetConnection(user, false);
+ dmstate->s.connspec = (PgFdwConnspecate *)
+ GetConnectionSpecificStorage(user, sizeof(PgFdwConnspecate));
/* Initialize state variable */
dmstate->num_tuples = -1; /* -1 means not set yet */
@@ -2339,7 +2505,10 @@ postgresIterateDirectModify(ForeignScanState *node)
* If this is the first call after Begin, execute the statement.
*/
if (dmstate->num_tuples == -1)
+ {
+ vacate_connection((PgFdwState *)dmstate);
execute_dml_stmt(node);
+ }
/*
* If the local query doesn't specify RETURNING, just clear tuple slot.
@@ -2386,8 +2555,8 @@ postgresEndDirectModify(ForeignScanState *node)
PQclear(dmstate->result);
/* Release remote connection */
- ReleaseConnection(dmstate->conn);
- dmstate->conn = NULL;
+ ReleaseConnection(dmstate->s.conn);
+ dmstate->s.conn = NULL;
/* MemoryContext will be deleted automatically. */
}
@@ -2505,6 +2674,7 @@ estimate_path_cost_size(PlannerInfo *root,
List *local_param_join_conds;
StringInfoData sql;
PGconn *conn;
+ PgFdwConnspecate *connspec;
Selectivity local_sel;
QualCost local_cost;
List *fdw_scan_tlist = NIL;
@@ -2547,6 +2717,16 @@ estimate_path_cost_size(PlannerInfo *root,
/* Get the remote estimate */
conn = GetConnection(fpinfo->user, false);
+ connspec = GetConnectionSpecificStorage(fpinfo->user,
+ sizeof(PgFdwConnspecate));
+ if (connspec)
+ {
+ PgFdwState tmpstate;
+ tmpstate.conn = conn;
+ tmpstate.connspec = connspec;
+ vacate_connection(&tmpstate);
+ }
+
get_remote_estimate(sql.data, conn, &rows, &width,
&startup_cost, &total_cost);
ReleaseConnection(conn);
@@ -2826,11 +3006,11 @@ ec_member_matches_foreign(PlannerInfo *root, RelOptInfo *rel,
static void
create_cursor(ForeignScanState *node)
{
- PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+ PgFdwScanState *fsstate = GetPgFdwScanState(node);
ExprContext *econtext = node->ss.ps.ps_ExprContext;
int numParams = fsstate->numParams;
const char **values = fsstate->param_values;
- PGconn *conn = fsstate->conn;
+ PGconn *conn = fsstate->s.conn;
StringInfoData buf;
PGresult *res;
@@ -2896,47 +3076,96 @@ create_cursor(ForeignScanState *node)
* Fetch some more rows from the node's cursor.
*/
static void
-fetch_more_data(ForeignScanState *node)
+request_more_data(ForeignScanState *node)
+{
+ PgFdwScanState *fsstate = GetPgFdwScanState(node);
+ PGconn *conn = fsstate->s.conn;
+ char sql[64];
+
+ /* The connection should be vacant */
+ Assert(fsstate->s.connspec->current_owner == NULL);
+
+ /*
+ * If this is the first call after Begin or ReScan, we need to create the
+ * cursor on the remote side.
+ */
+ if (!fsstate->cursor_exists)
+ create_cursor(node);
+
+ snprintf(sql, sizeof(sql), "FETCH %d FROM c%u",
+ fsstate->fetch_size, fsstate->cursor_number);
+
+ if (!PQsendQuery(conn, sql))
+ pgfdw_report_error(ERROR, NULL, conn, false, sql);
+
+ fsstate->s.connspec->current_owner = node;
+}
+
+/*
+ * Fetch some more rows from the node's cursor.
+ */
+static void
+fetch_received_data(ForeignScanState *node)
{
- PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+ PgFdwScanState *fsstate = GetPgFdwScanState(node);
PGresult *volatile res = NULL;
MemoryContext oldcontext;
+ /* I should be the current connection owner */
+ Assert(fsstate->s.connspec->current_owner == node);
+
/*
* We'll store the tuples in the batch_cxt. First, flush the previous
- * batch.
+ * batch if no tuple is remaining
*/
- fsstate->tuples = NULL;
- MemoryContextReset(fsstate->batch_cxt);
+ if (fsstate->next_tuple >= fsstate->num_tuples)
+ {
+ fsstate->tuples = NULL;
+ fsstate->num_tuples = 0;
+ MemoryContextReset(fsstate->batch_cxt);
+ }
+ else if (fsstate->next_tuple > 0)
+ {
+ /* move the remaining tuples to the beginning of the store */
+ int n = 0;
+
+ while(fsstate->next_tuple < fsstate->num_tuples)
+ fsstate->tuples[n++] = fsstate->tuples[fsstate->next_tuple++];
+ fsstate->num_tuples = n;
+ }
+
oldcontext = MemoryContextSwitchTo(fsstate->batch_cxt);
/* PGresult must be released before leaving this function. */
PG_TRY();
{
- PGconn *conn = fsstate->conn;
+ PGconn *conn = fsstate->s.conn;
char sql[64];
- int numrows;
+ int addrows;
+ size_t newsize;
int i;
snprintf(sql, sizeof(sql), "FETCH %d FROM c%u",
fsstate->fetch_size, fsstate->cursor_number);
- res = pgfdw_exec_query(conn, sql);
+ res = pgfdw_get_result(conn, sql);
/* On error, report the original query, not the FETCH. */
if (PQresultStatus(res) != PGRES_TUPLES_OK)
pgfdw_report_error(ERROR, res, conn, false, fsstate->query);
/* Convert the data into HeapTuples */
- numrows = PQntuples(res);
- fsstate->tuples = (HeapTuple *) palloc0(numrows * sizeof(HeapTuple));
- fsstate->num_tuples = numrows;
- fsstate->next_tuple = 0;
+ addrows = PQntuples(res);
+ newsize = (fsstate->num_tuples + addrows) * sizeof(HeapTuple);
+ if (fsstate->tuples)
+ fsstate->tuples = (HeapTuple *) repalloc(fsstate->tuples, newsize);
+ else
+ fsstate->tuples = (HeapTuple *) palloc(newsize);
- for (i = 0; i < numrows; i++)
+ for (i = 0; i < addrows; i++)
{
Assert(IsA(node->ss.ps.plan, ForeignScan));
- fsstate->tuples[i] =
+ fsstate->tuples[fsstate->num_tuples + i] =
make_tuple_from_result_row(res, i,
fsstate->rel,
fsstate->attinmeta,
@@ -2946,27 +3175,82 @@ fetch_more_data(ForeignScanState *node)
}
/* Update fetch_ct_2 */
- if (fsstate->fetch_ct_2 < 2)
+ if (fsstate->fetch_ct_2 < 2 && fsstate->next_tuple == 0)
fsstate->fetch_ct_2++;
+ fsstate->next_tuple = 0;
+ fsstate->num_tuples += addrows;
+
/* Must be EOF if we didn't get as many tuples as we asked for. */
- fsstate->eof_reached = (numrows < fsstate->fetch_size);
+ fsstate->eof_reached = (addrows < fsstate->fetch_size);
PQclear(res);
res = NULL;
}
PG_CATCH();
{
+ fsstate->s.connspec->current_owner = NULL;
if (res)
PQclear(res);
PG_RE_THROW();
}
PG_END_TRY();
+ fsstate->s.connspec->current_owner = NULL;
+
MemoryContextSwitchTo(oldcontext);
}
/*
+ * Vacate a connection so that this node can send the next query
+ */
+static void
+vacate_connection(PgFdwState *fdwstate)
+{
+ PgFdwConnspecate *connspec = fdwstate->connspec;
+ ForeignScanState *owner;
+
+ if (connspec == NULL || connspec->current_owner == NULL)
+ return;
+
+ /*
+ * let the current connection owner read the result for the running query
+ */
+ owner = connspec->current_owner;
+ fetch_received_data(owner);
+
+ /* Clear the waiting list */
+ while (owner)
+ {
+ PgFdwScanState *fsstate = GetPgFdwScanState(owner);
+
+ fsstate->last_waiter = NULL;
+ owner = fsstate->waiter;
+ fsstate->waiter = NULL;
+ }
+}
+
+/*
+ * Absorb the result of the current query.
+ */
+static void
+absorb_current_result(ForeignScanState *node)
+{
+ PgFdwScanState *fsstate = GetPgFdwScanState(node);
+ ForeignScanState *owner = fsstate->s.connspec->current_owner;
+
+ if (owner)
+ {
+ PgFdwScanState *target_state = GetPgFdwScanState(owner);
+ PGconn *conn = target_state->s.conn;
+
+ while(PQisBusy(conn))
+ PQclear(PQgetResult(conn));
+ fsstate->s.connspec->current_owner = NULL;
+ fsstate->async_waiting = false;
+ }
+}
+/*
* Force assorted GUC parameters to settings that ensure that we'll output
* data values in a form that is unambiguous to the remote server.
*
@@ -3050,7 +3334,7 @@ prepare_foreign_modify(PgFdwModifyState *fmstate)
/* Construct name we'll use for the prepared statement. */
snprintf(prep_name, sizeof(prep_name), "pgsql_fdw_prep_%u",
- GetPrepStmtNumber(fmstate->conn));
+ GetPrepStmtNumber(fmstate->s.conn));
p_name = pstrdup(prep_name);
/*
@@ -3060,12 +3344,12 @@ prepare_foreign_modify(PgFdwModifyState *fmstate)
* the prepared statements we use in this module are simple enough that
* the remote server will make the right choices.
*/
- if (!PQsendPrepare(fmstate->conn,
+ if (!PQsendPrepare(fmstate->s.conn,
p_name,
fmstate->query,
0,
NULL))
- pgfdw_report_error(ERROR, NULL, fmstate->conn, false, fmstate->query);
+ pgfdw_report_error(ERROR, NULL, fmstate->s.conn, false, fmstate->query);
/*
* Get the result, and check for success.
@@ -3073,9 +3357,9 @@ prepare_foreign_modify(PgFdwModifyState *fmstate)
* We don't use a PG_TRY block here, so be careful not to throw error
* without releasing the PGresult.
*/
- res = pgfdw_get_result(fmstate->conn, fmstate->query);
+ res = pgfdw_get_result(fmstate->s.conn, fmstate->query);
if (PQresultStatus(res) != PGRES_COMMAND_OK)
- pgfdw_report_error(ERROR, res, fmstate->conn, true, fmstate->query);
+ pgfdw_report_error(ERROR, res, fmstate->s.conn, true, fmstate->query);
PQclear(res);
/* This action shows that the prepare has been done. */
@@ -3206,9 +3490,9 @@ execute_dml_stmt(ForeignScanState *node)
* the desired result. This allows us to avoid assuming that the remote
* server has the same OIDs we do for the parameters' types.
*/
- if (!PQsendQueryParams(dmstate->conn, dmstate->query, numParams,
+ if (!PQsendQueryParams(dmstate->s.conn, dmstate->query, numParams,
NULL, values, NULL, NULL, 0))
- pgfdw_report_error(ERROR, NULL, dmstate->conn, false, dmstate->query);
+ pgfdw_report_error(ERROR, NULL, dmstate->s.conn, false, dmstate->query);
/*
* Get the result, and check for success.
@@ -3216,10 +3500,10 @@ execute_dml_stmt(ForeignScanState *node)
* We don't use a PG_TRY block here, so be careful not to throw error
* without releasing the PGresult.
*/
- dmstate->result = pgfdw_get_result(dmstate->conn, dmstate->query);
+ dmstate->result = pgfdw_get_result(dmstate->s.conn, dmstate->query);
if (PQresultStatus(dmstate->result) !=
(dmstate->has_returning ? PGRES_TUPLES_OK : PGRES_COMMAND_OK))
- pgfdw_report_error(ERROR, dmstate->result, dmstate->conn, true,
+ pgfdw_report_error(ERROR, dmstate->result, dmstate->s.conn, true,
dmstate->query);
/* Get the number of rows affected. */
@@ -4365,8 +4649,10 @@ postgresIsForeignPathAsyncCapable(ForeignPath *path)
}
/*
- * XXX. Just for testing purposes, let's run everything through the async
- * mechanism but return tuples synchronously.
+ * Accept async request. Notify to the caller if the next tuple is immediately
+ * available. ExecForeignScan does additional work to finishing the returning
+ * tuple, so call it instead of postgresIterateForeignScan to acquire a tuple
+ * in expected shape.
*/
static void
postgresForeignAsyncRequest(EState *estate, PendingAsyncRequest *areq)
@@ -4375,22 +4661,59 @@ postgresForeignAsyncRequest(EState *estate, PendingAsyncRequest *areq)
TupleTableSlot *slot;
Assert(IsA(node, ForeignScanState));
+ GetPgFdwScanState(node)->run_async = true;
slot = ExecForeignScan(node);
- ExecAsyncRequestDone(estate, areq, (Node *) slot);
+ if (GetPgFdwScanState(node)->result_ready)
+ ExecAsyncRequestDone(estate, areq, (Node *) slot);
+ else
+ ExecAsyncSetRequiredEvents(estate, areq, 1, false, false);
}
+/*
+ * Configure waiting event.
+ *
+ * Add an wait event only when the node is the connection owner. Elsewise
+ * another node on this connection is the owner.
+ */
static bool
postgresForeignAsyncConfigureWait(EState *estate, PendingAsyncRequest *areq,
- bool reinit)
+ bool reinit)
{
- elog(ERROR, "postgresForeignAsyncConfigureWait");
+ ForeignScanState *node = (ForeignScanState *) areq->requestee;
+ PgFdwScanState *fsstate = GetPgFdwScanState(node);
+
+
+ /* If the caller didn't reinit, this event is already in event set */
+ if (!reinit)
+ return true;
+
+ if (fsstate->s.connspec->current_owner == node)
+ {
+ AddWaitEventToSet(estate->es_wait_event_set,
+ WL_SOCKET_READABLE, PQsocket(fsstate->s.conn),
+ NULL, areq);
+ return true;
+ }
+
return false;
}
+/*
+ * Process a notification from async mechanism. ExecForeignScan does
+ * additional work to complete the returning tuple, so call it instead of
+ * postgresIterateForeignScan to acquire a completed tuple.
+ */
static void
postgresForeignAsyncNotify(EState *estate, PendingAsyncRequest *areq)
{
- elog(ERROR, "postgresForeignAsyncNotify");
+ ForeignScanState *node = (ForeignScanState *) areq->requestee;
+ TupleTableSlot *slot;
+
+ Assert(IsA(node, ForeignScanState));
+ slot = ExecForeignScan(node);
+ Assert(GetPgFdwScanState(node)->result_ready);
+
+ ExecAsyncRequestDone(estate, areq, (Node *) slot);
}
/*
@@ -4438,7 +4761,7 @@ make_tuple_from_result_row(PGresult *res,
PgFdwScanState *fdw_sstate;
Assert(fsstate);
- fdw_sstate = (PgFdwScanState *) fsstate->fdw_state;
+ fdw_sstate = GetPgFdwScanState(fsstate);
tupdesc = fdw_sstate->tupdesc;
}
diff --git a/contrib/postgres_fdw/postgres_fdw.h b/contrib/postgres_fdw/postgres_fdw.h
index 67126bc..9eff0ba 100644
--- a/contrib/postgres_fdw/postgres_fdw.h
+++ b/contrib/postgres_fdw/postgres_fdw.h
@@ -79,6 +79,7 @@ typedef struct PgFdwRelationInfo
UserMapping *user; /* only set in use_remote_estimate mode */
int fetch_size; /* fetch size for this remote table */
+ bool allow_prefetch; /* true to allow overlapped fetching */
/*
* Name of the relation while EXPLAINing ForeignScan. It is used for join
@@ -100,6 +101,7 @@ extern void reset_transmission_modes(int nestlevel);
/* in connection.c */
extern PGconn *GetConnection(UserMapping *user, bool will_prep_stmt);
+void *GetConnectionSpecificStorage(UserMapping *user, size_t initsize);
extern void ReleaseConnection(PGconn *conn);
extern unsigned int GetCursorNumber(PGconn *conn);
extern unsigned int GetPrepStmtNumber(PGconn *conn);
diff --git a/contrib/postgres_fdw/sql/postgres_fdw.sql b/contrib/postgres_fdw/sql/postgres_fdw.sql
index 4f68e89..de1d96e 100644
--- a/contrib/postgres_fdw/sql/postgres_fdw.sql
+++ b/contrib/postgres_fdw/sql/postgres_fdw.sql
@@ -1248,8 +1248,8 @@ explain (verbose, costs off)
delete from foo where f1 < 5 returning *;
delete from foo where f1 < 5 returning *;
explain (verbose, costs off)
-update bar set f2 = f2 + 100 returning *;
-update bar set f2 = f2 + 100 returning *;
+with u as (update bar set f2 = f2 + 100 returning *) select * from u order by 1;
+with u as (update bar set f2 = f2 + 100 returning *) select * from u order by 1;
drop table foo cascade;
drop table bar cascade;
diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
index 554244f..f864abe 100644
--- a/src/backend/executor/execProcnode.c
+++ b/src/backend/executor/execProcnode.c
@@ -114,6 +114,7 @@
#include "executor/nodeValuesscan.h"
#include "executor/nodeWindowAgg.h"
#include "executor/nodeWorktablescan.h"
+#include "foreign/fdwapi.h"
#include "nodes/nodeFuncs.h"
#include "miscadmin.h"
@@ -806,6 +807,14 @@ ExecShutdownNode(PlanState *node)
case T_GatherState:
ExecShutdownGather((GatherState *) node);
break;
+ case T_ForeignScanState:
+ {
+ ForeignScanState *fsstate = (ForeignScanState *)node;
+ FdwRoutine *fdwroutine = fsstate->fdwroutine;
+ if (fdwroutine->ShutdownForeignScan)
+ fdwroutine->ShutdownForeignScan((ForeignScanState *) node);
+ }
+ break;
default:
break;
}
diff --git a/src/include/foreign/fdwapi.h b/src/include/foreign/fdwapi.h
index 65517fd..e40db0e 100644
--- a/src/include/foreign/fdwapi.h
+++ b/src/include/foreign/fdwapi.h
@@ -163,6 +163,7 @@ typedef bool (*ForeignAsyncConfigureWait_function) (EState *estate,
bool reinit);
typedef void (*ForeignAsyncNotify_function) (EState *estate,
PendingAsyncRequest *areq);
+typedef void (*ShutdownForeignScan_function) (ForeignScanState *node);
/*
* FdwRoutine is the struct returned by a foreign-data wrapper's handler
@@ -239,6 +240,7 @@ typedef struct FdwRoutine
ForeignAsyncRequest_function ForeignAsyncRequest;
ForeignAsyncConfigureWait_function ForeignAsyncConfigureWait;
ForeignAsyncNotify_function ForeignAsyncNotify;
+ ShutdownForeignScan_function ShutdownForeignScan;
} FdwRoutine;
--
2.9.2
This is the rebased version on the current master(-0004), and
added resowner stuff (0005) and unlikely(0006).
At Tue, 18 Oct 2016 10:30:51 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in <20161018.103051.30820907.horiguchi.kyotaro@lab.ntt.co.jp>
- Errors in the executor can leak the WaitEventSet. Probably we need
to modify ResourceOwners to be able to own WaitEventSets.WaitEventSet itself is not leaked but epoll-fd should be closed
at failure. This seems doable with TRY-CATCHing in
ExecAsyncEventLoop. (not yet)
Haha, that's a silly talk. The wait event can continue to live
when timeout and any error can happen on the way after the
that. I added an entry for wait event set to resource owner and
hang ones created in ExecAsyncEventWait to
TopTransactionResourceOwner. Currently WaitLatchOrSocket doesn't
do so not to change the current behavior. WaitEventSet doesn't
have usable identifier for resowner.c so currently I use the
address(pointer value) for the purpose. The patch 0005 does that.
I measured performance and had the following result.
t0 - SELECT sum(a) FROM <local single table>;
pl - SELECT sum(a) FROM <4 local children>;
pf0 - SELECT sum(a) FROM <4 foreign children on single connection>;
pf1 - SELECT sum(a) FROM <4 foreign children on dedicate connections>;The result is written as "time<ms> (std dev <ms>)"
sync
t0: 3820.33 ( 1.88)
pl: 1608.59 ( 12.06)
pf0: 7928.29 ( 46.58)
pf1: 8023.16 ( 26.43)async
t0: 3806.31 ( 4.49) 0.4% faster (should be error)
pl: 1629.17 ( 0.29) 1.3% slower
pf0: 6447.07 ( 25.19) 18.7% faster
pf1: 1876.80 ( 47.13) 76.6% fastert0 is not affected since the ExecProcNode stuff has gone.
pl is getting a bit slower. (almost the same to simple seqscan of
the previous patch) This should be a misprediction penalty.
Using likely macro for ExecAppend, and it seems to have shaken
off the degradation.
sync
t0: 3919.49 ( 5.95)
pl: 1637.95 ( 0.75)
pf0: 8304.20 ( 43.94)
pf1: 8222.09 ( 28.20)
async
t0: 3885.84 ( 40.20) 0.86% faster (should be error but stable on my env..)
pl: 1617.20 ( 3.51) 1.26% faster (ditto)
pf0: 6680.95 (478.72) 19.5% faster
pf1: 1886.87 ( 36.25) 77.1% faster
regards,
--
Kyotaro Horiguchi
NTT Open Source Software Center
Attachments:
0001-robert-s-2nd-framework.patchtext/x-patch; charset=us-asciiDownload
From 25eba7e506228ab087e8b743efb039286a8251c4 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Wed, 12 Oct 2016 12:46:10 +0900
Subject: [PATCH 1/6] robert's 2nd framework
---
contrib/postgres_fdw/postgres_fdw.c | 49 ++++
src/backend/executor/Makefile | 4 +-
src/backend/executor/README | 43 +++
src/backend/executor/execAmi.c | 5 +
src/backend/executor/execAsync.c | 462 ++++++++++++++++++++++++++++++++
src/backend/executor/nodeAppend.c | 162 ++++++++++-
src/backend/executor/nodeForeignscan.c | 49 ++++
src/backend/nodes/copyfuncs.c | 1 +
src/backend/nodes/outfuncs.c | 1 +
src/backend/nodes/readfuncs.c | 1 +
src/backend/optimizer/plan/createplan.c | 45 +++-
src/include/executor/execAsync.h | 29 ++
src/include/executor/nodeAppend.h | 3 +
src/include/executor/nodeForeignscan.h | 7 +
src/include/foreign/fdwapi.h | 15 ++
src/include/nodes/execnodes.h | 57 +++-
src/include/nodes/plannodes.h | 1 +
17 files changed, 909 insertions(+), 25 deletions(-)
create mode 100644 src/backend/executor/execAsync.c
create mode 100644 src/include/executor/execAsync.h
diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c
index 906d6e6..c480945 100644
--- a/contrib/postgres_fdw/postgres_fdw.c
+++ b/contrib/postgres_fdw/postgres_fdw.c
@@ -19,6 +19,7 @@
#include "commands/defrem.h"
#include "commands/explain.h"
#include "commands/vacuum.h"
+#include "executor/execAsync.h"
#include "foreign/fdwapi.h"
#include "funcapi.h"
#include "miscadmin.h"
@@ -349,6 +350,14 @@ static void postgresGetForeignUpperPaths(PlannerInfo *root,
UpperRelationKind stage,
RelOptInfo *input_rel,
RelOptInfo *output_rel);
+static bool postgresIsForeignPathAsyncCapable(ForeignPath *path);
+static void postgresForeignAsyncRequest(EState *estate,
+ PendingAsyncRequest *areq);
+static void postgresForeignAsyncConfigureWait(EState *estate,
+ PendingAsyncRequest *areq,
+ bool reinit);
+static void postgresForeignAsyncNotify(EState *estate,
+ PendingAsyncRequest *areq);
/*
* Helper functions
@@ -468,6 +477,12 @@ postgres_fdw_handler(PG_FUNCTION_ARGS)
/* Support functions for upper relation push-down */
routine->GetForeignUpperPaths = postgresGetForeignUpperPaths;
+ /* Support functions for async execution */
+ routine->IsForeignPathAsyncCapable = postgresIsForeignPathAsyncCapable;
+ routine->ForeignAsyncRequest = postgresForeignAsyncRequest;
+ routine->ForeignAsyncConfigureWait = postgresForeignAsyncConfigureWait;
+ routine->ForeignAsyncNotify = postgresForeignAsyncNotify;
+
PG_RETURN_POINTER(routine);
}
@@ -4442,6 +4457,40 @@ postgresGetForeignJoinPaths(PlannerInfo *root,
/* XXX Consider parameterized paths for the join relation */
}
+static bool
+postgresIsForeignPathAsyncCapable(ForeignPath *path)
+{
+ return true;
+}
+
+/*
+ * XXX. Just for testing purposes, let's run everything through the async
+ * mechanism but return tuples synchronously.
+ */
+static void
+postgresForeignAsyncRequest(EState *estate, PendingAsyncRequest *areq)
+{
+ ForeignScanState *node = (ForeignScanState *) areq->requestee;
+ TupleTableSlot *slot;
+
+ Assert(IsA(node, ForeignScanState));
+ slot = postgresIterateForeignScan(node);
+ ExecAsyncRequestDone(estate, areq, (Node *) slot);
+}
+
+static void
+postgresForeignAsyncConfigureWait(EState *estate, PendingAsyncRequest *areq,
+ bool reinit)
+{
+ elog(ERROR, "postgresForeignAsyncConfigureWait");
+}
+
+static void
+postgresForeignAsyncNotify(EState *estate, PendingAsyncRequest *areq)
+{
+ elog(ERROR, "postgresForeignAsyncNotify");
+}
+
/*
* Assess whether the aggregation, grouping and having operations can be pushed
* down to the foreign server. As a side effect, save information we obtain in
diff --git a/src/backend/executor/Makefile b/src/backend/executor/Makefile
index 51edd4c..0675b01 100644
--- a/src/backend/executor/Makefile
+++ b/src/backend/executor/Makefile
@@ -12,8 +12,8 @@ subdir = src/backend/executor
top_builddir = ../../..
include $(top_builddir)/src/Makefile.global
-OBJS = execAmi.o execCurrent.o execGrouping.o execIndexing.o execJunk.o \
- execMain.o execParallel.o execProcnode.o execQual.o \
+OBJS = execAmi.o execAsync.o execCurrent.o execGrouping.o execIndexing.o \
+ execJunk.o execMain.o execParallel.o execProcnode.o execQual.o \
execScan.o execTuples.o \
execUtils.o functions.o instrument.o nodeAppend.o nodeAgg.o \
nodeBitmapAnd.o nodeBitmapOr.o \
diff --git a/src/backend/executor/README b/src/backend/executor/README
index f1d1e4c..1dee3db 100644
--- a/src/backend/executor/README
+++ b/src/backend/executor/README
@@ -199,3 +199,46 @@ query returning the same set of scan tuples multiple times. Likewise,
SRFs are disallowed in an UPDATE's targetlist. There, they would have the
effect of the same row being updated multiple times, which is not very
useful --- and updates after the first would have no effect anyway.
+
+Asynchronous Execution
+----------------------
+
+In certain cases, it's desirable for a node to indicate that it cannot
+return any tuple immediately but may be able to do at a later time. This
+might either because the node is waiting on an event external to the
+database system, such as a ForeignScan awaiting network I/O, or because
+the node is waiting for an event internal to the database system - e.g.
+one process involved in a parallel query may find that it cannot progress
+a certain parallel operation until some other process reaches a certain
+point in the computation. A process which discovers this type of situation
+can always handle it simply by blocking, but this may waste time that could
+be spent executing some other part of the plan where progress could be
+made immediately. This is particularly likely to occur when the plan
+contains an Append node.
+
+To use asynchronous execution, a node must first request a tuple from an
+async-capable child node using ExecAsyncRequest. Next, it must execute
+the asynchronous event loop using ExecAsyncEventLoop; it can avoid giving
+up control indefinitely by passing a timeout to this function, even passing
+-1 to poll for events without blocking. Eventually, when a node to which an
+asynchronous request has been made produces a tuple, the requesting node
+will receive a callback from the event loop via ExecAsyncResponse. Typically,
+the ExecAsyncResponse callback is the only one required for nodes that wish
+to request tuples asynchronously.
+
+On the other hand, nodes that wish to produce tuples asynchronously
+generally need to implement three methods:
+
+1. When an asynchronous request is made, the node's ExecAsyncRequest callback
+will be invoked; it should use ExecAsyncSetRequiredEvents to indicate the
+number of file descriptor events for which it wishes to wait and whether it
+wishes to receive a callback when the process latch is set. Alternatively,
+it can instead use ExecAsyncRequestDone if a result is available immediately.
+
+2. When the event loop wishes to wait or poll for file descriptor events and
+the process latch, the ExecAsyncConfigureWait callback is invoked to configure
+the file descriptor wait events for which the node wishes to wait. This
+callback isn't needed if the node only cares about the process latch.
+
+3. When file descriptors or the process latch become ready, the node's
+ExecAsyncNotify callback is invoked.
diff --git a/src/backend/executor/execAmi.c b/src/backend/executor/execAmi.c
index 2587ef7..9fcc4e4 100644
--- a/src/backend/executor/execAmi.c
+++ b/src/backend/executor/execAmi.c
@@ -464,11 +464,16 @@ ExecSupportsBackwardScan(Plan *node)
{
ListCell *l;
+ /* With async, tuples may be interleaved, so can't back up. */
+ if (((Append *) node)->nasyncplans != 0)
+ return false;
+
foreach(l, ((Append *) node)->appendplans)
{
if (!ExecSupportsBackwardScan((Plan *) lfirst(l)))
return false;
}
+
/* need not check tlist because Append doesn't evaluate it */
return true;
}
diff --git a/src/backend/executor/execAsync.c b/src/backend/executor/execAsync.c
new file mode 100644
index 0000000..5858bb5
--- /dev/null
+++ b/src/backend/executor/execAsync.c
@@ -0,0 +1,462 @@
+/*-------------------------------------------------------------------------
+ *
+ * execAsync.c
+ * Support routines for asynchronous execution.
+ *
+ * Portions Copyright (c) 1996-2016, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/executor/execAsync.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "executor/execAsync.h"
+#include "executor/nodeAppend.h"
+#include "executor/nodeForeignscan.h"
+#include "miscadmin.h"
+#include "storage/latch.h"
+
+static bool ExecAsyncEventWait(EState *estate, long timeout);
+static void ExecAsyncConfigureWait(EState *estate, PendingAsyncRequest *areq,
+ bool reinit);
+static void ExecAsyncNotify(EState *estate, PendingAsyncRequest *areq);
+static void ExecAsyncResponse(EState *estate, PendingAsyncRequest *areq);
+
+#define EVENT_BUFFER_SIZE 16
+
+/*
+ * Asynchronously request a tuple from a designed async-aware node.
+ *
+ * requestor is the node that wants the tuple; requestee is the node from
+ * which it wants the tuple. request_index is an arbitrary integer specified
+ * by the requestor which will be available at the time the requestor receives
+ * the tuple. This is useful if the requestor has multiple children and
+ * needs an easy way to figure out which one is delivering a tuple.
+ */
+void
+ExecAsyncRequest(EState *estate, PlanState *requestor, int request_index,
+ PlanState *requestee)
+{
+ PendingAsyncRequest *areq = NULL;
+ int i = estate->es_num_pending_async;
+
+ /*
+ * If the number of pending asynchronous nodes exceeds the number of
+ * available slots in the es_pending_async array, expand the array.
+ * We start with 16 slots, and thereafter double the array size each
+ * time we run out of slots.
+ */
+ if (i >= estate->es_max_pending_async)
+ {
+ int newmax;
+
+ newmax = estate->es_max_pending_async * 2;
+ if (estate->es_max_pending_async == 0)
+ {
+ newmax = 16;
+ estate->es_pending_async =
+ MemoryContextAllocZero(estate->es_query_cxt,
+ newmax * sizeof(PendingAsyncRequest *));
+ }
+ else
+ {
+ int newentries = newmax - estate->es_max_pending_async;
+
+ estate->es_pending_async =
+ repalloc(estate->es_pending_async,
+ newmax * sizeof(PendingAsyncRequest *));
+ MemSet(&estate->es_pending_async[estate->es_max_pending_async],
+ 0, newentries * sizeof(PendingAsyncRequest *));
+ }
+ estate->es_max_pending_async = newmax;
+ }
+
+ /*
+ * To avoid unnecessary palloc traffic, we reuse a previously-allocated
+ * PendingAsyncRequest if there is one. If not, we must allocate a new
+ * one.
+ */
+ if (estate->es_pending_async[i] == NULL)
+ {
+ areq = MemoryContextAllocZero(estate->es_query_cxt,
+ sizeof(PendingAsyncRequest));
+ estate->es_pending_async[i] = areq;
+ }
+ else
+ {
+ areq = estate->es_pending_async[i];
+ MemSet(areq, 0, sizeof(PendingAsyncRequest));
+ }
+ areq->myindex = estate->es_num_pending_async++;
+
+ /* Initialize the new request. */
+ areq->requestor = requestor;
+ areq->request_index = request_index;
+ areq->requestee = requestee;
+
+ /* Give the requestee a chance to do whatever it wants. */
+ switch (nodeTag(requestee))
+ {
+ case T_ForeignScanState:
+ ExecAsyncForeignScanRequest(estate, areq);
+ break;
+ default:
+ /* If requestee doesn't support async, caller messed up. */
+ elog(ERROR, "unrecognized node type: %d",
+ (int) nodeTag(requestee));
+ }
+}
+
+/*
+ * Execute the main loop until the timeout expires or a result is delivered
+ * to the requestor.
+ *
+ * If the timeout is -1, there is no timeout; wait indefinitely until a
+ * result is ready for requestor. If the timeout is 0, do not block, but
+ * poll for events and fire callbacks for as long as we can do so without
+ * blocking. If timeout is greater than 0, block for at most the number
+ * of milliseconds indicated by the timeout.
+ *
+ * Returns true if a result was delivered to the requestor. A return value
+ * of false indicates that the timeout was reached without delivering a
+ * result to the requestor.
+ */
+bool
+ExecAsyncEventLoop(EState *estate, PlanState *requestor, long timeout)
+{
+ instr_time start_time;
+ long cur_timeout = timeout;
+ bool requestor_done = false;
+
+ Assert(requestor != NULL);
+
+ /*
+ * If we plan to wait - but not indefinitely - we need to record the
+ * current time.
+ */
+ if (timeout > 0)
+ INSTR_TIME_SET_CURRENT(start_time);
+
+ /* Main event loop: poll for events, deliver notifications. */
+ for (;;)
+ {
+ int i;
+ bool any_node_done = false;
+
+ CHECK_FOR_INTERRUPTS();
+
+ /*
+ * Check for events, but don't block if there notifications that
+ * have not been delivered yet.
+ */
+ if (estate->es_async_callback_pending > 0)
+ ExecAsyncEventWait(estate, 0);
+ else if (!ExecAsyncEventWait(estate, cur_timeout))
+ cur_timeout = 0; /* Timeout was reached. */
+ else
+ {
+ instr_time cur_time;
+ long cur_timeout = -1;
+
+ INSTR_TIME_SET_CURRENT(cur_time);
+ INSTR_TIME_SUBTRACT(cur_time, start_time);
+ cur_timeout = timeout - (long) INSTR_TIME_GET_MILLISEC(cur_time);
+ if (cur_timeout < 0)
+ cur_timeout = 0;
+ }
+
+ /* Deliver notifications. */
+ for (i = 0; i < estate->es_num_pending_async; ++i)
+ {
+ PendingAsyncRequest *areq = estate->es_pending_async[i];
+
+ /* Skip it if no callback is pending. */
+ if (!areq->callback_pending)
+ continue;
+
+ /*
+ * Mark it as no longer needing a callback. We must do this
+ * before dispatching the callback in case the callback resets
+ * the flag.
+ */
+ areq->callback_pending = false;
+ estate->es_async_callback_pending--;
+
+ /* Perform the actual callback; set request_done if appropraite. */
+ if (!areq->request_complete)
+ ExecAsyncNotify(estate, areq);
+ else
+ {
+ any_node_done = true;
+ if (requestor == areq->requestor)
+ requestor_done = true;
+ ExecAsyncResponse(estate, areq);
+ }
+ }
+
+ /* If any node completed, compact the array. */
+ if (any_node_done)
+ {
+ int hidx = 0,
+ tidx;
+
+ /*
+ * Swap all non-yet-completed items to the start of the array.
+ * Keep them in the same order.
+ */
+ for (tidx = 0; tidx < estate->es_num_pending_async; ++tidx)
+ {
+ PendingAsyncRequest *head;
+ PendingAsyncRequest *tail = estate->es_pending_async[tidx];
+
+ if (!tail->callback_pending && tail->request_complete)
+ continue;
+ head = estate->es_pending_async[hidx];
+ estate->es_pending_async[tidx] = head;
+ estate->es_pending_async[hidx] = tail;
+ ++hidx;
+ }
+ estate->es_num_pending_async = hidx;
+ }
+
+ /*
+ * We only consider exiting the loop when no notifications are
+ * pending. Otherwise, each call to this function might advance
+ * the computation by only a very small amount; to the contrary,
+ * we want to push it forward as far as possible.
+ */
+ if (estate->es_async_callback_pending == 0)
+ {
+ /* If requestor is ready, exit. */
+ if (requestor_done)
+ return true;
+ /* If timeout was 0 or has expired, exit. */
+ if (cur_timeout == 0)
+ return false;
+ }
+ }
+}
+
+/*
+ * Wait or poll for events. As with ExecAsyncEventLoop, a timeout of -1
+ * means wait forever, 0 means don't wait at all, and >0 means wait for the
+ * indicated number of milliseconds.
+ *
+ * Returns true if we found some events and false if we timed out.
+ */
+static bool
+ExecAsyncEventWait(EState *estate, long timeout)
+{
+ WaitEvent occurred_event[EVENT_BUFFER_SIZE];
+ int noccurred;
+ int i;
+ int n;
+ bool reinit = false;
+ bool process_latch_set = false;
+
+ if (estate->es_wait_event_set == NULL)
+ {
+ /*
+ * Allow for a few extra events without reinitializing. It
+ * doesn't seem worth the complexity of doing anything very
+ * aggressive here, because plans that depend on massive numbers
+ * of external FDs are likely to run afoul of kernel limits anyway.
+ */
+ estate->es_allocated_fd_events = estate->es_total_fd_events + 16;
+ estate->es_wait_event_set =
+ CreateWaitEventSet(estate->es_query_cxt,
+ estate->es_allocated_fd_events + 1);
+ AddWaitEventToSet(estate->es_wait_event_set,
+ WL_LATCH_SET, PGINVALID_SOCKET, MyLatch, NULL);
+ reinit = true;
+ }
+
+ /* Give each waiting node a chance to add or modify events. */
+ for (i = 0; i < estate->es_num_pending_async; ++i)
+ {
+ PendingAsyncRequest *areq = estate->es_pending_async[i];
+
+ if (areq->num_fd_events > 0)
+ ExecAsyncConfigureWait(estate, areq, reinit);
+ }
+
+ /* Wait for at least one event to occur. */
+ noccurred = WaitEventSetWait(estate->es_wait_event_set, timeout,
+ occurred_event, EVENT_BUFFER_SIZE);
+ if (noccurred == 0)
+ return false;
+
+ /*
+ * Loop over the occurred events and set the callback_pending flags
+ * for the appropriate requests. The waiting nodes should have
+ * registered their wait events with user_data pointing back to the
+ * PendingAsyncRequest, but the process latch needs special handling.
+ */
+ for (n = 0; n < noccurred; ++n)
+ {
+ WaitEvent *w = &occurred_event[n];
+
+ if ((w->events & WL_LATCH_SET) != 0)
+ {
+ process_latch_set = true;
+ continue;
+ }
+
+ if ((w->events & (WL_SOCKET_READABLE|WL_SOCKET_WRITEABLE)) != 0)
+ {
+ PendingAsyncRequest *areq = w->user_data;
+
+ if (!areq->callback_pending)
+ {
+ Assert(!areq->request_complete);
+ areq->callback_pending = true;
+ estate->es_async_callback_pending++;
+ }
+ }
+ }
+
+ /*
+ * If the process latch got set, we must schedule a callback for every
+ * requestee that cares about it.
+ */
+ if (process_latch_set)
+ {
+ for (i = 0; i < estate->es_num_pending_async; ++i)
+ {
+ PendingAsyncRequest *areq = estate->es_pending_async[i];
+
+ if (areq->wants_process_latch)
+ {
+ Assert(!areq->request_complete);
+ areq->callback_pending = true;
+ }
+ }
+ }
+
+ return true;
+}
+
+/*
+ * Give the asynchronous node a chance to configure the file descriptor
+ * events for which it wishes to wait. We expect the node-type specific
+ * callback to make one or more calls of the following form:
+ *
+ * AddWaitEventToSet(es->es_wait_event_set, events, fd, NULL, areq);
+ *
+ * The events should include only WL_SOCKET_READABLE or WL_SOCKET_WRITEABLE,
+ * and the number of calls should not exceed areq->num_fd_events (as
+ * prevously set via ExecAsyncSetRequiredEvents).
+ */
+static void
+ExecAsyncConfigureWait(EState *estate, PendingAsyncRequest *areq,
+ bool reinit)
+{
+ switch (nodeTag(areq->requestee))
+ {
+ case T_ForeignScanState:
+ ExecAsyncForeignScanConfigureWait(estate, areq, reinit);
+ break;
+ default:
+ elog(ERROR, "unrecognized node type: %d",
+ (int) nodeTag(areq->requestee));
+ }
+}
+
+/*
+ * Call the asynchronous node back when a relevant event has occurred.
+ */
+static void
+ExecAsyncNotify(EState *estate, PendingAsyncRequest *areq)
+{
+ switch (nodeTag(areq->requestee))
+ {
+ case T_ForeignScanState:
+ ExecAsyncForeignScanNotify(estate, areq);
+ break;
+ default:
+ elog(ERROR, "unrecognized node type: %d",
+ (int) nodeTag(areq->requestee));
+ }
+}
+
+/*
+ * Call the requestor back when an asynchronous node has produced a result.
+ */
+static void
+ExecAsyncResponse(EState *estate, PendingAsyncRequest *areq)
+{
+ switch (nodeTag(areq->requestor))
+ {
+ case T_AppendState:
+ ExecAsyncAppendResponse(estate, areq);
+ break;
+ default:
+ elog(ERROR, "unrecognized node type: %d",
+ (int) nodeTag(areq->requestor));
+ }
+}
+
+/*
+ * An executor node should call this function to signal that it needs to wait
+ * on one or more file descriptor events that can be registered on a
+ * WaitEventSet, and possibly also on the process latch. num_fd_events
+ * should be the maximum number of file descriptor events that it will wish to
+ * register. force_reset should be true if the node can't reuse the
+ * WaitEventSet it most recently initialized, for example because it needs to
+ * drop a wait event from the set.
+ */
+void
+ExecAsyncSetRequiredEvents(EState *estate, PendingAsyncRequest *areq,
+ int num_fd_events, bool wants_process_latch,
+ bool force_reset)
+{
+ estate->es_total_fd_events += num_fd_events - areq->num_fd_events;
+ areq->num_fd_events = num_fd_events;
+ areq->wants_process_latch = wants_process_latch;
+
+ if (force_reset && estate->es_wait_event_set != NULL)
+ {
+ FreeWaitEventSet(estate->es_wait_event_set);
+ estate->es_wait_event_set = NULL;
+ }
+}
+
+/*
+ * An async-capable node should call this function to deliver the tuple to
+ * the node which requested it. The node can call this from its
+ * ExecAsyncRequest callback if the requested tuple is available immediately,
+ * or at a later time from its ExecAsyncNotify callback.
+ */
+void
+ExecAsyncRequestDone(EState *estate, PendingAsyncRequest *areq, Node *result)
+{
+ /*
+ * Since the request is complete, the requestee is no longer allowed
+ * to wait for any events. Note that this forces a rebuild of
+ * es_wait_event_set every time a process that was previously waiting
+ * stops doing so. It might be possible to defer that decision until
+ * we actually wait again, because it's quite possible that a new
+ * request will be made of the same node before any wait actually
+ * happens. However, we have to balance the cost of rebuilding the
+ * WaitEventSet against the additional overhead of tracking which nodes
+ * need a callback to remove registered wait events. It's not clear
+ * that we would come out ahead, so use brute force for now.
+ */
+ if (areq->num_fd_events > 0 || areq->wants_process_latch)
+ ExecAsyncSetRequiredEvents(estate, areq, 0, false, true);
+
+ /* Save result and mark request as complete. */
+ areq->result = result;
+ areq->request_complete = true;
+
+ /* Make sure this request is flagged for a callback. */
+ if (!areq->callback_pending)
+ {
+ areq->callback_pending = true;
+ estate->es_async_callback_pending++;
+ }
+}
diff --git a/src/backend/executor/nodeAppend.c b/src/backend/executor/nodeAppend.c
index a26bd63..bb06569 100644
--- a/src/backend/executor/nodeAppend.c
+++ b/src/backend/executor/nodeAppend.c
@@ -58,6 +58,7 @@
#include "postgres.h"
#include "executor/execdebug.h"
+#include "executor/execAsync.h"
#include "executor/nodeAppend.h"
static bool exec_append_initialize_next(AppendState *appendstate);
@@ -79,16 +80,21 @@ exec_append_initialize_next(AppendState *appendstate)
/*
* get information from the append node
*/
- whichplan = appendstate->as_whichplan;
+ whichplan = appendstate->as_whichsyncplan;
- if (whichplan < 0)
+ /*
+ * This routine is only responsible for setting up for nodes being scanned
+ * synchronously, so the first node we can scan is given by nasyncplans
+ * and the last is given by as_nplans - 1.
+ */
+ if (whichplan < appendstate->as_nasyncplans)
{
/*
* if scanning in reverse, we start at the last scan in the list and
* then proceed back to the first.. in any case we inform ExecAppend
* that we are at the end of the line by returning FALSE
*/
- appendstate->as_whichplan = 0;
+ appendstate->as_whichsyncplan = appendstate->as_nasyncplans;
return FALSE;
}
else if (whichplan >= appendstate->as_nplans)
@@ -96,7 +102,7 @@ exec_append_initialize_next(AppendState *appendstate)
/*
* as above, end the scan if we go beyond the last scan in our list..
*/
- appendstate->as_whichplan = appendstate->as_nplans - 1;
+ appendstate->as_whichsyncplan = appendstate->as_nplans - 1;
return FALSE;
}
else
@@ -142,6 +148,15 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
appendstate->ps.state = estate;
appendstate->appendplans = appendplanstates;
appendstate->as_nplans = nplans;
+ appendstate->as_nasyncplans = node->nasyncplans;
+ appendstate->as_syncdone = (node->nasyncplans == nplans);
+ appendstate->as_asyncresult = (TupleTableSlot **)
+ palloc0(node->nasyncplans * sizeof(TupleTableSlot *));
+
+ /* initially, all async requests need a request */
+ for (i = 0; i < appendstate->as_nasyncplans; ++i)
+ appendstate->as_needrequest =
+ bms_add_member(appendstate->as_needrequest, i);
/*
* Miscellaneous initialization
@@ -176,9 +191,9 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
appendstate->ps.ps_ProjInfo = NULL;
/*
- * initialize to scan first subplan
+ * initialize to scan first synchronous subplan
*/
- appendstate->as_whichplan = 0;
+ appendstate->as_whichsyncplan = appendstate->as_nasyncplans;
exec_append_initialize_next(appendstate);
return appendstate;
@@ -193,15 +208,78 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
TupleTableSlot *
ExecAppend(AppendState *node)
{
+ if (node->as_nasyncplans > 0)
+ {
+ EState *estate = node->ps.state;
+ int i;
+
+ /*
+ * If there are any asynchronously-generated results that have
+ * not yet been returned, return one of them.
+ */
+ if (node->as_nasyncresult > 0)
+ {
+ --node->as_nasyncresult;
+ return node->as_asyncresult[node->as_nasyncresult];
+ }
+
+ /*
+ * If there are any nodes that need a new asynchronous request,
+ * make all of them.
+ */
+ while ((i = bms_first_member(node->as_needrequest)) >= 0)
+ {
+ ExecAsyncRequest(estate, &node->ps, i, node->appendplans[i]);
+ node->as_nasyncpending++;
+ }
+ }
+
for (;;)
{
PlanState *subnode;
TupleTableSlot *result;
/*
- * figure out which subplan we are currently processing
+ * if we have async requests outstanding, run the event loop
*/
- subnode = node->appendplans[node->as_whichplan];
+ if (node->as_nasyncpending > 0)
+ {
+ long timeout = node->as_syncdone ? -1 : 0;
+
+ for (;;)
+ {
+ if (node->as_nasyncpending == 0)
+ {
+ /*
+ * If there is no asynchronous activity still pending
+ * and the synchronous activity is also complete, we're
+ * totally done scanning this node. Otherwise, we're
+ * done with the asynchronous stuff but must continue
+ * scanning the synchronous children.
+ */
+ if (node->as_syncdone)
+ return ExecClearTuple(node->ps.ps_ResultTupleSlot);
+ break;
+ }
+ if (!ExecAsyncEventLoop(node->ps.state, &node->ps, timeout))
+ {
+ /* Timeout reached. */
+ break;
+ }
+ if (node->as_nasyncresult > 0)
+ {
+ /* Asynchronous subplan returned a tuple! */
+ --node->as_nasyncresult;
+ return node->as_asyncresult[node->as_nasyncresult];
+ }
+ }
+ }
+
+ /*
+ * figure out which synchronous subplan we are currently processing
+ */
+ Assert(!node->as_syncdone);
+ subnode = node->appendplans[node->as_whichsyncplan];
/*
* get a tuple from the subplan
@@ -221,14 +299,21 @@ ExecAppend(AppendState *node)
/*
* Go on to the "next" subplan in the appropriate direction. If no
* more subplans, return the empty slot set up for us by
- * ExecInitAppend.
+ * ExecInitAppend, unless there are async plans we have yet to finish.
*/
if (ScanDirectionIsForward(node->ps.state->es_direction))
- node->as_whichplan++;
+ node->as_whichsyncplan++;
else
- node->as_whichplan--;
+ node->as_whichsyncplan--;
if (!exec_append_initialize_next(node))
- return ExecClearTuple(node->ps.ps_ResultTupleSlot);
+ {
+ node->as_syncdone = true;
+ if (node->as_nasyncpending == 0)
+ {
+ Assert(bms_is_empty(node->as_needrequest));
+ return ExecClearTuple(node->ps.ps_ResultTupleSlot);
+ }
+ }
/* Else loop back and try to get a tuple from the new subplan */
}
@@ -267,6 +352,16 @@ ExecReScanAppend(AppendState *node)
{
int i;
+ /*
+ * XXX. Cancel outstanding asynchronous tuple requests here! (How?)
+ */
+
+ /* Reset async state. */
+ for (i = 0; i < node->as_nasyncplans; ++i)
+ node->as_needrequest = bms_add_member(node->as_needrequest, i);
+ node->as_nasyncresult = 0;
+ node->as_syncdone = (node->as_nasyncplans == node->as_nplans);
+
for (i = 0; i < node->as_nplans; i++)
{
PlanState *subnode = node->appendplans[i];
@@ -285,6 +380,47 @@ ExecReScanAppend(AppendState *node)
if (subnode->chgParam == NULL)
ExecReScan(subnode);
}
- node->as_whichplan = 0;
+ node->as_whichsyncplan = node->as_nasyncplans;
exec_append_initialize_next(node);
}
+
+/* ----------------------------------------------------------------
+ * ExecAsyncAppendResponse
+ *
+ * Receive a response from an asynchronous request we made.
+ * ----------------------------------------------------------------
+ */
+void
+ExecAsyncAppendResponse(EState *estate, PendingAsyncRequest *areq)
+{
+ AppendState *node = (AppendState *) areq->requestor;
+ TupleTableSlot *slot;
+
+ /* We shouldn't be called until the request is complete. */
+ Assert(areq->request_complete);
+
+ /* Our result slot shouldn't already be occupied. */
+ Assert(TupIsNull(node->ps.ps_ResultTupleSlot));
+
+ /* Result should be a TupleTableSlot or NULL. */
+ slot = (TupleTableSlot *) areq->result;
+ Assert(slot == NULL || IsA(slot, TupleTableSlot));
+
+ /* Request is no longer pending. */
+ Assert(node->as_nasyncpending > 0);
+ --node->as_nasyncpending;
+
+ /* If the result is NULL or an empty slot, there's nothing more to do. */
+ if (TupIsNull(slot))
+ return;
+
+ /* Save result so we can return it. */
+ Assert(node->as_nasyncresult < node->as_nasyncplans);
+ node->as_asyncresult[node->as_nasyncresult++] = slot;
+
+ /*
+ * Mark the node that returned a result as ready for a new request. We
+ * don't launch another one here immediately because it might compelte
+ */
+ bms_add_member(node->as_needrequest, areq->request_index);
+}
diff --git a/src/backend/executor/nodeForeignscan.c b/src/backend/executor/nodeForeignscan.c
index d886aaf..85d436f 100644
--- a/src/backend/executor/nodeForeignscan.c
+++ b/src/backend/executor/nodeForeignscan.c
@@ -355,3 +355,52 @@ ExecForeignScanInitializeWorker(ForeignScanState *node, shm_toc *toc)
fdwroutine->InitializeWorkerForeignScan(node, toc, coordinate);
}
}
+
+/* ----------------------------------------------------------------
+ * ExecAsyncForeignScanRequest
+ *
+ * Initiate an asynchronous request
+ * ----------------------------------------------------------------
+ */
+void
+ExecAsyncForeignScanRequest(EState *estate, PendingAsyncRequest *areq)
+{
+ ForeignScanState *node = (ForeignScanState *) areq->requestee;
+ FdwRoutine *fdwroutine = node->fdwroutine;
+
+ Assert(fdwroutine->ForeignAsyncRequest != NULL);
+ fdwroutine->ForeignAsyncRequest(estate, areq);
+}
+
+/* ----------------------------------------------------------------
+ * ExecAsyncForeignScanConfigureWait
+ *
+ * In async mode, configure for a wait
+ * ----------------------------------------------------------------
+ */
+void
+ExecAsyncForeignScanConfigureWait(EState *estate,
+ PendingAsyncRequest *areq, bool reinit)
+{
+ ForeignScanState *node = (ForeignScanState *) areq->requestee;
+ FdwRoutine *fdwroutine = node->fdwroutine;
+
+ Assert(fdwroutine->ForeignAsyncConfigureWait != NULL);
+ fdwroutine->ForeignAsyncConfigureWait(estate, areq, reinit);
+}
+
+/* ----------------------------------------------------------------
+ * ExecAsyncForeignScanNotify
+ *
+ * Event loop callback
+ * ----------------------------------------------------------------
+ */
+void
+ExecAsyncForeignScanNotify(EState *estate, PendingAsyncRequest *areq)
+{
+ ForeignScanState *node = (ForeignScanState *) areq->requestee;
+ FdwRoutine *fdwroutine = node->fdwroutine;
+
+ Assert(fdwroutine->ForeignAsyncNotify != NULL);
+ fdwroutine->ForeignAsyncNotify(estate, areq);
+}
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index 71714bc..23b4e18 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -218,6 +218,7 @@ _copyAppend(const Append *from)
* copy remainder of node
*/
COPY_NODE_FIELD(appendplans);
+ COPY_SCALAR_FIELD(nasyncplans);
return newnode;
}
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index ae86954..dc5b938 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -359,6 +359,7 @@ _outAppend(StringInfo str, const Append *node)
_outPlanInfo(str, (const Plan *) node);
WRITE_NODE_FIELD(appendplans);
+ WRITE_INT_FIELD(nasyncplans);
}
static void
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index 917e6c8..69453b5 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -1519,6 +1519,7 @@ _readAppend(void)
ReadCommonPlan(&local_node->plan);
READ_NODE_FIELD(appendplans);
+ READ_INT_FIELD(nasyncplans);
READ_DONE();
}
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index ad49674..7caa8d3 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -193,7 +193,7 @@ static CteScan *make_ctescan(List *qptlist, List *qpqual,
Index scanrelid, int ctePlanId, int cteParam);
static WorkTableScan *make_worktablescan(List *qptlist, List *qpqual,
Index scanrelid, int wtParam);
-static Append *make_append(List *appendplans, List *tlist);
+static Append *make_append(List *asyncplans, int nasyncplans, List *tlist);
static RecursiveUnion *make_recursive_union(List *tlist,
Plan *lefttree,
Plan *righttree,
@@ -270,6 +270,7 @@ static ModifyTable *make_modifytable(PlannerInfo *root,
List *resultRelations, List *subplans,
List *withCheckOptionLists, List *returningLists,
List *rowMarks, OnConflictExpr *onconflict, int epqParam);
+static bool is_async_capable_path(Path *path);
/*
@@ -955,8 +956,10 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path)
{
Append *plan;
List *tlist = build_path_tlist(root, &best_path->path);
- List *subplans = NIL;
+ List *asyncplans = NIL;
+ List *syncplans = NIL;
ListCell *subpaths;
+ int nasyncplans = 0;
/*
* The subpaths list could be empty, if every child was proven empty by
@@ -991,7 +994,14 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path)
/* Must insist that all children return the same tlist */
subplan = create_plan_recurse(root, subpath, CP_EXACT_TLIST);
- subplans = lappend(subplans, subplan);
+ /* Classify as async-capable or not */
+ if (is_async_capable_path(subpath))
+ {
+ asyncplans = lappend(asyncplans, subplan);
+ ++nasyncplans;
+ }
+ else
+ syncplans = lappend(syncplans, subplan);
}
/*
@@ -1001,7 +1011,7 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path)
* parent-rel Vars it'll be asked to emit.
*/
- plan = make_append(subplans, tlist);
+ plan = make_append(list_concat(asyncplans, syncplans), nasyncplans, tlist);
copy_generic_path_info(&plan->plan, (Path *) best_path);
@@ -4941,7 +4951,7 @@ make_foreignscan(List *qptlist,
}
static Append *
-make_append(List *appendplans, List *tlist)
+make_append(List *appendplans, int nasyncplans, List *tlist)
{
Append *node = makeNode(Append);
Plan *plan = &node->plan;
@@ -4951,6 +4961,7 @@ make_append(List *appendplans, List *tlist)
plan->lefttree = NULL;
plan->righttree = NULL;
node->appendplans = appendplans;
+ node->nasyncplans = nasyncplans;
return node;
}
@@ -6225,3 +6236,27 @@ is_projection_capable_plan(Plan *plan)
}
return true;
}
+
+/*
+ * is_projection_capable_path
+ * Check whether a given Path node is async-capable.
+ */
+static bool
+is_async_capable_path(Path *path)
+{
+ switch (nodeTag(path))
+ {
+ case T_ForeignPath:
+ {
+ FdwRoutine *fdwroutine = path->parent->fdwroutine;
+
+ Assert(fdwroutine != NULL);
+ if (fdwroutine->IsForeignPathAsyncCapable != NULL &&
+ fdwroutine->IsForeignPathAsyncCapable((ForeignPath *) path))
+ return true;
+ }
+ default:
+ break;
+ }
+ return false;
+}
diff --git a/src/include/executor/execAsync.h b/src/include/executor/execAsync.h
new file mode 100644
index 0000000..2abc32d
--- /dev/null
+++ b/src/include/executor/execAsync.h
@@ -0,0 +1,29 @@
+/*--------------------------------------------------------------------
+ * execAsync.c
+ * Support functions for asynchronous query execution
+ *
+ * Portions Copyright (c) 1996-2016, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/executor/execAsync.c
+ *--------------------------------------------------------------------
+ */
+
+#ifndef EXECASYNC_H
+#define EXECASYNC_H
+
+#include "nodes/execnodes.h"
+
+extern void ExecAsyncRequest(EState *estate, PlanState *requestor,
+ int request_index, PlanState *requestee);
+extern bool ExecAsyncEventLoop(EState *estate, PlanState *requestor,
+ long timeout);
+
+extern void ExecAsyncSetRequiredEvents(EState *estate,
+ PendingAsyncRequest *areq, int num_fd_events,
+ bool wants_process_latch, bool force_reset);
+extern void ExecAsyncRequestDone(EState *estate,
+ PendingAsyncRequest *areq, Node *result);
+
+#endif /* EXECASYNC_H */
diff --git a/src/include/executor/nodeAppend.h b/src/include/executor/nodeAppend.h
index 51c381e..81a079d 100644
--- a/src/include/executor/nodeAppend.h
+++ b/src/include/executor/nodeAppend.h
@@ -21,4 +21,7 @@ extern TupleTableSlot *ExecAppend(AppendState *node);
extern void ExecEndAppend(AppendState *node);
extern void ExecReScanAppend(AppendState *node);
+extern void ExecAsyncAppendResponse(EState *estate,
+ PendingAsyncRequest *areq);
+
#endif /* NODEAPPEND_H */
diff --git a/src/include/executor/nodeForeignscan.h b/src/include/executor/nodeForeignscan.h
index 0cdec4e..3e69ab0 100644
--- a/src/include/executor/nodeForeignscan.h
+++ b/src/include/executor/nodeForeignscan.h
@@ -29,4 +29,11 @@ extern void ExecForeignScanInitializeDSM(ForeignScanState *node,
extern void ExecForeignScanInitializeWorker(ForeignScanState *node,
shm_toc *toc);
+extern void ExecAsyncForeignScanRequest(EState *estate,
+ PendingAsyncRequest *areq);
+extern void ExecAsyncForeignScanConfigureWait(EState *estate,
+ PendingAsyncRequest *areq, bool reinit);
+extern void ExecAsyncForeignScanNotify(EState *estate,
+ PendingAsyncRequest *areq);
+
#endif /* NODEFOREIGNSCAN_H */
diff --git a/src/include/foreign/fdwapi.h b/src/include/foreign/fdwapi.h
index e1b0d0d..88feb9a 100644
--- a/src/include/foreign/fdwapi.h
+++ b/src/include/foreign/fdwapi.h
@@ -155,6 +155,15 @@ typedef bool (*IsForeignScanParallelSafe_function) (PlannerInfo *root,
RelOptInfo *rel,
RangeTblEntry *rte);
+typedef bool (*IsForeignPathAsyncCapable_function) (ForeignPath *path);
+typedef void (*ForeignAsyncRequest_function) (EState *estate,
+ PendingAsyncRequest *areq);
+typedef void (*ForeignAsyncConfigureWait_function) (EState *estate,
+ PendingAsyncRequest *areq,
+ bool reinit);
+typedef void (*ForeignAsyncNotify_function) (EState *estate,
+ PendingAsyncRequest *areq);
+
/*
* FdwRoutine is the struct returned by a foreign-data wrapper's handler
* function. It provides pointers to the callback functions needed by the
@@ -224,6 +233,12 @@ typedef struct FdwRoutine
EstimateDSMForeignScan_function EstimateDSMForeignScan;
InitializeDSMForeignScan_function InitializeDSMForeignScan;
InitializeWorkerForeignScan_function InitializeWorkerForeignScan;
+
+ /* Support functions for asynchronous execution */
+ IsForeignPathAsyncCapable_function IsForeignPathAsyncCapable;
+ ForeignAsyncRequest_function ForeignAsyncRequest;
+ ForeignAsyncConfigureWait_function ForeignAsyncConfigureWait;
+ ForeignAsyncNotify_function ForeignAsyncNotify;
} FdwRoutine;
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index f6f73f3..b50b41c 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -347,6 +347,25 @@ typedef struct ResultRelInfo
} ResultRelInfo;
/* ----------------
+ * PendingAsyncRequest
+ *
+ * State for an asynchronous tuple request.
+ * ----------------
+ */
+typedef struct PendingAsyncRequest
+{
+ int myindex; /* Index in es_pending_async. */
+ struct PlanState *requestor; /* Node that wants a tuple. */
+ struct PlanState *requestee; /* Node from which a tuple is wanted. */
+ int request_index; /* Scratch space for requestor. */
+ int num_fd_events; /* Max number of FD events requestee needs. */
+ bool wants_process_latch; /* Requestee cares about MyLatch. */
+ bool callback_pending; /* Callback is needed. */
+ bool request_complete; /* Request complete, result valid. */
+ Node *result; /* Result (NULL if no more tuples). */
+} PendingAsyncRequest;
+
+/* ----------------
* EState information
*
* Master working state for an Executor invocation
@@ -422,6 +441,31 @@ typedef struct EState
HeapTuple *es_epqTuple; /* array of EPQ substitute tuples */
bool *es_epqTupleSet; /* true if EPQ tuple is provided */
bool *es_epqScanDone; /* true if EPQ tuple has been fetched */
+
+ /*
+ * Support for asynchronous execution.
+ *
+ * es_max_pending_async is the allocated size of es_pending_async, and
+ * es_num_pending_aync is the number of entries that are currently valid.
+ * (Entries after that may point to storage that can be reused.)
+ * es_async_callback_pending is the number of PendingAsyncRequests for
+ * which callback_pending is true.
+ *
+ * es_total_fd_events is the total number of FD events needed by all
+ * pending async nodes, and es_allocated_fd_events is the number any
+ * current wait event set was allocated to handle. es_wait_event_set, if
+ * non-NULL, is a previously allocated event set that may be reusable by a
+ * future wait provided that nothing's been removed and not too many more
+ * events have been added.
+ */
+ int es_num_pending_async;
+ int es_max_pending_async;
+ int es_async_callback_pending;
+ PendingAsyncRequest **es_pending_async;
+
+ int es_total_fd_events;
+ int es_allocated_fd_events;
+ struct WaitEventSet *es_wait_event_set;
} EState;
@@ -1147,17 +1191,20 @@ typedef struct ModifyTableState
/* ----------------
* AppendState information
- *
- * nplans how many plans are in the array
- * whichplan which plan is being executed (0 .. n-1)
* ----------------
*/
typedef struct AppendState
{
PlanState ps; /* its first field is NodeTag */
PlanState **appendplans; /* array of PlanStates for my inputs */
- int as_nplans;
- int as_whichplan;
+ int as_nplans; /* total # of children */
+ int as_nasyncplans; /* # of async-capable children */
+ int as_whichsyncplan; /* which sync plan is being executed */
+ bool as_syncdone; /* all synchronous plans done? */
+ Bitmapset *as_needrequest; /* async plans needing a new request */
+ TupleTableSlot **as_asyncresult; /* unreturned results of async plans */
+ int as_nasyncresult; /* # of valid entries in as_asyncresult */
+ int as_nasyncpending; /* # of outstanding async requests */
} AppendState;
/* ----------------
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index e2fbc7d..327119b 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -208,6 +208,7 @@ typedef struct Append
{
Plan plan;
List *appendplans;
+ int nasyncplans; /* # of async plans, always at start of list */
} Append;
/* ----------------
--
2.9.2
0002-Fix-some-bugs.patchtext/x-patch; charset=us-asciiDownload
From 4493e6d2d43a5864e9d381cb69270246e0c6234c Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Wed, 12 Oct 2016 14:03:53 +0900
Subject: [PATCH 2/6] Fix some bugs.
---
contrib/postgres_fdw/expected/postgres_fdw.out | 142 ++++++++++++-------------
contrib/postgres_fdw/postgres_fdw.c | 3 +-
src/backend/executor/execAsync.c | 4 +-
src/backend/postmaster/pgstat.c | 3 +
src/include/pgstat.h | 3 +-
5 files changed, 81 insertions(+), 74 deletions(-)
diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out
index 88b696c..f9fd172 100644
--- a/contrib/postgres_fdw/expected/postgres_fdw.out
+++ b/contrib/postgres_fdw/expected/postgres_fdw.out
@@ -6181,12 +6181,12 @@ INSERT INTO b(aa) VALUES('bbbbb');
SELECT tableoid::regclass, * FROM a;
tableoid | aa
----------+-------
- a | aaa
- a | aaaa
- a | aaaaa
b | bbb
b | bbbb
b | bbbbb
+ a | aaa
+ a | aaaa
+ a | aaaaa
(6 rows)
SELECT tableoid::regclass, * FROM b;
@@ -6209,12 +6209,12 @@ UPDATE a SET aa = 'zzzzzz' WHERE aa LIKE 'aaaa%';
SELECT tableoid::regclass, * FROM a;
tableoid | aa
----------+--------
- a | aaa
- a | zzzzzz
- a | zzzzzz
b | bbb
b | bbbb
b | bbbbb
+ a | aaa
+ a | zzzzzz
+ a | zzzzzz
(6 rows)
SELECT tableoid::regclass, * FROM b;
@@ -6237,12 +6237,12 @@ UPDATE b SET aa = 'new';
SELECT tableoid::regclass, * FROM a;
tableoid | aa
----------+--------
- a | aaa
- a | zzzzzz
- a | zzzzzz
b | new
b | new
b | new
+ a | aaa
+ a | zzzzzz
+ a | zzzzzz
(6 rows)
SELECT tableoid::regclass, * FROM b;
@@ -6265,12 +6265,12 @@ UPDATE a SET aa = 'newtoo';
SELECT tableoid::regclass, * FROM a;
tableoid | aa
----------+--------
- a | newtoo
- a | newtoo
- a | newtoo
b | newtoo
b | newtoo
b | newtoo
+ a | newtoo
+ a | newtoo
+ a | newtoo
(6 rows)
SELECT tableoid::regclass, * FROM b;
@@ -6329,120 +6329,120 @@ insert into bar2 values(4,44,44);
insert into bar2 values(7,77,77);
explain (verbose, costs off)
select * from bar where f1 in (select f1 from foo) for update;
- QUERY PLAN
-----------------------------------------------------------------------------------------------
+ QUERY PLAN
+------------------------------------------------------------------------------------------------------------------------
LockRows
- Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid, foo.ctid, foo.*, foo.tableoid
+ Output: bar2.f1, bar2.f2, bar2.ctid, ((bar2.*)::bar), bar2.tableoid, foo2.ctid, ((foo2.*)::foo), foo2.tableoid
-> Hash Join
- Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid, foo.ctid, foo.*, foo.tableoid
- Hash Cond: (bar.f1 = foo.f1)
+ Output: bar2.f1, bar2.f2, bar2.ctid, ((bar2.*)::bar), bar2.tableoid, foo2.ctid, ((foo2.*)::foo), foo2.tableoid
+ Hash Cond: (bar2.f1 = foo2.f1)
-> Append
- -> Seq Scan on public.bar
- Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid
-> Foreign Scan on public.bar2
Output: bar2.f1, bar2.f2, bar2.ctid, bar2.*, bar2.tableoid
Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 FOR UPDATE
+ -> Seq Scan on public.bar
+ Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid
-> Hash
- Output: foo.ctid, foo.*, foo.tableoid, foo.f1
+ Output: foo2.ctid, ((foo2.*)::foo), foo2.tableoid, foo2.f1
-> HashAggregate
- Output: foo.ctid, foo.*, foo.tableoid, foo.f1
- Group Key: foo.f1
+ Output: foo2.ctid, ((foo2.*)::foo), foo2.tableoid, foo2.f1
+ Group Key: foo2.f1
-> Append
- -> Seq Scan on public.foo
- Output: foo.ctid, foo.*, foo.tableoid, foo.f1
-> Foreign Scan on public.foo2
Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1
Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1
+ -> Seq Scan on public.foo
+ Output: foo.ctid, foo.*, foo.tableoid, foo.f1
(22 rows)
select * from bar where f1 in (select f1 from foo) for update;
f1 | f2
----+----
- 1 | 11
- 2 | 22
3 | 33
4 | 44
+ 1 | 11
+ 2 | 22
(4 rows)
explain (verbose, costs off)
select * from bar where f1 in (select f1 from foo) for share;
- QUERY PLAN
-----------------------------------------------------------------------------------------------
+ QUERY PLAN
+------------------------------------------------------------------------------------------------------------------------
LockRows
- Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid, foo.ctid, foo.*, foo.tableoid
+ Output: bar2.f1, bar2.f2, bar2.ctid, ((bar2.*)::bar), bar2.tableoid, foo2.ctid, ((foo2.*)::foo), foo2.tableoid
-> Hash Join
- Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid, foo.ctid, foo.*, foo.tableoid
- Hash Cond: (bar.f1 = foo.f1)
+ Output: bar2.f1, bar2.f2, bar2.ctid, ((bar2.*)::bar), bar2.tableoid, foo2.ctid, ((foo2.*)::foo), foo2.tableoid
+ Hash Cond: (bar2.f1 = foo2.f1)
-> Append
- -> Seq Scan on public.bar
- Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid
-> Foreign Scan on public.bar2
Output: bar2.f1, bar2.f2, bar2.ctid, bar2.*, bar2.tableoid
Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 FOR SHARE
+ -> Seq Scan on public.bar
+ Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid
-> Hash
- Output: foo.ctid, foo.*, foo.tableoid, foo.f1
+ Output: foo2.ctid, ((foo2.*)::foo), foo2.tableoid, foo2.f1
-> HashAggregate
- Output: foo.ctid, foo.*, foo.tableoid, foo.f1
- Group Key: foo.f1
+ Output: foo2.ctid, ((foo2.*)::foo), foo2.tableoid, foo2.f1
+ Group Key: foo2.f1
-> Append
- -> Seq Scan on public.foo
- Output: foo.ctid, foo.*, foo.tableoid, foo.f1
-> Foreign Scan on public.foo2
Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1
Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1
+ -> Seq Scan on public.foo
+ Output: foo.ctid, foo.*, foo.tableoid, foo.f1
(22 rows)
select * from bar where f1 in (select f1 from foo) for share;
f1 | f2
----+----
- 1 | 11
- 2 | 22
3 | 33
4 | 44
+ 1 | 11
+ 2 | 22
(4 rows)
-- Check UPDATE with inherited target and an inherited source table
explain (verbose, costs off)
update bar set f2 = f2 + 100 where f1 in (select f1 from foo);
- QUERY PLAN
----------------------------------------------------------------------------------------------
+ QUERY PLAN
+---------------------------------------------------------------------------------------------------------
Update on public.bar
Update on public.bar
Foreign Update on public.bar2
Remote SQL: UPDATE public.loct2 SET f2 = $2 WHERE ctid = $1
-> Hash Join
- Output: bar.f1, (bar.f2 + 100), bar.ctid, foo.ctid, foo.*, foo.tableoid
- Hash Cond: (bar.f1 = foo.f1)
+ Output: bar.f1, (bar.f2 + 100), bar.ctid, foo2.ctid, ((foo2.*)::foo), foo2.tableoid
+ Hash Cond: (bar.f1 = foo2.f1)
-> Seq Scan on public.bar
Output: bar.f1, bar.f2, bar.ctid
-> Hash
- Output: foo.ctid, foo.*, foo.tableoid, foo.f1
+ Output: foo2.ctid, ((foo2.*)::foo), foo2.tableoid, foo2.f1
-> HashAggregate
- Output: foo.ctid, foo.*, foo.tableoid, foo.f1
- Group Key: foo.f1
+ Output: foo2.ctid, ((foo2.*)::foo), foo2.tableoid, foo2.f1
+ Group Key: foo2.f1
-> Append
- -> Seq Scan on public.foo
- Output: foo.ctid, foo.*, foo.tableoid, foo.f1
-> Foreign Scan on public.foo2
Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1
Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1
+ -> Seq Scan on public.foo
+ Output: foo.ctid, foo.*, foo.tableoid, foo.f1
-> Hash Join
- Output: bar2.f1, (bar2.f2 + 100), bar2.f3, bar2.ctid, foo.ctid, foo.*, foo.tableoid
- Hash Cond: (bar2.f1 = foo.f1)
+ Output: bar2.f1, (bar2.f2 + 100), bar2.f3, bar2.ctid, foo2.ctid, ((foo2.*)::foo), foo2.tableoid
+ Hash Cond: (bar2.f1 = foo2.f1)
-> Foreign Scan on public.bar2
Output: bar2.f1, bar2.f2, bar2.f3, bar2.ctid
Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 FOR UPDATE
-> Hash
- Output: foo.ctid, foo.*, foo.tableoid, foo.f1
+ Output: foo2.ctid, ((foo2.*)::foo), foo2.tableoid, foo2.f1
-> HashAggregate
- Output: foo.ctid, foo.*, foo.tableoid, foo.f1
- Group Key: foo.f1
+ Output: foo2.ctid, ((foo2.*)::foo), foo2.tableoid, foo2.f1
+ Group Key: foo2.f1
-> Append
- -> Seq Scan on public.foo
- Output: foo.ctid, foo.*, foo.tableoid, foo.f1
-> Foreign Scan on public.foo2
Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1
Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1
+ -> Seq Scan on public.foo
+ Output: foo.ctid, foo.*, foo.tableoid, foo.f1
(37 rows)
update bar set f2 = f2 + 100 where f1 in (select f1 from foo);
@@ -6470,26 +6470,26 @@ where bar.f1 = ss.f1;
Foreign Update on public.bar2
Remote SQL: UPDATE public.loct2 SET f2 = $2 WHERE ctid = $1
-> Hash Join
- Output: bar.f1, (bar.f2 + 100), bar.ctid, (ROW(foo.f1))
- Hash Cond: (foo.f1 = bar.f1)
+ Output: bar.f1, (bar.f2 + 100), bar.ctid, (ROW(foo2.f1))
+ Hash Cond: (foo2.f1 = bar.f1)
-> Append
- -> Seq Scan on public.foo
- Output: ROW(foo.f1), foo.f1
-> Foreign Scan on public.foo2
Output: ROW(foo2.f1), foo2.f1
Remote SQL: SELECT f1 FROM public.loct1
- -> Seq Scan on public.foo foo_1
- Output: ROW((foo_1.f1 + 3)), (foo_1.f1 + 3)
-> Foreign Scan on public.foo2 foo2_1
Output: ROW((foo2_1.f1 + 3)), (foo2_1.f1 + 3)
Remote SQL: SELECT f1 FROM public.loct1
+ -> Seq Scan on public.foo
+ Output: ROW(foo.f1), foo.f1
+ -> Seq Scan on public.foo foo_1
+ Output: ROW((foo_1.f1 + 3)), (foo_1.f1 + 3)
-> Hash
Output: bar.f1, bar.f2, bar.ctid
-> Seq Scan on public.bar
Output: bar.f1, bar.f2, bar.ctid
-> Merge Join
- Output: bar2.f1, (bar2.f2 + 100), bar2.f3, bar2.ctid, (ROW(foo.f1))
- Merge Cond: (bar2.f1 = foo.f1)
+ Output: bar2.f1, (bar2.f2 + 100), bar2.f3, bar2.ctid, (ROW(foo2.f1))
+ Merge Cond: (bar2.f1 = foo2.f1)
-> Sort
Output: bar2.f1, bar2.f2, bar2.f3, bar2.ctid
Sort Key: bar2.f1
@@ -6497,19 +6497,19 @@ where bar.f1 = ss.f1;
Output: bar2.f1, bar2.f2, bar2.f3, bar2.ctid
Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 FOR UPDATE
-> Sort
- Output: (ROW(foo.f1)), foo.f1
- Sort Key: foo.f1
+ Output: (ROW(foo2.f1)), foo2.f1
+ Sort Key: foo2.f1
-> Append
- -> Seq Scan on public.foo
- Output: ROW(foo.f1), foo.f1
-> Foreign Scan on public.foo2
Output: ROW(foo2.f1), foo2.f1
Remote SQL: SELECT f1 FROM public.loct1
- -> Seq Scan on public.foo foo_1
- Output: ROW((foo_1.f1 + 3)), (foo_1.f1 + 3)
-> Foreign Scan on public.foo2 foo2_1
Output: ROW((foo2_1.f1 + 3)), (foo2_1.f1 + 3)
Remote SQL: SELECT f1 FROM public.loct1
+ -> Seq Scan on public.foo
+ Output: ROW(foo.f1), foo.f1
+ -> Seq Scan on public.foo foo_1
+ Output: ROW((foo_1.f1 + 3)), (foo_1.f1 + 3)
(45 rows)
update bar set f2 = f2 + 100
@@ -6676,8 +6676,8 @@ update bar set f2 = f2 + 100 returning *;
update bar set f2 = f2 + 100 returning *;
f1 | f2
----+-----
- 1 | 311
2 | 322
+ 1 | 311
6 | 266
3 | 333
4 | 344
diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c
index c480945..e75f8a1 100644
--- a/contrib/postgres_fdw/postgres_fdw.c
+++ b/contrib/postgres_fdw/postgres_fdw.c
@@ -20,6 +20,7 @@
#include "commands/explain.h"
#include "commands/vacuum.h"
#include "executor/execAsync.h"
+#include "executor/nodeForeignscan.h"
#include "foreign/fdwapi.h"
#include "funcapi.h"
#include "miscadmin.h"
@@ -4474,7 +4475,7 @@ postgresForeignAsyncRequest(EState *estate, PendingAsyncRequest *areq)
TupleTableSlot *slot;
Assert(IsA(node, ForeignScanState));
- slot = postgresIterateForeignScan(node);
+ slot = ExecForeignScan(node);
ExecAsyncRequestDone(estate, areq, (Node *) slot);
}
diff --git a/src/backend/executor/execAsync.c b/src/backend/executor/execAsync.c
index 5858bb5..e070c26 100644
--- a/src/backend/executor/execAsync.c
+++ b/src/backend/executor/execAsync.c
@@ -18,6 +18,7 @@
#include "executor/nodeAppend.h"
#include "executor/nodeForeignscan.h"
#include "miscadmin.h"
+#include "pgstat.h"
#include "storage/latch.h"
static bool ExecAsyncEventWait(EState *estate, long timeout);
@@ -286,7 +287,8 @@ ExecAsyncEventWait(EState *estate, long timeout)
/* Wait for at least one event to occur. */
noccurred = WaitEventSetWait(estate->es_wait_event_set, timeout,
- occurred_event, EVENT_BUFFER_SIZE);
+ occurred_event, EVENT_BUFFER_SIZE,
+ WAIT_EVENT_ASYNC_WAIT);
if (noccurred == 0)
return false;
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 5112d6d..558bb8f 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -3393,6 +3393,9 @@ pgstat_get_wait_ipc(WaitEventIPC w)
case WAIT_EVENT_SYNC_REP:
event_name = "SyncRep";
break;
+ case WAIT_EVENT_ASYNC_WAIT:
+ event_name = "AsyncExecWait";
+ break;
/* no default case, so that compiler will warn */
}
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 1c9bf13..40c6d08 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -785,7 +785,8 @@ typedef enum
WAIT_EVENT_MQ_SEND,
WAIT_EVENT_PARALLEL_FINISH,
WAIT_EVENT_SAFE_SNAPSHOT,
- WAIT_EVENT_SYNC_REP
+ WAIT_EVENT_SYNC_REP,
+ WAIT_EVENT_ASYNC_WAIT
} WaitEventIPC;
/* ----------
--
2.9.2
0003-Modify-async-execution-infrastructure.patchtext/x-patch; charset=us-asciiDownload
From 126ed476a6d41e5cfb54be387123ac3a8e9963d0 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Mon, 17 Oct 2016 15:54:32 +0900
Subject: [PATCH 3/6] Modify async execution infrastructure.
---
contrib/postgres_fdw/expected/postgres_fdw.out | 68 ++++++++--------
contrib/postgres_fdw/postgres_fdw.c | 5 +-
src/backend/executor/execAsync.c | 105 ++++++++++++++-----------
src/backend/executor/nodeAppend.c | 50 ++++++------
src/backend/executor/nodeForeignscan.c | 4 +-
src/backend/nodes/copyfuncs.c | 1 +
src/backend/nodes/outfuncs.c | 1 +
src/backend/nodes/readfuncs.c | 1 +
src/backend/optimizer/plan/createplan.c | 24 +++++-
src/backend/utils/adt/ruleutils.c | 6 +-
src/include/executor/nodeForeignscan.h | 2 +-
src/include/foreign/fdwapi.h | 2 +-
src/include/nodes/execnodes.h | 10 ++-
src/include/nodes/plannodes.h | 1 +
14 files changed, 167 insertions(+), 113 deletions(-)
diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out
index f9fd172..4b76e41 100644
--- a/contrib/postgres_fdw/expected/postgres_fdw.out
+++ b/contrib/postgres_fdw/expected/postgres_fdw.out
@@ -6329,13 +6329,13 @@ insert into bar2 values(4,44,44);
insert into bar2 values(7,77,77);
explain (verbose, costs off)
select * from bar where f1 in (select f1 from foo) for update;
- QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------
+ QUERY PLAN
+----------------------------------------------------------------------------------------------
LockRows
- Output: bar2.f1, bar2.f2, bar2.ctid, ((bar2.*)::bar), bar2.tableoid, foo2.ctid, ((foo2.*)::foo), foo2.tableoid
+ Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid, foo.ctid, foo.*, foo.tableoid
-> Hash Join
- Output: bar2.f1, bar2.f2, bar2.ctid, ((bar2.*)::bar), bar2.tableoid, foo2.ctid, ((foo2.*)::foo), foo2.tableoid
- Hash Cond: (bar2.f1 = foo2.f1)
+ Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid, foo.ctid, foo.*, foo.tableoid
+ Hash Cond: (bar.f1 = foo.f1)
-> Append
-> Foreign Scan on public.bar2
Output: bar2.f1, bar2.f2, bar2.ctid, bar2.*, bar2.tableoid
@@ -6343,10 +6343,10 @@ select * from bar where f1 in (select f1 from foo) for update;
-> Seq Scan on public.bar
Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid
-> Hash
- Output: foo2.ctid, ((foo2.*)::foo), foo2.tableoid, foo2.f1
+ Output: foo.ctid, foo.*, foo.tableoid, foo.f1
-> HashAggregate
- Output: foo2.ctid, ((foo2.*)::foo), foo2.tableoid, foo2.f1
- Group Key: foo2.f1
+ Output: foo.ctid, foo.*, foo.tableoid, foo.f1
+ Group Key: foo.f1
-> Append
-> Foreign Scan on public.foo2
Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1
@@ -6366,13 +6366,13 @@ select * from bar where f1 in (select f1 from foo) for update;
explain (verbose, costs off)
select * from bar where f1 in (select f1 from foo) for share;
- QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------
+ QUERY PLAN
+----------------------------------------------------------------------------------------------
LockRows
- Output: bar2.f1, bar2.f2, bar2.ctid, ((bar2.*)::bar), bar2.tableoid, foo2.ctid, ((foo2.*)::foo), foo2.tableoid
+ Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid, foo.ctid, foo.*, foo.tableoid
-> Hash Join
- Output: bar2.f1, bar2.f2, bar2.ctid, ((bar2.*)::bar), bar2.tableoid, foo2.ctid, ((foo2.*)::foo), foo2.tableoid
- Hash Cond: (bar2.f1 = foo2.f1)
+ Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid, foo.ctid, foo.*, foo.tableoid
+ Hash Cond: (bar.f1 = foo.f1)
-> Append
-> Foreign Scan on public.bar2
Output: bar2.f1, bar2.f2, bar2.ctid, bar2.*, bar2.tableoid
@@ -6380,10 +6380,10 @@ select * from bar where f1 in (select f1 from foo) for share;
-> Seq Scan on public.bar
Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid
-> Hash
- Output: foo2.ctid, ((foo2.*)::foo), foo2.tableoid, foo2.f1
+ Output: foo.ctid, foo.*, foo.tableoid, foo.f1
-> HashAggregate
- Output: foo2.ctid, ((foo2.*)::foo), foo2.tableoid, foo2.f1
- Group Key: foo2.f1
+ Output: foo.ctid, foo.*, foo.tableoid, foo.f1
+ Group Key: foo.f1
-> Append
-> Foreign Scan on public.foo2
Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1
@@ -6404,22 +6404,22 @@ select * from bar where f1 in (select f1 from foo) for share;
-- Check UPDATE with inherited target and an inherited source table
explain (verbose, costs off)
update bar set f2 = f2 + 100 where f1 in (select f1 from foo);
- QUERY PLAN
----------------------------------------------------------------------------------------------------------
+ QUERY PLAN
+---------------------------------------------------------------------------------------------
Update on public.bar
Update on public.bar
Foreign Update on public.bar2
Remote SQL: UPDATE public.loct2 SET f2 = $2 WHERE ctid = $1
-> Hash Join
- Output: bar.f1, (bar.f2 + 100), bar.ctid, foo2.ctid, ((foo2.*)::foo), foo2.tableoid
- Hash Cond: (bar.f1 = foo2.f1)
+ Output: bar.f1, (bar.f2 + 100), bar.ctid, foo.ctid, foo.*, foo.tableoid
+ Hash Cond: (bar.f1 = foo.f1)
-> Seq Scan on public.bar
Output: bar.f1, bar.f2, bar.ctid
-> Hash
- Output: foo2.ctid, ((foo2.*)::foo), foo2.tableoid, foo2.f1
+ Output: foo.ctid, foo.*, foo.tableoid, foo.f1
-> HashAggregate
- Output: foo2.ctid, ((foo2.*)::foo), foo2.tableoid, foo2.f1
- Group Key: foo2.f1
+ Output: foo.ctid, foo.*, foo.tableoid, foo.f1
+ Group Key: foo.f1
-> Append
-> Foreign Scan on public.foo2
Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1
@@ -6427,16 +6427,16 @@ update bar set f2 = f2 + 100 where f1 in (select f1 from foo);
-> Seq Scan on public.foo
Output: foo.ctid, foo.*, foo.tableoid, foo.f1
-> Hash Join
- Output: bar2.f1, (bar2.f2 + 100), bar2.f3, bar2.ctid, foo2.ctid, ((foo2.*)::foo), foo2.tableoid
- Hash Cond: (bar2.f1 = foo2.f1)
+ Output: bar2.f1, (bar2.f2 + 100), bar2.f3, bar2.ctid, foo.ctid, foo.*, foo.tableoid
+ Hash Cond: (bar2.f1 = foo.f1)
-> Foreign Scan on public.bar2
Output: bar2.f1, bar2.f2, bar2.f3, bar2.ctid
Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 FOR UPDATE
-> Hash
- Output: foo2.ctid, ((foo2.*)::foo), foo2.tableoid, foo2.f1
+ Output: foo.ctid, foo.*, foo.tableoid, foo.f1
-> HashAggregate
- Output: foo2.ctid, ((foo2.*)::foo), foo2.tableoid, foo2.f1
- Group Key: foo2.f1
+ Output: foo.ctid, foo.*, foo.tableoid, foo.f1
+ Group Key: foo.f1
-> Append
-> Foreign Scan on public.foo2
Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1
@@ -6470,8 +6470,8 @@ where bar.f1 = ss.f1;
Foreign Update on public.bar2
Remote SQL: UPDATE public.loct2 SET f2 = $2 WHERE ctid = $1
-> Hash Join
- Output: bar.f1, (bar.f2 + 100), bar.ctid, (ROW(foo2.f1))
- Hash Cond: (foo2.f1 = bar.f1)
+ Output: bar.f1, (bar.f2 + 100), bar.ctid, (ROW(foo.f1))
+ Hash Cond: (foo.f1 = bar.f1)
-> Append
-> Foreign Scan on public.foo2
Output: ROW(foo2.f1), foo2.f1
@@ -6488,8 +6488,8 @@ where bar.f1 = ss.f1;
-> Seq Scan on public.bar
Output: bar.f1, bar.f2, bar.ctid
-> Merge Join
- Output: bar2.f1, (bar2.f2 + 100), bar2.f3, bar2.ctid, (ROW(foo2.f1))
- Merge Cond: (bar2.f1 = foo2.f1)
+ Output: bar2.f1, (bar2.f2 + 100), bar2.f3, bar2.ctid, (ROW(foo.f1))
+ Merge Cond: (bar2.f1 = foo.f1)
-> Sort
Output: bar2.f1, bar2.f2, bar2.f3, bar2.ctid
Sort Key: bar2.f1
@@ -6497,8 +6497,8 @@ where bar.f1 = ss.f1;
Output: bar2.f1, bar2.f2, bar2.f3, bar2.ctid
Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 FOR UPDATE
-> Sort
- Output: (ROW(foo2.f1)), foo2.f1
- Sort Key: foo2.f1
+ Output: (ROW(foo.f1)), foo.f1
+ Sort Key: foo.f1
-> Append
-> Foreign Scan on public.foo2
Output: ROW(foo2.f1), foo2.f1
diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c
index e75f8a1..830212f 100644
--- a/contrib/postgres_fdw/postgres_fdw.c
+++ b/contrib/postgres_fdw/postgres_fdw.c
@@ -354,7 +354,7 @@ static void postgresGetForeignUpperPaths(PlannerInfo *root,
static bool postgresIsForeignPathAsyncCapable(ForeignPath *path);
static void postgresForeignAsyncRequest(EState *estate,
PendingAsyncRequest *areq);
-static void postgresForeignAsyncConfigureWait(EState *estate,
+static bool postgresForeignAsyncConfigureWait(EState *estate,
PendingAsyncRequest *areq,
bool reinit);
static void postgresForeignAsyncNotify(EState *estate,
@@ -4479,11 +4479,12 @@ postgresForeignAsyncRequest(EState *estate, PendingAsyncRequest *areq)
ExecAsyncRequestDone(estate, areq, (Node *) slot);
}
-static void
+static bool
postgresForeignAsyncConfigureWait(EState *estate, PendingAsyncRequest *areq,
bool reinit)
{
elog(ERROR, "postgresForeignAsyncConfigureWait");
+ return false;
}
static void
diff --git a/src/backend/executor/execAsync.c b/src/backend/executor/execAsync.c
index e070c26..33496a9 100644
--- a/src/backend/executor/execAsync.c
+++ b/src/backend/executor/execAsync.c
@@ -22,7 +22,7 @@
#include "storage/latch.h"
static bool ExecAsyncEventWait(EState *estate, long timeout);
-static void ExecAsyncConfigureWait(EState *estate, PendingAsyncRequest *areq,
+static bool ExecAsyncConfigureWait(EState *estate, PendingAsyncRequest *areq,
bool reinit);
static void ExecAsyncNotify(EState *estate, PendingAsyncRequest *areq);
static void ExecAsyncResponse(EState *estate, PendingAsyncRequest *areq);
@@ -43,7 +43,7 @@ ExecAsyncRequest(EState *estate, PlanState *requestor, int request_index,
PlanState *requestee)
{
PendingAsyncRequest *areq = NULL;
- int i = estate->es_num_pending_async;
+ int nasync = estate->es_num_pending_async;
/*
* If the number of pending asynchronous nodes exceeds the number of
@@ -51,7 +51,7 @@ ExecAsyncRequest(EState *estate, PlanState *requestor, int request_index,
* We start with 16 slots, and thereafter double the array size each
* time we run out of slots.
*/
- if (i >= estate->es_max_pending_async)
+ if (nasync >= estate->es_max_pending_async)
{
int newmax;
@@ -81,25 +81,28 @@ ExecAsyncRequest(EState *estate, PlanState *requestor, int request_index,
* PendingAsyncRequest if there is one. If not, we must allocate a new
* one.
*/
- if (estate->es_pending_async[i] == NULL)
+ if (estate->es_pending_async[nasync] == NULL)
{
areq = MemoryContextAllocZero(estate->es_query_cxt,
sizeof(PendingAsyncRequest));
- estate->es_pending_async[i] = areq;
+ estate->es_pending_async[nasync] = areq;
}
else
{
- areq = estate->es_pending_async[i];
+ areq = estate->es_pending_async[nasync];
MemSet(areq, 0, sizeof(PendingAsyncRequest));
}
- areq->myindex = estate->es_num_pending_async++;
+ areq->myindex = estate->es_num_pending_async;
/* Initialize the new request. */
areq->requestor = requestor;
areq->request_index = request_index;
areq->requestee = requestee;
- /* Give the requestee a chance to do whatever it wants. */
+ /*
+ * Give the requestee a chance to do whatever it wants.
+ * Requst functions return true if a result is immediately available.
+ */
switch (nodeTag(requestee))
{
case T_ForeignScanState:
@@ -110,6 +113,20 @@ ExecAsyncRequest(EState *estate, PlanState *requestor, int request_index,
elog(ERROR, "unrecognized node type: %d",
(int) nodeTag(requestee));
}
+
+ /*
+ * If a result is available, complete it immediately.
+ */
+ if (areq->state == ASYNC_COMPLETE)
+ {
+ Assert(areq->result == NULL || IsA(areq->result, TupleTableSlot));
+ ExecAsyncResponse(estate, areq);
+
+ return;
+ }
+
+ /* No result available now, make this node pending */
+ estate->es_num_pending_async++;
}
/*
@@ -175,22 +192,19 @@ ExecAsyncEventLoop(EState *estate, PlanState *requestor, long timeout)
{
PendingAsyncRequest *areq = estate->es_pending_async[i];
- /* Skip it if no callback is pending. */
- if (!areq->callback_pending)
- continue;
-
- /*
- * Mark it as no longer needing a callback. We must do this
- * before dispatching the callback in case the callback resets
- * the flag.
- */
- areq->callback_pending = false;
- estate->es_async_callback_pending--;
-
- /* Perform the actual callback; set request_done if appropraite. */
- if (!areq->request_complete)
+ /* Skip it if not pending. */
+ if (areq->state == ASYNC_CALLBACK_PENDING)
+ {
+ /*
+ * Mark it as no longer needing a callback. We must do this
+ * before dispatching the callback in case the callback resets
+ * the flag.
+ */
+ estate->es_async_callback_pending--;
ExecAsyncNotify(estate, areq);
- else
+ }
+
+ if (areq->state == ASYNC_COMPLETE)
{
any_node_done = true;
if (requestor == areq->requestor)
@@ -214,7 +228,7 @@ ExecAsyncEventLoop(EState *estate, PlanState *requestor, long timeout)
PendingAsyncRequest *head;
PendingAsyncRequest *tail = estate->es_pending_async[tidx];
- if (!tail->callback_pending && tail->request_complete)
+ if (tail->state == ASYNC_COMPLETE)
continue;
head = estate->es_pending_async[hidx];
estate->es_pending_async[tidx] = head;
@@ -247,7 +261,8 @@ ExecAsyncEventLoop(EState *estate, PlanState *requestor, long timeout)
* means wait forever, 0 means don't wait at all, and >0 means wait for the
* indicated number of milliseconds.
*
- * Returns true if we found some events and false if we timed out.
+ * Returns true if we found some events and false if we timed out or there's
+ * no event to wait. The latter is occur when the areq is processed during
*/
static bool
ExecAsyncEventWait(EState *estate, long timeout)
@@ -258,6 +273,7 @@ ExecAsyncEventWait(EState *estate, long timeout)
int n;
bool reinit = false;
bool process_latch_set = false;
+ bool added = false;
if (estate->es_wait_event_set == NULL)
{
@@ -282,13 +298,16 @@ ExecAsyncEventWait(EState *estate, long timeout)
PendingAsyncRequest *areq = estate->es_pending_async[i];
if (areq->num_fd_events > 0)
- ExecAsyncConfigureWait(estate, areq, reinit);
+ added |= ExecAsyncConfigureWait(estate, areq, reinit);
}
+ Assert(added);
+
/* Wait for at least one event to occur. */
noccurred = WaitEventSetWait(estate->es_wait_event_set, timeout,
occurred_event, EVENT_BUFFER_SIZE,
WAIT_EVENT_ASYNC_WAIT);
+
if (noccurred == 0)
return false;
@@ -312,12 +331,10 @@ ExecAsyncEventWait(EState *estate, long timeout)
{
PendingAsyncRequest *areq = w->user_data;
- if (!areq->callback_pending)
- {
- Assert(!areq->request_complete);
- areq->callback_pending = true;
- estate->es_async_callback_pending++;
- }
+ Assert(areq->state == ASYNC_WAITING);
+
+ areq->state = ASYNC_CALLBACK_PENDING;
+ estate->es_async_callback_pending++;
}
}
@@ -333,8 +350,8 @@ ExecAsyncEventWait(EState *estate, long timeout)
if (areq->wants_process_latch)
{
- Assert(!areq->request_complete);
- areq->callback_pending = true;
+ Assert(areq->state == ASYNC_WAITING);
+ areq->state = ASYNC_CALLBACK_PENDING;
}
}
}
@@ -352,15 +369,19 @@ ExecAsyncEventWait(EState *estate, long timeout)
* The events should include only WL_SOCKET_READABLE or WL_SOCKET_WRITEABLE,
* and the number of calls should not exceed areq->num_fd_events (as
* prevously set via ExecAsyncSetRequiredEvents).
+ *
+ * Individual requests can omit registering an event but it is a
+ * responsibility of the node driver to set at least one event per one
+ * requestor.
*/
-static void
+static bool
ExecAsyncConfigureWait(EState *estate, PendingAsyncRequest *areq,
bool reinit)
{
switch (nodeTag(areq->requestee))
{
case T_ForeignScanState:
- ExecAsyncForeignScanConfigureWait(estate, areq, reinit);
+ return ExecAsyncForeignScanConfigureWait(estate, areq, reinit);
break;
default:
elog(ERROR, "unrecognized node type: %d",
@@ -419,6 +440,7 @@ ExecAsyncSetRequiredEvents(EState *estate, PendingAsyncRequest *areq,
estate->es_total_fd_events += num_fd_events - areq->num_fd_events;
areq->num_fd_events = num_fd_events;
areq->wants_process_latch = wants_process_latch;
+ areq->state = ASYNC_WAITING;
if (force_reset && estate->es_wait_event_set != NULL)
{
@@ -448,17 +470,12 @@ ExecAsyncRequestDone(EState *estate, PendingAsyncRequest *areq, Node *result)
* need a callback to remove registered wait events. It's not clear
* that we would come out ahead, so use brute force for now.
*/
+ Assert(areq->state == ASYNC_IDLE || areq->state == ASYNC_CALLBACK_PENDING);
+
if (areq->num_fd_events > 0 || areq->wants_process_latch)
ExecAsyncSetRequiredEvents(estate, areq, 0, false, true);
/* Save result and mark request as complete. */
areq->result = result;
- areq->request_complete = true;
-
- /* Make sure this request is flagged for a callback. */
- if (!areq->callback_pending)
- {
- areq->callback_pending = true;
- estate->es_async_callback_pending++;
- }
+ areq->state = ASYNC_COMPLETE;
}
diff --git a/src/backend/executor/nodeAppend.c b/src/backend/executor/nodeAppend.c
index bb06569..c234f1f 100644
--- a/src/backend/executor/nodeAppend.c
+++ b/src/backend/executor/nodeAppend.c
@@ -229,9 +229,15 @@ ExecAppend(AppendState *node)
*/
while ((i = bms_first_member(node->as_needrequest)) >= 0)
{
- ExecAsyncRequest(estate, &node->ps, i, node->appendplans[i]);
node->as_nasyncpending++;
+
+ ExecAsyncRequest(estate, &node->ps, i, node->appendplans[i]);
+ /* If this request immediately gives a result, take it. */
+ if (node->as_nasyncresult > 0)
+ return node->as_asyncresult[--node->as_nasyncresult];
}
+ if (node->as_nasyncpending == 0 && node->as_syncdone)
+ return ExecClearTuple(node->ps.ps_ResultTupleSlot);
}
for (;;)
@@ -246,32 +252,32 @@ ExecAppend(AppendState *node)
{
long timeout = node->as_syncdone ? -1 : 0;
- for (;;)
+ while (node->as_nasyncpending > 0)
{
- if (node->as_nasyncpending == 0)
- {
- /*
- * If there is no asynchronous activity still pending
- * and the synchronous activity is also complete, we're
- * totally done scanning this node. Otherwise, we're
- * done with the asynchronous stuff but must continue
- * scanning the synchronous children.
- */
- if (node->as_syncdone)
- return ExecClearTuple(node->ps.ps_ResultTupleSlot);
- break;
- }
- if (!ExecAsyncEventLoop(node->ps.state, &node->ps, timeout))
- {
- /* Timeout reached. */
- break;
- }
- if (node->as_nasyncresult > 0)
+ if (ExecAsyncEventLoop(node->ps.state, &node->ps, timeout) &&
+ node->as_nasyncresult > 0)
{
/* Asynchronous subplan returned a tuple! */
--node->as_nasyncresult;
return node->as_asyncresult[node->as_nasyncresult];
}
+
+ /* Timeout reached. Go through to sync nodes if exists */
+ if (!node->as_syncdone)
+ break;
+ }
+
+ /*
+ * If there is no asynchronous activity still pending and the
+ * synchronous activity is also complete, we're totally done
+ * scanning this node. Otherwise, we're done with the
+ * asynchronous stuff but must continue scanning the synchronous
+ * children.
+ */
+ if (node->as_syncdone)
+ {
+ Assert(node->as_nasyncpending == 0);
+ return ExecClearTuple(node->ps.ps_ResultTupleSlot);
}
}
@@ -397,7 +403,7 @@ ExecAsyncAppendResponse(EState *estate, PendingAsyncRequest *areq)
TupleTableSlot *slot;
/* We shouldn't be called until the request is complete. */
- Assert(areq->request_complete);
+ Assert(areq->state == ASYNC_COMPLETE);
/* Our result slot shouldn't already be occupied. */
Assert(TupIsNull(node->ps.ps_ResultTupleSlot));
diff --git a/src/backend/executor/nodeForeignscan.c b/src/backend/executor/nodeForeignscan.c
index 85d436f..d3567bb 100644
--- a/src/backend/executor/nodeForeignscan.c
+++ b/src/backend/executor/nodeForeignscan.c
@@ -378,7 +378,7 @@ ExecAsyncForeignScanRequest(EState *estate, PendingAsyncRequest *areq)
* In async mode, configure for a wait
* ----------------------------------------------------------------
*/
-void
+bool
ExecAsyncForeignScanConfigureWait(EState *estate,
PendingAsyncRequest *areq, bool reinit)
{
@@ -386,7 +386,7 @@ ExecAsyncForeignScanConfigureWait(EState *estate,
FdwRoutine *fdwroutine = node->fdwroutine;
Assert(fdwroutine->ForeignAsyncConfigureWait != NULL);
- fdwroutine->ForeignAsyncConfigureWait(estate, areq, reinit);
+ return fdwroutine->ForeignAsyncConfigureWait(estate, areq, reinit);
}
/* ----------------------------------------------------------------
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index 23b4e18..72d8cd6 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -219,6 +219,7 @@ _copyAppend(const Append *from)
*/
COPY_NODE_FIELD(appendplans);
COPY_SCALAR_FIELD(nasyncplans);
+ COPY_SCALAR_FIELD(referent);
return newnode;
}
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index dc5b938..1ebdc48 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -360,6 +360,7 @@ _outAppend(StringInfo str, const Append *node)
WRITE_NODE_FIELD(appendplans);
WRITE_INT_FIELD(nasyncplans);
+ WRITE_INT_FIELD(referent);
}
static void
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index 69453b5..8443a62 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -1520,6 +1520,7 @@ _readAppend(void)
READ_NODE_FIELD(appendplans);
READ_INT_FIELD(nasyncplans);
+ READ_INT_FIELD(referent);
READ_DONE();
}
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index 7caa8d3..ff1d663 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -193,7 +193,8 @@ static CteScan *make_ctescan(List *qptlist, List *qpqual,
Index scanrelid, int ctePlanId, int cteParam);
static WorkTableScan *make_worktablescan(List *qptlist, List *qpqual,
Index scanrelid, int wtParam);
-static Append *make_append(List *asyncplans, int nasyncplans, List *tlist);
+static Append *make_append(List *asyncplans, int nasyncplans,
+ int referent, List *tlist);
static RecursiveUnion *make_recursive_union(List *tlist,
Plan *lefttree,
Plan *righttree,
@@ -960,6 +961,8 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path)
List *syncplans = NIL;
ListCell *subpaths;
int nasyncplans = 0;
+ bool first = true;
+ bool referent_is_sync = true;
/*
* The subpaths list could be empty, if every child was proven empty by
@@ -985,7 +988,14 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path)
return plan;
}
- /* Build the plan for each child */
+ /*
+ * Build the plan for each child
+
+ * The first child in an inheritance set is the representative in
+ * explaining tlist entries (see set_deparse_planstate). We should keep
+ * the first child in best_path->subpaths at the head of the subplan list
+ * for the reason.
+ */
foreach(subpaths, best_path->subpaths)
{
Path *subpath = (Path *) lfirst(subpaths);
@@ -999,9 +1009,13 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path)
{
asyncplans = lappend(asyncplans, subplan);
++nasyncplans;
+ if (first)
+ referent_is_sync = false;
}
else
syncplans = lappend(syncplans, subplan);
+
+ first = false;
}
/*
@@ -1011,7 +1025,8 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path)
* parent-rel Vars it'll be asked to emit.
*/
- plan = make_append(list_concat(asyncplans, syncplans), nasyncplans, tlist);
+ plan = make_append(list_concat(asyncplans, syncplans), nasyncplans,
+ referent_is_sync ? nasyncplans : 0, tlist);
copy_generic_path_info(&plan->plan, (Path *) best_path);
@@ -4951,7 +4966,7 @@ make_foreignscan(List *qptlist,
}
static Append *
-make_append(List *appendplans, int nasyncplans, List *tlist)
+make_append(List *appendplans, int nasyncplans, int referent, List *tlist)
{
Append *node = makeNode(Append);
Plan *plan = &node->plan;
@@ -4962,6 +4977,7 @@ make_append(List *appendplans, int nasyncplans, List *tlist)
plan->righttree = NULL;
node->appendplans = appendplans;
node->nasyncplans = nasyncplans;
+ node->referent = referent;
return node;
}
diff --git a/src/backend/utils/adt/ruleutils.c b/src/backend/utils/adt/ruleutils.c
index 8a81d7a..de0e96c 100644
--- a/src/backend/utils/adt/ruleutils.c
+++ b/src/backend/utils/adt/ruleutils.c
@@ -4056,7 +4056,11 @@ set_deparse_planstate(deparse_namespace *dpns, PlanState *ps)
* lists containing references to non-target relations.
*/
if (IsA(ps, AppendState))
- dpns->outer_planstate = ((AppendState *) ps)->appendplans[0];
+ {
+ int idx = ((Append*)(((AppendState *) ps)->ps.plan))->referent;
+ dpns->outer_planstate =
+ ((AppendState *) ps)->appendplans[idx];
+ }
else if (IsA(ps, MergeAppendState))
dpns->outer_planstate = ((MergeAppendState *) ps)->mergeplans[0];
else if (IsA(ps, ModifyTableState))
diff --git a/src/include/executor/nodeForeignscan.h b/src/include/executor/nodeForeignscan.h
index 3e69ab0..47a3920 100644
--- a/src/include/executor/nodeForeignscan.h
+++ b/src/include/executor/nodeForeignscan.h
@@ -31,7 +31,7 @@ extern void ExecForeignScanInitializeWorker(ForeignScanState *node,
extern void ExecAsyncForeignScanRequest(EState *estate,
PendingAsyncRequest *areq);
-extern void ExecAsyncForeignScanConfigureWait(EState *estate,
+extern bool ExecAsyncForeignScanConfigureWait(EState *estate,
PendingAsyncRequest *areq, bool reinit);
extern void ExecAsyncForeignScanNotify(EState *estate,
PendingAsyncRequest *areq);
diff --git a/src/include/foreign/fdwapi.h b/src/include/foreign/fdwapi.h
index 88feb9a..65517fd 100644
--- a/src/include/foreign/fdwapi.h
+++ b/src/include/foreign/fdwapi.h
@@ -158,7 +158,7 @@ typedef bool (*IsForeignScanParallelSafe_function) (PlannerInfo *root,
typedef bool (*IsForeignPathAsyncCapable_function) (ForeignPath *path);
typedef void (*ForeignAsyncRequest_function) (EState *estate,
PendingAsyncRequest *areq);
-typedef void (*ForeignAsyncConfigureWait_function) (EState *estate,
+typedef bool (*ForeignAsyncConfigureWait_function) (EState *estate,
PendingAsyncRequest *areq,
bool reinit);
typedef void (*ForeignAsyncNotify_function) (EState *estate,
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index b50b41c..0c6af86 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -352,6 +352,13 @@ typedef struct ResultRelInfo
* State for an asynchronous tuple request.
* ----------------
*/
+typedef enum AsyncRequestState
+{
+ ASYNC_IDLE,
+ ASYNC_WAITING,
+ ASYNC_CALLBACK_PENDING,
+ ASYNC_COMPLETE
+} AsyncRequestState;
typedef struct PendingAsyncRequest
{
int myindex; /* Index in es_pending_async. */
@@ -360,8 +367,7 @@ typedef struct PendingAsyncRequest
int request_index; /* Scratch space for requestor. */
int num_fd_events; /* Max number of FD events requestee needs. */
bool wants_process_latch; /* Requestee cares about MyLatch. */
- bool callback_pending; /* Callback is needed. */
- bool request_complete; /* Request complete, result valid. */
+ AsyncRequestState state;
Node *result; /* Result (NULL if no more tuples). */
} PendingAsyncRequest;
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index 327119b..1df6693 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -209,6 +209,7 @@ typedef struct Append
Plan plan;
List *appendplans;
int nasyncplans; /* # of async plans, always at start of list */
+ int referent; /* index of inheritance tree referent */
} Append;
/* ----------------
--
2.9.2
0004-Make-postgres_fdw-async-capable.patchtext/x-patch; charset=us-asciiDownload
From 62d27e1420de596dbd6a3ecdae1dc1d0a51116cf Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Mon, 17 Oct 2016 16:00:56 +0900
Subject: [PATCH 4/6] Make postgres_fdw async-capable
---
contrib/postgres_fdw/connection.c | 79 ++--
contrib/postgres_fdw/expected/postgres_fdw.out | 64 ++--
contrib/postgres_fdw/postgres_fdw.c | 483 +++++++++++++++++++++----
contrib/postgres_fdw/postgres_fdw.h | 2 +
contrib/postgres_fdw/sql/postgres_fdw.sql | 4 +-
src/backend/executor/execProcnode.c | 9 +
src/include/foreign/fdwapi.h | 2 +
7 files changed, 510 insertions(+), 133 deletions(-)
diff --git a/contrib/postgres_fdw/connection.c b/contrib/postgres_fdw/connection.c
index bcdddc2..ebc9417 100644
--- a/contrib/postgres_fdw/connection.c
+++ b/contrib/postgres_fdw/connection.c
@@ -49,6 +49,7 @@ typedef struct ConnCacheEntry
* one level of subxact open, etc */
bool have_prep_stmt; /* have we prepared any stmts in this xact? */
bool have_error; /* have any subxacts aborted in this xact? */
+ void *storage; /* connection specific storage */
} ConnCacheEntry;
/*
@@ -64,6 +65,7 @@ static unsigned int prep_stmt_number = 0;
static bool xact_got_connection = false;
/* prototypes of private functions */
+static ConnCacheEntry *get_connection_entry(Oid umid);
static PGconn *connect_pg_server(ForeignServer *server, UserMapping *user);
static void check_conn_params(const char **keywords, const char **values);
static void configure_remote_session(PGconn *conn);
@@ -75,26 +77,12 @@ static void pgfdw_subxact_callback(SubXactEvent event,
SubTransactionId parentSubid,
void *arg);
-
/*
- * Get a PGconn which can be used to execute queries on the remote PostgreSQL
- * server with the user's authorization. A new connection is established
- * if we don't already have a suitable one, and a transaction is opened at
- * the right subtransaction nesting depth if we didn't do that already.
- *
- * will_prep_stmt must be true if caller intends to create any prepared
- * statements. Since those don't go away automatically at transaction end
- * (not even on error), we need this flag to cue manual cleanup.
- *
- * XXX Note that caching connections theoretically requires a mechanism to
- * detect change of FDW objects to invalidate already established connections.
- * We could manage that by watching for invalidation events on the relevant
- * syscaches. For the moment, though, it's not clear that this would really
- * be useful and not mere pedantry. We could not flush any active connections
- * mid-transaction anyway.
+ * Common function to acquire or create a connection cache entry.
*/
-PGconn *
-GetConnection(UserMapping *user, bool will_prep_stmt)
+
+static ConnCacheEntry *
+get_connection_entry(Oid umid)
{
bool found;
ConnCacheEntry *entry;
@@ -122,11 +110,8 @@ GetConnection(UserMapping *user, bool will_prep_stmt)
RegisterSubXactCallback(pgfdw_subxact_callback, NULL);
}
- /* Set flag that we did GetConnection during the current transaction */
- xact_got_connection = true;
-
/* Create hash key for the entry. Assume no pad bytes in key struct */
- key = user->umid;
+ key = umid;
/*
* Find or create cached entry for requested connection.
@@ -139,8 +124,39 @@ GetConnection(UserMapping *user, bool will_prep_stmt)
entry->xact_depth = 0;
entry->have_prep_stmt = false;
entry->have_error = false;
+ entry->storage = NULL;
}
+ return entry;
+}
+
+/*
+ * Get a PGconn which can be used to execute queries on the remote PostgreSQL
+ * server with the user's authorization. A new connection is established
+ * if we don't already have a suitable one, and a transaction is opened at
+ * the right subtransaction nesting depth if we didn't do that already.
+ *
+ * will_prep_stmt must be true if caller intends to create any prepared
+ * statements. Since those don't go away automatically at transaction end
+ * (not even on error), we need this flag to cue manual cleanup.
+ *
+ * XXX Note that caching connections theoretically requires a mechanism to
+ * detect change of FDW objects to invalidate already established connections.
+ * We could manage that by watching for invalidation events on the relevant
+ * syscaches. For the moment, though, it's not clear that this would really
+ * be useful and not mere pedantry. We could not flush any active connections
+ * mid-transaction anyway.
+ */
+PGconn *
+GetConnection(UserMapping *user, bool will_prep_stmt)
+{
+ ConnCacheEntry *entry;
+
+ /* Set flag that we did GetConnection during the current transaction */
+ xact_got_connection = true;
+
+ entry = get_connection_entry(user->umid);
+
/*
* We don't check the health of cached connection here, because it would
* require some overhead. Broken connection will be detected when the
@@ -177,6 +193,25 @@ GetConnection(UserMapping *user, bool will_prep_stmt)
}
/*
+ * Rerturns the connection specific storage for this user. Allocate with
+ * initsize if not exists.
+ */
+void *
+GetConnectionSpecificStorage(UserMapping *user, size_t initsize)
+{
+ ConnCacheEntry *entry;
+
+ entry = get_connection_entry(user->umid);
+ if (entry->storage == NULL)
+ {
+ entry->storage = MemoryContextAlloc(CacheMemoryContext, initsize);
+ memset(entry->storage, 0, initsize);
+ }
+
+ return entry->storage;
+}
+
+/*
* Connect to remote server using specified server and user mapping properties.
*/
static PGconn *
diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out
index 4b76e41..ca69074 100644
--- a/contrib/postgres_fdw/expected/postgres_fdw.out
+++ b/contrib/postgres_fdw/expected/postgres_fdw.out
@@ -6181,12 +6181,12 @@ INSERT INTO b(aa) VALUES('bbbbb');
SELECT tableoid::regclass, * FROM a;
tableoid | aa
----------+-------
- b | bbb
- b | bbbb
- b | bbbbb
a | aaa
a | aaaa
a | aaaaa
+ b | bbb
+ b | bbbb
+ b | bbbbb
(6 rows)
SELECT tableoid::regclass, * FROM b;
@@ -6209,12 +6209,12 @@ UPDATE a SET aa = 'zzzzzz' WHERE aa LIKE 'aaaa%';
SELECT tableoid::regclass, * FROM a;
tableoid | aa
----------+--------
- b | bbb
- b | bbbb
- b | bbbbb
a | aaa
a | zzzzzz
a | zzzzzz
+ b | bbb
+ b | bbbb
+ b | bbbbb
(6 rows)
SELECT tableoid::regclass, * FROM b;
@@ -6237,12 +6237,12 @@ UPDATE b SET aa = 'new';
SELECT tableoid::regclass, * FROM a;
tableoid | aa
----------+--------
- b | new
- b | new
- b | new
a | aaa
a | zzzzzz
a | zzzzzz
+ b | new
+ b | new
+ b | new
(6 rows)
SELECT tableoid::regclass, * FROM b;
@@ -6265,12 +6265,12 @@ UPDATE a SET aa = 'newtoo';
SELECT tableoid::regclass, * FROM a;
tableoid | aa
----------+--------
- b | newtoo
- b | newtoo
- b | newtoo
a | newtoo
a | newtoo
a | newtoo
+ b | newtoo
+ b | newtoo
+ b | newtoo
(6 rows)
SELECT tableoid::regclass, * FROM b;
@@ -6358,9 +6358,9 @@ select * from bar where f1 in (select f1 from foo) for update;
select * from bar where f1 in (select f1 from foo) for update;
f1 | f2
----+----
+ 1 | 11
3 | 33
4 | 44
- 1 | 11
2 | 22
(4 rows)
@@ -6395,9 +6395,9 @@ select * from bar where f1 in (select f1 from foo) for share;
select * from bar where f1 in (select f1 from foo) for share;
f1 | f2
----+----
+ 1 | 11
3 | 33
4 | 44
- 1 | 11
2 | 22
(4 rows)
@@ -6660,27 +6660,33 @@ delete from foo where f1 < 5 returning *;
(5 rows)
explain (verbose, costs off)
-update bar set f2 = f2 + 100 returning *;
- QUERY PLAN
-------------------------------------------------------------------------------
- Update on public.bar
- Output: bar.f1, bar.f2
- Update on public.bar
- Foreign Update on public.bar2
- -> Seq Scan on public.bar
- Output: bar.f1, (bar.f2 + 100), bar.ctid
- -> Foreign Update on public.bar2
- Remote SQL: UPDATE public.loct2 SET f2 = (f2 + 100) RETURNING f1, f2
-(8 rows)
+with u as (update bar set f2 = f2 + 100 returning *) select * from u order by 1;
+ QUERY PLAN
+--------------------------------------------------------------------------------------
+ Sort
+ Output: u.f1, u.f2
+ Sort Key: u.f1
+ CTE u
+ -> Update on public.bar
+ Output: bar.f1, bar.f2
+ Update on public.bar
+ Foreign Update on public.bar2
+ -> Seq Scan on public.bar
+ Output: bar.f1, (bar.f2 + 100), bar.ctid
+ -> Foreign Update on public.bar2
+ Remote SQL: UPDATE public.loct2 SET f2 = (f2 + 100) RETURNING f1, f2
+ -> CTE Scan on u
+ Output: u.f1, u.f2
+(14 rows)
-update bar set f2 = f2 + 100 returning *;
+with u as (update bar set f2 = f2 + 100 returning *) select * from u order by 1;
f1 | f2
----+-----
- 2 | 322
1 | 311
- 6 | 266
+ 2 | 322
3 | 333
4 | 344
+ 6 | 266
7 | 277
(6 rows)
diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c
index 830212f..9244e51 100644
--- a/contrib/postgres_fdw/postgres_fdw.c
+++ b/contrib/postgres_fdw/postgres_fdw.c
@@ -35,6 +35,7 @@
#include "optimizer/var.h"
#include "optimizer/tlist.h"
#include "parser/parsetree.h"
+#include "pgstat.h"
#include "utils/builtins.h"
#include "utils/guc.h"
#include "utils/lsyscache.h"
@@ -54,6 +55,9 @@ PG_MODULE_MAGIC;
/* If no remote estimates, assume a sort costs 20% extra */
#define DEFAULT_FDW_SORT_MULTIPLIER 1.2
+/* Retrive PgFdwScanState struct from ForeginScanState */
+#define GetPgFdwScanState(n) ((PgFdwScanState *)(n)->fdw_state)
+
/*
* Indexes of FDW-private information stored in fdw_private lists.
*
@@ -123,10 +127,27 @@ enum FdwDirectModifyPrivateIndex
};
/*
+ * Connection private area structure.
+ */
+ typedef struct PgFdwConnspecate
+{
+ ForeignScanState *current_owner; /* The node currently running a query
+ * on this connection*/
+} PgFdwConnspecate;
+
+/* Execution state base type */
+typedef struct PgFdwState
+{
+ PGconn *conn; /* connection for the scan */
+ PgFdwConnspecate *connspec; /* connection private memory */
+} PgFdwState;
+
+/*
* Execution state of a foreign scan using postgres_fdw.
*/
typedef struct PgFdwScanState
{
+ PgFdwState s; /* common structure */
Relation rel; /* relcache entry for the foreign table. NULL
* for a foreign join scan. */
TupleDesc tupdesc; /* tuple descriptor of scan */
@@ -137,7 +158,7 @@ typedef struct PgFdwScanState
List *retrieved_attrs; /* list of retrieved attribute numbers */
/* for remote query execution */
- PGconn *conn; /* connection for the scan */
+ bool result_ready;
unsigned int cursor_number; /* quasi-unique ID for my cursor */
bool cursor_exists; /* have we created the cursor? */
int numParams; /* number of parameters passed to query */
@@ -153,6 +174,13 @@ typedef struct PgFdwScanState
/* batch-level state, for optimizing rewinds and avoiding useless fetch */
int fetch_ct_2; /* Min(# of fetches done, 2) */
bool eof_reached; /* true if last fetch reached EOF */
+ bool run_async; /* true if run asynchronously */
+ bool async_waiting; /* true if requesting the parent to wait */
+ ForeignScanState *waiter; /* Next node to run a query among nodes
+ * sharing the same connection */
+ ForeignScanState *last_waiter; /* A waiting node at the end of a waiting
+ * list. Maintained only by the current
+ * owner of the connection */
/* working memory contexts */
MemoryContext batch_cxt; /* context holding current batch of tuples */
@@ -166,11 +194,11 @@ typedef struct PgFdwScanState
*/
typedef struct PgFdwModifyState
{
+ PgFdwState s; /* common structure */
Relation rel; /* relcache entry for the foreign table */
AttInMetadata *attinmeta; /* attribute datatype conversion metadata */
/* for remote query execution */
- PGconn *conn; /* connection for the scan */
char *p_name; /* name of prepared statement, if created */
/* extracted fdw_private data */
@@ -193,6 +221,7 @@ typedef struct PgFdwModifyState
*/
typedef struct PgFdwDirectModifyState
{
+ PgFdwState s; /* common structure */
Relation rel; /* relcache entry for the foreign table */
AttInMetadata *attinmeta; /* attribute datatype conversion metadata */
@@ -291,6 +320,7 @@ static void postgresBeginForeignScan(ForeignScanState *node, int eflags);
static TupleTableSlot *postgresIterateForeignScan(ForeignScanState *node);
static void postgresReScanForeignScan(ForeignScanState *node);
static void postgresEndForeignScan(ForeignScanState *node);
+static void postgresShutdownForeignScan(ForeignScanState *node);
static void postgresAddForeignUpdateTargets(Query *parsetree,
RangeTblEntry *target_rte,
Relation target_relation);
@@ -355,8 +385,8 @@ static bool postgresIsForeignPathAsyncCapable(ForeignPath *path);
static void postgresForeignAsyncRequest(EState *estate,
PendingAsyncRequest *areq);
static bool postgresForeignAsyncConfigureWait(EState *estate,
- PendingAsyncRequest *areq,
- bool reinit);
+ PendingAsyncRequest *areq,
+ bool reinit);
static void postgresForeignAsyncNotify(EState *estate,
PendingAsyncRequest *areq);
@@ -379,7 +409,10 @@ static bool ec_member_matches_foreign(PlannerInfo *root, RelOptInfo *rel,
EquivalenceClass *ec, EquivalenceMember *em,
void *arg);
static void create_cursor(ForeignScanState *node);
-static void fetch_more_data(ForeignScanState *node);
+static void request_more_data(ForeignScanState *node);
+static void fetch_received_data(ForeignScanState *node);
+static void vacate_connection(PgFdwState *fdwconn);
+static void absorb_current_result(ForeignScanState *node);
static void close_cursor(PGconn *conn, unsigned int cursor_number);
static void prepare_foreign_modify(PgFdwModifyState *fmstate);
static const char **convert_prep_stmt_params(PgFdwModifyState *fmstate,
@@ -444,6 +477,7 @@ postgres_fdw_handler(PG_FUNCTION_ARGS)
routine->IterateForeignScan = postgresIterateForeignScan;
routine->ReScanForeignScan = postgresReScanForeignScan;
routine->EndForeignScan = postgresEndForeignScan;
+ routine->ShutdownForeignScan = postgresShutdownForeignScan;
/* Functions for updating foreign tables */
routine->AddForeignUpdateTargets = postgresAddForeignUpdateTargets;
@@ -1337,12 +1371,21 @@ postgresBeginForeignScan(ForeignScanState *node, int eflags)
* Get connection to the foreign server. Connection manager will
* establish new connection if necessary.
*/
- fsstate->conn = GetConnection(user, false);
+ fsstate->s.conn = GetConnection(user, false);
+ fsstate->s.connspec = (PgFdwConnspecate *)
+ GetConnectionSpecificStorage(user, sizeof(PgFdwConnspecate));
+ fsstate->s.connspec->current_owner = NULL;
+ fsstate->waiter = NULL;
+ fsstate->last_waiter = node;
/* Assign a unique ID for my cursor */
- fsstate->cursor_number = GetCursorNumber(fsstate->conn);
+ fsstate->cursor_number = GetCursorNumber(fsstate->s.conn);
fsstate->cursor_exists = false;
+ /* Initialize async execution status */
+ fsstate->run_async = false;
+ fsstate->async_waiting = false;
+
/* Get private info created by planner functions. */
fsstate->query = strVal(list_nth(fsplan->fdw_private,
FdwScanPrivateSelectSql));
@@ -1398,32 +1441,126 @@ postgresBeginForeignScan(ForeignScanState *node, int eflags)
static TupleTableSlot *
postgresIterateForeignScan(ForeignScanState *node)
{
- PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+ PgFdwScanState *fsstate = GetPgFdwScanState(node);
TupleTableSlot *slot = node->ss.ss_ScanTupleSlot;
/*
- * If this is the first call after Begin or ReScan, we need to create the
- * cursor on the remote side.
- */
- if (!fsstate->cursor_exists)
- create_cursor(node);
-
- /*
* Get some more tuples, if we've run out.
*/
if (fsstate->next_tuple >= fsstate->num_tuples)
{
- /* No point in another fetch if we already detected EOF, though. */
- if (!fsstate->eof_reached)
- fetch_more_data(node);
- /* If we didn't get any tuples, must be end of data. */
+ ForeignScanState *next_conn_owner = node;
+
+ /* This node has sent a query on this connection */
+ if (fsstate->s.connspec->current_owner == node)
+ {
+ /* Check if the result is available */
+ if (PQisBusy(fsstate->s.conn))
+ {
+ int rc = WaitLatchOrSocket(NULL,
+ WL_SOCKET_READABLE | WL_TIMEOUT,
+ PQsocket(fsstate->s.conn), 0,
+ WAIT_EVENT_ASYNC_WAIT);
+ if (fsstate->run_async && !(rc & WL_SOCKET_READABLE))
+ {
+ /*
+ * This node is not ready yet. Tell the caller to wait.
+ */
+ fsstate->result_ready = false;
+ return ExecClearTuple(slot);
+ }
+ }
+
+ Assert(fsstate->async_waiting);
+ fsstate->async_waiting = false;
+ fetch_received_data(node);
+
+ /*
+ * If someone is waiting this node on the same connection, let the
+ * first waiter be the next owner of this connection.
+ */
+ if (fsstate->waiter)
+ {
+ PgFdwScanState *next_owner_state;
+
+ next_conn_owner = fsstate->waiter;
+ next_owner_state = GetPgFdwScanState(next_conn_owner);
+ fsstate->waiter = NULL;
+
+ /*
+ * only the current owner is responsible to maintain the shortcut
+ * to the last waiter
+ */
+ next_owner_state->last_waiter = fsstate->last_waiter;
+
+ /*
+ * for simplicity, last_waiter points itself on a node that no one
+ * is waiting for.
+ */
+ fsstate->last_waiter = node;
+ }
+ }
+ else if (fsstate->s.connspec->current_owner)
+ {
+ /*
+ * Anyone else is holding this connection. Add myself to the tail
+ * of the waiters' list then return not-ready. To avoid scanning
+ * through the waiters' list, the current owner is to maintain the
+ * shortcut to the last waiter.
+ */
+ PgFdwScanState *conn_owner_state =
+ GetPgFdwScanState(fsstate->s.connspec->current_owner);
+ ForeignScanState *last_waiter = conn_owner_state->last_waiter;
+ PgFdwScanState *last_waiter_state = GetPgFdwScanState(last_waiter);
+
+ last_waiter_state->waiter = node;
+ conn_owner_state->last_waiter = node;
+
+ /* Register the node to the async-waiting node list */
+ Assert(!GetPgFdwScanState(node)->async_waiting);
+
+ GetPgFdwScanState(node)->async_waiting = true;
+
+ fsstate->result_ready = fsstate->eof_reached;
+ return ExecClearTuple(slot);
+ }
+
+ /*
+ * Send the next request for the next owner of this connection if
+ * needed.
+ */
+
+ if (!GetPgFdwScanState(next_conn_owner)->eof_reached)
+ {
+ PgFdwScanState *next_owner_state =
+ GetPgFdwScanState(next_conn_owner);
+
+ request_more_data(next_conn_owner);
+
+ /* Register the node to the async-waiting node list */
+ if (!next_owner_state->async_waiting)
+ next_owner_state->async_waiting = true;
+
+ if (!next_owner_state->run_async)
+ fetch_received_data(next_conn_owner);
+ }
+
+
+ /*
+ * If we haven't received a result for the given node this time,
+ * return with no tuple to give way to other nodes.
+ */
if (fsstate->next_tuple >= fsstate->num_tuples)
+ {
+ fsstate->result_ready = fsstate->eof_reached;
return ExecClearTuple(slot);
+ }
}
/*
* Return the next tuple.
*/
+ fsstate->result_ready = true;
ExecStoreTuple(fsstate->tuples[fsstate->next_tuple++],
slot,
InvalidBuffer,
@@ -1439,7 +1576,7 @@ postgresIterateForeignScan(ForeignScanState *node)
static void
postgresReScanForeignScan(ForeignScanState *node)
{
- PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+ PgFdwScanState *fsstate = GetPgFdwScanState(node);
char sql[64];
PGresult *res;
@@ -1447,6 +1584,9 @@ postgresReScanForeignScan(ForeignScanState *node)
if (!fsstate->cursor_exists)
return;
+ /* Absorb the ramining result */
+ absorb_current_result(node);
+
/*
* If any internal parameters affecting this node have changed, we'd
* better destroy and recreate the cursor. Otherwise, rewinding it should
@@ -1475,9 +1615,9 @@ postgresReScanForeignScan(ForeignScanState *node)
* We don't use a PG_TRY block here, so be careful not to throw error
* without releasing the PGresult.
*/
- res = pgfdw_exec_query(fsstate->conn, sql);
+ res = pgfdw_exec_query(fsstate->s.conn, sql);
if (PQresultStatus(res) != PGRES_COMMAND_OK)
- pgfdw_report_error(ERROR, res, fsstate->conn, true, sql);
+ pgfdw_report_error(ERROR, res, fsstate->s.conn, true, sql);
PQclear(res);
/* Now force a fresh FETCH. */
@@ -1495,7 +1635,7 @@ postgresReScanForeignScan(ForeignScanState *node)
static void
postgresEndForeignScan(ForeignScanState *node)
{
- PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+ PgFdwScanState *fsstate = GetPgFdwScanState(node);
/* if fsstate is NULL, we are in EXPLAIN; nothing to do */
if (fsstate == NULL)
@@ -1503,16 +1643,32 @@ postgresEndForeignScan(ForeignScanState *node)
/* Close the cursor if open, to prevent accumulation of cursors */
if (fsstate->cursor_exists)
- close_cursor(fsstate->conn, fsstate->cursor_number);
+ close_cursor(fsstate->s.conn, fsstate->cursor_number);
/* Release remote connection */
- ReleaseConnection(fsstate->conn);
- fsstate->conn = NULL;
+ ReleaseConnection(fsstate->s.conn);
+ fsstate->s.conn = NULL;
/* MemoryContexts will be deleted automatically. */
}
/*
+ * postgresShutdownForeignScan
+ * Remove asynchrony stuff and cleanup garbage on the connection.
+ */
+static void
+postgresShutdownForeignScan(ForeignScanState *node)
+{
+ ForeignScan *plan = (ForeignScan *) node->ss.ps.plan;
+
+ if (plan->operation != CMD_SELECT)
+ return;
+
+ /* Absorb the ramining result */
+ absorb_current_result(node);
+}
+
+/*
* postgresAddForeignUpdateTargets
* Add resjunk column(s) needed for update/delete on a foreign table
*/
@@ -1714,7 +1870,9 @@ postgresBeginForeignModify(ModifyTableState *mtstate,
user = GetUserMapping(userid, table->serverid);
/* Open connection; report that we'll create a prepared statement. */
- fmstate->conn = GetConnection(user, true);
+ fmstate->s.conn = GetConnection(user, true);
+ fmstate->s.connspec = (PgFdwConnspecate *)
+ GetConnectionSpecificStorage(user, sizeof(PgFdwConnspecate));
fmstate->p_name = NULL; /* prepared statement not made yet */
/* Deconstruct fdw_private data. */
@@ -1793,6 +1951,8 @@ postgresExecForeignInsert(EState *estate,
PGresult *res;
int n_rows;
+ vacate_connection((PgFdwState *)fmstate);
+
/* Set up the prepared statement on the remote server, if we didn't yet */
if (!fmstate->p_name)
prepare_foreign_modify(fmstate);
@@ -1803,14 +1963,14 @@ postgresExecForeignInsert(EState *estate,
/*
* Execute the prepared statement.
*/
- if (!PQsendQueryPrepared(fmstate->conn,
+ if (!PQsendQueryPrepared(fmstate->s.conn,
fmstate->p_name,
fmstate->p_nums,
p_values,
NULL,
NULL,
0))
- pgfdw_report_error(ERROR, NULL, fmstate->conn, false, fmstate->query);
+ pgfdw_report_error(ERROR, NULL, fmstate->s.conn, false, fmstate->query);
/*
* Get the result, and check for success.
@@ -1818,10 +1978,10 @@ postgresExecForeignInsert(EState *estate,
* We don't use a PG_TRY block here, so be careful not to throw error
* without releasing the PGresult.
*/
- res = pgfdw_get_result(fmstate->conn, fmstate->query);
+ res = pgfdw_get_result(fmstate->s.conn, fmstate->query);
if (PQresultStatus(res) !=
(fmstate->has_returning ? PGRES_TUPLES_OK : PGRES_COMMAND_OK))
- pgfdw_report_error(ERROR, res, fmstate->conn, true, fmstate->query);
+ pgfdw_report_error(ERROR, res, fmstate->s.conn, true, fmstate->query);
/* Check number of rows affected, and fetch RETURNING tuple if any */
if (fmstate->has_returning)
@@ -1859,6 +2019,8 @@ postgresExecForeignUpdate(EState *estate,
PGresult *res;
int n_rows;
+ vacate_connection((PgFdwState *)fmstate);
+
/* Set up the prepared statement on the remote server, if we didn't yet */
if (!fmstate->p_name)
prepare_foreign_modify(fmstate);
@@ -1879,14 +2041,14 @@ postgresExecForeignUpdate(EState *estate,
/*
* Execute the prepared statement.
*/
- if (!PQsendQueryPrepared(fmstate->conn,
+ if (!PQsendQueryPrepared(fmstate->s.conn,
fmstate->p_name,
fmstate->p_nums,
p_values,
NULL,
NULL,
0))
- pgfdw_report_error(ERROR, NULL, fmstate->conn, false, fmstate->query);
+ pgfdw_report_error(ERROR, NULL, fmstate->s.conn, false, fmstate->query);
/*
* Get the result, and check for success.
@@ -1894,10 +2056,10 @@ postgresExecForeignUpdate(EState *estate,
* We don't use a PG_TRY block here, so be careful not to throw error
* without releasing the PGresult.
*/
- res = pgfdw_get_result(fmstate->conn, fmstate->query);
+ res = pgfdw_get_result(fmstate->s.conn, fmstate->query);
if (PQresultStatus(res) !=
(fmstate->has_returning ? PGRES_TUPLES_OK : PGRES_COMMAND_OK))
- pgfdw_report_error(ERROR, res, fmstate->conn, true, fmstate->query);
+ pgfdw_report_error(ERROR, res, fmstate->s.conn, true, fmstate->query);
/* Check number of rows affected, and fetch RETURNING tuple if any */
if (fmstate->has_returning)
@@ -1935,6 +2097,8 @@ postgresExecForeignDelete(EState *estate,
PGresult *res;
int n_rows;
+ vacate_connection((PgFdwState *)fmstate);
+
/* Set up the prepared statement on the remote server, if we didn't yet */
if (!fmstate->p_name)
prepare_foreign_modify(fmstate);
@@ -1955,14 +2119,14 @@ postgresExecForeignDelete(EState *estate,
/*
* Execute the prepared statement.
*/
- if (!PQsendQueryPrepared(fmstate->conn,
+ if (!PQsendQueryPrepared(fmstate->s.conn,
fmstate->p_name,
fmstate->p_nums,
p_values,
NULL,
NULL,
0))
- pgfdw_report_error(ERROR, NULL, fmstate->conn, false, fmstate->query);
+ pgfdw_report_error(ERROR, NULL, fmstate->s.conn, false, fmstate->query);
/*
* Get the result, and check for success.
@@ -1970,10 +2134,10 @@ postgresExecForeignDelete(EState *estate,
* We don't use a PG_TRY block here, so be careful not to throw error
* without releasing the PGresult.
*/
- res = pgfdw_get_result(fmstate->conn, fmstate->query);
+ res = pgfdw_get_result(fmstate->s.conn, fmstate->query);
if (PQresultStatus(res) !=
(fmstate->has_returning ? PGRES_TUPLES_OK : PGRES_COMMAND_OK))
- pgfdw_report_error(ERROR, res, fmstate->conn, true, fmstate->query);
+ pgfdw_report_error(ERROR, res, fmstate->s.conn, true, fmstate->query);
/* Check number of rows affected, and fetch RETURNING tuple if any */
if (fmstate->has_returning)
@@ -2020,16 +2184,16 @@ postgresEndForeignModify(EState *estate,
* We don't use a PG_TRY block here, so be careful not to throw error
* without releasing the PGresult.
*/
- res = pgfdw_exec_query(fmstate->conn, sql);
+ res = pgfdw_exec_query(fmstate->s.conn, sql);
if (PQresultStatus(res) != PGRES_COMMAND_OK)
- pgfdw_report_error(ERROR, res, fmstate->conn, true, sql);
+ pgfdw_report_error(ERROR, res, fmstate->s.conn, true, sql);
PQclear(res);
fmstate->p_name = NULL;
}
/* Release remote connection */
- ReleaseConnection(fmstate->conn);
- fmstate->conn = NULL;
+ ReleaseConnection(fmstate->s.conn);
+ fmstate->s.conn = NULL;
}
/*
@@ -2309,7 +2473,9 @@ postgresBeginDirectModify(ForeignScanState *node, int eflags)
* Get connection to the foreign server. Connection manager will
* establish new connection if necessary.
*/
- dmstate->conn = GetConnection(user, false);
+ dmstate->s.conn = GetConnection(user, false);
+ dmstate->s.connspec = (PgFdwConnspecate *)
+ GetConnectionSpecificStorage(user, sizeof(PgFdwConnspecate));
/* Initialize state variable */
dmstate->num_tuples = -1; /* -1 means not set yet */
@@ -2362,7 +2528,10 @@ postgresIterateDirectModify(ForeignScanState *node)
* If this is the first call after Begin, execute the statement.
*/
if (dmstate->num_tuples == -1)
+ {
+ vacate_connection((PgFdwState *)dmstate);
execute_dml_stmt(node);
+ }
/*
* If the local query doesn't specify RETURNING, just clear tuple slot.
@@ -2409,8 +2578,8 @@ postgresEndDirectModify(ForeignScanState *node)
PQclear(dmstate->result);
/* Release remote connection */
- ReleaseConnection(dmstate->conn);
- dmstate->conn = NULL;
+ ReleaseConnection(dmstate->s.conn);
+ dmstate->s.conn = NULL;
/* MemoryContext will be deleted automatically. */
}
@@ -2529,6 +2698,7 @@ estimate_path_cost_size(PlannerInfo *root,
List *local_param_join_conds;
StringInfoData sql;
PGconn *conn;
+ PgFdwConnspecate *connspec;
Selectivity local_sel;
QualCost local_cost;
List *fdw_scan_tlist = NIL;
@@ -2572,6 +2742,16 @@ estimate_path_cost_size(PlannerInfo *root,
/* Get the remote estimate */
conn = GetConnection(fpinfo->user, false);
+ connspec = GetConnectionSpecificStorage(fpinfo->user,
+ sizeof(PgFdwConnspecate));
+ if (connspec)
+ {
+ PgFdwState tmpstate;
+ tmpstate.conn = conn;
+ tmpstate.connspec = connspec;
+ vacate_connection(&tmpstate);
+ }
+
get_remote_estimate(sql.data, conn, &rows, &width,
&startup_cost, &total_cost);
ReleaseConnection(conn);
@@ -2926,11 +3106,11 @@ ec_member_matches_foreign(PlannerInfo *root, RelOptInfo *rel,
static void
create_cursor(ForeignScanState *node)
{
- PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+ PgFdwScanState *fsstate = GetPgFdwScanState(node);
ExprContext *econtext = node->ss.ps.ps_ExprContext;
int numParams = fsstate->numParams;
const char **values = fsstate->param_values;
- PGconn *conn = fsstate->conn;
+ PGconn *conn = fsstate->s.conn;
StringInfoData buf;
PGresult *res;
@@ -2996,47 +3176,96 @@ create_cursor(ForeignScanState *node)
* Fetch some more rows from the node's cursor.
*/
static void
-fetch_more_data(ForeignScanState *node)
+request_more_data(ForeignScanState *node)
+{
+ PgFdwScanState *fsstate = GetPgFdwScanState(node);
+ PGconn *conn = fsstate->s.conn;
+ char sql[64];
+
+ /* The connection should be vacant */
+ Assert(fsstate->s.connspec->current_owner == NULL);
+
+ /*
+ * If this is the first call after Begin or ReScan, we need to create the
+ * cursor on the remote side.
+ */
+ if (!fsstate->cursor_exists)
+ create_cursor(node);
+
+ snprintf(sql, sizeof(sql), "FETCH %d FROM c%u",
+ fsstate->fetch_size, fsstate->cursor_number);
+
+ if (!PQsendQuery(conn, sql))
+ pgfdw_report_error(ERROR, NULL, conn, false, sql);
+
+ fsstate->s.connspec->current_owner = node;
+}
+
+/*
+ * Fetch some more rows from the node's cursor.
+ */
+static void
+fetch_received_data(ForeignScanState *node)
{
- PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+ PgFdwScanState *fsstate = GetPgFdwScanState(node);
PGresult *volatile res = NULL;
MemoryContext oldcontext;
+ /* I should be the current connection owner */
+ Assert(fsstate->s.connspec->current_owner == node);
+
/*
* We'll store the tuples in the batch_cxt. First, flush the previous
- * batch.
+ * batch if no tuple is remaining
*/
- fsstate->tuples = NULL;
- MemoryContextReset(fsstate->batch_cxt);
+ if (fsstate->next_tuple >= fsstate->num_tuples)
+ {
+ fsstate->tuples = NULL;
+ fsstate->num_tuples = 0;
+ MemoryContextReset(fsstate->batch_cxt);
+ }
+ else if (fsstate->next_tuple > 0)
+ {
+ /* move the remaining tuples to the beginning of the store */
+ int n = 0;
+
+ while(fsstate->next_tuple < fsstate->num_tuples)
+ fsstate->tuples[n++] = fsstate->tuples[fsstate->next_tuple++];
+ fsstate->num_tuples = n;
+ }
+
oldcontext = MemoryContextSwitchTo(fsstate->batch_cxt);
/* PGresult must be released before leaving this function. */
PG_TRY();
{
- PGconn *conn = fsstate->conn;
+ PGconn *conn = fsstate->s.conn;
char sql[64];
- int numrows;
+ int addrows;
+ size_t newsize;
int i;
snprintf(sql, sizeof(sql), "FETCH %d FROM c%u",
fsstate->fetch_size, fsstate->cursor_number);
- res = pgfdw_exec_query(conn, sql);
+ res = pgfdw_get_result(conn, sql);
/* On error, report the original query, not the FETCH. */
if (PQresultStatus(res) != PGRES_TUPLES_OK)
pgfdw_report_error(ERROR, res, conn, false, fsstate->query);
/* Convert the data into HeapTuples */
- numrows = PQntuples(res);
- fsstate->tuples = (HeapTuple *) palloc0(numrows * sizeof(HeapTuple));
- fsstate->num_tuples = numrows;
- fsstate->next_tuple = 0;
+ addrows = PQntuples(res);
+ newsize = (fsstate->num_tuples + addrows) * sizeof(HeapTuple);
+ if (fsstate->tuples)
+ fsstate->tuples = (HeapTuple *) repalloc(fsstate->tuples, newsize);
+ else
+ fsstate->tuples = (HeapTuple *) palloc(newsize);
- for (i = 0; i < numrows; i++)
+ for (i = 0; i < addrows; i++)
{
Assert(IsA(node->ss.ps.plan, ForeignScan));
- fsstate->tuples[i] =
+ fsstate->tuples[fsstate->num_tuples + i] =
make_tuple_from_result_row(res, i,
fsstate->rel,
fsstate->attinmeta,
@@ -3046,27 +3275,82 @@ fetch_more_data(ForeignScanState *node)
}
/* Update fetch_ct_2 */
- if (fsstate->fetch_ct_2 < 2)
+ if (fsstate->fetch_ct_2 < 2 && fsstate->next_tuple == 0)
fsstate->fetch_ct_2++;
+ fsstate->next_tuple = 0;
+ fsstate->num_tuples += addrows;
+
/* Must be EOF if we didn't get as many tuples as we asked for. */
- fsstate->eof_reached = (numrows < fsstate->fetch_size);
+ fsstate->eof_reached = (addrows < fsstate->fetch_size);
PQclear(res);
res = NULL;
}
PG_CATCH();
{
+ fsstate->s.connspec->current_owner = NULL;
if (res)
PQclear(res);
PG_RE_THROW();
}
PG_END_TRY();
+ fsstate->s.connspec->current_owner = NULL;
+
MemoryContextSwitchTo(oldcontext);
}
/*
+ * Vacate a connection so that this node can send the next query
+ */
+static void
+vacate_connection(PgFdwState *fdwstate)
+{
+ PgFdwConnspecate *connspec = fdwstate->connspec;
+ ForeignScanState *owner;
+
+ if (connspec == NULL || connspec->current_owner == NULL)
+ return;
+
+ /*
+ * let the current connection owner read the result for the running query
+ */
+ owner = connspec->current_owner;
+ fetch_received_data(owner);
+
+ /* Clear the waiting list */
+ while (owner)
+ {
+ PgFdwScanState *fsstate = GetPgFdwScanState(owner);
+
+ fsstate->last_waiter = NULL;
+ owner = fsstate->waiter;
+ fsstate->waiter = NULL;
+ }
+}
+
+/*
+ * Absorb the result of the current query.
+ */
+static void
+absorb_current_result(ForeignScanState *node)
+{
+ PgFdwScanState *fsstate = GetPgFdwScanState(node);
+ ForeignScanState *owner = fsstate->s.connspec->current_owner;
+
+ if (owner)
+ {
+ PgFdwScanState *target_state = GetPgFdwScanState(owner);
+ PGconn *conn = target_state->s.conn;
+
+ while(PQisBusy(conn))
+ PQclear(PQgetResult(conn));
+ fsstate->s.connspec->current_owner = NULL;
+ fsstate->async_waiting = false;
+ }
+}
+/*
* Force assorted GUC parameters to settings that ensure that we'll output
* data values in a form that is unambiguous to the remote server.
*
@@ -3150,7 +3434,7 @@ prepare_foreign_modify(PgFdwModifyState *fmstate)
/* Construct name we'll use for the prepared statement. */
snprintf(prep_name, sizeof(prep_name), "pgsql_fdw_prep_%u",
- GetPrepStmtNumber(fmstate->conn));
+ GetPrepStmtNumber(fmstate->s.conn));
p_name = pstrdup(prep_name);
/*
@@ -3160,12 +3444,12 @@ prepare_foreign_modify(PgFdwModifyState *fmstate)
* the prepared statements we use in this module are simple enough that
* the remote server will make the right choices.
*/
- if (!PQsendPrepare(fmstate->conn,
+ if (!PQsendPrepare(fmstate->s.conn,
p_name,
fmstate->query,
0,
NULL))
- pgfdw_report_error(ERROR, NULL, fmstate->conn, false, fmstate->query);
+ pgfdw_report_error(ERROR, NULL, fmstate->s.conn, false, fmstate->query);
/*
* Get the result, and check for success.
@@ -3173,9 +3457,9 @@ prepare_foreign_modify(PgFdwModifyState *fmstate)
* We don't use a PG_TRY block here, so be careful not to throw error
* without releasing the PGresult.
*/
- res = pgfdw_get_result(fmstate->conn, fmstate->query);
+ res = pgfdw_get_result(fmstate->s.conn, fmstate->query);
if (PQresultStatus(res) != PGRES_COMMAND_OK)
- pgfdw_report_error(ERROR, res, fmstate->conn, true, fmstate->query);
+ pgfdw_report_error(ERROR, res, fmstate->s.conn, true, fmstate->query);
PQclear(res);
/* This action shows that the prepare has been done. */
@@ -3306,9 +3590,9 @@ execute_dml_stmt(ForeignScanState *node)
* the desired result. This allows us to avoid assuming that the remote
* server has the same OIDs we do for the parameters' types.
*/
- if (!PQsendQueryParams(dmstate->conn, dmstate->query, numParams,
+ if (!PQsendQueryParams(dmstate->s.conn, dmstate->query, numParams,
NULL, values, NULL, NULL, 0))
- pgfdw_report_error(ERROR, NULL, dmstate->conn, false, dmstate->query);
+ pgfdw_report_error(ERROR, NULL, dmstate->s.conn, false, dmstate->query);
/*
* Get the result, and check for success.
@@ -3316,10 +3600,10 @@ execute_dml_stmt(ForeignScanState *node)
* We don't use a PG_TRY block here, so be careful not to throw error
* without releasing the PGresult.
*/
- dmstate->result = pgfdw_get_result(dmstate->conn, dmstate->query);
+ dmstate->result = pgfdw_get_result(dmstate->s.conn, dmstate->query);
if (PQresultStatus(dmstate->result) !=
(dmstate->has_returning ? PGRES_TUPLES_OK : PGRES_COMMAND_OK))
- pgfdw_report_error(ERROR, dmstate->result, dmstate->conn, true,
+ pgfdw_report_error(ERROR, dmstate->result, dmstate->s.conn, true,
dmstate->query);
/* Get the number of rows affected. */
@@ -4465,8 +4749,10 @@ postgresIsForeignPathAsyncCapable(ForeignPath *path)
}
/*
- * XXX. Just for testing purposes, let's run everything through the async
- * mechanism but return tuples synchronously.
+ * Accept async request. Notify to the caller if the next tuple is immediately
+ * available. ExecForeignScan does additional work to finishing the returning
+ * tuple, so call it instead of postgresIterateForeignScan to acquire a tuple
+ * in expected shape.
*/
static void
postgresForeignAsyncRequest(EState *estate, PendingAsyncRequest *areq)
@@ -4475,22 +4761,59 @@ postgresForeignAsyncRequest(EState *estate, PendingAsyncRequest *areq)
TupleTableSlot *slot;
Assert(IsA(node, ForeignScanState));
+ GetPgFdwScanState(node)->run_async = true;
slot = ExecForeignScan(node);
- ExecAsyncRequestDone(estate, areq, (Node *) slot);
+ if (GetPgFdwScanState(node)->result_ready)
+ ExecAsyncRequestDone(estate, areq, (Node *) slot);
+ else
+ ExecAsyncSetRequiredEvents(estate, areq, 1, false, false);
}
+/*
+ * Configure waiting event.
+ *
+ * Add an wait event only when the node is the connection owner. Elsewise
+ * another node on this connection is the owner.
+ */
static bool
postgresForeignAsyncConfigureWait(EState *estate, PendingAsyncRequest *areq,
- bool reinit)
+ bool reinit)
{
- elog(ERROR, "postgresForeignAsyncConfigureWait");
+ ForeignScanState *node = (ForeignScanState *) areq->requestee;
+ PgFdwScanState *fsstate = GetPgFdwScanState(node);
+
+
+ /* If the caller didn't reinit, this event is already in event set */
+ if (!reinit)
+ return true;
+
+ if (fsstate->s.connspec->current_owner == node)
+ {
+ AddWaitEventToSet(estate->es_wait_event_set,
+ WL_SOCKET_READABLE, PQsocket(fsstate->s.conn),
+ NULL, areq);
+ return true;
+ }
+
return false;
}
+/*
+ * Process a notification from async mechanism. ExecForeignScan does
+ * additional work to complete the returning tuple, so call it instead of
+ * postgresIterateForeignScan to acquire a completed tuple.
+ */
static void
postgresForeignAsyncNotify(EState *estate, PendingAsyncRequest *areq)
{
- elog(ERROR, "postgresForeignAsyncNotify");
+ ForeignScanState *node = (ForeignScanState *) areq->requestee;
+ TupleTableSlot *slot;
+
+ Assert(IsA(node, ForeignScanState));
+ slot = ExecForeignScan(node);
+ Assert(GetPgFdwScanState(node)->result_ready);
+
+ ExecAsyncRequestDone(estate, areq, (Node *) slot);
}
/*
@@ -4850,7 +5173,7 @@ make_tuple_from_result_row(PGresult *res,
PgFdwScanState *fdw_sstate;
Assert(fsstate);
- fdw_sstate = (PgFdwScanState *) fsstate->fdw_state;
+ fdw_sstate = GetPgFdwScanState(fsstate);
tupdesc = fdw_sstate->tupdesc;
}
diff --git a/contrib/postgres_fdw/postgres_fdw.h b/contrib/postgres_fdw/postgres_fdw.h
index f8c255e..1800977 100644
--- a/contrib/postgres_fdw/postgres_fdw.h
+++ b/contrib/postgres_fdw/postgres_fdw.h
@@ -79,6 +79,7 @@ typedef struct PgFdwRelationInfo
UserMapping *user; /* only set in use_remote_estimate mode */
int fetch_size; /* fetch size for this remote table */
+ bool allow_prefetch; /* true to allow overlapped fetching */
/*
* Name of the relation while EXPLAINing ForeignScan. It is used for join
@@ -103,6 +104,7 @@ extern void reset_transmission_modes(int nestlevel);
/* in connection.c */
extern PGconn *GetConnection(UserMapping *user, bool will_prep_stmt);
+void *GetConnectionSpecificStorage(UserMapping *user, size_t initsize);
extern void ReleaseConnection(PGconn *conn);
extern unsigned int GetCursorNumber(PGconn *conn);
extern unsigned int GetPrepStmtNumber(PGconn *conn);
diff --git a/contrib/postgres_fdw/sql/postgres_fdw.sql b/contrib/postgres_fdw/sql/postgres_fdw.sql
index bb9d41a..d4b5fad 100644
--- a/contrib/postgres_fdw/sql/postgres_fdw.sql
+++ b/contrib/postgres_fdw/sql/postgres_fdw.sql
@@ -1552,8 +1552,8 @@ explain (verbose, costs off)
delete from foo where f1 < 5 returning *;
delete from foo where f1 < 5 returning *;
explain (verbose, costs off)
-update bar set f2 = f2 + 100 returning *;
-update bar set f2 = f2 + 100 returning *;
+with u as (update bar set f2 = f2 + 100 returning *) select * from u order by 1;
+with u as (update bar set f2 = f2 + 100 returning *) select * from u order by 1;
drop table foo cascade;
drop table bar cascade;
diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
index 554244f..f864abe 100644
--- a/src/backend/executor/execProcnode.c
+++ b/src/backend/executor/execProcnode.c
@@ -114,6 +114,7 @@
#include "executor/nodeValuesscan.h"
#include "executor/nodeWindowAgg.h"
#include "executor/nodeWorktablescan.h"
+#include "foreign/fdwapi.h"
#include "nodes/nodeFuncs.h"
#include "miscadmin.h"
@@ -806,6 +807,14 @@ ExecShutdownNode(PlanState *node)
case T_GatherState:
ExecShutdownGather((GatherState *) node);
break;
+ case T_ForeignScanState:
+ {
+ ForeignScanState *fsstate = (ForeignScanState *)node;
+ FdwRoutine *fdwroutine = fsstate->fdwroutine;
+ if (fdwroutine->ShutdownForeignScan)
+ fdwroutine->ShutdownForeignScan((ForeignScanState *) node);
+ }
+ break;
default:
break;
}
diff --git a/src/include/foreign/fdwapi.h b/src/include/foreign/fdwapi.h
index 65517fd..e40db0e 100644
--- a/src/include/foreign/fdwapi.h
+++ b/src/include/foreign/fdwapi.h
@@ -163,6 +163,7 @@ typedef bool (*ForeignAsyncConfigureWait_function) (EState *estate,
bool reinit);
typedef void (*ForeignAsyncNotify_function) (EState *estate,
PendingAsyncRequest *areq);
+typedef void (*ShutdownForeignScan_function) (ForeignScanState *node);
/*
* FdwRoutine is the struct returned by a foreign-data wrapper's handler
@@ -239,6 +240,7 @@ typedef struct FdwRoutine
ForeignAsyncRequest_function ForeignAsyncRequest;
ForeignAsyncConfigureWait_function ForeignAsyncConfigureWait;
ForeignAsyncNotify_function ForeignAsyncNotify;
+ ShutdownForeignScan_function ShutdownForeignScan;
} FdwRoutine;
--
2.9.2
0005-Use-resource-owner-to-prevent-wait-event-set-from-le.patchtext/x-patch; charset=us-asciiDownload
From 233e2e5125cdea90fa10fc05dd5ff1885f09cff2 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Tue, 25 Oct 2016 18:01:56 +0900
Subject: [PATCH 5/6] Use resource owner to prevent wait event set from leaking
Wait event sets created for async execution can live for some
iterations so it leaks in the case of errors during the
iterations. This commit uses resource owner to prevent such leaks.
---
src/backend/executor/execAsync.c | 28 ++++++++++++++--
src/backend/storage/ipc/latch.c | 19 ++++++++++-
src/backend/utils/resowner/resowner.c | 63 +++++++++++++++++++++++++++++++++++
src/include/utils/resowner_private.h | 8 +++++
4 files changed, 114 insertions(+), 4 deletions(-)
diff --git a/src/backend/executor/execAsync.c b/src/backend/executor/execAsync.c
index 33496a9..40e3f67 100644
--- a/src/backend/executor/execAsync.c
+++ b/src/backend/executor/execAsync.c
@@ -20,6 +20,7 @@
#include "miscadmin.h"
#include "pgstat.h"
#include "storage/latch.h"
+#include "utils/resowner_private.h"
static bool ExecAsyncEventWait(EState *estate, long timeout);
static bool ExecAsyncConfigureWait(EState *estate, PendingAsyncRequest *areq,
@@ -277,6 +278,8 @@ ExecAsyncEventWait(EState *estate, long timeout)
if (estate->es_wait_event_set == NULL)
{
+ ResourceOwner savedOwner;
+
/*
* Allow for a few extra events without reinitializing. It
* doesn't seem worth the complexity of doing anything very
@@ -284,9 +287,28 @@ ExecAsyncEventWait(EState *estate, long timeout)
* of external FDs are likely to run afoul of kernel limits anyway.
*/
estate->es_allocated_fd_events = estate->es_total_fd_events + 16;
- estate->es_wait_event_set =
- CreateWaitEventSet(estate->es_query_cxt,
- estate->es_allocated_fd_events + 1);
+
+ /*
+ * The wait event set created here should be released in case of
+ * error.
+ */
+ savedOwner = CurrentResourceOwner;
+ CurrentResourceOwner = TopTransactionResourceOwner;
+
+ PG_TRY();
+ {
+ estate->es_wait_event_set =
+ CreateWaitEventSet(estate->es_query_cxt,
+ estate->es_allocated_fd_events + 1);
+ }
+ PG_CATCH();
+ {
+ CurrentResourceOwner = savedOwner;
+ PG_RE_THROW();
+ }
+ PG_END_TRY();
+
+ CurrentResourceOwner = savedOwner;
AddWaitEventToSet(estate->es_wait_event_set,
WL_LATCH_SET, PGINVALID_SOCKET, MyLatch, NULL);
reinit = true;
diff --git a/src/backend/storage/ipc/latch.c b/src/backend/storage/ipc/latch.c
index 8488f94..b8bcae9 100644
--- a/src/backend/storage/ipc/latch.c
+++ b/src/backend/storage/ipc/latch.c
@@ -62,6 +62,7 @@
#include "storage/latch.h"
#include "storage/pmsignal.h"
#include "storage/shmem.h"
+#include "utils/resowner_private.h"
/*
* Select the fd readiness primitive to use. Normally the "most modern"
@@ -90,6 +91,7 @@ struct WaitEventSet
int nevents; /* number of registered events */
int nevents_space; /* maximum number of events in this set */
+ ResourceOwner resowner; /* Resource owner */
/*
* Array, of nevents_space length, storing the definition of events this
* set is waiting for.
@@ -324,7 +326,13 @@ WaitLatchOrSocket(volatile Latch *latch, int wakeEvents, pgsocket sock,
int ret = 0;
int rc;
WaitEvent event;
- WaitEventSet *set = CreateWaitEventSet(CurrentMemoryContext, 3);
+ WaitEventSet *set;
+ ResourceOwner savedOwner = CurrentResourceOwner;
+
+ /* This function doesn't need resowner for event set */
+ CurrentResourceOwner = NULL;
+ set = CreateWaitEventSet(CurrentMemoryContext, 3);
+ CurrentResourceOwner = savedOwner;
if (wakeEvents & WL_TIMEOUT)
Assert(timeout >= 0);
@@ -488,6 +496,9 @@ CreateWaitEventSet(MemoryContext context, int nevents)
char *data;
Size sz = 0;
+ if (CurrentResourceOwner)
+ ResourceOwnerEnlargeWESs(CurrentResourceOwner);
+
/*
* Use MAXALIGN size/alignment to guarantee that later uses of memory are
* aligned correctly. E.g. epoll_event might need 8 byte alignment on some
@@ -547,6 +558,9 @@ CreateWaitEventSet(MemoryContext context, int nevents)
StaticAssertStmt(WSA_INVALID_EVENT == NULL, "");
#endif
+ set->resowner = CurrentResourceOwner;
+ if (CurrentResourceOwner)
+ ResourceOwnerRememberWES(set->resowner, set);
return set;
}
@@ -582,6 +596,9 @@ FreeWaitEventSet(WaitEventSet *set)
}
#endif
+ if (set->resowner != NULL)
+ ResourceOwnerForgetWES(set->resowner, set);
+
pfree(set);
}
diff --git a/src/backend/utils/resowner/resowner.c b/src/backend/utils/resowner/resowner.c
index 07075ce..272e460 100644
--- a/src/backend/utils/resowner/resowner.c
+++ b/src/backend/utils/resowner/resowner.c
@@ -124,6 +124,7 @@ typedef struct ResourceOwnerData
ResourceArray snapshotarr; /* snapshot references */
ResourceArray filearr; /* open temporary files */
ResourceArray dsmarr; /* dynamic shmem segments */
+ ResourceArray wesarr; /* wait event sets */
/* We can remember up to MAX_RESOWNER_LOCKS references to local locks. */
int nlocks; /* number of owned locks */
@@ -169,6 +170,7 @@ static void PrintTupleDescLeakWarning(TupleDesc tupdesc);
static void PrintSnapshotLeakWarning(Snapshot snapshot);
static void PrintFileLeakWarning(File file);
static void PrintDSMLeakWarning(dsm_segment *seg);
+static void PrintWESLeakWarning(WaitEventSet *events);
/*****************************************************************************
@@ -437,6 +439,7 @@ ResourceOwnerCreate(ResourceOwner parent, const char *name)
ResourceArrayInit(&(owner->snapshotarr), PointerGetDatum(NULL));
ResourceArrayInit(&(owner->filearr), FileGetDatum(-1));
ResourceArrayInit(&(owner->dsmarr), PointerGetDatum(NULL));
+ ResourceArrayInit(&(owner->wesarr), PointerGetDatum(NULL));
return owner;
}
@@ -552,6 +555,16 @@ ResourceOwnerReleaseInternal(ResourceOwner owner,
PrintDSMLeakWarning(res);
dsm_detach(res);
}
+
+ /* Ditto for wait event sets */
+ while (ResourceArrayGetAny(&(owner->wesarr), &foundres))
+ {
+ WaitEventSet *event = (WaitEventSet *) DatumGetPointer(foundres);
+
+ if (isCommit)
+ PrintWESLeakWarning(event);
+ FreeWaitEventSet(event);
+ }
}
else if (phase == RESOURCE_RELEASE_LOCKS)
{
@@ -702,6 +715,7 @@ ResourceOwnerDelete(ResourceOwner owner)
Assert(owner->snapshotarr.nitems == 0);
Assert(owner->filearr.nitems == 0);
Assert(owner->dsmarr.nitems == 0);
+ Assert(owner->waiteventarr.nitems == 0);
Assert(owner->nlocks == 0 || owner->nlocks == MAX_RESOWNER_LOCKS + 1);
/*
@@ -728,6 +742,7 @@ ResourceOwnerDelete(ResourceOwner owner)
ResourceArrayFree(&(owner->snapshotarr));
ResourceArrayFree(&(owner->filearr));
ResourceArrayFree(&(owner->dsmarr));
+ ResourceArrayFree(&(owner->wesarr));
pfree(owner);
}
@@ -1270,3 +1285,51 @@ PrintDSMLeakWarning(dsm_segment *seg)
elog(WARNING, "dynamic shared memory leak: segment %u still referenced",
dsm_segment_handle(seg));
}
+
+/*
+ * Make sure there is room for at least one more entry in a ResourceOwner's
+ * wait event set reference array.
+ *
+ * This is separate from actually inserting an entry because if we run out
+ * of memory, it's critical to do so *before* acquiring the resource.
+ */
+void
+ResourceOwnerEnlargeWESs(ResourceOwner owner)
+{
+ ResourceArrayEnlarge(&(owner->wesarr));
+}
+
+/*
+ * Remember that a wait event set is owned by a ResourceOwner
+ *
+ * Caller must have previously done ResourceOwnerEnlargeWESs()
+ */
+void
+ResourceOwnerRememberWES(ResourceOwner owner, WaitEventSet *events)
+{
+ ResourceArrayAdd(&(owner->wesarr), PointerGetDatum(events));
+}
+
+/*
+ * Forget that a wait event set is owned by a ResourceOwner
+ */
+void
+ResourceOwnerForgetWES(ResourceOwner owner, WaitEventSet *events)
+{
+ /* XXXX: There's no property to identify a wait event set */
+ if (!ResourceArrayRemove(&(owner->wesarr), PointerGetDatum(events)))
+ elog(ERROR, "wait event set %p is not owned by resource owner %s",
+ events, owner->name);
+}
+
+/*
+ * Debugging subroutine
+ */
+static void
+PrintWESLeakWarning(WaitEventSet *events)
+{
+ /* XXXX: There's no property to identify a wait event set */
+ elog(WARNING, "wait event set leak: %p still referenced",
+ events);
+}
+
diff --git a/src/include/utils/resowner_private.h b/src/include/utils/resowner_private.h
index fd32090..6087257e7 100644
--- a/src/include/utils/resowner_private.h
+++ b/src/include/utils/resowner_private.h
@@ -18,6 +18,7 @@
#include "storage/dsm.h"
#include "storage/fd.h"
+#include "storage/latch.h"
#include "storage/lock.h"
#include "utils/catcache.h"
#include "utils/plancache.h"
@@ -88,4 +89,11 @@ extern void ResourceOwnerRememberDSM(ResourceOwner owner,
extern void ResourceOwnerForgetDSM(ResourceOwner owner,
dsm_segment *);
+/* support for wait event set management */
+extern void ResourceOwnerEnlargeWESs(ResourceOwner owner);
+extern void ResourceOwnerRememberWES(ResourceOwner owner,
+ WaitEventSet *);
+extern void ResourceOwnerForgetWES(ResourceOwner owner,
+ WaitEventSet *);
+
#endif /* RESOWNER_PRIVATE_H */
--
2.9.2
0006-Apply-unlikely-to-suggest-synchronous-route-of-ExecA.patchtext/x-patch; charset=us-asciiDownload
From 11749cc592ac8369fcc9fbfb362ddd2a6f2f0a90 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Tue, 25 Oct 2016 18:05:30 +0900
Subject: [PATCH 6/6] Apply unlikely to suggest synchronous route of
ExecAppend.
ExecAppend seems to get slowed down by penalty of misprediction of
branches related to async-execution. Apply unlikey to them to prevent
such penalty on exiting route. Asynchronous execution is already
having a lot of additional code so this doesn't add siginificant
degradation.
---
src/backend/executor/nodeAppend.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/src/backend/executor/nodeAppend.c b/src/backend/executor/nodeAppend.c
index c234f1f..e82547d 100644
--- a/src/backend/executor/nodeAppend.c
+++ b/src/backend/executor/nodeAppend.c
@@ -208,7 +208,7 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
TupleTableSlot *
ExecAppend(AppendState *node)
{
- if (node->as_nasyncplans > 0)
+ if (unlikely(node->as_nasyncplans > 0))
{
EState *estate = node->ps.state;
int i;
@@ -248,7 +248,7 @@ ExecAppend(AppendState *node)
/*
* if we have async requests outstanding, run the event loop
*/
- if (node->as_nasyncpending > 0)
+ if (unlikely(node->as_nasyncpending > 0))
{
long timeout = node->as_syncdone ? -1 : 0;
--
2.9.2
Hi, this is the 7th patch to make instrumentation work.
Explain analyze shows the following result by the previous patch set .
| Aggregate (cost=820.25..820.26 rows=1 width=8) (actual time=4324.676..4324.676
| rows=1 loops=1)
| -> Append (cost=0.00..791.00 rows=11701 width=4) (actual time=0.910..3663.8
|82 rows=4000000 loops=1)
| -> Foreign Scan on ft10 (cost=100.00..197.75 rows=2925 width=4)
| (never executed)
| -> Foreign Scan on ft20 (cost=100.00..197.75 rows=2925 width=4)
| (never executed)
| -> Foreign Scan on ft30 (cost=100.00..197.75 rows=2925 width=4)
| (never executed)
| -> Foreign Scan on ft40 (cost=100.00..197.75 rows=2925 width=4)
| (never executed)
| -> Seq Scan on pf0 (cost=0.00..0.00 rows=1 width=4)
| (actual time=0.004..0.004 rows=0 loops=1)
The current instrument stuff assumes that requested tuple always
returns a tuple or the end of tuple comes. This async framework
has two point of executing underneath nodes. ExecAsyncRequest and
ExecAsyncEventLoop. So I'm not sure if this is appropriate but
anyway it seems to show sane numbers.
| Aggregate (cost=820.25..820.26 rows=1 width=8) (actual time=4571.205..4571.206
| rows=1 loops=1)
| -> Append (cost=0.00..791.00 rows=11701 width=4) (actual time=1.362..3893.1
|14 rows=4000000 loops=1)
| -> Foreign Scan on ft10 (cost=100.00..197.75 rows=2925 width=4)
| (actual time=1.056..770.863 rows=1000000 loops=1)
| -> Foreign Scan on ft20 (cost=100.00..197.75 rows=2925 width=4)
| (actual time=0.461..767.840 rows=1000000 loops=1)
| -> Foreign Scan on ft30 (cost=100.00..197.75 rows=2925 width=4)
| (actual time=0.474..782.547 rows=1000000 loops=1)
| -> Foreign Scan on ft40 (cost=100.00..197.75 rows=2925 width=4)
| (actual time=0.156..765.920 rows=1000000 loops=1)
| -> Seq Scan on pf0 (cost=0.00..0.00 rows=1 width=4) (never executed)
regards,
--
Kyotaro Horiguchi
NTT Open Source Software Center
Attachments:
0007-Add-instrumentation-to-async-execution.patchtext/x-patch; charset=us-asciiDownload
From 35c60a46f49aab72d492c798ff7eb8fc0e672250 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Tue, 25 Oct 2016 19:04:04 +0900
Subject: [PATCH 7/7] Add instrumentation to async execution
Make explain analyze give sane result when async execution has taken
place.
---
src/backend/executor/execAsync.c | 19 +++++++++++++++++++
src/backend/executor/instrument.c | 2 +-
2 files changed, 20 insertions(+), 1 deletion(-)
diff --git a/src/backend/executor/execAsync.c b/src/backend/executor/execAsync.c
index 40e3f67..588ba18 100644
--- a/src/backend/executor/execAsync.c
+++ b/src/backend/executor/execAsync.c
@@ -46,6 +46,9 @@ ExecAsyncRequest(EState *estate, PlanState *requestor, int request_index,
PendingAsyncRequest *areq = NULL;
int nasync = estate->es_num_pending_async;
+ if (requestee->instrument)
+ InstrStartNode(requestee->instrument);
+
/*
* If the number of pending asynchronous nodes exceeds the number of
* available slots in the es_pending_async array, expand the array.
@@ -121,11 +124,17 @@ ExecAsyncRequest(EState *estate, PlanState *requestor, int request_index,
if (areq->state == ASYNC_COMPLETE)
{
Assert(areq->result == NULL || IsA(areq->result, TupleTableSlot));
+
ExecAsyncResponse(estate, areq);
+ if (areq->requestee->instrument)
+ InstrStopNode(requestee->instrument,
+ TupIsNull((TupleTableSlot*)areq->result) ? 0.0 : 1.0);
return;
}
+ if (areq->requestee->instrument)
+ InstrStopNode(requestee->instrument, 0);
/* No result available now, make this node pending */
estate->es_num_pending_async++;
}
@@ -193,6 +202,9 @@ ExecAsyncEventLoop(EState *estate, PlanState *requestor, long timeout)
{
PendingAsyncRequest *areq = estate->es_pending_async[i];
+ if (areq->requestee->instrument)
+ InstrStartNode(areq->requestee->instrument);
+
/* Skip it if not pending. */
if (areq->state == ASYNC_CALLBACK_PENDING)
{
@@ -211,7 +223,14 @@ ExecAsyncEventLoop(EState *estate, PlanState *requestor, long timeout)
if (requestor == areq->requestor)
requestor_done = true;
ExecAsyncResponse(estate, areq);
+
+ if (areq->requestee->instrument)
+ InstrStopNode(areq->requestee->instrument,
+ TupIsNull((TupleTableSlot*)areq->result) ?
+ 0.0 : 1.0);
}
+ else if (areq->requestee->instrument)
+ InstrStopNode(areq->requestee->instrument, 0);
}
/* If any node completed, compact the array. */
diff --git a/src/backend/executor/instrument.c b/src/backend/executor/instrument.c
index 2614bf4..6a22a15 100644
--- a/src/backend/executor/instrument.c
+++ b/src/backend/executor/instrument.c
@@ -102,7 +102,7 @@ InstrStopNode(Instrumentation *instr, double nTuples)
&pgBufferUsage, &instr->bufusage_start);
/* Is this the first tuple of this cycle? */
- if (!instr->running)
+ if (!instr->running && nTuples > 0)
{
instr->running = true;
instr->firsttuple = INSTR_TIME_GET_DOUBLE(instr->counter);
--
2.9.2
Hello,
I'm not sure this is in a sutable shape for commit fest but I
decided to register this to ride on the bus for 10.0.
Hi, this is the 7th patch to make instrumentation work.
This a PoC patch of asynchronous execution feature, based on a
executor infrastructure Robert proposed. These patches are
rebased on the current master.
0001-robert-s-2nd-framework.patch
Roberts executor async infrastructure. Async-driver nodes
register its async-capable children and sync and data transfer
are done out of band of ordinary ExecProcNode channel. So async
execution no longer disturbs async-unaware node and slows them
down.
0002-Fix-some-bugs.patch
Some fixes for 0001 to work. This is just to preserve the shape
of 0001 patch.
0003-Modify-async-execution-infrastructure.patch
The original infrastructure doesn't work when multiple foreign
tables is on the same connection. This makes it work.
0004-Make-postgres_fdw-async-capable.patch
Makes postgres_fdw to work asynchronously.
0005-Use-resource-owner-to-prevent-wait-event-set-from-le.patch
This addresses a problem pointed by Robers about 0001 patch,
that WaitEventSet used for async execution can leak by errors.
0006-Apply-unlikely-to-suggest-synchronous-route-of-ExecA.patch
ExecAppend gets a bit slower by penalties of misprediction of
branches. This fixes it by using unlikely() macro.
0007-Add-instrumentation-to-async-execution.patch
As the description above for 0001, async infrastructure conveys
tuples outside ExecProcNode channel so EXPLAIN ANALYZE requires
special treat to show sane results. This patch tries that.
A result of a performance measurement is in this message.
/messages/by-id/20161025.182150.230901487.horiguchi.kyotaro@lab.ntt.co.jp
| t0 - SELECT sum(a) FROM <local single table>;
| pl - SELECT sum(a) FROM <4 local children>;
| pf0 - SELECT sum(a) FROM <4 foreign children on single connection>;
| pf1 - SELECT sum(a) FROM <4 foreign children on dedicate connections>;
...
| async
| t0: 3885.84 ( 40.20) 0.86% faster (should be error but stable on my env..)
| pl: 1617.20 ( 3.51) 1.26% faster (ditto)
| pf0: 6680.95 (478.72) 19.5% faster
| pf1: 1886.87 ( 36.25) 77.1% faster
regards,
--
Kyotaro Horiguchi
NTT Open Source Software Center
Attachments:
0001-robert-s-2nd-framework.patchtext/x-patch; charset=us-asciiDownload
From 8519a24a85a0d033ae9b6ddcc175f5948bb90b76 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Wed, 12 Oct 2016 12:46:10 +0900
Subject: [PATCH 1/7] robert's 2nd framework
---
contrib/postgres_fdw/postgres_fdw.c | 49 ++++
src/backend/executor/Makefile | 4 +-
src/backend/executor/README | 43 +++
src/backend/executor/execAmi.c | 5 +
src/backend/executor/execAsync.c | 462 ++++++++++++++++++++++++++++++++
src/backend/executor/nodeAppend.c | 162 ++++++++++-
src/backend/executor/nodeForeignscan.c | 49 ++++
src/backend/nodes/copyfuncs.c | 1 +
src/backend/nodes/outfuncs.c | 1 +
src/backend/nodes/readfuncs.c | 1 +
src/backend/optimizer/plan/createplan.c | 45 +++-
src/include/executor/execAsync.h | 29 ++
src/include/executor/nodeAppend.h | 3 +
src/include/executor/nodeForeignscan.h | 7 +
src/include/foreign/fdwapi.h | 15 ++
src/include/nodes/execnodes.h | 57 +++-
src/include/nodes/plannodes.h | 1 +
17 files changed, 909 insertions(+), 25 deletions(-)
create mode 100644 src/backend/executor/execAsync.c
create mode 100644 src/include/executor/execAsync.h
diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c
index 906d6e6..c480945 100644
--- a/contrib/postgres_fdw/postgres_fdw.c
+++ b/contrib/postgres_fdw/postgres_fdw.c
@@ -19,6 +19,7 @@
#include "commands/defrem.h"
#include "commands/explain.h"
#include "commands/vacuum.h"
+#include "executor/execAsync.h"
#include "foreign/fdwapi.h"
#include "funcapi.h"
#include "miscadmin.h"
@@ -349,6 +350,14 @@ static void postgresGetForeignUpperPaths(PlannerInfo *root,
UpperRelationKind stage,
RelOptInfo *input_rel,
RelOptInfo *output_rel);
+static bool postgresIsForeignPathAsyncCapable(ForeignPath *path);
+static void postgresForeignAsyncRequest(EState *estate,
+ PendingAsyncRequest *areq);
+static void postgresForeignAsyncConfigureWait(EState *estate,
+ PendingAsyncRequest *areq,
+ bool reinit);
+static void postgresForeignAsyncNotify(EState *estate,
+ PendingAsyncRequest *areq);
/*
* Helper functions
@@ -468,6 +477,12 @@ postgres_fdw_handler(PG_FUNCTION_ARGS)
/* Support functions for upper relation push-down */
routine->GetForeignUpperPaths = postgresGetForeignUpperPaths;
+ /* Support functions for async execution */
+ routine->IsForeignPathAsyncCapable = postgresIsForeignPathAsyncCapable;
+ routine->ForeignAsyncRequest = postgresForeignAsyncRequest;
+ routine->ForeignAsyncConfigureWait = postgresForeignAsyncConfigureWait;
+ routine->ForeignAsyncNotify = postgresForeignAsyncNotify;
+
PG_RETURN_POINTER(routine);
}
@@ -4442,6 +4457,40 @@ postgresGetForeignJoinPaths(PlannerInfo *root,
/* XXX Consider parameterized paths for the join relation */
}
+static bool
+postgresIsForeignPathAsyncCapable(ForeignPath *path)
+{
+ return true;
+}
+
+/*
+ * XXX. Just for testing purposes, let's run everything through the async
+ * mechanism but return tuples synchronously.
+ */
+static void
+postgresForeignAsyncRequest(EState *estate, PendingAsyncRequest *areq)
+{
+ ForeignScanState *node = (ForeignScanState *) areq->requestee;
+ TupleTableSlot *slot;
+
+ Assert(IsA(node, ForeignScanState));
+ slot = postgresIterateForeignScan(node);
+ ExecAsyncRequestDone(estate, areq, (Node *) slot);
+}
+
+static void
+postgresForeignAsyncConfigureWait(EState *estate, PendingAsyncRequest *areq,
+ bool reinit)
+{
+ elog(ERROR, "postgresForeignAsyncConfigureWait");
+}
+
+static void
+postgresForeignAsyncNotify(EState *estate, PendingAsyncRequest *areq)
+{
+ elog(ERROR, "postgresForeignAsyncNotify");
+}
+
/*
* Assess whether the aggregation, grouping and having operations can be pushed
* down to the foreign server. As a side effect, save information we obtain in
diff --git a/src/backend/executor/Makefile b/src/backend/executor/Makefile
index 51edd4c..0675b01 100644
--- a/src/backend/executor/Makefile
+++ b/src/backend/executor/Makefile
@@ -12,8 +12,8 @@ subdir = src/backend/executor
top_builddir = ../../..
include $(top_builddir)/src/Makefile.global
-OBJS = execAmi.o execCurrent.o execGrouping.o execIndexing.o execJunk.o \
- execMain.o execParallel.o execProcnode.o execQual.o \
+OBJS = execAmi.o execAsync.o execCurrent.o execGrouping.o execIndexing.o \
+ execJunk.o execMain.o execParallel.o execProcnode.o execQual.o \
execScan.o execTuples.o \
execUtils.o functions.o instrument.o nodeAppend.o nodeAgg.o \
nodeBitmapAnd.o nodeBitmapOr.o \
diff --git a/src/backend/executor/README b/src/backend/executor/README
index f1d1e4c..1dee3db 100644
--- a/src/backend/executor/README
+++ b/src/backend/executor/README
@@ -199,3 +199,46 @@ query returning the same set of scan tuples multiple times. Likewise,
SRFs are disallowed in an UPDATE's targetlist. There, they would have the
effect of the same row being updated multiple times, which is not very
useful --- and updates after the first would have no effect anyway.
+
+Asynchronous Execution
+----------------------
+
+In certain cases, it's desirable for a node to indicate that it cannot
+return any tuple immediately but may be able to do at a later time. This
+might either because the node is waiting on an event external to the
+database system, such as a ForeignScan awaiting network I/O, or because
+the node is waiting for an event internal to the database system - e.g.
+one process involved in a parallel query may find that it cannot progress
+a certain parallel operation until some other process reaches a certain
+point in the computation. A process which discovers this type of situation
+can always handle it simply by blocking, but this may waste time that could
+be spent executing some other part of the plan where progress could be
+made immediately. This is particularly likely to occur when the plan
+contains an Append node.
+
+To use asynchronous execution, a node must first request a tuple from an
+async-capable child node using ExecAsyncRequest. Next, it must execute
+the asynchronous event loop using ExecAsyncEventLoop; it can avoid giving
+up control indefinitely by passing a timeout to this function, even passing
+-1 to poll for events without blocking. Eventually, when a node to which an
+asynchronous request has been made produces a tuple, the requesting node
+will receive a callback from the event loop via ExecAsyncResponse. Typically,
+the ExecAsyncResponse callback is the only one required for nodes that wish
+to request tuples asynchronously.
+
+On the other hand, nodes that wish to produce tuples asynchronously
+generally need to implement three methods:
+
+1. When an asynchronous request is made, the node's ExecAsyncRequest callback
+will be invoked; it should use ExecAsyncSetRequiredEvents to indicate the
+number of file descriptor events for which it wishes to wait and whether it
+wishes to receive a callback when the process latch is set. Alternatively,
+it can instead use ExecAsyncRequestDone if a result is available immediately.
+
+2. When the event loop wishes to wait or poll for file descriptor events and
+the process latch, the ExecAsyncConfigureWait callback is invoked to configure
+the file descriptor wait events for which the node wishes to wait. This
+callback isn't needed if the node only cares about the process latch.
+
+3. When file descriptors or the process latch become ready, the node's
+ExecAsyncNotify callback is invoked.
diff --git a/src/backend/executor/execAmi.c b/src/backend/executor/execAmi.c
index 2587ef7..9fcc4e4 100644
--- a/src/backend/executor/execAmi.c
+++ b/src/backend/executor/execAmi.c
@@ -464,11 +464,16 @@ ExecSupportsBackwardScan(Plan *node)
{
ListCell *l;
+ /* With async, tuples may be interleaved, so can't back up. */
+ if (((Append *) node)->nasyncplans != 0)
+ return false;
+
foreach(l, ((Append *) node)->appendplans)
{
if (!ExecSupportsBackwardScan((Plan *) lfirst(l)))
return false;
}
+
/* need not check tlist because Append doesn't evaluate it */
return true;
}
diff --git a/src/backend/executor/execAsync.c b/src/backend/executor/execAsync.c
new file mode 100644
index 0000000..5858bb5
--- /dev/null
+++ b/src/backend/executor/execAsync.c
@@ -0,0 +1,462 @@
+/*-------------------------------------------------------------------------
+ *
+ * execAsync.c
+ * Support routines for asynchronous execution.
+ *
+ * Portions Copyright (c) 1996-2016, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/executor/execAsync.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "executor/execAsync.h"
+#include "executor/nodeAppend.h"
+#include "executor/nodeForeignscan.h"
+#include "miscadmin.h"
+#include "storage/latch.h"
+
+static bool ExecAsyncEventWait(EState *estate, long timeout);
+static void ExecAsyncConfigureWait(EState *estate, PendingAsyncRequest *areq,
+ bool reinit);
+static void ExecAsyncNotify(EState *estate, PendingAsyncRequest *areq);
+static void ExecAsyncResponse(EState *estate, PendingAsyncRequest *areq);
+
+#define EVENT_BUFFER_SIZE 16
+
+/*
+ * Asynchronously request a tuple from a designed async-aware node.
+ *
+ * requestor is the node that wants the tuple; requestee is the node from
+ * which it wants the tuple. request_index is an arbitrary integer specified
+ * by the requestor which will be available at the time the requestor receives
+ * the tuple. This is useful if the requestor has multiple children and
+ * needs an easy way to figure out which one is delivering a tuple.
+ */
+void
+ExecAsyncRequest(EState *estate, PlanState *requestor, int request_index,
+ PlanState *requestee)
+{
+ PendingAsyncRequest *areq = NULL;
+ int i = estate->es_num_pending_async;
+
+ /*
+ * If the number of pending asynchronous nodes exceeds the number of
+ * available slots in the es_pending_async array, expand the array.
+ * We start with 16 slots, and thereafter double the array size each
+ * time we run out of slots.
+ */
+ if (i >= estate->es_max_pending_async)
+ {
+ int newmax;
+
+ newmax = estate->es_max_pending_async * 2;
+ if (estate->es_max_pending_async == 0)
+ {
+ newmax = 16;
+ estate->es_pending_async =
+ MemoryContextAllocZero(estate->es_query_cxt,
+ newmax * sizeof(PendingAsyncRequest *));
+ }
+ else
+ {
+ int newentries = newmax - estate->es_max_pending_async;
+
+ estate->es_pending_async =
+ repalloc(estate->es_pending_async,
+ newmax * sizeof(PendingAsyncRequest *));
+ MemSet(&estate->es_pending_async[estate->es_max_pending_async],
+ 0, newentries * sizeof(PendingAsyncRequest *));
+ }
+ estate->es_max_pending_async = newmax;
+ }
+
+ /*
+ * To avoid unnecessary palloc traffic, we reuse a previously-allocated
+ * PendingAsyncRequest if there is one. If not, we must allocate a new
+ * one.
+ */
+ if (estate->es_pending_async[i] == NULL)
+ {
+ areq = MemoryContextAllocZero(estate->es_query_cxt,
+ sizeof(PendingAsyncRequest));
+ estate->es_pending_async[i] = areq;
+ }
+ else
+ {
+ areq = estate->es_pending_async[i];
+ MemSet(areq, 0, sizeof(PendingAsyncRequest));
+ }
+ areq->myindex = estate->es_num_pending_async++;
+
+ /* Initialize the new request. */
+ areq->requestor = requestor;
+ areq->request_index = request_index;
+ areq->requestee = requestee;
+
+ /* Give the requestee a chance to do whatever it wants. */
+ switch (nodeTag(requestee))
+ {
+ case T_ForeignScanState:
+ ExecAsyncForeignScanRequest(estate, areq);
+ break;
+ default:
+ /* If requestee doesn't support async, caller messed up. */
+ elog(ERROR, "unrecognized node type: %d",
+ (int) nodeTag(requestee));
+ }
+}
+
+/*
+ * Execute the main loop until the timeout expires or a result is delivered
+ * to the requestor.
+ *
+ * If the timeout is -1, there is no timeout; wait indefinitely until a
+ * result is ready for requestor. If the timeout is 0, do not block, but
+ * poll for events and fire callbacks for as long as we can do so without
+ * blocking. If timeout is greater than 0, block for at most the number
+ * of milliseconds indicated by the timeout.
+ *
+ * Returns true if a result was delivered to the requestor. A return value
+ * of false indicates that the timeout was reached without delivering a
+ * result to the requestor.
+ */
+bool
+ExecAsyncEventLoop(EState *estate, PlanState *requestor, long timeout)
+{
+ instr_time start_time;
+ long cur_timeout = timeout;
+ bool requestor_done = false;
+
+ Assert(requestor != NULL);
+
+ /*
+ * If we plan to wait - but not indefinitely - we need to record the
+ * current time.
+ */
+ if (timeout > 0)
+ INSTR_TIME_SET_CURRENT(start_time);
+
+ /* Main event loop: poll for events, deliver notifications. */
+ for (;;)
+ {
+ int i;
+ bool any_node_done = false;
+
+ CHECK_FOR_INTERRUPTS();
+
+ /*
+ * Check for events, but don't block if there notifications that
+ * have not been delivered yet.
+ */
+ if (estate->es_async_callback_pending > 0)
+ ExecAsyncEventWait(estate, 0);
+ else if (!ExecAsyncEventWait(estate, cur_timeout))
+ cur_timeout = 0; /* Timeout was reached. */
+ else
+ {
+ instr_time cur_time;
+ long cur_timeout = -1;
+
+ INSTR_TIME_SET_CURRENT(cur_time);
+ INSTR_TIME_SUBTRACT(cur_time, start_time);
+ cur_timeout = timeout - (long) INSTR_TIME_GET_MILLISEC(cur_time);
+ if (cur_timeout < 0)
+ cur_timeout = 0;
+ }
+
+ /* Deliver notifications. */
+ for (i = 0; i < estate->es_num_pending_async; ++i)
+ {
+ PendingAsyncRequest *areq = estate->es_pending_async[i];
+
+ /* Skip it if no callback is pending. */
+ if (!areq->callback_pending)
+ continue;
+
+ /*
+ * Mark it as no longer needing a callback. We must do this
+ * before dispatching the callback in case the callback resets
+ * the flag.
+ */
+ areq->callback_pending = false;
+ estate->es_async_callback_pending--;
+
+ /* Perform the actual callback; set request_done if appropraite. */
+ if (!areq->request_complete)
+ ExecAsyncNotify(estate, areq);
+ else
+ {
+ any_node_done = true;
+ if (requestor == areq->requestor)
+ requestor_done = true;
+ ExecAsyncResponse(estate, areq);
+ }
+ }
+
+ /* If any node completed, compact the array. */
+ if (any_node_done)
+ {
+ int hidx = 0,
+ tidx;
+
+ /*
+ * Swap all non-yet-completed items to the start of the array.
+ * Keep them in the same order.
+ */
+ for (tidx = 0; tidx < estate->es_num_pending_async; ++tidx)
+ {
+ PendingAsyncRequest *head;
+ PendingAsyncRequest *tail = estate->es_pending_async[tidx];
+
+ if (!tail->callback_pending && tail->request_complete)
+ continue;
+ head = estate->es_pending_async[hidx];
+ estate->es_pending_async[tidx] = head;
+ estate->es_pending_async[hidx] = tail;
+ ++hidx;
+ }
+ estate->es_num_pending_async = hidx;
+ }
+
+ /*
+ * We only consider exiting the loop when no notifications are
+ * pending. Otherwise, each call to this function might advance
+ * the computation by only a very small amount; to the contrary,
+ * we want to push it forward as far as possible.
+ */
+ if (estate->es_async_callback_pending == 0)
+ {
+ /* If requestor is ready, exit. */
+ if (requestor_done)
+ return true;
+ /* If timeout was 0 or has expired, exit. */
+ if (cur_timeout == 0)
+ return false;
+ }
+ }
+}
+
+/*
+ * Wait or poll for events. As with ExecAsyncEventLoop, a timeout of -1
+ * means wait forever, 0 means don't wait at all, and >0 means wait for the
+ * indicated number of milliseconds.
+ *
+ * Returns true if we found some events and false if we timed out.
+ */
+static bool
+ExecAsyncEventWait(EState *estate, long timeout)
+{
+ WaitEvent occurred_event[EVENT_BUFFER_SIZE];
+ int noccurred;
+ int i;
+ int n;
+ bool reinit = false;
+ bool process_latch_set = false;
+
+ if (estate->es_wait_event_set == NULL)
+ {
+ /*
+ * Allow for a few extra events without reinitializing. It
+ * doesn't seem worth the complexity of doing anything very
+ * aggressive here, because plans that depend on massive numbers
+ * of external FDs are likely to run afoul of kernel limits anyway.
+ */
+ estate->es_allocated_fd_events = estate->es_total_fd_events + 16;
+ estate->es_wait_event_set =
+ CreateWaitEventSet(estate->es_query_cxt,
+ estate->es_allocated_fd_events + 1);
+ AddWaitEventToSet(estate->es_wait_event_set,
+ WL_LATCH_SET, PGINVALID_SOCKET, MyLatch, NULL);
+ reinit = true;
+ }
+
+ /* Give each waiting node a chance to add or modify events. */
+ for (i = 0; i < estate->es_num_pending_async; ++i)
+ {
+ PendingAsyncRequest *areq = estate->es_pending_async[i];
+
+ if (areq->num_fd_events > 0)
+ ExecAsyncConfigureWait(estate, areq, reinit);
+ }
+
+ /* Wait for at least one event to occur. */
+ noccurred = WaitEventSetWait(estate->es_wait_event_set, timeout,
+ occurred_event, EVENT_BUFFER_SIZE);
+ if (noccurred == 0)
+ return false;
+
+ /*
+ * Loop over the occurred events and set the callback_pending flags
+ * for the appropriate requests. The waiting nodes should have
+ * registered their wait events with user_data pointing back to the
+ * PendingAsyncRequest, but the process latch needs special handling.
+ */
+ for (n = 0; n < noccurred; ++n)
+ {
+ WaitEvent *w = &occurred_event[n];
+
+ if ((w->events & WL_LATCH_SET) != 0)
+ {
+ process_latch_set = true;
+ continue;
+ }
+
+ if ((w->events & (WL_SOCKET_READABLE|WL_SOCKET_WRITEABLE)) != 0)
+ {
+ PendingAsyncRequest *areq = w->user_data;
+
+ if (!areq->callback_pending)
+ {
+ Assert(!areq->request_complete);
+ areq->callback_pending = true;
+ estate->es_async_callback_pending++;
+ }
+ }
+ }
+
+ /*
+ * If the process latch got set, we must schedule a callback for every
+ * requestee that cares about it.
+ */
+ if (process_latch_set)
+ {
+ for (i = 0; i < estate->es_num_pending_async; ++i)
+ {
+ PendingAsyncRequest *areq = estate->es_pending_async[i];
+
+ if (areq->wants_process_latch)
+ {
+ Assert(!areq->request_complete);
+ areq->callback_pending = true;
+ }
+ }
+ }
+
+ return true;
+}
+
+/*
+ * Give the asynchronous node a chance to configure the file descriptor
+ * events for which it wishes to wait. We expect the node-type specific
+ * callback to make one or more calls of the following form:
+ *
+ * AddWaitEventToSet(es->es_wait_event_set, events, fd, NULL, areq);
+ *
+ * The events should include only WL_SOCKET_READABLE or WL_SOCKET_WRITEABLE,
+ * and the number of calls should not exceed areq->num_fd_events (as
+ * prevously set via ExecAsyncSetRequiredEvents).
+ */
+static void
+ExecAsyncConfigureWait(EState *estate, PendingAsyncRequest *areq,
+ bool reinit)
+{
+ switch (nodeTag(areq->requestee))
+ {
+ case T_ForeignScanState:
+ ExecAsyncForeignScanConfigureWait(estate, areq, reinit);
+ break;
+ default:
+ elog(ERROR, "unrecognized node type: %d",
+ (int) nodeTag(areq->requestee));
+ }
+}
+
+/*
+ * Call the asynchronous node back when a relevant event has occurred.
+ */
+static void
+ExecAsyncNotify(EState *estate, PendingAsyncRequest *areq)
+{
+ switch (nodeTag(areq->requestee))
+ {
+ case T_ForeignScanState:
+ ExecAsyncForeignScanNotify(estate, areq);
+ break;
+ default:
+ elog(ERROR, "unrecognized node type: %d",
+ (int) nodeTag(areq->requestee));
+ }
+}
+
+/*
+ * Call the requestor back when an asynchronous node has produced a result.
+ */
+static void
+ExecAsyncResponse(EState *estate, PendingAsyncRequest *areq)
+{
+ switch (nodeTag(areq->requestor))
+ {
+ case T_AppendState:
+ ExecAsyncAppendResponse(estate, areq);
+ break;
+ default:
+ elog(ERROR, "unrecognized node type: %d",
+ (int) nodeTag(areq->requestor));
+ }
+}
+
+/*
+ * An executor node should call this function to signal that it needs to wait
+ * on one or more file descriptor events that can be registered on a
+ * WaitEventSet, and possibly also on the process latch. num_fd_events
+ * should be the maximum number of file descriptor events that it will wish to
+ * register. force_reset should be true if the node can't reuse the
+ * WaitEventSet it most recently initialized, for example because it needs to
+ * drop a wait event from the set.
+ */
+void
+ExecAsyncSetRequiredEvents(EState *estate, PendingAsyncRequest *areq,
+ int num_fd_events, bool wants_process_latch,
+ bool force_reset)
+{
+ estate->es_total_fd_events += num_fd_events - areq->num_fd_events;
+ areq->num_fd_events = num_fd_events;
+ areq->wants_process_latch = wants_process_latch;
+
+ if (force_reset && estate->es_wait_event_set != NULL)
+ {
+ FreeWaitEventSet(estate->es_wait_event_set);
+ estate->es_wait_event_set = NULL;
+ }
+}
+
+/*
+ * An async-capable node should call this function to deliver the tuple to
+ * the node which requested it. The node can call this from its
+ * ExecAsyncRequest callback if the requested tuple is available immediately,
+ * or at a later time from its ExecAsyncNotify callback.
+ */
+void
+ExecAsyncRequestDone(EState *estate, PendingAsyncRequest *areq, Node *result)
+{
+ /*
+ * Since the request is complete, the requestee is no longer allowed
+ * to wait for any events. Note that this forces a rebuild of
+ * es_wait_event_set every time a process that was previously waiting
+ * stops doing so. It might be possible to defer that decision until
+ * we actually wait again, because it's quite possible that a new
+ * request will be made of the same node before any wait actually
+ * happens. However, we have to balance the cost of rebuilding the
+ * WaitEventSet against the additional overhead of tracking which nodes
+ * need a callback to remove registered wait events. It's not clear
+ * that we would come out ahead, so use brute force for now.
+ */
+ if (areq->num_fd_events > 0 || areq->wants_process_latch)
+ ExecAsyncSetRequiredEvents(estate, areq, 0, false, true);
+
+ /* Save result and mark request as complete. */
+ areq->result = result;
+ areq->request_complete = true;
+
+ /* Make sure this request is flagged for a callback. */
+ if (!areq->callback_pending)
+ {
+ areq->callback_pending = true;
+ estate->es_async_callback_pending++;
+ }
+}
diff --git a/src/backend/executor/nodeAppend.c b/src/backend/executor/nodeAppend.c
index a26bd63..bb06569 100644
--- a/src/backend/executor/nodeAppend.c
+++ b/src/backend/executor/nodeAppend.c
@@ -58,6 +58,7 @@
#include "postgres.h"
#include "executor/execdebug.h"
+#include "executor/execAsync.h"
#include "executor/nodeAppend.h"
static bool exec_append_initialize_next(AppendState *appendstate);
@@ -79,16 +80,21 @@ exec_append_initialize_next(AppendState *appendstate)
/*
* get information from the append node
*/
- whichplan = appendstate->as_whichplan;
+ whichplan = appendstate->as_whichsyncplan;
- if (whichplan < 0)
+ /*
+ * This routine is only responsible for setting up for nodes being scanned
+ * synchronously, so the first node we can scan is given by nasyncplans
+ * and the last is given by as_nplans - 1.
+ */
+ if (whichplan < appendstate->as_nasyncplans)
{
/*
* if scanning in reverse, we start at the last scan in the list and
* then proceed back to the first.. in any case we inform ExecAppend
* that we are at the end of the line by returning FALSE
*/
- appendstate->as_whichplan = 0;
+ appendstate->as_whichsyncplan = appendstate->as_nasyncplans;
return FALSE;
}
else if (whichplan >= appendstate->as_nplans)
@@ -96,7 +102,7 @@ exec_append_initialize_next(AppendState *appendstate)
/*
* as above, end the scan if we go beyond the last scan in our list..
*/
- appendstate->as_whichplan = appendstate->as_nplans - 1;
+ appendstate->as_whichsyncplan = appendstate->as_nplans - 1;
return FALSE;
}
else
@@ -142,6 +148,15 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
appendstate->ps.state = estate;
appendstate->appendplans = appendplanstates;
appendstate->as_nplans = nplans;
+ appendstate->as_nasyncplans = node->nasyncplans;
+ appendstate->as_syncdone = (node->nasyncplans == nplans);
+ appendstate->as_asyncresult = (TupleTableSlot **)
+ palloc0(node->nasyncplans * sizeof(TupleTableSlot *));
+
+ /* initially, all async requests need a request */
+ for (i = 0; i < appendstate->as_nasyncplans; ++i)
+ appendstate->as_needrequest =
+ bms_add_member(appendstate->as_needrequest, i);
/*
* Miscellaneous initialization
@@ -176,9 +191,9 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
appendstate->ps.ps_ProjInfo = NULL;
/*
- * initialize to scan first subplan
+ * initialize to scan first synchronous subplan
*/
- appendstate->as_whichplan = 0;
+ appendstate->as_whichsyncplan = appendstate->as_nasyncplans;
exec_append_initialize_next(appendstate);
return appendstate;
@@ -193,15 +208,78 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
TupleTableSlot *
ExecAppend(AppendState *node)
{
+ if (node->as_nasyncplans > 0)
+ {
+ EState *estate = node->ps.state;
+ int i;
+
+ /*
+ * If there are any asynchronously-generated results that have
+ * not yet been returned, return one of them.
+ */
+ if (node->as_nasyncresult > 0)
+ {
+ --node->as_nasyncresult;
+ return node->as_asyncresult[node->as_nasyncresult];
+ }
+
+ /*
+ * If there are any nodes that need a new asynchronous request,
+ * make all of them.
+ */
+ while ((i = bms_first_member(node->as_needrequest)) >= 0)
+ {
+ ExecAsyncRequest(estate, &node->ps, i, node->appendplans[i]);
+ node->as_nasyncpending++;
+ }
+ }
+
for (;;)
{
PlanState *subnode;
TupleTableSlot *result;
/*
- * figure out which subplan we are currently processing
+ * if we have async requests outstanding, run the event loop
*/
- subnode = node->appendplans[node->as_whichplan];
+ if (node->as_nasyncpending > 0)
+ {
+ long timeout = node->as_syncdone ? -1 : 0;
+
+ for (;;)
+ {
+ if (node->as_nasyncpending == 0)
+ {
+ /*
+ * If there is no asynchronous activity still pending
+ * and the synchronous activity is also complete, we're
+ * totally done scanning this node. Otherwise, we're
+ * done with the asynchronous stuff but must continue
+ * scanning the synchronous children.
+ */
+ if (node->as_syncdone)
+ return ExecClearTuple(node->ps.ps_ResultTupleSlot);
+ break;
+ }
+ if (!ExecAsyncEventLoop(node->ps.state, &node->ps, timeout))
+ {
+ /* Timeout reached. */
+ break;
+ }
+ if (node->as_nasyncresult > 0)
+ {
+ /* Asynchronous subplan returned a tuple! */
+ --node->as_nasyncresult;
+ return node->as_asyncresult[node->as_nasyncresult];
+ }
+ }
+ }
+
+ /*
+ * figure out which synchronous subplan we are currently processing
+ */
+ Assert(!node->as_syncdone);
+ subnode = node->appendplans[node->as_whichsyncplan];
/*
* get a tuple from the subplan
@@ -221,14 +299,21 @@ ExecAppend(AppendState *node)
/*
* Go on to the "next" subplan in the appropriate direction. If no
* more subplans, return the empty slot set up for us by
- * ExecInitAppend.
+ * ExecInitAppend, unless there are async plans we have yet to finish.
*/
if (ScanDirectionIsForward(node->ps.state->es_direction))
- node->as_whichplan++;
+ node->as_whichsyncplan++;
else
- node->as_whichplan--;
+ node->as_whichsyncplan--;
if (!exec_append_initialize_next(node))
- return ExecClearTuple(node->ps.ps_ResultTupleSlot);
+ {
+ node->as_syncdone = true;
+ if (node->as_nasyncpending == 0)
+ {
+ Assert(bms_is_empty(node->as_needrequest));
+ return ExecClearTuple(node->ps.ps_ResultTupleSlot);
+ }
+ }
/* Else loop back and try to get a tuple from the new subplan */
}
@@ -267,6 +352,16 @@ ExecReScanAppend(AppendState *node)
{
int i;
+ /*
+ * XXX. Cancel outstanding asynchronous tuple requests here! (How?)
+ */
+
+ /* Reset async state. */
+ for (i = 0; i < node->as_nasyncplans; ++i)
+ node->as_needrequest = bms_add_member(node->as_needrequest, i);
+ node->as_nasyncresult = 0;
+ node->as_syncdone = (node->as_nasyncplans == node->as_nplans);
+
for (i = 0; i < node->as_nplans; i++)
{
PlanState *subnode = node->appendplans[i];
@@ -285,6 +380,47 @@ ExecReScanAppend(AppendState *node)
if (subnode->chgParam == NULL)
ExecReScan(subnode);
}
- node->as_whichplan = 0;
+ node->as_whichsyncplan = node->as_nasyncplans;
exec_append_initialize_next(node);
}
+
+/* ----------------------------------------------------------------
+ * ExecAsyncAppendResponse
+ *
+ * Receive a response from an asynchronous request we made.
+ * ----------------------------------------------------------------
+ */
+void
+ExecAsyncAppendResponse(EState *estate, PendingAsyncRequest *areq)
+{
+ AppendState *node = (AppendState *) areq->requestor;
+ TupleTableSlot *slot;
+
+ /* We shouldn't be called until the request is complete. */
+ Assert(areq->request_complete);
+
+ /* Our result slot shouldn't already be occupied. */
+ Assert(TupIsNull(node->ps.ps_ResultTupleSlot));
+
+ /* Result should be a TupleTableSlot or NULL. */
+ slot = (TupleTableSlot *) areq->result;
+ Assert(slot == NULL || IsA(slot, TupleTableSlot));
+
+ /* Request is no longer pending. */
+ Assert(node->as_nasyncpending > 0);
+ --node->as_nasyncpending;
+
+ /* If the result is NULL or an empty slot, there's nothing more to do. */
+ if (TupIsNull(slot))
+ return;
+
+ /* Save result so we can return it. */
+ Assert(node->as_nasyncresult < node->as_nasyncplans);
+ node->as_asyncresult[node->as_nasyncresult++] = slot;
+
+ /*
+ * Mark the node that returned a result as ready for a new request. We
+ * don't launch another one here immediately because it might compelte
+ */
+ bms_add_member(node->as_needrequest, areq->request_index);
+}
diff --git a/src/backend/executor/nodeForeignscan.c b/src/backend/executor/nodeForeignscan.c
index d886aaf..85d436f 100644
--- a/src/backend/executor/nodeForeignscan.c
+++ b/src/backend/executor/nodeForeignscan.c
@@ -355,3 +355,52 @@ ExecForeignScanInitializeWorker(ForeignScanState *node, shm_toc *toc)
fdwroutine->InitializeWorkerForeignScan(node, toc, coordinate);
}
}
+
+/* ----------------------------------------------------------------
+ * ExecAsyncForeignScanRequest
+ *
+ * Initiate an asynchronous request
+ * ----------------------------------------------------------------
+ */
+void
+ExecAsyncForeignScanRequest(EState *estate, PendingAsyncRequest *areq)
+{
+ ForeignScanState *node = (ForeignScanState *) areq->requestee;
+ FdwRoutine *fdwroutine = node->fdwroutine;
+
+ Assert(fdwroutine->ForeignAsyncRequest != NULL);
+ fdwroutine->ForeignAsyncRequest(estate, areq);
+}
+
+/* ----------------------------------------------------------------
+ * ExecAsyncForeignScanConfigureWait
+ *
+ * In async mode, configure for a wait
+ * ----------------------------------------------------------------
+ */
+void
+ExecAsyncForeignScanConfigureWait(EState *estate,
+ PendingAsyncRequest *areq, bool reinit)
+{
+ ForeignScanState *node = (ForeignScanState *) areq->requestee;
+ FdwRoutine *fdwroutine = node->fdwroutine;
+
+ Assert(fdwroutine->ForeignAsyncConfigureWait != NULL);
+ fdwroutine->ForeignAsyncConfigureWait(estate, areq, reinit);
+}
+
+/* ----------------------------------------------------------------
+ * ExecAsyncForeignScanNotify
+ *
+ * Event loop callback
+ * ----------------------------------------------------------------
+ */
+void
+ExecAsyncForeignScanNotify(EState *estate, PendingAsyncRequest *areq)
+{
+ ForeignScanState *node = (ForeignScanState *) areq->requestee;
+ FdwRoutine *fdwroutine = node->fdwroutine;
+
+ Assert(fdwroutine->ForeignAsyncNotify != NULL);
+ fdwroutine->ForeignAsyncNotify(estate, areq);
+}
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index 71714bc..23b4e18 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -218,6 +218,7 @@ _copyAppend(const Append *from)
* copy remainder of node
*/
COPY_NODE_FIELD(appendplans);
+ COPY_SCALAR_FIELD(nasyncplans);
return newnode;
}
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index ae86954..dc5b938 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -359,6 +359,7 @@ _outAppend(StringInfo str, const Append *node)
_outPlanInfo(str, (const Plan *) node);
WRITE_NODE_FIELD(appendplans);
+ WRITE_INT_FIELD(nasyncplans);
}
static void
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index 917e6c8..69453b5 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -1519,6 +1519,7 @@ _readAppend(void)
ReadCommonPlan(&local_node->plan);
READ_NODE_FIELD(appendplans);
+ READ_INT_FIELD(nasyncplans);
READ_DONE();
}
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index ad49674..7caa8d3 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -193,7 +193,7 @@ static CteScan *make_ctescan(List *qptlist, List *qpqual,
Index scanrelid, int ctePlanId, int cteParam);
static WorkTableScan *make_worktablescan(List *qptlist, List *qpqual,
Index scanrelid, int wtParam);
-static Append *make_append(List *appendplans, List *tlist);
+static Append *make_append(List *asyncplans, int nasyncplans, List *tlist);
static RecursiveUnion *make_recursive_union(List *tlist,
Plan *lefttree,
Plan *righttree,
@@ -270,6 +270,7 @@ static ModifyTable *make_modifytable(PlannerInfo *root,
List *resultRelations, List *subplans,
List *withCheckOptionLists, List *returningLists,
List *rowMarks, OnConflictExpr *onconflict, int epqParam);
+static bool is_async_capable_path(Path *path);
/*
@@ -955,8 +956,10 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path)
{
Append *plan;
List *tlist = build_path_tlist(root, &best_path->path);
- List *subplans = NIL;
+ List *asyncplans = NIL;
+ List *syncplans = NIL;
ListCell *subpaths;
+ int nasyncplans = 0;
/*
* The subpaths list could be empty, if every child was proven empty by
@@ -991,7 +994,14 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path)
/* Must insist that all children return the same tlist */
subplan = create_plan_recurse(root, subpath, CP_EXACT_TLIST);
- subplans = lappend(subplans, subplan);
+ /* Classify as async-capable or not */
+ if (is_async_capable_path(subpath))
+ {
+ asyncplans = lappend(asyncplans, subplan);
+ ++nasyncplans;
+ }
+ else
+ syncplans = lappend(syncplans, subplan);
}
/*
@@ -1001,7 +1011,7 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path)
* parent-rel Vars it'll be asked to emit.
*/
- plan = make_append(subplans, tlist);
+ plan = make_append(list_concat(asyncplans, syncplans), nasyncplans, tlist);
copy_generic_path_info(&plan->plan, (Path *) best_path);
@@ -4941,7 +4951,7 @@ make_foreignscan(List *qptlist,
}
static Append *
-make_append(List *appendplans, List *tlist)
+make_append(List *appendplans, int nasyncplans, List *tlist)
{
Append *node = makeNode(Append);
Plan *plan = &node->plan;
@@ -4951,6 +4961,7 @@ make_append(List *appendplans, List *tlist)
plan->lefttree = NULL;
plan->righttree = NULL;
node->appendplans = appendplans;
+ node->nasyncplans = nasyncplans;
return node;
}
@@ -6225,3 +6236,27 @@ is_projection_capable_plan(Plan *plan)
}
return true;
}
+
+/*
+ * is_projection_capable_path
+ * Check whether a given Path node is async-capable.
+ */
+static bool
+is_async_capable_path(Path *path)
+{
+ switch (nodeTag(path))
+ {
+ case T_ForeignPath:
+ {
+ FdwRoutine *fdwroutine = path->parent->fdwroutine;
+
+ Assert(fdwroutine != NULL);
+ if (fdwroutine->IsForeignPathAsyncCapable != NULL &&
+ fdwroutine->IsForeignPathAsyncCapable((ForeignPath *) path))
+ return true;
+ }
+ default:
+ break;
+ }
+ return false;
+}
diff --git a/src/include/executor/execAsync.h b/src/include/executor/execAsync.h
new file mode 100644
index 0000000..2abc32d
--- /dev/null
+++ b/src/include/executor/execAsync.h
@@ -0,0 +1,29 @@
+/*--------------------------------------------------------------------
+ * execAsync.c
+ * Support functions for asynchronous query execution
+ *
+ * Portions Copyright (c) 1996-2016, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/executor/execAsync.c
+ *--------------------------------------------------------------------
+ */
+
+#ifndef EXECASYNC_H
+#define EXECASYNC_H
+
+#include "nodes/execnodes.h"
+
+extern void ExecAsyncRequest(EState *estate, PlanState *requestor,
+ int request_index, PlanState *requestee);
+extern bool ExecAsyncEventLoop(EState *estate, PlanState *requestor,
+ long timeout);
+
+extern void ExecAsyncSetRequiredEvents(EState *estate,
+ PendingAsyncRequest *areq, int num_fd_events,
+ bool wants_process_latch, bool force_reset);
+extern void ExecAsyncRequestDone(EState *estate,
+ PendingAsyncRequest *areq, Node *result);
+
+#endif /* EXECASYNC_H */
diff --git a/src/include/executor/nodeAppend.h b/src/include/executor/nodeAppend.h
index 51c381e..81a079d 100644
--- a/src/include/executor/nodeAppend.h
+++ b/src/include/executor/nodeAppend.h
@@ -21,4 +21,7 @@ extern TupleTableSlot *ExecAppend(AppendState *node);
extern void ExecEndAppend(AppendState *node);
extern void ExecReScanAppend(AppendState *node);
+extern void ExecAsyncAppendResponse(EState *estate,
+ PendingAsyncRequest *areq);
+
#endif /* NODEAPPEND_H */
diff --git a/src/include/executor/nodeForeignscan.h b/src/include/executor/nodeForeignscan.h
index 0cdec4e..3e69ab0 100644
--- a/src/include/executor/nodeForeignscan.h
+++ b/src/include/executor/nodeForeignscan.h
@@ -29,4 +29,11 @@ extern void ExecForeignScanInitializeDSM(ForeignScanState *node,
extern void ExecForeignScanInitializeWorker(ForeignScanState *node,
shm_toc *toc);
+extern void ExecAsyncForeignScanRequest(EState *estate,
+ PendingAsyncRequest *areq);
+extern void ExecAsyncForeignScanConfigureWait(EState *estate,
+ PendingAsyncRequest *areq, bool reinit);
+extern void ExecAsyncForeignScanNotify(EState *estate,
+ PendingAsyncRequest *areq);
+
#endif /* NODEFOREIGNSCAN_H */
diff --git a/src/include/foreign/fdwapi.h b/src/include/foreign/fdwapi.h
index e1b0d0d..88feb9a 100644
--- a/src/include/foreign/fdwapi.h
+++ b/src/include/foreign/fdwapi.h
@@ -155,6 +155,15 @@ typedef bool (*IsForeignScanParallelSafe_function) (PlannerInfo *root,
RelOptInfo *rel,
RangeTblEntry *rte);
+typedef bool (*IsForeignPathAsyncCapable_function) (ForeignPath *path);
+typedef void (*ForeignAsyncRequest_function) (EState *estate,
+ PendingAsyncRequest *areq);
+typedef void (*ForeignAsyncConfigureWait_function) (EState *estate,
+ PendingAsyncRequest *areq,
+ bool reinit);
+typedef void (*ForeignAsyncNotify_function) (EState *estate,
+ PendingAsyncRequest *areq);
+
/*
* FdwRoutine is the struct returned by a foreign-data wrapper's handler
* function. It provides pointers to the callback functions needed by the
@@ -224,6 +233,12 @@ typedef struct FdwRoutine
EstimateDSMForeignScan_function EstimateDSMForeignScan;
InitializeDSMForeignScan_function InitializeDSMForeignScan;
InitializeWorkerForeignScan_function InitializeWorkerForeignScan;
+
+ /* Support functions for asynchronous execution */
+ IsForeignPathAsyncCapable_function IsForeignPathAsyncCapable;
+ ForeignAsyncRequest_function ForeignAsyncRequest;
+ ForeignAsyncConfigureWait_function ForeignAsyncConfigureWait;
+ ForeignAsyncNotify_function ForeignAsyncNotify;
} FdwRoutine;
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index f6f73f3..b50b41c 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -347,6 +347,25 @@ typedef struct ResultRelInfo
} ResultRelInfo;
/* ----------------
+ * PendingAsyncRequest
+ *
+ * State for an asynchronous tuple request.
+ * ----------------
+ */
+typedef struct PendingAsyncRequest
+{
+ int myindex; /* Index in es_pending_async. */
+ struct PlanState *requestor; /* Node that wants a tuple. */
+ struct PlanState *requestee; /* Node from which a tuple is wanted. */
+ int request_index; /* Scratch space for requestor. */
+ int num_fd_events; /* Max number of FD events requestee needs. */
+ bool wants_process_latch; /* Requestee cares about MyLatch. */
+ bool callback_pending; /* Callback is needed. */
+ bool request_complete; /* Request complete, result valid. */
+ Node *result; /* Result (NULL if no more tuples). */
+} PendingAsyncRequest;
+
+/* ----------------
* EState information
*
* Master working state for an Executor invocation
@@ -422,6 +441,31 @@ typedef struct EState
HeapTuple *es_epqTuple; /* array of EPQ substitute tuples */
bool *es_epqTupleSet; /* true if EPQ tuple is provided */
bool *es_epqScanDone; /* true if EPQ tuple has been fetched */
+
+ /*
+ * Support for asynchronous execution.
+ *
+ * es_max_pending_async is the allocated size of es_pending_async, and
+ * es_num_pending_aync is the number of entries that are currently valid.
+ * (Entries after that may point to storage that can be reused.)
+ * es_async_callback_pending is the number of PendingAsyncRequests for
+ * which callback_pending is true.
+ *
+ * es_total_fd_events is the total number of FD events needed by all
+ * pending async nodes, and es_allocated_fd_events is the number any
+ * current wait event set was allocated to handle. es_wait_event_set, if
+ * non-NULL, is a previously allocated event set that may be reusable by a
+ * future wait provided that nothing's been removed and not too many more
+ * events have been added.
+ */
+ int es_num_pending_async;
+ int es_max_pending_async;
+ int es_async_callback_pending;
+ PendingAsyncRequest **es_pending_async;
+
+ int es_total_fd_events;
+ int es_allocated_fd_events;
+ struct WaitEventSet *es_wait_event_set;
} EState;
@@ -1147,17 +1191,20 @@ typedef struct ModifyTableState
/* ----------------
* AppendState information
- *
- * nplans how many plans are in the array
- * whichplan which plan is being executed (0 .. n-1)
* ----------------
*/
typedef struct AppendState
{
PlanState ps; /* its first field is NodeTag */
PlanState **appendplans; /* array of PlanStates for my inputs */
- int as_nplans;
- int as_whichplan;
+ int as_nplans; /* total # of children */
+ int as_nasyncplans; /* # of async-capable children */
+ int as_whichsyncplan; /* which sync plan is being executed */
+ bool as_syncdone; /* all synchronous plans done? */
+ Bitmapset *as_needrequest; /* async plans needing a new request */
+ TupleTableSlot **as_asyncresult; /* unreturned results of async plans */
+ int as_nasyncresult; /* # of valid entries in as_asyncresult */
+ int as_nasyncpending; /* # of outstanding async requests */
} AppendState;
/* ----------------
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index e2fbc7d..327119b 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -208,6 +208,7 @@ typedef struct Append
{
Plan plan;
List *appendplans;
+ int nasyncplans; /* # of async plans, always at start of list */
} Append;
/* ----------------
--
2.9.2
0002-Fix-some-bugs.patchtext/x-patch; charset=us-asciiDownload
From c0d26333dd549343ab0658aace4389b1ea60eedb Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Wed, 12 Oct 2016 14:03:53 +0900
Subject: [PATCH 2/7] Fix some bugs.
---
contrib/postgres_fdw/expected/postgres_fdw.out | 142 ++++++++++++-------------
contrib/postgres_fdw/postgres_fdw.c | 3 +-
src/backend/executor/execAsync.c | 4 +-
src/backend/postmaster/pgstat.c | 3 +
src/include/pgstat.h | 3 +-
5 files changed, 81 insertions(+), 74 deletions(-)
diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out
index 2745ad5..1b36579 100644
--- a/contrib/postgres_fdw/expected/postgres_fdw.out
+++ b/contrib/postgres_fdw/expected/postgres_fdw.out
@@ -6173,12 +6173,12 @@ INSERT INTO b(aa) VALUES('bbbbb');
SELECT tableoid::regclass, * FROM a;
tableoid | aa
----------+-------
- a | aaa
- a | aaaa
- a | aaaaa
b | bbb
b | bbbb
b | bbbbb
+ a | aaa
+ a | aaaa
+ a | aaaaa
(6 rows)
SELECT tableoid::regclass, * FROM b;
@@ -6201,12 +6201,12 @@ UPDATE a SET aa = 'zzzzzz' WHERE aa LIKE 'aaaa%';
SELECT tableoid::regclass, * FROM a;
tableoid | aa
----------+--------
- a | aaa
- a | zzzzzz
- a | zzzzzz
b | bbb
b | bbbb
b | bbbbb
+ a | aaa
+ a | zzzzzz
+ a | zzzzzz
(6 rows)
SELECT tableoid::regclass, * FROM b;
@@ -6229,12 +6229,12 @@ UPDATE b SET aa = 'new';
SELECT tableoid::regclass, * FROM a;
tableoid | aa
----------+--------
- a | aaa
- a | zzzzzz
- a | zzzzzz
b | new
b | new
b | new
+ a | aaa
+ a | zzzzzz
+ a | zzzzzz
(6 rows)
SELECT tableoid::regclass, * FROM b;
@@ -6257,12 +6257,12 @@ UPDATE a SET aa = 'newtoo';
SELECT tableoid::regclass, * FROM a;
tableoid | aa
----------+--------
- a | newtoo
- a | newtoo
- a | newtoo
b | newtoo
b | newtoo
b | newtoo
+ a | newtoo
+ a | newtoo
+ a | newtoo
(6 rows)
SELECT tableoid::regclass, * FROM b;
@@ -6321,120 +6321,120 @@ insert into bar2 values(4,44,44);
insert into bar2 values(7,77,77);
explain (verbose, costs off)
select * from bar where f1 in (select f1 from foo) for update;
- QUERY PLAN
-----------------------------------------------------------------------------------------------
+ QUERY PLAN
+------------------------------------------------------------------------------------------------------------------------
LockRows
- Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid, foo.ctid, foo.*, foo.tableoid
+ Output: bar2.f1, bar2.f2, bar2.ctid, ((bar2.*)::bar), bar2.tableoid, foo2.ctid, ((foo2.*)::foo), foo2.tableoid
-> Hash Join
- Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid, foo.ctid, foo.*, foo.tableoid
- Hash Cond: (bar.f1 = foo.f1)
+ Output: bar2.f1, bar2.f2, bar2.ctid, ((bar2.*)::bar), bar2.tableoid, foo2.ctid, ((foo2.*)::foo), foo2.tableoid
+ Hash Cond: (bar2.f1 = foo2.f1)
-> Append
- -> Seq Scan on public.bar
- Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid
-> Foreign Scan on public.bar2
Output: bar2.f1, bar2.f2, bar2.ctid, bar2.*, bar2.tableoid
Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 FOR UPDATE
+ -> Seq Scan on public.bar
+ Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid
-> Hash
- Output: foo.ctid, foo.*, foo.tableoid, foo.f1
+ Output: foo2.ctid, ((foo2.*)::foo), foo2.tableoid, foo2.f1
-> HashAggregate
- Output: foo.ctid, foo.*, foo.tableoid, foo.f1
- Group Key: foo.f1
+ Output: foo2.ctid, ((foo2.*)::foo), foo2.tableoid, foo2.f1
+ Group Key: foo2.f1
-> Append
- -> Seq Scan on public.foo
- Output: foo.ctid, foo.*, foo.tableoid, foo.f1
-> Foreign Scan on public.foo2
Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1
Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1
+ -> Seq Scan on public.foo
+ Output: foo.ctid, foo.*, foo.tableoid, foo.f1
(22 rows)
select * from bar where f1 in (select f1 from foo) for update;
f1 | f2
----+----
- 1 | 11
- 2 | 22
3 | 33
4 | 44
+ 1 | 11
+ 2 | 22
(4 rows)
explain (verbose, costs off)
select * from bar where f1 in (select f1 from foo) for share;
- QUERY PLAN
-----------------------------------------------------------------------------------------------
+ QUERY PLAN
+------------------------------------------------------------------------------------------------------------------------
LockRows
- Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid, foo.ctid, foo.*, foo.tableoid
+ Output: bar2.f1, bar2.f2, bar2.ctid, ((bar2.*)::bar), bar2.tableoid, foo2.ctid, ((foo2.*)::foo), foo2.tableoid
-> Hash Join
- Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid, foo.ctid, foo.*, foo.tableoid
- Hash Cond: (bar.f1 = foo.f1)
+ Output: bar2.f1, bar2.f2, bar2.ctid, ((bar2.*)::bar), bar2.tableoid, foo2.ctid, ((foo2.*)::foo), foo2.tableoid
+ Hash Cond: (bar2.f1 = foo2.f1)
-> Append
- -> Seq Scan on public.bar
- Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid
-> Foreign Scan on public.bar2
Output: bar2.f1, bar2.f2, bar2.ctid, bar2.*, bar2.tableoid
Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 FOR SHARE
+ -> Seq Scan on public.bar
+ Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid
-> Hash
- Output: foo.ctid, foo.*, foo.tableoid, foo.f1
+ Output: foo2.ctid, ((foo2.*)::foo), foo2.tableoid, foo2.f1
-> HashAggregate
- Output: foo.ctid, foo.*, foo.tableoid, foo.f1
- Group Key: foo.f1
+ Output: foo2.ctid, ((foo2.*)::foo), foo2.tableoid, foo2.f1
+ Group Key: foo2.f1
-> Append
- -> Seq Scan on public.foo
- Output: foo.ctid, foo.*, foo.tableoid, foo.f1
-> Foreign Scan on public.foo2
Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1
Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1
+ -> Seq Scan on public.foo
+ Output: foo.ctid, foo.*, foo.tableoid, foo.f1
(22 rows)
select * from bar where f1 in (select f1 from foo) for share;
f1 | f2
----+----
- 1 | 11
- 2 | 22
3 | 33
4 | 44
+ 1 | 11
+ 2 | 22
(4 rows)
-- Check UPDATE with inherited target and an inherited source table
explain (verbose, costs off)
update bar set f2 = f2 + 100 where f1 in (select f1 from foo);
- QUERY PLAN
----------------------------------------------------------------------------------------------
+ QUERY PLAN
+---------------------------------------------------------------------------------------------------------
Update on public.bar
Update on public.bar
Foreign Update on public.bar2
Remote SQL: UPDATE public.loct2 SET f2 = $2 WHERE ctid = $1
-> Hash Join
- Output: bar.f1, (bar.f2 + 100), bar.ctid, foo.ctid, foo.*, foo.tableoid
- Hash Cond: (bar.f1 = foo.f1)
+ Output: bar.f1, (bar.f2 + 100), bar.ctid, foo2.ctid, ((foo2.*)::foo), foo2.tableoid
+ Hash Cond: (bar.f1 = foo2.f1)
-> Seq Scan on public.bar
Output: bar.f1, bar.f2, bar.ctid
-> Hash
- Output: foo.ctid, foo.*, foo.tableoid, foo.f1
+ Output: foo2.ctid, ((foo2.*)::foo), foo2.tableoid, foo2.f1
-> HashAggregate
- Output: foo.ctid, foo.*, foo.tableoid, foo.f1
- Group Key: foo.f1
+ Output: foo2.ctid, ((foo2.*)::foo), foo2.tableoid, foo2.f1
+ Group Key: foo2.f1
-> Append
- -> Seq Scan on public.foo
- Output: foo.ctid, foo.*, foo.tableoid, foo.f1
-> Foreign Scan on public.foo2
Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1
Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1
+ -> Seq Scan on public.foo
+ Output: foo.ctid, foo.*, foo.tableoid, foo.f1
-> Hash Join
- Output: bar2.f1, (bar2.f2 + 100), bar2.f3, bar2.ctid, foo.ctid, foo.*, foo.tableoid
- Hash Cond: (bar2.f1 = foo.f1)
+ Output: bar2.f1, (bar2.f2 + 100), bar2.f3, bar2.ctid, foo2.ctid, ((foo2.*)::foo), foo2.tableoid
+ Hash Cond: (bar2.f1 = foo2.f1)
-> Foreign Scan on public.bar2
Output: bar2.f1, bar2.f2, bar2.f3, bar2.ctid
Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 FOR UPDATE
-> Hash
- Output: foo.ctid, foo.*, foo.tableoid, foo.f1
+ Output: foo2.ctid, ((foo2.*)::foo), foo2.tableoid, foo2.f1
-> HashAggregate
- Output: foo.ctid, foo.*, foo.tableoid, foo.f1
- Group Key: foo.f1
+ Output: foo2.ctid, ((foo2.*)::foo), foo2.tableoid, foo2.f1
+ Group Key: foo2.f1
-> Append
- -> Seq Scan on public.foo
- Output: foo.ctid, foo.*, foo.tableoid, foo.f1
-> Foreign Scan on public.foo2
Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1
Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1
+ -> Seq Scan on public.foo
+ Output: foo.ctid, foo.*, foo.tableoid, foo.f1
(37 rows)
update bar set f2 = f2 + 100 where f1 in (select f1 from foo);
@@ -6462,26 +6462,26 @@ where bar.f1 = ss.f1;
Foreign Update on public.bar2
Remote SQL: UPDATE public.loct2 SET f2 = $2 WHERE ctid = $1
-> Hash Join
- Output: bar.f1, (bar.f2 + 100), bar.ctid, (ROW(foo.f1))
- Hash Cond: (foo.f1 = bar.f1)
+ Output: bar.f1, (bar.f2 + 100), bar.ctid, (ROW(foo2.f1))
+ Hash Cond: (foo2.f1 = bar.f1)
-> Append
- -> Seq Scan on public.foo
- Output: ROW(foo.f1), foo.f1
-> Foreign Scan on public.foo2
Output: ROW(foo2.f1), foo2.f1
Remote SQL: SELECT f1 FROM public.loct1
- -> Seq Scan on public.foo foo_1
- Output: ROW((foo_1.f1 + 3)), (foo_1.f1 + 3)
-> Foreign Scan on public.foo2 foo2_1
Output: ROW((foo2_1.f1 + 3)), (foo2_1.f1 + 3)
Remote SQL: SELECT f1 FROM public.loct1
+ -> Seq Scan on public.foo
+ Output: ROW(foo.f1), foo.f1
+ -> Seq Scan on public.foo foo_1
+ Output: ROW((foo_1.f1 + 3)), (foo_1.f1 + 3)
-> Hash
Output: bar.f1, bar.f2, bar.ctid
-> Seq Scan on public.bar
Output: bar.f1, bar.f2, bar.ctid
-> Merge Join
- Output: bar2.f1, (bar2.f2 + 100), bar2.f3, bar2.ctid, (ROW(foo.f1))
- Merge Cond: (bar2.f1 = foo.f1)
+ Output: bar2.f1, (bar2.f2 + 100), bar2.f3, bar2.ctid, (ROW(foo2.f1))
+ Merge Cond: (bar2.f1 = foo2.f1)
-> Sort
Output: bar2.f1, bar2.f2, bar2.f3, bar2.ctid
Sort Key: bar2.f1
@@ -6489,19 +6489,19 @@ where bar.f1 = ss.f1;
Output: bar2.f1, bar2.f2, bar2.f3, bar2.ctid
Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 FOR UPDATE
-> Sort
- Output: (ROW(foo.f1)), foo.f1
- Sort Key: foo.f1
+ Output: (ROW(foo2.f1)), foo2.f1
+ Sort Key: foo2.f1
-> Append
- -> Seq Scan on public.foo
- Output: ROW(foo.f1), foo.f1
-> Foreign Scan on public.foo2
Output: ROW(foo2.f1), foo2.f1
Remote SQL: SELECT f1 FROM public.loct1
- -> Seq Scan on public.foo foo_1
- Output: ROW((foo_1.f1 + 3)), (foo_1.f1 + 3)
-> Foreign Scan on public.foo2 foo2_1
Output: ROW((foo2_1.f1 + 3)), (foo2_1.f1 + 3)
Remote SQL: SELECT f1 FROM public.loct1
+ -> Seq Scan on public.foo
+ Output: ROW(foo.f1), foo.f1
+ -> Seq Scan on public.foo foo_1
+ Output: ROW((foo_1.f1 + 3)), (foo_1.f1 + 3)
(45 rows)
update bar set f2 = f2 + 100
@@ -6668,8 +6668,8 @@ update bar set f2 = f2 + 100 returning *;
update bar set f2 = f2 + 100 returning *;
f1 | f2
----+-----
- 1 | 311
2 | 322
+ 1 | 311
6 | 266
3 | 333
4 | 344
diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c
index c480945..e75f8a1 100644
--- a/contrib/postgres_fdw/postgres_fdw.c
+++ b/contrib/postgres_fdw/postgres_fdw.c
@@ -20,6 +20,7 @@
#include "commands/explain.h"
#include "commands/vacuum.h"
#include "executor/execAsync.h"
+#include "executor/nodeForeignscan.h"
#include "foreign/fdwapi.h"
#include "funcapi.h"
#include "miscadmin.h"
@@ -4474,7 +4475,7 @@ postgresForeignAsyncRequest(EState *estate, PendingAsyncRequest *areq)
TupleTableSlot *slot;
Assert(IsA(node, ForeignScanState));
- slot = postgresIterateForeignScan(node);
+ slot = ExecForeignScan(node);
ExecAsyncRequestDone(estate, areq, (Node *) slot);
}
diff --git a/src/backend/executor/execAsync.c b/src/backend/executor/execAsync.c
index 5858bb5..e070c26 100644
--- a/src/backend/executor/execAsync.c
+++ b/src/backend/executor/execAsync.c
@@ -18,6 +18,7 @@
#include "executor/nodeAppend.h"
#include "executor/nodeForeignscan.h"
#include "miscadmin.h"
+#include "pgstat.h"
#include "storage/latch.h"
static bool ExecAsyncEventWait(EState *estate, long timeout);
@@ -286,7 +287,8 @@ ExecAsyncEventWait(EState *estate, long timeout)
/* Wait for at least one event to occur. */
noccurred = WaitEventSetWait(estate->es_wait_event_set, timeout,
- occurred_event, EVENT_BUFFER_SIZE);
+ occurred_event, EVENT_BUFFER_SIZE,
+ WAIT_EVENT_ASYNC_WAIT);
if (noccurred == 0)
return false;
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index a392197..ca91dd8 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -3393,6 +3393,9 @@ pgstat_get_wait_ipc(WaitEventIPC w)
case WAIT_EVENT_SYNC_REP:
event_name = "SyncRep";
break;
+ case WAIT_EVENT_ASYNC_WAIT:
+ event_name = "AsyncExecWait";
+ break;
/* no default case, so that compiler will warn */
}
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 4e8dac6..87ce505 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -785,7 +785,8 @@ typedef enum
WAIT_EVENT_MQ_SEND,
WAIT_EVENT_PARALLEL_FINISH,
WAIT_EVENT_SAFE_SNAPSHOT,
- WAIT_EVENT_SYNC_REP
+ WAIT_EVENT_SYNC_REP,
+ WAIT_EVENT_ASYNC_WAIT
} WaitEventIPC;
/* ----------
--
2.9.2
0003-Modify-async-execution-infrastructure.patchtext/x-patch; charset=us-asciiDownload
From 75eec490d5fa5e7272066ab35bba30c8c00e87cf Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Mon, 17 Oct 2016 15:54:32 +0900
Subject: [PATCH 3/7] Modify async execution infrastructure.
---
contrib/postgres_fdw/expected/postgres_fdw.out | 68 ++++++++--------
contrib/postgres_fdw/postgres_fdw.c | 5 +-
src/backend/executor/execAsync.c | 105 ++++++++++++++-----------
src/backend/executor/nodeAppend.c | 50 ++++++------
src/backend/executor/nodeForeignscan.c | 4 +-
src/backend/nodes/copyfuncs.c | 1 +
src/backend/nodes/outfuncs.c | 1 +
src/backend/nodes/readfuncs.c | 1 +
src/backend/optimizer/plan/createplan.c | 24 +++++-
src/backend/utils/adt/ruleutils.c | 6 +-
src/include/executor/nodeForeignscan.h | 2 +-
src/include/foreign/fdwapi.h | 2 +-
src/include/nodes/execnodes.h | 10 ++-
src/include/nodes/plannodes.h | 1 +
14 files changed, 167 insertions(+), 113 deletions(-)
diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out
index 1b36579..a98e138 100644
--- a/contrib/postgres_fdw/expected/postgres_fdw.out
+++ b/contrib/postgres_fdw/expected/postgres_fdw.out
@@ -6321,13 +6321,13 @@ insert into bar2 values(4,44,44);
insert into bar2 values(7,77,77);
explain (verbose, costs off)
select * from bar where f1 in (select f1 from foo) for update;
- QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------
+ QUERY PLAN
+----------------------------------------------------------------------------------------------
LockRows
- Output: bar2.f1, bar2.f2, bar2.ctid, ((bar2.*)::bar), bar2.tableoid, foo2.ctid, ((foo2.*)::foo), foo2.tableoid
+ Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid, foo.ctid, foo.*, foo.tableoid
-> Hash Join
- Output: bar2.f1, bar2.f2, bar2.ctid, ((bar2.*)::bar), bar2.tableoid, foo2.ctid, ((foo2.*)::foo), foo2.tableoid
- Hash Cond: (bar2.f1 = foo2.f1)
+ Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid, foo.ctid, foo.*, foo.tableoid
+ Hash Cond: (bar.f1 = foo.f1)
-> Append
-> Foreign Scan on public.bar2
Output: bar2.f1, bar2.f2, bar2.ctid, bar2.*, bar2.tableoid
@@ -6335,10 +6335,10 @@ select * from bar where f1 in (select f1 from foo) for update;
-> Seq Scan on public.bar
Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid
-> Hash
- Output: foo2.ctid, ((foo2.*)::foo), foo2.tableoid, foo2.f1
+ Output: foo.ctid, foo.*, foo.tableoid, foo.f1
-> HashAggregate
- Output: foo2.ctid, ((foo2.*)::foo), foo2.tableoid, foo2.f1
- Group Key: foo2.f1
+ Output: foo.ctid, foo.*, foo.tableoid, foo.f1
+ Group Key: foo.f1
-> Append
-> Foreign Scan on public.foo2
Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1
@@ -6358,13 +6358,13 @@ select * from bar where f1 in (select f1 from foo) for update;
explain (verbose, costs off)
select * from bar where f1 in (select f1 from foo) for share;
- QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------
+ QUERY PLAN
+----------------------------------------------------------------------------------------------
LockRows
- Output: bar2.f1, bar2.f2, bar2.ctid, ((bar2.*)::bar), bar2.tableoid, foo2.ctid, ((foo2.*)::foo), foo2.tableoid
+ Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid, foo.ctid, foo.*, foo.tableoid
-> Hash Join
- Output: bar2.f1, bar2.f2, bar2.ctid, ((bar2.*)::bar), bar2.tableoid, foo2.ctid, ((foo2.*)::foo), foo2.tableoid
- Hash Cond: (bar2.f1 = foo2.f1)
+ Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid, foo.ctid, foo.*, foo.tableoid
+ Hash Cond: (bar.f1 = foo.f1)
-> Append
-> Foreign Scan on public.bar2
Output: bar2.f1, bar2.f2, bar2.ctid, bar2.*, bar2.tableoid
@@ -6372,10 +6372,10 @@ select * from bar where f1 in (select f1 from foo) for share;
-> Seq Scan on public.bar
Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid
-> Hash
- Output: foo2.ctid, ((foo2.*)::foo), foo2.tableoid, foo2.f1
+ Output: foo.ctid, foo.*, foo.tableoid, foo.f1
-> HashAggregate
- Output: foo2.ctid, ((foo2.*)::foo), foo2.tableoid, foo2.f1
- Group Key: foo2.f1
+ Output: foo.ctid, foo.*, foo.tableoid, foo.f1
+ Group Key: foo.f1
-> Append
-> Foreign Scan on public.foo2
Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1
@@ -6396,22 +6396,22 @@ select * from bar where f1 in (select f1 from foo) for share;
-- Check UPDATE with inherited target and an inherited source table
explain (verbose, costs off)
update bar set f2 = f2 + 100 where f1 in (select f1 from foo);
- QUERY PLAN
----------------------------------------------------------------------------------------------------------
+ QUERY PLAN
+---------------------------------------------------------------------------------------------
Update on public.bar
Update on public.bar
Foreign Update on public.bar2
Remote SQL: UPDATE public.loct2 SET f2 = $2 WHERE ctid = $1
-> Hash Join
- Output: bar.f1, (bar.f2 + 100), bar.ctid, foo2.ctid, ((foo2.*)::foo), foo2.tableoid
- Hash Cond: (bar.f1 = foo2.f1)
+ Output: bar.f1, (bar.f2 + 100), bar.ctid, foo.ctid, foo.*, foo.tableoid
+ Hash Cond: (bar.f1 = foo.f1)
-> Seq Scan on public.bar
Output: bar.f1, bar.f2, bar.ctid
-> Hash
- Output: foo2.ctid, ((foo2.*)::foo), foo2.tableoid, foo2.f1
+ Output: foo.ctid, foo.*, foo.tableoid, foo.f1
-> HashAggregate
- Output: foo2.ctid, ((foo2.*)::foo), foo2.tableoid, foo2.f1
- Group Key: foo2.f1
+ Output: foo.ctid, foo.*, foo.tableoid, foo.f1
+ Group Key: foo.f1
-> Append
-> Foreign Scan on public.foo2
Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1
@@ -6419,16 +6419,16 @@ update bar set f2 = f2 + 100 where f1 in (select f1 from foo);
-> Seq Scan on public.foo
Output: foo.ctid, foo.*, foo.tableoid, foo.f1
-> Hash Join
- Output: bar2.f1, (bar2.f2 + 100), bar2.f3, bar2.ctid, foo2.ctid, ((foo2.*)::foo), foo2.tableoid
- Hash Cond: (bar2.f1 = foo2.f1)
+ Output: bar2.f1, (bar2.f2 + 100), bar2.f3, bar2.ctid, foo.ctid, foo.*, foo.tableoid
+ Hash Cond: (bar2.f1 = foo.f1)
-> Foreign Scan on public.bar2
Output: bar2.f1, bar2.f2, bar2.f3, bar2.ctid
Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 FOR UPDATE
-> Hash
- Output: foo2.ctid, ((foo2.*)::foo), foo2.tableoid, foo2.f1
+ Output: foo.ctid, foo.*, foo.tableoid, foo.f1
-> HashAggregate
- Output: foo2.ctid, ((foo2.*)::foo), foo2.tableoid, foo2.f1
- Group Key: foo2.f1
+ Output: foo.ctid, foo.*, foo.tableoid, foo.f1
+ Group Key: foo.f1
-> Append
-> Foreign Scan on public.foo2
Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1
@@ -6462,8 +6462,8 @@ where bar.f1 = ss.f1;
Foreign Update on public.bar2
Remote SQL: UPDATE public.loct2 SET f2 = $2 WHERE ctid = $1
-> Hash Join
- Output: bar.f1, (bar.f2 + 100), bar.ctid, (ROW(foo2.f1))
- Hash Cond: (foo2.f1 = bar.f1)
+ Output: bar.f1, (bar.f2 + 100), bar.ctid, (ROW(foo.f1))
+ Hash Cond: (foo.f1 = bar.f1)
-> Append
-> Foreign Scan on public.foo2
Output: ROW(foo2.f1), foo2.f1
@@ -6480,8 +6480,8 @@ where bar.f1 = ss.f1;
-> Seq Scan on public.bar
Output: bar.f1, bar.f2, bar.ctid
-> Merge Join
- Output: bar2.f1, (bar2.f2 + 100), bar2.f3, bar2.ctid, (ROW(foo2.f1))
- Merge Cond: (bar2.f1 = foo2.f1)
+ Output: bar2.f1, (bar2.f2 + 100), bar2.f3, bar2.ctid, (ROW(foo.f1))
+ Merge Cond: (bar2.f1 = foo.f1)
-> Sort
Output: bar2.f1, bar2.f2, bar2.f3, bar2.ctid
Sort Key: bar2.f1
@@ -6489,8 +6489,8 @@ where bar.f1 = ss.f1;
Output: bar2.f1, bar2.f2, bar2.f3, bar2.ctid
Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 FOR UPDATE
-> Sort
- Output: (ROW(foo2.f1)), foo2.f1
- Sort Key: foo2.f1
+ Output: (ROW(foo.f1)), foo.f1
+ Sort Key: foo.f1
-> Append
-> Foreign Scan on public.foo2
Output: ROW(foo2.f1), foo2.f1
diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c
index e75f8a1..830212f 100644
--- a/contrib/postgres_fdw/postgres_fdw.c
+++ b/contrib/postgres_fdw/postgres_fdw.c
@@ -354,7 +354,7 @@ static void postgresGetForeignUpperPaths(PlannerInfo *root,
static bool postgresIsForeignPathAsyncCapable(ForeignPath *path);
static void postgresForeignAsyncRequest(EState *estate,
PendingAsyncRequest *areq);
-static void postgresForeignAsyncConfigureWait(EState *estate,
+static bool postgresForeignAsyncConfigureWait(EState *estate,
PendingAsyncRequest *areq,
bool reinit);
static void postgresForeignAsyncNotify(EState *estate,
@@ -4479,11 +4479,12 @@ postgresForeignAsyncRequest(EState *estate, PendingAsyncRequest *areq)
ExecAsyncRequestDone(estate, areq, (Node *) slot);
}
-static void
+static bool
postgresForeignAsyncConfigureWait(EState *estate, PendingAsyncRequest *areq,
bool reinit)
{
elog(ERROR, "postgresForeignAsyncConfigureWait");
+ return false;
}
static void
diff --git a/src/backend/executor/execAsync.c b/src/backend/executor/execAsync.c
index e070c26..33496a9 100644
--- a/src/backend/executor/execAsync.c
+++ b/src/backend/executor/execAsync.c
@@ -22,7 +22,7 @@
#include "storage/latch.h"
static bool ExecAsyncEventWait(EState *estate, long timeout);
-static void ExecAsyncConfigureWait(EState *estate, PendingAsyncRequest *areq,
+static bool ExecAsyncConfigureWait(EState *estate, PendingAsyncRequest *areq,
bool reinit);
static void ExecAsyncNotify(EState *estate, PendingAsyncRequest *areq);
static void ExecAsyncResponse(EState *estate, PendingAsyncRequest *areq);
@@ -43,7 +43,7 @@ ExecAsyncRequest(EState *estate, PlanState *requestor, int request_index,
PlanState *requestee)
{
PendingAsyncRequest *areq = NULL;
- int i = estate->es_num_pending_async;
+ int nasync = estate->es_num_pending_async;
/*
* If the number of pending asynchronous nodes exceeds the number of
@@ -51,7 +51,7 @@ ExecAsyncRequest(EState *estate, PlanState *requestor, int request_index,
* We start with 16 slots, and thereafter double the array size each
* time we run out of slots.
*/
- if (i >= estate->es_max_pending_async)
+ if (nasync >= estate->es_max_pending_async)
{
int newmax;
@@ -81,25 +81,28 @@ ExecAsyncRequest(EState *estate, PlanState *requestor, int request_index,
* PendingAsyncRequest if there is one. If not, we must allocate a new
* one.
*/
- if (estate->es_pending_async[i] == NULL)
+ if (estate->es_pending_async[nasync] == NULL)
{
areq = MemoryContextAllocZero(estate->es_query_cxt,
sizeof(PendingAsyncRequest));
- estate->es_pending_async[i] = areq;
+ estate->es_pending_async[nasync] = areq;
}
else
{
- areq = estate->es_pending_async[i];
+ areq = estate->es_pending_async[nasync];
MemSet(areq, 0, sizeof(PendingAsyncRequest));
}
- areq->myindex = estate->es_num_pending_async++;
+ areq->myindex = estate->es_num_pending_async;
/* Initialize the new request. */
areq->requestor = requestor;
areq->request_index = request_index;
areq->requestee = requestee;
- /* Give the requestee a chance to do whatever it wants. */
+ /*
+ * Give the requestee a chance to do whatever it wants.
+ * Requst functions return true if a result is immediately available.
+ */
switch (nodeTag(requestee))
{
case T_ForeignScanState:
@@ -110,6 +113,20 @@ ExecAsyncRequest(EState *estate, PlanState *requestor, int request_index,
elog(ERROR, "unrecognized node type: %d",
(int) nodeTag(requestee));
}
+
+ /*
+ * If a result is available, complete it immediately.
+ */
+ if (areq->state == ASYNC_COMPLETE)
+ {
+ Assert(areq->result == NULL || IsA(areq->result, TupleTableSlot));
+ ExecAsyncResponse(estate, areq);
+
+ return;
+ }
+
+ /* No result available now, make this node pending */
+ estate->es_num_pending_async++;
}
/*
@@ -175,22 +192,19 @@ ExecAsyncEventLoop(EState *estate, PlanState *requestor, long timeout)
{
PendingAsyncRequest *areq = estate->es_pending_async[i];
- /* Skip it if no callback is pending. */
- if (!areq->callback_pending)
- continue;
-
- /*
- * Mark it as no longer needing a callback. We must do this
- * before dispatching the callback in case the callback resets
- * the flag.
- */
- areq->callback_pending = false;
- estate->es_async_callback_pending--;
-
- /* Perform the actual callback; set request_done if appropraite. */
- if (!areq->request_complete)
+ /* Skip it if not pending. */
+ if (areq->state == ASYNC_CALLBACK_PENDING)
+ {
+ /*
+ * Mark it as no longer needing a callback. We must do this
+ * before dispatching the callback in case the callback resets
+ * the flag.
+ */
+ estate->es_async_callback_pending--;
ExecAsyncNotify(estate, areq);
- else
+ }
+
+ if (areq->state == ASYNC_COMPLETE)
{
any_node_done = true;
if (requestor == areq->requestor)
@@ -214,7 +228,7 @@ ExecAsyncEventLoop(EState *estate, PlanState *requestor, long timeout)
PendingAsyncRequest *head;
PendingAsyncRequest *tail = estate->es_pending_async[tidx];
- if (!tail->callback_pending && tail->request_complete)
+ if (tail->state == ASYNC_COMPLETE)
continue;
head = estate->es_pending_async[hidx];
estate->es_pending_async[tidx] = head;
@@ -247,7 +261,8 @@ ExecAsyncEventLoop(EState *estate, PlanState *requestor, long timeout)
* means wait forever, 0 means don't wait at all, and >0 means wait for the
* indicated number of milliseconds.
*
- * Returns true if we found some events and false if we timed out.
+ * Returns true if we found some events and false if we timed out or there's
+ * no event to wait. The latter is occur when the areq is processed during
*/
static bool
ExecAsyncEventWait(EState *estate, long timeout)
@@ -258,6 +273,7 @@ ExecAsyncEventWait(EState *estate, long timeout)
int n;
bool reinit = false;
bool process_latch_set = false;
+ bool added = false;
if (estate->es_wait_event_set == NULL)
{
@@ -282,13 +298,16 @@ ExecAsyncEventWait(EState *estate, long timeout)
PendingAsyncRequest *areq = estate->es_pending_async[i];
if (areq->num_fd_events > 0)
- ExecAsyncConfigureWait(estate, areq, reinit);
+ added |= ExecAsyncConfigureWait(estate, areq, reinit);
}
+ Assert(added);
+
/* Wait for at least one event to occur. */
noccurred = WaitEventSetWait(estate->es_wait_event_set, timeout,
occurred_event, EVENT_BUFFER_SIZE,
WAIT_EVENT_ASYNC_WAIT);
+
if (noccurred == 0)
return false;
@@ -312,12 +331,10 @@ ExecAsyncEventWait(EState *estate, long timeout)
{
PendingAsyncRequest *areq = w->user_data;
- if (!areq->callback_pending)
- {
- Assert(!areq->request_complete);
- areq->callback_pending = true;
- estate->es_async_callback_pending++;
- }
+ Assert(areq->state == ASYNC_WAITING);
+
+ areq->state = ASYNC_CALLBACK_PENDING;
+ estate->es_async_callback_pending++;
}
}
@@ -333,8 +350,8 @@ ExecAsyncEventWait(EState *estate, long timeout)
if (areq->wants_process_latch)
{
- Assert(!areq->request_complete);
- areq->callback_pending = true;
+ Assert(areq->state == ASYNC_WAITING);
+ areq->state = ASYNC_CALLBACK_PENDING;
}
}
}
@@ -352,15 +369,19 @@ ExecAsyncEventWait(EState *estate, long timeout)
* The events should include only WL_SOCKET_READABLE or WL_SOCKET_WRITEABLE,
* and the number of calls should not exceed areq->num_fd_events (as
* prevously set via ExecAsyncSetRequiredEvents).
+ *
+ * Individual requests can omit registering an event but it is a
+ * responsibility of the node driver to set at least one event per one
+ * requestor.
*/
-static void
+static bool
ExecAsyncConfigureWait(EState *estate, PendingAsyncRequest *areq,
bool reinit)
{
switch (nodeTag(areq->requestee))
{
case T_ForeignScanState:
- ExecAsyncForeignScanConfigureWait(estate, areq, reinit);
+ return ExecAsyncForeignScanConfigureWait(estate, areq, reinit);
break;
default:
elog(ERROR, "unrecognized node type: %d",
@@ -419,6 +440,7 @@ ExecAsyncSetRequiredEvents(EState *estate, PendingAsyncRequest *areq,
estate->es_total_fd_events += num_fd_events - areq->num_fd_events;
areq->num_fd_events = num_fd_events;
areq->wants_process_latch = wants_process_latch;
+ areq->state = ASYNC_WAITING;
if (force_reset && estate->es_wait_event_set != NULL)
{
@@ -448,17 +470,12 @@ ExecAsyncRequestDone(EState *estate, PendingAsyncRequest *areq, Node *result)
* need a callback to remove registered wait events. It's not clear
* that we would come out ahead, so use brute force for now.
*/
+ Assert(areq->state == ASYNC_IDLE || areq->state == ASYNC_CALLBACK_PENDING);
+
if (areq->num_fd_events > 0 || areq->wants_process_latch)
ExecAsyncSetRequiredEvents(estate, areq, 0, false, true);
/* Save result and mark request as complete. */
areq->result = result;
- areq->request_complete = true;
-
- /* Make sure this request is flagged for a callback. */
- if (!areq->callback_pending)
- {
- areq->callback_pending = true;
- estate->es_async_callback_pending++;
- }
+ areq->state = ASYNC_COMPLETE;
}
diff --git a/src/backend/executor/nodeAppend.c b/src/backend/executor/nodeAppend.c
index bb06569..c234f1f 100644
--- a/src/backend/executor/nodeAppend.c
+++ b/src/backend/executor/nodeAppend.c
@@ -229,9 +229,15 @@ ExecAppend(AppendState *node)
*/
while ((i = bms_first_member(node->as_needrequest)) >= 0)
{
- ExecAsyncRequest(estate, &node->ps, i, node->appendplans[i]);
node->as_nasyncpending++;
+
+ ExecAsyncRequest(estate, &node->ps, i, node->appendplans[i]);
+ /* If this request immediately gives a result, take it. */
+ if (node->as_nasyncresult > 0)
+ return node->as_asyncresult[--node->as_nasyncresult];
}
+ if (node->as_nasyncpending == 0 && node->as_syncdone)
+ return ExecClearTuple(node->ps.ps_ResultTupleSlot);
}
for (;;)
@@ -246,32 +252,32 @@ ExecAppend(AppendState *node)
{
long timeout = node->as_syncdone ? -1 : 0;
- for (;;)
+ while (node->as_nasyncpending > 0)
{
- if (node->as_nasyncpending == 0)
- {
- /*
- * If there is no asynchronous activity still pending
- * and the synchronous activity is also complete, we're
- * totally done scanning this node. Otherwise, we're
- * done with the asynchronous stuff but must continue
- * scanning the synchronous children.
- */
- if (node->as_syncdone)
- return ExecClearTuple(node->ps.ps_ResultTupleSlot);
- break;
- }
- if (!ExecAsyncEventLoop(node->ps.state, &node->ps, timeout))
- {
- /* Timeout reached. */
- break;
- }
- if (node->as_nasyncresult > 0)
+ if (ExecAsyncEventLoop(node->ps.state, &node->ps, timeout) &&
+ node->as_nasyncresult > 0)
{
/* Asynchronous subplan returned a tuple! */
--node->as_nasyncresult;
return node->as_asyncresult[node->as_nasyncresult];
}
+
+ /* Timeout reached. Go through to sync nodes if exists */
+ if (!node->as_syncdone)
+ break;
+ }
+
+ /*
+ * If there is no asynchronous activity still pending and the
+ * synchronous activity is also complete, we're totally done
+ * scanning this node. Otherwise, we're done with the
+ * asynchronous stuff but must continue scanning the synchronous
+ * children.
+ */
+ if (node->as_syncdone)
+ {
+ Assert(node->as_nasyncpending == 0);
+ return ExecClearTuple(node->ps.ps_ResultTupleSlot);
}
}
@@ -397,7 +403,7 @@ ExecAsyncAppendResponse(EState *estate, PendingAsyncRequest *areq)
TupleTableSlot *slot;
/* We shouldn't be called until the request is complete. */
- Assert(areq->request_complete);
+ Assert(areq->state == ASYNC_COMPLETE);
/* Our result slot shouldn't already be occupied. */
Assert(TupIsNull(node->ps.ps_ResultTupleSlot));
diff --git a/src/backend/executor/nodeForeignscan.c b/src/backend/executor/nodeForeignscan.c
index 85d436f..d3567bb 100644
--- a/src/backend/executor/nodeForeignscan.c
+++ b/src/backend/executor/nodeForeignscan.c
@@ -378,7 +378,7 @@ ExecAsyncForeignScanRequest(EState *estate, PendingAsyncRequest *areq)
* In async mode, configure for a wait
* ----------------------------------------------------------------
*/
-void
+bool
ExecAsyncForeignScanConfigureWait(EState *estate,
PendingAsyncRequest *areq, bool reinit)
{
@@ -386,7 +386,7 @@ ExecAsyncForeignScanConfigureWait(EState *estate,
FdwRoutine *fdwroutine = node->fdwroutine;
Assert(fdwroutine->ForeignAsyncConfigureWait != NULL);
- fdwroutine->ForeignAsyncConfigureWait(estate, areq, reinit);
+ return fdwroutine->ForeignAsyncConfigureWait(estate, areq, reinit);
}
/* ----------------------------------------------------------------
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index 23b4e18..72d8cd6 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -219,6 +219,7 @@ _copyAppend(const Append *from)
*/
COPY_NODE_FIELD(appendplans);
COPY_SCALAR_FIELD(nasyncplans);
+ COPY_SCALAR_FIELD(referent);
return newnode;
}
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index dc5b938..1ebdc48 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -360,6 +360,7 @@ _outAppend(StringInfo str, const Append *node)
WRITE_NODE_FIELD(appendplans);
WRITE_INT_FIELD(nasyncplans);
+ WRITE_INT_FIELD(referent);
}
static void
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index 69453b5..8443a62 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -1520,6 +1520,7 @@ _readAppend(void)
READ_NODE_FIELD(appendplans);
READ_INT_FIELD(nasyncplans);
+ READ_INT_FIELD(referent);
READ_DONE();
}
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index 7caa8d3..ff1d663 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -193,7 +193,8 @@ static CteScan *make_ctescan(List *qptlist, List *qpqual,
Index scanrelid, int ctePlanId, int cteParam);
static WorkTableScan *make_worktablescan(List *qptlist, List *qpqual,
Index scanrelid, int wtParam);
-static Append *make_append(List *asyncplans, int nasyncplans, List *tlist);
+static Append *make_append(List *asyncplans, int nasyncplans,
+ int referent, List *tlist);
static RecursiveUnion *make_recursive_union(List *tlist,
Plan *lefttree,
Plan *righttree,
@@ -960,6 +961,8 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path)
List *syncplans = NIL;
ListCell *subpaths;
int nasyncplans = 0;
+ bool first = true;
+ bool referent_is_sync = true;
/*
* The subpaths list could be empty, if every child was proven empty by
@@ -985,7 +988,14 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path)
return plan;
}
- /* Build the plan for each child */
+ /*
+ * Build the plan for each child
+
+ * The first child in an inheritance set is the representative in
+ * explaining tlist entries (see set_deparse_planstate). We should keep
+ * the first child in best_path->subpaths at the head of the subplan list
+ * for the reason.
+ */
foreach(subpaths, best_path->subpaths)
{
Path *subpath = (Path *) lfirst(subpaths);
@@ -999,9 +1009,13 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path)
{
asyncplans = lappend(asyncplans, subplan);
++nasyncplans;
+ if (first)
+ referent_is_sync = false;
}
else
syncplans = lappend(syncplans, subplan);
+
+ first = false;
}
/*
@@ -1011,7 +1025,8 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path)
* parent-rel Vars it'll be asked to emit.
*/
- plan = make_append(list_concat(asyncplans, syncplans), nasyncplans, tlist);
+ plan = make_append(list_concat(asyncplans, syncplans), nasyncplans,
+ referent_is_sync ? nasyncplans : 0, tlist);
copy_generic_path_info(&plan->plan, (Path *) best_path);
@@ -4951,7 +4966,7 @@ make_foreignscan(List *qptlist,
}
static Append *
-make_append(List *appendplans, int nasyncplans, List *tlist)
+make_append(List *appendplans, int nasyncplans, int referent, List *tlist)
{
Append *node = makeNode(Append);
Plan *plan = &node->plan;
@@ -4962,6 +4977,7 @@ make_append(List *appendplans, int nasyncplans, List *tlist)
plan->righttree = NULL;
node->appendplans = appendplans;
node->nasyncplans = nasyncplans;
+ node->referent = referent;
return node;
}
diff --git a/src/backend/utils/adt/ruleutils.c b/src/backend/utils/adt/ruleutils.c
index 8a81d7a..de0e96c 100644
--- a/src/backend/utils/adt/ruleutils.c
+++ b/src/backend/utils/adt/ruleutils.c
@@ -4056,7 +4056,11 @@ set_deparse_planstate(deparse_namespace *dpns, PlanState *ps)
* lists containing references to non-target relations.
*/
if (IsA(ps, AppendState))
- dpns->outer_planstate = ((AppendState *) ps)->appendplans[0];
+ {
+ int idx = ((Append*)(((AppendState *) ps)->ps.plan))->referent;
+ dpns->outer_planstate =
+ ((AppendState *) ps)->appendplans[idx];
+ }
else if (IsA(ps, MergeAppendState))
dpns->outer_planstate = ((MergeAppendState *) ps)->mergeplans[0];
else if (IsA(ps, ModifyTableState))
diff --git a/src/include/executor/nodeForeignscan.h b/src/include/executor/nodeForeignscan.h
index 3e69ab0..47a3920 100644
--- a/src/include/executor/nodeForeignscan.h
+++ b/src/include/executor/nodeForeignscan.h
@@ -31,7 +31,7 @@ extern void ExecForeignScanInitializeWorker(ForeignScanState *node,
extern void ExecAsyncForeignScanRequest(EState *estate,
PendingAsyncRequest *areq);
-extern void ExecAsyncForeignScanConfigureWait(EState *estate,
+extern bool ExecAsyncForeignScanConfigureWait(EState *estate,
PendingAsyncRequest *areq, bool reinit);
extern void ExecAsyncForeignScanNotify(EState *estate,
PendingAsyncRequest *areq);
diff --git a/src/include/foreign/fdwapi.h b/src/include/foreign/fdwapi.h
index 88feb9a..65517fd 100644
--- a/src/include/foreign/fdwapi.h
+++ b/src/include/foreign/fdwapi.h
@@ -158,7 +158,7 @@ typedef bool (*IsForeignScanParallelSafe_function) (PlannerInfo *root,
typedef bool (*IsForeignPathAsyncCapable_function) (ForeignPath *path);
typedef void (*ForeignAsyncRequest_function) (EState *estate,
PendingAsyncRequest *areq);
-typedef void (*ForeignAsyncConfigureWait_function) (EState *estate,
+typedef bool (*ForeignAsyncConfigureWait_function) (EState *estate,
PendingAsyncRequest *areq,
bool reinit);
typedef void (*ForeignAsyncNotify_function) (EState *estate,
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index b50b41c..0c6af86 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -352,6 +352,13 @@ typedef struct ResultRelInfo
* State for an asynchronous tuple request.
* ----------------
*/
+typedef enum AsyncRequestState
+{
+ ASYNC_IDLE,
+ ASYNC_WAITING,
+ ASYNC_CALLBACK_PENDING,
+ ASYNC_COMPLETE
+} AsyncRequestState;
typedef struct PendingAsyncRequest
{
int myindex; /* Index in es_pending_async. */
@@ -360,8 +367,7 @@ typedef struct PendingAsyncRequest
int request_index; /* Scratch space for requestor. */
int num_fd_events; /* Max number of FD events requestee needs. */
bool wants_process_latch; /* Requestee cares about MyLatch. */
- bool callback_pending; /* Callback is needed. */
- bool request_complete; /* Request complete, result valid. */
+ AsyncRequestState state;
Node *result; /* Result (NULL if no more tuples). */
} PendingAsyncRequest;
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index 327119b..1df6693 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -209,6 +209,7 @@ typedef struct Append
Plan plan;
List *appendplans;
int nasyncplans; /* # of async plans, always at start of list */
+ int referent; /* index of inheritance tree referent */
} Append;
/* ----------------
--
2.9.2
0004-Make-postgres_fdw-async-capable.patchtext/x-patch; charset=us-asciiDownload
From 04d33f89391ad8aedfa9b13a2dd72f87f19c3ae1 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Mon, 17 Oct 2016 16:00:56 +0900
Subject: [PATCH 4/7] Make postgres_fdw async-capable
---
contrib/postgres_fdw/connection.c | 79 ++--
contrib/postgres_fdw/expected/postgres_fdw.out | 64 ++--
contrib/postgres_fdw/postgres_fdw.c | 483 +++++++++++++++++++++----
contrib/postgres_fdw/postgres_fdw.h | 2 +
contrib/postgres_fdw/sql/postgres_fdw.sql | 4 +-
src/backend/executor/execProcnode.c | 9 +
src/include/foreign/fdwapi.h | 2 +
7 files changed, 510 insertions(+), 133 deletions(-)
diff --git a/contrib/postgres_fdw/connection.c b/contrib/postgres_fdw/connection.c
index bcdddc2..ebc9417 100644
--- a/contrib/postgres_fdw/connection.c
+++ b/contrib/postgres_fdw/connection.c
@@ -49,6 +49,7 @@ typedef struct ConnCacheEntry
* one level of subxact open, etc */
bool have_prep_stmt; /* have we prepared any stmts in this xact? */
bool have_error; /* have any subxacts aborted in this xact? */
+ void *storage; /* connection specific storage */
} ConnCacheEntry;
/*
@@ -64,6 +65,7 @@ static unsigned int prep_stmt_number = 0;
static bool xact_got_connection = false;
/* prototypes of private functions */
+static ConnCacheEntry *get_connection_entry(Oid umid);
static PGconn *connect_pg_server(ForeignServer *server, UserMapping *user);
static void check_conn_params(const char **keywords, const char **values);
static void configure_remote_session(PGconn *conn);
@@ -75,26 +77,12 @@ static void pgfdw_subxact_callback(SubXactEvent event,
SubTransactionId parentSubid,
void *arg);
-
/*
- * Get a PGconn which can be used to execute queries on the remote PostgreSQL
- * server with the user's authorization. A new connection is established
- * if we don't already have a suitable one, and a transaction is opened at
- * the right subtransaction nesting depth if we didn't do that already.
- *
- * will_prep_stmt must be true if caller intends to create any prepared
- * statements. Since those don't go away automatically at transaction end
- * (not even on error), we need this flag to cue manual cleanup.
- *
- * XXX Note that caching connections theoretically requires a mechanism to
- * detect change of FDW objects to invalidate already established connections.
- * We could manage that by watching for invalidation events on the relevant
- * syscaches. For the moment, though, it's not clear that this would really
- * be useful and not mere pedantry. We could not flush any active connections
- * mid-transaction anyway.
+ * Common function to acquire or create a connection cache entry.
*/
-PGconn *
-GetConnection(UserMapping *user, bool will_prep_stmt)
+
+static ConnCacheEntry *
+get_connection_entry(Oid umid)
{
bool found;
ConnCacheEntry *entry;
@@ -122,11 +110,8 @@ GetConnection(UserMapping *user, bool will_prep_stmt)
RegisterSubXactCallback(pgfdw_subxact_callback, NULL);
}
- /* Set flag that we did GetConnection during the current transaction */
- xact_got_connection = true;
-
/* Create hash key for the entry. Assume no pad bytes in key struct */
- key = user->umid;
+ key = umid;
/*
* Find or create cached entry for requested connection.
@@ -139,8 +124,39 @@ GetConnection(UserMapping *user, bool will_prep_stmt)
entry->xact_depth = 0;
entry->have_prep_stmt = false;
entry->have_error = false;
+ entry->storage = NULL;
}
+ return entry;
+}
+
+/*
+ * Get a PGconn which can be used to execute queries on the remote PostgreSQL
+ * server with the user's authorization. A new connection is established
+ * if we don't already have a suitable one, and a transaction is opened at
+ * the right subtransaction nesting depth if we didn't do that already.
+ *
+ * will_prep_stmt must be true if caller intends to create any prepared
+ * statements. Since those don't go away automatically at transaction end
+ * (not even on error), we need this flag to cue manual cleanup.
+ *
+ * XXX Note that caching connections theoretically requires a mechanism to
+ * detect change of FDW objects to invalidate already established connections.
+ * We could manage that by watching for invalidation events on the relevant
+ * syscaches. For the moment, though, it's not clear that this would really
+ * be useful and not mere pedantry. We could not flush any active connections
+ * mid-transaction anyway.
+ */
+PGconn *
+GetConnection(UserMapping *user, bool will_prep_stmt)
+{
+ ConnCacheEntry *entry;
+
+ /* Set flag that we did GetConnection during the current transaction */
+ xact_got_connection = true;
+
+ entry = get_connection_entry(user->umid);
+
/*
* We don't check the health of cached connection here, because it would
* require some overhead. Broken connection will be detected when the
@@ -177,6 +193,25 @@ GetConnection(UserMapping *user, bool will_prep_stmt)
}
/*
+ * Rerturns the connection specific storage for this user. Allocate with
+ * initsize if not exists.
+ */
+void *
+GetConnectionSpecificStorage(UserMapping *user, size_t initsize)
+{
+ ConnCacheEntry *entry;
+
+ entry = get_connection_entry(user->umid);
+ if (entry->storage == NULL)
+ {
+ entry->storage = MemoryContextAlloc(CacheMemoryContext, initsize);
+ memset(entry->storage, 0, initsize);
+ }
+
+ return entry->storage;
+}
+
+/*
* Connect to remote server using specified server and user mapping properties.
*/
static PGconn *
diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out
index a98e138..38dc55b 100644
--- a/contrib/postgres_fdw/expected/postgres_fdw.out
+++ b/contrib/postgres_fdw/expected/postgres_fdw.out
@@ -6173,12 +6173,12 @@ INSERT INTO b(aa) VALUES('bbbbb');
SELECT tableoid::regclass, * FROM a;
tableoid | aa
----------+-------
- b | bbb
- b | bbbb
- b | bbbbb
a | aaa
a | aaaa
a | aaaaa
+ b | bbb
+ b | bbbb
+ b | bbbbb
(6 rows)
SELECT tableoid::regclass, * FROM b;
@@ -6201,12 +6201,12 @@ UPDATE a SET aa = 'zzzzzz' WHERE aa LIKE 'aaaa%';
SELECT tableoid::regclass, * FROM a;
tableoid | aa
----------+--------
- b | bbb
- b | bbbb
- b | bbbbb
a | aaa
a | zzzzzz
a | zzzzzz
+ b | bbb
+ b | bbbb
+ b | bbbbb
(6 rows)
SELECT tableoid::regclass, * FROM b;
@@ -6229,12 +6229,12 @@ UPDATE b SET aa = 'new';
SELECT tableoid::regclass, * FROM a;
tableoid | aa
----------+--------
- b | new
- b | new
- b | new
a | aaa
a | zzzzzz
a | zzzzzz
+ b | new
+ b | new
+ b | new
(6 rows)
SELECT tableoid::regclass, * FROM b;
@@ -6257,12 +6257,12 @@ UPDATE a SET aa = 'newtoo';
SELECT tableoid::regclass, * FROM a;
tableoid | aa
----------+--------
- b | newtoo
- b | newtoo
- b | newtoo
a | newtoo
a | newtoo
a | newtoo
+ b | newtoo
+ b | newtoo
+ b | newtoo
(6 rows)
SELECT tableoid::regclass, * FROM b;
@@ -6350,9 +6350,9 @@ select * from bar where f1 in (select f1 from foo) for update;
select * from bar where f1 in (select f1 from foo) for update;
f1 | f2
----+----
+ 1 | 11
3 | 33
4 | 44
- 1 | 11
2 | 22
(4 rows)
@@ -6387,9 +6387,9 @@ select * from bar where f1 in (select f1 from foo) for share;
select * from bar where f1 in (select f1 from foo) for share;
f1 | f2
----+----
+ 1 | 11
3 | 33
4 | 44
- 1 | 11
2 | 22
(4 rows)
@@ -6652,27 +6652,33 @@ delete from foo where f1 < 5 returning *;
(5 rows)
explain (verbose, costs off)
-update bar set f2 = f2 + 100 returning *;
- QUERY PLAN
-------------------------------------------------------------------------------
- Update on public.bar
- Output: bar.f1, bar.f2
- Update on public.bar
- Foreign Update on public.bar2
- -> Seq Scan on public.bar
- Output: bar.f1, (bar.f2 + 100), bar.ctid
- -> Foreign Update on public.bar2
- Remote SQL: UPDATE public.loct2 SET f2 = (f2 + 100) RETURNING f1, f2
-(8 rows)
+with u as (update bar set f2 = f2 + 100 returning *) select * from u order by 1;
+ QUERY PLAN
+--------------------------------------------------------------------------------------
+ Sort
+ Output: u.f1, u.f2
+ Sort Key: u.f1
+ CTE u
+ -> Update on public.bar
+ Output: bar.f1, bar.f2
+ Update on public.bar
+ Foreign Update on public.bar2
+ -> Seq Scan on public.bar
+ Output: bar.f1, (bar.f2 + 100), bar.ctid
+ -> Foreign Update on public.bar2
+ Remote SQL: UPDATE public.loct2 SET f2 = (f2 + 100) RETURNING f1, f2
+ -> CTE Scan on u
+ Output: u.f1, u.f2
+(14 rows)
-update bar set f2 = f2 + 100 returning *;
+with u as (update bar set f2 = f2 + 100 returning *) select * from u order by 1;
f1 | f2
----+-----
- 2 | 322
1 | 311
- 6 | 266
+ 2 | 322
3 | 333
4 | 344
+ 6 | 266
7 | 277
(6 rows)
diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c
index 830212f..9244e51 100644
--- a/contrib/postgres_fdw/postgres_fdw.c
+++ b/contrib/postgres_fdw/postgres_fdw.c
@@ -35,6 +35,7 @@
#include "optimizer/var.h"
#include "optimizer/tlist.h"
#include "parser/parsetree.h"
+#include "pgstat.h"
#include "utils/builtins.h"
#include "utils/guc.h"
#include "utils/lsyscache.h"
@@ -54,6 +55,9 @@ PG_MODULE_MAGIC;
/* If no remote estimates, assume a sort costs 20% extra */
#define DEFAULT_FDW_SORT_MULTIPLIER 1.2
+/* Retrive PgFdwScanState struct from ForeginScanState */
+#define GetPgFdwScanState(n) ((PgFdwScanState *)(n)->fdw_state)
+
/*
* Indexes of FDW-private information stored in fdw_private lists.
*
@@ -123,10 +127,27 @@ enum FdwDirectModifyPrivateIndex
};
/*
+ * Connection private area structure.
+ */
+ typedef struct PgFdwConnspecate
+{
+ ForeignScanState *current_owner; /* The node currently running a query
+ * on this connection*/
+} PgFdwConnspecate;
+
+/* Execution state base type */
+typedef struct PgFdwState
+{
+ PGconn *conn; /* connection for the scan */
+ PgFdwConnspecate *connspec; /* connection private memory */
+} PgFdwState;
+
+/*
* Execution state of a foreign scan using postgres_fdw.
*/
typedef struct PgFdwScanState
{
+ PgFdwState s; /* common structure */
Relation rel; /* relcache entry for the foreign table. NULL
* for a foreign join scan. */
TupleDesc tupdesc; /* tuple descriptor of scan */
@@ -137,7 +158,7 @@ typedef struct PgFdwScanState
List *retrieved_attrs; /* list of retrieved attribute numbers */
/* for remote query execution */
- PGconn *conn; /* connection for the scan */
+ bool result_ready;
unsigned int cursor_number; /* quasi-unique ID for my cursor */
bool cursor_exists; /* have we created the cursor? */
int numParams; /* number of parameters passed to query */
@@ -153,6 +174,13 @@ typedef struct PgFdwScanState
/* batch-level state, for optimizing rewinds and avoiding useless fetch */
int fetch_ct_2; /* Min(# of fetches done, 2) */
bool eof_reached; /* true if last fetch reached EOF */
+ bool run_async; /* true if run asynchronously */
+ bool async_waiting; /* true if requesting the parent to wait */
+ ForeignScanState *waiter; /* Next node to run a query among nodes
+ * sharing the same connection */
+ ForeignScanState *last_waiter; /* A waiting node at the end of a waiting
+ * list. Maintained only by the current
+ * owner of the connection */
/* working memory contexts */
MemoryContext batch_cxt; /* context holding current batch of tuples */
@@ -166,11 +194,11 @@ typedef struct PgFdwScanState
*/
typedef struct PgFdwModifyState
{
+ PgFdwState s; /* common structure */
Relation rel; /* relcache entry for the foreign table */
AttInMetadata *attinmeta; /* attribute datatype conversion metadata */
/* for remote query execution */
- PGconn *conn; /* connection for the scan */
char *p_name; /* name of prepared statement, if created */
/* extracted fdw_private data */
@@ -193,6 +221,7 @@ typedef struct PgFdwModifyState
*/
typedef struct PgFdwDirectModifyState
{
+ PgFdwState s; /* common structure */
Relation rel; /* relcache entry for the foreign table */
AttInMetadata *attinmeta; /* attribute datatype conversion metadata */
@@ -291,6 +320,7 @@ static void postgresBeginForeignScan(ForeignScanState *node, int eflags);
static TupleTableSlot *postgresIterateForeignScan(ForeignScanState *node);
static void postgresReScanForeignScan(ForeignScanState *node);
static void postgresEndForeignScan(ForeignScanState *node);
+static void postgresShutdownForeignScan(ForeignScanState *node);
static void postgresAddForeignUpdateTargets(Query *parsetree,
RangeTblEntry *target_rte,
Relation target_relation);
@@ -355,8 +385,8 @@ static bool postgresIsForeignPathAsyncCapable(ForeignPath *path);
static void postgresForeignAsyncRequest(EState *estate,
PendingAsyncRequest *areq);
static bool postgresForeignAsyncConfigureWait(EState *estate,
- PendingAsyncRequest *areq,
- bool reinit);
+ PendingAsyncRequest *areq,
+ bool reinit);
static void postgresForeignAsyncNotify(EState *estate,
PendingAsyncRequest *areq);
@@ -379,7 +409,10 @@ static bool ec_member_matches_foreign(PlannerInfo *root, RelOptInfo *rel,
EquivalenceClass *ec, EquivalenceMember *em,
void *arg);
static void create_cursor(ForeignScanState *node);
-static void fetch_more_data(ForeignScanState *node);
+static void request_more_data(ForeignScanState *node);
+static void fetch_received_data(ForeignScanState *node);
+static void vacate_connection(PgFdwState *fdwconn);
+static void absorb_current_result(ForeignScanState *node);
static void close_cursor(PGconn *conn, unsigned int cursor_number);
static void prepare_foreign_modify(PgFdwModifyState *fmstate);
static const char **convert_prep_stmt_params(PgFdwModifyState *fmstate,
@@ -444,6 +477,7 @@ postgres_fdw_handler(PG_FUNCTION_ARGS)
routine->IterateForeignScan = postgresIterateForeignScan;
routine->ReScanForeignScan = postgresReScanForeignScan;
routine->EndForeignScan = postgresEndForeignScan;
+ routine->ShutdownForeignScan = postgresShutdownForeignScan;
/* Functions for updating foreign tables */
routine->AddForeignUpdateTargets = postgresAddForeignUpdateTargets;
@@ -1337,12 +1371,21 @@ postgresBeginForeignScan(ForeignScanState *node, int eflags)
* Get connection to the foreign server. Connection manager will
* establish new connection if necessary.
*/
- fsstate->conn = GetConnection(user, false);
+ fsstate->s.conn = GetConnection(user, false);
+ fsstate->s.connspec = (PgFdwConnspecate *)
+ GetConnectionSpecificStorage(user, sizeof(PgFdwConnspecate));
+ fsstate->s.connspec->current_owner = NULL;
+ fsstate->waiter = NULL;
+ fsstate->last_waiter = node;
/* Assign a unique ID for my cursor */
- fsstate->cursor_number = GetCursorNumber(fsstate->conn);
+ fsstate->cursor_number = GetCursorNumber(fsstate->s.conn);
fsstate->cursor_exists = false;
+ /* Initialize async execution status */
+ fsstate->run_async = false;
+ fsstate->async_waiting = false;
+
/* Get private info created by planner functions. */
fsstate->query = strVal(list_nth(fsplan->fdw_private,
FdwScanPrivateSelectSql));
@@ -1398,32 +1441,126 @@ postgresBeginForeignScan(ForeignScanState *node, int eflags)
static TupleTableSlot *
postgresIterateForeignScan(ForeignScanState *node)
{
- PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+ PgFdwScanState *fsstate = GetPgFdwScanState(node);
TupleTableSlot *slot = node->ss.ss_ScanTupleSlot;
/*
- * If this is the first call after Begin or ReScan, we need to create the
- * cursor on the remote side.
- */
- if (!fsstate->cursor_exists)
- create_cursor(node);
-
- /*
* Get some more tuples, if we've run out.
*/
if (fsstate->next_tuple >= fsstate->num_tuples)
{
- /* No point in another fetch if we already detected EOF, though. */
- if (!fsstate->eof_reached)
- fetch_more_data(node);
- /* If we didn't get any tuples, must be end of data. */
+ ForeignScanState *next_conn_owner = node;
+
+ /* This node has sent a query on this connection */
+ if (fsstate->s.connspec->current_owner == node)
+ {
+ /* Check if the result is available */
+ if (PQisBusy(fsstate->s.conn))
+ {
+ int rc = WaitLatchOrSocket(NULL,
+ WL_SOCKET_READABLE | WL_TIMEOUT,
+ PQsocket(fsstate->s.conn), 0,
+ WAIT_EVENT_ASYNC_WAIT);
+ if (fsstate->run_async && !(rc & WL_SOCKET_READABLE))
+ {
+ /*
+ * This node is not ready yet. Tell the caller to wait.
+ */
+ fsstate->result_ready = false;
+ return ExecClearTuple(slot);
+ }
+ }
+
+ Assert(fsstate->async_waiting);
+ fsstate->async_waiting = false;
+ fetch_received_data(node);
+
+ /*
+ * If someone is waiting this node on the same connection, let the
+ * first waiter be the next owner of this connection.
+ */
+ if (fsstate->waiter)
+ {
+ PgFdwScanState *next_owner_state;
+
+ next_conn_owner = fsstate->waiter;
+ next_owner_state = GetPgFdwScanState(next_conn_owner);
+ fsstate->waiter = NULL;
+
+ /*
+ * only the current owner is responsible to maintain the shortcut
+ * to the last waiter
+ */
+ next_owner_state->last_waiter = fsstate->last_waiter;
+
+ /*
+ * for simplicity, last_waiter points itself on a node that no one
+ * is waiting for.
+ */
+ fsstate->last_waiter = node;
+ }
+ }
+ else if (fsstate->s.connspec->current_owner)
+ {
+ /*
+ * Anyone else is holding this connection. Add myself to the tail
+ * of the waiters' list then return not-ready. To avoid scanning
+ * through the waiters' list, the current owner is to maintain the
+ * shortcut to the last waiter.
+ */
+ PgFdwScanState *conn_owner_state =
+ GetPgFdwScanState(fsstate->s.connspec->current_owner);
+ ForeignScanState *last_waiter = conn_owner_state->last_waiter;
+ PgFdwScanState *last_waiter_state = GetPgFdwScanState(last_waiter);
+
+ last_waiter_state->waiter = node;
+ conn_owner_state->last_waiter = node;
+
+ /* Register the node to the async-waiting node list */
+ Assert(!GetPgFdwScanState(node)->async_waiting);
+
+ GetPgFdwScanState(node)->async_waiting = true;
+
+ fsstate->result_ready = fsstate->eof_reached;
+ return ExecClearTuple(slot);
+ }
+
+ /*
+ * Send the next request for the next owner of this connection if
+ * needed.
+ */
+
+ if (!GetPgFdwScanState(next_conn_owner)->eof_reached)
+ {
+ PgFdwScanState *next_owner_state =
+ GetPgFdwScanState(next_conn_owner);
+
+ request_more_data(next_conn_owner);
+
+ /* Register the node to the async-waiting node list */
+ if (!next_owner_state->async_waiting)
+ next_owner_state->async_waiting = true;
+
+ if (!next_owner_state->run_async)
+ fetch_received_data(next_conn_owner);
+ }
+
+
+ /*
+ * If we haven't received a result for the given node this time,
+ * return with no tuple to give way to other nodes.
+ */
if (fsstate->next_tuple >= fsstate->num_tuples)
+ {
+ fsstate->result_ready = fsstate->eof_reached;
return ExecClearTuple(slot);
+ }
}
/*
* Return the next tuple.
*/
+ fsstate->result_ready = true;
ExecStoreTuple(fsstate->tuples[fsstate->next_tuple++],
slot,
InvalidBuffer,
@@ -1439,7 +1576,7 @@ postgresIterateForeignScan(ForeignScanState *node)
static void
postgresReScanForeignScan(ForeignScanState *node)
{
- PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+ PgFdwScanState *fsstate = GetPgFdwScanState(node);
char sql[64];
PGresult *res;
@@ -1447,6 +1584,9 @@ postgresReScanForeignScan(ForeignScanState *node)
if (!fsstate->cursor_exists)
return;
+ /* Absorb the ramining result */
+ absorb_current_result(node);
+
/*
* If any internal parameters affecting this node have changed, we'd
* better destroy and recreate the cursor. Otherwise, rewinding it should
@@ -1475,9 +1615,9 @@ postgresReScanForeignScan(ForeignScanState *node)
* We don't use a PG_TRY block here, so be careful not to throw error
* without releasing the PGresult.
*/
- res = pgfdw_exec_query(fsstate->conn, sql);
+ res = pgfdw_exec_query(fsstate->s.conn, sql);
if (PQresultStatus(res) != PGRES_COMMAND_OK)
- pgfdw_report_error(ERROR, res, fsstate->conn, true, sql);
+ pgfdw_report_error(ERROR, res, fsstate->s.conn, true, sql);
PQclear(res);
/* Now force a fresh FETCH. */
@@ -1495,7 +1635,7 @@ postgresReScanForeignScan(ForeignScanState *node)
static void
postgresEndForeignScan(ForeignScanState *node)
{
- PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+ PgFdwScanState *fsstate = GetPgFdwScanState(node);
/* if fsstate is NULL, we are in EXPLAIN; nothing to do */
if (fsstate == NULL)
@@ -1503,16 +1643,32 @@ postgresEndForeignScan(ForeignScanState *node)
/* Close the cursor if open, to prevent accumulation of cursors */
if (fsstate->cursor_exists)
- close_cursor(fsstate->conn, fsstate->cursor_number);
+ close_cursor(fsstate->s.conn, fsstate->cursor_number);
/* Release remote connection */
- ReleaseConnection(fsstate->conn);
- fsstate->conn = NULL;
+ ReleaseConnection(fsstate->s.conn);
+ fsstate->s.conn = NULL;
/* MemoryContexts will be deleted automatically. */
}
/*
+ * postgresShutdownForeignScan
+ * Remove asynchrony stuff and cleanup garbage on the connection.
+ */
+static void
+postgresShutdownForeignScan(ForeignScanState *node)
+{
+ ForeignScan *plan = (ForeignScan *) node->ss.ps.plan;
+
+ if (plan->operation != CMD_SELECT)
+ return;
+
+ /* Absorb the ramining result */
+ absorb_current_result(node);
+}
+
+/*
* postgresAddForeignUpdateTargets
* Add resjunk column(s) needed for update/delete on a foreign table
*/
@@ -1714,7 +1870,9 @@ postgresBeginForeignModify(ModifyTableState *mtstate,
user = GetUserMapping(userid, table->serverid);
/* Open connection; report that we'll create a prepared statement. */
- fmstate->conn = GetConnection(user, true);
+ fmstate->s.conn = GetConnection(user, true);
+ fmstate->s.connspec = (PgFdwConnspecate *)
+ GetConnectionSpecificStorage(user, sizeof(PgFdwConnspecate));
fmstate->p_name = NULL; /* prepared statement not made yet */
/* Deconstruct fdw_private data. */
@@ -1793,6 +1951,8 @@ postgresExecForeignInsert(EState *estate,
PGresult *res;
int n_rows;
+ vacate_connection((PgFdwState *)fmstate);
+
/* Set up the prepared statement on the remote server, if we didn't yet */
if (!fmstate->p_name)
prepare_foreign_modify(fmstate);
@@ -1803,14 +1963,14 @@ postgresExecForeignInsert(EState *estate,
/*
* Execute the prepared statement.
*/
- if (!PQsendQueryPrepared(fmstate->conn,
+ if (!PQsendQueryPrepared(fmstate->s.conn,
fmstate->p_name,
fmstate->p_nums,
p_values,
NULL,
NULL,
0))
- pgfdw_report_error(ERROR, NULL, fmstate->conn, false, fmstate->query);
+ pgfdw_report_error(ERROR, NULL, fmstate->s.conn, false, fmstate->query);
/*
* Get the result, and check for success.
@@ -1818,10 +1978,10 @@ postgresExecForeignInsert(EState *estate,
* We don't use a PG_TRY block here, so be careful not to throw error
* without releasing the PGresult.
*/
- res = pgfdw_get_result(fmstate->conn, fmstate->query);
+ res = pgfdw_get_result(fmstate->s.conn, fmstate->query);
if (PQresultStatus(res) !=
(fmstate->has_returning ? PGRES_TUPLES_OK : PGRES_COMMAND_OK))
- pgfdw_report_error(ERROR, res, fmstate->conn, true, fmstate->query);
+ pgfdw_report_error(ERROR, res, fmstate->s.conn, true, fmstate->query);
/* Check number of rows affected, and fetch RETURNING tuple if any */
if (fmstate->has_returning)
@@ -1859,6 +2019,8 @@ postgresExecForeignUpdate(EState *estate,
PGresult *res;
int n_rows;
+ vacate_connection((PgFdwState *)fmstate);
+
/* Set up the prepared statement on the remote server, if we didn't yet */
if (!fmstate->p_name)
prepare_foreign_modify(fmstate);
@@ -1879,14 +2041,14 @@ postgresExecForeignUpdate(EState *estate,
/*
* Execute the prepared statement.
*/
- if (!PQsendQueryPrepared(fmstate->conn,
+ if (!PQsendQueryPrepared(fmstate->s.conn,
fmstate->p_name,
fmstate->p_nums,
p_values,
NULL,
NULL,
0))
- pgfdw_report_error(ERROR, NULL, fmstate->conn, false, fmstate->query);
+ pgfdw_report_error(ERROR, NULL, fmstate->s.conn, false, fmstate->query);
/*
* Get the result, and check for success.
@@ -1894,10 +2056,10 @@ postgresExecForeignUpdate(EState *estate,
* We don't use a PG_TRY block here, so be careful not to throw error
* without releasing the PGresult.
*/
- res = pgfdw_get_result(fmstate->conn, fmstate->query);
+ res = pgfdw_get_result(fmstate->s.conn, fmstate->query);
if (PQresultStatus(res) !=
(fmstate->has_returning ? PGRES_TUPLES_OK : PGRES_COMMAND_OK))
- pgfdw_report_error(ERROR, res, fmstate->conn, true, fmstate->query);
+ pgfdw_report_error(ERROR, res, fmstate->s.conn, true, fmstate->query);
/* Check number of rows affected, and fetch RETURNING tuple if any */
if (fmstate->has_returning)
@@ -1935,6 +2097,8 @@ postgresExecForeignDelete(EState *estate,
PGresult *res;
int n_rows;
+ vacate_connection((PgFdwState *)fmstate);
+
/* Set up the prepared statement on the remote server, if we didn't yet */
if (!fmstate->p_name)
prepare_foreign_modify(fmstate);
@@ -1955,14 +2119,14 @@ postgresExecForeignDelete(EState *estate,
/*
* Execute the prepared statement.
*/
- if (!PQsendQueryPrepared(fmstate->conn,
+ if (!PQsendQueryPrepared(fmstate->s.conn,
fmstate->p_name,
fmstate->p_nums,
p_values,
NULL,
NULL,
0))
- pgfdw_report_error(ERROR, NULL, fmstate->conn, false, fmstate->query);
+ pgfdw_report_error(ERROR, NULL, fmstate->s.conn, false, fmstate->query);
/*
* Get the result, and check for success.
@@ -1970,10 +2134,10 @@ postgresExecForeignDelete(EState *estate,
* We don't use a PG_TRY block here, so be careful not to throw error
* without releasing the PGresult.
*/
- res = pgfdw_get_result(fmstate->conn, fmstate->query);
+ res = pgfdw_get_result(fmstate->s.conn, fmstate->query);
if (PQresultStatus(res) !=
(fmstate->has_returning ? PGRES_TUPLES_OK : PGRES_COMMAND_OK))
- pgfdw_report_error(ERROR, res, fmstate->conn, true, fmstate->query);
+ pgfdw_report_error(ERROR, res, fmstate->s.conn, true, fmstate->query);
/* Check number of rows affected, and fetch RETURNING tuple if any */
if (fmstate->has_returning)
@@ -2020,16 +2184,16 @@ postgresEndForeignModify(EState *estate,
* We don't use a PG_TRY block here, so be careful not to throw error
* without releasing the PGresult.
*/
- res = pgfdw_exec_query(fmstate->conn, sql);
+ res = pgfdw_exec_query(fmstate->s.conn, sql);
if (PQresultStatus(res) != PGRES_COMMAND_OK)
- pgfdw_report_error(ERROR, res, fmstate->conn, true, sql);
+ pgfdw_report_error(ERROR, res, fmstate->s.conn, true, sql);
PQclear(res);
fmstate->p_name = NULL;
}
/* Release remote connection */
- ReleaseConnection(fmstate->conn);
- fmstate->conn = NULL;
+ ReleaseConnection(fmstate->s.conn);
+ fmstate->s.conn = NULL;
}
/*
@@ -2309,7 +2473,9 @@ postgresBeginDirectModify(ForeignScanState *node, int eflags)
* Get connection to the foreign server. Connection manager will
* establish new connection if necessary.
*/
- dmstate->conn = GetConnection(user, false);
+ dmstate->s.conn = GetConnection(user, false);
+ dmstate->s.connspec = (PgFdwConnspecate *)
+ GetConnectionSpecificStorage(user, sizeof(PgFdwConnspecate));
/* Initialize state variable */
dmstate->num_tuples = -1; /* -1 means not set yet */
@@ -2362,7 +2528,10 @@ postgresIterateDirectModify(ForeignScanState *node)
* If this is the first call after Begin, execute the statement.
*/
if (dmstate->num_tuples == -1)
+ {
+ vacate_connection((PgFdwState *)dmstate);
execute_dml_stmt(node);
+ }
/*
* If the local query doesn't specify RETURNING, just clear tuple slot.
@@ -2409,8 +2578,8 @@ postgresEndDirectModify(ForeignScanState *node)
PQclear(dmstate->result);
/* Release remote connection */
- ReleaseConnection(dmstate->conn);
- dmstate->conn = NULL;
+ ReleaseConnection(dmstate->s.conn);
+ dmstate->s.conn = NULL;
/* MemoryContext will be deleted automatically. */
}
@@ -2529,6 +2698,7 @@ estimate_path_cost_size(PlannerInfo *root,
List *local_param_join_conds;
StringInfoData sql;
PGconn *conn;
+ PgFdwConnspecate *connspec;
Selectivity local_sel;
QualCost local_cost;
List *fdw_scan_tlist = NIL;
@@ -2572,6 +2742,16 @@ estimate_path_cost_size(PlannerInfo *root,
/* Get the remote estimate */
conn = GetConnection(fpinfo->user, false);
+ connspec = GetConnectionSpecificStorage(fpinfo->user,
+ sizeof(PgFdwConnspecate));
+ if (connspec)
+ {
+ PgFdwState tmpstate;
+ tmpstate.conn = conn;
+ tmpstate.connspec = connspec;
+ vacate_connection(&tmpstate);
+ }
+
get_remote_estimate(sql.data, conn, &rows, &width,
&startup_cost, &total_cost);
ReleaseConnection(conn);
@@ -2926,11 +3106,11 @@ ec_member_matches_foreign(PlannerInfo *root, RelOptInfo *rel,
static void
create_cursor(ForeignScanState *node)
{
- PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+ PgFdwScanState *fsstate = GetPgFdwScanState(node);
ExprContext *econtext = node->ss.ps.ps_ExprContext;
int numParams = fsstate->numParams;
const char **values = fsstate->param_values;
- PGconn *conn = fsstate->conn;
+ PGconn *conn = fsstate->s.conn;
StringInfoData buf;
PGresult *res;
@@ -2996,47 +3176,96 @@ create_cursor(ForeignScanState *node)
* Fetch some more rows from the node's cursor.
*/
static void
-fetch_more_data(ForeignScanState *node)
+request_more_data(ForeignScanState *node)
+{
+ PgFdwScanState *fsstate = GetPgFdwScanState(node);
+ PGconn *conn = fsstate->s.conn;
+ char sql[64];
+
+ /* The connection should be vacant */
+ Assert(fsstate->s.connspec->current_owner == NULL);
+
+ /*
+ * If this is the first call after Begin or ReScan, we need to create the
+ * cursor on the remote side.
+ */
+ if (!fsstate->cursor_exists)
+ create_cursor(node);
+
+ snprintf(sql, sizeof(sql), "FETCH %d FROM c%u",
+ fsstate->fetch_size, fsstate->cursor_number);
+
+ if (!PQsendQuery(conn, sql))
+ pgfdw_report_error(ERROR, NULL, conn, false, sql);
+
+ fsstate->s.connspec->current_owner = node;
+}
+
+/*
+ * Fetch some more rows from the node's cursor.
+ */
+static void
+fetch_received_data(ForeignScanState *node)
{
- PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+ PgFdwScanState *fsstate = GetPgFdwScanState(node);
PGresult *volatile res = NULL;
MemoryContext oldcontext;
+ /* I should be the current connection owner */
+ Assert(fsstate->s.connspec->current_owner == node);
+
/*
* We'll store the tuples in the batch_cxt. First, flush the previous
- * batch.
+ * batch if no tuple is remaining
*/
- fsstate->tuples = NULL;
- MemoryContextReset(fsstate->batch_cxt);
+ if (fsstate->next_tuple >= fsstate->num_tuples)
+ {
+ fsstate->tuples = NULL;
+ fsstate->num_tuples = 0;
+ MemoryContextReset(fsstate->batch_cxt);
+ }
+ else if (fsstate->next_tuple > 0)
+ {
+ /* move the remaining tuples to the beginning of the store */
+ int n = 0;
+
+ while(fsstate->next_tuple < fsstate->num_tuples)
+ fsstate->tuples[n++] = fsstate->tuples[fsstate->next_tuple++];
+ fsstate->num_tuples = n;
+ }
+
oldcontext = MemoryContextSwitchTo(fsstate->batch_cxt);
/* PGresult must be released before leaving this function. */
PG_TRY();
{
- PGconn *conn = fsstate->conn;
+ PGconn *conn = fsstate->s.conn;
char sql[64];
- int numrows;
+ int addrows;
+ size_t newsize;
int i;
snprintf(sql, sizeof(sql), "FETCH %d FROM c%u",
fsstate->fetch_size, fsstate->cursor_number);
- res = pgfdw_exec_query(conn, sql);
+ res = pgfdw_get_result(conn, sql);
/* On error, report the original query, not the FETCH. */
if (PQresultStatus(res) != PGRES_TUPLES_OK)
pgfdw_report_error(ERROR, res, conn, false, fsstate->query);
/* Convert the data into HeapTuples */
- numrows = PQntuples(res);
- fsstate->tuples = (HeapTuple *) palloc0(numrows * sizeof(HeapTuple));
- fsstate->num_tuples = numrows;
- fsstate->next_tuple = 0;
+ addrows = PQntuples(res);
+ newsize = (fsstate->num_tuples + addrows) * sizeof(HeapTuple);
+ if (fsstate->tuples)
+ fsstate->tuples = (HeapTuple *) repalloc(fsstate->tuples, newsize);
+ else
+ fsstate->tuples = (HeapTuple *) palloc(newsize);
- for (i = 0; i < numrows; i++)
+ for (i = 0; i < addrows; i++)
{
Assert(IsA(node->ss.ps.plan, ForeignScan));
- fsstate->tuples[i] =
+ fsstate->tuples[fsstate->num_tuples + i] =
make_tuple_from_result_row(res, i,
fsstate->rel,
fsstate->attinmeta,
@@ -3046,27 +3275,82 @@ fetch_more_data(ForeignScanState *node)
}
/* Update fetch_ct_2 */
- if (fsstate->fetch_ct_2 < 2)
+ if (fsstate->fetch_ct_2 < 2 && fsstate->next_tuple == 0)
fsstate->fetch_ct_2++;
+ fsstate->next_tuple = 0;
+ fsstate->num_tuples += addrows;
+
/* Must be EOF if we didn't get as many tuples as we asked for. */
- fsstate->eof_reached = (numrows < fsstate->fetch_size);
+ fsstate->eof_reached = (addrows < fsstate->fetch_size);
PQclear(res);
res = NULL;
}
PG_CATCH();
{
+ fsstate->s.connspec->current_owner = NULL;
if (res)
PQclear(res);
PG_RE_THROW();
}
PG_END_TRY();
+ fsstate->s.connspec->current_owner = NULL;
+
MemoryContextSwitchTo(oldcontext);
}
/*
+ * Vacate a connection so that this node can send the next query
+ */
+static void
+vacate_connection(PgFdwState *fdwstate)
+{
+ PgFdwConnspecate *connspec = fdwstate->connspec;
+ ForeignScanState *owner;
+
+ if (connspec == NULL || connspec->current_owner == NULL)
+ return;
+
+ /*
+ * let the current connection owner read the result for the running query
+ */
+ owner = connspec->current_owner;
+ fetch_received_data(owner);
+
+ /* Clear the waiting list */
+ while (owner)
+ {
+ PgFdwScanState *fsstate = GetPgFdwScanState(owner);
+
+ fsstate->last_waiter = NULL;
+ owner = fsstate->waiter;
+ fsstate->waiter = NULL;
+ }
+}
+
+/*
+ * Absorb the result of the current query.
+ */
+static void
+absorb_current_result(ForeignScanState *node)
+{
+ PgFdwScanState *fsstate = GetPgFdwScanState(node);
+ ForeignScanState *owner = fsstate->s.connspec->current_owner;
+
+ if (owner)
+ {
+ PgFdwScanState *target_state = GetPgFdwScanState(owner);
+ PGconn *conn = target_state->s.conn;
+
+ while(PQisBusy(conn))
+ PQclear(PQgetResult(conn));
+ fsstate->s.connspec->current_owner = NULL;
+ fsstate->async_waiting = false;
+ }
+}
+/*
* Force assorted GUC parameters to settings that ensure that we'll output
* data values in a form that is unambiguous to the remote server.
*
@@ -3150,7 +3434,7 @@ prepare_foreign_modify(PgFdwModifyState *fmstate)
/* Construct name we'll use for the prepared statement. */
snprintf(prep_name, sizeof(prep_name), "pgsql_fdw_prep_%u",
- GetPrepStmtNumber(fmstate->conn));
+ GetPrepStmtNumber(fmstate->s.conn));
p_name = pstrdup(prep_name);
/*
@@ -3160,12 +3444,12 @@ prepare_foreign_modify(PgFdwModifyState *fmstate)
* the prepared statements we use in this module are simple enough that
* the remote server will make the right choices.
*/
- if (!PQsendPrepare(fmstate->conn,
+ if (!PQsendPrepare(fmstate->s.conn,
p_name,
fmstate->query,
0,
NULL))
- pgfdw_report_error(ERROR, NULL, fmstate->conn, false, fmstate->query);
+ pgfdw_report_error(ERROR, NULL, fmstate->s.conn, false, fmstate->query);
/*
* Get the result, and check for success.
@@ -3173,9 +3457,9 @@ prepare_foreign_modify(PgFdwModifyState *fmstate)
* We don't use a PG_TRY block here, so be careful not to throw error
* without releasing the PGresult.
*/
- res = pgfdw_get_result(fmstate->conn, fmstate->query);
+ res = pgfdw_get_result(fmstate->s.conn, fmstate->query);
if (PQresultStatus(res) != PGRES_COMMAND_OK)
- pgfdw_report_error(ERROR, res, fmstate->conn, true, fmstate->query);
+ pgfdw_report_error(ERROR, res, fmstate->s.conn, true, fmstate->query);
PQclear(res);
/* This action shows that the prepare has been done. */
@@ -3306,9 +3590,9 @@ execute_dml_stmt(ForeignScanState *node)
* the desired result. This allows us to avoid assuming that the remote
* server has the same OIDs we do for the parameters' types.
*/
- if (!PQsendQueryParams(dmstate->conn, dmstate->query, numParams,
+ if (!PQsendQueryParams(dmstate->s.conn, dmstate->query, numParams,
NULL, values, NULL, NULL, 0))
- pgfdw_report_error(ERROR, NULL, dmstate->conn, false, dmstate->query);
+ pgfdw_report_error(ERROR, NULL, dmstate->s.conn, false, dmstate->query);
/*
* Get the result, and check for success.
@@ -3316,10 +3600,10 @@ execute_dml_stmt(ForeignScanState *node)
* We don't use a PG_TRY block here, so be careful not to throw error
* without releasing the PGresult.
*/
- dmstate->result = pgfdw_get_result(dmstate->conn, dmstate->query);
+ dmstate->result = pgfdw_get_result(dmstate->s.conn, dmstate->query);
if (PQresultStatus(dmstate->result) !=
(dmstate->has_returning ? PGRES_TUPLES_OK : PGRES_COMMAND_OK))
- pgfdw_report_error(ERROR, dmstate->result, dmstate->conn, true,
+ pgfdw_report_error(ERROR, dmstate->result, dmstate->s.conn, true,
dmstate->query);
/* Get the number of rows affected. */
@@ -4465,8 +4749,10 @@ postgresIsForeignPathAsyncCapable(ForeignPath *path)
}
/*
- * XXX. Just for testing purposes, let's run everything through the async
- * mechanism but return tuples synchronously.
+ * Accept async request. Notify to the caller if the next tuple is immediately
+ * available. ExecForeignScan does additional work to finishing the returning
+ * tuple, so call it instead of postgresIterateForeignScan to acquire a tuple
+ * in expected shape.
*/
static void
postgresForeignAsyncRequest(EState *estate, PendingAsyncRequest *areq)
@@ -4475,22 +4761,59 @@ postgresForeignAsyncRequest(EState *estate, PendingAsyncRequest *areq)
TupleTableSlot *slot;
Assert(IsA(node, ForeignScanState));
+ GetPgFdwScanState(node)->run_async = true;
slot = ExecForeignScan(node);
- ExecAsyncRequestDone(estate, areq, (Node *) slot);
+ if (GetPgFdwScanState(node)->result_ready)
+ ExecAsyncRequestDone(estate, areq, (Node *) slot);
+ else
+ ExecAsyncSetRequiredEvents(estate, areq, 1, false, false);
}
+/*
+ * Configure waiting event.
+ *
+ * Add an wait event only when the node is the connection owner. Elsewise
+ * another node on this connection is the owner.
+ */
static bool
postgresForeignAsyncConfigureWait(EState *estate, PendingAsyncRequest *areq,
- bool reinit)
+ bool reinit)
{
- elog(ERROR, "postgresForeignAsyncConfigureWait");
+ ForeignScanState *node = (ForeignScanState *) areq->requestee;
+ PgFdwScanState *fsstate = GetPgFdwScanState(node);
+
+
+ /* If the caller didn't reinit, this event is already in event set */
+ if (!reinit)
+ return true;
+
+ if (fsstate->s.connspec->current_owner == node)
+ {
+ AddWaitEventToSet(estate->es_wait_event_set,
+ WL_SOCKET_READABLE, PQsocket(fsstate->s.conn),
+ NULL, areq);
+ return true;
+ }
+
return false;
}
+/*
+ * Process a notification from async mechanism. ExecForeignScan does
+ * additional work to complete the returning tuple, so call it instead of
+ * postgresIterateForeignScan to acquire a completed tuple.
+ */
static void
postgresForeignAsyncNotify(EState *estate, PendingAsyncRequest *areq)
{
- elog(ERROR, "postgresForeignAsyncNotify");
+ ForeignScanState *node = (ForeignScanState *) areq->requestee;
+ TupleTableSlot *slot;
+
+ Assert(IsA(node, ForeignScanState));
+ slot = ExecForeignScan(node);
+ Assert(GetPgFdwScanState(node)->result_ready);
+
+ ExecAsyncRequestDone(estate, areq, (Node *) slot);
}
/*
@@ -4850,7 +5173,7 @@ make_tuple_from_result_row(PGresult *res,
PgFdwScanState *fdw_sstate;
Assert(fsstate);
- fdw_sstate = (PgFdwScanState *) fsstate->fdw_state;
+ fdw_sstate = GetPgFdwScanState(fsstate);
tupdesc = fdw_sstate->tupdesc;
}
diff --git a/contrib/postgres_fdw/postgres_fdw.h b/contrib/postgres_fdw/postgres_fdw.h
index f8c255e..1800977 100644
--- a/contrib/postgres_fdw/postgres_fdw.h
+++ b/contrib/postgres_fdw/postgres_fdw.h
@@ -79,6 +79,7 @@ typedef struct PgFdwRelationInfo
UserMapping *user; /* only set in use_remote_estimate mode */
int fetch_size; /* fetch size for this remote table */
+ bool allow_prefetch; /* true to allow overlapped fetching */
/*
* Name of the relation while EXPLAINing ForeignScan. It is used for join
@@ -103,6 +104,7 @@ extern void reset_transmission_modes(int nestlevel);
/* in connection.c */
extern PGconn *GetConnection(UserMapping *user, bool will_prep_stmt);
+void *GetConnectionSpecificStorage(UserMapping *user, size_t initsize);
extern void ReleaseConnection(PGconn *conn);
extern unsigned int GetCursorNumber(PGconn *conn);
extern unsigned int GetPrepStmtNumber(PGconn *conn);
diff --git a/contrib/postgres_fdw/sql/postgres_fdw.sql b/contrib/postgres_fdw/sql/postgres_fdw.sql
index f48743c..7153661 100644
--- a/contrib/postgres_fdw/sql/postgres_fdw.sql
+++ b/contrib/postgres_fdw/sql/postgres_fdw.sql
@@ -1552,8 +1552,8 @@ explain (verbose, costs off)
delete from foo where f1 < 5 returning *;
delete from foo where f1 < 5 returning *;
explain (verbose, costs off)
-update bar set f2 = f2 + 100 returning *;
-update bar set f2 = f2 + 100 returning *;
+with u as (update bar set f2 = f2 + 100 returning *) select * from u order by 1;
+with u as (update bar set f2 = f2 + 100 returning *) select * from u order by 1;
drop table foo cascade;
drop table bar cascade;
diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
index 554244f..f864abe 100644
--- a/src/backend/executor/execProcnode.c
+++ b/src/backend/executor/execProcnode.c
@@ -114,6 +114,7 @@
#include "executor/nodeValuesscan.h"
#include "executor/nodeWindowAgg.h"
#include "executor/nodeWorktablescan.h"
+#include "foreign/fdwapi.h"
#include "nodes/nodeFuncs.h"
#include "miscadmin.h"
@@ -806,6 +807,14 @@ ExecShutdownNode(PlanState *node)
case T_GatherState:
ExecShutdownGather((GatherState *) node);
break;
+ case T_ForeignScanState:
+ {
+ ForeignScanState *fsstate = (ForeignScanState *)node;
+ FdwRoutine *fdwroutine = fsstate->fdwroutine;
+ if (fdwroutine->ShutdownForeignScan)
+ fdwroutine->ShutdownForeignScan((ForeignScanState *) node);
+ }
+ break;
default:
break;
}
diff --git a/src/include/foreign/fdwapi.h b/src/include/foreign/fdwapi.h
index 65517fd..e40db0e 100644
--- a/src/include/foreign/fdwapi.h
+++ b/src/include/foreign/fdwapi.h
@@ -163,6 +163,7 @@ typedef bool (*ForeignAsyncConfigureWait_function) (EState *estate,
bool reinit);
typedef void (*ForeignAsyncNotify_function) (EState *estate,
PendingAsyncRequest *areq);
+typedef void (*ShutdownForeignScan_function) (ForeignScanState *node);
/*
* FdwRoutine is the struct returned by a foreign-data wrapper's handler
@@ -239,6 +240,7 @@ typedef struct FdwRoutine
ForeignAsyncRequest_function ForeignAsyncRequest;
ForeignAsyncConfigureWait_function ForeignAsyncConfigureWait;
ForeignAsyncNotify_function ForeignAsyncNotify;
+ ShutdownForeignScan_function ShutdownForeignScan;
} FdwRoutine;
--
2.9.2
0005-Use-resource-owner-to-prevent-wait-event-set-from-le.patchtext/x-patch; charset=us-asciiDownload
From 616e4186479fda5f7f5d87f2fd2e6b9d0fa9f603 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Tue, 25 Oct 2016 18:01:56 +0900
Subject: [PATCH 5/7] Use resource owner to prevent wait event set from leaking
Wait event sets created for async execution can live for some
iterations so it leaks in the case of errors during the
iterations. This commit uses resource owner to prevent such leaks.
---
src/backend/executor/execAsync.c | 28 ++++++++++++++--
src/backend/storage/ipc/latch.c | 19 ++++++++++-
src/backend/utils/resowner/resowner.c | 63 +++++++++++++++++++++++++++++++++++
src/include/utils/resowner_private.h | 8 +++++
4 files changed, 114 insertions(+), 4 deletions(-)
diff --git a/src/backend/executor/execAsync.c b/src/backend/executor/execAsync.c
index 33496a9..40e3f67 100644
--- a/src/backend/executor/execAsync.c
+++ b/src/backend/executor/execAsync.c
@@ -20,6 +20,7 @@
#include "miscadmin.h"
#include "pgstat.h"
#include "storage/latch.h"
+#include "utils/resowner_private.h"
static bool ExecAsyncEventWait(EState *estate, long timeout);
static bool ExecAsyncConfigureWait(EState *estate, PendingAsyncRequest *areq,
@@ -277,6 +278,8 @@ ExecAsyncEventWait(EState *estate, long timeout)
if (estate->es_wait_event_set == NULL)
{
+ ResourceOwner savedOwner;
+
/*
* Allow for a few extra events without reinitializing. It
* doesn't seem worth the complexity of doing anything very
@@ -284,9 +287,28 @@ ExecAsyncEventWait(EState *estate, long timeout)
* of external FDs are likely to run afoul of kernel limits anyway.
*/
estate->es_allocated_fd_events = estate->es_total_fd_events + 16;
- estate->es_wait_event_set =
- CreateWaitEventSet(estate->es_query_cxt,
- estate->es_allocated_fd_events + 1);
+
+ /*
+ * The wait event set created here should be released in case of
+ * error.
+ */
+ savedOwner = CurrentResourceOwner;
+ CurrentResourceOwner = TopTransactionResourceOwner;
+
+ PG_TRY();
+ {
+ estate->es_wait_event_set =
+ CreateWaitEventSet(estate->es_query_cxt,
+ estate->es_allocated_fd_events + 1);
+ }
+ PG_CATCH();
+ {
+ CurrentResourceOwner = savedOwner;
+ PG_RE_THROW();
+ }
+ PG_END_TRY();
+
+ CurrentResourceOwner = savedOwner;
AddWaitEventToSet(estate->es_wait_event_set,
WL_LATCH_SET, PGINVALID_SOCKET, MyLatch, NULL);
reinit = true;
diff --git a/src/backend/storage/ipc/latch.c b/src/backend/storage/ipc/latch.c
index 8488f94..b8bcae9 100644
--- a/src/backend/storage/ipc/latch.c
+++ b/src/backend/storage/ipc/latch.c
@@ -62,6 +62,7 @@
#include "storage/latch.h"
#include "storage/pmsignal.h"
#include "storage/shmem.h"
+#include "utils/resowner_private.h"
/*
* Select the fd readiness primitive to use. Normally the "most modern"
@@ -90,6 +91,7 @@ struct WaitEventSet
int nevents; /* number of registered events */
int nevents_space; /* maximum number of events in this set */
+ ResourceOwner resowner; /* Resource owner */
/*
* Array, of nevents_space length, storing the definition of events this
* set is waiting for.
@@ -324,7 +326,13 @@ WaitLatchOrSocket(volatile Latch *latch, int wakeEvents, pgsocket sock,
int ret = 0;
int rc;
WaitEvent event;
- WaitEventSet *set = CreateWaitEventSet(CurrentMemoryContext, 3);
+ WaitEventSet *set;
+ ResourceOwner savedOwner = CurrentResourceOwner;
+
+ /* This function doesn't need resowner for event set */
+ CurrentResourceOwner = NULL;
+ set = CreateWaitEventSet(CurrentMemoryContext, 3);
+ CurrentResourceOwner = savedOwner;
if (wakeEvents & WL_TIMEOUT)
Assert(timeout >= 0);
@@ -488,6 +496,9 @@ CreateWaitEventSet(MemoryContext context, int nevents)
char *data;
Size sz = 0;
+ if (CurrentResourceOwner)
+ ResourceOwnerEnlargeWESs(CurrentResourceOwner);
+
/*
* Use MAXALIGN size/alignment to guarantee that later uses of memory are
* aligned correctly. E.g. epoll_event might need 8 byte alignment on some
@@ -547,6 +558,9 @@ CreateWaitEventSet(MemoryContext context, int nevents)
StaticAssertStmt(WSA_INVALID_EVENT == NULL, "");
#endif
+ set->resowner = CurrentResourceOwner;
+ if (CurrentResourceOwner)
+ ResourceOwnerRememberWES(set->resowner, set);
return set;
}
@@ -582,6 +596,9 @@ FreeWaitEventSet(WaitEventSet *set)
}
#endif
+ if (set->resowner != NULL)
+ ResourceOwnerForgetWES(set->resowner, set);
+
pfree(set);
}
diff --git a/src/backend/utils/resowner/resowner.c b/src/backend/utils/resowner/resowner.c
index 07075ce..272e460 100644
--- a/src/backend/utils/resowner/resowner.c
+++ b/src/backend/utils/resowner/resowner.c
@@ -124,6 +124,7 @@ typedef struct ResourceOwnerData
ResourceArray snapshotarr; /* snapshot references */
ResourceArray filearr; /* open temporary files */
ResourceArray dsmarr; /* dynamic shmem segments */
+ ResourceArray wesarr; /* wait event sets */
/* We can remember up to MAX_RESOWNER_LOCKS references to local locks. */
int nlocks; /* number of owned locks */
@@ -169,6 +170,7 @@ static void PrintTupleDescLeakWarning(TupleDesc tupdesc);
static void PrintSnapshotLeakWarning(Snapshot snapshot);
static void PrintFileLeakWarning(File file);
static void PrintDSMLeakWarning(dsm_segment *seg);
+static void PrintWESLeakWarning(WaitEventSet *events);
/*****************************************************************************
@@ -437,6 +439,7 @@ ResourceOwnerCreate(ResourceOwner parent, const char *name)
ResourceArrayInit(&(owner->snapshotarr), PointerGetDatum(NULL));
ResourceArrayInit(&(owner->filearr), FileGetDatum(-1));
ResourceArrayInit(&(owner->dsmarr), PointerGetDatum(NULL));
+ ResourceArrayInit(&(owner->wesarr), PointerGetDatum(NULL));
return owner;
}
@@ -552,6 +555,16 @@ ResourceOwnerReleaseInternal(ResourceOwner owner,
PrintDSMLeakWarning(res);
dsm_detach(res);
}
+
+ /* Ditto for wait event sets */
+ while (ResourceArrayGetAny(&(owner->wesarr), &foundres))
+ {
+ WaitEventSet *event = (WaitEventSet *) DatumGetPointer(foundres);
+
+ if (isCommit)
+ PrintWESLeakWarning(event);
+ FreeWaitEventSet(event);
+ }
}
else if (phase == RESOURCE_RELEASE_LOCKS)
{
@@ -702,6 +715,7 @@ ResourceOwnerDelete(ResourceOwner owner)
Assert(owner->snapshotarr.nitems == 0);
Assert(owner->filearr.nitems == 0);
Assert(owner->dsmarr.nitems == 0);
+ Assert(owner->waiteventarr.nitems == 0);
Assert(owner->nlocks == 0 || owner->nlocks == MAX_RESOWNER_LOCKS + 1);
/*
@@ -728,6 +742,7 @@ ResourceOwnerDelete(ResourceOwner owner)
ResourceArrayFree(&(owner->snapshotarr));
ResourceArrayFree(&(owner->filearr));
ResourceArrayFree(&(owner->dsmarr));
+ ResourceArrayFree(&(owner->wesarr));
pfree(owner);
}
@@ -1270,3 +1285,51 @@ PrintDSMLeakWarning(dsm_segment *seg)
elog(WARNING, "dynamic shared memory leak: segment %u still referenced",
dsm_segment_handle(seg));
}
+
+/*
+ * Make sure there is room for at least one more entry in a ResourceOwner's
+ * wait event set reference array.
+ *
+ * This is separate from actually inserting an entry because if we run out
+ * of memory, it's critical to do so *before* acquiring the resource.
+ */
+void
+ResourceOwnerEnlargeWESs(ResourceOwner owner)
+{
+ ResourceArrayEnlarge(&(owner->wesarr));
+}
+
+/*
+ * Remember that a wait event set is owned by a ResourceOwner
+ *
+ * Caller must have previously done ResourceOwnerEnlargeWESs()
+ */
+void
+ResourceOwnerRememberWES(ResourceOwner owner, WaitEventSet *events)
+{
+ ResourceArrayAdd(&(owner->wesarr), PointerGetDatum(events));
+}
+
+/*
+ * Forget that a wait event set is owned by a ResourceOwner
+ */
+void
+ResourceOwnerForgetWES(ResourceOwner owner, WaitEventSet *events)
+{
+ /* XXXX: There's no property to identify a wait event set */
+ if (!ResourceArrayRemove(&(owner->wesarr), PointerGetDatum(events)))
+ elog(ERROR, "wait event set %p is not owned by resource owner %s",
+ events, owner->name);
+}
+
+/*
+ * Debugging subroutine
+ */
+static void
+PrintWESLeakWarning(WaitEventSet *events)
+{
+ /* XXXX: There's no property to identify a wait event set */
+ elog(WARNING, "wait event set leak: %p still referenced",
+ events);
+}
+
diff --git a/src/include/utils/resowner_private.h b/src/include/utils/resowner_private.h
index fd32090..6087257e7 100644
--- a/src/include/utils/resowner_private.h
+++ b/src/include/utils/resowner_private.h
@@ -18,6 +18,7 @@
#include "storage/dsm.h"
#include "storage/fd.h"
+#include "storage/latch.h"
#include "storage/lock.h"
#include "utils/catcache.h"
#include "utils/plancache.h"
@@ -88,4 +89,11 @@ extern void ResourceOwnerRememberDSM(ResourceOwner owner,
extern void ResourceOwnerForgetDSM(ResourceOwner owner,
dsm_segment *);
+/* support for wait event set management */
+extern void ResourceOwnerEnlargeWESs(ResourceOwner owner);
+extern void ResourceOwnerRememberWES(ResourceOwner owner,
+ WaitEventSet *);
+extern void ResourceOwnerForgetWES(ResourceOwner owner,
+ WaitEventSet *);
+
#endif /* RESOWNER_PRIVATE_H */
--
2.9.2
0006-Apply-unlikely-to-suggest-synchronous-route-of-ExecA.patchtext/x-patch; charset=us-asciiDownload
From 4129f613956b2e87fb924533b28ea44a7f7e3dc3 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Tue, 25 Oct 2016 18:05:30 +0900
Subject: [PATCH 6/7] Apply unlikely to suggest synchronous route of
ExecAppend.
ExecAppend seems to get slowed down by penalty of misprediction of
branches related to async-execution. Apply unlikey to them to prevent
such penalty on exiting route. Asynchronous execution is already
having a lot of additional code so this doesn't add siginificant
degradation.
---
src/backend/executor/nodeAppend.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/src/backend/executor/nodeAppend.c b/src/backend/executor/nodeAppend.c
index c234f1f..e82547d 100644
--- a/src/backend/executor/nodeAppend.c
+++ b/src/backend/executor/nodeAppend.c
@@ -208,7 +208,7 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
TupleTableSlot *
ExecAppend(AppendState *node)
{
- if (node->as_nasyncplans > 0)
+ if (unlikely(node->as_nasyncplans > 0))
{
EState *estate = node->ps.state;
int i;
@@ -248,7 +248,7 @@ ExecAppend(AppendState *node)
/*
* if we have async requests outstanding, run the event loop
*/
- if (node->as_nasyncpending > 0)
+ if (unlikely(node->as_nasyncpending > 0))
{
long timeout = node->as_syncdone ? -1 : 0;
--
2.9.2
0007-Add-instrumentation-to-async-execution.patchtext/x-patch; charset=us-asciiDownload
From 13872b3aed2cf7627af8cd4d009712574c7c9ad5 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Tue, 25 Oct 2016 19:04:04 +0900
Subject: [PATCH 7/7] Add instrumentation to async execution
Make explain analyze give sane result when async execution has taken
place.
---
src/backend/executor/execAsync.c | 19 +++++++++++++++++++
src/backend/executor/instrument.c | 2 +-
2 files changed, 20 insertions(+), 1 deletion(-)
diff --git a/src/backend/executor/execAsync.c b/src/backend/executor/execAsync.c
index 40e3f67..588ba18 100644
--- a/src/backend/executor/execAsync.c
+++ b/src/backend/executor/execAsync.c
@@ -46,6 +46,9 @@ ExecAsyncRequest(EState *estate, PlanState *requestor, int request_index,
PendingAsyncRequest *areq = NULL;
int nasync = estate->es_num_pending_async;
+ if (requestee->instrument)
+ InstrStartNode(requestee->instrument);
+
/*
* If the number of pending asynchronous nodes exceeds the number of
* available slots in the es_pending_async array, expand the array.
@@ -121,11 +124,17 @@ ExecAsyncRequest(EState *estate, PlanState *requestor, int request_index,
if (areq->state == ASYNC_COMPLETE)
{
Assert(areq->result == NULL || IsA(areq->result, TupleTableSlot));
+
ExecAsyncResponse(estate, areq);
+ if (areq->requestee->instrument)
+ InstrStopNode(requestee->instrument,
+ TupIsNull((TupleTableSlot*)areq->result) ? 0.0 : 1.0);
return;
}
+ if (areq->requestee->instrument)
+ InstrStopNode(requestee->instrument, 0);
/* No result available now, make this node pending */
estate->es_num_pending_async++;
}
@@ -193,6 +202,9 @@ ExecAsyncEventLoop(EState *estate, PlanState *requestor, long timeout)
{
PendingAsyncRequest *areq = estate->es_pending_async[i];
+ if (areq->requestee->instrument)
+ InstrStartNode(areq->requestee->instrument);
+
/* Skip it if not pending. */
if (areq->state == ASYNC_CALLBACK_PENDING)
{
@@ -211,7 +223,14 @@ ExecAsyncEventLoop(EState *estate, PlanState *requestor, long timeout)
if (requestor == areq->requestor)
requestor_done = true;
ExecAsyncResponse(estate, areq);
+
+ if (areq->requestee->instrument)
+ InstrStopNode(areq->requestee->instrument,
+ TupIsNull((TupleTableSlot*)areq->result) ?
+ 0.0 : 1.0);
}
+ else if (areq->requestee->instrument)
+ InstrStopNode(areq->requestee->instrument, 0);
}
/* If any node completed, compact the array. */
diff --git a/src/backend/executor/instrument.c b/src/backend/executor/instrument.c
index 2614bf4..6a22a15 100644
--- a/src/backend/executor/instrument.c
+++ b/src/backend/executor/instrument.c
@@ -102,7 +102,7 @@ InstrStopNode(Instrumentation *instr, double nTuples)
&pgBufferUsage, &instr->bufusage_start);
/* Is this the first tuple of this cycle? */
- if (!instr->running)
+ if (!instr->running && nTuples > 0)
{
instr->running = true;
instr->firsttuple = INSTR_TIME_GET_DOUBLE(instr->counter);
--
2.9.2
Hello, this is a maintenance post of reased patches.
I added a change of ResourceOwnerData missed in 0005.
At Mon, 31 Oct 2016 10:39:12 +0900 (JST), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in <20161031.103912.217430542.horiguchi.kyotaro@lab.ntt.co.jp>
This a PoC patch of asynchronous execution feature, based on a
executor infrastructure Robert proposed. These patches are
rebased on the current master.0001-robert-s-2nd-framework.patch
Roberts executor async infrastructure. Async-driver nodes
register its async-capable children and sync and data transfer
are done out of band of ordinary ExecProcNode channel. So async
execution no longer disturbs async-unaware node and slows them
down.0002-Fix-some-bugs.patch
Some fixes for 0001 to work. This is just to preserve the shape
of 0001 patch.0003-Modify-async-execution-infrastructure.patch
The original infrastructure doesn't work when multiple foreign
tables is on the same connection. This makes it work.0004-Make-postgres_fdw-async-capable.patch
Makes postgres_fdw to work asynchronously.
0005-Use-resource-owner-to-prevent-wait-event-set-from-le.patch
This addresses a problem pointed by Robers about 0001 patch,
that WaitEventSet used for async execution can leak by errors.0006-Apply-unlikely-to-suggest-synchronous-route-of-ExecA.patch
ExecAppend gets a bit slower by penalties of misprediction of
branches. This fixes it by using unlikely() macro.0007-Add-instrumentation-to-async-execution.patch
As the description above for 0001, async infrastructure conveys
tuples outside ExecProcNode channel so EXPLAIN ANALYZE requires
special treat to show sane results. This patch tries that.A result of a performance measurement is in this message.
/messages/by-id/20161025.182150.230901487.horiguchi.kyotaro@lab.ntt.co.jp
| t0 - SELECT sum(a) FROM <local single table>;
| pl - SELECT sum(a) FROM <4 local children>;
| pf0 - SELECT sum(a) FROM <4 foreign children on single connection>;
| pf1 - SELECT sum(a) FROM <4 foreign children on dedicate connections>;
...
| async
| t0: 3885.84 ( 40.20) 0.86% faster (should be error but stable on my env..)
| pl: 1617.20 ( 3.51) 1.26% faster (ditto)
| pf0: 6680.95 (478.72) 19.5% faster
| pf1: 1886.87 ( 36.25) 77.1% faster
regards,
--
Kyotaro Horiguchi
NTT Open Source Software Center
Attachments:
0001-robert-s-2nd-framework.patchtext/x-patch; charset=us-asciiDownload
From f1c33db03494975bdf3ef5a9856a5c99041f0e55 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Wed, 12 Oct 2016 12:46:10 +0900
Subject: [PATCH 1/7] robert's 2nd framework
---
contrib/postgres_fdw/postgres_fdw.c | 49 ++++
src/backend/executor/Makefile | 4 +-
src/backend/executor/README | 43 +++
src/backend/executor/execAmi.c | 5 +
src/backend/executor/execAsync.c | 462 ++++++++++++++++++++++++++++++++
src/backend/executor/nodeAppend.c | 162 ++++++++++-
src/backend/executor/nodeForeignscan.c | 49 ++++
src/backend/nodes/copyfuncs.c | 1 +
src/backend/nodes/outfuncs.c | 1 +
src/backend/nodes/readfuncs.c | 1 +
src/backend/optimizer/plan/createplan.c | 45 +++-
src/include/executor/execAsync.h | 29 ++
src/include/executor/nodeAppend.h | 3 +
src/include/executor/nodeForeignscan.h | 7 +
src/include/foreign/fdwapi.h | 15 ++
src/include/nodes/execnodes.h | 57 +++-
src/include/nodes/plannodes.h | 1 +
17 files changed, 909 insertions(+), 25 deletions(-)
create mode 100644 src/backend/executor/execAsync.c
create mode 100644 src/include/executor/execAsync.h
diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c
index fbe6929..ef4acc7 100644
--- a/contrib/postgres_fdw/postgres_fdw.c
+++ b/contrib/postgres_fdw/postgres_fdw.c
@@ -19,6 +19,7 @@
#include "commands/defrem.h"
#include "commands/explain.h"
#include "commands/vacuum.h"
+#include "executor/execAsync.h"
#include "foreign/fdwapi.h"
#include "funcapi.h"
#include "miscadmin.h"
@@ -349,6 +350,14 @@ static void postgresGetForeignUpperPaths(PlannerInfo *root,
UpperRelationKind stage,
RelOptInfo *input_rel,
RelOptInfo *output_rel);
+static bool postgresIsForeignPathAsyncCapable(ForeignPath *path);
+static void postgresForeignAsyncRequest(EState *estate,
+ PendingAsyncRequest *areq);
+static void postgresForeignAsyncConfigureWait(EState *estate,
+ PendingAsyncRequest *areq,
+ bool reinit);
+static void postgresForeignAsyncNotify(EState *estate,
+ PendingAsyncRequest *areq);
/*
* Helper functions
@@ -468,6 +477,12 @@ postgres_fdw_handler(PG_FUNCTION_ARGS)
/* Support functions for upper relation push-down */
routine->GetForeignUpperPaths = postgresGetForeignUpperPaths;
+ /* Support functions for async execution */
+ routine->IsForeignPathAsyncCapable = postgresIsForeignPathAsyncCapable;
+ routine->ForeignAsyncRequest = postgresForeignAsyncRequest;
+ routine->ForeignAsyncConfigureWait = postgresForeignAsyncConfigureWait;
+ routine->ForeignAsyncNotify = postgresForeignAsyncNotify;
+
PG_RETURN_POINTER(routine);
}
@@ -4442,6 +4457,40 @@ postgresGetForeignJoinPaths(PlannerInfo *root,
/* XXX Consider parameterized paths for the join relation */
}
+static bool
+postgresIsForeignPathAsyncCapable(ForeignPath *path)
+{
+ return true;
+}
+
+/*
+ * XXX. Just for testing purposes, let's run everything through the async
+ * mechanism but return tuples synchronously.
+ */
+static void
+postgresForeignAsyncRequest(EState *estate, PendingAsyncRequest *areq)
+{
+ ForeignScanState *node = (ForeignScanState *) areq->requestee;
+ TupleTableSlot *slot;
+
+ Assert(IsA(node, ForeignScanState));
+ slot = postgresIterateForeignScan(node);
+ ExecAsyncRequestDone(estate, areq, (Node *) slot);
+}
+
+static void
+postgresForeignAsyncConfigureWait(EState *estate, PendingAsyncRequest *areq,
+ bool reinit)
+{
+ elog(ERROR, "postgresForeignAsyncConfigureWait");
+}
+
+static void
+postgresForeignAsyncNotify(EState *estate, PendingAsyncRequest *areq)
+{
+ elog(ERROR, "postgresForeignAsyncNotify");
+}
+
/*
* Assess whether the aggregation, grouping and having operations can be pushed
* down to the foreign server. As a side effect, save information we obtain in
diff --git a/src/backend/executor/Makefile b/src/backend/executor/Makefile
index 51edd4c..0675b01 100644
--- a/src/backend/executor/Makefile
+++ b/src/backend/executor/Makefile
@@ -12,8 +12,8 @@ subdir = src/backend/executor
top_builddir = ../../..
include $(top_builddir)/src/Makefile.global
-OBJS = execAmi.o execCurrent.o execGrouping.o execIndexing.o execJunk.o \
- execMain.o execParallel.o execProcnode.o execQual.o \
+OBJS = execAmi.o execAsync.o execCurrent.o execGrouping.o execIndexing.o \
+ execJunk.o execMain.o execParallel.o execProcnode.o execQual.o \
execScan.o execTuples.o \
execUtils.o functions.o instrument.o nodeAppend.o nodeAgg.o \
nodeBitmapAnd.o nodeBitmapOr.o \
diff --git a/src/backend/executor/README b/src/backend/executor/README
index f1d1e4c..1dee3db 100644
--- a/src/backend/executor/README
+++ b/src/backend/executor/README
@@ -199,3 +199,46 @@ query returning the same set of scan tuples multiple times. Likewise,
SRFs are disallowed in an UPDATE's targetlist. There, they would have the
effect of the same row being updated multiple times, which is not very
useful --- and updates after the first would have no effect anyway.
+
+Asynchronous Execution
+----------------------
+
+In certain cases, it's desirable for a node to indicate that it cannot
+return any tuple immediately but may be able to do at a later time. This
+might either because the node is waiting on an event external to the
+database system, such as a ForeignScan awaiting network I/O, or because
+the node is waiting for an event internal to the database system - e.g.
+one process involved in a parallel query may find that it cannot progress
+a certain parallel operation until some other process reaches a certain
+point in the computation. A process which discovers this type of situation
+can always handle it simply by blocking, but this may waste time that could
+be spent executing some other part of the plan where progress could be
+made immediately. This is particularly likely to occur when the plan
+contains an Append node.
+
+To use asynchronous execution, a node must first request a tuple from an
+async-capable child node using ExecAsyncRequest. Next, it must execute
+the asynchronous event loop using ExecAsyncEventLoop; it can avoid giving
+up control indefinitely by passing a timeout to this function, even passing
+-1 to poll for events without blocking. Eventually, when a node to which an
+asynchronous request has been made produces a tuple, the requesting node
+will receive a callback from the event loop via ExecAsyncResponse. Typically,
+the ExecAsyncResponse callback is the only one required for nodes that wish
+to request tuples asynchronously.
+
+On the other hand, nodes that wish to produce tuples asynchronously
+generally need to implement three methods:
+
+1. When an asynchronous request is made, the node's ExecAsyncRequest callback
+will be invoked; it should use ExecAsyncSetRequiredEvents to indicate the
+number of file descriptor events for which it wishes to wait and whether it
+wishes to receive a callback when the process latch is set. Alternatively,
+it can instead use ExecAsyncRequestDone if a result is available immediately.
+
+2. When the event loop wishes to wait or poll for file descriptor events and
+the process latch, the ExecAsyncConfigureWait callback is invoked to configure
+the file descriptor wait events for which the node wishes to wait. This
+callback isn't needed if the node only cares about the process latch.
+
+3. When file descriptors or the process latch become ready, the node's
+ExecAsyncNotify callback is invoked.
diff --git a/src/backend/executor/execAmi.c b/src/backend/executor/execAmi.c
index 2587ef7..9fcc4e4 100644
--- a/src/backend/executor/execAmi.c
+++ b/src/backend/executor/execAmi.c
@@ -464,11 +464,16 @@ ExecSupportsBackwardScan(Plan *node)
{
ListCell *l;
+ /* With async, tuples may be interleaved, so can't back up. */
+ if (((Append *) node)->nasyncplans != 0)
+ return false;
+
foreach(l, ((Append *) node)->appendplans)
{
if (!ExecSupportsBackwardScan((Plan *) lfirst(l)))
return false;
}
+
/* need not check tlist because Append doesn't evaluate it */
return true;
}
diff --git a/src/backend/executor/execAsync.c b/src/backend/executor/execAsync.c
new file mode 100644
index 0000000..5858bb5
--- /dev/null
+++ b/src/backend/executor/execAsync.c
@@ -0,0 +1,462 @@
+/*-------------------------------------------------------------------------
+ *
+ * execAsync.c
+ * Support routines for asynchronous execution.
+ *
+ * Portions Copyright (c) 1996-2016, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/executor/execAsync.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "executor/execAsync.h"
+#include "executor/nodeAppend.h"
+#include "executor/nodeForeignscan.h"
+#include "miscadmin.h"
+#include "storage/latch.h"
+
+static bool ExecAsyncEventWait(EState *estate, long timeout);
+static void ExecAsyncConfigureWait(EState *estate, PendingAsyncRequest *areq,
+ bool reinit);
+static void ExecAsyncNotify(EState *estate, PendingAsyncRequest *areq);
+static void ExecAsyncResponse(EState *estate, PendingAsyncRequest *areq);
+
+#define EVENT_BUFFER_SIZE 16
+
+/*
+ * Asynchronously request a tuple from a designed async-aware node.
+ *
+ * requestor is the node that wants the tuple; requestee is the node from
+ * which it wants the tuple. request_index is an arbitrary integer specified
+ * by the requestor which will be available at the time the requestor receives
+ * the tuple. This is useful if the requestor has multiple children and
+ * needs an easy way to figure out which one is delivering a tuple.
+ */
+void
+ExecAsyncRequest(EState *estate, PlanState *requestor, int request_index,
+ PlanState *requestee)
+{
+ PendingAsyncRequest *areq = NULL;
+ int i = estate->es_num_pending_async;
+
+ /*
+ * If the number of pending asynchronous nodes exceeds the number of
+ * available slots in the es_pending_async array, expand the array.
+ * We start with 16 slots, and thereafter double the array size each
+ * time we run out of slots.
+ */
+ if (i >= estate->es_max_pending_async)
+ {
+ int newmax;
+
+ newmax = estate->es_max_pending_async * 2;
+ if (estate->es_max_pending_async == 0)
+ {
+ newmax = 16;
+ estate->es_pending_async =
+ MemoryContextAllocZero(estate->es_query_cxt,
+ newmax * sizeof(PendingAsyncRequest *));
+ }
+ else
+ {
+ int newentries = newmax - estate->es_max_pending_async;
+
+ estate->es_pending_async =
+ repalloc(estate->es_pending_async,
+ newmax * sizeof(PendingAsyncRequest *));
+ MemSet(&estate->es_pending_async[estate->es_max_pending_async],
+ 0, newentries * sizeof(PendingAsyncRequest *));
+ }
+ estate->es_max_pending_async = newmax;
+ }
+
+ /*
+ * To avoid unnecessary palloc traffic, we reuse a previously-allocated
+ * PendingAsyncRequest if there is one. If not, we must allocate a new
+ * one.
+ */
+ if (estate->es_pending_async[i] == NULL)
+ {
+ areq = MemoryContextAllocZero(estate->es_query_cxt,
+ sizeof(PendingAsyncRequest));
+ estate->es_pending_async[i] = areq;
+ }
+ else
+ {
+ areq = estate->es_pending_async[i];
+ MemSet(areq, 0, sizeof(PendingAsyncRequest));
+ }
+ areq->myindex = estate->es_num_pending_async++;
+
+ /* Initialize the new request. */
+ areq->requestor = requestor;
+ areq->request_index = request_index;
+ areq->requestee = requestee;
+
+ /* Give the requestee a chance to do whatever it wants. */
+ switch (nodeTag(requestee))
+ {
+ case T_ForeignScanState:
+ ExecAsyncForeignScanRequest(estate, areq);
+ break;
+ default:
+ /* If requestee doesn't support async, caller messed up. */
+ elog(ERROR, "unrecognized node type: %d",
+ (int) nodeTag(requestee));
+ }
+}
+
+/*
+ * Execute the main loop until the timeout expires or a result is delivered
+ * to the requestor.
+ *
+ * If the timeout is -1, there is no timeout; wait indefinitely until a
+ * result is ready for requestor. If the timeout is 0, do not block, but
+ * poll for events and fire callbacks for as long as we can do so without
+ * blocking. If timeout is greater than 0, block for at most the number
+ * of milliseconds indicated by the timeout.
+ *
+ * Returns true if a result was delivered to the requestor. A return value
+ * of false indicates that the timeout was reached without delivering a
+ * result to the requestor.
+ */
+bool
+ExecAsyncEventLoop(EState *estate, PlanState *requestor, long timeout)
+{
+ instr_time start_time;
+ long cur_timeout = timeout;
+ bool requestor_done = false;
+
+ Assert(requestor != NULL);
+
+ /*
+ * If we plan to wait - but not indefinitely - we need to record the
+ * current time.
+ */
+ if (timeout > 0)
+ INSTR_TIME_SET_CURRENT(start_time);
+
+ /* Main event loop: poll for events, deliver notifications. */
+ for (;;)
+ {
+ int i;
+ bool any_node_done = false;
+
+ CHECK_FOR_INTERRUPTS();
+
+ /*
+ * Check for events, but don't block if there notifications that
+ * have not been delivered yet.
+ */
+ if (estate->es_async_callback_pending > 0)
+ ExecAsyncEventWait(estate, 0);
+ else if (!ExecAsyncEventWait(estate, cur_timeout))
+ cur_timeout = 0; /* Timeout was reached. */
+ else
+ {
+ instr_time cur_time;
+ long cur_timeout = -1;
+
+ INSTR_TIME_SET_CURRENT(cur_time);
+ INSTR_TIME_SUBTRACT(cur_time, start_time);
+ cur_timeout = timeout - (long) INSTR_TIME_GET_MILLISEC(cur_time);
+ if (cur_timeout < 0)
+ cur_timeout = 0;
+ }
+
+ /* Deliver notifications. */
+ for (i = 0; i < estate->es_num_pending_async; ++i)
+ {
+ PendingAsyncRequest *areq = estate->es_pending_async[i];
+
+ /* Skip it if no callback is pending. */
+ if (!areq->callback_pending)
+ continue;
+
+ /*
+ * Mark it as no longer needing a callback. We must do this
+ * before dispatching the callback in case the callback resets
+ * the flag.
+ */
+ areq->callback_pending = false;
+ estate->es_async_callback_pending--;
+
+ /* Perform the actual callback; set request_done if appropraite. */
+ if (!areq->request_complete)
+ ExecAsyncNotify(estate, areq);
+ else
+ {
+ any_node_done = true;
+ if (requestor == areq->requestor)
+ requestor_done = true;
+ ExecAsyncResponse(estate, areq);
+ }
+ }
+
+ /* If any node completed, compact the array. */
+ if (any_node_done)
+ {
+ int hidx = 0,
+ tidx;
+
+ /*
+ * Swap all non-yet-completed items to the start of the array.
+ * Keep them in the same order.
+ */
+ for (tidx = 0; tidx < estate->es_num_pending_async; ++tidx)
+ {
+ PendingAsyncRequest *head;
+ PendingAsyncRequest *tail = estate->es_pending_async[tidx];
+
+ if (!tail->callback_pending && tail->request_complete)
+ continue;
+ head = estate->es_pending_async[hidx];
+ estate->es_pending_async[tidx] = head;
+ estate->es_pending_async[hidx] = tail;
+ ++hidx;
+ }
+ estate->es_num_pending_async = hidx;
+ }
+
+ /*
+ * We only consider exiting the loop when no notifications are
+ * pending. Otherwise, each call to this function might advance
+ * the computation by only a very small amount; to the contrary,
+ * we want to push it forward as far as possible.
+ */
+ if (estate->es_async_callback_pending == 0)
+ {
+ /* If requestor is ready, exit. */
+ if (requestor_done)
+ return true;
+ /* If timeout was 0 or has expired, exit. */
+ if (cur_timeout == 0)
+ return false;
+ }
+ }
+}
+
+/*
+ * Wait or poll for events. As with ExecAsyncEventLoop, a timeout of -1
+ * means wait forever, 0 means don't wait at all, and >0 means wait for the
+ * indicated number of milliseconds.
+ *
+ * Returns true if we found some events and false if we timed out.
+ */
+static bool
+ExecAsyncEventWait(EState *estate, long timeout)
+{
+ WaitEvent occurred_event[EVENT_BUFFER_SIZE];
+ int noccurred;
+ int i;
+ int n;
+ bool reinit = false;
+ bool process_latch_set = false;
+
+ if (estate->es_wait_event_set == NULL)
+ {
+ /*
+ * Allow for a few extra events without reinitializing. It
+ * doesn't seem worth the complexity of doing anything very
+ * aggressive here, because plans that depend on massive numbers
+ * of external FDs are likely to run afoul of kernel limits anyway.
+ */
+ estate->es_allocated_fd_events = estate->es_total_fd_events + 16;
+ estate->es_wait_event_set =
+ CreateWaitEventSet(estate->es_query_cxt,
+ estate->es_allocated_fd_events + 1);
+ AddWaitEventToSet(estate->es_wait_event_set,
+ WL_LATCH_SET, PGINVALID_SOCKET, MyLatch, NULL);
+ reinit = true;
+ }
+
+ /* Give each waiting node a chance to add or modify events. */
+ for (i = 0; i < estate->es_num_pending_async; ++i)
+ {
+ PendingAsyncRequest *areq = estate->es_pending_async[i];
+
+ if (areq->num_fd_events > 0)
+ ExecAsyncConfigureWait(estate, areq, reinit);
+ }
+
+ /* Wait for at least one event to occur. */
+ noccurred = WaitEventSetWait(estate->es_wait_event_set, timeout,
+ occurred_event, EVENT_BUFFER_SIZE);
+ if (noccurred == 0)
+ return false;
+
+ /*
+ * Loop over the occurred events and set the callback_pending flags
+ * for the appropriate requests. The waiting nodes should have
+ * registered their wait events with user_data pointing back to the
+ * PendingAsyncRequest, but the process latch needs special handling.
+ */
+ for (n = 0; n < noccurred; ++n)
+ {
+ WaitEvent *w = &occurred_event[n];
+
+ if ((w->events & WL_LATCH_SET) != 0)
+ {
+ process_latch_set = true;
+ continue;
+ }
+
+ if ((w->events & (WL_SOCKET_READABLE|WL_SOCKET_WRITEABLE)) != 0)
+ {
+ PendingAsyncRequest *areq = w->user_data;
+
+ if (!areq->callback_pending)
+ {
+ Assert(!areq->request_complete);
+ areq->callback_pending = true;
+ estate->es_async_callback_pending++;
+ }
+ }
+ }
+
+ /*
+ * If the process latch got set, we must schedule a callback for every
+ * requestee that cares about it.
+ */
+ if (process_latch_set)
+ {
+ for (i = 0; i < estate->es_num_pending_async; ++i)
+ {
+ PendingAsyncRequest *areq = estate->es_pending_async[i];
+
+ if (areq->wants_process_latch)
+ {
+ Assert(!areq->request_complete);
+ areq->callback_pending = true;
+ }
+ }
+ }
+
+ return true;
+}
+
+/*
+ * Give the asynchronous node a chance to configure the file descriptor
+ * events for which it wishes to wait. We expect the node-type specific
+ * callback to make one or more calls of the following form:
+ *
+ * AddWaitEventToSet(es->es_wait_event_set, events, fd, NULL, areq);
+ *
+ * The events should include only WL_SOCKET_READABLE or WL_SOCKET_WRITEABLE,
+ * and the number of calls should not exceed areq->num_fd_events (as
+ * prevously set via ExecAsyncSetRequiredEvents).
+ */
+static void
+ExecAsyncConfigureWait(EState *estate, PendingAsyncRequest *areq,
+ bool reinit)
+{
+ switch (nodeTag(areq->requestee))
+ {
+ case T_ForeignScanState:
+ ExecAsyncForeignScanConfigureWait(estate, areq, reinit);
+ break;
+ default:
+ elog(ERROR, "unrecognized node type: %d",
+ (int) nodeTag(areq->requestee));
+ }
+}
+
+/*
+ * Call the asynchronous node back when a relevant event has occurred.
+ */
+static void
+ExecAsyncNotify(EState *estate, PendingAsyncRequest *areq)
+{
+ switch (nodeTag(areq->requestee))
+ {
+ case T_ForeignScanState:
+ ExecAsyncForeignScanNotify(estate, areq);
+ break;
+ default:
+ elog(ERROR, "unrecognized node type: %d",
+ (int) nodeTag(areq->requestee));
+ }
+}
+
+/*
+ * Call the requestor back when an asynchronous node has produced a result.
+ */
+static void
+ExecAsyncResponse(EState *estate, PendingAsyncRequest *areq)
+{
+ switch (nodeTag(areq->requestor))
+ {
+ case T_AppendState:
+ ExecAsyncAppendResponse(estate, areq);
+ break;
+ default:
+ elog(ERROR, "unrecognized node type: %d",
+ (int) nodeTag(areq->requestor));
+ }
+}
+
+/*
+ * An executor node should call this function to signal that it needs to wait
+ * on one or more file descriptor events that can be registered on a
+ * WaitEventSet, and possibly also on the process latch. num_fd_events
+ * should be the maximum number of file descriptor events that it will wish to
+ * register. force_reset should be true if the node can't reuse the
+ * WaitEventSet it most recently initialized, for example because it needs to
+ * drop a wait event from the set.
+ */
+void
+ExecAsyncSetRequiredEvents(EState *estate, PendingAsyncRequest *areq,
+ int num_fd_events, bool wants_process_latch,
+ bool force_reset)
+{
+ estate->es_total_fd_events += num_fd_events - areq->num_fd_events;
+ areq->num_fd_events = num_fd_events;
+ areq->wants_process_latch = wants_process_latch;
+
+ if (force_reset && estate->es_wait_event_set != NULL)
+ {
+ FreeWaitEventSet(estate->es_wait_event_set);
+ estate->es_wait_event_set = NULL;
+ }
+}
+
+/*
+ * An async-capable node should call this function to deliver the tuple to
+ * the node which requested it. The node can call this from its
+ * ExecAsyncRequest callback if the requested tuple is available immediately,
+ * or at a later time from its ExecAsyncNotify callback.
+ */
+void
+ExecAsyncRequestDone(EState *estate, PendingAsyncRequest *areq, Node *result)
+{
+ /*
+ * Since the request is complete, the requestee is no longer allowed
+ * to wait for any events. Note that this forces a rebuild of
+ * es_wait_event_set every time a process that was previously waiting
+ * stops doing so. It might be possible to defer that decision until
+ * we actually wait again, because it's quite possible that a new
+ * request will be made of the same node before any wait actually
+ * happens. However, we have to balance the cost of rebuilding the
+ * WaitEventSet against the additional overhead of tracking which nodes
+ * need a callback to remove registered wait events. It's not clear
+ * that we would come out ahead, so use brute force for now.
+ */
+ if (areq->num_fd_events > 0 || areq->wants_process_latch)
+ ExecAsyncSetRequiredEvents(estate, areq, 0, false, true);
+
+ /* Save result and mark request as complete. */
+ areq->result = result;
+ areq->request_complete = true;
+
+ /* Make sure this request is flagged for a callback. */
+ if (!areq->callback_pending)
+ {
+ areq->callback_pending = true;
+ estate->es_async_callback_pending++;
+ }
+}
diff --git a/src/backend/executor/nodeAppend.c b/src/backend/executor/nodeAppend.c
index a26bd63..bb06569 100644
--- a/src/backend/executor/nodeAppend.c
+++ b/src/backend/executor/nodeAppend.c
@@ -58,6 +58,7 @@
#include "postgres.h"
#include "executor/execdebug.h"
+#include "executor/execAsync.h"
#include "executor/nodeAppend.h"
static bool exec_append_initialize_next(AppendState *appendstate);
@@ -79,16 +80,21 @@ exec_append_initialize_next(AppendState *appendstate)
/*
* get information from the append node
*/
- whichplan = appendstate->as_whichplan;
+ whichplan = appendstate->as_whichsyncplan;
- if (whichplan < 0)
+ /*
+ * This routine is only responsible for setting up for nodes being scanned
+ * synchronously, so the first node we can scan is given by nasyncplans
+ * and the last is given by as_nplans - 1.
+ */
+ if (whichplan < appendstate->as_nasyncplans)
{
/*
* if scanning in reverse, we start at the last scan in the list and
* then proceed back to the first.. in any case we inform ExecAppend
* that we are at the end of the line by returning FALSE
*/
- appendstate->as_whichplan = 0;
+ appendstate->as_whichsyncplan = appendstate->as_nasyncplans;
return FALSE;
}
else if (whichplan >= appendstate->as_nplans)
@@ -96,7 +102,7 @@ exec_append_initialize_next(AppendState *appendstate)
/*
* as above, end the scan if we go beyond the last scan in our list..
*/
- appendstate->as_whichplan = appendstate->as_nplans - 1;
+ appendstate->as_whichsyncplan = appendstate->as_nplans - 1;
return FALSE;
}
else
@@ -142,6 +148,15 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
appendstate->ps.state = estate;
appendstate->appendplans = appendplanstates;
appendstate->as_nplans = nplans;
+ appendstate->as_nasyncplans = node->nasyncplans;
+ appendstate->as_syncdone = (node->nasyncplans == nplans);
+ appendstate->as_asyncresult = (TupleTableSlot **)
+ palloc0(node->nasyncplans * sizeof(TupleTableSlot *));
+
+ /* initially, all async requests need a request */
+ for (i = 0; i < appendstate->as_nasyncplans; ++i)
+ appendstate->as_needrequest =
+ bms_add_member(appendstate->as_needrequest, i);
/*
* Miscellaneous initialization
@@ -176,9 +191,9 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
appendstate->ps.ps_ProjInfo = NULL;
/*
- * initialize to scan first subplan
+ * initialize to scan first synchronous subplan
*/
- appendstate->as_whichplan = 0;
+ appendstate->as_whichsyncplan = appendstate->as_nasyncplans;
exec_append_initialize_next(appendstate);
return appendstate;
@@ -193,15 +208,78 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
TupleTableSlot *
ExecAppend(AppendState *node)
{
+ if (node->as_nasyncplans > 0)
+ {
+ EState *estate = node->ps.state;
+ int i;
+
+ /*
+ * If there are any asynchronously-generated results that have
+ * not yet been returned, return one of them.
+ */
+ if (node->as_nasyncresult > 0)
+ {
+ --node->as_nasyncresult;
+ return node->as_asyncresult[node->as_nasyncresult];
+ }
+
+ /*
+ * If there are any nodes that need a new asynchronous request,
+ * make all of them.
+ */
+ while ((i = bms_first_member(node->as_needrequest)) >= 0)
+ {
+ ExecAsyncRequest(estate, &node->ps, i, node->appendplans[i]);
+ node->as_nasyncpending++;
+ }
+ }
+
for (;;)
{
PlanState *subnode;
TupleTableSlot *result;
/*
- * figure out which subplan we are currently processing
+ * if we have async requests outstanding, run the event loop
*/
- subnode = node->appendplans[node->as_whichplan];
+ if (node->as_nasyncpending > 0)
+ {
+ long timeout = node->as_syncdone ? -1 : 0;
+
+ for (;;)
+ {
+ if (node->as_nasyncpending == 0)
+ {
+ /*
+ * If there is no asynchronous activity still pending
+ * and the synchronous activity is also complete, we're
+ * totally done scanning this node. Otherwise, we're
+ * done with the asynchronous stuff but must continue
+ * scanning the synchronous children.
+ */
+ if (node->as_syncdone)
+ return ExecClearTuple(node->ps.ps_ResultTupleSlot);
+ break;
+ }
+ if (!ExecAsyncEventLoop(node->ps.state, &node->ps, timeout))
+ {
+ /* Timeout reached. */
+ break;
+ }
+ if (node->as_nasyncresult > 0)
+ {
+ /* Asynchronous subplan returned a tuple! */
+ --node->as_nasyncresult;
+ return node->as_asyncresult[node->as_nasyncresult];
+ }
+ }
+ }
+
+ /*
+ * figure out which synchronous subplan we are currently processing
+ */
+ Assert(!node->as_syncdone);
+ subnode = node->appendplans[node->as_whichsyncplan];
/*
* get a tuple from the subplan
@@ -221,14 +299,21 @@ ExecAppend(AppendState *node)
/*
* Go on to the "next" subplan in the appropriate direction. If no
* more subplans, return the empty slot set up for us by
- * ExecInitAppend.
+ * ExecInitAppend, unless there are async plans we have yet to finish.
*/
if (ScanDirectionIsForward(node->ps.state->es_direction))
- node->as_whichplan++;
+ node->as_whichsyncplan++;
else
- node->as_whichplan--;
+ node->as_whichsyncplan--;
if (!exec_append_initialize_next(node))
- return ExecClearTuple(node->ps.ps_ResultTupleSlot);
+ {
+ node->as_syncdone = true;
+ if (node->as_nasyncpending == 0)
+ {
+ Assert(bms_is_empty(node->as_needrequest));
+ return ExecClearTuple(node->ps.ps_ResultTupleSlot);
+ }
+ }
/* Else loop back and try to get a tuple from the new subplan */
}
@@ -267,6 +352,16 @@ ExecReScanAppend(AppendState *node)
{
int i;
+ /*
+ * XXX. Cancel outstanding asynchronous tuple requests here! (How?)
+ */
+
+ /* Reset async state. */
+ for (i = 0; i < node->as_nasyncplans; ++i)
+ node->as_needrequest = bms_add_member(node->as_needrequest, i);
+ node->as_nasyncresult = 0;
+ node->as_syncdone = (node->as_nasyncplans == node->as_nplans);
+
for (i = 0; i < node->as_nplans; i++)
{
PlanState *subnode = node->appendplans[i];
@@ -285,6 +380,47 @@ ExecReScanAppend(AppendState *node)
if (subnode->chgParam == NULL)
ExecReScan(subnode);
}
- node->as_whichplan = 0;
+ node->as_whichsyncplan = node->as_nasyncplans;
exec_append_initialize_next(node);
}
+
+/* ----------------------------------------------------------------
+ * ExecAsyncAppendResponse
+ *
+ * Receive a response from an asynchronous request we made.
+ * ----------------------------------------------------------------
+ */
+void
+ExecAsyncAppendResponse(EState *estate, PendingAsyncRequest *areq)
+{
+ AppendState *node = (AppendState *) areq->requestor;
+ TupleTableSlot *slot;
+
+ /* We shouldn't be called until the request is complete. */
+ Assert(areq->request_complete);
+
+ /* Our result slot shouldn't already be occupied. */
+ Assert(TupIsNull(node->ps.ps_ResultTupleSlot));
+
+ /* Result should be a TupleTableSlot or NULL. */
+ slot = (TupleTableSlot *) areq->result;
+ Assert(slot == NULL || IsA(slot, TupleTableSlot));
+
+ /* Request is no longer pending. */
+ Assert(node->as_nasyncpending > 0);
+ --node->as_nasyncpending;
+
+ /* If the result is NULL or an empty slot, there's nothing more to do. */
+ if (TupIsNull(slot))
+ return;
+
+ /* Save result so we can return it. */
+ Assert(node->as_nasyncresult < node->as_nasyncplans);
+ node->as_asyncresult[node->as_nasyncresult++] = slot;
+
+ /*
+ * Mark the node that returned a result as ready for a new request. We
+ * don't launch another one here immediately because it might compelte
+ */
+ bms_add_member(node->as_needrequest, areq->request_index);
+}
diff --git a/src/backend/executor/nodeForeignscan.c b/src/backend/executor/nodeForeignscan.c
index d886aaf..85d436f 100644
--- a/src/backend/executor/nodeForeignscan.c
+++ b/src/backend/executor/nodeForeignscan.c
@@ -355,3 +355,52 @@ ExecForeignScanInitializeWorker(ForeignScanState *node, shm_toc *toc)
fdwroutine->InitializeWorkerForeignScan(node, toc, coordinate);
}
}
+
+/* ----------------------------------------------------------------
+ * ExecAsyncForeignScanRequest
+ *
+ * Initiate an asynchronous request
+ * ----------------------------------------------------------------
+ */
+void
+ExecAsyncForeignScanRequest(EState *estate, PendingAsyncRequest *areq)
+{
+ ForeignScanState *node = (ForeignScanState *) areq->requestee;
+ FdwRoutine *fdwroutine = node->fdwroutine;
+
+ Assert(fdwroutine->ForeignAsyncRequest != NULL);
+ fdwroutine->ForeignAsyncRequest(estate, areq);
+}
+
+/* ----------------------------------------------------------------
+ * ExecAsyncForeignScanConfigureWait
+ *
+ * In async mode, configure for a wait
+ * ----------------------------------------------------------------
+ */
+void
+ExecAsyncForeignScanConfigureWait(EState *estate,
+ PendingAsyncRequest *areq, bool reinit)
+{
+ ForeignScanState *node = (ForeignScanState *) areq->requestee;
+ FdwRoutine *fdwroutine = node->fdwroutine;
+
+ Assert(fdwroutine->ForeignAsyncConfigureWait != NULL);
+ fdwroutine->ForeignAsyncConfigureWait(estate, areq, reinit);
+}
+
+/* ----------------------------------------------------------------
+ * ExecAsyncForeignScanNotify
+ *
+ * Event loop callback
+ * ----------------------------------------------------------------
+ */
+void
+ExecAsyncForeignScanNotify(EState *estate, PendingAsyncRequest *areq)
+{
+ ForeignScanState *node = (ForeignScanState *) areq->requestee;
+ FdwRoutine *fdwroutine = node->fdwroutine;
+
+ Assert(fdwroutine->ForeignAsyncNotify != NULL);
+ fdwroutine->ForeignAsyncNotify(estate, areq);
+}
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index 04e49b7..e4a103f 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -218,6 +218,7 @@ _copyAppend(const Append *from)
* copy remainder of node
*/
COPY_NODE_FIELD(appendplans);
+ COPY_SCALAR_FIELD(nasyncplans);
return newnode;
}
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index 748b687..1566e0d 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -359,6 +359,7 @@ _outAppend(StringInfo str, const Append *node)
_outPlanInfo(str, (const Plan *) node);
WRITE_NODE_FIELD(appendplans);
+ WRITE_INT_FIELD(nasyncplans);
}
static void
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index 917e6c8..69453b5 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -1519,6 +1519,7 @@ _readAppend(void)
ReadCommonPlan(&local_node->plan);
READ_NODE_FIELD(appendplans);
+ READ_INT_FIELD(nasyncplans);
READ_DONE();
}
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index ad49674..7caa8d3 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -193,7 +193,7 @@ static CteScan *make_ctescan(List *qptlist, List *qpqual,
Index scanrelid, int ctePlanId, int cteParam);
static WorkTableScan *make_worktablescan(List *qptlist, List *qpqual,
Index scanrelid, int wtParam);
-static Append *make_append(List *appendplans, List *tlist);
+static Append *make_append(List *asyncplans, int nasyncplans, List *tlist);
static RecursiveUnion *make_recursive_union(List *tlist,
Plan *lefttree,
Plan *righttree,
@@ -270,6 +270,7 @@ static ModifyTable *make_modifytable(PlannerInfo *root,
List *resultRelations, List *subplans,
List *withCheckOptionLists, List *returningLists,
List *rowMarks, OnConflictExpr *onconflict, int epqParam);
+static bool is_async_capable_path(Path *path);
/*
@@ -955,8 +956,10 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path)
{
Append *plan;
List *tlist = build_path_tlist(root, &best_path->path);
- List *subplans = NIL;
+ List *asyncplans = NIL;
+ List *syncplans = NIL;
ListCell *subpaths;
+ int nasyncplans = 0;
/*
* The subpaths list could be empty, if every child was proven empty by
@@ -991,7 +994,14 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path)
/* Must insist that all children return the same tlist */
subplan = create_plan_recurse(root, subpath, CP_EXACT_TLIST);
- subplans = lappend(subplans, subplan);
+ /* Classify as async-capable or not */
+ if (is_async_capable_path(subpath))
+ {
+ asyncplans = lappend(asyncplans, subplan);
+ ++nasyncplans;
+ }
+ else
+ syncplans = lappend(syncplans, subplan);
}
/*
@@ -1001,7 +1011,7 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path)
* parent-rel Vars it'll be asked to emit.
*/
- plan = make_append(subplans, tlist);
+ plan = make_append(list_concat(asyncplans, syncplans), nasyncplans, tlist);
copy_generic_path_info(&plan->plan, (Path *) best_path);
@@ -4941,7 +4951,7 @@ make_foreignscan(List *qptlist,
}
static Append *
-make_append(List *appendplans, List *tlist)
+make_append(List *appendplans, int nasyncplans, List *tlist)
{
Append *node = makeNode(Append);
Plan *plan = &node->plan;
@@ -4951,6 +4961,7 @@ make_append(List *appendplans, List *tlist)
plan->lefttree = NULL;
plan->righttree = NULL;
node->appendplans = appendplans;
+ node->nasyncplans = nasyncplans;
return node;
}
@@ -6225,3 +6236,27 @@ is_projection_capable_plan(Plan *plan)
}
return true;
}
+
+/*
+ * is_projection_capable_path
+ * Check whether a given Path node is async-capable.
+ */
+static bool
+is_async_capable_path(Path *path)
+{
+ switch (nodeTag(path))
+ {
+ case T_ForeignPath:
+ {
+ FdwRoutine *fdwroutine = path->parent->fdwroutine;
+
+ Assert(fdwroutine != NULL);
+ if (fdwroutine->IsForeignPathAsyncCapable != NULL &&
+ fdwroutine->IsForeignPathAsyncCapable((ForeignPath *) path))
+ return true;
+ }
+ default:
+ break;
+ }
+ return false;
+}
diff --git a/src/include/executor/execAsync.h b/src/include/executor/execAsync.h
new file mode 100644
index 0000000..2abc32d
--- /dev/null
+++ b/src/include/executor/execAsync.h
@@ -0,0 +1,29 @@
+/*--------------------------------------------------------------------
+ * execAsync.c
+ * Support functions for asynchronous query execution
+ *
+ * Portions Copyright (c) 1996-2016, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/executor/execAsync.c
+ *--------------------------------------------------------------------
+ */
+
+#ifndef EXECASYNC_H
+#define EXECASYNC_H
+
+#include "nodes/execnodes.h"
+
+extern void ExecAsyncRequest(EState *estate, PlanState *requestor,
+ int request_index, PlanState *requestee);
+extern bool ExecAsyncEventLoop(EState *estate, PlanState *requestor,
+ long timeout);
+
+extern void ExecAsyncSetRequiredEvents(EState *estate,
+ PendingAsyncRequest *areq, int num_fd_events,
+ bool wants_process_latch, bool force_reset);
+extern void ExecAsyncRequestDone(EState *estate,
+ PendingAsyncRequest *areq, Node *result);
+
+#endif /* EXECASYNC_H */
diff --git a/src/include/executor/nodeAppend.h b/src/include/executor/nodeAppend.h
index 51c381e..81a079d 100644
--- a/src/include/executor/nodeAppend.h
+++ b/src/include/executor/nodeAppend.h
@@ -21,4 +21,7 @@ extern TupleTableSlot *ExecAppend(AppendState *node);
extern void ExecEndAppend(AppendState *node);
extern void ExecReScanAppend(AppendState *node);
+extern void ExecAsyncAppendResponse(EState *estate,
+ PendingAsyncRequest *areq);
+
#endif /* NODEAPPEND_H */
diff --git a/src/include/executor/nodeForeignscan.h b/src/include/executor/nodeForeignscan.h
index 0cdec4e..3e69ab0 100644
--- a/src/include/executor/nodeForeignscan.h
+++ b/src/include/executor/nodeForeignscan.h
@@ -29,4 +29,11 @@ extern void ExecForeignScanInitializeDSM(ForeignScanState *node,
extern void ExecForeignScanInitializeWorker(ForeignScanState *node,
shm_toc *toc);
+extern void ExecAsyncForeignScanRequest(EState *estate,
+ PendingAsyncRequest *areq);
+extern void ExecAsyncForeignScanConfigureWait(EState *estate,
+ PendingAsyncRequest *areq, bool reinit);
+extern void ExecAsyncForeignScanNotify(EState *estate,
+ PendingAsyncRequest *areq);
+
#endif /* NODEFOREIGNSCAN_H */
diff --git a/src/include/foreign/fdwapi.h b/src/include/foreign/fdwapi.h
index e1b0d0d..88feb9a 100644
--- a/src/include/foreign/fdwapi.h
+++ b/src/include/foreign/fdwapi.h
@@ -155,6 +155,15 @@ typedef bool (*IsForeignScanParallelSafe_function) (PlannerInfo *root,
RelOptInfo *rel,
RangeTblEntry *rte);
+typedef bool (*IsForeignPathAsyncCapable_function) (ForeignPath *path);
+typedef void (*ForeignAsyncRequest_function) (EState *estate,
+ PendingAsyncRequest *areq);
+typedef void (*ForeignAsyncConfigureWait_function) (EState *estate,
+ PendingAsyncRequest *areq,
+ bool reinit);
+typedef void (*ForeignAsyncNotify_function) (EState *estate,
+ PendingAsyncRequest *areq);
+
/*
* FdwRoutine is the struct returned by a foreign-data wrapper's handler
* function. It provides pointers to the callback functions needed by the
@@ -224,6 +233,12 @@ typedef struct FdwRoutine
EstimateDSMForeignScan_function EstimateDSMForeignScan;
InitializeDSMForeignScan_function InitializeDSMForeignScan;
InitializeWorkerForeignScan_function InitializeWorkerForeignScan;
+
+ /* Support functions for asynchronous execution */
+ IsForeignPathAsyncCapable_function IsForeignPathAsyncCapable;
+ ForeignAsyncRequest_function ForeignAsyncRequest;
+ ForeignAsyncConfigureWait_function ForeignAsyncConfigureWait;
+ ForeignAsyncNotify_function ForeignAsyncNotify;
} FdwRoutine;
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index f6f73f3..b50b41c 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -347,6 +347,25 @@ typedef struct ResultRelInfo
} ResultRelInfo;
/* ----------------
+ * PendingAsyncRequest
+ *
+ * State for an asynchronous tuple request.
+ * ----------------
+ */
+typedef struct PendingAsyncRequest
+{
+ int myindex; /* Index in es_pending_async. */
+ struct PlanState *requestor; /* Node that wants a tuple. */
+ struct PlanState *requestee; /* Node from which a tuple is wanted. */
+ int request_index; /* Scratch space for requestor. */
+ int num_fd_events; /* Max number of FD events requestee needs. */
+ bool wants_process_latch; /* Requestee cares about MyLatch. */
+ bool callback_pending; /* Callback is needed. */
+ bool request_complete; /* Request complete, result valid. */
+ Node *result; /* Result (NULL if no more tuples). */
+} PendingAsyncRequest;
+
+/* ----------------
* EState information
*
* Master working state for an Executor invocation
@@ -422,6 +441,31 @@ typedef struct EState
HeapTuple *es_epqTuple; /* array of EPQ substitute tuples */
bool *es_epqTupleSet; /* true if EPQ tuple is provided */
bool *es_epqScanDone; /* true if EPQ tuple has been fetched */
+
+ /*
+ * Support for asynchronous execution.
+ *
+ * es_max_pending_async is the allocated size of es_pending_async, and
+ * es_num_pending_aync is the number of entries that are currently valid.
+ * (Entries after that may point to storage that can be reused.)
+ * es_async_callback_pending is the number of PendingAsyncRequests for
+ * which callback_pending is true.
+ *
+ * es_total_fd_events is the total number of FD events needed by all
+ * pending async nodes, and es_allocated_fd_events is the number any
+ * current wait event set was allocated to handle. es_wait_event_set, if
+ * non-NULL, is a previously allocated event set that may be reusable by a
+ * future wait provided that nothing's been removed and not too many more
+ * events have been added.
+ */
+ int es_num_pending_async;
+ int es_max_pending_async;
+ int es_async_callback_pending;
+ PendingAsyncRequest **es_pending_async;
+
+ int es_total_fd_events;
+ int es_allocated_fd_events;
+ struct WaitEventSet *es_wait_event_set;
} EState;
@@ -1147,17 +1191,20 @@ typedef struct ModifyTableState
/* ----------------
* AppendState information
- *
- * nplans how many plans are in the array
- * whichplan which plan is being executed (0 .. n-1)
* ----------------
*/
typedef struct AppendState
{
PlanState ps; /* its first field is NodeTag */
PlanState **appendplans; /* array of PlanStates for my inputs */
- int as_nplans;
- int as_whichplan;
+ int as_nplans; /* total # of children */
+ int as_nasyncplans; /* # of async-capable children */
+ int as_whichsyncplan; /* which sync plan is being executed */
+ bool as_syncdone; /* all synchronous plans done? */
+ Bitmapset *as_needrequest; /* async plans needing a new request */
+ TupleTableSlot **as_asyncresult; /* unreturned results of async plans */
+ int as_nasyncresult; /* # of valid entries in as_asyncresult */
+ int as_nasyncpending; /* # of outstanding async requests */
} AppendState;
/* ----------------
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index e2fbc7d..327119b 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -208,6 +208,7 @@ typedef struct Append
{
Plan plan;
List *appendplans;
+ int nasyncplans; /* # of async plans, always at start of list */
} Append;
/* ----------------
--
2.9.2
0002-Fix-some-bugs.patchtext/x-patch; charset=us-asciiDownload
From f2aa3c04fc79163bb45e9e122b151b39110d3cd7 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Wed, 12 Oct 2016 14:03:53 +0900
Subject: [PATCH 2/7] Fix some bugs.
---
contrib/postgres_fdw/expected/postgres_fdw.out | 142 ++++++++++++-------------
contrib/postgres_fdw/postgres_fdw.c | 3 +-
src/backend/executor/execAsync.c | 4 +-
src/backend/postmaster/pgstat.c | 3 +
src/include/pgstat.h | 3 +-
5 files changed, 81 insertions(+), 74 deletions(-)
diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out
index 785f520..457cfdb 100644
--- a/contrib/postgres_fdw/expected/postgres_fdw.out
+++ b/contrib/postgres_fdw/expected/postgres_fdw.out
@@ -6173,12 +6173,12 @@ INSERT INTO b(aa) VALUES('bbbbb');
SELECT tableoid::regclass, * FROM a;
tableoid | aa
----------+-------
- a | aaa
- a | aaaa
- a | aaaaa
b | bbb
b | bbbb
b | bbbbb
+ a | aaa
+ a | aaaa
+ a | aaaaa
(6 rows)
SELECT tableoid::regclass, * FROM b;
@@ -6201,12 +6201,12 @@ UPDATE a SET aa = 'zzzzzz' WHERE aa LIKE 'aaaa%';
SELECT tableoid::regclass, * FROM a;
tableoid | aa
----------+--------
- a | aaa
- a | zzzzzz
- a | zzzzzz
b | bbb
b | bbbb
b | bbbbb
+ a | aaa
+ a | zzzzzz
+ a | zzzzzz
(6 rows)
SELECT tableoid::regclass, * FROM b;
@@ -6229,12 +6229,12 @@ UPDATE b SET aa = 'new';
SELECT tableoid::regclass, * FROM a;
tableoid | aa
----------+--------
- a | aaa
- a | zzzzzz
- a | zzzzzz
b | new
b | new
b | new
+ a | aaa
+ a | zzzzzz
+ a | zzzzzz
(6 rows)
SELECT tableoid::regclass, * FROM b;
@@ -6257,12 +6257,12 @@ UPDATE a SET aa = 'newtoo';
SELECT tableoid::regclass, * FROM a;
tableoid | aa
----------+--------
- a | newtoo
- a | newtoo
- a | newtoo
b | newtoo
b | newtoo
b | newtoo
+ a | newtoo
+ a | newtoo
+ a | newtoo
(6 rows)
SELECT tableoid::regclass, * FROM b;
@@ -6321,120 +6321,120 @@ insert into bar2 values(4,44,44);
insert into bar2 values(7,77,77);
explain (verbose, costs off)
select * from bar where f1 in (select f1 from foo) for update;
- QUERY PLAN
-----------------------------------------------------------------------------------------------
+ QUERY PLAN
+------------------------------------------------------------------------------------------------------------------------
LockRows
- Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid, foo.ctid, foo.*, foo.tableoid
+ Output: bar2.f1, bar2.f2, bar2.ctid, ((bar2.*)::bar), bar2.tableoid, foo2.ctid, ((foo2.*)::foo), foo2.tableoid
-> Hash Join
- Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid, foo.ctid, foo.*, foo.tableoid
- Hash Cond: (bar.f1 = foo.f1)
+ Output: bar2.f1, bar2.f2, bar2.ctid, ((bar2.*)::bar), bar2.tableoid, foo2.ctid, ((foo2.*)::foo), foo2.tableoid
+ Hash Cond: (bar2.f1 = foo2.f1)
-> Append
- -> Seq Scan on public.bar
- Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid
-> Foreign Scan on public.bar2
Output: bar2.f1, bar2.f2, bar2.ctid, bar2.*, bar2.tableoid
Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 FOR UPDATE
+ -> Seq Scan on public.bar
+ Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid
-> Hash
- Output: foo.ctid, foo.*, foo.tableoid, foo.f1
+ Output: foo2.ctid, ((foo2.*)::foo), foo2.tableoid, foo2.f1
-> HashAggregate
- Output: foo.ctid, foo.*, foo.tableoid, foo.f1
- Group Key: foo.f1
+ Output: foo2.ctid, ((foo2.*)::foo), foo2.tableoid, foo2.f1
+ Group Key: foo2.f1
-> Append
- -> Seq Scan on public.foo
- Output: foo.ctid, foo.*, foo.tableoid, foo.f1
-> Foreign Scan on public.foo2
Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1
Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1
+ -> Seq Scan on public.foo
+ Output: foo.ctid, foo.*, foo.tableoid, foo.f1
(22 rows)
select * from bar where f1 in (select f1 from foo) for update;
f1 | f2
----+----
- 1 | 11
- 2 | 22
3 | 33
4 | 44
+ 1 | 11
+ 2 | 22
(4 rows)
explain (verbose, costs off)
select * from bar where f1 in (select f1 from foo) for share;
- QUERY PLAN
-----------------------------------------------------------------------------------------------
+ QUERY PLAN
+------------------------------------------------------------------------------------------------------------------------
LockRows
- Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid, foo.ctid, foo.*, foo.tableoid
+ Output: bar2.f1, bar2.f2, bar2.ctid, ((bar2.*)::bar), bar2.tableoid, foo2.ctid, ((foo2.*)::foo), foo2.tableoid
-> Hash Join
- Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid, foo.ctid, foo.*, foo.tableoid
- Hash Cond: (bar.f1 = foo.f1)
+ Output: bar2.f1, bar2.f2, bar2.ctid, ((bar2.*)::bar), bar2.tableoid, foo2.ctid, ((foo2.*)::foo), foo2.tableoid
+ Hash Cond: (bar2.f1 = foo2.f1)
-> Append
- -> Seq Scan on public.bar
- Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid
-> Foreign Scan on public.bar2
Output: bar2.f1, bar2.f2, bar2.ctid, bar2.*, bar2.tableoid
Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 FOR SHARE
+ -> Seq Scan on public.bar
+ Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid
-> Hash
- Output: foo.ctid, foo.*, foo.tableoid, foo.f1
+ Output: foo2.ctid, ((foo2.*)::foo), foo2.tableoid, foo2.f1
-> HashAggregate
- Output: foo.ctid, foo.*, foo.tableoid, foo.f1
- Group Key: foo.f1
+ Output: foo2.ctid, ((foo2.*)::foo), foo2.tableoid, foo2.f1
+ Group Key: foo2.f1
-> Append
- -> Seq Scan on public.foo
- Output: foo.ctid, foo.*, foo.tableoid, foo.f1
-> Foreign Scan on public.foo2
Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1
Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1
+ -> Seq Scan on public.foo
+ Output: foo.ctid, foo.*, foo.tableoid, foo.f1
(22 rows)
select * from bar where f1 in (select f1 from foo) for share;
f1 | f2
----+----
- 1 | 11
- 2 | 22
3 | 33
4 | 44
+ 1 | 11
+ 2 | 22
(4 rows)
-- Check UPDATE with inherited target and an inherited source table
explain (verbose, costs off)
update bar set f2 = f2 + 100 where f1 in (select f1 from foo);
- QUERY PLAN
----------------------------------------------------------------------------------------------
+ QUERY PLAN
+---------------------------------------------------------------------------------------------------------
Update on public.bar
Update on public.bar
Foreign Update on public.bar2
Remote SQL: UPDATE public.loct2 SET f2 = $2 WHERE ctid = $1
-> Hash Join
- Output: bar.f1, (bar.f2 + 100), bar.ctid, foo.ctid, foo.*, foo.tableoid
- Hash Cond: (bar.f1 = foo.f1)
+ Output: bar.f1, (bar.f2 + 100), bar.ctid, foo2.ctid, ((foo2.*)::foo), foo2.tableoid
+ Hash Cond: (bar.f1 = foo2.f1)
-> Seq Scan on public.bar
Output: bar.f1, bar.f2, bar.ctid
-> Hash
- Output: foo.ctid, foo.*, foo.tableoid, foo.f1
+ Output: foo2.ctid, ((foo2.*)::foo), foo2.tableoid, foo2.f1
-> HashAggregate
- Output: foo.ctid, foo.*, foo.tableoid, foo.f1
- Group Key: foo.f1
+ Output: foo2.ctid, ((foo2.*)::foo), foo2.tableoid, foo2.f1
+ Group Key: foo2.f1
-> Append
- -> Seq Scan on public.foo
- Output: foo.ctid, foo.*, foo.tableoid, foo.f1
-> Foreign Scan on public.foo2
Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1
Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1
+ -> Seq Scan on public.foo
+ Output: foo.ctid, foo.*, foo.tableoid, foo.f1
-> Hash Join
- Output: bar2.f1, (bar2.f2 + 100), bar2.f3, bar2.ctid, foo.ctid, foo.*, foo.tableoid
- Hash Cond: (bar2.f1 = foo.f1)
+ Output: bar2.f1, (bar2.f2 + 100), bar2.f3, bar2.ctid, foo2.ctid, ((foo2.*)::foo), foo2.tableoid
+ Hash Cond: (bar2.f1 = foo2.f1)
-> Foreign Scan on public.bar2
Output: bar2.f1, bar2.f2, bar2.f3, bar2.ctid
Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 FOR UPDATE
-> Hash
- Output: foo.ctid, foo.*, foo.tableoid, foo.f1
+ Output: foo2.ctid, ((foo2.*)::foo), foo2.tableoid, foo2.f1
-> HashAggregate
- Output: foo.ctid, foo.*, foo.tableoid, foo.f1
- Group Key: foo.f1
+ Output: foo2.ctid, ((foo2.*)::foo), foo2.tableoid, foo2.f1
+ Group Key: foo2.f1
-> Append
- -> Seq Scan on public.foo
- Output: foo.ctid, foo.*, foo.tableoid, foo.f1
-> Foreign Scan on public.foo2
Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1
Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1
+ -> Seq Scan on public.foo
+ Output: foo.ctid, foo.*, foo.tableoid, foo.f1
(37 rows)
update bar set f2 = f2 + 100 where f1 in (select f1 from foo);
@@ -6462,26 +6462,26 @@ where bar.f1 = ss.f1;
Foreign Update on public.bar2
Remote SQL: UPDATE public.loct2 SET f2 = $2 WHERE ctid = $1
-> Hash Join
- Output: bar.f1, (bar.f2 + 100), bar.ctid, (ROW(foo.f1))
- Hash Cond: (foo.f1 = bar.f1)
+ Output: bar.f1, (bar.f2 + 100), bar.ctid, (ROW(foo2.f1))
+ Hash Cond: (foo2.f1 = bar.f1)
-> Append
- -> Seq Scan on public.foo
- Output: ROW(foo.f1), foo.f1
-> Foreign Scan on public.foo2
Output: ROW(foo2.f1), foo2.f1
Remote SQL: SELECT f1 FROM public.loct1
- -> Seq Scan on public.foo foo_1
- Output: ROW((foo_1.f1 + 3)), (foo_1.f1 + 3)
-> Foreign Scan on public.foo2 foo2_1
Output: ROW((foo2_1.f1 + 3)), (foo2_1.f1 + 3)
Remote SQL: SELECT f1 FROM public.loct1
+ -> Seq Scan on public.foo
+ Output: ROW(foo.f1), foo.f1
+ -> Seq Scan on public.foo foo_1
+ Output: ROW((foo_1.f1 + 3)), (foo_1.f1 + 3)
-> Hash
Output: bar.f1, bar.f2, bar.ctid
-> Seq Scan on public.bar
Output: bar.f1, bar.f2, bar.ctid
-> Merge Join
- Output: bar2.f1, (bar2.f2 + 100), bar2.f3, bar2.ctid, (ROW(foo.f1))
- Merge Cond: (bar2.f1 = foo.f1)
+ Output: bar2.f1, (bar2.f2 + 100), bar2.f3, bar2.ctid, (ROW(foo2.f1))
+ Merge Cond: (bar2.f1 = foo2.f1)
-> Sort
Output: bar2.f1, bar2.f2, bar2.f3, bar2.ctid
Sort Key: bar2.f1
@@ -6489,19 +6489,19 @@ where bar.f1 = ss.f1;
Output: bar2.f1, bar2.f2, bar2.f3, bar2.ctid
Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 FOR UPDATE
-> Sort
- Output: (ROW(foo.f1)), foo.f1
- Sort Key: foo.f1
+ Output: (ROW(foo2.f1)), foo2.f1
+ Sort Key: foo2.f1
-> Append
- -> Seq Scan on public.foo
- Output: ROW(foo.f1), foo.f1
-> Foreign Scan on public.foo2
Output: ROW(foo2.f1), foo2.f1
Remote SQL: SELECT f1 FROM public.loct1
- -> Seq Scan on public.foo foo_1
- Output: ROW((foo_1.f1 + 3)), (foo_1.f1 + 3)
-> Foreign Scan on public.foo2 foo2_1
Output: ROW((foo2_1.f1 + 3)), (foo2_1.f1 + 3)
Remote SQL: SELECT f1 FROM public.loct1
+ -> Seq Scan on public.foo
+ Output: ROW(foo.f1), foo.f1
+ -> Seq Scan on public.foo foo_1
+ Output: ROW((foo_1.f1 + 3)), (foo_1.f1 + 3)
(45 rows)
update bar set f2 = f2 + 100
@@ -6668,8 +6668,8 @@ update bar set f2 = f2 + 100 returning *;
update bar set f2 = f2 + 100 returning *;
f1 | f2
----+-----
- 1 | 311
2 | 322
+ 1 | 311
6 | 266
3 | 333
4 | 344
diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c
index ef4acc7..c64ae41 100644
--- a/contrib/postgres_fdw/postgres_fdw.c
+++ b/contrib/postgres_fdw/postgres_fdw.c
@@ -20,6 +20,7 @@
#include "commands/explain.h"
#include "commands/vacuum.h"
#include "executor/execAsync.h"
+#include "executor/nodeForeignscan.h"
#include "foreign/fdwapi.h"
#include "funcapi.h"
#include "miscadmin.h"
@@ -4474,7 +4475,7 @@ postgresForeignAsyncRequest(EState *estate, PendingAsyncRequest *areq)
TupleTableSlot *slot;
Assert(IsA(node, ForeignScanState));
- slot = postgresIterateForeignScan(node);
+ slot = ExecForeignScan(node);
ExecAsyncRequestDone(estate, areq, (Node *) slot);
}
diff --git a/src/backend/executor/execAsync.c b/src/backend/executor/execAsync.c
index 5858bb5..e070c26 100644
--- a/src/backend/executor/execAsync.c
+++ b/src/backend/executor/execAsync.c
@@ -18,6 +18,7 @@
#include "executor/nodeAppend.h"
#include "executor/nodeForeignscan.h"
#include "miscadmin.h"
+#include "pgstat.h"
#include "storage/latch.h"
static bool ExecAsyncEventWait(EState *estate, long timeout);
@@ -286,7 +287,8 @@ ExecAsyncEventWait(EState *estate, long timeout)
/* Wait for at least one event to occur. */
noccurred = WaitEventSetWait(estate->es_wait_event_set, timeout,
- occurred_event, EVENT_BUFFER_SIZE);
+ occurred_event, EVENT_BUFFER_SIZE,
+ WAIT_EVENT_ASYNC_WAIT);
if (noccurred == 0)
return false;
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index a392197..ca91dd8 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -3393,6 +3393,9 @@ pgstat_get_wait_ipc(WaitEventIPC w)
case WAIT_EVENT_SYNC_REP:
event_name = "SyncRep";
break;
+ case WAIT_EVENT_ASYNC_WAIT:
+ event_name = "AsyncExecWait";
+ break;
/* no default case, so that compiler will warn */
}
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 4e8dac6..87ce505 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -785,7 +785,8 @@ typedef enum
WAIT_EVENT_MQ_SEND,
WAIT_EVENT_PARALLEL_FINISH,
WAIT_EVENT_SAFE_SNAPSHOT,
- WAIT_EVENT_SYNC_REP
+ WAIT_EVENT_SYNC_REP,
+ WAIT_EVENT_ASYNC_WAIT
} WaitEventIPC;
/* ----------
--
2.9.2
0003-Modify-async-execution-infrastructure.patchtext/x-patch; charset=us-asciiDownload
From a4d81f284e9eac5e60c2dfe7e9f693acae73ab36 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Mon, 17 Oct 2016 15:54:32 +0900
Subject: [PATCH 3/7] Modify async execution infrastructure.
---
contrib/postgres_fdw/expected/postgres_fdw.out | 68 ++++++++--------
contrib/postgres_fdw/postgres_fdw.c | 5 +-
src/backend/executor/execAsync.c | 105 ++++++++++++++-----------
src/backend/executor/nodeAppend.c | 50 ++++++------
src/backend/executor/nodeForeignscan.c | 4 +-
src/backend/nodes/copyfuncs.c | 1 +
src/backend/nodes/outfuncs.c | 1 +
src/backend/nodes/readfuncs.c | 1 +
src/backend/optimizer/plan/createplan.c | 24 +++++-
src/backend/utils/adt/ruleutils.c | 6 +-
src/include/executor/nodeForeignscan.h | 2 +-
src/include/foreign/fdwapi.h | 2 +-
src/include/nodes/execnodes.h | 10 ++-
src/include/nodes/plannodes.h | 1 +
14 files changed, 167 insertions(+), 113 deletions(-)
diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out
index 457cfdb..083d947 100644
--- a/contrib/postgres_fdw/expected/postgres_fdw.out
+++ b/contrib/postgres_fdw/expected/postgres_fdw.out
@@ -6321,13 +6321,13 @@ insert into bar2 values(4,44,44);
insert into bar2 values(7,77,77);
explain (verbose, costs off)
select * from bar where f1 in (select f1 from foo) for update;
- QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------
+ QUERY PLAN
+----------------------------------------------------------------------------------------------
LockRows
- Output: bar2.f1, bar2.f2, bar2.ctid, ((bar2.*)::bar), bar2.tableoid, foo2.ctid, ((foo2.*)::foo), foo2.tableoid
+ Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid, foo.ctid, foo.*, foo.tableoid
-> Hash Join
- Output: bar2.f1, bar2.f2, bar2.ctid, ((bar2.*)::bar), bar2.tableoid, foo2.ctid, ((foo2.*)::foo), foo2.tableoid
- Hash Cond: (bar2.f1 = foo2.f1)
+ Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid, foo.ctid, foo.*, foo.tableoid
+ Hash Cond: (bar.f1 = foo.f1)
-> Append
-> Foreign Scan on public.bar2
Output: bar2.f1, bar2.f2, bar2.ctid, bar2.*, bar2.tableoid
@@ -6335,10 +6335,10 @@ select * from bar where f1 in (select f1 from foo) for update;
-> Seq Scan on public.bar
Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid
-> Hash
- Output: foo2.ctid, ((foo2.*)::foo), foo2.tableoid, foo2.f1
+ Output: foo.ctid, foo.*, foo.tableoid, foo.f1
-> HashAggregate
- Output: foo2.ctid, ((foo2.*)::foo), foo2.tableoid, foo2.f1
- Group Key: foo2.f1
+ Output: foo.ctid, foo.*, foo.tableoid, foo.f1
+ Group Key: foo.f1
-> Append
-> Foreign Scan on public.foo2
Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1
@@ -6358,13 +6358,13 @@ select * from bar where f1 in (select f1 from foo) for update;
explain (verbose, costs off)
select * from bar where f1 in (select f1 from foo) for share;
- QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------
+ QUERY PLAN
+----------------------------------------------------------------------------------------------
LockRows
- Output: bar2.f1, bar2.f2, bar2.ctid, ((bar2.*)::bar), bar2.tableoid, foo2.ctid, ((foo2.*)::foo), foo2.tableoid
+ Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid, foo.ctid, foo.*, foo.tableoid
-> Hash Join
- Output: bar2.f1, bar2.f2, bar2.ctid, ((bar2.*)::bar), bar2.tableoid, foo2.ctid, ((foo2.*)::foo), foo2.tableoid
- Hash Cond: (bar2.f1 = foo2.f1)
+ Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid, foo.ctid, foo.*, foo.tableoid
+ Hash Cond: (bar.f1 = foo.f1)
-> Append
-> Foreign Scan on public.bar2
Output: bar2.f1, bar2.f2, bar2.ctid, bar2.*, bar2.tableoid
@@ -6372,10 +6372,10 @@ select * from bar where f1 in (select f1 from foo) for share;
-> Seq Scan on public.bar
Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid
-> Hash
- Output: foo2.ctid, ((foo2.*)::foo), foo2.tableoid, foo2.f1
+ Output: foo.ctid, foo.*, foo.tableoid, foo.f1
-> HashAggregate
- Output: foo2.ctid, ((foo2.*)::foo), foo2.tableoid, foo2.f1
- Group Key: foo2.f1
+ Output: foo.ctid, foo.*, foo.tableoid, foo.f1
+ Group Key: foo.f1
-> Append
-> Foreign Scan on public.foo2
Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1
@@ -6396,22 +6396,22 @@ select * from bar where f1 in (select f1 from foo) for share;
-- Check UPDATE with inherited target and an inherited source table
explain (verbose, costs off)
update bar set f2 = f2 + 100 where f1 in (select f1 from foo);
- QUERY PLAN
----------------------------------------------------------------------------------------------------------
+ QUERY PLAN
+---------------------------------------------------------------------------------------------
Update on public.bar
Update on public.bar
Foreign Update on public.bar2
Remote SQL: UPDATE public.loct2 SET f2 = $2 WHERE ctid = $1
-> Hash Join
- Output: bar.f1, (bar.f2 + 100), bar.ctid, foo2.ctid, ((foo2.*)::foo), foo2.tableoid
- Hash Cond: (bar.f1 = foo2.f1)
+ Output: bar.f1, (bar.f2 + 100), bar.ctid, foo.ctid, foo.*, foo.tableoid
+ Hash Cond: (bar.f1 = foo.f1)
-> Seq Scan on public.bar
Output: bar.f1, bar.f2, bar.ctid
-> Hash
- Output: foo2.ctid, ((foo2.*)::foo), foo2.tableoid, foo2.f1
+ Output: foo.ctid, foo.*, foo.tableoid, foo.f1
-> HashAggregate
- Output: foo2.ctid, ((foo2.*)::foo), foo2.tableoid, foo2.f1
- Group Key: foo2.f1
+ Output: foo.ctid, foo.*, foo.tableoid, foo.f1
+ Group Key: foo.f1
-> Append
-> Foreign Scan on public.foo2
Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1
@@ -6419,16 +6419,16 @@ update bar set f2 = f2 + 100 where f1 in (select f1 from foo);
-> Seq Scan on public.foo
Output: foo.ctid, foo.*, foo.tableoid, foo.f1
-> Hash Join
- Output: bar2.f1, (bar2.f2 + 100), bar2.f3, bar2.ctid, foo2.ctid, ((foo2.*)::foo), foo2.tableoid
- Hash Cond: (bar2.f1 = foo2.f1)
+ Output: bar2.f1, (bar2.f2 + 100), bar2.f3, bar2.ctid, foo.ctid, foo.*, foo.tableoid
+ Hash Cond: (bar2.f1 = foo.f1)
-> Foreign Scan on public.bar2
Output: bar2.f1, bar2.f2, bar2.f3, bar2.ctid
Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 FOR UPDATE
-> Hash
- Output: foo2.ctid, ((foo2.*)::foo), foo2.tableoid, foo2.f1
+ Output: foo.ctid, foo.*, foo.tableoid, foo.f1
-> HashAggregate
- Output: foo2.ctid, ((foo2.*)::foo), foo2.tableoid, foo2.f1
- Group Key: foo2.f1
+ Output: foo.ctid, foo.*, foo.tableoid, foo.f1
+ Group Key: foo.f1
-> Append
-> Foreign Scan on public.foo2
Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1
@@ -6462,8 +6462,8 @@ where bar.f1 = ss.f1;
Foreign Update on public.bar2
Remote SQL: UPDATE public.loct2 SET f2 = $2 WHERE ctid = $1
-> Hash Join
- Output: bar.f1, (bar.f2 + 100), bar.ctid, (ROW(foo2.f1))
- Hash Cond: (foo2.f1 = bar.f1)
+ Output: bar.f1, (bar.f2 + 100), bar.ctid, (ROW(foo.f1))
+ Hash Cond: (foo.f1 = bar.f1)
-> Append
-> Foreign Scan on public.foo2
Output: ROW(foo2.f1), foo2.f1
@@ -6480,8 +6480,8 @@ where bar.f1 = ss.f1;
-> Seq Scan on public.bar
Output: bar.f1, bar.f2, bar.ctid
-> Merge Join
- Output: bar2.f1, (bar2.f2 + 100), bar2.f3, bar2.ctid, (ROW(foo2.f1))
- Merge Cond: (bar2.f1 = foo2.f1)
+ Output: bar2.f1, (bar2.f2 + 100), bar2.f3, bar2.ctid, (ROW(foo.f1))
+ Merge Cond: (bar2.f1 = foo.f1)
-> Sort
Output: bar2.f1, bar2.f2, bar2.f3, bar2.ctid
Sort Key: bar2.f1
@@ -6489,8 +6489,8 @@ where bar.f1 = ss.f1;
Output: bar2.f1, bar2.f2, bar2.f3, bar2.ctid
Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 FOR UPDATE
-> Sort
- Output: (ROW(foo2.f1)), foo2.f1
- Sort Key: foo2.f1
+ Output: (ROW(foo.f1)), foo.f1
+ Sort Key: foo.f1
-> Append
-> Foreign Scan on public.foo2
Output: ROW(foo2.f1), foo2.f1
diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c
index c64ae41..b92b279 100644
--- a/contrib/postgres_fdw/postgres_fdw.c
+++ b/contrib/postgres_fdw/postgres_fdw.c
@@ -354,7 +354,7 @@ static void postgresGetForeignUpperPaths(PlannerInfo *root,
static bool postgresIsForeignPathAsyncCapable(ForeignPath *path);
static void postgresForeignAsyncRequest(EState *estate,
PendingAsyncRequest *areq);
-static void postgresForeignAsyncConfigureWait(EState *estate,
+static bool postgresForeignAsyncConfigureWait(EState *estate,
PendingAsyncRequest *areq,
bool reinit);
static void postgresForeignAsyncNotify(EState *estate,
@@ -4479,11 +4479,12 @@ postgresForeignAsyncRequest(EState *estate, PendingAsyncRequest *areq)
ExecAsyncRequestDone(estate, areq, (Node *) slot);
}
-static void
+static bool
postgresForeignAsyncConfigureWait(EState *estate, PendingAsyncRequest *areq,
bool reinit)
{
elog(ERROR, "postgresForeignAsyncConfigureWait");
+ return false;
}
static void
diff --git a/src/backend/executor/execAsync.c b/src/backend/executor/execAsync.c
index e070c26..33496a9 100644
--- a/src/backend/executor/execAsync.c
+++ b/src/backend/executor/execAsync.c
@@ -22,7 +22,7 @@
#include "storage/latch.h"
static bool ExecAsyncEventWait(EState *estate, long timeout);
-static void ExecAsyncConfigureWait(EState *estate, PendingAsyncRequest *areq,
+static bool ExecAsyncConfigureWait(EState *estate, PendingAsyncRequest *areq,
bool reinit);
static void ExecAsyncNotify(EState *estate, PendingAsyncRequest *areq);
static void ExecAsyncResponse(EState *estate, PendingAsyncRequest *areq);
@@ -43,7 +43,7 @@ ExecAsyncRequest(EState *estate, PlanState *requestor, int request_index,
PlanState *requestee)
{
PendingAsyncRequest *areq = NULL;
- int i = estate->es_num_pending_async;
+ int nasync = estate->es_num_pending_async;
/*
* If the number of pending asynchronous nodes exceeds the number of
@@ -51,7 +51,7 @@ ExecAsyncRequest(EState *estate, PlanState *requestor, int request_index,
* We start with 16 slots, and thereafter double the array size each
* time we run out of slots.
*/
- if (i >= estate->es_max_pending_async)
+ if (nasync >= estate->es_max_pending_async)
{
int newmax;
@@ -81,25 +81,28 @@ ExecAsyncRequest(EState *estate, PlanState *requestor, int request_index,
* PendingAsyncRequest if there is one. If not, we must allocate a new
* one.
*/
- if (estate->es_pending_async[i] == NULL)
+ if (estate->es_pending_async[nasync] == NULL)
{
areq = MemoryContextAllocZero(estate->es_query_cxt,
sizeof(PendingAsyncRequest));
- estate->es_pending_async[i] = areq;
+ estate->es_pending_async[nasync] = areq;
}
else
{
- areq = estate->es_pending_async[i];
+ areq = estate->es_pending_async[nasync];
MemSet(areq, 0, sizeof(PendingAsyncRequest));
}
- areq->myindex = estate->es_num_pending_async++;
+ areq->myindex = estate->es_num_pending_async;
/* Initialize the new request. */
areq->requestor = requestor;
areq->request_index = request_index;
areq->requestee = requestee;
- /* Give the requestee a chance to do whatever it wants. */
+ /*
+ * Give the requestee a chance to do whatever it wants.
+ * Requst functions return true if a result is immediately available.
+ */
switch (nodeTag(requestee))
{
case T_ForeignScanState:
@@ -110,6 +113,20 @@ ExecAsyncRequest(EState *estate, PlanState *requestor, int request_index,
elog(ERROR, "unrecognized node type: %d",
(int) nodeTag(requestee));
}
+
+ /*
+ * If a result is available, complete it immediately.
+ */
+ if (areq->state == ASYNC_COMPLETE)
+ {
+ Assert(areq->result == NULL || IsA(areq->result, TupleTableSlot));
+ ExecAsyncResponse(estate, areq);
+
+ return;
+ }
+
+ /* No result available now, make this node pending */
+ estate->es_num_pending_async++;
}
/*
@@ -175,22 +192,19 @@ ExecAsyncEventLoop(EState *estate, PlanState *requestor, long timeout)
{
PendingAsyncRequest *areq = estate->es_pending_async[i];
- /* Skip it if no callback is pending. */
- if (!areq->callback_pending)
- continue;
-
- /*
- * Mark it as no longer needing a callback. We must do this
- * before dispatching the callback in case the callback resets
- * the flag.
- */
- areq->callback_pending = false;
- estate->es_async_callback_pending--;
-
- /* Perform the actual callback; set request_done if appropraite. */
- if (!areq->request_complete)
+ /* Skip it if not pending. */
+ if (areq->state == ASYNC_CALLBACK_PENDING)
+ {
+ /*
+ * Mark it as no longer needing a callback. We must do this
+ * before dispatching the callback in case the callback resets
+ * the flag.
+ */
+ estate->es_async_callback_pending--;
ExecAsyncNotify(estate, areq);
- else
+ }
+
+ if (areq->state == ASYNC_COMPLETE)
{
any_node_done = true;
if (requestor == areq->requestor)
@@ -214,7 +228,7 @@ ExecAsyncEventLoop(EState *estate, PlanState *requestor, long timeout)
PendingAsyncRequest *head;
PendingAsyncRequest *tail = estate->es_pending_async[tidx];
- if (!tail->callback_pending && tail->request_complete)
+ if (tail->state == ASYNC_COMPLETE)
continue;
head = estate->es_pending_async[hidx];
estate->es_pending_async[tidx] = head;
@@ -247,7 +261,8 @@ ExecAsyncEventLoop(EState *estate, PlanState *requestor, long timeout)
* means wait forever, 0 means don't wait at all, and >0 means wait for the
* indicated number of milliseconds.
*
- * Returns true if we found some events and false if we timed out.
+ * Returns true if we found some events and false if we timed out or there's
+ * no event to wait. The latter is occur when the areq is processed during
*/
static bool
ExecAsyncEventWait(EState *estate, long timeout)
@@ -258,6 +273,7 @@ ExecAsyncEventWait(EState *estate, long timeout)
int n;
bool reinit = false;
bool process_latch_set = false;
+ bool added = false;
if (estate->es_wait_event_set == NULL)
{
@@ -282,13 +298,16 @@ ExecAsyncEventWait(EState *estate, long timeout)
PendingAsyncRequest *areq = estate->es_pending_async[i];
if (areq->num_fd_events > 0)
- ExecAsyncConfigureWait(estate, areq, reinit);
+ added |= ExecAsyncConfigureWait(estate, areq, reinit);
}
+ Assert(added);
+
/* Wait for at least one event to occur. */
noccurred = WaitEventSetWait(estate->es_wait_event_set, timeout,
occurred_event, EVENT_BUFFER_SIZE,
WAIT_EVENT_ASYNC_WAIT);
+
if (noccurred == 0)
return false;
@@ -312,12 +331,10 @@ ExecAsyncEventWait(EState *estate, long timeout)
{
PendingAsyncRequest *areq = w->user_data;
- if (!areq->callback_pending)
- {
- Assert(!areq->request_complete);
- areq->callback_pending = true;
- estate->es_async_callback_pending++;
- }
+ Assert(areq->state == ASYNC_WAITING);
+
+ areq->state = ASYNC_CALLBACK_PENDING;
+ estate->es_async_callback_pending++;
}
}
@@ -333,8 +350,8 @@ ExecAsyncEventWait(EState *estate, long timeout)
if (areq->wants_process_latch)
{
- Assert(!areq->request_complete);
- areq->callback_pending = true;
+ Assert(areq->state == ASYNC_WAITING);
+ areq->state = ASYNC_CALLBACK_PENDING;
}
}
}
@@ -352,15 +369,19 @@ ExecAsyncEventWait(EState *estate, long timeout)
* The events should include only WL_SOCKET_READABLE or WL_SOCKET_WRITEABLE,
* and the number of calls should not exceed areq->num_fd_events (as
* prevously set via ExecAsyncSetRequiredEvents).
+ *
+ * Individual requests can omit registering an event but it is a
+ * responsibility of the node driver to set at least one event per one
+ * requestor.
*/
-static void
+static bool
ExecAsyncConfigureWait(EState *estate, PendingAsyncRequest *areq,
bool reinit)
{
switch (nodeTag(areq->requestee))
{
case T_ForeignScanState:
- ExecAsyncForeignScanConfigureWait(estate, areq, reinit);
+ return ExecAsyncForeignScanConfigureWait(estate, areq, reinit);
break;
default:
elog(ERROR, "unrecognized node type: %d",
@@ -419,6 +440,7 @@ ExecAsyncSetRequiredEvents(EState *estate, PendingAsyncRequest *areq,
estate->es_total_fd_events += num_fd_events - areq->num_fd_events;
areq->num_fd_events = num_fd_events;
areq->wants_process_latch = wants_process_latch;
+ areq->state = ASYNC_WAITING;
if (force_reset && estate->es_wait_event_set != NULL)
{
@@ -448,17 +470,12 @@ ExecAsyncRequestDone(EState *estate, PendingAsyncRequest *areq, Node *result)
* need a callback to remove registered wait events. It's not clear
* that we would come out ahead, so use brute force for now.
*/
+ Assert(areq->state == ASYNC_IDLE || areq->state == ASYNC_CALLBACK_PENDING);
+
if (areq->num_fd_events > 0 || areq->wants_process_latch)
ExecAsyncSetRequiredEvents(estate, areq, 0, false, true);
/* Save result and mark request as complete. */
areq->result = result;
- areq->request_complete = true;
-
- /* Make sure this request is flagged for a callback. */
- if (!areq->callback_pending)
- {
- areq->callback_pending = true;
- estate->es_async_callback_pending++;
- }
+ areq->state = ASYNC_COMPLETE;
}
diff --git a/src/backend/executor/nodeAppend.c b/src/backend/executor/nodeAppend.c
index bb06569..c234f1f 100644
--- a/src/backend/executor/nodeAppend.c
+++ b/src/backend/executor/nodeAppend.c
@@ -229,9 +229,15 @@ ExecAppend(AppendState *node)
*/
while ((i = bms_first_member(node->as_needrequest)) >= 0)
{
- ExecAsyncRequest(estate, &node->ps, i, node->appendplans[i]);
node->as_nasyncpending++;
+
+ ExecAsyncRequest(estate, &node->ps, i, node->appendplans[i]);
+ /* If this request immediately gives a result, take it. */
+ if (node->as_nasyncresult > 0)
+ return node->as_asyncresult[--node->as_nasyncresult];
}
+ if (node->as_nasyncpending == 0 && node->as_syncdone)
+ return ExecClearTuple(node->ps.ps_ResultTupleSlot);
}
for (;;)
@@ -246,32 +252,32 @@ ExecAppend(AppendState *node)
{
long timeout = node->as_syncdone ? -1 : 0;
- for (;;)
+ while (node->as_nasyncpending > 0)
{
- if (node->as_nasyncpending == 0)
- {
- /*
- * If there is no asynchronous activity still pending
- * and the synchronous activity is also complete, we're
- * totally done scanning this node. Otherwise, we're
- * done with the asynchronous stuff but must continue
- * scanning the synchronous children.
- */
- if (node->as_syncdone)
- return ExecClearTuple(node->ps.ps_ResultTupleSlot);
- break;
- }
- if (!ExecAsyncEventLoop(node->ps.state, &node->ps, timeout))
- {
- /* Timeout reached. */
- break;
- }
- if (node->as_nasyncresult > 0)
+ if (ExecAsyncEventLoop(node->ps.state, &node->ps, timeout) &&
+ node->as_nasyncresult > 0)
{
/* Asynchronous subplan returned a tuple! */
--node->as_nasyncresult;
return node->as_asyncresult[node->as_nasyncresult];
}
+
+ /* Timeout reached. Go through to sync nodes if exists */
+ if (!node->as_syncdone)
+ break;
+ }
+
+ /*
+ * If there is no asynchronous activity still pending and the
+ * synchronous activity is also complete, we're totally done
+ * scanning this node. Otherwise, we're done with the
+ * asynchronous stuff but must continue scanning the synchronous
+ * children.
+ */
+ if (node->as_syncdone)
+ {
+ Assert(node->as_nasyncpending == 0);
+ return ExecClearTuple(node->ps.ps_ResultTupleSlot);
}
}
@@ -397,7 +403,7 @@ ExecAsyncAppendResponse(EState *estate, PendingAsyncRequest *areq)
TupleTableSlot *slot;
/* We shouldn't be called until the request is complete. */
- Assert(areq->request_complete);
+ Assert(areq->state == ASYNC_COMPLETE);
/* Our result slot shouldn't already be occupied. */
Assert(TupIsNull(node->ps.ps_ResultTupleSlot));
diff --git a/src/backend/executor/nodeForeignscan.c b/src/backend/executor/nodeForeignscan.c
index 85d436f..d3567bb 100644
--- a/src/backend/executor/nodeForeignscan.c
+++ b/src/backend/executor/nodeForeignscan.c
@@ -378,7 +378,7 @@ ExecAsyncForeignScanRequest(EState *estate, PendingAsyncRequest *areq)
* In async mode, configure for a wait
* ----------------------------------------------------------------
*/
-void
+bool
ExecAsyncForeignScanConfigureWait(EState *estate,
PendingAsyncRequest *areq, bool reinit)
{
@@ -386,7 +386,7 @@ ExecAsyncForeignScanConfigureWait(EState *estate,
FdwRoutine *fdwroutine = node->fdwroutine;
Assert(fdwroutine->ForeignAsyncConfigureWait != NULL);
- fdwroutine->ForeignAsyncConfigureWait(estate, areq, reinit);
+ return fdwroutine->ForeignAsyncConfigureWait(estate, areq, reinit);
}
/* ----------------------------------------------------------------
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index e4a103f..27ccf9d 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -219,6 +219,7 @@ _copyAppend(const Append *from)
*/
COPY_NODE_FIELD(appendplans);
COPY_SCALAR_FIELD(nasyncplans);
+ COPY_SCALAR_FIELD(referent);
return newnode;
}
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index 1566e0d..c8b9f31 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -360,6 +360,7 @@ _outAppend(StringInfo str, const Append *node)
WRITE_NODE_FIELD(appendplans);
WRITE_INT_FIELD(nasyncplans);
+ WRITE_INT_FIELD(referent);
}
static void
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index 69453b5..8443a62 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -1520,6 +1520,7 @@ _readAppend(void)
READ_NODE_FIELD(appendplans);
READ_INT_FIELD(nasyncplans);
+ READ_INT_FIELD(referent);
READ_DONE();
}
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index 7caa8d3..ff1d663 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -193,7 +193,8 @@ static CteScan *make_ctescan(List *qptlist, List *qpqual,
Index scanrelid, int ctePlanId, int cteParam);
static WorkTableScan *make_worktablescan(List *qptlist, List *qpqual,
Index scanrelid, int wtParam);
-static Append *make_append(List *asyncplans, int nasyncplans, List *tlist);
+static Append *make_append(List *asyncplans, int nasyncplans,
+ int referent, List *tlist);
static RecursiveUnion *make_recursive_union(List *tlist,
Plan *lefttree,
Plan *righttree,
@@ -960,6 +961,8 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path)
List *syncplans = NIL;
ListCell *subpaths;
int nasyncplans = 0;
+ bool first = true;
+ bool referent_is_sync = true;
/*
* The subpaths list could be empty, if every child was proven empty by
@@ -985,7 +988,14 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path)
return plan;
}
- /* Build the plan for each child */
+ /*
+ * Build the plan for each child
+
+ * The first child in an inheritance set is the representative in
+ * explaining tlist entries (see set_deparse_planstate). We should keep
+ * the first child in best_path->subpaths at the head of the subplan list
+ * for the reason.
+ */
foreach(subpaths, best_path->subpaths)
{
Path *subpath = (Path *) lfirst(subpaths);
@@ -999,9 +1009,13 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path)
{
asyncplans = lappend(asyncplans, subplan);
++nasyncplans;
+ if (first)
+ referent_is_sync = false;
}
else
syncplans = lappend(syncplans, subplan);
+
+ first = false;
}
/*
@@ -1011,7 +1025,8 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path)
* parent-rel Vars it'll be asked to emit.
*/
- plan = make_append(list_concat(asyncplans, syncplans), nasyncplans, tlist);
+ plan = make_append(list_concat(asyncplans, syncplans), nasyncplans,
+ referent_is_sync ? nasyncplans : 0, tlist);
copy_generic_path_info(&plan->plan, (Path *) best_path);
@@ -4951,7 +4966,7 @@ make_foreignscan(List *qptlist,
}
static Append *
-make_append(List *appendplans, int nasyncplans, List *tlist)
+make_append(List *appendplans, int nasyncplans, int referent, List *tlist)
{
Append *node = makeNode(Append);
Plan *plan = &node->plan;
@@ -4962,6 +4977,7 @@ make_append(List *appendplans, int nasyncplans, List *tlist)
plan->righttree = NULL;
node->appendplans = appendplans;
node->nasyncplans = nasyncplans;
+ node->referent = referent;
return node;
}
diff --git a/src/backend/utils/adt/ruleutils.c b/src/backend/utils/adt/ruleutils.c
index a3a4174..9a2ee83 100644
--- a/src/backend/utils/adt/ruleutils.c
+++ b/src/backend/utils/adt/ruleutils.c
@@ -4079,7 +4079,11 @@ set_deparse_planstate(deparse_namespace *dpns, PlanState *ps)
* lists containing references to non-target relations.
*/
if (IsA(ps, AppendState))
- dpns->outer_planstate = ((AppendState *) ps)->appendplans[0];
+ {
+ int idx = ((Append*)(((AppendState *) ps)->ps.plan))->referent;
+ dpns->outer_planstate =
+ ((AppendState *) ps)->appendplans[idx];
+ }
else if (IsA(ps, MergeAppendState))
dpns->outer_planstate = ((MergeAppendState *) ps)->mergeplans[0];
else if (IsA(ps, ModifyTableState))
diff --git a/src/include/executor/nodeForeignscan.h b/src/include/executor/nodeForeignscan.h
index 3e69ab0..47a3920 100644
--- a/src/include/executor/nodeForeignscan.h
+++ b/src/include/executor/nodeForeignscan.h
@@ -31,7 +31,7 @@ extern void ExecForeignScanInitializeWorker(ForeignScanState *node,
extern void ExecAsyncForeignScanRequest(EState *estate,
PendingAsyncRequest *areq);
-extern void ExecAsyncForeignScanConfigureWait(EState *estate,
+extern bool ExecAsyncForeignScanConfigureWait(EState *estate,
PendingAsyncRequest *areq, bool reinit);
extern void ExecAsyncForeignScanNotify(EState *estate,
PendingAsyncRequest *areq);
diff --git a/src/include/foreign/fdwapi.h b/src/include/foreign/fdwapi.h
index 88feb9a..65517fd 100644
--- a/src/include/foreign/fdwapi.h
+++ b/src/include/foreign/fdwapi.h
@@ -158,7 +158,7 @@ typedef bool (*IsForeignScanParallelSafe_function) (PlannerInfo *root,
typedef bool (*IsForeignPathAsyncCapable_function) (ForeignPath *path);
typedef void (*ForeignAsyncRequest_function) (EState *estate,
PendingAsyncRequest *areq);
-typedef void (*ForeignAsyncConfigureWait_function) (EState *estate,
+typedef bool (*ForeignAsyncConfigureWait_function) (EState *estate,
PendingAsyncRequest *areq,
bool reinit);
typedef void (*ForeignAsyncNotify_function) (EState *estate,
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index b50b41c..0c6af86 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -352,6 +352,13 @@ typedef struct ResultRelInfo
* State for an asynchronous tuple request.
* ----------------
*/
+typedef enum AsyncRequestState
+{
+ ASYNC_IDLE,
+ ASYNC_WAITING,
+ ASYNC_CALLBACK_PENDING,
+ ASYNC_COMPLETE
+} AsyncRequestState;
typedef struct PendingAsyncRequest
{
int myindex; /* Index in es_pending_async. */
@@ -360,8 +367,7 @@ typedef struct PendingAsyncRequest
int request_index; /* Scratch space for requestor. */
int num_fd_events; /* Max number of FD events requestee needs. */
bool wants_process_latch; /* Requestee cares about MyLatch. */
- bool callback_pending; /* Callback is needed. */
- bool request_complete; /* Request complete, result valid. */
+ AsyncRequestState state;
Node *result; /* Result (NULL if no more tuples). */
} PendingAsyncRequest;
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index 327119b..1df6693 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -209,6 +209,7 @@ typedef struct Append
Plan plan;
List *appendplans;
int nasyncplans; /* # of async plans, always at start of list */
+ int referent; /* index of inheritance tree referent */
} Append;
/* ----------------
--
2.9.2
0004-Make-postgres_fdw-async-capable.patchtext/x-patch; charset=us-asciiDownload
From f708cece53a6dd21647478b0fba6a8b4dff992a0 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Mon, 17 Oct 2016 16:00:56 +0900
Subject: [PATCH 4/7] Make postgres_fdw async-capable
---
contrib/postgres_fdw/connection.c | 79 ++--
contrib/postgres_fdw/expected/postgres_fdw.out | 64 ++--
contrib/postgres_fdw/postgres_fdw.c | 483 +++++++++++++++++++++----
contrib/postgres_fdw/postgres_fdw.h | 2 +
contrib/postgres_fdw/sql/postgres_fdw.sql | 4 +-
src/backend/executor/execProcnode.c | 9 +
src/include/foreign/fdwapi.h | 2 +
7 files changed, 510 insertions(+), 133 deletions(-)
diff --git a/contrib/postgres_fdw/connection.c b/contrib/postgres_fdw/connection.c
index bcdddc2..ebc9417 100644
--- a/contrib/postgres_fdw/connection.c
+++ b/contrib/postgres_fdw/connection.c
@@ -49,6 +49,7 @@ typedef struct ConnCacheEntry
* one level of subxact open, etc */
bool have_prep_stmt; /* have we prepared any stmts in this xact? */
bool have_error; /* have any subxacts aborted in this xact? */
+ void *storage; /* connection specific storage */
} ConnCacheEntry;
/*
@@ -64,6 +65,7 @@ static unsigned int prep_stmt_number = 0;
static bool xact_got_connection = false;
/* prototypes of private functions */
+static ConnCacheEntry *get_connection_entry(Oid umid);
static PGconn *connect_pg_server(ForeignServer *server, UserMapping *user);
static void check_conn_params(const char **keywords, const char **values);
static void configure_remote_session(PGconn *conn);
@@ -75,26 +77,12 @@ static void pgfdw_subxact_callback(SubXactEvent event,
SubTransactionId parentSubid,
void *arg);
-
/*
- * Get a PGconn which can be used to execute queries on the remote PostgreSQL
- * server with the user's authorization. A new connection is established
- * if we don't already have a suitable one, and a transaction is opened at
- * the right subtransaction nesting depth if we didn't do that already.
- *
- * will_prep_stmt must be true if caller intends to create any prepared
- * statements. Since those don't go away automatically at transaction end
- * (not even on error), we need this flag to cue manual cleanup.
- *
- * XXX Note that caching connections theoretically requires a mechanism to
- * detect change of FDW objects to invalidate already established connections.
- * We could manage that by watching for invalidation events on the relevant
- * syscaches. For the moment, though, it's not clear that this would really
- * be useful and not mere pedantry. We could not flush any active connections
- * mid-transaction anyway.
+ * Common function to acquire or create a connection cache entry.
*/
-PGconn *
-GetConnection(UserMapping *user, bool will_prep_stmt)
+
+static ConnCacheEntry *
+get_connection_entry(Oid umid)
{
bool found;
ConnCacheEntry *entry;
@@ -122,11 +110,8 @@ GetConnection(UserMapping *user, bool will_prep_stmt)
RegisterSubXactCallback(pgfdw_subxact_callback, NULL);
}
- /* Set flag that we did GetConnection during the current transaction */
- xact_got_connection = true;
-
/* Create hash key for the entry. Assume no pad bytes in key struct */
- key = user->umid;
+ key = umid;
/*
* Find or create cached entry for requested connection.
@@ -139,8 +124,39 @@ GetConnection(UserMapping *user, bool will_prep_stmt)
entry->xact_depth = 0;
entry->have_prep_stmt = false;
entry->have_error = false;
+ entry->storage = NULL;
}
+ return entry;
+}
+
+/*
+ * Get a PGconn which can be used to execute queries on the remote PostgreSQL
+ * server with the user's authorization. A new connection is established
+ * if we don't already have a suitable one, and a transaction is opened at
+ * the right subtransaction nesting depth if we didn't do that already.
+ *
+ * will_prep_stmt must be true if caller intends to create any prepared
+ * statements. Since those don't go away automatically at transaction end
+ * (not even on error), we need this flag to cue manual cleanup.
+ *
+ * XXX Note that caching connections theoretically requires a mechanism to
+ * detect change of FDW objects to invalidate already established connections.
+ * We could manage that by watching for invalidation events on the relevant
+ * syscaches. For the moment, though, it's not clear that this would really
+ * be useful and not mere pedantry. We could not flush any active connections
+ * mid-transaction anyway.
+ */
+PGconn *
+GetConnection(UserMapping *user, bool will_prep_stmt)
+{
+ ConnCacheEntry *entry;
+
+ /* Set flag that we did GetConnection during the current transaction */
+ xact_got_connection = true;
+
+ entry = get_connection_entry(user->umid);
+
/*
* We don't check the health of cached connection here, because it would
* require some overhead. Broken connection will be detected when the
@@ -177,6 +193,25 @@ GetConnection(UserMapping *user, bool will_prep_stmt)
}
/*
+ * Rerturns the connection specific storage for this user. Allocate with
+ * initsize if not exists.
+ */
+void *
+GetConnectionSpecificStorage(UserMapping *user, size_t initsize)
+{
+ ConnCacheEntry *entry;
+
+ entry = get_connection_entry(user->umid);
+ if (entry->storage == NULL)
+ {
+ entry->storage = MemoryContextAlloc(CacheMemoryContext, initsize);
+ memset(entry->storage, 0, initsize);
+ }
+
+ return entry->storage;
+}
+
+/*
* Connect to remote server using specified server and user mapping properties.
*/
static PGconn *
diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out
index 083d947..15519c1 100644
--- a/contrib/postgres_fdw/expected/postgres_fdw.out
+++ b/contrib/postgres_fdw/expected/postgres_fdw.out
@@ -6173,12 +6173,12 @@ INSERT INTO b(aa) VALUES('bbbbb');
SELECT tableoid::regclass, * FROM a;
tableoid | aa
----------+-------
- b | bbb
- b | bbbb
- b | bbbbb
a | aaa
a | aaaa
a | aaaaa
+ b | bbb
+ b | bbbb
+ b | bbbbb
(6 rows)
SELECT tableoid::regclass, * FROM b;
@@ -6201,12 +6201,12 @@ UPDATE a SET aa = 'zzzzzz' WHERE aa LIKE 'aaaa%';
SELECT tableoid::regclass, * FROM a;
tableoid | aa
----------+--------
- b | bbb
- b | bbbb
- b | bbbbb
a | aaa
a | zzzzzz
a | zzzzzz
+ b | bbb
+ b | bbbb
+ b | bbbbb
(6 rows)
SELECT tableoid::regclass, * FROM b;
@@ -6229,12 +6229,12 @@ UPDATE b SET aa = 'new';
SELECT tableoid::regclass, * FROM a;
tableoid | aa
----------+--------
- b | new
- b | new
- b | new
a | aaa
a | zzzzzz
a | zzzzzz
+ b | new
+ b | new
+ b | new
(6 rows)
SELECT tableoid::regclass, * FROM b;
@@ -6257,12 +6257,12 @@ UPDATE a SET aa = 'newtoo';
SELECT tableoid::regclass, * FROM a;
tableoid | aa
----------+--------
- b | newtoo
- b | newtoo
- b | newtoo
a | newtoo
a | newtoo
a | newtoo
+ b | newtoo
+ b | newtoo
+ b | newtoo
(6 rows)
SELECT tableoid::regclass, * FROM b;
@@ -6350,9 +6350,9 @@ select * from bar where f1 in (select f1 from foo) for update;
select * from bar where f1 in (select f1 from foo) for update;
f1 | f2
----+----
+ 1 | 11
3 | 33
4 | 44
- 1 | 11
2 | 22
(4 rows)
@@ -6387,9 +6387,9 @@ select * from bar where f1 in (select f1 from foo) for share;
select * from bar where f1 in (select f1 from foo) for share;
f1 | f2
----+----
+ 1 | 11
3 | 33
4 | 44
- 1 | 11
2 | 22
(4 rows)
@@ -6652,27 +6652,33 @@ delete from foo where f1 < 5 returning *;
(5 rows)
explain (verbose, costs off)
-update bar set f2 = f2 + 100 returning *;
- QUERY PLAN
-------------------------------------------------------------------------------
- Update on public.bar
- Output: bar.f1, bar.f2
- Update on public.bar
- Foreign Update on public.bar2
- -> Seq Scan on public.bar
- Output: bar.f1, (bar.f2 + 100), bar.ctid
- -> Foreign Update on public.bar2
- Remote SQL: UPDATE public.loct2 SET f2 = (f2 + 100) RETURNING f1, f2
-(8 rows)
+with u as (update bar set f2 = f2 + 100 returning *) select * from u order by 1;
+ QUERY PLAN
+--------------------------------------------------------------------------------------
+ Sort
+ Output: u.f1, u.f2
+ Sort Key: u.f1
+ CTE u
+ -> Update on public.bar
+ Output: bar.f1, bar.f2
+ Update on public.bar
+ Foreign Update on public.bar2
+ -> Seq Scan on public.bar
+ Output: bar.f1, (bar.f2 + 100), bar.ctid
+ -> Foreign Update on public.bar2
+ Remote SQL: UPDATE public.loct2 SET f2 = (f2 + 100) RETURNING f1, f2
+ -> CTE Scan on u
+ Output: u.f1, u.f2
+(14 rows)
-update bar set f2 = f2 + 100 returning *;
+with u as (update bar set f2 = f2 + 100 returning *) select * from u order by 1;
f1 | f2
----+-----
- 2 | 322
1 | 311
- 6 | 266
+ 2 | 322
3 | 333
4 | 344
+ 6 | 266
7 | 277
(6 rows)
diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c
index b92b279..21e7fd9 100644
--- a/contrib/postgres_fdw/postgres_fdw.c
+++ b/contrib/postgres_fdw/postgres_fdw.c
@@ -35,6 +35,7 @@
#include "optimizer/var.h"
#include "optimizer/tlist.h"
#include "parser/parsetree.h"
+#include "pgstat.h"
#include "utils/builtins.h"
#include "utils/guc.h"
#include "utils/lsyscache.h"
@@ -54,6 +55,9 @@ PG_MODULE_MAGIC;
/* If no remote estimates, assume a sort costs 20% extra */
#define DEFAULT_FDW_SORT_MULTIPLIER 1.2
+/* Retrive PgFdwScanState struct from ForeginScanState */
+#define GetPgFdwScanState(n) ((PgFdwScanState *)(n)->fdw_state)
+
/*
* Indexes of FDW-private information stored in fdw_private lists.
*
@@ -123,10 +127,27 @@ enum FdwDirectModifyPrivateIndex
};
/*
+ * Connection private area structure.
+ */
+ typedef struct PgFdwConnspecate
+{
+ ForeignScanState *current_owner; /* The node currently running a query
+ * on this connection*/
+} PgFdwConnspecate;
+
+/* Execution state base type */
+typedef struct PgFdwState
+{
+ PGconn *conn; /* connection for the scan */
+ PgFdwConnspecate *connspec; /* connection private memory */
+} PgFdwState;
+
+/*
* Execution state of a foreign scan using postgres_fdw.
*/
typedef struct PgFdwScanState
{
+ PgFdwState s; /* common structure */
Relation rel; /* relcache entry for the foreign table. NULL
* for a foreign join scan. */
TupleDesc tupdesc; /* tuple descriptor of scan */
@@ -137,7 +158,7 @@ typedef struct PgFdwScanState
List *retrieved_attrs; /* list of retrieved attribute numbers */
/* for remote query execution */
- PGconn *conn; /* connection for the scan */
+ bool result_ready;
unsigned int cursor_number; /* quasi-unique ID for my cursor */
bool cursor_exists; /* have we created the cursor? */
int numParams; /* number of parameters passed to query */
@@ -153,6 +174,13 @@ typedef struct PgFdwScanState
/* batch-level state, for optimizing rewinds and avoiding useless fetch */
int fetch_ct_2; /* Min(# of fetches done, 2) */
bool eof_reached; /* true if last fetch reached EOF */
+ bool run_async; /* true if run asynchronously */
+ bool async_waiting; /* true if requesting the parent to wait */
+ ForeignScanState *waiter; /* Next node to run a query among nodes
+ * sharing the same connection */
+ ForeignScanState *last_waiter; /* A waiting node at the end of a waiting
+ * list. Maintained only by the current
+ * owner of the connection */
/* working memory contexts */
MemoryContext batch_cxt; /* context holding current batch of tuples */
@@ -166,11 +194,11 @@ typedef struct PgFdwScanState
*/
typedef struct PgFdwModifyState
{
+ PgFdwState s; /* common structure */
Relation rel; /* relcache entry for the foreign table */
AttInMetadata *attinmeta; /* attribute datatype conversion metadata */
/* for remote query execution */
- PGconn *conn; /* connection for the scan */
char *p_name; /* name of prepared statement, if created */
/* extracted fdw_private data */
@@ -193,6 +221,7 @@ typedef struct PgFdwModifyState
*/
typedef struct PgFdwDirectModifyState
{
+ PgFdwState s; /* common structure */
Relation rel; /* relcache entry for the foreign table */
AttInMetadata *attinmeta; /* attribute datatype conversion metadata */
@@ -291,6 +320,7 @@ static void postgresBeginForeignScan(ForeignScanState *node, int eflags);
static TupleTableSlot *postgresIterateForeignScan(ForeignScanState *node);
static void postgresReScanForeignScan(ForeignScanState *node);
static void postgresEndForeignScan(ForeignScanState *node);
+static void postgresShutdownForeignScan(ForeignScanState *node);
static void postgresAddForeignUpdateTargets(Query *parsetree,
RangeTblEntry *target_rte,
Relation target_relation);
@@ -355,8 +385,8 @@ static bool postgresIsForeignPathAsyncCapable(ForeignPath *path);
static void postgresForeignAsyncRequest(EState *estate,
PendingAsyncRequest *areq);
static bool postgresForeignAsyncConfigureWait(EState *estate,
- PendingAsyncRequest *areq,
- bool reinit);
+ PendingAsyncRequest *areq,
+ bool reinit);
static void postgresForeignAsyncNotify(EState *estate,
PendingAsyncRequest *areq);
@@ -379,7 +409,10 @@ static bool ec_member_matches_foreign(PlannerInfo *root, RelOptInfo *rel,
EquivalenceClass *ec, EquivalenceMember *em,
void *arg);
static void create_cursor(ForeignScanState *node);
-static void fetch_more_data(ForeignScanState *node);
+static void request_more_data(ForeignScanState *node);
+static void fetch_received_data(ForeignScanState *node);
+static void vacate_connection(PgFdwState *fdwconn);
+static void absorb_current_result(ForeignScanState *node);
static void close_cursor(PGconn *conn, unsigned int cursor_number);
static void prepare_foreign_modify(PgFdwModifyState *fmstate);
static const char **convert_prep_stmt_params(PgFdwModifyState *fmstate,
@@ -444,6 +477,7 @@ postgres_fdw_handler(PG_FUNCTION_ARGS)
routine->IterateForeignScan = postgresIterateForeignScan;
routine->ReScanForeignScan = postgresReScanForeignScan;
routine->EndForeignScan = postgresEndForeignScan;
+ routine->ShutdownForeignScan = postgresShutdownForeignScan;
/* Functions for updating foreign tables */
routine->AddForeignUpdateTargets = postgresAddForeignUpdateTargets;
@@ -1337,12 +1371,21 @@ postgresBeginForeignScan(ForeignScanState *node, int eflags)
* Get connection to the foreign server. Connection manager will
* establish new connection if necessary.
*/
- fsstate->conn = GetConnection(user, false);
+ fsstate->s.conn = GetConnection(user, false);
+ fsstate->s.connspec = (PgFdwConnspecate *)
+ GetConnectionSpecificStorage(user, sizeof(PgFdwConnspecate));
+ fsstate->s.connspec->current_owner = NULL;
+ fsstate->waiter = NULL;
+ fsstate->last_waiter = node;
/* Assign a unique ID for my cursor */
- fsstate->cursor_number = GetCursorNumber(fsstate->conn);
+ fsstate->cursor_number = GetCursorNumber(fsstate->s.conn);
fsstate->cursor_exists = false;
+ /* Initialize async execution status */
+ fsstate->run_async = false;
+ fsstate->async_waiting = false;
+
/* Get private info created by planner functions. */
fsstate->query = strVal(list_nth(fsplan->fdw_private,
FdwScanPrivateSelectSql));
@@ -1398,32 +1441,126 @@ postgresBeginForeignScan(ForeignScanState *node, int eflags)
static TupleTableSlot *
postgresIterateForeignScan(ForeignScanState *node)
{
- PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+ PgFdwScanState *fsstate = GetPgFdwScanState(node);
TupleTableSlot *slot = node->ss.ss_ScanTupleSlot;
/*
- * If this is the first call after Begin or ReScan, we need to create the
- * cursor on the remote side.
- */
- if (!fsstate->cursor_exists)
- create_cursor(node);
-
- /*
* Get some more tuples, if we've run out.
*/
if (fsstate->next_tuple >= fsstate->num_tuples)
{
- /* No point in another fetch if we already detected EOF, though. */
- if (!fsstate->eof_reached)
- fetch_more_data(node);
- /* If we didn't get any tuples, must be end of data. */
+ ForeignScanState *next_conn_owner = node;
+
+ /* This node has sent a query on this connection */
+ if (fsstate->s.connspec->current_owner == node)
+ {
+ /* Check if the result is available */
+ if (PQisBusy(fsstate->s.conn))
+ {
+ int rc = WaitLatchOrSocket(NULL,
+ WL_SOCKET_READABLE | WL_TIMEOUT,
+ PQsocket(fsstate->s.conn), 0,
+ WAIT_EVENT_ASYNC_WAIT);
+ if (fsstate->run_async && !(rc & WL_SOCKET_READABLE))
+ {
+ /*
+ * This node is not ready yet. Tell the caller to wait.
+ */
+ fsstate->result_ready = false;
+ return ExecClearTuple(slot);
+ }
+ }
+
+ Assert(fsstate->async_waiting);
+ fsstate->async_waiting = false;
+ fetch_received_data(node);
+
+ /*
+ * If someone is waiting this node on the same connection, let the
+ * first waiter be the next owner of this connection.
+ */
+ if (fsstate->waiter)
+ {
+ PgFdwScanState *next_owner_state;
+
+ next_conn_owner = fsstate->waiter;
+ next_owner_state = GetPgFdwScanState(next_conn_owner);
+ fsstate->waiter = NULL;
+
+ /*
+ * only the current owner is responsible to maintain the shortcut
+ * to the last waiter
+ */
+ next_owner_state->last_waiter = fsstate->last_waiter;
+
+ /*
+ * for simplicity, last_waiter points itself on a node that no one
+ * is waiting for.
+ */
+ fsstate->last_waiter = node;
+ }
+ }
+ else if (fsstate->s.connspec->current_owner)
+ {
+ /*
+ * Anyone else is holding this connection. Add myself to the tail
+ * of the waiters' list then return not-ready. To avoid scanning
+ * through the waiters' list, the current owner is to maintain the
+ * shortcut to the last waiter.
+ */
+ PgFdwScanState *conn_owner_state =
+ GetPgFdwScanState(fsstate->s.connspec->current_owner);
+ ForeignScanState *last_waiter = conn_owner_state->last_waiter;
+ PgFdwScanState *last_waiter_state = GetPgFdwScanState(last_waiter);
+
+ last_waiter_state->waiter = node;
+ conn_owner_state->last_waiter = node;
+
+ /* Register the node to the async-waiting node list */
+ Assert(!GetPgFdwScanState(node)->async_waiting);
+
+ GetPgFdwScanState(node)->async_waiting = true;
+
+ fsstate->result_ready = fsstate->eof_reached;
+ return ExecClearTuple(slot);
+ }
+
+ /*
+ * Send the next request for the next owner of this connection if
+ * needed.
+ */
+
+ if (!GetPgFdwScanState(next_conn_owner)->eof_reached)
+ {
+ PgFdwScanState *next_owner_state =
+ GetPgFdwScanState(next_conn_owner);
+
+ request_more_data(next_conn_owner);
+
+ /* Register the node to the async-waiting node list */
+ if (!next_owner_state->async_waiting)
+ next_owner_state->async_waiting = true;
+
+ if (!next_owner_state->run_async)
+ fetch_received_data(next_conn_owner);
+ }
+
+
+ /*
+ * If we haven't received a result for the given node this time,
+ * return with no tuple to give way to other nodes.
+ */
if (fsstate->next_tuple >= fsstate->num_tuples)
+ {
+ fsstate->result_ready = fsstate->eof_reached;
return ExecClearTuple(slot);
+ }
}
/*
* Return the next tuple.
*/
+ fsstate->result_ready = true;
ExecStoreTuple(fsstate->tuples[fsstate->next_tuple++],
slot,
InvalidBuffer,
@@ -1439,7 +1576,7 @@ postgresIterateForeignScan(ForeignScanState *node)
static void
postgresReScanForeignScan(ForeignScanState *node)
{
- PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+ PgFdwScanState *fsstate = GetPgFdwScanState(node);
char sql[64];
PGresult *res;
@@ -1447,6 +1584,9 @@ postgresReScanForeignScan(ForeignScanState *node)
if (!fsstate->cursor_exists)
return;
+ /* Absorb the ramining result */
+ absorb_current_result(node);
+
/*
* If any internal parameters affecting this node have changed, we'd
* better destroy and recreate the cursor. Otherwise, rewinding it should
@@ -1475,9 +1615,9 @@ postgresReScanForeignScan(ForeignScanState *node)
* We don't use a PG_TRY block here, so be careful not to throw error
* without releasing the PGresult.
*/
- res = pgfdw_exec_query(fsstate->conn, sql);
+ res = pgfdw_exec_query(fsstate->s.conn, sql);
if (PQresultStatus(res) != PGRES_COMMAND_OK)
- pgfdw_report_error(ERROR, res, fsstate->conn, true, sql);
+ pgfdw_report_error(ERROR, res, fsstate->s.conn, true, sql);
PQclear(res);
/* Now force a fresh FETCH. */
@@ -1495,7 +1635,7 @@ postgresReScanForeignScan(ForeignScanState *node)
static void
postgresEndForeignScan(ForeignScanState *node)
{
- PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+ PgFdwScanState *fsstate = GetPgFdwScanState(node);
/* if fsstate is NULL, we are in EXPLAIN; nothing to do */
if (fsstate == NULL)
@@ -1503,16 +1643,32 @@ postgresEndForeignScan(ForeignScanState *node)
/* Close the cursor if open, to prevent accumulation of cursors */
if (fsstate->cursor_exists)
- close_cursor(fsstate->conn, fsstate->cursor_number);
+ close_cursor(fsstate->s.conn, fsstate->cursor_number);
/* Release remote connection */
- ReleaseConnection(fsstate->conn);
- fsstate->conn = NULL;
+ ReleaseConnection(fsstate->s.conn);
+ fsstate->s.conn = NULL;
/* MemoryContexts will be deleted automatically. */
}
/*
+ * postgresShutdownForeignScan
+ * Remove asynchrony stuff and cleanup garbage on the connection.
+ */
+static void
+postgresShutdownForeignScan(ForeignScanState *node)
+{
+ ForeignScan *plan = (ForeignScan *) node->ss.ps.plan;
+
+ if (plan->operation != CMD_SELECT)
+ return;
+
+ /* Absorb the ramining result */
+ absorb_current_result(node);
+}
+
+/*
* postgresAddForeignUpdateTargets
* Add resjunk column(s) needed for update/delete on a foreign table
*/
@@ -1714,7 +1870,9 @@ postgresBeginForeignModify(ModifyTableState *mtstate,
user = GetUserMapping(userid, table->serverid);
/* Open connection; report that we'll create a prepared statement. */
- fmstate->conn = GetConnection(user, true);
+ fmstate->s.conn = GetConnection(user, true);
+ fmstate->s.connspec = (PgFdwConnspecate *)
+ GetConnectionSpecificStorage(user, sizeof(PgFdwConnspecate));
fmstate->p_name = NULL; /* prepared statement not made yet */
/* Deconstruct fdw_private data. */
@@ -1793,6 +1951,8 @@ postgresExecForeignInsert(EState *estate,
PGresult *res;
int n_rows;
+ vacate_connection((PgFdwState *)fmstate);
+
/* Set up the prepared statement on the remote server, if we didn't yet */
if (!fmstate->p_name)
prepare_foreign_modify(fmstate);
@@ -1803,14 +1963,14 @@ postgresExecForeignInsert(EState *estate,
/*
* Execute the prepared statement.
*/
- if (!PQsendQueryPrepared(fmstate->conn,
+ if (!PQsendQueryPrepared(fmstate->s.conn,
fmstate->p_name,
fmstate->p_nums,
p_values,
NULL,
NULL,
0))
- pgfdw_report_error(ERROR, NULL, fmstate->conn, false, fmstate->query);
+ pgfdw_report_error(ERROR, NULL, fmstate->s.conn, false, fmstate->query);
/*
* Get the result, and check for success.
@@ -1818,10 +1978,10 @@ postgresExecForeignInsert(EState *estate,
* We don't use a PG_TRY block here, so be careful not to throw error
* without releasing the PGresult.
*/
- res = pgfdw_get_result(fmstate->conn, fmstate->query);
+ res = pgfdw_get_result(fmstate->s.conn, fmstate->query);
if (PQresultStatus(res) !=
(fmstate->has_returning ? PGRES_TUPLES_OK : PGRES_COMMAND_OK))
- pgfdw_report_error(ERROR, res, fmstate->conn, true, fmstate->query);
+ pgfdw_report_error(ERROR, res, fmstate->s.conn, true, fmstate->query);
/* Check number of rows affected, and fetch RETURNING tuple if any */
if (fmstate->has_returning)
@@ -1859,6 +2019,8 @@ postgresExecForeignUpdate(EState *estate,
PGresult *res;
int n_rows;
+ vacate_connection((PgFdwState *)fmstate);
+
/* Set up the prepared statement on the remote server, if we didn't yet */
if (!fmstate->p_name)
prepare_foreign_modify(fmstate);
@@ -1879,14 +2041,14 @@ postgresExecForeignUpdate(EState *estate,
/*
* Execute the prepared statement.
*/
- if (!PQsendQueryPrepared(fmstate->conn,
+ if (!PQsendQueryPrepared(fmstate->s.conn,
fmstate->p_name,
fmstate->p_nums,
p_values,
NULL,
NULL,
0))
- pgfdw_report_error(ERROR, NULL, fmstate->conn, false, fmstate->query);
+ pgfdw_report_error(ERROR, NULL, fmstate->s.conn, false, fmstate->query);
/*
* Get the result, and check for success.
@@ -1894,10 +2056,10 @@ postgresExecForeignUpdate(EState *estate,
* We don't use a PG_TRY block here, so be careful not to throw error
* without releasing the PGresult.
*/
- res = pgfdw_get_result(fmstate->conn, fmstate->query);
+ res = pgfdw_get_result(fmstate->s.conn, fmstate->query);
if (PQresultStatus(res) !=
(fmstate->has_returning ? PGRES_TUPLES_OK : PGRES_COMMAND_OK))
- pgfdw_report_error(ERROR, res, fmstate->conn, true, fmstate->query);
+ pgfdw_report_error(ERROR, res, fmstate->s.conn, true, fmstate->query);
/* Check number of rows affected, and fetch RETURNING tuple if any */
if (fmstate->has_returning)
@@ -1935,6 +2097,8 @@ postgresExecForeignDelete(EState *estate,
PGresult *res;
int n_rows;
+ vacate_connection((PgFdwState *)fmstate);
+
/* Set up the prepared statement on the remote server, if we didn't yet */
if (!fmstate->p_name)
prepare_foreign_modify(fmstate);
@@ -1955,14 +2119,14 @@ postgresExecForeignDelete(EState *estate,
/*
* Execute the prepared statement.
*/
- if (!PQsendQueryPrepared(fmstate->conn,
+ if (!PQsendQueryPrepared(fmstate->s.conn,
fmstate->p_name,
fmstate->p_nums,
p_values,
NULL,
NULL,
0))
- pgfdw_report_error(ERROR, NULL, fmstate->conn, false, fmstate->query);
+ pgfdw_report_error(ERROR, NULL, fmstate->s.conn, false, fmstate->query);
/*
* Get the result, and check for success.
@@ -1970,10 +2134,10 @@ postgresExecForeignDelete(EState *estate,
* We don't use a PG_TRY block here, so be careful not to throw error
* without releasing the PGresult.
*/
- res = pgfdw_get_result(fmstate->conn, fmstate->query);
+ res = pgfdw_get_result(fmstate->s.conn, fmstate->query);
if (PQresultStatus(res) !=
(fmstate->has_returning ? PGRES_TUPLES_OK : PGRES_COMMAND_OK))
- pgfdw_report_error(ERROR, res, fmstate->conn, true, fmstate->query);
+ pgfdw_report_error(ERROR, res, fmstate->s.conn, true, fmstate->query);
/* Check number of rows affected, and fetch RETURNING tuple if any */
if (fmstate->has_returning)
@@ -2020,16 +2184,16 @@ postgresEndForeignModify(EState *estate,
* We don't use a PG_TRY block here, so be careful not to throw error
* without releasing the PGresult.
*/
- res = pgfdw_exec_query(fmstate->conn, sql);
+ res = pgfdw_exec_query(fmstate->s.conn, sql);
if (PQresultStatus(res) != PGRES_COMMAND_OK)
- pgfdw_report_error(ERROR, res, fmstate->conn, true, sql);
+ pgfdw_report_error(ERROR, res, fmstate->s.conn, true, sql);
PQclear(res);
fmstate->p_name = NULL;
}
/* Release remote connection */
- ReleaseConnection(fmstate->conn);
- fmstate->conn = NULL;
+ ReleaseConnection(fmstate->s.conn);
+ fmstate->s.conn = NULL;
}
/*
@@ -2309,7 +2473,9 @@ postgresBeginDirectModify(ForeignScanState *node, int eflags)
* Get connection to the foreign server. Connection manager will
* establish new connection if necessary.
*/
- dmstate->conn = GetConnection(user, false);
+ dmstate->s.conn = GetConnection(user, false);
+ dmstate->s.connspec = (PgFdwConnspecate *)
+ GetConnectionSpecificStorage(user, sizeof(PgFdwConnspecate));
/* Initialize state variable */
dmstate->num_tuples = -1; /* -1 means not set yet */
@@ -2362,7 +2528,10 @@ postgresIterateDirectModify(ForeignScanState *node)
* If this is the first call after Begin, execute the statement.
*/
if (dmstate->num_tuples == -1)
+ {
+ vacate_connection((PgFdwState *)dmstate);
execute_dml_stmt(node);
+ }
/*
* If the local query doesn't specify RETURNING, just clear tuple slot.
@@ -2409,8 +2578,8 @@ postgresEndDirectModify(ForeignScanState *node)
PQclear(dmstate->result);
/* Release remote connection */
- ReleaseConnection(dmstate->conn);
- dmstate->conn = NULL;
+ ReleaseConnection(dmstate->s.conn);
+ dmstate->s.conn = NULL;
/* MemoryContext will be deleted automatically. */
}
@@ -2529,6 +2698,7 @@ estimate_path_cost_size(PlannerInfo *root,
List *local_param_join_conds;
StringInfoData sql;
PGconn *conn;
+ PgFdwConnspecate *connspec;
Selectivity local_sel;
QualCost local_cost;
List *fdw_scan_tlist = NIL;
@@ -2572,6 +2742,16 @@ estimate_path_cost_size(PlannerInfo *root,
/* Get the remote estimate */
conn = GetConnection(fpinfo->user, false);
+ connspec = GetConnectionSpecificStorage(fpinfo->user,
+ sizeof(PgFdwConnspecate));
+ if (connspec)
+ {
+ PgFdwState tmpstate;
+ tmpstate.conn = conn;
+ tmpstate.connspec = connspec;
+ vacate_connection(&tmpstate);
+ }
+
get_remote_estimate(sql.data, conn, &rows, &width,
&startup_cost, &total_cost);
ReleaseConnection(conn);
@@ -2926,11 +3106,11 @@ ec_member_matches_foreign(PlannerInfo *root, RelOptInfo *rel,
static void
create_cursor(ForeignScanState *node)
{
- PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+ PgFdwScanState *fsstate = GetPgFdwScanState(node);
ExprContext *econtext = node->ss.ps.ps_ExprContext;
int numParams = fsstate->numParams;
const char **values = fsstate->param_values;
- PGconn *conn = fsstate->conn;
+ PGconn *conn = fsstate->s.conn;
StringInfoData buf;
PGresult *res;
@@ -2996,47 +3176,96 @@ create_cursor(ForeignScanState *node)
* Fetch some more rows from the node's cursor.
*/
static void
-fetch_more_data(ForeignScanState *node)
+request_more_data(ForeignScanState *node)
+{
+ PgFdwScanState *fsstate = GetPgFdwScanState(node);
+ PGconn *conn = fsstate->s.conn;
+ char sql[64];
+
+ /* The connection should be vacant */
+ Assert(fsstate->s.connspec->current_owner == NULL);
+
+ /*
+ * If this is the first call after Begin or ReScan, we need to create the
+ * cursor on the remote side.
+ */
+ if (!fsstate->cursor_exists)
+ create_cursor(node);
+
+ snprintf(sql, sizeof(sql), "FETCH %d FROM c%u",
+ fsstate->fetch_size, fsstate->cursor_number);
+
+ if (!PQsendQuery(conn, sql))
+ pgfdw_report_error(ERROR, NULL, conn, false, sql);
+
+ fsstate->s.connspec->current_owner = node;
+}
+
+/*
+ * Fetch some more rows from the node's cursor.
+ */
+static void
+fetch_received_data(ForeignScanState *node)
{
- PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+ PgFdwScanState *fsstate = GetPgFdwScanState(node);
PGresult *volatile res = NULL;
MemoryContext oldcontext;
+ /* I should be the current connection owner */
+ Assert(fsstate->s.connspec->current_owner == node);
+
/*
* We'll store the tuples in the batch_cxt. First, flush the previous
- * batch.
+ * batch if no tuple is remaining
*/
- fsstate->tuples = NULL;
- MemoryContextReset(fsstate->batch_cxt);
+ if (fsstate->next_tuple >= fsstate->num_tuples)
+ {
+ fsstate->tuples = NULL;
+ fsstate->num_tuples = 0;
+ MemoryContextReset(fsstate->batch_cxt);
+ }
+ else if (fsstate->next_tuple > 0)
+ {
+ /* move the remaining tuples to the beginning of the store */
+ int n = 0;
+
+ while(fsstate->next_tuple < fsstate->num_tuples)
+ fsstate->tuples[n++] = fsstate->tuples[fsstate->next_tuple++];
+ fsstate->num_tuples = n;
+ }
+
oldcontext = MemoryContextSwitchTo(fsstate->batch_cxt);
/* PGresult must be released before leaving this function. */
PG_TRY();
{
- PGconn *conn = fsstate->conn;
+ PGconn *conn = fsstate->s.conn;
char sql[64];
- int numrows;
+ int addrows;
+ size_t newsize;
int i;
snprintf(sql, sizeof(sql), "FETCH %d FROM c%u",
fsstate->fetch_size, fsstate->cursor_number);
- res = pgfdw_exec_query(conn, sql);
+ res = pgfdw_get_result(conn, sql);
/* On error, report the original query, not the FETCH. */
if (PQresultStatus(res) != PGRES_TUPLES_OK)
pgfdw_report_error(ERROR, res, conn, false, fsstate->query);
/* Convert the data into HeapTuples */
- numrows = PQntuples(res);
- fsstate->tuples = (HeapTuple *) palloc0(numrows * sizeof(HeapTuple));
- fsstate->num_tuples = numrows;
- fsstate->next_tuple = 0;
+ addrows = PQntuples(res);
+ newsize = (fsstate->num_tuples + addrows) * sizeof(HeapTuple);
+ if (fsstate->tuples)
+ fsstate->tuples = (HeapTuple *) repalloc(fsstate->tuples, newsize);
+ else
+ fsstate->tuples = (HeapTuple *) palloc(newsize);
- for (i = 0; i < numrows; i++)
+ for (i = 0; i < addrows; i++)
{
Assert(IsA(node->ss.ps.plan, ForeignScan));
- fsstate->tuples[i] =
+ fsstate->tuples[fsstate->num_tuples + i] =
make_tuple_from_result_row(res, i,
fsstate->rel,
fsstate->attinmeta,
@@ -3046,27 +3275,82 @@ fetch_more_data(ForeignScanState *node)
}
/* Update fetch_ct_2 */
- if (fsstate->fetch_ct_2 < 2)
+ if (fsstate->fetch_ct_2 < 2 && fsstate->next_tuple == 0)
fsstate->fetch_ct_2++;
+ fsstate->next_tuple = 0;
+ fsstate->num_tuples += addrows;
+
/* Must be EOF if we didn't get as many tuples as we asked for. */
- fsstate->eof_reached = (numrows < fsstate->fetch_size);
+ fsstate->eof_reached = (addrows < fsstate->fetch_size);
PQclear(res);
res = NULL;
}
PG_CATCH();
{
+ fsstate->s.connspec->current_owner = NULL;
if (res)
PQclear(res);
PG_RE_THROW();
}
PG_END_TRY();
+ fsstate->s.connspec->current_owner = NULL;
+
MemoryContextSwitchTo(oldcontext);
}
/*
+ * Vacate a connection so that this node can send the next query
+ */
+static void
+vacate_connection(PgFdwState *fdwstate)
+{
+ PgFdwConnspecate *connspec = fdwstate->connspec;
+ ForeignScanState *owner;
+
+ if (connspec == NULL || connspec->current_owner == NULL)
+ return;
+
+ /*
+ * let the current connection owner read the result for the running query
+ */
+ owner = connspec->current_owner;
+ fetch_received_data(owner);
+
+ /* Clear the waiting list */
+ while (owner)
+ {
+ PgFdwScanState *fsstate = GetPgFdwScanState(owner);
+
+ fsstate->last_waiter = NULL;
+ owner = fsstate->waiter;
+ fsstate->waiter = NULL;
+ }
+}
+
+/*
+ * Absorb the result of the current query.
+ */
+static void
+absorb_current_result(ForeignScanState *node)
+{
+ PgFdwScanState *fsstate = GetPgFdwScanState(node);
+ ForeignScanState *owner = fsstate->s.connspec->current_owner;
+
+ if (owner)
+ {
+ PgFdwScanState *target_state = GetPgFdwScanState(owner);
+ PGconn *conn = target_state->s.conn;
+
+ while(PQisBusy(conn))
+ PQclear(PQgetResult(conn));
+ fsstate->s.connspec->current_owner = NULL;
+ fsstate->async_waiting = false;
+ }
+}
+/*
* Force assorted GUC parameters to settings that ensure that we'll output
* data values in a form that is unambiguous to the remote server.
*
@@ -3150,7 +3434,7 @@ prepare_foreign_modify(PgFdwModifyState *fmstate)
/* Construct name we'll use for the prepared statement. */
snprintf(prep_name, sizeof(prep_name), "pgsql_fdw_prep_%u",
- GetPrepStmtNumber(fmstate->conn));
+ GetPrepStmtNumber(fmstate->s.conn));
p_name = pstrdup(prep_name);
/*
@@ -3160,12 +3444,12 @@ prepare_foreign_modify(PgFdwModifyState *fmstate)
* the prepared statements we use in this module are simple enough that
* the remote server will make the right choices.
*/
- if (!PQsendPrepare(fmstate->conn,
+ if (!PQsendPrepare(fmstate->s.conn,
p_name,
fmstate->query,
0,
NULL))
- pgfdw_report_error(ERROR, NULL, fmstate->conn, false, fmstate->query);
+ pgfdw_report_error(ERROR, NULL, fmstate->s.conn, false, fmstate->query);
/*
* Get the result, and check for success.
@@ -3173,9 +3457,9 @@ prepare_foreign_modify(PgFdwModifyState *fmstate)
* We don't use a PG_TRY block here, so be careful not to throw error
* without releasing the PGresult.
*/
- res = pgfdw_get_result(fmstate->conn, fmstate->query);
+ res = pgfdw_get_result(fmstate->s.conn, fmstate->query);
if (PQresultStatus(res) != PGRES_COMMAND_OK)
- pgfdw_report_error(ERROR, res, fmstate->conn, true, fmstate->query);
+ pgfdw_report_error(ERROR, res, fmstate->s.conn, true, fmstate->query);
PQclear(res);
/* This action shows that the prepare has been done. */
@@ -3306,9 +3590,9 @@ execute_dml_stmt(ForeignScanState *node)
* the desired result. This allows us to avoid assuming that the remote
* server has the same OIDs we do for the parameters' types.
*/
- if (!PQsendQueryParams(dmstate->conn, dmstate->query, numParams,
+ if (!PQsendQueryParams(dmstate->s.conn, dmstate->query, numParams,
NULL, values, NULL, NULL, 0))
- pgfdw_report_error(ERROR, NULL, dmstate->conn, false, dmstate->query);
+ pgfdw_report_error(ERROR, NULL, dmstate->s.conn, false, dmstate->query);
/*
* Get the result, and check for success.
@@ -3316,10 +3600,10 @@ execute_dml_stmt(ForeignScanState *node)
* We don't use a PG_TRY block here, so be careful not to throw error
* without releasing the PGresult.
*/
- dmstate->result = pgfdw_get_result(dmstate->conn, dmstate->query);
+ dmstate->result = pgfdw_get_result(dmstate->s.conn, dmstate->query);
if (PQresultStatus(dmstate->result) !=
(dmstate->has_returning ? PGRES_TUPLES_OK : PGRES_COMMAND_OK))
- pgfdw_report_error(ERROR, dmstate->result, dmstate->conn, true,
+ pgfdw_report_error(ERROR, dmstate->result, dmstate->s.conn, true,
dmstate->query);
/* Get the number of rows affected. */
@@ -4465,8 +4749,10 @@ postgresIsForeignPathAsyncCapable(ForeignPath *path)
}
/*
- * XXX. Just for testing purposes, let's run everything through the async
- * mechanism but return tuples synchronously.
+ * Accept async request. Notify to the caller if the next tuple is immediately
+ * available. ExecForeignScan does additional work to finishing the returning
+ * tuple, so call it instead of postgresIterateForeignScan to acquire a tuple
+ * in expected shape.
*/
static void
postgresForeignAsyncRequest(EState *estate, PendingAsyncRequest *areq)
@@ -4475,22 +4761,59 @@ postgresForeignAsyncRequest(EState *estate, PendingAsyncRequest *areq)
TupleTableSlot *slot;
Assert(IsA(node, ForeignScanState));
+ GetPgFdwScanState(node)->run_async = true;
slot = ExecForeignScan(node);
- ExecAsyncRequestDone(estate, areq, (Node *) slot);
+ if (GetPgFdwScanState(node)->result_ready)
+ ExecAsyncRequestDone(estate, areq, (Node *) slot);
+ else
+ ExecAsyncSetRequiredEvents(estate, areq, 1, false, false);
}
+/*
+ * Configure waiting event.
+ *
+ * Add an wait event only when the node is the connection owner. Elsewise
+ * another node on this connection is the owner.
+ */
static bool
postgresForeignAsyncConfigureWait(EState *estate, PendingAsyncRequest *areq,
- bool reinit)
+ bool reinit)
{
- elog(ERROR, "postgresForeignAsyncConfigureWait");
+ ForeignScanState *node = (ForeignScanState *) areq->requestee;
+ PgFdwScanState *fsstate = GetPgFdwScanState(node);
+
+
+ /* If the caller didn't reinit, this event is already in event set */
+ if (!reinit)
+ return true;
+
+ if (fsstate->s.connspec->current_owner == node)
+ {
+ AddWaitEventToSet(estate->es_wait_event_set,
+ WL_SOCKET_READABLE, PQsocket(fsstate->s.conn),
+ NULL, areq);
+ return true;
+ }
+
return false;
}
+/*
+ * Process a notification from async mechanism. ExecForeignScan does
+ * additional work to complete the returning tuple, so call it instead of
+ * postgresIterateForeignScan to acquire a completed tuple.
+ */
static void
postgresForeignAsyncNotify(EState *estate, PendingAsyncRequest *areq)
{
- elog(ERROR, "postgresForeignAsyncNotify");
+ ForeignScanState *node = (ForeignScanState *) areq->requestee;
+ TupleTableSlot *slot;
+
+ Assert(IsA(node, ForeignScanState));
+ slot = ExecForeignScan(node);
+ Assert(GetPgFdwScanState(node)->result_ready);
+
+ ExecAsyncRequestDone(estate, areq, (Node *) slot);
}
/*
@@ -4850,7 +5173,7 @@ make_tuple_from_result_row(PGresult *res,
PgFdwScanState *fdw_sstate;
Assert(fsstate);
- fdw_sstate = (PgFdwScanState *) fsstate->fdw_state;
+ fdw_sstate = GetPgFdwScanState(fsstate);
tupdesc = fdw_sstate->tupdesc;
}
diff --git a/contrib/postgres_fdw/postgres_fdw.h b/contrib/postgres_fdw/postgres_fdw.h
index f8c255e..1800977 100644
--- a/contrib/postgres_fdw/postgres_fdw.h
+++ b/contrib/postgres_fdw/postgres_fdw.h
@@ -79,6 +79,7 @@ typedef struct PgFdwRelationInfo
UserMapping *user; /* only set in use_remote_estimate mode */
int fetch_size; /* fetch size for this remote table */
+ bool allow_prefetch; /* true to allow overlapped fetching */
/*
* Name of the relation while EXPLAINing ForeignScan. It is used for join
@@ -103,6 +104,7 @@ extern void reset_transmission_modes(int nestlevel);
/* in connection.c */
extern PGconn *GetConnection(UserMapping *user, bool will_prep_stmt);
+void *GetConnectionSpecificStorage(UserMapping *user, size_t initsize);
extern void ReleaseConnection(PGconn *conn);
extern unsigned int GetCursorNumber(PGconn *conn);
extern unsigned int GetPrepStmtNumber(PGconn *conn);
diff --git a/contrib/postgres_fdw/sql/postgres_fdw.sql b/contrib/postgres_fdw/sql/postgres_fdw.sql
index f48743c..7153661 100644
--- a/contrib/postgres_fdw/sql/postgres_fdw.sql
+++ b/contrib/postgres_fdw/sql/postgres_fdw.sql
@@ -1552,8 +1552,8 @@ explain (verbose, costs off)
delete from foo where f1 < 5 returning *;
delete from foo where f1 < 5 returning *;
explain (verbose, costs off)
-update bar set f2 = f2 + 100 returning *;
-update bar set f2 = f2 + 100 returning *;
+with u as (update bar set f2 = f2 + 100 returning *) select * from u order by 1;
+with u as (update bar set f2 = f2 + 100 returning *) select * from u order by 1;
drop table foo cascade;
drop table bar cascade;
diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
index 554244f..f864abe 100644
--- a/src/backend/executor/execProcnode.c
+++ b/src/backend/executor/execProcnode.c
@@ -114,6 +114,7 @@
#include "executor/nodeValuesscan.h"
#include "executor/nodeWindowAgg.h"
#include "executor/nodeWorktablescan.h"
+#include "foreign/fdwapi.h"
#include "nodes/nodeFuncs.h"
#include "miscadmin.h"
@@ -806,6 +807,14 @@ ExecShutdownNode(PlanState *node)
case T_GatherState:
ExecShutdownGather((GatherState *) node);
break;
+ case T_ForeignScanState:
+ {
+ ForeignScanState *fsstate = (ForeignScanState *)node;
+ FdwRoutine *fdwroutine = fsstate->fdwroutine;
+ if (fdwroutine->ShutdownForeignScan)
+ fdwroutine->ShutdownForeignScan((ForeignScanState *) node);
+ }
+ break;
default:
break;
}
diff --git a/src/include/foreign/fdwapi.h b/src/include/foreign/fdwapi.h
index 65517fd..e40db0e 100644
--- a/src/include/foreign/fdwapi.h
+++ b/src/include/foreign/fdwapi.h
@@ -163,6 +163,7 @@ typedef bool (*ForeignAsyncConfigureWait_function) (EState *estate,
bool reinit);
typedef void (*ForeignAsyncNotify_function) (EState *estate,
PendingAsyncRequest *areq);
+typedef void (*ShutdownForeignScan_function) (ForeignScanState *node);
/*
* FdwRoutine is the struct returned by a foreign-data wrapper's handler
@@ -239,6 +240,7 @@ typedef struct FdwRoutine
ForeignAsyncRequest_function ForeignAsyncRequest;
ForeignAsyncConfigureWait_function ForeignAsyncConfigureWait;
ForeignAsyncNotify_function ForeignAsyncNotify;
+ ShutdownForeignScan_function ShutdownForeignScan;
} FdwRoutine;
--
2.9.2
0005-Use-resource-owner-to-prevent-wait-event-set-from-le.patchtext/x-patch; charset=us-asciiDownload
From 951d3a9ee1bc73c63428620c0c02b225451275c2 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Tue, 25 Oct 2016 18:01:56 +0900
Subject: [PATCH 5/7] Use resource owner to prevent wait event set from leaking
Wait event sets created for async execution can live for some
iterations so it leaks in the case of errors during the
iterations. This commit uses resource owner to prevent such leaks.
---
src/backend/executor/execAsync.c | 28 ++++++++++++++--
src/backend/storage/ipc/latch.c | 19 ++++++++++-
src/backend/utils/resowner/resowner.c | 63 +++++++++++++++++++++++++++++++++++
src/include/utils/resowner_private.h | 8 +++++
4 files changed, 114 insertions(+), 4 deletions(-)
diff --git a/src/backend/executor/execAsync.c b/src/backend/executor/execAsync.c
index 33496a9..40e3f67 100644
--- a/src/backend/executor/execAsync.c
+++ b/src/backend/executor/execAsync.c
@@ -20,6 +20,7 @@
#include "miscadmin.h"
#include "pgstat.h"
#include "storage/latch.h"
+#include "utils/resowner_private.h"
static bool ExecAsyncEventWait(EState *estate, long timeout);
static bool ExecAsyncConfigureWait(EState *estate, PendingAsyncRequest *areq,
@@ -277,6 +278,8 @@ ExecAsyncEventWait(EState *estate, long timeout)
if (estate->es_wait_event_set == NULL)
{
+ ResourceOwner savedOwner;
+
/*
* Allow for a few extra events without reinitializing. It
* doesn't seem worth the complexity of doing anything very
@@ -284,9 +287,28 @@ ExecAsyncEventWait(EState *estate, long timeout)
* of external FDs are likely to run afoul of kernel limits anyway.
*/
estate->es_allocated_fd_events = estate->es_total_fd_events + 16;
- estate->es_wait_event_set =
- CreateWaitEventSet(estate->es_query_cxt,
- estate->es_allocated_fd_events + 1);
+
+ /*
+ * The wait event set created here should be released in case of
+ * error.
+ */
+ savedOwner = CurrentResourceOwner;
+ CurrentResourceOwner = TopTransactionResourceOwner;
+
+ PG_TRY();
+ {
+ estate->es_wait_event_set =
+ CreateWaitEventSet(estate->es_query_cxt,
+ estate->es_allocated_fd_events + 1);
+ }
+ PG_CATCH();
+ {
+ CurrentResourceOwner = savedOwner;
+ PG_RE_THROW();
+ }
+ PG_END_TRY();
+
+ CurrentResourceOwner = savedOwner;
AddWaitEventToSet(estate->es_wait_event_set,
WL_LATCH_SET, PGINVALID_SOCKET, MyLatch, NULL);
reinit = true;
diff --git a/src/backend/storage/ipc/latch.c b/src/backend/storage/ipc/latch.c
index 8488f94..b8bcae9 100644
--- a/src/backend/storage/ipc/latch.c
+++ b/src/backend/storage/ipc/latch.c
@@ -62,6 +62,7 @@
#include "storage/latch.h"
#include "storage/pmsignal.h"
#include "storage/shmem.h"
+#include "utils/resowner_private.h"
/*
* Select the fd readiness primitive to use. Normally the "most modern"
@@ -90,6 +91,7 @@ struct WaitEventSet
int nevents; /* number of registered events */
int nevents_space; /* maximum number of events in this set */
+ ResourceOwner resowner; /* Resource owner */
/*
* Array, of nevents_space length, storing the definition of events this
* set is waiting for.
@@ -324,7 +326,13 @@ WaitLatchOrSocket(volatile Latch *latch, int wakeEvents, pgsocket sock,
int ret = 0;
int rc;
WaitEvent event;
- WaitEventSet *set = CreateWaitEventSet(CurrentMemoryContext, 3);
+ WaitEventSet *set;
+ ResourceOwner savedOwner = CurrentResourceOwner;
+
+ /* This function doesn't need resowner for event set */
+ CurrentResourceOwner = NULL;
+ set = CreateWaitEventSet(CurrentMemoryContext, 3);
+ CurrentResourceOwner = savedOwner;
if (wakeEvents & WL_TIMEOUT)
Assert(timeout >= 0);
@@ -488,6 +496,9 @@ CreateWaitEventSet(MemoryContext context, int nevents)
char *data;
Size sz = 0;
+ if (CurrentResourceOwner)
+ ResourceOwnerEnlargeWESs(CurrentResourceOwner);
+
/*
* Use MAXALIGN size/alignment to guarantee that later uses of memory are
* aligned correctly. E.g. epoll_event might need 8 byte alignment on some
@@ -547,6 +558,9 @@ CreateWaitEventSet(MemoryContext context, int nevents)
StaticAssertStmt(WSA_INVALID_EVENT == NULL, "");
#endif
+ set->resowner = CurrentResourceOwner;
+ if (CurrentResourceOwner)
+ ResourceOwnerRememberWES(set->resowner, set);
return set;
}
@@ -582,6 +596,9 @@ FreeWaitEventSet(WaitEventSet *set)
}
#endif
+ if (set->resowner != NULL)
+ ResourceOwnerForgetWES(set->resowner, set);
+
pfree(set);
}
diff --git a/src/backend/utils/resowner/resowner.c b/src/backend/utils/resowner/resowner.c
index 07075ce..0b590c1 100644
--- a/src/backend/utils/resowner/resowner.c
+++ b/src/backend/utils/resowner/resowner.c
@@ -124,6 +124,7 @@ typedef struct ResourceOwnerData
ResourceArray snapshotarr; /* snapshot references */
ResourceArray filearr; /* open temporary files */
ResourceArray dsmarr; /* dynamic shmem segments */
+ ResourceArray wesarr; /* wait event sets */
/* We can remember up to MAX_RESOWNER_LOCKS references to local locks. */
int nlocks; /* number of owned locks */
@@ -169,6 +170,7 @@ static void PrintTupleDescLeakWarning(TupleDesc tupdesc);
static void PrintSnapshotLeakWarning(Snapshot snapshot);
static void PrintFileLeakWarning(File file);
static void PrintDSMLeakWarning(dsm_segment *seg);
+static void PrintWESLeakWarning(WaitEventSet *events);
/*****************************************************************************
@@ -437,6 +439,7 @@ ResourceOwnerCreate(ResourceOwner parent, const char *name)
ResourceArrayInit(&(owner->snapshotarr), PointerGetDatum(NULL));
ResourceArrayInit(&(owner->filearr), FileGetDatum(-1));
ResourceArrayInit(&(owner->dsmarr), PointerGetDatum(NULL));
+ ResourceArrayInit(&(owner->wesarr), PointerGetDatum(NULL));
return owner;
}
@@ -552,6 +555,16 @@ ResourceOwnerReleaseInternal(ResourceOwner owner,
PrintDSMLeakWarning(res);
dsm_detach(res);
}
+
+ /* Ditto for wait event sets */
+ while (ResourceArrayGetAny(&(owner->wesarr), &foundres))
+ {
+ WaitEventSet *event = (WaitEventSet *) DatumGetPointer(foundres);
+
+ if (isCommit)
+ PrintWESLeakWarning(event);
+ FreeWaitEventSet(event);
+ }
}
else if (phase == RESOURCE_RELEASE_LOCKS)
{
@@ -702,6 +715,7 @@ ResourceOwnerDelete(ResourceOwner owner)
Assert(owner->snapshotarr.nitems == 0);
Assert(owner->filearr.nitems == 0);
Assert(owner->dsmarr.nitems == 0);
+ Assert(owner->wesarr.nitems == 0);
Assert(owner->nlocks == 0 || owner->nlocks == MAX_RESOWNER_LOCKS + 1);
/*
@@ -728,6 +742,7 @@ ResourceOwnerDelete(ResourceOwner owner)
ResourceArrayFree(&(owner->snapshotarr));
ResourceArrayFree(&(owner->filearr));
ResourceArrayFree(&(owner->dsmarr));
+ ResourceArrayFree(&(owner->wesarr));
pfree(owner);
}
@@ -1270,3 +1285,51 @@ PrintDSMLeakWarning(dsm_segment *seg)
elog(WARNING, "dynamic shared memory leak: segment %u still referenced",
dsm_segment_handle(seg));
}
+
+/*
+ * Make sure there is room for at least one more entry in a ResourceOwner's
+ * wait event set reference array.
+ *
+ * This is separate from actually inserting an entry because if we run out
+ * of memory, it's critical to do so *before* acquiring the resource.
+ */
+void
+ResourceOwnerEnlargeWESs(ResourceOwner owner)
+{
+ ResourceArrayEnlarge(&(owner->wesarr));
+}
+
+/*
+ * Remember that a wait event set is owned by a ResourceOwner
+ *
+ * Caller must have previously done ResourceOwnerEnlargeWESs()
+ */
+void
+ResourceOwnerRememberWES(ResourceOwner owner, WaitEventSet *events)
+{
+ ResourceArrayAdd(&(owner->wesarr), PointerGetDatum(events));
+}
+
+/*
+ * Forget that a wait event set is owned by a ResourceOwner
+ */
+void
+ResourceOwnerForgetWES(ResourceOwner owner, WaitEventSet *events)
+{
+ /* XXXX: There's no property to identify a wait event set */
+ if (!ResourceArrayRemove(&(owner->wesarr), PointerGetDatum(events)))
+ elog(ERROR, "wait event set %p is not owned by resource owner %s",
+ events, owner->name);
+}
+
+/*
+ * Debugging subroutine
+ */
+static void
+PrintWESLeakWarning(WaitEventSet *events)
+{
+ /* XXXX: There's no property to identify a wait event set */
+ elog(WARNING, "wait event set leak: %p still referenced",
+ events);
+}
+
diff --git a/src/include/utils/resowner_private.h b/src/include/utils/resowner_private.h
index fd32090..6087257e7 100644
--- a/src/include/utils/resowner_private.h
+++ b/src/include/utils/resowner_private.h
@@ -18,6 +18,7 @@
#include "storage/dsm.h"
#include "storage/fd.h"
+#include "storage/latch.h"
#include "storage/lock.h"
#include "utils/catcache.h"
#include "utils/plancache.h"
@@ -88,4 +89,11 @@ extern void ResourceOwnerRememberDSM(ResourceOwner owner,
extern void ResourceOwnerForgetDSM(ResourceOwner owner,
dsm_segment *);
+/* support for wait event set management */
+extern void ResourceOwnerEnlargeWESs(ResourceOwner owner);
+extern void ResourceOwnerRememberWES(ResourceOwner owner,
+ WaitEventSet *);
+extern void ResourceOwnerForgetWES(ResourceOwner owner,
+ WaitEventSet *);
+
#endif /* RESOWNER_PRIVATE_H */
--
2.9.2
0006-Apply-unlikely-to-suggest-synchronous-route-of-ExecA.patchtext/x-patch; charset=us-asciiDownload
From 4a49530e1b4d968ae067819bf872049ebfae48eb Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Tue, 25 Oct 2016 18:05:30 +0900
Subject: [PATCH 6/7] Apply unlikely to suggest synchronous route of
ExecAppend.
ExecAppend seems to get slowed down by penalty of misprediction of
branches related to async-execution. Apply unlikey to them to prevent
such penalty on exiting route. Asynchronous execution is already
having a lot of additional code so this doesn't add siginificant
degradation.
---
src/backend/executor/nodeAppend.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/src/backend/executor/nodeAppend.c b/src/backend/executor/nodeAppend.c
index c234f1f..e82547d 100644
--- a/src/backend/executor/nodeAppend.c
+++ b/src/backend/executor/nodeAppend.c
@@ -208,7 +208,7 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
TupleTableSlot *
ExecAppend(AppendState *node)
{
- if (node->as_nasyncplans > 0)
+ if (unlikely(node->as_nasyncplans > 0))
{
EState *estate = node->ps.state;
int i;
@@ -248,7 +248,7 @@ ExecAppend(AppendState *node)
/*
* if we have async requests outstanding, run the event loop
*/
- if (node->as_nasyncpending > 0)
+ if (unlikely(node->as_nasyncpending > 0))
{
long timeout = node->as_syncdone ? -1 : 0;
--
2.9.2
0007-Add-instrumentation-to-async-execution.patchtext/x-patch; charset=us-asciiDownload
From 9d6a9444aea28c2880ecbedcaa3d721150d4a988 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Tue, 25 Oct 2016 19:04:04 +0900
Subject: [PATCH 7/7] Add instrumentation to async execution
Make explain analyze give sane result when async execution has taken
place.
---
src/backend/executor/execAsync.c | 19 +++++++++++++++++++
src/backend/executor/instrument.c | 2 +-
2 files changed, 20 insertions(+), 1 deletion(-)
diff --git a/src/backend/executor/execAsync.c b/src/backend/executor/execAsync.c
index 40e3f67..588ba18 100644
--- a/src/backend/executor/execAsync.c
+++ b/src/backend/executor/execAsync.c
@@ -46,6 +46,9 @@ ExecAsyncRequest(EState *estate, PlanState *requestor, int request_index,
PendingAsyncRequest *areq = NULL;
int nasync = estate->es_num_pending_async;
+ if (requestee->instrument)
+ InstrStartNode(requestee->instrument);
+
/*
* If the number of pending asynchronous nodes exceeds the number of
* available slots in the es_pending_async array, expand the array.
@@ -121,11 +124,17 @@ ExecAsyncRequest(EState *estate, PlanState *requestor, int request_index,
if (areq->state == ASYNC_COMPLETE)
{
Assert(areq->result == NULL || IsA(areq->result, TupleTableSlot));
+
ExecAsyncResponse(estate, areq);
+ if (areq->requestee->instrument)
+ InstrStopNode(requestee->instrument,
+ TupIsNull((TupleTableSlot*)areq->result) ? 0.0 : 1.0);
return;
}
+ if (areq->requestee->instrument)
+ InstrStopNode(requestee->instrument, 0);
/* No result available now, make this node pending */
estate->es_num_pending_async++;
}
@@ -193,6 +202,9 @@ ExecAsyncEventLoop(EState *estate, PlanState *requestor, long timeout)
{
PendingAsyncRequest *areq = estate->es_pending_async[i];
+ if (areq->requestee->instrument)
+ InstrStartNode(areq->requestee->instrument);
+
/* Skip it if not pending. */
if (areq->state == ASYNC_CALLBACK_PENDING)
{
@@ -211,7 +223,14 @@ ExecAsyncEventLoop(EState *estate, PlanState *requestor, long timeout)
if (requestor == areq->requestor)
requestor_done = true;
ExecAsyncResponse(estate, areq);
+
+ if (areq->requestee->instrument)
+ InstrStopNode(areq->requestee->instrument,
+ TupIsNull((TupleTableSlot*)areq->result) ?
+ 0.0 : 1.0);
}
+ else if (areq->requestee->instrument)
+ InstrStopNode(areq->requestee->instrument, 0);
}
/* If any node completed, compact the array. */
diff --git a/src/backend/executor/instrument.c b/src/backend/executor/instrument.c
index 2614bf4..6a22a15 100644
--- a/src/backend/executor/instrument.c
+++ b/src/backend/executor/instrument.c
@@ -102,7 +102,7 @@ InstrStopNode(Instrumentation *instr, double nTuples)
&pgBufferUsage, &instr->bufusage_start);
/* Is this the first tuple of this cycle? */
- if (!instr->running)
+ if (!instr->running && nTuples > 0)
{
instr->running = true;
instr->firsttuple = INSTR_TIME_GET_DOUBLE(instr->counter);
--
2.9.2
Hello,
I cannot respond until next Monday, so I move this to the next CF
by myself.
At Tue, 15 Nov 2016 20:25:13 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in <20161115.202513.268072050.horiguchi.kyotaro@lab.ntt.co.jp>
Hello, this is a maintenance post of reased patches.
I added a change of ResourceOwnerData missed in 0005.At Mon, 31 Oct 2016 10:39:12 +0900 (JST), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in <20161031.103912.217430542.horiguchi.kyotaro@lab.ntt.co.jp>
This a PoC patch of asynchronous execution feature, based on a
executor infrastructure Robert proposed. These patches are
rebased on the current master.0001-robert-s-2nd-framework.patch
Roberts executor async infrastructure. Async-driver nodes
register its async-capable children and sync and data transfer
are done out of band of ordinary ExecProcNode channel. So async
execution no longer disturbs async-unaware node and slows them
down.0002-Fix-some-bugs.patch
Some fixes for 0001 to work. This is just to preserve the shape
of 0001 patch.0003-Modify-async-execution-infrastructure.patch
The original infrastructure doesn't work when multiple foreign
tables is on the same connection. This makes it work.0004-Make-postgres_fdw-async-capable.patch
Makes postgres_fdw to work asynchronously.
0005-Use-resource-owner-to-prevent-wait-event-set-from-le.patch
This addresses a problem pointed by Robers about 0001 patch,
that WaitEventSet used for async execution can leak by errors.0006-Apply-unlikely-to-suggest-synchronous-route-of-ExecA.patch
ExecAppend gets a bit slower by penalties of misprediction of
branches. This fixes it by using unlikely() macro.0007-Add-instrumentation-to-async-execution.patch
As the description above for 0001, async infrastructure conveys
tuples outside ExecProcNode channel so EXPLAIN ANALYZE requires
special treat to show sane results. This patch tries that.A result of a performance measurement is in this message.
/messages/by-id/20161025.182150.230901487.horiguchi.kyotaro@lab.ntt.co.jp
| t0 - SELECT sum(a) FROM <local single table>;
| pl - SELECT sum(a) FROM <4 local children>;
| pf0 - SELECT sum(a) FROM <4 foreign children on single connection>;
| pf1 - SELECT sum(a) FROM <4 foreign children on dedicate connections>;
...
| async
| t0: 3885.84 ( 40.20) 0.86% faster (should be error but stable on my env..)
| pl: 1617.20 ( 3.51) 1.26% faster (ditto)
| pf0: 6680.95 (478.72) 19.5% faster
| pf1: 1886.87 ( 36.25) 77.1% faster
--
Kyotaro Horiguchi
NTT Open Source Software Center
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
This patch conflicts with e13029a (es_query_dsa) so I rebased
this.
At Mon, 31 Oct 2016 10:39:12 +0900 (JST), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in <20161031.103912.217430542.horiguchi.kyotaro@lab.ntt.co.jp>
This a PoC patch of asynchronous execution feature, based on a
executor infrastructure Robert proposed. These patches are
rebased on the current master.0001-robert-s-2nd-framework.patch
Roberts executor async infrastructure. Async-driver nodes
register its async-capable children and sync and data transfer
are done out of band of ordinary ExecProcNode channel. So async
execution no longer disturbs async-unaware node and slows them
down.0002-Fix-some-bugs.patch
Some fixes for 0001 to work. This is just to preserve the shape
of 0001 patch.0003-Modify-async-execution-infrastructure.patch
The original infrastructure doesn't work when multiple foreign
tables is on the same connection. This makes it work.0004-Make-postgres_fdw-async-capable.patch
Makes postgres_fdw to work asynchronously.
0005-Use-resource-owner-to-prevent-wait-event-set-from-le.patch
This addresses a problem pointed by Robers about 0001 patch,
that WaitEventSet used for async execution can leak by errors.0006-Apply-unlikely-to-suggest-synchronous-route-of-ExecA.patch
ExecAppend gets a bit slower by penalties of misprediction of
branches. This fixes it by using unlikely() macro.0007-Add-instrumentation-to-async-execution.patch
As the description above for 0001, async infrastructure conveys
tuples outside ExecProcNode channel so EXPLAIN ANALYZE requires
special treat to show sane results. This patch tries that.A result of a performance measurement is in this message.
/messages/by-id/20161025.182150.230901487.horiguchi.kyotaro@lab.ntt.co.jp
| t0 - SELECT sum(a) FROM <local single table>;
| pl - SELECT sum(a) FROM <4 local children>;
| pf0 - SELECT sum(a) FROM <4 foreign children on single connection>;
| pf1 - SELECT sum(a) FROM <4 foreign children on dedicate connections>;
...
| async
| t0: 3885.84 ( 40.20) 0.86% faster (should be error but stable on my env..)
| pl: 1617.20 ( 3.51) 1.26% faster (ditto)
| pf0: 6680.95 (478.72) 19.5% faster
| pf1: 1886.87 ( 36.25) 77.1% faster
--
Kyotaro Horiguchi
NTT Open Source Software Center
Attachments:
0001-robert-s-2nd-framework.patchtext/x-patch; charset=us-asciiDownload
From 68e8bbb5996f8a3605b440933d59bbd12268269a Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Wed, 12 Oct 2016 12:46:10 +0900
Subject: [PATCH 1/7] robert's 2nd framework
---
contrib/postgres_fdw/postgres_fdw.c | 49 ++++
src/backend/executor/Makefile | 4 +-
src/backend/executor/README | 43 +++
src/backend/executor/execAmi.c | 5 +
src/backend/executor/execAsync.c | 462 ++++++++++++++++++++++++++++++++
src/backend/executor/nodeAppend.c | 162 ++++++++++-
src/backend/executor/nodeForeignscan.c | 49 ++++
src/backend/nodes/copyfuncs.c | 1 +
src/backend/nodes/outfuncs.c | 1 +
src/backend/nodes/readfuncs.c | 1 +
src/backend/optimizer/plan/createplan.c | 45 +++-
src/include/executor/execAsync.h | 29 ++
src/include/executor/nodeAppend.h | 3 +
src/include/executor/nodeForeignscan.h | 7 +
src/include/foreign/fdwapi.h | 15 ++
src/include/nodes/execnodes.h | 57 +++-
src/include/nodes/plannodes.h | 1 +
17 files changed, 909 insertions(+), 25 deletions(-)
create mode 100644 src/backend/executor/execAsync.c
create mode 100644 src/include/executor/execAsync.h
diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c
index fbe6929..ef4acc7 100644
--- a/contrib/postgres_fdw/postgres_fdw.c
+++ b/contrib/postgres_fdw/postgres_fdw.c
@@ -19,6 +19,7 @@
#include "commands/defrem.h"
#include "commands/explain.h"
#include "commands/vacuum.h"
+#include "executor/execAsync.h"
#include "foreign/fdwapi.h"
#include "funcapi.h"
#include "miscadmin.h"
@@ -349,6 +350,14 @@ static void postgresGetForeignUpperPaths(PlannerInfo *root,
UpperRelationKind stage,
RelOptInfo *input_rel,
RelOptInfo *output_rel);
+static bool postgresIsForeignPathAsyncCapable(ForeignPath *path);
+static void postgresForeignAsyncRequest(EState *estate,
+ PendingAsyncRequest *areq);
+static void postgresForeignAsyncConfigureWait(EState *estate,
+ PendingAsyncRequest *areq,
+ bool reinit);
+static void postgresForeignAsyncNotify(EState *estate,
+ PendingAsyncRequest *areq);
/*
* Helper functions
@@ -468,6 +477,12 @@ postgres_fdw_handler(PG_FUNCTION_ARGS)
/* Support functions for upper relation push-down */
routine->GetForeignUpperPaths = postgresGetForeignUpperPaths;
+ /* Support functions for async execution */
+ routine->IsForeignPathAsyncCapable = postgresIsForeignPathAsyncCapable;
+ routine->ForeignAsyncRequest = postgresForeignAsyncRequest;
+ routine->ForeignAsyncConfigureWait = postgresForeignAsyncConfigureWait;
+ routine->ForeignAsyncNotify = postgresForeignAsyncNotify;
+
PG_RETURN_POINTER(routine);
}
@@ -4442,6 +4457,40 @@ postgresGetForeignJoinPaths(PlannerInfo *root,
/* XXX Consider parameterized paths for the join relation */
}
+static bool
+postgresIsForeignPathAsyncCapable(ForeignPath *path)
+{
+ return true;
+}
+
+/*
+ * XXX. Just for testing purposes, let's run everything through the async
+ * mechanism but return tuples synchronously.
+ */
+static void
+postgresForeignAsyncRequest(EState *estate, PendingAsyncRequest *areq)
+{
+ ForeignScanState *node = (ForeignScanState *) areq->requestee;
+ TupleTableSlot *slot;
+
+ Assert(IsA(node, ForeignScanState));
+ slot = postgresIterateForeignScan(node);
+ ExecAsyncRequestDone(estate, areq, (Node *) slot);
+}
+
+static void
+postgresForeignAsyncConfigureWait(EState *estate, PendingAsyncRequest *areq,
+ bool reinit)
+{
+ elog(ERROR, "postgresForeignAsyncConfigureWait");
+}
+
+static void
+postgresForeignAsyncNotify(EState *estate, PendingAsyncRequest *areq)
+{
+ elog(ERROR, "postgresForeignAsyncNotify");
+}
+
/*
* Assess whether the aggregation, grouping and having operations can be pushed
* down to the foreign server. As a side effect, save information we obtain in
diff --git a/src/backend/executor/Makefile b/src/backend/executor/Makefile
index 51edd4c..0675b01 100644
--- a/src/backend/executor/Makefile
+++ b/src/backend/executor/Makefile
@@ -12,8 +12,8 @@ subdir = src/backend/executor
top_builddir = ../../..
include $(top_builddir)/src/Makefile.global
-OBJS = execAmi.o execCurrent.o execGrouping.o execIndexing.o execJunk.o \
- execMain.o execParallel.o execProcnode.o execQual.o \
+OBJS = execAmi.o execAsync.o execCurrent.o execGrouping.o execIndexing.o \
+ execJunk.o execMain.o execParallel.o execProcnode.o execQual.o \
execScan.o execTuples.o \
execUtils.o functions.o instrument.o nodeAppend.o nodeAgg.o \
nodeBitmapAnd.o nodeBitmapOr.o \
diff --git a/src/backend/executor/README b/src/backend/executor/README
index f1d1e4c..1dee3db 100644
--- a/src/backend/executor/README
+++ b/src/backend/executor/README
@@ -199,3 +199,46 @@ query returning the same set of scan tuples multiple times. Likewise,
SRFs are disallowed in an UPDATE's targetlist. There, they would have the
effect of the same row being updated multiple times, which is not very
useful --- and updates after the first would have no effect anyway.
+
+Asynchronous Execution
+----------------------
+
+In certain cases, it's desirable for a node to indicate that it cannot
+return any tuple immediately but may be able to do at a later time. This
+might either because the node is waiting on an event external to the
+database system, such as a ForeignScan awaiting network I/O, or because
+the node is waiting for an event internal to the database system - e.g.
+one process involved in a parallel query may find that it cannot progress
+a certain parallel operation until some other process reaches a certain
+point in the computation. A process which discovers this type of situation
+can always handle it simply by blocking, but this may waste time that could
+be spent executing some other part of the plan where progress could be
+made immediately. This is particularly likely to occur when the plan
+contains an Append node.
+
+To use asynchronous execution, a node must first request a tuple from an
+async-capable child node using ExecAsyncRequest. Next, it must execute
+the asynchronous event loop using ExecAsyncEventLoop; it can avoid giving
+up control indefinitely by passing a timeout to this function, even passing
+-1 to poll for events without blocking. Eventually, when a node to which an
+asynchronous request has been made produces a tuple, the requesting node
+will receive a callback from the event loop via ExecAsyncResponse. Typically,
+the ExecAsyncResponse callback is the only one required for nodes that wish
+to request tuples asynchronously.
+
+On the other hand, nodes that wish to produce tuples asynchronously
+generally need to implement three methods:
+
+1. When an asynchronous request is made, the node's ExecAsyncRequest callback
+will be invoked; it should use ExecAsyncSetRequiredEvents to indicate the
+number of file descriptor events for which it wishes to wait and whether it
+wishes to receive a callback when the process latch is set. Alternatively,
+it can instead use ExecAsyncRequestDone if a result is available immediately.
+
+2. When the event loop wishes to wait or poll for file descriptor events and
+the process latch, the ExecAsyncConfigureWait callback is invoked to configure
+the file descriptor wait events for which the node wishes to wait. This
+callback isn't needed if the node only cares about the process latch.
+
+3. When file descriptors or the process latch become ready, the node's
+ExecAsyncNotify callback is invoked.
diff --git a/src/backend/executor/execAmi.c b/src/backend/executor/execAmi.c
index 2587ef7..9fcc4e4 100644
--- a/src/backend/executor/execAmi.c
+++ b/src/backend/executor/execAmi.c
@@ -464,11 +464,16 @@ ExecSupportsBackwardScan(Plan *node)
{
ListCell *l;
+ /* With async, tuples may be interleaved, so can't back up. */
+ if (((Append *) node)->nasyncplans != 0)
+ return false;
+
foreach(l, ((Append *) node)->appendplans)
{
if (!ExecSupportsBackwardScan((Plan *) lfirst(l)))
return false;
}
+
/* need not check tlist because Append doesn't evaluate it */
return true;
}
diff --git a/src/backend/executor/execAsync.c b/src/backend/executor/execAsync.c
new file mode 100644
index 0000000..5858bb5
--- /dev/null
+++ b/src/backend/executor/execAsync.c
@@ -0,0 +1,462 @@
+/*-------------------------------------------------------------------------
+ *
+ * execAsync.c
+ * Support routines for asynchronous execution.
+ *
+ * Portions Copyright (c) 1996-2016, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/executor/execAsync.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "executor/execAsync.h"
+#include "executor/nodeAppend.h"
+#include "executor/nodeForeignscan.h"
+#include "miscadmin.h"
+#include "storage/latch.h"
+
+static bool ExecAsyncEventWait(EState *estate, long timeout);
+static void ExecAsyncConfigureWait(EState *estate, PendingAsyncRequest *areq,
+ bool reinit);
+static void ExecAsyncNotify(EState *estate, PendingAsyncRequest *areq);
+static void ExecAsyncResponse(EState *estate, PendingAsyncRequest *areq);
+
+#define EVENT_BUFFER_SIZE 16
+
+/*
+ * Asynchronously request a tuple from a designed async-aware node.
+ *
+ * requestor is the node that wants the tuple; requestee is the node from
+ * which it wants the tuple. request_index is an arbitrary integer specified
+ * by the requestor which will be available at the time the requestor receives
+ * the tuple. This is useful if the requestor has multiple children and
+ * needs an easy way to figure out which one is delivering a tuple.
+ */
+void
+ExecAsyncRequest(EState *estate, PlanState *requestor, int request_index,
+ PlanState *requestee)
+{
+ PendingAsyncRequest *areq = NULL;
+ int i = estate->es_num_pending_async;
+
+ /*
+ * If the number of pending asynchronous nodes exceeds the number of
+ * available slots in the es_pending_async array, expand the array.
+ * We start with 16 slots, and thereafter double the array size each
+ * time we run out of slots.
+ */
+ if (i >= estate->es_max_pending_async)
+ {
+ int newmax;
+
+ newmax = estate->es_max_pending_async * 2;
+ if (estate->es_max_pending_async == 0)
+ {
+ newmax = 16;
+ estate->es_pending_async =
+ MemoryContextAllocZero(estate->es_query_cxt,
+ newmax * sizeof(PendingAsyncRequest *));
+ }
+ else
+ {
+ int newentries = newmax - estate->es_max_pending_async;
+
+ estate->es_pending_async =
+ repalloc(estate->es_pending_async,
+ newmax * sizeof(PendingAsyncRequest *));
+ MemSet(&estate->es_pending_async[estate->es_max_pending_async],
+ 0, newentries * sizeof(PendingAsyncRequest *));
+ }
+ estate->es_max_pending_async = newmax;
+ }
+
+ /*
+ * To avoid unnecessary palloc traffic, we reuse a previously-allocated
+ * PendingAsyncRequest if there is one. If not, we must allocate a new
+ * one.
+ */
+ if (estate->es_pending_async[i] == NULL)
+ {
+ areq = MemoryContextAllocZero(estate->es_query_cxt,
+ sizeof(PendingAsyncRequest));
+ estate->es_pending_async[i] = areq;
+ }
+ else
+ {
+ areq = estate->es_pending_async[i];
+ MemSet(areq, 0, sizeof(PendingAsyncRequest));
+ }
+ areq->myindex = estate->es_num_pending_async++;
+
+ /* Initialize the new request. */
+ areq->requestor = requestor;
+ areq->request_index = request_index;
+ areq->requestee = requestee;
+
+ /* Give the requestee a chance to do whatever it wants. */
+ switch (nodeTag(requestee))
+ {
+ case T_ForeignScanState:
+ ExecAsyncForeignScanRequest(estate, areq);
+ break;
+ default:
+ /* If requestee doesn't support async, caller messed up. */
+ elog(ERROR, "unrecognized node type: %d",
+ (int) nodeTag(requestee));
+ }
+}
+
+/*
+ * Execute the main loop until the timeout expires or a result is delivered
+ * to the requestor.
+ *
+ * If the timeout is -1, there is no timeout; wait indefinitely until a
+ * result is ready for requestor. If the timeout is 0, do not block, but
+ * poll for events and fire callbacks for as long as we can do so without
+ * blocking. If timeout is greater than 0, block for at most the number
+ * of milliseconds indicated by the timeout.
+ *
+ * Returns true if a result was delivered to the requestor. A return value
+ * of false indicates that the timeout was reached without delivering a
+ * result to the requestor.
+ */
+bool
+ExecAsyncEventLoop(EState *estate, PlanState *requestor, long timeout)
+{
+ instr_time start_time;
+ long cur_timeout = timeout;
+ bool requestor_done = false;
+
+ Assert(requestor != NULL);
+
+ /*
+ * If we plan to wait - but not indefinitely - we need to record the
+ * current time.
+ */
+ if (timeout > 0)
+ INSTR_TIME_SET_CURRENT(start_time);
+
+ /* Main event loop: poll for events, deliver notifications. */
+ for (;;)
+ {
+ int i;
+ bool any_node_done = false;
+
+ CHECK_FOR_INTERRUPTS();
+
+ /*
+ * Check for events, but don't block if there notifications that
+ * have not been delivered yet.
+ */
+ if (estate->es_async_callback_pending > 0)
+ ExecAsyncEventWait(estate, 0);
+ else if (!ExecAsyncEventWait(estate, cur_timeout))
+ cur_timeout = 0; /* Timeout was reached. */
+ else
+ {
+ instr_time cur_time;
+ long cur_timeout = -1;
+
+ INSTR_TIME_SET_CURRENT(cur_time);
+ INSTR_TIME_SUBTRACT(cur_time, start_time);
+ cur_timeout = timeout - (long) INSTR_TIME_GET_MILLISEC(cur_time);
+ if (cur_timeout < 0)
+ cur_timeout = 0;
+ }
+
+ /* Deliver notifications. */
+ for (i = 0; i < estate->es_num_pending_async; ++i)
+ {
+ PendingAsyncRequest *areq = estate->es_pending_async[i];
+
+ /* Skip it if no callback is pending. */
+ if (!areq->callback_pending)
+ continue;
+
+ /*
+ * Mark it as no longer needing a callback. We must do this
+ * before dispatching the callback in case the callback resets
+ * the flag.
+ */
+ areq->callback_pending = false;
+ estate->es_async_callback_pending--;
+
+ /* Perform the actual callback; set request_done if appropraite. */
+ if (!areq->request_complete)
+ ExecAsyncNotify(estate, areq);
+ else
+ {
+ any_node_done = true;
+ if (requestor == areq->requestor)
+ requestor_done = true;
+ ExecAsyncResponse(estate, areq);
+ }
+ }
+
+ /* If any node completed, compact the array. */
+ if (any_node_done)
+ {
+ int hidx = 0,
+ tidx;
+
+ /*
+ * Swap all non-yet-completed items to the start of the array.
+ * Keep them in the same order.
+ */
+ for (tidx = 0; tidx < estate->es_num_pending_async; ++tidx)
+ {
+ PendingAsyncRequest *head;
+ PendingAsyncRequest *tail = estate->es_pending_async[tidx];
+
+ if (!tail->callback_pending && tail->request_complete)
+ continue;
+ head = estate->es_pending_async[hidx];
+ estate->es_pending_async[tidx] = head;
+ estate->es_pending_async[hidx] = tail;
+ ++hidx;
+ }
+ estate->es_num_pending_async = hidx;
+ }
+
+ /*
+ * We only consider exiting the loop when no notifications are
+ * pending. Otherwise, each call to this function might advance
+ * the computation by only a very small amount; to the contrary,
+ * we want to push it forward as far as possible.
+ */
+ if (estate->es_async_callback_pending == 0)
+ {
+ /* If requestor is ready, exit. */
+ if (requestor_done)
+ return true;
+ /* If timeout was 0 or has expired, exit. */
+ if (cur_timeout == 0)
+ return false;
+ }
+ }
+}
+
+/*
+ * Wait or poll for events. As with ExecAsyncEventLoop, a timeout of -1
+ * means wait forever, 0 means don't wait at all, and >0 means wait for the
+ * indicated number of milliseconds.
+ *
+ * Returns true if we found some events and false if we timed out.
+ */
+static bool
+ExecAsyncEventWait(EState *estate, long timeout)
+{
+ WaitEvent occurred_event[EVENT_BUFFER_SIZE];
+ int noccurred;
+ int i;
+ int n;
+ bool reinit = false;
+ bool process_latch_set = false;
+
+ if (estate->es_wait_event_set == NULL)
+ {
+ /*
+ * Allow for a few extra events without reinitializing. It
+ * doesn't seem worth the complexity of doing anything very
+ * aggressive here, because plans that depend on massive numbers
+ * of external FDs are likely to run afoul of kernel limits anyway.
+ */
+ estate->es_allocated_fd_events = estate->es_total_fd_events + 16;
+ estate->es_wait_event_set =
+ CreateWaitEventSet(estate->es_query_cxt,
+ estate->es_allocated_fd_events + 1);
+ AddWaitEventToSet(estate->es_wait_event_set,
+ WL_LATCH_SET, PGINVALID_SOCKET, MyLatch, NULL);
+ reinit = true;
+ }
+
+ /* Give each waiting node a chance to add or modify events. */
+ for (i = 0; i < estate->es_num_pending_async; ++i)
+ {
+ PendingAsyncRequest *areq = estate->es_pending_async[i];
+
+ if (areq->num_fd_events > 0)
+ ExecAsyncConfigureWait(estate, areq, reinit);
+ }
+
+ /* Wait for at least one event to occur. */
+ noccurred = WaitEventSetWait(estate->es_wait_event_set, timeout,
+ occurred_event, EVENT_BUFFER_SIZE);
+ if (noccurred == 0)
+ return false;
+
+ /*
+ * Loop over the occurred events and set the callback_pending flags
+ * for the appropriate requests. The waiting nodes should have
+ * registered their wait events with user_data pointing back to the
+ * PendingAsyncRequest, but the process latch needs special handling.
+ */
+ for (n = 0; n < noccurred; ++n)
+ {
+ WaitEvent *w = &occurred_event[n];
+
+ if ((w->events & WL_LATCH_SET) != 0)
+ {
+ process_latch_set = true;
+ continue;
+ }
+
+ if ((w->events & (WL_SOCKET_READABLE|WL_SOCKET_WRITEABLE)) != 0)
+ {
+ PendingAsyncRequest *areq = w->user_data;
+
+ if (!areq->callback_pending)
+ {
+ Assert(!areq->request_complete);
+ areq->callback_pending = true;
+ estate->es_async_callback_pending++;
+ }
+ }
+ }
+
+ /*
+ * If the process latch got set, we must schedule a callback for every
+ * requestee that cares about it.
+ */
+ if (process_latch_set)
+ {
+ for (i = 0; i < estate->es_num_pending_async; ++i)
+ {
+ PendingAsyncRequest *areq = estate->es_pending_async[i];
+
+ if (areq->wants_process_latch)
+ {
+ Assert(!areq->request_complete);
+ areq->callback_pending = true;
+ }
+ }
+ }
+
+ return true;
+}
+
+/*
+ * Give the asynchronous node a chance to configure the file descriptor
+ * events for which it wishes to wait. We expect the node-type specific
+ * callback to make one or more calls of the following form:
+ *
+ * AddWaitEventToSet(es->es_wait_event_set, events, fd, NULL, areq);
+ *
+ * The events should include only WL_SOCKET_READABLE or WL_SOCKET_WRITEABLE,
+ * and the number of calls should not exceed areq->num_fd_events (as
+ * prevously set via ExecAsyncSetRequiredEvents).
+ */
+static void
+ExecAsyncConfigureWait(EState *estate, PendingAsyncRequest *areq,
+ bool reinit)
+{
+ switch (nodeTag(areq->requestee))
+ {
+ case T_ForeignScanState:
+ ExecAsyncForeignScanConfigureWait(estate, areq, reinit);
+ break;
+ default:
+ elog(ERROR, "unrecognized node type: %d",
+ (int) nodeTag(areq->requestee));
+ }
+}
+
+/*
+ * Call the asynchronous node back when a relevant event has occurred.
+ */
+static void
+ExecAsyncNotify(EState *estate, PendingAsyncRequest *areq)
+{
+ switch (nodeTag(areq->requestee))
+ {
+ case T_ForeignScanState:
+ ExecAsyncForeignScanNotify(estate, areq);
+ break;
+ default:
+ elog(ERROR, "unrecognized node type: %d",
+ (int) nodeTag(areq->requestee));
+ }
+}
+
+/*
+ * Call the requestor back when an asynchronous node has produced a result.
+ */
+static void
+ExecAsyncResponse(EState *estate, PendingAsyncRequest *areq)
+{
+ switch (nodeTag(areq->requestor))
+ {
+ case T_AppendState:
+ ExecAsyncAppendResponse(estate, areq);
+ break;
+ default:
+ elog(ERROR, "unrecognized node type: %d",
+ (int) nodeTag(areq->requestor));
+ }
+}
+
+/*
+ * An executor node should call this function to signal that it needs to wait
+ * on one or more file descriptor events that can be registered on a
+ * WaitEventSet, and possibly also on the process latch. num_fd_events
+ * should be the maximum number of file descriptor events that it will wish to
+ * register. force_reset should be true if the node can't reuse the
+ * WaitEventSet it most recently initialized, for example because it needs to
+ * drop a wait event from the set.
+ */
+void
+ExecAsyncSetRequiredEvents(EState *estate, PendingAsyncRequest *areq,
+ int num_fd_events, bool wants_process_latch,
+ bool force_reset)
+{
+ estate->es_total_fd_events += num_fd_events - areq->num_fd_events;
+ areq->num_fd_events = num_fd_events;
+ areq->wants_process_latch = wants_process_latch;
+
+ if (force_reset && estate->es_wait_event_set != NULL)
+ {
+ FreeWaitEventSet(estate->es_wait_event_set);
+ estate->es_wait_event_set = NULL;
+ }
+}
+
+/*
+ * An async-capable node should call this function to deliver the tuple to
+ * the node which requested it. The node can call this from its
+ * ExecAsyncRequest callback if the requested tuple is available immediately,
+ * or at a later time from its ExecAsyncNotify callback.
+ */
+void
+ExecAsyncRequestDone(EState *estate, PendingAsyncRequest *areq, Node *result)
+{
+ /*
+ * Since the request is complete, the requestee is no longer allowed
+ * to wait for any events. Note that this forces a rebuild of
+ * es_wait_event_set every time a process that was previously waiting
+ * stops doing so. It might be possible to defer that decision until
+ * we actually wait again, because it's quite possible that a new
+ * request will be made of the same node before any wait actually
+ * happens. However, we have to balance the cost of rebuilding the
+ * WaitEventSet against the additional overhead of tracking which nodes
+ * need a callback to remove registered wait events. It's not clear
+ * that we would come out ahead, so use brute force for now.
+ */
+ if (areq->num_fd_events > 0 || areq->wants_process_latch)
+ ExecAsyncSetRequiredEvents(estate, areq, 0, false, true);
+
+ /* Save result and mark request as complete. */
+ areq->result = result;
+ areq->request_complete = true;
+
+ /* Make sure this request is flagged for a callback. */
+ if (!areq->callback_pending)
+ {
+ areq->callback_pending = true;
+ estate->es_async_callback_pending++;
+ }
+}
diff --git a/src/backend/executor/nodeAppend.c b/src/backend/executor/nodeAppend.c
index a26bd63..bb06569 100644
--- a/src/backend/executor/nodeAppend.c
+++ b/src/backend/executor/nodeAppend.c
@@ -58,6 +58,7 @@
#include "postgres.h"
#include "executor/execdebug.h"
+#include "executor/execAsync.h"
#include "executor/nodeAppend.h"
static bool exec_append_initialize_next(AppendState *appendstate);
@@ -79,16 +80,21 @@ exec_append_initialize_next(AppendState *appendstate)
/*
* get information from the append node
*/
- whichplan = appendstate->as_whichplan;
+ whichplan = appendstate->as_whichsyncplan;
- if (whichplan < 0)
+ /*
+ * This routine is only responsible for setting up for nodes being scanned
+ * synchronously, so the first node we can scan is given by nasyncplans
+ * and the last is given by as_nplans - 1.
+ */
+ if (whichplan < appendstate->as_nasyncplans)
{
/*
* if scanning in reverse, we start at the last scan in the list and
* then proceed back to the first.. in any case we inform ExecAppend
* that we are at the end of the line by returning FALSE
*/
- appendstate->as_whichplan = 0;
+ appendstate->as_whichsyncplan = appendstate->as_nasyncplans;
return FALSE;
}
else if (whichplan >= appendstate->as_nplans)
@@ -96,7 +102,7 @@ exec_append_initialize_next(AppendState *appendstate)
/*
* as above, end the scan if we go beyond the last scan in our list..
*/
- appendstate->as_whichplan = appendstate->as_nplans - 1;
+ appendstate->as_whichsyncplan = appendstate->as_nplans - 1;
return FALSE;
}
else
@@ -142,6 +148,15 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
appendstate->ps.state = estate;
appendstate->appendplans = appendplanstates;
appendstate->as_nplans = nplans;
+ appendstate->as_nasyncplans = node->nasyncplans;
+ appendstate->as_syncdone = (node->nasyncplans == nplans);
+ appendstate->as_asyncresult = (TupleTableSlot **)
+ palloc0(node->nasyncplans * sizeof(TupleTableSlot *));
+
+ /* initially, all async requests need a request */
+ for (i = 0; i < appendstate->as_nasyncplans; ++i)
+ appendstate->as_needrequest =
+ bms_add_member(appendstate->as_needrequest, i);
/*
* Miscellaneous initialization
@@ -176,9 +191,9 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
appendstate->ps.ps_ProjInfo = NULL;
/*
- * initialize to scan first subplan
+ * initialize to scan first synchronous subplan
*/
- appendstate->as_whichplan = 0;
+ appendstate->as_whichsyncplan = appendstate->as_nasyncplans;
exec_append_initialize_next(appendstate);
return appendstate;
@@ -193,15 +208,78 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
TupleTableSlot *
ExecAppend(AppendState *node)
{
+ if (node->as_nasyncplans > 0)
+ {
+ EState *estate = node->ps.state;
+ int i;
+
+ /*
+ * If there are any asynchronously-generated results that have
+ * not yet been returned, return one of them.
+ */
+ if (node->as_nasyncresult > 0)
+ {
+ --node->as_nasyncresult;
+ return node->as_asyncresult[node->as_nasyncresult];
+ }
+
+ /*
+ * If there are any nodes that need a new asynchronous request,
+ * make all of them.
+ */
+ while ((i = bms_first_member(node->as_needrequest)) >= 0)
+ {
+ ExecAsyncRequest(estate, &node->ps, i, node->appendplans[i]);
+ node->as_nasyncpending++;
+ }
+ }
+
for (;;)
{
PlanState *subnode;
TupleTableSlot *result;
/*
- * figure out which subplan we are currently processing
+ * if we have async requests outstanding, run the event loop
*/
- subnode = node->appendplans[node->as_whichplan];
+ if (node->as_nasyncpending > 0)
+ {
+ long timeout = node->as_syncdone ? -1 : 0;
+
+ for (;;)
+ {
+ if (node->as_nasyncpending == 0)
+ {
+ /*
+ * If there is no asynchronous activity still pending
+ * and the synchronous activity is also complete, we're
+ * totally done scanning this node. Otherwise, we're
+ * done with the asynchronous stuff but must continue
+ * scanning the synchronous children.
+ */
+ if (node->as_syncdone)
+ return ExecClearTuple(node->ps.ps_ResultTupleSlot);
+ break;
+ }
+ if (!ExecAsyncEventLoop(node->ps.state, &node->ps, timeout))
+ {
+ /* Timeout reached. */
+ break;
+ }
+ if (node->as_nasyncresult > 0)
+ {
+ /* Asynchronous subplan returned a tuple! */
+ --node->as_nasyncresult;
+ return node->as_asyncresult[node->as_nasyncresult];
+ }
+ }
+ }
+
+ /*
+ * figure out which synchronous subplan we are currently processing
+ */
+ Assert(!node->as_syncdone);
+ subnode = node->appendplans[node->as_whichsyncplan];
/*
* get a tuple from the subplan
@@ -221,14 +299,21 @@ ExecAppend(AppendState *node)
/*
* Go on to the "next" subplan in the appropriate direction. If no
* more subplans, return the empty slot set up for us by
- * ExecInitAppend.
+ * ExecInitAppend, unless there are async plans we have yet to finish.
*/
if (ScanDirectionIsForward(node->ps.state->es_direction))
- node->as_whichplan++;
+ node->as_whichsyncplan++;
else
- node->as_whichplan--;
+ node->as_whichsyncplan--;
if (!exec_append_initialize_next(node))
- return ExecClearTuple(node->ps.ps_ResultTupleSlot);
+ {
+ node->as_syncdone = true;
+ if (node->as_nasyncpending == 0)
+ {
+ Assert(bms_is_empty(node->as_needrequest));
+ return ExecClearTuple(node->ps.ps_ResultTupleSlot);
+ }
+ }
/* Else loop back and try to get a tuple from the new subplan */
}
@@ -267,6 +352,16 @@ ExecReScanAppend(AppendState *node)
{
int i;
+ /*
+ * XXX. Cancel outstanding asynchronous tuple requests here! (How?)
+ */
+
+ /* Reset async state. */
+ for (i = 0; i < node->as_nasyncplans; ++i)
+ node->as_needrequest = bms_add_member(node->as_needrequest, i);
+ node->as_nasyncresult = 0;
+ node->as_syncdone = (node->as_nasyncplans == node->as_nplans);
+
for (i = 0; i < node->as_nplans; i++)
{
PlanState *subnode = node->appendplans[i];
@@ -285,6 +380,47 @@ ExecReScanAppend(AppendState *node)
if (subnode->chgParam == NULL)
ExecReScan(subnode);
}
- node->as_whichplan = 0;
+ node->as_whichsyncplan = node->as_nasyncplans;
exec_append_initialize_next(node);
}
+
+/* ----------------------------------------------------------------
+ * ExecAsyncAppendResponse
+ *
+ * Receive a response from an asynchronous request we made.
+ * ----------------------------------------------------------------
+ */
+void
+ExecAsyncAppendResponse(EState *estate, PendingAsyncRequest *areq)
+{
+ AppendState *node = (AppendState *) areq->requestor;
+ TupleTableSlot *slot;
+
+ /* We shouldn't be called until the request is complete. */
+ Assert(areq->request_complete);
+
+ /* Our result slot shouldn't already be occupied. */
+ Assert(TupIsNull(node->ps.ps_ResultTupleSlot));
+
+ /* Result should be a TupleTableSlot or NULL. */
+ slot = (TupleTableSlot *) areq->result;
+ Assert(slot == NULL || IsA(slot, TupleTableSlot));
+
+ /* Request is no longer pending. */
+ Assert(node->as_nasyncpending > 0);
+ --node->as_nasyncpending;
+
+ /* If the result is NULL or an empty slot, there's nothing more to do. */
+ if (TupIsNull(slot))
+ return;
+
+ /* Save result so we can return it. */
+ Assert(node->as_nasyncresult < node->as_nasyncplans);
+ node->as_asyncresult[node->as_nasyncresult++] = slot;
+
+ /*
+ * Mark the node that returned a result as ready for a new request. We
+ * don't launch another one here immediately because it might compelte
+ */
+ bms_add_member(node->as_needrequest, areq->request_index);
+}
diff --git a/src/backend/executor/nodeForeignscan.c b/src/backend/executor/nodeForeignscan.c
index d886aaf..85d436f 100644
--- a/src/backend/executor/nodeForeignscan.c
+++ b/src/backend/executor/nodeForeignscan.c
@@ -355,3 +355,52 @@ ExecForeignScanInitializeWorker(ForeignScanState *node, shm_toc *toc)
fdwroutine->InitializeWorkerForeignScan(node, toc, coordinate);
}
}
+
+/* ----------------------------------------------------------------
+ * ExecAsyncForeignScanRequest
+ *
+ * Initiate an asynchronous request
+ * ----------------------------------------------------------------
+ */
+void
+ExecAsyncForeignScanRequest(EState *estate, PendingAsyncRequest *areq)
+{
+ ForeignScanState *node = (ForeignScanState *) areq->requestee;
+ FdwRoutine *fdwroutine = node->fdwroutine;
+
+ Assert(fdwroutine->ForeignAsyncRequest != NULL);
+ fdwroutine->ForeignAsyncRequest(estate, areq);
+}
+
+/* ----------------------------------------------------------------
+ * ExecAsyncForeignScanConfigureWait
+ *
+ * In async mode, configure for a wait
+ * ----------------------------------------------------------------
+ */
+void
+ExecAsyncForeignScanConfigureWait(EState *estate,
+ PendingAsyncRequest *areq, bool reinit)
+{
+ ForeignScanState *node = (ForeignScanState *) areq->requestee;
+ FdwRoutine *fdwroutine = node->fdwroutine;
+
+ Assert(fdwroutine->ForeignAsyncConfigureWait != NULL);
+ fdwroutine->ForeignAsyncConfigureWait(estate, areq, reinit);
+}
+
+/* ----------------------------------------------------------------
+ * ExecAsyncForeignScanNotify
+ *
+ * Event loop callback
+ * ----------------------------------------------------------------
+ */
+void
+ExecAsyncForeignScanNotify(EState *estate, PendingAsyncRequest *areq)
+{
+ ForeignScanState *node = (ForeignScanState *) areq->requestee;
+ FdwRoutine *fdwroutine = node->fdwroutine;
+
+ Assert(fdwroutine->ForeignAsyncNotify != NULL);
+ fdwroutine->ForeignAsyncNotify(estate, areq);
+}
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index d973225..a4b31cc 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -218,6 +218,7 @@ _copyAppend(const Append *from)
* copy remainder of node
*/
COPY_NODE_FIELD(appendplans);
+ COPY_SCALAR_FIELD(nasyncplans);
return newnode;
}
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index 7258c03..c59c635 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -359,6 +359,7 @@ _outAppend(StringInfo str, const Append *node)
_outPlanInfo(str, (const Plan *) node);
WRITE_NODE_FIELD(appendplans);
+ WRITE_INT_FIELD(nasyncplans);
}
static void
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index d608530..8051c58 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -1521,6 +1521,7 @@ _readAppend(void)
ReadCommonPlan(&local_node->plan);
READ_NODE_FIELD(appendplans);
+ READ_INT_FIELD(nasyncplans);
READ_DONE();
}
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index ad49674..7caa8d3 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -193,7 +193,7 @@ static CteScan *make_ctescan(List *qptlist, List *qpqual,
Index scanrelid, int ctePlanId, int cteParam);
static WorkTableScan *make_worktablescan(List *qptlist, List *qpqual,
Index scanrelid, int wtParam);
-static Append *make_append(List *appendplans, List *tlist);
+static Append *make_append(List *asyncplans, int nasyncplans, List *tlist);
static RecursiveUnion *make_recursive_union(List *tlist,
Plan *lefttree,
Plan *righttree,
@@ -270,6 +270,7 @@ static ModifyTable *make_modifytable(PlannerInfo *root,
List *resultRelations, List *subplans,
List *withCheckOptionLists, List *returningLists,
List *rowMarks, OnConflictExpr *onconflict, int epqParam);
+static bool is_async_capable_path(Path *path);
/*
@@ -955,8 +956,10 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path)
{
Append *plan;
List *tlist = build_path_tlist(root, &best_path->path);
- List *subplans = NIL;
+ List *asyncplans = NIL;
+ List *syncplans = NIL;
ListCell *subpaths;
+ int nasyncplans = 0;
/*
* The subpaths list could be empty, if every child was proven empty by
@@ -991,7 +994,14 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path)
/* Must insist that all children return the same tlist */
subplan = create_plan_recurse(root, subpath, CP_EXACT_TLIST);
- subplans = lappend(subplans, subplan);
+ /* Classify as async-capable or not */
+ if (is_async_capable_path(subpath))
+ {
+ asyncplans = lappend(asyncplans, subplan);
+ ++nasyncplans;
+ }
+ else
+ syncplans = lappend(syncplans, subplan);
}
/*
@@ -1001,7 +1011,7 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path)
* parent-rel Vars it'll be asked to emit.
*/
- plan = make_append(subplans, tlist);
+ plan = make_append(list_concat(asyncplans, syncplans), nasyncplans, tlist);
copy_generic_path_info(&plan->plan, (Path *) best_path);
@@ -4941,7 +4951,7 @@ make_foreignscan(List *qptlist,
}
static Append *
-make_append(List *appendplans, List *tlist)
+make_append(List *appendplans, int nasyncplans, List *tlist)
{
Append *node = makeNode(Append);
Plan *plan = &node->plan;
@@ -4951,6 +4961,7 @@ make_append(List *appendplans, List *tlist)
plan->lefttree = NULL;
plan->righttree = NULL;
node->appendplans = appendplans;
+ node->nasyncplans = nasyncplans;
return node;
}
@@ -6225,3 +6236,27 @@ is_projection_capable_plan(Plan *plan)
}
return true;
}
+
+/*
+ * is_projection_capable_path
+ * Check whether a given Path node is async-capable.
+ */
+static bool
+is_async_capable_path(Path *path)
+{
+ switch (nodeTag(path))
+ {
+ case T_ForeignPath:
+ {
+ FdwRoutine *fdwroutine = path->parent->fdwroutine;
+
+ Assert(fdwroutine != NULL);
+ if (fdwroutine->IsForeignPathAsyncCapable != NULL &&
+ fdwroutine->IsForeignPathAsyncCapable((ForeignPath *) path))
+ return true;
+ }
+ default:
+ break;
+ }
+ return false;
+}
diff --git a/src/include/executor/execAsync.h b/src/include/executor/execAsync.h
new file mode 100644
index 0000000..2abc32d
--- /dev/null
+++ b/src/include/executor/execAsync.h
@@ -0,0 +1,29 @@
+/*--------------------------------------------------------------------
+ * execAsync.c
+ * Support functions for asynchronous query execution
+ *
+ * Portions Copyright (c) 1996-2016, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/executor/execAsync.c
+ *--------------------------------------------------------------------
+ */
+
+#ifndef EXECASYNC_H
+#define EXECASYNC_H
+
+#include "nodes/execnodes.h"
+
+extern void ExecAsyncRequest(EState *estate, PlanState *requestor,
+ int request_index, PlanState *requestee);
+extern bool ExecAsyncEventLoop(EState *estate, PlanState *requestor,
+ long timeout);
+
+extern void ExecAsyncSetRequiredEvents(EState *estate,
+ PendingAsyncRequest *areq, int num_fd_events,
+ bool wants_process_latch, bool force_reset);
+extern void ExecAsyncRequestDone(EState *estate,
+ PendingAsyncRequest *areq, Node *result);
+
+#endif /* EXECASYNC_H */
diff --git a/src/include/executor/nodeAppend.h b/src/include/executor/nodeAppend.h
index 51c381e..81a079d 100644
--- a/src/include/executor/nodeAppend.h
+++ b/src/include/executor/nodeAppend.h
@@ -21,4 +21,7 @@ extern TupleTableSlot *ExecAppend(AppendState *node);
extern void ExecEndAppend(AppendState *node);
extern void ExecReScanAppend(AppendState *node);
+extern void ExecAsyncAppendResponse(EState *estate,
+ PendingAsyncRequest *areq);
+
#endif /* NODEAPPEND_H */
diff --git a/src/include/executor/nodeForeignscan.h b/src/include/executor/nodeForeignscan.h
index 0cdec4e..3e69ab0 100644
--- a/src/include/executor/nodeForeignscan.h
+++ b/src/include/executor/nodeForeignscan.h
@@ -29,4 +29,11 @@ extern void ExecForeignScanInitializeDSM(ForeignScanState *node,
extern void ExecForeignScanInitializeWorker(ForeignScanState *node,
shm_toc *toc);
+extern void ExecAsyncForeignScanRequest(EState *estate,
+ PendingAsyncRequest *areq);
+extern void ExecAsyncForeignScanConfigureWait(EState *estate,
+ PendingAsyncRequest *areq, bool reinit);
+extern void ExecAsyncForeignScanNotify(EState *estate,
+ PendingAsyncRequest *areq);
+
#endif /* NODEFOREIGNSCAN_H */
diff --git a/src/include/foreign/fdwapi.h b/src/include/foreign/fdwapi.h
index e1b0d0d..88feb9a 100644
--- a/src/include/foreign/fdwapi.h
+++ b/src/include/foreign/fdwapi.h
@@ -155,6 +155,15 @@ typedef bool (*IsForeignScanParallelSafe_function) (PlannerInfo *root,
RelOptInfo *rel,
RangeTblEntry *rte);
+typedef bool (*IsForeignPathAsyncCapable_function) (ForeignPath *path);
+typedef void (*ForeignAsyncRequest_function) (EState *estate,
+ PendingAsyncRequest *areq);
+typedef void (*ForeignAsyncConfigureWait_function) (EState *estate,
+ PendingAsyncRequest *areq,
+ bool reinit);
+typedef void (*ForeignAsyncNotify_function) (EState *estate,
+ PendingAsyncRequest *areq);
+
/*
* FdwRoutine is the struct returned by a foreign-data wrapper's handler
* function. It provides pointers to the callback functions needed by the
@@ -224,6 +233,12 @@ typedef struct FdwRoutine
EstimateDSMForeignScan_function EstimateDSMForeignScan;
InitializeDSMForeignScan_function InitializeDSMForeignScan;
InitializeWorkerForeignScan_function InitializeWorkerForeignScan;
+
+ /* Support functions for asynchronous execution */
+ IsForeignPathAsyncCapable_function IsForeignPathAsyncCapable;
+ ForeignAsyncRequest_function ForeignAsyncRequest;
+ ForeignAsyncConfigureWait_function ForeignAsyncConfigureWait;
+ ForeignAsyncNotify_function ForeignAsyncNotify;
} FdwRoutine;
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 5c3b868..7b0e145 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -352,6 +352,25 @@ typedef struct ResultRelInfo
} ResultRelInfo;
/* ----------------
+ * PendingAsyncRequest
+ *
+ * State for an asynchronous tuple request.
+ * ----------------
+ */
+typedef struct PendingAsyncRequest
+{
+ int myindex; /* Index in es_pending_async. */
+ struct PlanState *requestor; /* Node that wants a tuple. */
+ struct PlanState *requestee; /* Node from which a tuple is wanted. */
+ int request_index; /* Scratch space for requestor. */
+ int num_fd_events; /* Max number of FD events requestee needs. */
+ bool wants_process_latch; /* Requestee cares about MyLatch. */
+ bool callback_pending; /* Callback is needed. */
+ bool request_complete; /* Request complete, result valid. */
+ Node *result; /* Result (NULL if no more tuples). */
+} PendingAsyncRequest;
+
+/* ----------------
* EState information
*
* Master working state for an Executor invocation
@@ -430,6 +449,31 @@ typedef struct EState
/* The per-query shared memory area to use for parallel execution. */
struct dsa_area *es_query_dsa;
+
+ /*
+ * Support for asynchronous execution.
+ *
+ * es_max_pending_async is the allocated size of es_pending_async, and
+ * es_num_pending_aync is the number of entries that are currently valid.
+ * (Entries after that may point to storage that can be reused.)
+ * es_async_callback_pending is the number of PendingAsyncRequests for
+ * which callback_pending is true.
+ *
+ * es_total_fd_events is the total number of FD events needed by all
+ * pending async nodes, and es_allocated_fd_events is the number any
+ * current wait event set was allocated to handle. es_wait_event_set, if
+ * non-NULL, is a previously allocated event set that may be reusable by a
+ * future wait provided that nothing's been removed and not too many more
+ * events have been added.
+ */
+ int es_num_pending_async;
+ int es_max_pending_async;
+ int es_async_callback_pending;
+ PendingAsyncRequest **es_pending_async;
+
+ int es_total_fd_events;
+ int es_allocated_fd_events;
+ struct WaitEventSet *es_wait_event_set;
} EState;
@@ -1165,17 +1209,20 @@ typedef struct ModifyTableState
/* ----------------
* AppendState information
- *
- * nplans how many plans are in the array
- * whichplan which plan is being executed (0 .. n-1)
* ----------------
*/
typedef struct AppendState
{
PlanState ps; /* its first field is NodeTag */
PlanState **appendplans; /* array of PlanStates for my inputs */
- int as_nplans;
- int as_whichplan;
+ int as_nplans; /* total # of children */
+ int as_nasyncplans; /* # of async-capable children */
+ int as_whichsyncplan; /* which sync plan is being executed */
+ bool as_syncdone; /* all synchronous plans done? */
+ Bitmapset *as_needrequest; /* async plans needing a new request */
+ TupleTableSlot **as_asyncresult; /* unreturned results of async plans */
+ int as_nasyncresult; /* # of valid entries in as_asyncresult */
+ int as_nasyncpending; /* # of outstanding async requests */
} AppendState;
/* ----------------
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index e2fbc7d..327119b 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -208,6 +208,7 @@ typedef struct Append
{
Plan plan;
List *appendplans;
+ int nasyncplans; /* # of async plans, always at start of list */
} Append;
/* ----------------
--
2.9.2
0002-Fix-some-bugs.patchtext/x-patch; charset=us-asciiDownload
From f63728704995dd9b147a2f94778e1c1ad05da517 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Wed, 12 Oct 2016 14:03:53 +0900
Subject: [PATCH 2/7] Fix some bugs.
---
contrib/postgres_fdw/expected/postgres_fdw.out | 142 ++++++++++++-------------
contrib/postgres_fdw/postgres_fdw.c | 3 +-
src/backend/executor/execAsync.c | 4 +-
src/backend/postmaster/pgstat.c | 3 +
src/include/pgstat.h | 3 +-
5 files changed, 81 insertions(+), 74 deletions(-)
diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out
index 785f520..457cfdb 100644
--- a/contrib/postgres_fdw/expected/postgres_fdw.out
+++ b/contrib/postgres_fdw/expected/postgres_fdw.out
@@ -6173,12 +6173,12 @@ INSERT INTO b(aa) VALUES('bbbbb');
SELECT tableoid::regclass, * FROM a;
tableoid | aa
----------+-------
- a | aaa
- a | aaaa
- a | aaaaa
b | bbb
b | bbbb
b | bbbbb
+ a | aaa
+ a | aaaa
+ a | aaaaa
(6 rows)
SELECT tableoid::regclass, * FROM b;
@@ -6201,12 +6201,12 @@ UPDATE a SET aa = 'zzzzzz' WHERE aa LIKE 'aaaa%';
SELECT tableoid::regclass, * FROM a;
tableoid | aa
----------+--------
- a | aaa
- a | zzzzzz
- a | zzzzzz
b | bbb
b | bbbb
b | bbbbb
+ a | aaa
+ a | zzzzzz
+ a | zzzzzz
(6 rows)
SELECT tableoid::regclass, * FROM b;
@@ -6229,12 +6229,12 @@ UPDATE b SET aa = 'new';
SELECT tableoid::regclass, * FROM a;
tableoid | aa
----------+--------
- a | aaa
- a | zzzzzz
- a | zzzzzz
b | new
b | new
b | new
+ a | aaa
+ a | zzzzzz
+ a | zzzzzz
(6 rows)
SELECT tableoid::regclass, * FROM b;
@@ -6257,12 +6257,12 @@ UPDATE a SET aa = 'newtoo';
SELECT tableoid::regclass, * FROM a;
tableoid | aa
----------+--------
- a | newtoo
- a | newtoo
- a | newtoo
b | newtoo
b | newtoo
b | newtoo
+ a | newtoo
+ a | newtoo
+ a | newtoo
(6 rows)
SELECT tableoid::regclass, * FROM b;
@@ -6321,120 +6321,120 @@ insert into bar2 values(4,44,44);
insert into bar2 values(7,77,77);
explain (verbose, costs off)
select * from bar where f1 in (select f1 from foo) for update;
- QUERY PLAN
-----------------------------------------------------------------------------------------------
+ QUERY PLAN
+------------------------------------------------------------------------------------------------------------------------
LockRows
- Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid, foo.ctid, foo.*, foo.tableoid
+ Output: bar2.f1, bar2.f2, bar2.ctid, ((bar2.*)::bar), bar2.tableoid, foo2.ctid, ((foo2.*)::foo), foo2.tableoid
-> Hash Join
- Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid, foo.ctid, foo.*, foo.tableoid
- Hash Cond: (bar.f1 = foo.f1)
+ Output: bar2.f1, bar2.f2, bar2.ctid, ((bar2.*)::bar), bar2.tableoid, foo2.ctid, ((foo2.*)::foo), foo2.tableoid
+ Hash Cond: (bar2.f1 = foo2.f1)
-> Append
- -> Seq Scan on public.bar
- Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid
-> Foreign Scan on public.bar2
Output: bar2.f1, bar2.f2, bar2.ctid, bar2.*, bar2.tableoid
Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 FOR UPDATE
+ -> Seq Scan on public.bar
+ Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid
-> Hash
- Output: foo.ctid, foo.*, foo.tableoid, foo.f1
+ Output: foo2.ctid, ((foo2.*)::foo), foo2.tableoid, foo2.f1
-> HashAggregate
- Output: foo.ctid, foo.*, foo.tableoid, foo.f1
- Group Key: foo.f1
+ Output: foo2.ctid, ((foo2.*)::foo), foo2.tableoid, foo2.f1
+ Group Key: foo2.f1
-> Append
- -> Seq Scan on public.foo
- Output: foo.ctid, foo.*, foo.tableoid, foo.f1
-> Foreign Scan on public.foo2
Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1
Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1
+ -> Seq Scan on public.foo
+ Output: foo.ctid, foo.*, foo.tableoid, foo.f1
(22 rows)
select * from bar where f1 in (select f1 from foo) for update;
f1 | f2
----+----
- 1 | 11
- 2 | 22
3 | 33
4 | 44
+ 1 | 11
+ 2 | 22
(4 rows)
explain (verbose, costs off)
select * from bar where f1 in (select f1 from foo) for share;
- QUERY PLAN
-----------------------------------------------------------------------------------------------
+ QUERY PLAN
+------------------------------------------------------------------------------------------------------------------------
LockRows
- Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid, foo.ctid, foo.*, foo.tableoid
+ Output: bar2.f1, bar2.f2, bar2.ctid, ((bar2.*)::bar), bar2.tableoid, foo2.ctid, ((foo2.*)::foo), foo2.tableoid
-> Hash Join
- Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid, foo.ctid, foo.*, foo.tableoid
- Hash Cond: (bar.f1 = foo.f1)
+ Output: bar2.f1, bar2.f2, bar2.ctid, ((bar2.*)::bar), bar2.tableoid, foo2.ctid, ((foo2.*)::foo), foo2.tableoid
+ Hash Cond: (bar2.f1 = foo2.f1)
-> Append
- -> Seq Scan on public.bar
- Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid
-> Foreign Scan on public.bar2
Output: bar2.f1, bar2.f2, bar2.ctid, bar2.*, bar2.tableoid
Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 FOR SHARE
+ -> Seq Scan on public.bar
+ Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid
-> Hash
- Output: foo.ctid, foo.*, foo.tableoid, foo.f1
+ Output: foo2.ctid, ((foo2.*)::foo), foo2.tableoid, foo2.f1
-> HashAggregate
- Output: foo.ctid, foo.*, foo.tableoid, foo.f1
- Group Key: foo.f1
+ Output: foo2.ctid, ((foo2.*)::foo), foo2.tableoid, foo2.f1
+ Group Key: foo2.f1
-> Append
- -> Seq Scan on public.foo
- Output: foo.ctid, foo.*, foo.tableoid, foo.f1
-> Foreign Scan on public.foo2
Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1
Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1
+ -> Seq Scan on public.foo
+ Output: foo.ctid, foo.*, foo.tableoid, foo.f1
(22 rows)
select * from bar where f1 in (select f1 from foo) for share;
f1 | f2
----+----
- 1 | 11
- 2 | 22
3 | 33
4 | 44
+ 1 | 11
+ 2 | 22
(4 rows)
-- Check UPDATE with inherited target and an inherited source table
explain (verbose, costs off)
update bar set f2 = f2 + 100 where f1 in (select f1 from foo);
- QUERY PLAN
----------------------------------------------------------------------------------------------
+ QUERY PLAN
+---------------------------------------------------------------------------------------------------------
Update on public.bar
Update on public.bar
Foreign Update on public.bar2
Remote SQL: UPDATE public.loct2 SET f2 = $2 WHERE ctid = $1
-> Hash Join
- Output: bar.f1, (bar.f2 + 100), bar.ctid, foo.ctid, foo.*, foo.tableoid
- Hash Cond: (bar.f1 = foo.f1)
+ Output: bar.f1, (bar.f2 + 100), bar.ctid, foo2.ctid, ((foo2.*)::foo), foo2.tableoid
+ Hash Cond: (bar.f1 = foo2.f1)
-> Seq Scan on public.bar
Output: bar.f1, bar.f2, bar.ctid
-> Hash
- Output: foo.ctid, foo.*, foo.tableoid, foo.f1
+ Output: foo2.ctid, ((foo2.*)::foo), foo2.tableoid, foo2.f1
-> HashAggregate
- Output: foo.ctid, foo.*, foo.tableoid, foo.f1
- Group Key: foo.f1
+ Output: foo2.ctid, ((foo2.*)::foo), foo2.tableoid, foo2.f1
+ Group Key: foo2.f1
-> Append
- -> Seq Scan on public.foo
- Output: foo.ctid, foo.*, foo.tableoid, foo.f1
-> Foreign Scan on public.foo2
Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1
Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1
+ -> Seq Scan on public.foo
+ Output: foo.ctid, foo.*, foo.tableoid, foo.f1
-> Hash Join
- Output: bar2.f1, (bar2.f2 + 100), bar2.f3, bar2.ctid, foo.ctid, foo.*, foo.tableoid
- Hash Cond: (bar2.f1 = foo.f1)
+ Output: bar2.f1, (bar2.f2 + 100), bar2.f3, bar2.ctid, foo2.ctid, ((foo2.*)::foo), foo2.tableoid
+ Hash Cond: (bar2.f1 = foo2.f1)
-> Foreign Scan on public.bar2
Output: bar2.f1, bar2.f2, bar2.f3, bar2.ctid
Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 FOR UPDATE
-> Hash
- Output: foo.ctid, foo.*, foo.tableoid, foo.f1
+ Output: foo2.ctid, ((foo2.*)::foo), foo2.tableoid, foo2.f1
-> HashAggregate
- Output: foo.ctid, foo.*, foo.tableoid, foo.f1
- Group Key: foo.f1
+ Output: foo2.ctid, ((foo2.*)::foo), foo2.tableoid, foo2.f1
+ Group Key: foo2.f1
-> Append
- -> Seq Scan on public.foo
- Output: foo.ctid, foo.*, foo.tableoid, foo.f1
-> Foreign Scan on public.foo2
Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1
Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1
+ -> Seq Scan on public.foo
+ Output: foo.ctid, foo.*, foo.tableoid, foo.f1
(37 rows)
update bar set f2 = f2 + 100 where f1 in (select f1 from foo);
@@ -6462,26 +6462,26 @@ where bar.f1 = ss.f1;
Foreign Update on public.bar2
Remote SQL: UPDATE public.loct2 SET f2 = $2 WHERE ctid = $1
-> Hash Join
- Output: bar.f1, (bar.f2 + 100), bar.ctid, (ROW(foo.f1))
- Hash Cond: (foo.f1 = bar.f1)
+ Output: bar.f1, (bar.f2 + 100), bar.ctid, (ROW(foo2.f1))
+ Hash Cond: (foo2.f1 = bar.f1)
-> Append
- -> Seq Scan on public.foo
- Output: ROW(foo.f1), foo.f1
-> Foreign Scan on public.foo2
Output: ROW(foo2.f1), foo2.f1
Remote SQL: SELECT f1 FROM public.loct1
- -> Seq Scan on public.foo foo_1
- Output: ROW((foo_1.f1 + 3)), (foo_1.f1 + 3)
-> Foreign Scan on public.foo2 foo2_1
Output: ROW((foo2_1.f1 + 3)), (foo2_1.f1 + 3)
Remote SQL: SELECT f1 FROM public.loct1
+ -> Seq Scan on public.foo
+ Output: ROW(foo.f1), foo.f1
+ -> Seq Scan on public.foo foo_1
+ Output: ROW((foo_1.f1 + 3)), (foo_1.f1 + 3)
-> Hash
Output: bar.f1, bar.f2, bar.ctid
-> Seq Scan on public.bar
Output: bar.f1, bar.f2, bar.ctid
-> Merge Join
- Output: bar2.f1, (bar2.f2 + 100), bar2.f3, bar2.ctid, (ROW(foo.f1))
- Merge Cond: (bar2.f1 = foo.f1)
+ Output: bar2.f1, (bar2.f2 + 100), bar2.f3, bar2.ctid, (ROW(foo2.f1))
+ Merge Cond: (bar2.f1 = foo2.f1)
-> Sort
Output: bar2.f1, bar2.f2, bar2.f3, bar2.ctid
Sort Key: bar2.f1
@@ -6489,19 +6489,19 @@ where bar.f1 = ss.f1;
Output: bar2.f1, bar2.f2, bar2.f3, bar2.ctid
Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 FOR UPDATE
-> Sort
- Output: (ROW(foo.f1)), foo.f1
- Sort Key: foo.f1
+ Output: (ROW(foo2.f1)), foo2.f1
+ Sort Key: foo2.f1
-> Append
- -> Seq Scan on public.foo
- Output: ROW(foo.f1), foo.f1
-> Foreign Scan on public.foo2
Output: ROW(foo2.f1), foo2.f1
Remote SQL: SELECT f1 FROM public.loct1
- -> Seq Scan on public.foo foo_1
- Output: ROW((foo_1.f1 + 3)), (foo_1.f1 + 3)
-> Foreign Scan on public.foo2 foo2_1
Output: ROW((foo2_1.f1 + 3)), (foo2_1.f1 + 3)
Remote SQL: SELECT f1 FROM public.loct1
+ -> Seq Scan on public.foo
+ Output: ROW(foo.f1), foo.f1
+ -> Seq Scan on public.foo foo_1
+ Output: ROW((foo_1.f1 + 3)), (foo_1.f1 + 3)
(45 rows)
update bar set f2 = f2 + 100
@@ -6668,8 +6668,8 @@ update bar set f2 = f2 + 100 returning *;
update bar set f2 = f2 + 100 returning *;
f1 | f2
----+-----
- 1 | 311
2 | 322
+ 1 | 311
6 | 266
3 | 333
4 | 344
diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c
index ef4acc7..c64ae41 100644
--- a/contrib/postgres_fdw/postgres_fdw.c
+++ b/contrib/postgres_fdw/postgres_fdw.c
@@ -20,6 +20,7 @@
#include "commands/explain.h"
#include "commands/vacuum.h"
#include "executor/execAsync.h"
+#include "executor/nodeForeignscan.h"
#include "foreign/fdwapi.h"
#include "funcapi.h"
#include "miscadmin.h"
@@ -4474,7 +4475,7 @@ postgresForeignAsyncRequest(EState *estate, PendingAsyncRequest *areq)
TupleTableSlot *slot;
Assert(IsA(node, ForeignScanState));
- slot = postgresIterateForeignScan(node);
+ slot = ExecForeignScan(node);
ExecAsyncRequestDone(estate, areq, (Node *) slot);
}
diff --git a/src/backend/executor/execAsync.c b/src/backend/executor/execAsync.c
index 5858bb5..e070c26 100644
--- a/src/backend/executor/execAsync.c
+++ b/src/backend/executor/execAsync.c
@@ -18,6 +18,7 @@
#include "executor/nodeAppend.h"
#include "executor/nodeForeignscan.h"
#include "miscadmin.h"
+#include "pgstat.h"
#include "storage/latch.h"
static bool ExecAsyncEventWait(EState *estate, long timeout);
@@ -286,7 +287,8 @@ ExecAsyncEventWait(EState *estate, long timeout)
/* Wait for at least one event to occur. */
noccurred = WaitEventSetWait(estate->es_wait_event_set, timeout,
- occurred_event, EVENT_BUFFER_SIZE);
+ occurred_event, EVENT_BUFFER_SIZE,
+ WAIT_EVENT_ASYNC_WAIT);
if (noccurred == 0)
return false;
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 61e6a2c..beae80b 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -3392,6 +3392,9 @@ pgstat_get_wait_ipc(WaitEventIPC w)
case WAIT_EVENT_SYNC_REP:
event_name = "SyncRep";
break;
+ case WAIT_EVENT_ASYNC_WAIT:
+ event_name = "AsyncExecWait";
+ break;
/* no default case, so that compiler will warn */
}
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 282f8ae..a42ad48 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -785,7 +785,8 @@ typedef enum
WAIT_EVENT_MQ_SEND,
WAIT_EVENT_PARALLEL_FINISH,
WAIT_EVENT_SAFE_SNAPSHOT,
- WAIT_EVENT_SYNC_REP
+ WAIT_EVENT_SYNC_REP,
+ WAIT_EVENT_ASYNC_WAIT
} WaitEventIPC;
/* ----------
--
2.9.2
0003-Modify-async-execution-infrastructure.patchtext/x-patch; charset=us-asciiDownload
From a951a6124c00297d825323199a30c8c570ca46b4 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Mon, 17 Oct 2016 15:54:32 +0900
Subject: [PATCH 3/7] Modify async execution infrastructure.
---
contrib/postgres_fdw/expected/postgres_fdw.out | 68 ++++++++--------
contrib/postgres_fdw/postgres_fdw.c | 5 +-
src/backend/executor/execAsync.c | 105 ++++++++++++++-----------
src/backend/executor/nodeAppend.c | 50 ++++++------
src/backend/executor/nodeForeignscan.c | 4 +-
src/backend/nodes/copyfuncs.c | 1 +
src/backend/nodes/outfuncs.c | 1 +
src/backend/nodes/readfuncs.c | 1 +
src/backend/optimizer/plan/createplan.c | 24 +++++-
src/backend/utils/adt/ruleutils.c | 6 +-
src/include/executor/nodeForeignscan.h | 2 +-
src/include/foreign/fdwapi.h | 2 +-
src/include/nodes/execnodes.h | 10 ++-
src/include/nodes/plannodes.h | 1 +
14 files changed, 167 insertions(+), 113 deletions(-)
diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out
index 457cfdb..083d947 100644
--- a/contrib/postgres_fdw/expected/postgres_fdw.out
+++ b/contrib/postgres_fdw/expected/postgres_fdw.out
@@ -6321,13 +6321,13 @@ insert into bar2 values(4,44,44);
insert into bar2 values(7,77,77);
explain (verbose, costs off)
select * from bar where f1 in (select f1 from foo) for update;
- QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------
+ QUERY PLAN
+----------------------------------------------------------------------------------------------
LockRows
- Output: bar2.f1, bar2.f2, bar2.ctid, ((bar2.*)::bar), bar2.tableoid, foo2.ctid, ((foo2.*)::foo), foo2.tableoid
+ Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid, foo.ctid, foo.*, foo.tableoid
-> Hash Join
- Output: bar2.f1, bar2.f2, bar2.ctid, ((bar2.*)::bar), bar2.tableoid, foo2.ctid, ((foo2.*)::foo), foo2.tableoid
- Hash Cond: (bar2.f1 = foo2.f1)
+ Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid, foo.ctid, foo.*, foo.tableoid
+ Hash Cond: (bar.f1 = foo.f1)
-> Append
-> Foreign Scan on public.bar2
Output: bar2.f1, bar2.f2, bar2.ctid, bar2.*, bar2.tableoid
@@ -6335,10 +6335,10 @@ select * from bar where f1 in (select f1 from foo) for update;
-> Seq Scan on public.bar
Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid
-> Hash
- Output: foo2.ctid, ((foo2.*)::foo), foo2.tableoid, foo2.f1
+ Output: foo.ctid, foo.*, foo.tableoid, foo.f1
-> HashAggregate
- Output: foo2.ctid, ((foo2.*)::foo), foo2.tableoid, foo2.f1
- Group Key: foo2.f1
+ Output: foo.ctid, foo.*, foo.tableoid, foo.f1
+ Group Key: foo.f1
-> Append
-> Foreign Scan on public.foo2
Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1
@@ -6358,13 +6358,13 @@ select * from bar where f1 in (select f1 from foo) for update;
explain (verbose, costs off)
select * from bar where f1 in (select f1 from foo) for share;
- QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------
+ QUERY PLAN
+----------------------------------------------------------------------------------------------
LockRows
- Output: bar2.f1, bar2.f2, bar2.ctid, ((bar2.*)::bar), bar2.tableoid, foo2.ctid, ((foo2.*)::foo), foo2.tableoid
+ Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid, foo.ctid, foo.*, foo.tableoid
-> Hash Join
- Output: bar2.f1, bar2.f2, bar2.ctid, ((bar2.*)::bar), bar2.tableoid, foo2.ctid, ((foo2.*)::foo), foo2.tableoid
- Hash Cond: (bar2.f1 = foo2.f1)
+ Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid, foo.ctid, foo.*, foo.tableoid
+ Hash Cond: (bar.f1 = foo.f1)
-> Append
-> Foreign Scan on public.bar2
Output: bar2.f1, bar2.f2, bar2.ctid, bar2.*, bar2.tableoid
@@ -6372,10 +6372,10 @@ select * from bar where f1 in (select f1 from foo) for share;
-> Seq Scan on public.bar
Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid
-> Hash
- Output: foo2.ctid, ((foo2.*)::foo), foo2.tableoid, foo2.f1
+ Output: foo.ctid, foo.*, foo.tableoid, foo.f1
-> HashAggregate
- Output: foo2.ctid, ((foo2.*)::foo), foo2.tableoid, foo2.f1
- Group Key: foo2.f1
+ Output: foo.ctid, foo.*, foo.tableoid, foo.f1
+ Group Key: foo.f1
-> Append
-> Foreign Scan on public.foo2
Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1
@@ -6396,22 +6396,22 @@ select * from bar where f1 in (select f1 from foo) for share;
-- Check UPDATE with inherited target and an inherited source table
explain (verbose, costs off)
update bar set f2 = f2 + 100 where f1 in (select f1 from foo);
- QUERY PLAN
----------------------------------------------------------------------------------------------------------
+ QUERY PLAN
+---------------------------------------------------------------------------------------------
Update on public.bar
Update on public.bar
Foreign Update on public.bar2
Remote SQL: UPDATE public.loct2 SET f2 = $2 WHERE ctid = $1
-> Hash Join
- Output: bar.f1, (bar.f2 + 100), bar.ctid, foo2.ctid, ((foo2.*)::foo), foo2.tableoid
- Hash Cond: (bar.f1 = foo2.f1)
+ Output: bar.f1, (bar.f2 + 100), bar.ctid, foo.ctid, foo.*, foo.tableoid
+ Hash Cond: (bar.f1 = foo.f1)
-> Seq Scan on public.bar
Output: bar.f1, bar.f2, bar.ctid
-> Hash
- Output: foo2.ctid, ((foo2.*)::foo), foo2.tableoid, foo2.f1
+ Output: foo.ctid, foo.*, foo.tableoid, foo.f1
-> HashAggregate
- Output: foo2.ctid, ((foo2.*)::foo), foo2.tableoid, foo2.f1
- Group Key: foo2.f1
+ Output: foo.ctid, foo.*, foo.tableoid, foo.f1
+ Group Key: foo.f1
-> Append
-> Foreign Scan on public.foo2
Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1
@@ -6419,16 +6419,16 @@ update bar set f2 = f2 + 100 where f1 in (select f1 from foo);
-> Seq Scan on public.foo
Output: foo.ctid, foo.*, foo.tableoid, foo.f1
-> Hash Join
- Output: bar2.f1, (bar2.f2 + 100), bar2.f3, bar2.ctid, foo2.ctid, ((foo2.*)::foo), foo2.tableoid
- Hash Cond: (bar2.f1 = foo2.f1)
+ Output: bar2.f1, (bar2.f2 + 100), bar2.f3, bar2.ctid, foo.ctid, foo.*, foo.tableoid
+ Hash Cond: (bar2.f1 = foo.f1)
-> Foreign Scan on public.bar2
Output: bar2.f1, bar2.f2, bar2.f3, bar2.ctid
Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 FOR UPDATE
-> Hash
- Output: foo2.ctid, ((foo2.*)::foo), foo2.tableoid, foo2.f1
+ Output: foo.ctid, foo.*, foo.tableoid, foo.f1
-> HashAggregate
- Output: foo2.ctid, ((foo2.*)::foo), foo2.tableoid, foo2.f1
- Group Key: foo2.f1
+ Output: foo.ctid, foo.*, foo.tableoid, foo.f1
+ Group Key: foo.f1
-> Append
-> Foreign Scan on public.foo2
Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1
@@ -6462,8 +6462,8 @@ where bar.f1 = ss.f1;
Foreign Update on public.bar2
Remote SQL: UPDATE public.loct2 SET f2 = $2 WHERE ctid = $1
-> Hash Join
- Output: bar.f1, (bar.f2 + 100), bar.ctid, (ROW(foo2.f1))
- Hash Cond: (foo2.f1 = bar.f1)
+ Output: bar.f1, (bar.f2 + 100), bar.ctid, (ROW(foo.f1))
+ Hash Cond: (foo.f1 = bar.f1)
-> Append
-> Foreign Scan on public.foo2
Output: ROW(foo2.f1), foo2.f1
@@ -6480,8 +6480,8 @@ where bar.f1 = ss.f1;
-> Seq Scan on public.bar
Output: bar.f1, bar.f2, bar.ctid
-> Merge Join
- Output: bar2.f1, (bar2.f2 + 100), bar2.f3, bar2.ctid, (ROW(foo2.f1))
- Merge Cond: (bar2.f1 = foo2.f1)
+ Output: bar2.f1, (bar2.f2 + 100), bar2.f3, bar2.ctid, (ROW(foo.f1))
+ Merge Cond: (bar2.f1 = foo.f1)
-> Sort
Output: bar2.f1, bar2.f2, bar2.f3, bar2.ctid
Sort Key: bar2.f1
@@ -6489,8 +6489,8 @@ where bar.f1 = ss.f1;
Output: bar2.f1, bar2.f2, bar2.f3, bar2.ctid
Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 FOR UPDATE
-> Sort
- Output: (ROW(foo2.f1)), foo2.f1
- Sort Key: foo2.f1
+ Output: (ROW(foo.f1)), foo.f1
+ Sort Key: foo.f1
-> Append
-> Foreign Scan on public.foo2
Output: ROW(foo2.f1), foo2.f1
diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c
index c64ae41..b92b279 100644
--- a/contrib/postgres_fdw/postgres_fdw.c
+++ b/contrib/postgres_fdw/postgres_fdw.c
@@ -354,7 +354,7 @@ static void postgresGetForeignUpperPaths(PlannerInfo *root,
static bool postgresIsForeignPathAsyncCapable(ForeignPath *path);
static void postgresForeignAsyncRequest(EState *estate,
PendingAsyncRequest *areq);
-static void postgresForeignAsyncConfigureWait(EState *estate,
+static bool postgresForeignAsyncConfigureWait(EState *estate,
PendingAsyncRequest *areq,
bool reinit);
static void postgresForeignAsyncNotify(EState *estate,
@@ -4479,11 +4479,12 @@ postgresForeignAsyncRequest(EState *estate, PendingAsyncRequest *areq)
ExecAsyncRequestDone(estate, areq, (Node *) slot);
}
-static void
+static bool
postgresForeignAsyncConfigureWait(EState *estate, PendingAsyncRequest *areq,
bool reinit)
{
elog(ERROR, "postgresForeignAsyncConfigureWait");
+ return false;
}
static void
diff --git a/src/backend/executor/execAsync.c b/src/backend/executor/execAsync.c
index e070c26..33496a9 100644
--- a/src/backend/executor/execAsync.c
+++ b/src/backend/executor/execAsync.c
@@ -22,7 +22,7 @@
#include "storage/latch.h"
static bool ExecAsyncEventWait(EState *estate, long timeout);
-static void ExecAsyncConfigureWait(EState *estate, PendingAsyncRequest *areq,
+static bool ExecAsyncConfigureWait(EState *estate, PendingAsyncRequest *areq,
bool reinit);
static void ExecAsyncNotify(EState *estate, PendingAsyncRequest *areq);
static void ExecAsyncResponse(EState *estate, PendingAsyncRequest *areq);
@@ -43,7 +43,7 @@ ExecAsyncRequest(EState *estate, PlanState *requestor, int request_index,
PlanState *requestee)
{
PendingAsyncRequest *areq = NULL;
- int i = estate->es_num_pending_async;
+ int nasync = estate->es_num_pending_async;
/*
* If the number of pending asynchronous nodes exceeds the number of
@@ -51,7 +51,7 @@ ExecAsyncRequest(EState *estate, PlanState *requestor, int request_index,
* We start with 16 slots, and thereafter double the array size each
* time we run out of slots.
*/
- if (i >= estate->es_max_pending_async)
+ if (nasync >= estate->es_max_pending_async)
{
int newmax;
@@ -81,25 +81,28 @@ ExecAsyncRequest(EState *estate, PlanState *requestor, int request_index,
* PendingAsyncRequest if there is one. If not, we must allocate a new
* one.
*/
- if (estate->es_pending_async[i] == NULL)
+ if (estate->es_pending_async[nasync] == NULL)
{
areq = MemoryContextAllocZero(estate->es_query_cxt,
sizeof(PendingAsyncRequest));
- estate->es_pending_async[i] = areq;
+ estate->es_pending_async[nasync] = areq;
}
else
{
- areq = estate->es_pending_async[i];
+ areq = estate->es_pending_async[nasync];
MemSet(areq, 0, sizeof(PendingAsyncRequest));
}
- areq->myindex = estate->es_num_pending_async++;
+ areq->myindex = estate->es_num_pending_async;
/* Initialize the new request. */
areq->requestor = requestor;
areq->request_index = request_index;
areq->requestee = requestee;
- /* Give the requestee a chance to do whatever it wants. */
+ /*
+ * Give the requestee a chance to do whatever it wants.
+ * Requst functions return true if a result is immediately available.
+ */
switch (nodeTag(requestee))
{
case T_ForeignScanState:
@@ -110,6 +113,20 @@ ExecAsyncRequest(EState *estate, PlanState *requestor, int request_index,
elog(ERROR, "unrecognized node type: %d",
(int) nodeTag(requestee));
}
+
+ /*
+ * If a result is available, complete it immediately.
+ */
+ if (areq->state == ASYNC_COMPLETE)
+ {
+ Assert(areq->result == NULL || IsA(areq->result, TupleTableSlot));
+ ExecAsyncResponse(estate, areq);
+
+ return;
+ }
+
+ /* No result available now, make this node pending */
+ estate->es_num_pending_async++;
}
/*
@@ -175,22 +192,19 @@ ExecAsyncEventLoop(EState *estate, PlanState *requestor, long timeout)
{
PendingAsyncRequest *areq = estate->es_pending_async[i];
- /* Skip it if no callback is pending. */
- if (!areq->callback_pending)
- continue;
-
- /*
- * Mark it as no longer needing a callback. We must do this
- * before dispatching the callback in case the callback resets
- * the flag.
- */
- areq->callback_pending = false;
- estate->es_async_callback_pending--;
-
- /* Perform the actual callback; set request_done if appropraite. */
- if (!areq->request_complete)
+ /* Skip it if not pending. */
+ if (areq->state == ASYNC_CALLBACK_PENDING)
+ {
+ /*
+ * Mark it as no longer needing a callback. We must do this
+ * before dispatching the callback in case the callback resets
+ * the flag.
+ */
+ estate->es_async_callback_pending--;
ExecAsyncNotify(estate, areq);
- else
+ }
+
+ if (areq->state == ASYNC_COMPLETE)
{
any_node_done = true;
if (requestor == areq->requestor)
@@ -214,7 +228,7 @@ ExecAsyncEventLoop(EState *estate, PlanState *requestor, long timeout)
PendingAsyncRequest *head;
PendingAsyncRequest *tail = estate->es_pending_async[tidx];
- if (!tail->callback_pending && tail->request_complete)
+ if (tail->state == ASYNC_COMPLETE)
continue;
head = estate->es_pending_async[hidx];
estate->es_pending_async[tidx] = head;
@@ -247,7 +261,8 @@ ExecAsyncEventLoop(EState *estate, PlanState *requestor, long timeout)
* means wait forever, 0 means don't wait at all, and >0 means wait for the
* indicated number of milliseconds.
*
- * Returns true if we found some events and false if we timed out.
+ * Returns true if we found some events and false if we timed out or there's
+ * no event to wait. The latter is occur when the areq is processed during
*/
static bool
ExecAsyncEventWait(EState *estate, long timeout)
@@ -258,6 +273,7 @@ ExecAsyncEventWait(EState *estate, long timeout)
int n;
bool reinit = false;
bool process_latch_set = false;
+ bool added = false;
if (estate->es_wait_event_set == NULL)
{
@@ -282,13 +298,16 @@ ExecAsyncEventWait(EState *estate, long timeout)
PendingAsyncRequest *areq = estate->es_pending_async[i];
if (areq->num_fd_events > 0)
- ExecAsyncConfigureWait(estate, areq, reinit);
+ added |= ExecAsyncConfigureWait(estate, areq, reinit);
}
+ Assert(added);
+
/* Wait for at least one event to occur. */
noccurred = WaitEventSetWait(estate->es_wait_event_set, timeout,
occurred_event, EVENT_BUFFER_SIZE,
WAIT_EVENT_ASYNC_WAIT);
+
if (noccurred == 0)
return false;
@@ -312,12 +331,10 @@ ExecAsyncEventWait(EState *estate, long timeout)
{
PendingAsyncRequest *areq = w->user_data;
- if (!areq->callback_pending)
- {
- Assert(!areq->request_complete);
- areq->callback_pending = true;
- estate->es_async_callback_pending++;
- }
+ Assert(areq->state == ASYNC_WAITING);
+
+ areq->state = ASYNC_CALLBACK_PENDING;
+ estate->es_async_callback_pending++;
}
}
@@ -333,8 +350,8 @@ ExecAsyncEventWait(EState *estate, long timeout)
if (areq->wants_process_latch)
{
- Assert(!areq->request_complete);
- areq->callback_pending = true;
+ Assert(areq->state == ASYNC_WAITING);
+ areq->state = ASYNC_CALLBACK_PENDING;
}
}
}
@@ -352,15 +369,19 @@ ExecAsyncEventWait(EState *estate, long timeout)
* The events should include only WL_SOCKET_READABLE or WL_SOCKET_WRITEABLE,
* and the number of calls should not exceed areq->num_fd_events (as
* prevously set via ExecAsyncSetRequiredEvents).
+ *
+ * Individual requests can omit registering an event but it is a
+ * responsibility of the node driver to set at least one event per one
+ * requestor.
*/
-static void
+static bool
ExecAsyncConfigureWait(EState *estate, PendingAsyncRequest *areq,
bool reinit)
{
switch (nodeTag(areq->requestee))
{
case T_ForeignScanState:
- ExecAsyncForeignScanConfigureWait(estate, areq, reinit);
+ return ExecAsyncForeignScanConfigureWait(estate, areq, reinit);
break;
default:
elog(ERROR, "unrecognized node type: %d",
@@ -419,6 +440,7 @@ ExecAsyncSetRequiredEvents(EState *estate, PendingAsyncRequest *areq,
estate->es_total_fd_events += num_fd_events - areq->num_fd_events;
areq->num_fd_events = num_fd_events;
areq->wants_process_latch = wants_process_latch;
+ areq->state = ASYNC_WAITING;
if (force_reset && estate->es_wait_event_set != NULL)
{
@@ -448,17 +470,12 @@ ExecAsyncRequestDone(EState *estate, PendingAsyncRequest *areq, Node *result)
* need a callback to remove registered wait events. It's not clear
* that we would come out ahead, so use brute force for now.
*/
+ Assert(areq->state == ASYNC_IDLE || areq->state == ASYNC_CALLBACK_PENDING);
+
if (areq->num_fd_events > 0 || areq->wants_process_latch)
ExecAsyncSetRequiredEvents(estate, areq, 0, false, true);
/* Save result and mark request as complete. */
areq->result = result;
- areq->request_complete = true;
-
- /* Make sure this request is flagged for a callback. */
- if (!areq->callback_pending)
- {
- areq->callback_pending = true;
- estate->es_async_callback_pending++;
- }
+ areq->state = ASYNC_COMPLETE;
}
diff --git a/src/backend/executor/nodeAppend.c b/src/backend/executor/nodeAppend.c
index bb06569..c234f1f 100644
--- a/src/backend/executor/nodeAppend.c
+++ b/src/backend/executor/nodeAppend.c
@@ -229,9 +229,15 @@ ExecAppend(AppendState *node)
*/
while ((i = bms_first_member(node->as_needrequest)) >= 0)
{
- ExecAsyncRequest(estate, &node->ps, i, node->appendplans[i]);
node->as_nasyncpending++;
+
+ ExecAsyncRequest(estate, &node->ps, i, node->appendplans[i]);
+ /* If this request immediately gives a result, take it. */
+ if (node->as_nasyncresult > 0)
+ return node->as_asyncresult[--node->as_nasyncresult];
}
+ if (node->as_nasyncpending == 0 && node->as_syncdone)
+ return ExecClearTuple(node->ps.ps_ResultTupleSlot);
}
for (;;)
@@ -246,32 +252,32 @@ ExecAppend(AppendState *node)
{
long timeout = node->as_syncdone ? -1 : 0;
- for (;;)
+ while (node->as_nasyncpending > 0)
{
- if (node->as_nasyncpending == 0)
- {
- /*
- * If there is no asynchronous activity still pending
- * and the synchronous activity is also complete, we're
- * totally done scanning this node. Otherwise, we're
- * done with the asynchronous stuff but must continue
- * scanning the synchronous children.
- */
- if (node->as_syncdone)
- return ExecClearTuple(node->ps.ps_ResultTupleSlot);
- break;
- }
- if (!ExecAsyncEventLoop(node->ps.state, &node->ps, timeout))
- {
- /* Timeout reached. */
- break;
- }
- if (node->as_nasyncresult > 0)
+ if (ExecAsyncEventLoop(node->ps.state, &node->ps, timeout) &&
+ node->as_nasyncresult > 0)
{
/* Asynchronous subplan returned a tuple! */
--node->as_nasyncresult;
return node->as_asyncresult[node->as_nasyncresult];
}
+
+ /* Timeout reached. Go through to sync nodes if exists */
+ if (!node->as_syncdone)
+ break;
+ }
+
+ /*
+ * If there is no asynchronous activity still pending and the
+ * synchronous activity is also complete, we're totally done
+ * scanning this node. Otherwise, we're done with the
+ * asynchronous stuff but must continue scanning the synchronous
+ * children.
+ */
+ if (node->as_syncdone)
+ {
+ Assert(node->as_nasyncpending == 0);
+ return ExecClearTuple(node->ps.ps_ResultTupleSlot);
}
}
@@ -397,7 +403,7 @@ ExecAsyncAppendResponse(EState *estate, PendingAsyncRequest *areq)
TupleTableSlot *slot;
/* We shouldn't be called until the request is complete. */
- Assert(areq->request_complete);
+ Assert(areq->state == ASYNC_COMPLETE);
/* Our result slot shouldn't already be occupied. */
Assert(TupIsNull(node->ps.ps_ResultTupleSlot));
diff --git a/src/backend/executor/nodeForeignscan.c b/src/backend/executor/nodeForeignscan.c
index 85d436f..d3567bb 100644
--- a/src/backend/executor/nodeForeignscan.c
+++ b/src/backend/executor/nodeForeignscan.c
@@ -378,7 +378,7 @@ ExecAsyncForeignScanRequest(EState *estate, PendingAsyncRequest *areq)
* In async mode, configure for a wait
* ----------------------------------------------------------------
*/
-void
+bool
ExecAsyncForeignScanConfigureWait(EState *estate,
PendingAsyncRequest *areq, bool reinit)
{
@@ -386,7 +386,7 @@ ExecAsyncForeignScanConfigureWait(EState *estate,
FdwRoutine *fdwroutine = node->fdwroutine;
Assert(fdwroutine->ForeignAsyncConfigureWait != NULL);
- fdwroutine->ForeignAsyncConfigureWait(estate, areq, reinit);
+ return fdwroutine->ForeignAsyncConfigureWait(estate, areq, reinit);
}
/* ----------------------------------------------------------------
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index a4b31cc..ad64649 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -219,6 +219,7 @@ _copyAppend(const Append *from)
*/
COPY_NODE_FIELD(appendplans);
COPY_SCALAR_FIELD(nasyncplans);
+ COPY_SCALAR_FIELD(referent);
return newnode;
}
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index c59c635..829e826 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -360,6 +360,7 @@ _outAppend(StringInfo str, const Append *node)
WRITE_NODE_FIELD(appendplans);
WRITE_INT_FIELD(nasyncplans);
+ WRITE_INT_FIELD(referent);
}
static void
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index 8051c58..7f72c99 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -1522,6 +1522,7 @@ _readAppend(void)
READ_NODE_FIELD(appendplans);
READ_INT_FIELD(nasyncplans);
+ READ_INT_FIELD(referent);
READ_DONE();
}
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index 7caa8d3..ff1d663 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -193,7 +193,8 @@ static CteScan *make_ctescan(List *qptlist, List *qpqual,
Index scanrelid, int ctePlanId, int cteParam);
static WorkTableScan *make_worktablescan(List *qptlist, List *qpqual,
Index scanrelid, int wtParam);
-static Append *make_append(List *asyncplans, int nasyncplans, List *tlist);
+static Append *make_append(List *asyncplans, int nasyncplans,
+ int referent, List *tlist);
static RecursiveUnion *make_recursive_union(List *tlist,
Plan *lefttree,
Plan *righttree,
@@ -960,6 +961,8 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path)
List *syncplans = NIL;
ListCell *subpaths;
int nasyncplans = 0;
+ bool first = true;
+ bool referent_is_sync = true;
/*
* The subpaths list could be empty, if every child was proven empty by
@@ -985,7 +988,14 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path)
return plan;
}
- /* Build the plan for each child */
+ /*
+ * Build the plan for each child
+
+ * The first child in an inheritance set is the representative in
+ * explaining tlist entries (see set_deparse_planstate). We should keep
+ * the first child in best_path->subpaths at the head of the subplan list
+ * for the reason.
+ */
foreach(subpaths, best_path->subpaths)
{
Path *subpath = (Path *) lfirst(subpaths);
@@ -999,9 +1009,13 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path)
{
asyncplans = lappend(asyncplans, subplan);
++nasyncplans;
+ if (first)
+ referent_is_sync = false;
}
else
syncplans = lappend(syncplans, subplan);
+
+ first = false;
}
/*
@@ -1011,7 +1025,8 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path)
* parent-rel Vars it'll be asked to emit.
*/
- plan = make_append(list_concat(asyncplans, syncplans), nasyncplans, tlist);
+ plan = make_append(list_concat(asyncplans, syncplans), nasyncplans,
+ referent_is_sync ? nasyncplans : 0, tlist);
copy_generic_path_info(&plan->plan, (Path *) best_path);
@@ -4951,7 +4966,7 @@ make_foreignscan(List *qptlist,
}
static Append *
-make_append(List *appendplans, int nasyncplans, List *tlist)
+make_append(List *appendplans, int nasyncplans, int referent, List *tlist)
{
Append *node = makeNode(Append);
Plan *plan = &node->plan;
@@ -4962,6 +4977,7 @@ make_append(List *appendplans, int nasyncplans, List *tlist)
plan->righttree = NULL;
node->appendplans = appendplans;
node->nasyncplans = nasyncplans;
+ node->referent = referent;
return node;
}
diff --git a/src/backend/utils/adt/ruleutils.c b/src/backend/utils/adt/ruleutils.c
index 4e2ba19..47135fe 100644
--- a/src/backend/utils/adt/ruleutils.c
+++ b/src/backend/utils/adt/ruleutils.c
@@ -4241,7 +4241,11 @@ set_deparse_planstate(deparse_namespace *dpns, PlanState *ps)
* lists containing references to non-target relations.
*/
if (IsA(ps, AppendState))
- dpns->outer_planstate = ((AppendState *) ps)->appendplans[0];
+ {
+ int idx = ((Append*)(((AppendState *) ps)->ps.plan))->referent;
+ dpns->outer_planstate =
+ ((AppendState *) ps)->appendplans[idx];
+ }
else if (IsA(ps, MergeAppendState))
dpns->outer_planstate = ((MergeAppendState *) ps)->mergeplans[0];
else if (IsA(ps, ModifyTableState))
diff --git a/src/include/executor/nodeForeignscan.h b/src/include/executor/nodeForeignscan.h
index 3e69ab0..47a3920 100644
--- a/src/include/executor/nodeForeignscan.h
+++ b/src/include/executor/nodeForeignscan.h
@@ -31,7 +31,7 @@ extern void ExecForeignScanInitializeWorker(ForeignScanState *node,
extern void ExecAsyncForeignScanRequest(EState *estate,
PendingAsyncRequest *areq);
-extern void ExecAsyncForeignScanConfigureWait(EState *estate,
+extern bool ExecAsyncForeignScanConfigureWait(EState *estate,
PendingAsyncRequest *areq, bool reinit);
extern void ExecAsyncForeignScanNotify(EState *estate,
PendingAsyncRequest *areq);
diff --git a/src/include/foreign/fdwapi.h b/src/include/foreign/fdwapi.h
index 88feb9a..65517fd 100644
--- a/src/include/foreign/fdwapi.h
+++ b/src/include/foreign/fdwapi.h
@@ -158,7 +158,7 @@ typedef bool (*IsForeignScanParallelSafe_function) (PlannerInfo *root,
typedef bool (*IsForeignPathAsyncCapable_function) (ForeignPath *path);
typedef void (*ForeignAsyncRequest_function) (EState *estate,
PendingAsyncRequest *areq);
-typedef void (*ForeignAsyncConfigureWait_function) (EState *estate,
+typedef bool (*ForeignAsyncConfigureWait_function) (EState *estate,
PendingAsyncRequest *areq,
bool reinit);
typedef void (*ForeignAsyncNotify_function) (EState *estate,
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 7b0e145..139bd8e 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -357,6 +357,13 @@ typedef struct ResultRelInfo
* State for an asynchronous tuple request.
* ----------------
*/
+typedef enum AsyncRequestState
+{
+ ASYNC_IDLE,
+ ASYNC_WAITING,
+ ASYNC_CALLBACK_PENDING,
+ ASYNC_COMPLETE
+} AsyncRequestState;
typedef struct PendingAsyncRequest
{
int myindex; /* Index in es_pending_async. */
@@ -365,8 +372,7 @@ typedef struct PendingAsyncRequest
int request_index; /* Scratch space for requestor. */
int num_fd_events; /* Max number of FD events requestee needs. */
bool wants_process_latch; /* Requestee cares about MyLatch. */
- bool callback_pending; /* Callback is needed. */
- bool request_complete; /* Request complete, result valid. */
+ AsyncRequestState state;
Node *result; /* Result (NULL if no more tuples). */
} PendingAsyncRequest;
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index 327119b..1df6693 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -209,6 +209,7 @@ typedef struct Append
Plan plan;
List *appendplans;
int nasyncplans; /* # of async plans, always at start of list */
+ int referent; /* index of inheritance tree referent */
} Append;
/* ----------------
--
2.9.2
0004-Make-postgres_fdw-async-capable.patchtext/x-patch; charset=us-asciiDownload
From 20a519a37bb2667427d1c857466bd220d9fe0bf9 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Mon, 17 Oct 2016 16:00:56 +0900
Subject: [PATCH 4/7] Make postgres_fdw async-capable
---
contrib/postgres_fdw/connection.c | 79 ++--
contrib/postgres_fdw/expected/postgres_fdw.out | 64 ++--
contrib/postgres_fdw/postgres_fdw.c | 483 +++++++++++++++++++++----
contrib/postgres_fdw/postgres_fdw.h | 2 +
contrib/postgres_fdw/sql/postgres_fdw.sql | 4 +-
src/backend/executor/execProcnode.c | 9 +
src/include/foreign/fdwapi.h | 2 +
7 files changed, 510 insertions(+), 133 deletions(-)
diff --git a/contrib/postgres_fdw/connection.c b/contrib/postgres_fdw/connection.c
index bcdddc2..ebc9417 100644
--- a/contrib/postgres_fdw/connection.c
+++ b/contrib/postgres_fdw/connection.c
@@ -49,6 +49,7 @@ typedef struct ConnCacheEntry
* one level of subxact open, etc */
bool have_prep_stmt; /* have we prepared any stmts in this xact? */
bool have_error; /* have any subxacts aborted in this xact? */
+ void *storage; /* connection specific storage */
} ConnCacheEntry;
/*
@@ -64,6 +65,7 @@ static unsigned int prep_stmt_number = 0;
static bool xact_got_connection = false;
/* prototypes of private functions */
+static ConnCacheEntry *get_connection_entry(Oid umid);
static PGconn *connect_pg_server(ForeignServer *server, UserMapping *user);
static void check_conn_params(const char **keywords, const char **values);
static void configure_remote_session(PGconn *conn);
@@ -75,26 +77,12 @@ static void pgfdw_subxact_callback(SubXactEvent event,
SubTransactionId parentSubid,
void *arg);
-
/*
- * Get a PGconn which can be used to execute queries on the remote PostgreSQL
- * server with the user's authorization. A new connection is established
- * if we don't already have a suitable one, and a transaction is opened at
- * the right subtransaction nesting depth if we didn't do that already.
- *
- * will_prep_stmt must be true if caller intends to create any prepared
- * statements. Since those don't go away automatically at transaction end
- * (not even on error), we need this flag to cue manual cleanup.
- *
- * XXX Note that caching connections theoretically requires a mechanism to
- * detect change of FDW objects to invalidate already established connections.
- * We could manage that by watching for invalidation events on the relevant
- * syscaches. For the moment, though, it's not clear that this would really
- * be useful and not mere pedantry. We could not flush any active connections
- * mid-transaction anyway.
+ * Common function to acquire or create a connection cache entry.
*/
-PGconn *
-GetConnection(UserMapping *user, bool will_prep_stmt)
+
+static ConnCacheEntry *
+get_connection_entry(Oid umid)
{
bool found;
ConnCacheEntry *entry;
@@ -122,11 +110,8 @@ GetConnection(UserMapping *user, bool will_prep_stmt)
RegisterSubXactCallback(pgfdw_subxact_callback, NULL);
}
- /* Set flag that we did GetConnection during the current transaction */
- xact_got_connection = true;
-
/* Create hash key for the entry. Assume no pad bytes in key struct */
- key = user->umid;
+ key = umid;
/*
* Find or create cached entry for requested connection.
@@ -139,8 +124,39 @@ GetConnection(UserMapping *user, bool will_prep_stmt)
entry->xact_depth = 0;
entry->have_prep_stmt = false;
entry->have_error = false;
+ entry->storage = NULL;
}
+ return entry;
+}
+
+/*
+ * Get a PGconn which can be used to execute queries on the remote PostgreSQL
+ * server with the user's authorization. A new connection is established
+ * if we don't already have a suitable one, and a transaction is opened at
+ * the right subtransaction nesting depth if we didn't do that already.
+ *
+ * will_prep_stmt must be true if caller intends to create any prepared
+ * statements. Since those don't go away automatically at transaction end
+ * (not even on error), we need this flag to cue manual cleanup.
+ *
+ * XXX Note that caching connections theoretically requires a mechanism to
+ * detect change of FDW objects to invalidate already established connections.
+ * We could manage that by watching for invalidation events on the relevant
+ * syscaches. For the moment, though, it's not clear that this would really
+ * be useful and not mere pedantry. We could not flush any active connections
+ * mid-transaction anyway.
+ */
+PGconn *
+GetConnection(UserMapping *user, bool will_prep_stmt)
+{
+ ConnCacheEntry *entry;
+
+ /* Set flag that we did GetConnection during the current transaction */
+ xact_got_connection = true;
+
+ entry = get_connection_entry(user->umid);
+
/*
* We don't check the health of cached connection here, because it would
* require some overhead. Broken connection will be detected when the
@@ -177,6 +193,25 @@ GetConnection(UserMapping *user, bool will_prep_stmt)
}
/*
+ * Rerturns the connection specific storage for this user. Allocate with
+ * initsize if not exists.
+ */
+void *
+GetConnectionSpecificStorage(UserMapping *user, size_t initsize)
+{
+ ConnCacheEntry *entry;
+
+ entry = get_connection_entry(user->umid);
+ if (entry->storage == NULL)
+ {
+ entry->storage = MemoryContextAlloc(CacheMemoryContext, initsize);
+ memset(entry->storage, 0, initsize);
+ }
+
+ return entry->storage;
+}
+
+/*
* Connect to remote server using specified server and user mapping properties.
*/
static PGconn *
diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out
index 083d947..15519c1 100644
--- a/contrib/postgres_fdw/expected/postgres_fdw.out
+++ b/contrib/postgres_fdw/expected/postgres_fdw.out
@@ -6173,12 +6173,12 @@ INSERT INTO b(aa) VALUES('bbbbb');
SELECT tableoid::regclass, * FROM a;
tableoid | aa
----------+-------
- b | bbb
- b | bbbb
- b | bbbbb
a | aaa
a | aaaa
a | aaaaa
+ b | bbb
+ b | bbbb
+ b | bbbbb
(6 rows)
SELECT tableoid::regclass, * FROM b;
@@ -6201,12 +6201,12 @@ UPDATE a SET aa = 'zzzzzz' WHERE aa LIKE 'aaaa%';
SELECT tableoid::regclass, * FROM a;
tableoid | aa
----------+--------
- b | bbb
- b | bbbb
- b | bbbbb
a | aaa
a | zzzzzz
a | zzzzzz
+ b | bbb
+ b | bbbb
+ b | bbbbb
(6 rows)
SELECT tableoid::regclass, * FROM b;
@@ -6229,12 +6229,12 @@ UPDATE b SET aa = 'new';
SELECT tableoid::regclass, * FROM a;
tableoid | aa
----------+--------
- b | new
- b | new
- b | new
a | aaa
a | zzzzzz
a | zzzzzz
+ b | new
+ b | new
+ b | new
(6 rows)
SELECT tableoid::regclass, * FROM b;
@@ -6257,12 +6257,12 @@ UPDATE a SET aa = 'newtoo';
SELECT tableoid::regclass, * FROM a;
tableoid | aa
----------+--------
- b | newtoo
- b | newtoo
- b | newtoo
a | newtoo
a | newtoo
a | newtoo
+ b | newtoo
+ b | newtoo
+ b | newtoo
(6 rows)
SELECT tableoid::regclass, * FROM b;
@@ -6350,9 +6350,9 @@ select * from bar where f1 in (select f1 from foo) for update;
select * from bar where f1 in (select f1 from foo) for update;
f1 | f2
----+----
+ 1 | 11
3 | 33
4 | 44
- 1 | 11
2 | 22
(4 rows)
@@ -6387,9 +6387,9 @@ select * from bar where f1 in (select f1 from foo) for share;
select * from bar where f1 in (select f1 from foo) for share;
f1 | f2
----+----
+ 1 | 11
3 | 33
4 | 44
- 1 | 11
2 | 22
(4 rows)
@@ -6652,27 +6652,33 @@ delete from foo where f1 < 5 returning *;
(5 rows)
explain (verbose, costs off)
-update bar set f2 = f2 + 100 returning *;
- QUERY PLAN
-------------------------------------------------------------------------------
- Update on public.bar
- Output: bar.f1, bar.f2
- Update on public.bar
- Foreign Update on public.bar2
- -> Seq Scan on public.bar
- Output: bar.f1, (bar.f2 + 100), bar.ctid
- -> Foreign Update on public.bar2
- Remote SQL: UPDATE public.loct2 SET f2 = (f2 + 100) RETURNING f1, f2
-(8 rows)
+with u as (update bar set f2 = f2 + 100 returning *) select * from u order by 1;
+ QUERY PLAN
+--------------------------------------------------------------------------------------
+ Sort
+ Output: u.f1, u.f2
+ Sort Key: u.f1
+ CTE u
+ -> Update on public.bar
+ Output: bar.f1, bar.f2
+ Update on public.bar
+ Foreign Update on public.bar2
+ -> Seq Scan on public.bar
+ Output: bar.f1, (bar.f2 + 100), bar.ctid
+ -> Foreign Update on public.bar2
+ Remote SQL: UPDATE public.loct2 SET f2 = (f2 + 100) RETURNING f1, f2
+ -> CTE Scan on u
+ Output: u.f1, u.f2
+(14 rows)
-update bar set f2 = f2 + 100 returning *;
+with u as (update bar set f2 = f2 + 100 returning *) select * from u order by 1;
f1 | f2
----+-----
- 2 | 322
1 | 311
- 6 | 266
+ 2 | 322
3 | 333
4 | 344
+ 6 | 266
7 | 277
(6 rows)
diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c
index b92b279..21e7fd9 100644
--- a/contrib/postgres_fdw/postgres_fdw.c
+++ b/contrib/postgres_fdw/postgres_fdw.c
@@ -35,6 +35,7 @@
#include "optimizer/var.h"
#include "optimizer/tlist.h"
#include "parser/parsetree.h"
+#include "pgstat.h"
#include "utils/builtins.h"
#include "utils/guc.h"
#include "utils/lsyscache.h"
@@ -54,6 +55,9 @@ PG_MODULE_MAGIC;
/* If no remote estimates, assume a sort costs 20% extra */
#define DEFAULT_FDW_SORT_MULTIPLIER 1.2
+/* Retrive PgFdwScanState struct from ForeginScanState */
+#define GetPgFdwScanState(n) ((PgFdwScanState *)(n)->fdw_state)
+
/*
* Indexes of FDW-private information stored in fdw_private lists.
*
@@ -123,10 +127,27 @@ enum FdwDirectModifyPrivateIndex
};
/*
+ * Connection private area structure.
+ */
+ typedef struct PgFdwConnspecate
+{
+ ForeignScanState *current_owner; /* The node currently running a query
+ * on this connection*/
+} PgFdwConnspecate;
+
+/* Execution state base type */
+typedef struct PgFdwState
+{
+ PGconn *conn; /* connection for the scan */
+ PgFdwConnspecate *connspec; /* connection private memory */
+} PgFdwState;
+
+/*
* Execution state of a foreign scan using postgres_fdw.
*/
typedef struct PgFdwScanState
{
+ PgFdwState s; /* common structure */
Relation rel; /* relcache entry for the foreign table. NULL
* for a foreign join scan. */
TupleDesc tupdesc; /* tuple descriptor of scan */
@@ -137,7 +158,7 @@ typedef struct PgFdwScanState
List *retrieved_attrs; /* list of retrieved attribute numbers */
/* for remote query execution */
- PGconn *conn; /* connection for the scan */
+ bool result_ready;
unsigned int cursor_number; /* quasi-unique ID for my cursor */
bool cursor_exists; /* have we created the cursor? */
int numParams; /* number of parameters passed to query */
@@ -153,6 +174,13 @@ typedef struct PgFdwScanState
/* batch-level state, for optimizing rewinds and avoiding useless fetch */
int fetch_ct_2; /* Min(# of fetches done, 2) */
bool eof_reached; /* true if last fetch reached EOF */
+ bool run_async; /* true if run asynchronously */
+ bool async_waiting; /* true if requesting the parent to wait */
+ ForeignScanState *waiter; /* Next node to run a query among nodes
+ * sharing the same connection */
+ ForeignScanState *last_waiter; /* A waiting node at the end of a waiting
+ * list. Maintained only by the current
+ * owner of the connection */
/* working memory contexts */
MemoryContext batch_cxt; /* context holding current batch of tuples */
@@ -166,11 +194,11 @@ typedef struct PgFdwScanState
*/
typedef struct PgFdwModifyState
{
+ PgFdwState s; /* common structure */
Relation rel; /* relcache entry for the foreign table */
AttInMetadata *attinmeta; /* attribute datatype conversion metadata */
/* for remote query execution */
- PGconn *conn; /* connection for the scan */
char *p_name; /* name of prepared statement, if created */
/* extracted fdw_private data */
@@ -193,6 +221,7 @@ typedef struct PgFdwModifyState
*/
typedef struct PgFdwDirectModifyState
{
+ PgFdwState s; /* common structure */
Relation rel; /* relcache entry for the foreign table */
AttInMetadata *attinmeta; /* attribute datatype conversion metadata */
@@ -291,6 +320,7 @@ static void postgresBeginForeignScan(ForeignScanState *node, int eflags);
static TupleTableSlot *postgresIterateForeignScan(ForeignScanState *node);
static void postgresReScanForeignScan(ForeignScanState *node);
static void postgresEndForeignScan(ForeignScanState *node);
+static void postgresShutdownForeignScan(ForeignScanState *node);
static void postgresAddForeignUpdateTargets(Query *parsetree,
RangeTblEntry *target_rte,
Relation target_relation);
@@ -355,8 +385,8 @@ static bool postgresIsForeignPathAsyncCapable(ForeignPath *path);
static void postgresForeignAsyncRequest(EState *estate,
PendingAsyncRequest *areq);
static bool postgresForeignAsyncConfigureWait(EState *estate,
- PendingAsyncRequest *areq,
- bool reinit);
+ PendingAsyncRequest *areq,
+ bool reinit);
static void postgresForeignAsyncNotify(EState *estate,
PendingAsyncRequest *areq);
@@ -379,7 +409,10 @@ static bool ec_member_matches_foreign(PlannerInfo *root, RelOptInfo *rel,
EquivalenceClass *ec, EquivalenceMember *em,
void *arg);
static void create_cursor(ForeignScanState *node);
-static void fetch_more_data(ForeignScanState *node);
+static void request_more_data(ForeignScanState *node);
+static void fetch_received_data(ForeignScanState *node);
+static void vacate_connection(PgFdwState *fdwconn);
+static void absorb_current_result(ForeignScanState *node);
static void close_cursor(PGconn *conn, unsigned int cursor_number);
static void prepare_foreign_modify(PgFdwModifyState *fmstate);
static const char **convert_prep_stmt_params(PgFdwModifyState *fmstate,
@@ -444,6 +477,7 @@ postgres_fdw_handler(PG_FUNCTION_ARGS)
routine->IterateForeignScan = postgresIterateForeignScan;
routine->ReScanForeignScan = postgresReScanForeignScan;
routine->EndForeignScan = postgresEndForeignScan;
+ routine->ShutdownForeignScan = postgresShutdownForeignScan;
/* Functions for updating foreign tables */
routine->AddForeignUpdateTargets = postgresAddForeignUpdateTargets;
@@ -1337,12 +1371,21 @@ postgresBeginForeignScan(ForeignScanState *node, int eflags)
* Get connection to the foreign server. Connection manager will
* establish new connection if necessary.
*/
- fsstate->conn = GetConnection(user, false);
+ fsstate->s.conn = GetConnection(user, false);
+ fsstate->s.connspec = (PgFdwConnspecate *)
+ GetConnectionSpecificStorage(user, sizeof(PgFdwConnspecate));
+ fsstate->s.connspec->current_owner = NULL;
+ fsstate->waiter = NULL;
+ fsstate->last_waiter = node;
/* Assign a unique ID for my cursor */
- fsstate->cursor_number = GetCursorNumber(fsstate->conn);
+ fsstate->cursor_number = GetCursorNumber(fsstate->s.conn);
fsstate->cursor_exists = false;
+ /* Initialize async execution status */
+ fsstate->run_async = false;
+ fsstate->async_waiting = false;
+
/* Get private info created by planner functions. */
fsstate->query = strVal(list_nth(fsplan->fdw_private,
FdwScanPrivateSelectSql));
@@ -1398,32 +1441,126 @@ postgresBeginForeignScan(ForeignScanState *node, int eflags)
static TupleTableSlot *
postgresIterateForeignScan(ForeignScanState *node)
{
- PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+ PgFdwScanState *fsstate = GetPgFdwScanState(node);
TupleTableSlot *slot = node->ss.ss_ScanTupleSlot;
/*
- * If this is the first call after Begin or ReScan, we need to create the
- * cursor on the remote side.
- */
- if (!fsstate->cursor_exists)
- create_cursor(node);
-
- /*
* Get some more tuples, if we've run out.
*/
if (fsstate->next_tuple >= fsstate->num_tuples)
{
- /* No point in another fetch if we already detected EOF, though. */
- if (!fsstate->eof_reached)
- fetch_more_data(node);
- /* If we didn't get any tuples, must be end of data. */
+ ForeignScanState *next_conn_owner = node;
+
+ /* This node has sent a query on this connection */
+ if (fsstate->s.connspec->current_owner == node)
+ {
+ /* Check if the result is available */
+ if (PQisBusy(fsstate->s.conn))
+ {
+ int rc = WaitLatchOrSocket(NULL,
+ WL_SOCKET_READABLE | WL_TIMEOUT,
+ PQsocket(fsstate->s.conn), 0,
+ WAIT_EVENT_ASYNC_WAIT);
+ if (fsstate->run_async && !(rc & WL_SOCKET_READABLE))
+ {
+ /*
+ * This node is not ready yet. Tell the caller to wait.
+ */
+ fsstate->result_ready = false;
+ return ExecClearTuple(slot);
+ }
+ }
+
+ Assert(fsstate->async_waiting);
+ fsstate->async_waiting = false;
+ fetch_received_data(node);
+
+ /*
+ * If someone is waiting this node on the same connection, let the
+ * first waiter be the next owner of this connection.
+ */
+ if (fsstate->waiter)
+ {
+ PgFdwScanState *next_owner_state;
+
+ next_conn_owner = fsstate->waiter;
+ next_owner_state = GetPgFdwScanState(next_conn_owner);
+ fsstate->waiter = NULL;
+
+ /*
+ * only the current owner is responsible to maintain the shortcut
+ * to the last waiter
+ */
+ next_owner_state->last_waiter = fsstate->last_waiter;
+
+ /*
+ * for simplicity, last_waiter points itself on a node that no one
+ * is waiting for.
+ */
+ fsstate->last_waiter = node;
+ }
+ }
+ else if (fsstate->s.connspec->current_owner)
+ {
+ /*
+ * Anyone else is holding this connection. Add myself to the tail
+ * of the waiters' list then return not-ready. To avoid scanning
+ * through the waiters' list, the current owner is to maintain the
+ * shortcut to the last waiter.
+ */
+ PgFdwScanState *conn_owner_state =
+ GetPgFdwScanState(fsstate->s.connspec->current_owner);
+ ForeignScanState *last_waiter = conn_owner_state->last_waiter;
+ PgFdwScanState *last_waiter_state = GetPgFdwScanState(last_waiter);
+
+ last_waiter_state->waiter = node;
+ conn_owner_state->last_waiter = node;
+
+ /* Register the node to the async-waiting node list */
+ Assert(!GetPgFdwScanState(node)->async_waiting);
+
+ GetPgFdwScanState(node)->async_waiting = true;
+
+ fsstate->result_ready = fsstate->eof_reached;
+ return ExecClearTuple(slot);
+ }
+
+ /*
+ * Send the next request for the next owner of this connection if
+ * needed.
+ */
+
+ if (!GetPgFdwScanState(next_conn_owner)->eof_reached)
+ {
+ PgFdwScanState *next_owner_state =
+ GetPgFdwScanState(next_conn_owner);
+
+ request_more_data(next_conn_owner);
+
+ /* Register the node to the async-waiting node list */
+ if (!next_owner_state->async_waiting)
+ next_owner_state->async_waiting = true;
+
+ if (!next_owner_state->run_async)
+ fetch_received_data(next_conn_owner);
+ }
+
+
+ /*
+ * If we haven't received a result for the given node this time,
+ * return with no tuple to give way to other nodes.
+ */
if (fsstate->next_tuple >= fsstate->num_tuples)
+ {
+ fsstate->result_ready = fsstate->eof_reached;
return ExecClearTuple(slot);
+ }
}
/*
* Return the next tuple.
*/
+ fsstate->result_ready = true;
ExecStoreTuple(fsstate->tuples[fsstate->next_tuple++],
slot,
InvalidBuffer,
@@ -1439,7 +1576,7 @@ postgresIterateForeignScan(ForeignScanState *node)
static void
postgresReScanForeignScan(ForeignScanState *node)
{
- PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+ PgFdwScanState *fsstate = GetPgFdwScanState(node);
char sql[64];
PGresult *res;
@@ -1447,6 +1584,9 @@ postgresReScanForeignScan(ForeignScanState *node)
if (!fsstate->cursor_exists)
return;
+ /* Absorb the ramining result */
+ absorb_current_result(node);
+
/*
* If any internal parameters affecting this node have changed, we'd
* better destroy and recreate the cursor. Otherwise, rewinding it should
@@ -1475,9 +1615,9 @@ postgresReScanForeignScan(ForeignScanState *node)
* We don't use a PG_TRY block here, so be careful not to throw error
* without releasing the PGresult.
*/
- res = pgfdw_exec_query(fsstate->conn, sql);
+ res = pgfdw_exec_query(fsstate->s.conn, sql);
if (PQresultStatus(res) != PGRES_COMMAND_OK)
- pgfdw_report_error(ERROR, res, fsstate->conn, true, sql);
+ pgfdw_report_error(ERROR, res, fsstate->s.conn, true, sql);
PQclear(res);
/* Now force a fresh FETCH. */
@@ -1495,7 +1635,7 @@ postgresReScanForeignScan(ForeignScanState *node)
static void
postgresEndForeignScan(ForeignScanState *node)
{
- PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+ PgFdwScanState *fsstate = GetPgFdwScanState(node);
/* if fsstate is NULL, we are in EXPLAIN; nothing to do */
if (fsstate == NULL)
@@ -1503,16 +1643,32 @@ postgresEndForeignScan(ForeignScanState *node)
/* Close the cursor if open, to prevent accumulation of cursors */
if (fsstate->cursor_exists)
- close_cursor(fsstate->conn, fsstate->cursor_number);
+ close_cursor(fsstate->s.conn, fsstate->cursor_number);
/* Release remote connection */
- ReleaseConnection(fsstate->conn);
- fsstate->conn = NULL;
+ ReleaseConnection(fsstate->s.conn);
+ fsstate->s.conn = NULL;
/* MemoryContexts will be deleted automatically. */
}
/*
+ * postgresShutdownForeignScan
+ * Remove asynchrony stuff and cleanup garbage on the connection.
+ */
+static void
+postgresShutdownForeignScan(ForeignScanState *node)
+{
+ ForeignScan *plan = (ForeignScan *) node->ss.ps.plan;
+
+ if (plan->operation != CMD_SELECT)
+ return;
+
+ /* Absorb the ramining result */
+ absorb_current_result(node);
+}
+
+/*
* postgresAddForeignUpdateTargets
* Add resjunk column(s) needed for update/delete on a foreign table
*/
@@ -1714,7 +1870,9 @@ postgresBeginForeignModify(ModifyTableState *mtstate,
user = GetUserMapping(userid, table->serverid);
/* Open connection; report that we'll create a prepared statement. */
- fmstate->conn = GetConnection(user, true);
+ fmstate->s.conn = GetConnection(user, true);
+ fmstate->s.connspec = (PgFdwConnspecate *)
+ GetConnectionSpecificStorage(user, sizeof(PgFdwConnspecate));
fmstate->p_name = NULL; /* prepared statement not made yet */
/* Deconstruct fdw_private data. */
@@ -1793,6 +1951,8 @@ postgresExecForeignInsert(EState *estate,
PGresult *res;
int n_rows;
+ vacate_connection((PgFdwState *)fmstate);
+
/* Set up the prepared statement on the remote server, if we didn't yet */
if (!fmstate->p_name)
prepare_foreign_modify(fmstate);
@@ -1803,14 +1963,14 @@ postgresExecForeignInsert(EState *estate,
/*
* Execute the prepared statement.
*/
- if (!PQsendQueryPrepared(fmstate->conn,
+ if (!PQsendQueryPrepared(fmstate->s.conn,
fmstate->p_name,
fmstate->p_nums,
p_values,
NULL,
NULL,
0))
- pgfdw_report_error(ERROR, NULL, fmstate->conn, false, fmstate->query);
+ pgfdw_report_error(ERROR, NULL, fmstate->s.conn, false, fmstate->query);
/*
* Get the result, and check for success.
@@ -1818,10 +1978,10 @@ postgresExecForeignInsert(EState *estate,
* We don't use a PG_TRY block here, so be careful not to throw error
* without releasing the PGresult.
*/
- res = pgfdw_get_result(fmstate->conn, fmstate->query);
+ res = pgfdw_get_result(fmstate->s.conn, fmstate->query);
if (PQresultStatus(res) !=
(fmstate->has_returning ? PGRES_TUPLES_OK : PGRES_COMMAND_OK))
- pgfdw_report_error(ERROR, res, fmstate->conn, true, fmstate->query);
+ pgfdw_report_error(ERROR, res, fmstate->s.conn, true, fmstate->query);
/* Check number of rows affected, and fetch RETURNING tuple if any */
if (fmstate->has_returning)
@@ -1859,6 +2019,8 @@ postgresExecForeignUpdate(EState *estate,
PGresult *res;
int n_rows;
+ vacate_connection((PgFdwState *)fmstate);
+
/* Set up the prepared statement on the remote server, if we didn't yet */
if (!fmstate->p_name)
prepare_foreign_modify(fmstate);
@@ -1879,14 +2041,14 @@ postgresExecForeignUpdate(EState *estate,
/*
* Execute the prepared statement.
*/
- if (!PQsendQueryPrepared(fmstate->conn,
+ if (!PQsendQueryPrepared(fmstate->s.conn,
fmstate->p_name,
fmstate->p_nums,
p_values,
NULL,
NULL,
0))
- pgfdw_report_error(ERROR, NULL, fmstate->conn, false, fmstate->query);
+ pgfdw_report_error(ERROR, NULL, fmstate->s.conn, false, fmstate->query);
/*
* Get the result, and check for success.
@@ -1894,10 +2056,10 @@ postgresExecForeignUpdate(EState *estate,
* We don't use a PG_TRY block here, so be careful not to throw error
* without releasing the PGresult.
*/
- res = pgfdw_get_result(fmstate->conn, fmstate->query);
+ res = pgfdw_get_result(fmstate->s.conn, fmstate->query);
if (PQresultStatus(res) !=
(fmstate->has_returning ? PGRES_TUPLES_OK : PGRES_COMMAND_OK))
- pgfdw_report_error(ERROR, res, fmstate->conn, true, fmstate->query);
+ pgfdw_report_error(ERROR, res, fmstate->s.conn, true, fmstate->query);
/* Check number of rows affected, and fetch RETURNING tuple if any */
if (fmstate->has_returning)
@@ -1935,6 +2097,8 @@ postgresExecForeignDelete(EState *estate,
PGresult *res;
int n_rows;
+ vacate_connection((PgFdwState *)fmstate);
+
/* Set up the prepared statement on the remote server, if we didn't yet */
if (!fmstate->p_name)
prepare_foreign_modify(fmstate);
@@ -1955,14 +2119,14 @@ postgresExecForeignDelete(EState *estate,
/*
* Execute the prepared statement.
*/
- if (!PQsendQueryPrepared(fmstate->conn,
+ if (!PQsendQueryPrepared(fmstate->s.conn,
fmstate->p_name,
fmstate->p_nums,
p_values,
NULL,
NULL,
0))
- pgfdw_report_error(ERROR, NULL, fmstate->conn, false, fmstate->query);
+ pgfdw_report_error(ERROR, NULL, fmstate->s.conn, false, fmstate->query);
/*
* Get the result, and check for success.
@@ -1970,10 +2134,10 @@ postgresExecForeignDelete(EState *estate,
* We don't use a PG_TRY block here, so be careful not to throw error
* without releasing the PGresult.
*/
- res = pgfdw_get_result(fmstate->conn, fmstate->query);
+ res = pgfdw_get_result(fmstate->s.conn, fmstate->query);
if (PQresultStatus(res) !=
(fmstate->has_returning ? PGRES_TUPLES_OK : PGRES_COMMAND_OK))
- pgfdw_report_error(ERROR, res, fmstate->conn, true, fmstate->query);
+ pgfdw_report_error(ERROR, res, fmstate->s.conn, true, fmstate->query);
/* Check number of rows affected, and fetch RETURNING tuple if any */
if (fmstate->has_returning)
@@ -2020,16 +2184,16 @@ postgresEndForeignModify(EState *estate,
* We don't use a PG_TRY block here, so be careful not to throw error
* without releasing the PGresult.
*/
- res = pgfdw_exec_query(fmstate->conn, sql);
+ res = pgfdw_exec_query(fmstate->s.conn, sql);
if (PQresultStatus(res) != PGRES_COMMAND_OK)
- pgfdw_report_error(ERROR, res, fmstate->conn, true, sql);
+ pgfdw_report_error(ERROR, res, fmstate->s.conn, true, sql);
PQclear(res);
fmstate->p_name = NULL;
}
/* Release remote connection */
- ReleaseConnection(fmstate->conn);
- fmstate->conn = NULL;
+ ReleaseConnection(fmstate->s.conn);
+ fmstate->s.conn = NULL;
}
/*
@@ -2309,7 +2473,9 @@ postgresBeginDirectModify(ForeignScanState *node, int eflags)
* Get connection to the foreign server. Connection manager will
* establish new connection if necessary.
*/
- dmstate->conn = GetConnection(user, false);
+ dmstate->s.conn = GetConnection(user, false);
+ dmstate->s.connspec = (PgFdwConnspecate *)
+ GetConnectionSpecificStorage(user, sizeof(PgFdwConnspecate));
/* Initialize state variable */
dmstate->num_tuples = -1; /* -1 means not set yet */
@@ -2362,7 +2528,10 @@ postgresIterateDirectModify(ForeignScanState *node)
* If this is the first call after Begin, execute the statement.
*/
if (dmstate->num_tuples == -1)
+ {
+ vacate_connection((PgFdwState *)dmstate);
execute_dml_stmt(node);
+ }
/*
* If the local query doesn't specify RETURNING, just clear tuple slot.
@@ -2409,8 +2578,8 @@ postgresEndDirectModify(ForeignScanState *node)
PQclear(dmstate->result);
/* Release remote connection */
- ReleaseConnection(dmstate->conn);
- dmstate->conn = NULL;
+ ReleaseConnection(dmstate->s.conn);
+ dmstate->s.conn = NULL;
/* MemoryContext will be deleted automatically. */
}
@@ -2529,6 +2698,7 @@ estimate_path_cost_size(PlannerInfo *root,
List *local_param_join_conds;
StringInfoData sql;
PGconn *conn;
+ PgFdwConnspecate *connspec;
Selectivity local_sel;
QualCost local_cost;
List *fdw_scan_tlist = NIL;
@@ -2572,6 +2742,16 @@ estimate_path_cost_size(PlannerInfo *root,
/* Get the remote estimate */
conn = GetConnection(fpinfo->user, false);
+ connspec = GetConnectionSpecificStorage(fpinfo->user,
+ sizeof(PgFdwConnspecate));
+ if (connspec)
+ {
+ PgFdwState tmpstate;
+ tmpstate.conn = conn;
+ tmpstate.connspec = connspec;
+ vacate_connection(&tmpstate);
+ }
+
get_remote_estimate(sql.data, conn, &rows, &width,
&startup_cost, &total_cost);
ReleaseConnection(conn);
@@ -2926,11 +3106,11 @@ ec_member_matches_foreign(PlannerInfo *root, RelOptInfo *rel,
static void
create_cursor(ForeignScanState *node)
{
- PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+ PgFdwScanState *fsstate = GetPgFdwScanState(node);
ExprContext *econtext = node->ss.ps.ps_ExprContext;
int numParams = fsstate->numParams;
const char **values = fsstate->param_values;
- PGconn *conn = fsstate->conn;
+ PGconn *conn = fsstate->s.conn;
StringInfoData buf;
PGresult *res;
@@ -2996,47 +3176,96 @@ create_cursor(ForeignScanState *node)
* Fetch some more rows from the node's cursor.
*/
static void
-fetch_more_data(ForeignScanState *node)
+request_more_data(ForeignScanState *node)
+{
+ PgFdwScanState *fsstate = GetPgFdwScanState(node);
+ PGconn *conn = fsstate->s.conn;
+ char sql[64];
+
+ /* The connection should be vacant */
+ Assert(fsstate->s.connspec->current_owner == NULL);
+
+ /*
+ * If this is the first call after Begin or ReScan, we need to create the
+ * cursor on the remote side.
+ */
+ if (!fsstate->cursor_exists)
+ create_cursor(node);
+
+ snprintf(sql, sizeof(sql), "FETCH %d FROM c%u",
+ fsstate->fetch_size, fsstate->cursor_number);
+
+ if (!PQsendQuery(conn, sql))
+ pgfdw_report_error(ERROR, NULL, conn, false, sql);
+
+ fsstate->s.connspec->current_owner = node;
+}
+
+/*
+ * Fetch some more rows from the node's cursor.
+ */
+static void
+fetch_received_data(ForeignScanState *node)
{
- PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+ PgFdwScanState *fsstate = GetPgFdwScanState(node);
PGresult *volatile res = NULL;
MemoryContext oldcontext;
+ /* I should be the current connection owner */
+ Assert(fsstate->s.connspec->current_owner == node);
+
/*
* We'll store the tuples in the batch_cxt. First, flush the previous
- * batch.
+ * batch if no tuple is remaining
*/
- fsstate->tuples = NULL;
- MemoryContextReset(fsstate->batch_cxt);
+ if (fsstate->next_tuple >= fsstate->num_tuples)
+ {
+ fsstate->tuples = NULL;
+ fsstate->num_tuples = 0;
+ MemoryContextReset(fsstate->batch_cxt);
+ }
+ else if (fsstate->next_tuple > 0)
+ {
+ /* move the remaining tuples to the beginning of the store */
+ int n = 0;
+
+ while(fsstate->next_tuple < fsstate->num_tuples)
+ fsstate->tuples[n++] = fsstate->tuples[fsstate->next_tuple++];
+ fsstate->num_tuples = n;
+ }
+
oldcontext = MemoryContextSwitchTo(fsstate->batch_cxt);
/* PGresult must be released before leaving this function. */
PG_TRY();
{
- PGconn *conn = fsstate->conn;
+ PGconn *conn = fsstate->s.conn;
char sql[64];
- int numrows;
+ int addrows;
+ size_t newsize;
int i;
snprintf(sql, sizeof(sql), "FETCH %d FROM c%u",
fsstate->fetch_size, fsstate->cursor_number);
- res = pgfdw_exec_query(conn, sql);
+ res = pgfdw_get_result(conn, sql);
/* On error, report the original query, not the FETCH. */
if (PQresultStatus(res) != PGRES_TUPLES_OK)
pgfdw_report_error(ERROR, res, conn, false, fsstate->query);
/* Convert the data into HeapTuples */
- numrows = PQntuples(res);
- fsstate->tuples = (HeapTuple *) palloc0(numrows * sizeof(HeapTuple));
- fsstate->num_tuples = numrows;
- fsstate->next_tuple = 0;
+ addrows = PQntuples(res);
+ newsize = (fsstate->num_tuples + addrows) * sizeof(HeapTuple);
+ if (fsstate->tuples)
+ fsstate->tuples = (HeapTuple *) repalloc(fsstate->tuples, newsize);
+ else
+ fsstate->tuples = (HeapTuple *) palloc(newsize);
- for (i = 0; i < numrows; i++)
+ for (i = 0; i < addrows; i++)
{
Assert(IsA(node->ss.ps.plan, ForeignScan));
- fsstate->tuples[i] =
+ fsstate->tuples[fsstate->num_tuples + i] =
make_tuple_from_result_row(res, i,
fsstate->rel,
fsstate->attinmeta,
@@ -3046,27 +3275,82 @@ fetch_more_data(ForeignScanState *node)
}
/* Update fetch_ct_2 */
- if (fsstate->fetch_ct_2 < 2)
+ if (fsstate->fetch_ct_2 < 2 && fsstate->next_tuple == 0)
fsstate->fetch_ct_2++;
+ fsstate->next_tuple = 0;
+ fsstate->num_tuples += addrows;
+
/* Must be EOF if we didn't get as many tuples as we asked for. */
- fsstate->eof_reached = (numrows < fsstate->fetch_size);
+ fsstate->eof_reached = (addrows < fsstate->fetch_size);
PQclear(res);
res = NULL;
}
PG_CATCH();
{
+ fsstate->s.connspec->current_owner = NULL;
if (res)
PQclear(res);
PG_RE_THROW();
}
PG_END_TRY();
+ fsstate->s.connspec->current_owner = NULL;
+
MemoryContextSwitchTo(oldcontext);
}
/*
+ * Vacate a connection so that this node can send the next query
+ */
+static void
+vacate_connection(PgFdwState *fdwstate)
+{
+ PgFdwConnspecate *connspec = fdwstate->connspec;
+ ForeignScanState *owner;
+
+ if (connspec == NULL || connspec->current_owner == NULL)
+ return;
+
+ /*
+ * let the current connection owner read the result for the running query
+ */
+ owner = connspec->current_owner;
+ fetch_received_data(owner);
+
+ /* Clear the waiting list */
+ while (owner)
+ {
+ PgFdwScanState *fsstate = GetPgFdwScanState(owner);
+
+ fsstate->last_waiter = NULL;
+ owner = fsstate->waiter;
+ fsstate->waiter = NULL;
+ }
+}
+
+/*
+ * Absorb the result of the current query.
+ */
+static void
+absorb_current_result(ForeignScanState *node)
+{
+ PgFdwScanState *fsstate = GetPgFdwScanState(node);
+ ForeignScanState *owner = fsstate->s.connspec->current_owner;
+
+ if (owner)
+ {
+ PgFdwScanState *target_state = GetPgFdwScanState(owner);
+ PGconn *conn = target_state->s.conn;
+
+ while(PQisBusy(conn))
+ PQclear(PQgetResult(conn));
+ fsstate->s.connspec->current_owner = NULL;
+ fsstate->async_waiting = false;
+ }
+}
+/*
* Force assorted GUC parameters to settings that ensure that we'll output
* data values in a form that is unambiguous to the remote server.
*
@@ -3150,7 +3434,7 @@ prepare_foreign_modify(PgFdwModifyState *fmstate)
/* Construct name we'll use for the prepared statement. */
snprintf(prep_name, sizeof(prep_name), "pgsql_fdw_prep_%u",
- GetPrepStmtNumber(fmstate->conn));
+ GetPrepStmtNumber(fmstate->s.conn));
p_name = pstrdup(prep_name);
/*
@@ -3160,12 +3444,12 @@ prepare_foreign_modify(PgFdwModifyState *fmstate)
* the prepared statements we use in this module are simple enough that
* the remote server will make the right choices.
*/
- if (!PQsendPrepare(fmstate->conn,
+ if (!PQsendPrepare(fmstate->s.conn,
p_name,
fmstate->query,
0,
NULL))
- pgfdw_report_error(ERROR, NULL, fmstate->conn, false, fmstate->query);
+ pgfdw_report_error(ERROR, NULL, fmstate->s.conn, false, fmstate->query);
/*
* Get the result, and check for success.
@@ -3173,9 +3457,9 @@ prepare_foreign_modify(PgFdwModifyState *fmstate)
* We don't use a PG_TRY block here, so be careful not to throw error
* without releasing the PGresult.
*/
- res = pgfdw_get_result(fmstate->conn, fmstate->query);
+ res = pgfdw_get_result(fmstate->s.conn, fmstate->query);
if (PQresultStatus(res) != PGRES_COMMAND_OK)
- pgfdw_report_error(ERROR, res, fmstate->conn, true, fmstate->query);
+ pgfdw_report_error(ERROR, res, fmstate->s.conn, true, fmstate->query);
PQclear(res);
/* This action shows that the prepare has been done. */
@@ -3306,9 +3590,9 @@ execute_dml_stmt(ForeignScanState *node)
* the desired result. This allows us to avoid assuming that the remote
* server has the same OIDs we do for the parameters' types.
*/
- if (!PQsendQueryParams(dmstate->conn, dmstate->query, numParams,
+ if (!PQsendQueryParams(dmstate->s.conn, dmstate->query, numParams,
NULL, values, NULL, NULL, 0))
- pgfdw_report_error(ERROR, NULL, dmstate->conn, false, dmstate->query);
+ pgfdw_report_error(ERROR, NULL, dmstate->s.conn, false, dmstate->query);
/*
* Get the result, and check for success.
@@ -3316,10 +3600,10 @@ execute_dml_stmt(ForeignScanState *node)
* We don't use a PG_TRY block here, so be careful not to throw error
* without releasing the PGresult.
*/
- dmstate->result = pgfdw_get_result(dmstate->conn, dmstate->query);
+ dmstate->result = pgfdw_get_result(dmstate->s.conn, dmstate->query);
if (PQresultStatus(dmstate->result) !=
(dmstate->has_returning ? PGRES_TUPLES_OK : PGRES_COMMAND_OK))
- pgfdw_report_error(ERROR, dmstate->result, dmstate->conn, true,
+ pgfdw_report_error(ERROR, dmstate->result, dmstate->s.conn, true,
dmstate->query);
/* Get the number of rows affected. */
@@ -4465,8 +4749,10 @@ postgresIsForeignPathAsyncCapable(ForeignPath *path)
}
/*
- * XXX. Just for testing purposes, let's run everything through the async
- * mechanism but return tuples synchronously.
+ * Accept async request. Notify to the caller if the next tuple is immediately
+ * available. ExecForeignScan does additional work to finishing the returning
+ * tuple, so call it instead of postgresIterateForeignScan to acquire a tuple
+ * in expected shape.
*/
static void
postgresForeignAsyncRequest(EState *estate, PendingAsyncRequest *areq)
@@ -4475,22 +4761,59 @@ postgresForeignAsyncRequest(EState *estate, PendingAsyncRequest *areq)
TupleTableSlot *slot;
Assert(IsA(node, ForeignScanState));
+ GetPgFdwScanState(node)->run_async = true;
slot = ExecForeignScan(node);
- ExecAsyncRequestDone(estate, areq, (Node *) slot);
+ if (GetPgFdwScanState(node)->result_ready)
+ ExecAsyncRequestDone(estate, areq, (Node *) slot);
+ else
+ ExecAsyncSetRequiredEvents(estate, areq, 1, false, false);
}
+/*
+ * Configure waiting event.
+ *
+ * Add an wait event only when the node is the connection owner. Elsewise
+ * another node on this connection is the owner.
+ */
static bool
postgresForeignAsyncConfigureWait(EState *estate, PendingAsyncRequest *areq,
- bool reinit)
+ bool reinit)
{
- elog(ERROR, "postgresForeignAsyncConfigureWait");
+ ForeignScanState *node = (ForeignScanState *) areq->requestee;
+ PgFdwScanState *fsstate = GetPgFdwScanState(node);
+
+
+ /* If the caller didn't reinit, this event is already in event set */
+ if (!reinit)
+ return true;
+
+ if (fsstate->s.connspec->current_owner == node)
+ {
+ AddWaitEventToSet(estate->es_wait_event_set,
+ WL_SOCKET_READABLE, PQsocket(fsstate->s.conn),
+ NULL, areq);
+ return true;
+ }
+
return false;
}
+/*
+ * Process a notification from async mechanism. ExecForeignScan does
+ * additional work to complete the returning tuple, so call it instead of
+ * postgresIterateForeignScan to acquire a completed tuple.
+ */
static void
postgresForeignAsyncNotify(EState *estate, PendingAsyncRequest *areq)
{
- elog(ERROR, "postgresForeignAsyncNotify");
+ ForeignScanState *node = (ForeignScanState *) areq->requestee;
+ TupleTableSlot *slot;
+
+ Assert(IsA(node, ForeignScanState));
+ slot = ExecForeignScan(node);
+ Assert(GetPgFdwScanState(node)->result_ready);
+
+ ExecAsyncRequestDone(estate, areq, (Node *) slot);
}
/*
@@ -4850,7 +5173,7 @@ make_tuple_from_result_row(PGresult *res,
PgFdwScanState *fdw_sstate;
Assert(fsstate);
- fdw_sstate = (PgFdwScanState *) fsstate->fdw_state;
+ fdw_sstate = GetPgFdwScanState(fsstate);
tupdesc = fdw_sstate->tupdesc;
}
diff --git a/contrib/postgres_fdw/postgres_fdw.h b/contrib/postgres_fdw/postgres_fdw.h
index f8c255e..1800977 100644
--- a/contrib/postgres_fdw/postgres_fdw.h
+++ b/contrib/postgres_fdw/postgres_fdw.h
@@ -79,6 +79,7 @@ typedef struct PgFdwRelationInfo
UserMapping *user; /* only set in use_remote_estimate mode */
int fetch_size; /* fetch size for this remote table */
+ bool allow_prefetch; /* true to allow overlapped fetching */
/*
* Name of the relation while EXPLAINing ForeignScan. It is used for join
@@ -103,6 +104,7 @@ extern void reset_transmission_modes(int nestlevel);
/* in connection.c */
extern PGconn *GetConnection(UserMapping *user, bool will_prep_stmt);
+void *GetConnectionSpecificStorage(UserMapping *user, size_t initsize);
extern void ReleaseConnection(PGconn *conn);
extern unsigned int GetCursorNumber(PGconn *conn);
extern unsigned int GetPrepStmtNumber(PGconn *conn);
diff --git a/contrib/postgres_fdw/sql/postgres_fdw.sql b/contrib/postgres_fdw/sql/postgres_fdw.sql
index f48743c..7153661 100644
--- a/contrib/postgres_fdw/sql/postgres_fdw.sql
+++ b/contrib/postgres_fdw/sql/postgres_fdw.sql
@@ -1552,8 +1552,8 @@ explain (verbose, costs off)
delete from foo where f1 < 5 returning *;
delete from foo where f1 < 5 returning *;
explain (verbose, costs off)
-update bar set f2 = f2 + 100 returning *;
-update bar set f2 = f2 + 100 returning *;
+with u as (update bar set f2 = f2 + 100 returning *) select * from u order by 1;
+with u as (update bar set f2 = f2 + 100 returning *) select * from u order by 1;
drop table foo cascade;
drop table bar cascade;
diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
index 554244f..f864abe 100644
--- a/src/backend/executor/execProcnode.c
+++ b/src/backend/executor/execProcnode.c
@@ -114,6 +114,7 @@
#include "executor/nodeValuesscan.h"
#include "executor/nodeWindowAgg.h"
#include "executor/nodeWorktablescan.h"
+#include "foreign/fdwapi.h"
#include "nodes/nodeFuncs.h"
#include "miscadmin.h"
@@ -806,6 +807,14 @@ ExecShutdownNode(PlanState *node)
case T_GatherState:
ExecShutdownGather((GatherState *) node);
break;
+ case T_ForeignScanState:
+ {
+ ForeignScanState *fsstate = (ForeignScanState *)node;
+ FdwRoutine *fdwroutine = fsstate->fdwroutine;
+ if (fdwroutine->ShutdownForeignScan)
+ fdwroutine->ShutdownForeignScan((ForeignScanState *) node);
+ }
+ break;
default:
break;
}
diff --git a/src/include/foreign/fdwapi.h b/src/include/foreign/fdwapi.h
index 65517fd..e40db0e 100644
--- a/src/include/foreign/fdwapi.h
+++ b/src/include/foreign/fdwapi.h
@@ -163,6 +163,7 @@ typedef bool (*ForeignAsyncConfigureWait_function) (EState *estate,
bool reinit);
typedef void (*ForeignAsyncNotify_function) (EState *estate,
PendingAsyncRequest *areq);
+typedef void (*ShutdownForeignScan_function) (ForeignScanState *node);
/*
* FdwRoutine is the struct returned by a foreign-data wrapper's handler
@@ -239,6 +240,7 @@ typedef struct FdwRoutine
ForeignAsyncRequest_function ForeignAsyncRequest;
ForeignAsyncConfigureWait_function ForeignAsyncConfigureWait;
ForeignAsyncNotify_function ForeignAsyncNotify;
+ ShutdownForeignScan_function ShutdownForeignScan;
} FdwRoutine;
--
2.9.2
0005-Use-resource-owner-to-prevent-wait-event-set-from-le.patchtext/x-patch; charset=us-asciiDownload
From 9ad5ab969809960a5d954aed086743e04a963e2e Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Tue, 25 Oct 2016 18:01:56 +0900
Subject: [PATCH 5/7] Use resource owner to prevent wait event set from leaking
Wait event sets created for async execution can live for some
iterations so it leaks in the case of errors during the
iterations. This commit uses resource owner to prevent such leaks.
---
src/backend/executor/execAsync.c | 28 ++++++++++++++--
src/backend/storage/ipc/latch.c | 19 ++++++++++-
src/backend/utils/resowner/resowner.c | 63 +++++++++++++++++++++++++++++++++++
src/include/utils/resowner_private.h | 8 +++++
4 files changed, 114 insertions(+), 4 deletions(-)
diff --git a/src/backend/executor/execAsync.c b/src/backend/executor/execAsync.c
index 33496a9..40e3f67 100644
--- a/src/backend/executor/execAsync.c
+++ b/src/backend/executor/execAsync.c
@@ -20,6 +20,7 @@
#include "miscadmin.h"
#include "pgstat.h"
#include "storage/latch.h"
+#include "utils/resowner_private.h"
static bool ExecAsyncEventWait(EState *estate, long timeout);
static bool ExecAsyncConfigureWait(EState *estate, PendingAsyncRequest *areq,
@@ -277,6 +278,8 @@ ExecAsyncEventWait(EState *estate, long timeout)
if (estate->es_wait_event_set == NULL)
{
+ ResourceOwner savedOwner;
+
/*
* Allow for a few extra events without reinitializing. It
* doesn't seem worth the complexity of doing anything very
@@ -284,9 +287,28 @@ ExecAsyncEventWait(EState *estate, long timeout)
* of external FDs are likely to run afoul of kernel limits anyway.
*/
estate->es_allocated_fd_events = estate->es_total_fd_events + 16;
- estate->es_wait_event_set =
- CreateWaitEventSet(estate->es_query_cxt,
- estate->es_allocated_fd_events + 1);
+
+ /*
+ * The wait event set created here should be released in case of
+ * error.
+ */
+ savedOwner = CurrentResourceOwner;
+ CurrentResourceOwner = TopTransactionResourceOwner;
+
+ PG_TRY();
+ {
+ estate->es_wait_event_set =
+ CreateWaitEventSet(estate->es_query_cxt,
+ estate->es_allocated_fd_events + 1);
+ }
+ PG_CATCH();
+ {
+ CurrentResourceOwner = savedOwner;
+ PG_RE_THROW();
+ }
+ PG_END_TRY();
+
+ CurrentResourceOwner = savedOwner;
AddWaitEventToSet(estate->es_wait_event_set,
WL_LATCH_SET, PGINVALID_SOCKET, MyLatch, NULL);
reinit = true;
diff --git a/src/backend/storage/ipc/latch.c b/src/backend/storage/ipc/latch.c
index b7e5129..90a93cc 100644
--- a/src/backend/storage/ipc/latch.c
+++ b/src/backend/storage/ipc/latch.c
@@ -62,6 +62,7 @@
#include "storage/latch.h"
#include "storage/pmsignal.h"
#include "storage/shmem.h"
+#include "utils/resowner_private.h"
/*
* Select the fd readiness primitive to use. Normally the "most modern"
@@ -90,6 +91,7 @@ struct WaitEventSet
int nevents; /* number of registered events */
int nevents_space; /* maximum number of events in this set */
+ ResourceOwner resowner; /* Resource owner */
/*
* Array, of nevents_space length, storing the definition of events this
* set is waiting for.
@@ -324,7 +326,13 @@ WaitLatchOrSocket(volatile Latch *latch, int wakeEvents, pgsocket sock,
int ret = 0;
int rc;
WaitEvent event;
- WaitEventSet *set = CreateWaitEventSet(CurrentMemoryContext, 3);
+ WaitEventSet *set;
+ ResourceOwner savedOwner = CurrentResourceOwner;
+
+ /* This function doesn't need resowner for event set */
+ CurrentResourceOwner = NULL;
+ set = CreateWaitEventSet(CurrentMemoryContext, 3);
+ CurrentResourceOwner = savedOwner;
if (wakeEvents & WL_TIMEOUT)
Assert(timeout >= 0);
@@ -488,6 +496,9 @@ CreateWaitEventSet(MemoryContext context, int nevents)
char *data;
Size sz = 0;
+ if (CurrentResourceOwner)
+ ResourceOwnerEnlargeWESs(CurrentResourceOwner);
+
/*
* Use MAXALIGN size/alignment to guarantee that later uses of memory are
* aligned correctly. E.g. epoll_event might need 8 byte alignment on some
@@ -547,6 +558,9 @@ CreateWaitEventSet(MemoryContext context, int nevents)
StaticAssertStmt(WSA_INVALID_EVENT == NULL, "");
#endif
+ set->resowner = CurrentResourceOwner;
+ if (CurrentResourceOwner)
+ ResourceOwnerRememberWES(set->resowner, set);
return set;
}
@@ -582,6 +596,9 @@ FreeWaitEventSet(WaitEventSet *set)
}
#endif
+ if (set->resowner != NULL)
+ ResourceOwnerForgetWES(set->resowner, set);
+
pfree(set);
}
diff --git a/src/backend/utils/resowner/resowner.c b/src/backend/utils/resowner/resowner.c
index cdc460b..46c2531 100644
--- a/src/backend/utils/resowner/resowner.c
+++ b/src/backend/utils/resowner/resowner.c
@@ -124,6 +124,7 @@ typedef struct ResourceOwnerData
ResourceArray snapshotarr; /* snapshot references */
ResourceArray filearr; /* open temporary files */
ResourceArray dsmarr; /* dynamic shmem segments */
+ ResourceArray wesarr; /* wait event sets */
/* We can remember up to MAX_RESOWNER_LOCKS references to local locks. */
int nlocks; /* number of owned locks */
@@ -169,6 +170,7 @@ static void PrintTupleDescLeakWarning(TupleDesc tupdesc);
static void PrintSnapshotLeakWarning(Snapshot snapshot);
static void PrintFileLeakWarning(File file);
static void PrintDSMLeakWarning(dsm_segment *seg);
+static void PrintWESLeakWarning(WaitEventSet *events);
/*****************************************************************************
@@ -437,6 +439,7 @@ ResourceOwnerCreate(ResourceOwner parent, const char *name)
ResourceArrayInit(&(owner->snapshotarr), PointerGetDatum(NULL));
ResourceArrayInit(&(owner->filearr), FileGetDatum(-1));
ResourceArrayInit(&(owner->dsmarr), PointerGetDatum(NULL));
+ ResourceArrayInit(&(owner->wesarr), PointerGetDatum(NULL));
return owner;
}
@@ -552,6 +555,16 @@ ResourceOwnerReleaseInternal(ResourceOwner owner,
PrintDSMLeakWarning(res);
dsm_detach(res);
}
+
+ /* Ditto for wait event sets */
+ while (ResourceArrayGetAny(&(owner->wesarr), &foundres))
+ {
+ WaitEventSet *event = (WaitEventSet *) DatumGetPointer(foundres);
+
+ if (isCommit)
+ PrintWESLeakWarning(event);
+ FreeWaitEventSet(event);
+ }
}
else if (phase == RESOURCE_RELEASE_LOCKS)
{
@@ -699,6 +712,7 @@ ResourceOwnerDelete(ResourceOwner owner)
Assert(owner->snapshotarr.nitems == 0);
Assert(owner->filearr.nitems == 0);
Assert(owner->dsmarr.nitems == 0);
+ Assert(owner->wesarr.nitems == 0);
Assert(owner->nlocks == 0 || owner->nlocks == MAX_RESOWNER_LOCKS + 1);
/*
@@ -725,6 +739,7 @@ ResourceOwnerDelete(ResourceOwner owner)
ResourceArrayFree(&(owner->snapshotarr));
ResourceArrayFree(&(owner->filearr));
ResourceArrayFree(&(owner->dsmarr));
+ ResourceArrayFree(&(owner->wesarr));
pfree(owner);
}
@@ -1267,3 +1282,51 @@ PrintDSMLeakWarning(dsm_segment *seg)
elog(WARNING, "dynamic shared memory leak: segment %u still referenced",
dsm_segment_handle(seg));
}
+
+/*
+ * Make sure there is room for at least one more entry in a ResourceOwner's
+ * wait event set reference array.
+ *
+ * This is separate from actually inserting an entry because if we run out
+ * of memory, it's critical to do so *before* acquiring the resource.
+ */
+void
+ResourceOwnerEnlargeWESs(ResourceOwner owner)
+{
+ ResourceArrayEnlarge(&(owner->wesarr));
+}
+
+/*
+ * Remember that a wait event set is owned by a ResourceOwner
+ *
+ * Caller must have previously done ResourceOwnerEnlargeWESs()
+ */
+void
+ResourceOwnerRememberWES(ResourceOwner owner, WaitEventSet *events)
+{
+ ResourceArrayAdd(&(owner->wesarr), PointerGetDatum(events));
+}
+
+/*
+ * Forget that a wait event set is owned by a ResourceOwner
+ */
+void
+ResourceOwnerForgetWES(ResourceOwner owner, WaitEventSet *events)
+{
+ /* XXXX: There's no property to identify a wait event set */
+ if (!ResourceArrayRemove(&(owner->wesarr), PointerGetDatum(events)))
+ elog(ERROR, "wait event set %p is not owned by resource owner %s",
+ events, owner->name);
+}
+
+/*
+ * Debugging subroutine
+ */
+static void
+PrintWESLeakWarning(WaitEventSet *events)
+{
+ /* XXXX: There's no property to identify a wait event set */
+ elog(WARNING, "wait event set leak: %p still referenced",
+ events);
+}
+
diff --git a/src/include/utils/resowner_private.h b/src/include/utils/resowner_private.h
index fd32090..6087257e7 100644
--- a/src/include/utils/resowner_private.h
+++ b/src/include/utils/resowner_private.h
@@ -18,6 +18,7 @@
#include "storage/dsm.h"
#include "storage/fd.h"
+#include "storage/latch.h"
#include "storage/lock.h"
#include "utils/catcache.h"
#include "utils/plancache.h"
@@ -88,4 +89,11 @@ extern void ResourceOwnerRememberDSM(ResourceOwner owner,
extern void ResourceOwnerForgetDSM(ResourceOwner owner,
dsm_segment *);
+/* support for wait event set management */
+extern void ResourceOwnerEnlargeWESs(ResourceOwner owner);
+extern void ResourceOwnerRememberWES(ResourceOwner owner,
+ WaitEventSet *);
+extern void ResourceOwnerForgetWES(ResourceOwner owner,
+ WaitEventSet *);
+
#endif /* RESOWNER_PRIVATE_H */
--
2.9.2
0006-Apply-unlikely-to-suggest-synchronous-route-of-ExecA.patchtext/x-patch; charset=us-asciiDownload
From 5c39b606ada4ed4c84d4aea283ada6f19a90913a Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Tue, 25 Oct 2016 18:05:30 +0900
Subject: [PATCH 6/7] Apply unlikely to suggest synchronous route of
ExecAppend.
ExecAppend seems to get slowed down by penalty of misprediction of
branches related to async-execution. Apply unlikey to them to prevent
such penalty on exiting route. Asynchronous execution is already
having a lot of additional code so this doesn't add siginificant
degradation.
---
src/backend/executor/nodeAppend.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/src/backend/executor/nodeAppend.c b/src/backend/executor/nodeAppend.c
index c234f1f..e82547d 100644
--- a/src/backend/executor/nodeAppend.c
+++ b/src/backend/executor/nodeAppend.c
@@ -208,7 +208,7 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
TupleTableSlot *
ExecAppend(AppendState *node)
{
- if (node->as_nasyncplans > 0)
+ if (unlikely(node->as_nasyncplans > 0))
{
EState *estate = node->ps.state;
int i;
@@ -248,7 +248,7 @@ ExecAppend(AppendState *node)
/*
* if we have async requests outstanding, run the event loop
*/
- if (node->as_nasyncpending > 0)
+ if (unlikely(node->as_nasyncpending > 0))
{
long timeout = node->as_syncdone ? -1 : 0;
--
2.9.2
0007-Add-instrumentation-to-async-execution.patchtext/x-patch; charset=us-asciiDownload
From 252216d9348e6d32894d15732f6991a3c770baf3 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Tue, 25 Oct 2016 19:04:04 +0900
Subject: [PATCH 7/7] Add instrumentation to async execution
Make explain analyze give sane result when async execution has taken
place.
---
src/backend/executor/execAsync.c | 19 +++++++++++++++++++
src/backend/executor/instrument.c | 2 +-
2 files changed, 20 insertions(+), 1 deletion(-)
diff --git a/src/backend/executor/execAsync.c b/src/backend/executor/execAsync.c
index 40e3f67..588ba18 100644
--- a/src/backend/executor/execAsync.c
+++ b/src/backend/executor/execAsync.c
@@ -46,6 +46,9 @@ ExecAsyncRequest(EState *estate, PlanState *requestor, int request_index,
PendingAsyncRequest *areq = NULL;
int nasync = estate->es_num_pending_async;
+ if (requestee->instrument)
+ InstrStartNode(requestee->instrument);
+
/*
* If the number of pending asynchronous nodes exceeds the number of
* available slots in the es_pending_async array, expand the array.
@@ -121,11 +124,17 @@ ExecAsyncRequest(EState *estate, PlanState *requestor, int request_index,
if (areq->state == ASYNC_COMPLETE)
{
Assert(areq->result == NULL || IsA(areq->result, TupleTableSlot));
+
ExecAsyncResponse(estate, areq);
+ if (areq->requestee->instrument)
+ InstrStopNode(requestee->instrument,
+ TupIsNull((TupleTableSlot*)areq->result) ? 0.0 : 1.0);
return;
}
+ if (areq->requestee->instrument)
+ InstrStopNode(requestee->instrument, 0);
/* No result available now, make this node pending */
estate->es_num_pending_async++;
}
@@ -193,6 +202,9 @@ ExecAsyncEventLoop(EState *estate, PlanState *requestor, long timeout)
{
PendingAsyncRequest *areq = estate->es_pending_async[i];
+ if (areq->requestee->instrument)
+ InstrStartNode(areq->requestee->instrument);
+
/* Skip it if not pending. */
if (areq->state == ASYNC_CALLBACK_PENDING)
{
@@ -211,7 +223,14 @@ ExecAsyncEventLoop(EState *estate, PlanState *requestor, long timeout)
if (requestor == areq->requestor)
requestor_done = true;
ExecAsyncResponse(estate, areq);
+
+ if (areq->requestee->instrument)
+ InstrStopNode(areq->requestee->instrument,
+ TupIsNull((TupleTableSlot*)areq->result) ?
+ 0.0 : 1.0);
}
+ else if (areq->requestee->instrument)
+ InstrStopNode(areq->requestee->instrument, 0);
}
/* If any node completed, compact the array. */
diff --git a/src/backend/executor/instrument.c b/src/backend/executor/instrument.c
index 2614bf4..6a22a15 100644
--- a/src/backend/executor/instrument.c
+++ b/src/backend/executor/instrument.c
@@ -102,7 +102,7 @@ InstrStopNode(Instrumentation *instr, double nTuples)
&pgBufferUsage, &instr->bufusage_start);
/* Is this the first tuple of this cycle? */
- if (!instr->running)
+ if (!instr->running && nTuples > 0)
{
instr->running = true;
instr->firsttuple = INSTR_TIME_GET_DOUBLE(instr->counter);
--
2.9.2
I noticed that this patch is conflicting with 665d1fa (Logical
replication) so I rebased this. Only executor/Makefile
conflicted.
At Mon, 31 Oct 2016 10:39:12 +0900 (JST), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in <20161031.103912.217430542.horiguchi.kyotaro@lab.ntt.co.jp>
This a PoC patch of asynchronous execution feature, based on a
executor infrastructure Robert proposed. These patches are
rebased on the current master.0001-robert-s-2nd-framework.patch
Roberts executor async infrastructure. Async-driver nodes
register its async-capable children and sync and data transfer
are done out of band of ordinary ExecProcNode channel. So async
execution no longer disturbs async-unaware node and slows them
down.0002-Fix-some-bugs.patch
Some fixes for 0001 to work. This is just to preserve the shape
of 0001 patch.0003-Modify-async-execution-infrastructure.patch
The original infrastructure doesn't work when multiple foreign
tables is on the same connection. This makes it work.0004-Make-postgres_fdw-async-capable.patch
Makes postgres_fdw to work asynchronously.
0005-Use-resource-owner-to-prevent-wait-event-set-from-le.patch
This addresses a problem pointed by Robers about 0001 patch,
that WaitEventSet used for async execution can leak by errors.0006-Apply-unlikely-to-suggest-synchronous-route-of-ExecA.patch
ExecAppend gets a bit slower by penalties of misprediction of
branches. This fixes it by using unlikely() macro.0007-Add-instrumentation-to-async-execution.patch
As the description above for 0001, async infrastructure conveys
tuples outside ExecProcNode channel so EXPLAIN ANALYZE requires
special treat to show sane results. This patch tries that.A result of a performance measurement is in this message.
/messages/by-id/20161025.182150.230901487.horiguchi.kyotaro@lab.ntt.co.jp
| t0 - SELECT sum(a) FROM <local single table>;
| pl - SELECT sum(a) FROM <4 local children>;
| pf0 - SELECT sum(a) FROM <4 foreign children on single connection>;
| pf1 - SELECT sum(a) FROM <4 foreign children on dedicate connections>;
...
| async
| t0: 3885.84 ( 40.20) 0.86% faster (should be error but stable on my env..)
| pl: 1617.20 ( 3.51) 1.26% faster (ditto)
| pf0: 6680.95 (478.72) 19.5% faster
| pf1: 1886.87 ( 36.25) 77.1% faster
regards,
--
Kyotaro Horiguchi
NTT Open Source Software Center
Attachments:
0001-robert-s-2nd-framework.patchtext/x-patch; charset=us-asciiDownload
From 5ef5ae125e758f221dcacbb1391ba3a517ec0a9f Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Wed, 12 Oct 2016 12:46:10 +0900
Subject: [PATCH 1/7] robert's 2nd framework
---
contrib/postgres_fdw/postgres_fdw.c | 49 ++++
src/backend/executor/Makefile | 4 +-
src/backend/executor/README | 43 +++
src/backend/executor/execAmi.c | 5 +
src/backend/executor/execAsync.c | 462 ++++++++++++++++++++++++++++++++
src/backend/executor/nodeAppend.c | 162 ++++++++++-
src/backend/executor/nodeForeignscan.c | 49 ++++
src/backend/nodes/copyfuncs.c | 1 +
src/backend/nodes/outfuncs.c | 1 +
src/backend/nodes/readfuncs.c | 1 +
src/backend/optimizer/plan/createplan.c | 45 +++-
src/include/executor/execAsync.h | 29 ++
src/include/executor/nodeAppend.h | 3 +
src/include/executor/nodeForeignscan.h | 7 +
src/include/foreign/fdwapi.h | 15 ++
src/include/nodes/execnodes.h | 57 +++-
src/include/nodes/plannodes.h | 1 +
17 files changed, 909 insertions(+), 25 deletions(-)
create mode 100644 src/backend/executor/execAsync.c
create mode 100644 src/include/executor/execAsync.h
diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c
index 5d270b9..595a47e 100644
--- a/contrib/postgres_fdw/postgres_fdw.c
+++ b/contrib/postgres_fdw/postgres_fdw.c
@@ -19,6 +19,7 @@
#include "commands/defrem.h"
#include "commands/explain.h"
#include "commands/vacuum.h"
+#include "executor/execAsync.h"
#include "foreign/fdwapi.h"
#include "funcapi.h"
#include "miscadmin.h"
@@ -349,6 +350,14 @@ static void postgresGetForeignUpperPaths(PlannerInfo *root,
UpperRelationKind stage,
RelOptInfo *input_rel,
RelOptInfo *output_rel);
+static bool postgresIsForeignPathAsyncCapable(ForeignPath *path);
+static void postgresForeignAsyncRequest(EState *estate,
+ PendingAsyncRequest *areq);
+static void postgresForeignAsyncConfigureWait(EState *estate,
+ PendingAsyncRequest *areq,
+ bool reinit);
+static void postgresForeignAsyncNotify(EState *estate,
+ PendingAsyncRequest *areq);
/*
* Helper functions
@@ -468,6 +477,12 @@ postgres_fdw_handler(PG_FUNCTION_ARGS)
/* Support functions for upper relation push-down */
routine->GetForeignUpperPaths = postgresGetForeignUpperPaths;
+ /* Support functions for async execution */
+ routine->IsForeignPathAsyncCapable = postgresIsForeignPathAsyncCapable;
+ routine->ForeignAsyncRequest = postgresForeignAsyncRequest;
+ routine->ForeignAsyncConfigureWait = postgresForeignAsyncConfigureWait;
+ routine->ForeignAsyncNotify = postgresForeignAsyncNotify;
+
PG_RETURN_POINTER(routine);
}
@@ -4440,6 +4455,40 @@ postgresGetForeignJoinPaths(PlannerInfo *root,
/* XXX Consider parameterized paths for the join relation */
}
+static bool
+postgresIsForeignPathAsyncCapable(ForeignPath *path)
+{
+ return true;
+}
+
+/*
+ * XXX. Just for testing purposes, let's run everything through the async
+ * mechanism but return tuples synchronously.
+ */
+static void
+postgresForeignAsyncRequest(EState *estate, PendingAsyncRequest *areq)
+{
+ ForeignScanState *node = (ForeignScanState *) areq->requestee;
+ TupleTableSlot *slot;
+
+ Assert(IsA(node, ForeignScanState));
+ slot = postgresIterateForeignScan(node);
+ ExecAsyncRequestDone(estate, areq, (Node *) slot);
+}
+
+static void
+postgresForeignAsyncConfigureWait(EState *estate, PendingAsyncRequest *areq,
+ bool reinit)
+{
+ elog(ERROR, "postgresForeignAsyncConfigureWait");
+}
+
+static void
+postgresForeignAsyncNotify(EState *estate, PendingAsyncRequest *areq)
+{
+ elog(ERROR, "postgresForeignAsyncNotify");
+}
+
/*
* Assess whether the aggregation, grouping and having operations can be pushed
* down to the foreign server. As a side effect, save information we obtain in
diff --git a/src/backend/executor/Makefile b/src/backend/executor/Makefile
index 2a2b7eb..dd05d1e 100644
--- a/src/backend/executor/Makefile
+++ b/src/backend/executor/Makefile
@@ -12,8 +12,8 @@ subdir = src/backend/executor
top_builddir = ../../..
include $(top_builddir)/src/Makefile.global
-OBJS = execAmi.o execCurrent.o execGrouping.o execIndexing.o execJunk.o \
- execMain.o execParallel.o execProcnode.o execQual.o \
+OBJS = execAmi.o execAsync.o execCurrent.o execGrouping.o execIndexing.o \
+ execJunk.o execMain.o execParallel.o execProcnode.o execQual.o \
execReplication.o execScan.o execTuples.o \
execUtils.o functions.o instrument.o nodeAppend.o nodeAgg.o \
nodeBitmapAnd.o nodeBitmapOr.o \
diff --git a/src/backend/executor/README b/src/backend/executor/README
index f1d1e4c..1dee3db 100644
--- a/src/backend/executor/README
+++ b/src/backend/executor/README
@@ -199,3 +199,46 @@ query returning the same set of scan tuples multiple times. Likewise,
SRFs are disallowed in an UPDATE's targetlist. There, they would have the
effect of the same row being updated multiple times, which is not very
useful --- and updates after the first would have no effect anyway.
+
+Asynchronous Execution
+----------------------
+
+In certain cases, it's desirable for a node to indicate that it cannot
+return any tuple immediately but may be able to do at a later time. This
+might either because the node is waiting on an event external to the
+database system, such as a ForeignScan awaiting network I/O, or because
+the node is waiting for an event internal to the database system - e.g.
+one process involved in a parallel query may find that it cannot progress
+a certain parallel operation until some other process reaches a certain
+point in the computation. A process which discovers this type of situation
+can always handle it simply by blocking, but this may waste time that could
+be spent executing some other part of the plan where progress could be
+made immediately. This is particularly likely to occur when the plan
+contains an Append node.
+
+To use asynchronous execution, a node must first request a tuple from an
+async-capable child node using ExecAsyncRequest. Next, it must execute
+the asynchronous event loop using ExecAsyncEventLoop; it can avoid giving
+up control indefinitely by passing a timeout to this function, even passing
+-1 to poll for events without blocking. Eventually, when a node to which an
+asynchronous request has been made produces a tuple, the requesting node
+will receive a callback from the event loop via ExecAsyncResponse. Typically,
+the ExecAsyncResponse callback is the only one required for nodes that wish
+to request tuples asynchronously.
+
+On the other hand, nodes that wish to produce tuples asynchronously
+generally need to implement three methods:
+
+1. When an asynchronous request is made, the node's ExecAsyncRequest callback
+will be invoked; it should use ExecAsyncSetRequiredEvents to indicate the
+number of file descriptor events for which it wishes to wait and whether it
+wishes to receive a callback when the process latch is set. Alternatively,
+it can instead use ExecAsyncRequestDone if a result is available immediately.
+
+2. When the event loop wishes to wait or poll for file descriptor events and
+the process latch, the ExecAsyncConfigureWait callback is invoked to configure
+the file descriptor wait events for which the node wishes to wait. This
+callback isn't needed if the node only cares about the process latch.
+
+3. When file descriptors or the process latch become ready, the node's
+ExecAsyncNotify callback is invoked.
diff --git a/src/backend/executor/execAmi.c b/src/backend/executor/execAmi.c
index d380207..e154c59 100644
--- a/src/backend/executor/execAmi.c
+++ b/src/backend/executor/execAmi.c
@@ -468,11 +468,16 @@ ExecSupportsBackwardScan(Plan *node)
{
ListCell *l;
+ /* With async, tuples may be interleaved, so can't back up. */
+ if (((Append *) node)->nasyncplans != 0)
+ return false;
+
foreach(l, ((Append *) node)->appendplans)
{
if (!ExecSupportsBackwardScan((Plan *) lfirst(l)))
return false;
}
+
/* need not check tlist because Append doesn't evaluate it */
return true;
}
diff --git a/src/backend/executor/execAsync.c b/src/backend/executor/execAsync.c
new file mode 100644
index 0000000..5858bb5
--- /dev/null
+++ b/src/backend/executor/execAsync.c
@@ -0,0 +1,462 @@
+/*-------------------------------------------------------------------------
+ *
+ * execAsync.c
+ * Support routines for asynchronous execution.
+ *
+ * Portions Copyright (c) 1996-2016, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/executor/execAsync.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "executor/execAsync.h"
+#include "executor/nodeAppend.h"
+#include "executor/nodeForeignscan.h"
+#include "miscadmin.h"
+#include "storage/latch.h"
+
+static bool ExecAsyncEventWait(EState *estate, long timeout);
+static void ExecAsyncConfigureWait(EState *estate, PendingAsyncRequest *areq,
+ bool reinit);
+static void ExecAsyncNotify(EState *estate, PendingAsyncRequest *areq);
+static void ExecAsyncResponse(EState *estate, PendingAsyncRequest *areq);
+
+#define EVENT_BUFFER_SIZE 16
+
+/*
+ * Asynchronously request a tuple from a designed async-aware node.
+ *
+ * requestor is the node that wants the tuple; requestee is the node from
+ * which it wants the tuple. request_index is an arbitrary integer specified
+ * by the requestor which will be available at the time the requestor receives
+ * the tuple. This is useful if the requestor has multiple children and
+ * needs an easy way to figure out which one is delivering a tuple.
+ */
+void
+ExecAsyncRequest(EState *estate, PlanState *requestor, int request_index,
+ PlanState *requestee)
+{
+ PendingAsyncRequest *areq = NULL;
+ int i = estate->es_num_pending_async;
+
+ /*
+ * If the number of pending asynchronous nodes exceeds the number of
+ * available slots in the es_pending_async array, expand the array.
+ * We start with 16 slots, and thereafter double the array size each
+ * time we run out of slots.
+ */
+ if (i >= estate->es_max_pending_async)
+ {
+ int newmax;
+
+ newmax = estate->es_max_pending_async * 2;
+ if (estate->es_max_pending_async == 0)
+ {
+ newmax = 16;
+ estate->es_pending_async =
+ MemoryContextAllocZero(estate->es_query_cxt,
+ newmax * sizeof(PendingAsyncRequest *));
+ }
+ else
+ {
+ int newentries = newmax - estate->es_max_pending_async;
+
+ estate->es_pending_async =
+ repalloc(estate->es_pending_async,
+ newmax * sizeof(PendingAsyncRequest *));
+ MemSet(&estate->es_pending_async[estate->es_max_pending_async],
+ 0, newentries * sizeof(PendingAsyncRequest *));
+ }
+ estate->es_max_pending_async = newmax;
+ }
+
+ /*
+ * To avoid unnecessary palloc traffic, we reuse a previously-allocated
+ * PendingAsyncRequest if there is one. If not, we must allocate a new
+ * one.
+ */
+ if (estate->es_pending_async[i] == NULL)
+ {
+ areq = MemoryContextAllocZero(estate->es_query_cxt,
+ sizeof(PendingAsyncRequest));
+ estate->es_pending_async[i] = areq;
+ }
+ else
+ {
+ areq = estate->es_pending_async[i];
+ MemSet(areq, 0, sizeof(PendingAsyncRequest));
+ }
+ areq->myindex = estate->es_num_pending_async++;
+
+ /* Initialize the new request. */
+ areq->requestor = requestor;
+ areq->request_index = request_index;
+ areq->requestee = requestee;
+
+ /* Give the requestee a chance to do whatever it wants. */
+ switch (nodeTag(requestee))
+ {
+ case T_ForeignScanState:
+ ExecAsyncForeignScanRequest(estate, areq);
+ break;
+ default:
+ /* If requestee doesn't support async, caller messed up. */
+ elog(ERROR, "unrecognized node type: %d",
+ (int) nodeTag(requestee));
+ }
+}
+
+/*
+ * Execute the main loop until the timeout expires or a result is delivered
+ * to the requestor.
+ *
+ * If the timeout is -1, there is no timeout; wait indefinitely until a
+ * result is ready for requestor. If the timeout is 0, do not block, but
+ * poll for events and fire callbacks for as long as we can do so without
+ * blocking. If timeout is greater than 0, block for at most the number
+ * of milliseconds indicated by the timeout.
+ *
+ * Returns true if a result was delivered to the requestor. A return value
+ * of false indicates that the timeout was reached without delivering a
+ * result to the requestor.
+ */
+bool
+ExecAsyncEventLoop(EState *estate, PlanState *requestor, long timeout)
+{
+ instr_time start_time;
+ long cur_timeout = timeout;
+ bool requestor_done = false;
+
+ Assert(requestor != NULL);
+
+ /*
+ * If we plan to wait - but not indefinitely - we need to record the
+ * current time.
+ */
+ if (timeout > 0)
+ INSTR_TIME_SET_CURRENT(start_time);
+
+ /* Main event loop: poll for events, deliver notifications. */
+ for (;;)
+ {
+ int i;
+ bool any_node_done = false;
+
+ CHECK_FOR_INTERRUPTS();
+
+ /*
+ * Check for events, but don't block if there notifications that
+ * have not been delivered yet.
+ */
+ if (estate->es_async_callback_pending > 0)
+ ExecAsyncEventWait(estate, 0);
+ else if (!ExecAsyncEventWait(estate, cur_timeout))
+ cur_timeout = 0; /* Timeout was reached. */
+ else
+ {
+ instr_time cur_time;
+ long cur_timeout = -1;
+
+ INSTR_TIME_SET_CURRENT(cur_time);
+ INSTR_TIME_SUBTRACT(cur_time, start_time);
+ cur_timeout = timeout - (long) INSTR_TIME_GET_MILLISEC(cur_time);
+ if (cur_timeout < 0)
+ cur_timeout = 0;
+ }
+
+ /* Deliver notifications. */
+ for (i = 0; i < estate->es_num_pending_async; ++i)
+ {
+ PendingAsyncRequest *areq = estate->es_pending_async[i];
+
+ /* Skip it if no callback is pending. */
+ if (!areq->callback_pending)
+ continue;
+
+ /*
+ * Mark it as no longer needing a callback. We must do this
+ * before dispatching the callback in case the callback resets
+ * the flag.
+ */
+ areq->callback_pending = false;
+ estate->es_async_callback_pending--;
+
+ /* Perform the actual callback; set request_done if appropraite. */
+ if (!areq->request_complete)
+ ExecAsyncNotify(estate, areq);
+ else
+ {
+ any_node_done = true;
+ if (requestor == areq->requestor)
+ requestor_done = true;
+ ExecAsyncResponse(estate, areq);
+ }
+ }
+
+ /* If any node completed, compact the array. */
+ if (any_node_done)
+ {
+ int hidx = 0,
+ tidx;
+
+ /*
+ * Swap all non-yet-completed items to the start of the array.
+ * Keep them in the same order.
+ */
+ for (tidx = 0; tidx < estate->es_num_pending_async; ++tidx)
+ {
+ PendingAsyncRequest *head;
+ PendingAsyncRequest *tail = estate->es_pending_async[tidx];
+
+ if (!tail->callback_pending && tail->request_complete)
+ continue;
+ head = estate->es_pending_async[hidx];
+ estate->es_pending_async[tidx] = head;
+ estate->es_pending_async[hidx] = tail;
+ ++hidx;
+ }
+ estate->es_num_pending_async = hidx;
+ }
+
+ /*
+ * We only consider exiting the loop when no notifications are
+ * pending. Otherwise, each call to this function might advance
+ * the computation by only a very small amount; to the contrary,
+ * we want to push it forward as far as possible.
+ */
+ if (estate->es_async_callback_pending == 0)
+ {
+ /* If requestor is ready, exit. */
+ if (requestor_done)
+ return true;
+ /* If timeout was 0 or has expired, exit. */
+ if (cur_timeout == 0)
+ return false;
+ }
+ }
+}
+
+/*
+ * Wait or poll for events. As with ExecAsyncEventLoop, a timeout of -1
+ * means wait forever, 0 means don't wait at all, and >0 means wait for the
+ * indicated number of milliseconds.
+ *
+ * Returns true if we found some events and false if we timed out.
+ */
+static bool
+ExecAsyncEventWait(EState *estate, long timeout)
+{
+ WaitEvent occurred_event[EVENT_BUFFER_SIZE];
+ int noccurred;
+ int i;
+ int n;
+ bool reinit = false;
+ bool process_latch_set = false;
+
+ if (estate->es_wait_event_set == NULL)
+ {
+ /*
+ * Allow for a few extra events without reinitializing. It
+ * doesn't seem worth the complexity of doing anything very
+ * aggressive here, because plans that depend on massive numbers
+ * of external FDs are likely to run afoul of kernel limits anyway.
+ */
+ estate->es_allocated_fd_events = estate->es_total_fd_events + 16;
+ estate->es_wait_event_set =
+ CreateWaitEventSet(estate->es_query_cxt,
+ estate->es_allocated_fd_events + 1);
+ AddWaitEventToSet(estate->es_wait_event_set,
+ WL_LATCH_SET, PGINVALID_SOCKET, MyLatch, NULL);
+ reinit = true;
+ }
+
+ /* Give each waiting node a chance to add or modify events. */
+ for (i = 0; i < estate->es_num_pending_async; ++i)
+ {
+ PendingAsyncRequest *areq = estate->es_pending_async[i];
+
+ if (areq->num_fd_events > 0)
+ ExecAsyncConfigureWait(estate, areq, reinit);
+ }
+
+ /* Wait for at least one event to occur. */
+ noccurred = WaitEventSetWait(estate->es_wait_event_set, timeout,
+ occurred_event, EVENT_BUFFER_SIZE);
+ if (noccurred == 0)
+ return false;
+
+ /*
+ * Loop over the occurred events and set the callback_pending flags
+ * for the appropriate requests. The waiting nodes should have
+ * registered their wait events with user_data pointing back to the
+ * PendingAsyncRequest, but the process latch needs special handling.
+ */
+ for (n = 0; n < noccurred; ++n)
+ {
+ WaitEvent *w = &occurred_event[n];
+
+ if ((w->events & WL_LATCH_SET) != 0)
+ {
+ process_latch_set = true;
+ continue;
+ }
+
+ if ((w->events & (WL_SOCKET_READABLE|WL_SOCKET_WRITEABLE)) != 0)
+ {
+ PendingAsyncRequest *areq = w->user_data;
+
+ if (!areq->callback_pending)
+ {
+ Assert(!areq->request_complete);
+ areq->callback_pending = true;
+ estate->es_async_callback_pending++;
+ }
+ }
+ }
+
+ /*
+ * If the process latch got set, we must schedule a callback for every
+ * requestee that cares about it.
+ */
+ if (process_latch_set)
+ {
+ for (i = 0; i < estate->es_num_pending_async; ++i)
+ {
+ PendingAsyncRequest *areq = estate->es_pending_async[i];
+
+ if (areq->wants_process_latch)
+ {
+ Assert(!areq->request_complete);
+ areq->callback_pending = true;
+ }
+ }
+ }
+
+ return true;
+}
+
+/*
+ * Give the asynchronous node a chance to configure the file descriptor
+ * events for which it wishes to wait. We expect the node-type specific
+ * callback to make one or more calls of the following form:
+ *
+ * AddWaitEventToSet(es->es_wait_event_set, events, fd, NULL, areq);
+ *
+ * The events should include only WL_SOCKET_READABLE or WL_SOCKET_WRITEABLE,
+ * and the number of calls should not exceed areq->num_fd_events (as
+ * prevously set via ExecAsyncSetRequiredEvents).
+ */
+static void
+ExecAsyncConfigureWait(EState *estate, PendingAsyncRequest *areq,
+ bool reinit)
+{
+ switch (nodeTag(areq->requestee))
+ {
+ case T_ForeignScanState:
+ ExecAsyncForeignScanConfigureWait(estate, areq, reinit);
+ break;
+ default:
+ elog(ERROR, "unrecognized node type: %d",
+ (int) nodeTag(areq->requestee));
+ }
+}
+
+/*
+ * Call the asynchronous node back when a relevant event has occurred.
+ */
+static void
+ExecAsyncNotify(EState *estate, PendingAsyncRequest *areq)
+{
+ switch (nodeTag(areq->requestee))
+ {
+ case T_ForeignScanState:
+ ExecAsyncForeignScanNotify(estate, areq);
+ break;
+ default:
+ elog(ERROR, "unrecognized node type: %d",
+ (int) nodeTag(areq->requestee));
+ }
+}
+
+/*
+ * Call the requestor back when an asynchronous node has produced a result.
+ */
+static void
+ExecAsyncResponse(EState *estate, PendingAsyncRequest *areq)
+{
+ switch (nodeTag(areq->requestor))
+ {
+ case T_AppendState:
+ ExecAsyncAppendResponse(estate, areq);
+ break;
+ default:
+ elog(ERROR, "unrecognized node type: %d",
+ (int) nodeTag(areq->requestor));
+ }
+}
+
+/*
+ * An executor node should call this function to signal that it needs to wait
+ * on one or more file descriptor events that can be registered on a
+ * WaitEventSet, and possibly also on the process latch. num_fd_events
+ * should be the maximum number of file descriptor events that it will wish to
+ * register. force_reset should be true if the node can't reuse the
+ * WaitEventSet it most recently initialized, for example because it needs to
+ * drop a wait event from the set.
+ */
+void
+ExecAsyncSetRequiredEvents(EState *estate, PendingAsyncRequest *areq,
+ int num_fd_events, bool wants_process_latch,
+ bool force_reset)
+{
+ estate->es_total_fd_events += num_fd_events - areq->num_fd_events;
+ areq->num_fd_events = num_fd_events;
+ areq->wants_process_latch = wants_process_latch;
+
+ if (force_reset && estate->es_wait_event_set != NULL)
+ {
+ FreeWaitEventSet(estate->es_wait_event_set);
+ estate->es_wait_event_set = NULL;
+ }
+}
+
+/*
+ * An async-capable node should call this function to deliver the tuple to
+ * the node which requested it. The node can call this from its
+ * ExecAsyncRequest callback if the requested tuple is available immediately,
+ * or at a later time from its ExecAsyncNotify callback.
+ */
+void
+ExecAsyncRequestDone(EState *estate, PendingAsyncRequest *areq, Node *result)
+{
+ /*
+ * Since the request is complete, the requestee is no longer allowed
+ * to wait for any events. Note that this forces a rebuild of
+ * es_wait_event_set every time a process that was previously waiting
+ * stops doing so. It might be possible to defer that decision until
+ * we actually wait again, because it's quite possible that a new
+ * request will be made of the same node before any wait actually
+ * happens. However, we have to balance the cost of rebuilding the
+ * WaitEventSet against the additional overhead of tracking which nodes
+ * need a callback to remove registered wait events. It's not clear
+ * that we would come out ahead, so use brute force for now.
+ */
+ if (areq->num_fd_events > 0 || areq->wants_process_latch)
+ ExecAsyncSetRequiredEvents(estate, areq, 0, false, true);
+
+ /* Save result and mark request as complete. */
+ areq->result = result;
+ areq->request_complete = true;
+
+ /* Make sure this request is flagged for a callback. */
+ if (!areq->callback_pending)
+ {
+ areq->callback_pending = true;
+ estate->es_async_callback_pending++;
+ }
+}
diff --git a/src/backend/executor/nodeAppend.c b/src/backend/executor/nodeAppend.c
index 6986cae..e61218a 100644
--- a/src/backend/executor/nodeAppend.c
+++ b/src/backend/executor/nodeAppend.c
@@ -58,6 +58,7 @@
#include "postgres.h"
#include "executor/execdebug.h"
+#include "executor/execAsync.h"
#include "executor/nodeAppend.h"
static bool exec_append_initialize_next(AppendState *appendstate);
@@ -79,16 +80,21 @@ exec_append_initialize_next(AppendState *appendstate)
/*
* get information from the append node
*/
- whichplan = appendstate->as_whichplan;
+ whichplan = appendstate->as_whichsyncplan;
- if (whichplan < 0)
+ /*
+ * This routine is only responsible for setting up for nodes being scanned
+ * synchronously, so the first node we can scan is given by nasyncplans
+ * and the last is given by as_nplans - 1.
+ */
+ if (whichplan < appendstate->as_nasyncplans)
{
/*
* if scanning in reverse, we start at the last scan in the list and
* then proceed back to the first.. in any case we inform ExecAppend
* that we are at the end of the line by returning FALSE
*/
- appendstate->as_whichplan = 0;
+ appendstate->as_whichsyncplan = appendstate->as_nasyncplans;
return FALSE;
}
else if (whichplan >= appendstate->as_nplans)
@@ -96,7 +102,7 @@ exec_append_initialize_next(AppendState *appendstate)
/*
* as above, end the scan if we go beyond the last scan in our list..
*/
- appendstate->as_whichplan = appendstate->as_nplans - 1;
+ appendstate->as_whichsyncplan = appendstate->as_nplans - 1;
return FALSE;
}
else
@@ -142,6 +148,15 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
appendstate->ps.state = estate;
appendstate->appendplans = appendplanstates;
appendstate->as_nplans = nplans;
+ appendstate->as_nasyncplans = node->nasyncplans;
+ appendstate->as_syncdone = (node->nasyncplans == nplans);
+ appendstate->as_asyncresult = (TupleTableSlot **)
+ palloc0(node->nasyncplans * sizeof(TupleTableSlot *));
+
+ /* initially, all async requests need a request */
+ for (i = 0; i < appendstate->as_nasyncplans; ++i)
+ appendstate->as_needrequest =
+ bms_add_member(appendstate->as_needrequest, i);
/*
* Miscellaneous initialization
@@ -176,9 +191,9 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
appendstate->ps.ps_ProjInfo = NULL;
/*
- * initialize to scan first subplan
+ * initialize to scan first synchronous subplan
*/
- appendstate->as_whichplan = 0;
+ appendstate->as_whichsyncplan = appendstate->as_nasyncplans;
exec_append_initialize_next(appendstate);
return appendstate;
@@ -193,15 +208,78 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
TupleTableSlot *
ExecAppend(AppendState *node)
{
+ if (node->as_nasyncplans > 0)
+ {
+ EState *estate = node->ps.state;
+ int i;
+
+ /*
+ * If there are any asynchronously-generated results that have
+ * not yet been returned, return one of them.
+ */
+ if (node->as_nasyncresult > 0)
+ {
+ --node->as_nasyncresult;
+ return node->as_asyncresult[node->as_nasyncresult];
+ }
+
+ /*
+ * If there are any nodes that need a new asynchronous request,
+ * make all of them.
+ */
+ while ((i = bms_first_member(node->as_needrequest)) >= 0)
+ {
+ ExecAsyncRequest(estate, &node->ps, i, node->appendplans[i]);
+ node->as_nasyncpending++;
+ }
+ }
+
for (;;)
{
PlanState *subnode;
TupleTableSlot *result;
/*
- * figure out which subplan we are currently processing
+ * if we have async requests outstanding, run the event loop
*/
- subnode = node->appendplans[node->as_whichplan];
+ if (node->as_nasyncpending > 0)
+ {
+ long timeout = node->as_syncdone ? -1 : 0;
+
+ for (;;)
+ {
+ if (node->as_nasyncpending == 0)
+ {
+ /*
+ * If there is no asynchronous activity still pending
+ * and the synchronous activity is also complete, we're
+ * totally done scanning this node. Otherwise, we're
+ * done with the asynchronous stuff but must continue
+ * scanning the synchronous children.
+ */
+ if (node->as_syncdone)
+ return ExecClearTuple(node->ps.ps_ResultTupleSlot);
+ break;
+ }
+ if (!ExecAsyncEventLoop(node->ps.state, &node->ps, timeout))
+ {
+ /* Timeout reached. */
+ break;
+ }
+ if (node->as_nasyncresult > 0)
+ {
+ /* Asynchronous subplan returned a tuple! */
+ --node->as_nasyncresult;
+ return node->as_asyncresult[node->as_nasyncresult];
+ }
+ }
+ }
+
+ /*
+ * figure out which synchronous subplan we are currently processing
+ */
+ Assert(!node->as_syncdone);
+ subnode = node->appendplans[node->as_whichsyncplan];
/*
* get a tuple from the subplan
@@ -221,14 +299,21 @@ ExecAppend(AppendState *node)
/*
* Go on to the "next" subplan in the appropriate direction. If no
* more subplans, return the empty slot set up for us by
- * ExecInitAppend.
+ * ExecInitAppend, unless there are async plans we have yet to finish.
*/
if (ScanDirectionIsForward(node->ps.state->es_direction))
- node->as_whichplan++;
+ node->as_whichsyncplan++;
else
- node->as_whichplan--;
+ node->as_whichsyncplan--;
if (!exec_append_initialize_next(node))
- return ExecClearTuple(node->ps.ps_ResultTupleSlot);
+ {
+ node->as_syncdone = true;
+ if (node->as_nasyncpending == 0)
+ {
+ Assert(bms_is_empty(node->as_needrequest));
+ return ExecClearTuple(node->ps.ps_ResultTupleSlot);
+ }
+ }
/* Else loop back and try to get a tuple from the new subplan */
}
@@ -267,6 +352,16 @@ ExecReScanAppend(AppendState *node)
{
int i;
+ /*
+ * XXX. Cancel outstanding asynchronous tuple requests here! (How?)
+ */
+
+ /* Reset async state. */
+ for (i = 0; i < node->as_nasyncplans; ++i)
+ node->as_needrequest = bms_add_member(node->as_needrequest, i);
+ node->as_nasyncresult = 0;
+ node->as_syncdone = (node->as_nasyncplans == node->as_nplans);
+
for (i = 0; i < node->as_nplans; i++)
{
PlanState *subnode = node->appendplans[i];
@@ -285,6 +380,47 @@ ExecReScanAppend(AppendState *node)
if (subnode->chgParam == NULL)
ExecReScan(subnode);
}
- node->as_whichplan = 0;
+ node->as_whichsyncplan = node->as_nasyncplans;
exec_append_initialize_next(node);
}
+
+/* ----------------------------------------------------------------
+ * ExecAsyncAppendResponse
+ *
+ * Receive a response from an asynchronous request we made.
+ * ----------------------------------------------------------------
+ */
+void
+ExecAsyncAppendResponse(EState *estate, PendingAsyncRequest *areq)
+{
+ AppendState *node = (AppendState *) areq->requestor;
+ TupleTableSlot *slot;
+
+ /* We shouldn't be called until the request is complete. */
+ Assert(areq->request_complete);
+
+ /* Our result slot shouldn't already be occupied. */
+ Assert(TupIsNull(node->ps.ps_ResultTupleSlot));
+
+ /* Result should be a TupleTableSlot or NULL. */
+ slot = (TupleTableSlot *) areq->result;
+ Assert(slot == NULL || IsA(slot, TupleTableSlot));
+
+ /* Request is no longer pending. */
+ Assert(node->as_nasyncpending > 0);
+ --node->as_nasyncpending;
+
+ /* If the result is NULL or an empty slot, there's nothing more to do. */
+ if (TupIsNull(slot))
+ return;
+
+ /* Save result so we can return it. */
+ Assert(node->as_nasyncresult < node->as_nasyncplans);
+ node->as_asyncresult[node->as_nasyncresult++] = slot;
+
+ /*
+ * Mark the node that returned a result as ready for a new request. We
+ * don't launch another one here immediately because it might compelte
+ */
+ bms_add_member(node->as_needrequest, areq->request_index);
+}
diff --git a/src/backend/executor/nodeForeignscan.c b/src/backend/executor/nodeForeignscan.c
index 86a77e3..61899d1 100644
--- a/src/backend/executor/nodeForeignscan.c
+++ b/src/backend/executor/nodeForeignscan.c
@@ -353,3 +353,52 @@ ExecForeignScanInitializeWorker(ForeignScanState *node, shm_toc *toc)
fdwroutine->InitializeWorkerForeignScan(node, toc, coordinate);
}
}
+
+/* ----------------------------------------------------------------
+ * ExecAsyncForeignScanRequest
+ *
+ * Initiate an asynchronous request
+ * ----------------------------------------------------------------
+ */
+void
+ExecAsyncForeignScanRequest(EState *estate, PendingAsyncRequest *areq)
+{
+ ForeignScanState *node = (ForeignScanState *) areq->requestee;
+ FdwRoutine *fdwroutine = node->fdwroutine;
+
+ Assert(fdwroutine->ForeignAsyncRequest != NULL);
+ fdwroutine->ForeignAsyncRequest(estate, areq);
+}
+
+/* ----------------------------------------------------------------
+ * ExecAsyncForeignScanConfigureWait
+ *
+ * In async mode, configure for a wait
+ * ----------------------------------------------------------------
+ */
+void
+ExecAsyncForeignScanConfigureWait(EState *estate,
+ PendingAsyncRequest *areq, bool reinit)
+{
+ ForeignScanState *node = (ForeignScanState *) areq->requestee;
+ FdwRoutine *fdwroutine = node->fdwroutine;
+
+ Assert(fdwroutine->ForeignAsyncConfigureWait != NULL);
+ fdwroutine->ForeignAsyncConfigureWait(estate, areq, reinit);
+}
+
+/* ----------------------------------------------------------------
+ * ExecAsyncForeignScanNotify
+ *
+ * Event loop callback
+ * ----------------------------------------------------------------
+ */
+void
+ExecAsyncForeignScanNotify(EState *estate, PendingAsyncRequest *areq)
+{
+ ForeignScanState *node = (ForeignScanState *) areq->requestee;
+ FdwRoutine *fdwroutine = node->fdwroutine;
+
+ Assert(fdwroutine->ForeignAsyncNotify != NULL);
+ fdwroutine->ForeignAsyncNotify(estate, areq);
+}
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index 30d733e..a8cabdf 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -236,6 +236,7 @@ _copyAppend(const Append *from)
* copy remainder of node
*/
COPY_NODE_FIELD(appendplans);
+ COPY_SCALAR_FIELD(nasyncplans);
return newnode;
}
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index 1560ac3..a894a9d 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -369,6 +369,7 @@ _outAppend(StringInfo str, const Append *node)
_outPlanInfo(str, (const Plan *) node);
WRITE_NODE_FIELD(appendplans);
+ WRITE_INT_FIELD(nasyncplans);
}
static void
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index dcfa6ee..67439ec 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -1539,6 +1539,7 @@ _readAppend(void)
ReadCommonPlan(&local_node->plan);
READ_NODE_FIELD(appendplans);
+ READ_INT_FIELD(nasyncplans);
READ_DONE();
}
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index fae1f67..968f8be 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -194,7 +194,7 @@ static CteScan *make_ctescan(List *qptlist, List *qpqual,
Index scanrelid, int ctePlanId, int cteParam);
static WorkTableScan *make_worktablescan(List *qptlist, List *qpqual,
Index scanrelid, int wtParam);
-static Append *make_append(List *appendplans, List *tlist);
+static Append *make_append(List *asyncplans, int nasyncplans, List *tlist);
static RecursiveUnion *make_recursive_union(List *tlist,
Plan *lefttree,
Plan *righttree,
@@ -272,6 +272,7 @@ static ModifyTable *make_modifytable(PlannerInfo *root,
List *resultRelations, List *subplans,
List *withCheckOptionLists, List *returningLists,
List *rowMarks, OnConflictExpr *onconflict, int epqParam);
+static bool is_async_capable_path(Path *path);
/*
@@ -961,8 +962,10 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path)
{
Append *plan;
List *tlist = build_path_tlist(root, &best_path->path);
- List *subplans = NIL;
+ List *asyncplans = NIL;
+ List *syncplans = NIL;
ListCell *subpaths;
+ int nasyncplans = 0;
/*
* The subpaths list could be empty, if every child was proven empty by
@@ -997,7 +1000,14 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path)
/* Must insist that all children return the same tlist */
subplan = create_plan_recurse(root, subpath, CP_EXACT_TLIST);
- subplans = lappend(subplans, subplan);
+ /* Classify as async-capable or not */
+ if (is_async_capable_path(subpath))
+ {
+ asyncplans = lappend(asyncplans, subplan);
+ ++nasyncplans;
+ }
+ else
+ syncplans = lappend(syncplans, subplan);
}
/*
@@ -1007,7 +1017,7 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path)
* parent-rel Vars it'll be asked to emit.
*/
- plan = make_append(subplans, tlist);
+ plan = make_append(list_concat(asyncplans, syncplans), nasyncplans, tlist);
copy_generic_path_info(&plan->plan, (Path *) best_path);
@@ -5009,7 +5019,7 @@ make_foreignscan(List *qptlist,
}
static Append *
-make_append(List *appendplans, List *tlist)
+make_append(List *appendplans, int nasyncplans, List *tlist)
{
Append *node = makeNode(Append);
Plan *plan = &node->plan;
@@ -5019,6 +5029,7 @@ make_append(List *appendplans, List *tlist)
plan->lefttree = NULL;
plan->righttree = NULL;
node->appendplans = appendplans;
+ node->nasyncplans = nasyncplans;
return node;
}
@@ -6330,3 +6341,27 @@ is_projection_capable_plan(Plan *plan)
}
return true;
}
+
+/*
+ * is_projection_capable_path
+ * Check whether a given Path node is async-capable.
+ */
+static bool
+is_async_capable_path(Path *path)
+{
+ switch (nodeTag(path))
+ {
+ case T_ForeignPath:
+ {
+ FdwRoutine *fdwroutine = path->parent->fdwroutine;
+
+ Assert(fdwroutine != NULL);
+ if (fdwroutine->IsForeignPathAsyncCapable != NULL &&
+ fdwroutine->IsForeignPathAsyncCapable((ForeignPath *) path))
+ return true;
+ }
+ default:
+ break;
+ }
+ return false;
+}
diff --git a/src/include/executor/execAsync.h b/src/include/executor/execAsync.h
new file mode 100644
index 0000000..2abc32d
--- /dev/null
+++ b/src/include/executor/execAsync.h
@@ -0,0 +1,29 @@
+/*--------------------------------------------------------------------
+ * execAsync.c
+ * Support functions for asynchronous query execution
+ *
+ * Portions Copyright (c) 1996-2016, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/executor/execAsync.c
+ *--------------------------------------------------------------------
+ */
+
+#ifndef EXECASYNC_H
+#define EXECASYNC_H
+
+#include "nodes/execnodes.h"
+
+extern void ExecAsyncRequest(EState *estate, PlanState *requestor,
+ int request_index, PlanState *requestee);
+extern bool ExecAsyncEventLoop(EState *estate, PlanState *requestor,
+ long timeout);
+
+extern void ExecAsyncSetRequiredEvents(EState *estate,
+ PendingAsyncRequest *areq, int num_fd_events,
+ bool wants_process_latch, bool force_reset);
+extern void ExecAsyncRequestDone(EState *estate,
+ PendingAsyncRequest *areq, Node *result);
+
+#endif /* EXECASYNC_H */
diff --git a/src/include/executor/nodeAppend.h b/src/include/executor/nodeAppend.h
index 6fb4662..3cbf9ff 100644
--- a/src/include/executor/nodeAppend.h
+++ b/src/include/executor/nodeAppend.h
@@ -21,4 +21,7 @@ extern TupleTableSlot *ExecAppend(AppendState *node);
extern void ExecEndAppend(AppendState *node);
extern void ExecReScanAppend(AppendState *node);
+extern void ExecAsyncAppendResponse(EState *estate,
+ PendingAsyncRequest *areq);
+
#endif /* NODEAPPEND_H */
diff --git a/src/include/executor/nodeForeignscan.h b/src/include/executor/nodeForeignscan.h
index f0e942a..5a61306 100644
--- a/src/include/executor/nodeForeignscan.h
+++ b/src/include/executor/nodeForeignscan.h
@@ -29,4 +29,11 @@ extern void ExecForeignScanInitializeDSM(ForeignScanState *node,
extern void ExecForeignScanInitializeWorker(ForeignScanState *node,
shm_toc *toc);
+extern void ExecAsyncForeignScanRequest(EState *estate,
+ PendingAsyncRequest *areq);
+extern void ExecAsyncForeignScanConfigureWait(EState *estate,
+ PendingAsyncRequest *areq, bool reinit);
+extern void ExecAsyncForeignScanNotify(EState *estate,
+ PendingAsyncRequest *areq);
+
#endif /* NODEFOREIGNSCAN_H */
diff --git a/src/include/foreign/fdwapi.h b/src/include/foreign/fdwapi.h
index 523d415..4c50f1e 100644
--- a/src/include/foreign/fdwapi.h
+++ b/src/include/foreign/fdwapi.h
@@ -155,6 +155,15 @@ typedef bool (*IsForeignScanParallelSafe_function) (PlannerInfo *root,
RelOptInfo *rel,
RangeTblEntry *rte);
+typedef bool (*IsForeignPathAsyncCapable_function) (ForeignPath *path);
+typedef void (*ForeignAsyncRequest_function) (EState *estate,
+ PendingAsyncRequest *areq);
+typedef void (*ForeignAsyncConfigureWait_function) (EState *estate,
+ PendingAsyncRequest *areq,
+ bool reinit);
+typedef void (*ForeignAsyncNotify_function) (EState *estate,
+ PendingAsyncRequest *areq);
+
/*
* FdwRoutine is the struct returned by a foreign-data wrapper's handler
* function. It provides pointers to the callback functions needed by the
@@ -224,6 +233,12 @@ typedef struct FdwRoutine
EstimateDSMForeignScan_function EstimateDSMForeignScan;
InitializeDSMForeignScan_function InitializeDSMForeignScan;
InitializeWorkerForeignScan_function InitializeWorkerForeignScan;
+
+ /* Support functions for asynchronous execution */
+ IsForeignPathAsyncCapable_function IsForeignPathAsyncCapable;
+ ForeignAsyncRequest_function ForeignAsyncRequest;
+ ForeignAsyncConfigureWait_function ForeignAsyncConfigureWait;
+ ForeignAsyncNotify_function ForeignAsyncNotify;
} FdwRoutine;
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index f9bcdd6..29f3d7c 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -352,6 +352,25 @@ typedef struct ResultRelInfo
} ResultRelInfo;
/* ----------------
+ * PendingAsyncRequest
+ *
+ * State for an asynchronous tuple request.
+ * ----------------
+ */
+typedef struct PendingAsyncRequest
+{
+ int myindex; /* Index in es_pending_async. */
+ struct PlanState *requestor; /* Node that wants a tuple. */
+ struct PlanState *requestee; /* Node from which a tuple is wanted. */
+ int request_index; /* Scratch space for requestor. */
+ int num_fd_events; /* Max number of FD events requestee needs. */
+ bool wants_process_latch; /* Requestee cares about MyLatch. */
+ bool callback_pending; /* Callback is needed. */
+ bool request_complete; /* Request complete, result valid. */
+ Node *result; /* Result (NULL if no more tuples). */
+} PendingAsyncRequest;
+
+/* ----------------
* EState information
*
* Master working state for an Executor invocation
@@ -430,6 +449,31 @@ typedef struct EState
/* The per-query shared memory area to use for parallel execution. */
struct dsa_area *es_query_dsa;
+
+ /*
+ * Support for asynchronous execution.
+ *
+ * es_max_pending_async is the allocated size of es_pending_async, and
+ * es_num_pending_aync is the number of entries that are currently valid.
+ * (Entries after that may point to storage that can be reused.)
+ * es_async_callback_pending is the number of PendingAsyncRequests for
+ * which callback_pending is true.
+ *
+ * es_total_fd_events is the total number of FD events needed by all
+ * pending async nodes, and es_allocated_fd_events is the number any
+ * current wait event set was allocated to handle. es_wait_event_set, if
+ * non-NULL, is a previously allocated event set that may be reusable by a
+ * future wait provided that nothing's been removed and not too many more
+ * events have been added.
+ */
+ int es_num_pending_async;
+ int es_max_pending_async;
+ int es_async_callback_pending;
+ PendingAsyncRequest **es_pending_async;
+
+ int es_total_fd_events;
+ int es_allocated_fd_events;
+ struct WaitEventSet *es_wait_event_set;
} EState;
@@ -1175,17 +1219,20 @@ typedef struct ModifyTableState
/* ----------------
* AppendState information
- *
- * nplans how many plans are in the array
- * whichplan which plan is being executed (0 .. n-1)
* ----------------
*/
typedef struct AppendState
{
PlanState ps; /* its first field is NodeTag */
PlanState **appendplans; /* array of PlanStates for my inputs */
- int as_nplans;
- int as_whichplan;
+ int as_nplans; /* total # of children */
+ int as_nasyncplans; /* # of async-capable children */
+ int as_whichsyncplan; /* which sync plan is being executed */
+ bool as_syncdone; /* all synchronous plans done? */
+ Bitmapset *as_needrequest; /* async plans needing a new request */
+ TupleTableSlot **as_asyncresult; /* unreturned results of async plans */
+ int as_nasyncresult; /* # of valid entries in as_asyncresult */
+ int as_nasyncpending; /* # of outstanding async requests */
} AppendState;
/* ----------------
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index f72f7a8..f0daada 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -228,6 +228,7 @@ typedef struct Append
{
Plan plan;
List *appendplans;
+ int nasyncplans; /* # of async plans, always at start of list */
} Append;
/* ----------------
--
2.9.2
0002-Fix-some-bugs.patchtext/x-patch; charset=us-asciiDownload
From 4675717734d12d404b1d66a734866b3f85830244 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Wed, 12 Oct 2016 14:03:53 +0900
Subject: [PATCH 2/7] Fix some bugs.
---
contrib/postgres_fdw/expected/postgres_fdw.out | 142 ++++++++++++-------------
contrib/postgres_fdw/postgres_fdw.c | 3 +-
src/backend/executor/execAsync.c | 4 +-
src/backend/postmaster/pgstat.c | 3 +
src/include/pgstat.h | 3 +-
5 files changed, 81 insertions(+), 74 deletions(-)
diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out
index 3a09280..d7420e0 100644
--- a/contrib/postgres_fdw/expected/postgres_fdw.out
+++ b/contrib/postgres_fdw/expected/postgres_fdw.out
@@ -6254,12 +6254,12 @@ INSERT INTO b(aa) VALUES('bbbbb');
SELECT tableoid::regclass, * FROM a;
tableoid | aa
----------+-------
- a | aaa
- a | aaaa
- a | aaaaa
b | bbb
b | bbbb
b | bbbbb
+ a | aaa
+ a | aaaa
+ a | aaaaa
(6 rows)
SELECT tableoid::regclass, * FROM b;
@@ -6282,12 +6282,12 @@ UPDATE a SET aa = 'zzzzzz' WHERE aa LIKE 'aaaa%';
SELECT tableoid::regclass, * FROM a;
tableoid | aa
----------+--------
- a | aaa
- a | zzzzzz
- a | zzzzzz
b | bbb
b | bbbb
b | bbbbb
+ a | aaa
+ a | zzzzzz
+ a | zzzzzz
(6 rows)
SELECT tableoid::regclass, * FROM b;
@@ -6310,12 +6310,12 @@ UPDATE b SET aa = 'new';
SELECT tableoid::regclass, * FROM a;
tableoid | aa
----------+--------
- a | aaa
- a | zzzzzz
- a | zzzzzz
b | new
b | new
b | new
+ a | aaa
+ a | zzzzzz
+ a | zzzzzz
(6 rows)
SELECT tableoid::regclass, * FROM b;
@@ -6338,12 +6338,12 @@ UPDATE a SET aa = 'newtoo';
SELECT tableoid::regclass, * FROM a;
tableoid | aa
----------+--------
- a | newtoo
- a | newtoo
- a | newtoo
b | newtoo
b | newtoo
b | newtoo
+ a | newtoo
+ a | newtoo
+ a | newtoo
(6 rows)
SELECT tableoid::regclass, * FROM b;
@@ -6402,120 +6402,120 @@ insert into bar2 values(4,44,44);
insert into bar2 values(7,77,77);
explain (verbose, costs off)
select * from bar where f1 in (select f1 from foo) for update;
- QUERY PLAN
-----------------------------------------------------------------------------------------------
+ QUERY PLAN
+------------------------------------------------------------------------------------------------------------------------
LockRows
- Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid, foo.ctid, foo.*, foo.tableoid
+ Output: bar2.f1, bar2.f2, bar2.ctid, ((bar2.*)::bar), bar2.tableoid, foo2.ctid, ((foo2.*)::foo), foo2.tableoid
-> Hash Join
- Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid, foo.ctid, foo.*, foo.tableoid
- Hash Cond: (bar.f1 = foo.f1)
+ Output: bar2.f1, bar2.f2, bar2.ctid, ((bar2.*)::bar), bar2.tableoid, foo2.ctid, ((foo2.*)::foo), foo2.tableoid
+ Hash Cond: (bar2.f1 = foo2.f1)
-> Append
- -> Seq Scan on public.bar
- Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid
-> Foreign Scan on public.bar2
Output: bar2.f1, bar2.f2, bar2.ctid, bar2.*, bar2.tableoid
Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 FOR UPDATE
+ -> Seq Scan on public.bar
+ Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid
-> Hash
- Output: foo.ctid, foo.*, foo.tableoid, foo.f1
+ Output: foo2.ctid, ((foo2.*)::foo), foo2.tableoid, foo2.f1
-> HashAggregate
- Output: foo.ctid, foo.*, foo.tableoid, foo.f1
- Group Key: foo.f1
+ Output: foo2.ctid, ((foo2.*)::foo), foo2.tableoid, foo2.f1
+ Group Key: foo2.f1
-> Append
- -> Seq Scan on public.foo
- Output: foo.ctid, foo.*, foo.tableoid, foo.f1
-> Foreign Scan on public.foo2
Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1
Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1
+ -> Seq Scan on public.foo
+ Output: foo.ctid, foo.*, foo.tableoid, foo.f1
(22 rows)
select * from bar where f1 in (select f1 from foo) for update;
f1 | f2
----+----
- 1 | 11
- 2 | 22
3 | 33
4 | 44
+ 1 | 11
+ 2 | 22
(4 rows)
explain (verbose, costs off)
select * from bar where f1 in (select f1 from foo) for share;
- QUERY PLAN
-----------------------------------------------------------------------------------------------
+ QUERY PLAN
+------------------------------------------------------------------------------------------------------------------------
LockRows
- Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid, foo.ctid, foo.*, foo.tableoid
+ Output: bar2.f1, bar2.f2, bar2.ctid, ((bar2.*)::bar), bar2.tableoid, foo2.ctid, ((foo2.*)::foo), foo2.tableoid
-> Hash Join
- Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid, foo.ctid, foo.*, foo.tableoid
- Hash Cond: (bar.f1 = foo.f1)
+ Output: bar2.f1, bar2.f2, bar2.ctid, ((bar2.*)::bar), bar2.tableoid, foo2.ctid, ((foo2.*)::foo), foo2.tableoid
+ Hash Cond: (bar2.f1 = foo2.f1)
-> Append
- -> Seq Scan on public.bar
- Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid
-> Foreign Scan on public.bar2
Output: bar2.f1, bar2.f2, bar2.ctid, bar2.*, bar2.tableoid
Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 FOR SHARE
+ -> Seq Scan on public.bar
+ Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid
-> Hash
- Output: foo.ctid, foo.*, foo.tableoid, foo.f1
+ Output: foo2.ctid, ((foo2.*)::foo), foo2.tableoid, foo2.f1
-> HashAggregate
- Output: foo.ctid, foo.*, foo.tableoid, foo.f1
- Group Key: foo.f1
+ Output: foo2.ctid, ((foo2.*)::foo), foo2.tableoid, foo2.f1
+ Group Key: foo2.f1
-> Append
- -> Seq Scan on public.foo
- Output: foo.ctid, foo.*, foo.tableoid, foo.f1
-> Foreign Scan on public.foo2
Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1
Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1
+ -> Seq Scan on public.foo
+ Output: foo.ctid, foo.*, foo.tableoid, foo.f1
(22 rows)
select * from bar where f1 in (select f1 from foo) for share;
f1 | f2
----+----
- 1 | 11
- 2 | 22
3 | 33
4 | 44
+ 1 | 11
+ 2 | 22
(4 rows)
-- Check UPDATE with inherited target and an inherited source table
explain (verbose, costs off)
update bar set f2 = f2 + 100 where f1 in (select f1 from foo);
- QUERY PLAN
----------------------------------------------------------------------------------------------
+ QUERY PLAN
+---------------------------------------------------------------------------------------------------------
Update on public.bar
Update on public.bar
Foreign Update on public.bar2
Remote SQL: UPDATE public.loct2 SET f2 = $2 WHERE ctid = $1
-> Hash Join
- Output: bar.f1, (bar.f2 + 100), bar.ctid, foo.ctid, foo.*, foo.tableoid
- Hash Cond: (bar.f1 = foo.f1)
+ Output: bar.f1, (bar.f2 + 100), bar.ctid, foo2.ctid, ((foo2.*)::foo), foo2.tableoid
+ Hash Cond: (bar.f1 = foo2.f1)
-> Seq Scan on public.bar
Output: bar.f1, bar.f2, bar.ctid
-> Hash
- Output: foo.ctid, foo.*, foo.tableoid, foo.f1
+ Output: foo2.ctid, ((foo2.*)::foo), foo2.tableoid, foo2.f1
-> HashAggregate
- Output: foo.ctid, foo.*, foo.tableoid, foo.f1
- Group Key: foo.f1
+ Output: foo2.ctid, ((foo2.*)::foo), foo2.tableoid, foo2.f1
+ Group Key: foo2.f1
-> Append
- -> Seq Scan on public.foo
- Output: foo.ctid, foo.*, foo.tableoid, foo.f1
-> Foreign Scan on public.foo2
Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1
Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1
+ -> Seq Scan on public.foo
+ Output: foo.ctid, foo.*, foo.tableoid, foo.f1
-> Hash Join
- Output: bar2.f1, (bar2.f2 + 100), bar2.f3, bar2.ctid, foo.ctid, foo.*, foo.tableoid
- Hash Cond: (bar2.f1 = foo.f1)
+ Output: bar2.f1, (bar2.f2 + 100), bar2.f3, bar2.ctid, foo2.ctid, ((foo2.*)::foo), foo2.tableoid
+ Hash Cond: (bar2.f1 = foo2.f1)
-> Foreign Scan on public.bar2
Output: bar2.f1, bar2.f2, bar2.f3, bar2.ctid
Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 FOR UPDATE
-> Hash
- Output: foo.ctid, foo.*, foo.tableoid, foo.f1
+ Output: foo2.ctid, ((foo2.*)::foo), foo2.tableoid, foo2.f1
-> HashAggregate
- Output: foo.ctid, foo.*, foo.tableoid, foo.f1
- Group Key: foo.f1
+ Output: foo2.ctid, ((foo2.*)::foo), foo2.tableoid, foo2.f1
+ Group Key: foo2.f1
-> Append
- -> Seq Scan on public.foo
- Output: foo.ctid, foo.*, foo.tableoid, foo.f1
-> Foreign Scan on public.foo2
Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1
Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1
+ -> Seq Scan on public.foo
+ Output: foo.ctid, foo.*, foo.tableoid, foo.f1
(37 rows)
update bar set f2 = f2 + 100 where f1 in (select f1 from foo);
@@ -6543,26 +6543,26 @@ where bar.f1 = ss.f1;
Foreign Update on public.bar2
Remote SQL: UPDATE public.loct2 SET f2 = $2 WHERE ctid = $1
-> Hash Join
- Output: bar.f1, (bar.f2 + 100), bar.ctid, (ROW(foo.f1))
- Hash Cond: (foo.f1 = bar.f1)
+ Output: bar.f1, (bar.f2 + 100), bar.ctid, (ROW(foo2.f1))
+ Hash Cond: (foo2.f1 = bar.f1)
-> Append
- -> Seq Scan on public.foo
- Output: ROW(foo.f1), foo.f1
-> Foreign Scan on public.foo2
Output: ROW(foo2.f1), foo2.f1
Remote SQL: SELECT f1 FROM public.loct1
- -> Seq Scan on public.foo foo_1
- Output: ROW((foo_1.f1 + 3)), (foo_1.f1 + 3)
-> Foreign Scan on public.foo2 foo2_1
Output: ROW((foo2_1.f1 + 3)), (foo2_1.f1 + 3)
Remote SQL: SELECT f1 FROM public.loct1
+ -> Seq Scan on public.foo
+ Output: ROW(foo.f1), foo.f1
+ -> Seq Scan on public.foo foo_1
+ Output: ROW((foo_1.f1 + 3)), (foo_1.f1 + 3)
-> Hash
Output: bar.f1, bar.f2, bar.ctid
-> Seq Scan on public.bar
Output: bar.f1, bar.f2, bar.ctid
-> Merge Join
- Output: bar2.f1, (bar2.f2 + 100), bar2.f3, bar2.ctid, (ROW(foo.f1))
- Merge Cond: (bar2.f1 = foo.f1)
+ Output: bar2.f1, (bar2.f2 + 100), bar2.f3, bar2.ctid, (ROW(foo2.f1))
+ Merge Cond: (bar2.f1 = foo2.f1)
-> Sort
Output: bar2.f1, bar2.f2, bar2.f3, bar2.ctid
Sort Key: bar2.f1
@@ -6570,19 +6570,19 @@ where bar.f1 = ss.f1;
Output: bar2.f1, bar2.f2, bar2.f3, bar2.ctid
Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 FOR UPDATE
-> Sort
- Output: (ROW(foo.f1)), foo.f1
- Sort Key: foo.f1
+ Output: (ROW(foo2.f1)), foo2.f1
+ Sort Key: foo2.f1
-> Append
- -> Seq Scan on public.foo
- Output: ROW(foo.f1), foo.f1
-> Foreign Scan on public.foo2
Output: ROW(foo2.f1), foo2.f1
Remote SQL: SELECT f1 FROM public.loct1
- -> Seq Scan on public.foo foo_1
- Output: ROW((foo_1.f1 + 3)), (foo_1.f1 + 3)
-> Foreign Scan on public.foo2 foo2_1
Output: ROW((foo2_1.f1 + 3)), (foo2_1.f1 + 3)
Remote SQL: SELECT f1 FROM public.loct1
+ -> Seq Scan on public.foo
+ Output: ROW(foo.f1), foo.f1
+ -> Seq Scan on public.foo foo_1
+ Output: ROW((foo_1.f1 + 3)), (foo_1.f1 + 3)
(45 rows)
update bar set f2 = f2 + 100
@@ -6749,8 +6749,8 @@ update bar set f2 = f2 + 100 returning *;
update bar set f2 = f2 + 100 returning *;
f1 | f2
----+-----
- 1 | 311
2 | 322
+ 1 | 311
6 | 266
3 | 333
4 | 344
diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c
index 595a47e..f180838 100644
--- a/contrib/postgres_fdw/postgres_fdw.c
+++ b/contrib/postgres_fdw/postgres_fdw.c
@@ -20,6 +20,7 @@
#include "commands/explain.h"
#include "commands/vacuum.h"
#include "executor/execAsync.h"
+#include "executor/nodeForeignscan.h"
#include "foreign/fdwapi.h"
#include "funcapi.h"
#include "miscadmin.h"
@@ -4472,7 +4473,7 @@ postgresForeignAsyncRequest(EState *estate, PendingAsyncRequest *areq)
TupleTableSlot *slot;
Assert(IsA(node, ForeignScanState));
- slot = postgresIterateForeignScan(node);
+ slot = ExecForeignScan(node);
ExecAsyncRequestDone(estate, areq, (Node *) slot);
}
diff --git a/src/backend/executor/execAsync.c b/src/backend/executor/execAsync.c
index 5858bb5..e070c26 100644
--- a/src/backend/executor/execAsync.c
+++ b/src/backend/executor/execAsync.c
@@ -18,6 +18,7 @@
#include "executor/nodeAppend.h"
#include "executor/nodeForeignscan.h"
#include "miscadmin.h"
+#include "pgstat.h"
#include "storage/latch.h"
static bool ExecAsyncEventWait(EState *estate, long timeout);
@@ -286,7 +287,8 @@ ExecAsyncEventWait(EState *estate, long timeout)
/* Wait for at least one event to occur. */
noccurred = WaitEventSetWait(estate->es_wait_event_set, timeout,
- occurred_event, EVENT_BUFFER_SIZE);
+ occurred_event, EVENT_BUFFER_SIZE,
+ WAIT_EVENT_ASYNC_WAIT);
if (noccurred == 0)
return false;
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 7176cf1..af59f51 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -3398,6 +3398,9 @@ pgstat_get_wait_ipc(WaitEventIPC w)
case WAIT_EVENT_SYNC_REP:
event_name = "SyncRep";
break;
+ case WAIT_EVENT_ASYNC_WAIT:
+ event_name = "AsyncExecWait";
+ break;
/* no default case, so that compiler will warn */
}
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index de8225b..7769d3c 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -787,7 +787,8 @@ typedef enum
WAIT_EVENT_MQ_SEND,
WAIT_EVENT_PARALLEL_FINISH,
WAIT_EVENT_SAFE_SNAPSHOT,
- WAIT_EVENT_SYNC_REP
+ WAIT_EVENT_SYNC_REP,
+ WAIT_EVENT_ASYNC_WAIT
} WaitEventIPC;
/* ----------
--
2.9.2
0003-Modify-async-execution-infrastructure.patchtext/x-patch; charset=us-asciiDownload
From 60a9ba9e74666dba290f6bf27225384966d272a9 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Mon, 17 Oct 2016 15:54:32 +0900
Subject: [PATCH 3/7] Modify async execution infrastructure.
---
contrib/postgres_fdw/expected/postgres_fdw.out | 68 ++++++++--------
contrib/postgres_fdw/postgres_fdw.c | 5 +-
src/backend/executor/execAsync.c | 105 ++++++++++++++-----------
src/backend/executor/nodeAppend.c | 50 ++++++------
src/backend/executor/nodeForeignscan.c | 4 +-
src/backend/nodes/copyfuncs.c | 1 +
src/backend/nodes/outfuncs.c | 1 +
src/backend/nodes/readfuncs.c | 1 +
src/backend/optimizer/plan/createplan.c | 24 +++++-
src/backend/utils/adt/ruleutils.c | 6 +-
src/include/executor/nodeForeignscan.h | 2 +-
src/include/foreign/fdwapi.h | 2 +-
src/include/nodes/execnodes.h | 10 ++-
src/include/nodes/plannodes.h | 1 +
14 files changed, 167 insertions(+), 113 deletions(-)
diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out
index d7420e0..fd8b628 100644
--- a/contrib/postgres_fdw/expected/postgres_fdw.out
+++ b/contrib/postgres_fdw/expected/postgres_fdw.out
@@ -6402,13 +6402,13 @@ insert into bar2 values(4,44,44);
insert into bar2 values(7,77,77);
explain (verbose, costs off)
select * from bar where f1 in (select f1 from foo) for update;
- QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------
+ QUERY PLAN
+----------------------------------------------------------------------------------------------
LockRows
- Output: bar2.f1, bar2.f2, bar2.ctid, ((bar2.*)::bar), bar2.tableoid, foo2.ctid, ((foo2.*)::foo), foo2.tableoid
+ Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid, foo.ctid, foo.*, foo.tableoid
-> Hash Join
- Output: bar2.f1, bar2.f2, bar2.ctid, ((bar2.*)::bar), bar2.tableoid, foo2.ctid, ((foo2.*)::foo), foo2.tableoid
- Hash Cond: (bar2.f1 = foo2.f1)
+ Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid, foo.ctid, foo.*, foo.tableoid
+ Hash Cond: (bar.f1 = foo.f1)
-> Append
-> Foreign Scan on public.bar2
Output: bar2.f1, bar2.f2, bar2.ctid, bar2.*, bar2.tableoid
@@ -6416,10 +6416,10 @@ select * from bar where f1 in (select f1 from foo) for update;
-> Seq Scan on public.bar
Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid
-> Hash
- Output: foo2.ctid, ((foo2.*)::foo), foo2.tableoid, foo2.f1
+ Output: foo.ctid, foo.*, foo.tableoid, foo.f1
-> HashAggregate
- Output: foo2.ctid, ((foo2.*)::foo), foo2.tableoid, foo2.f1
- Group Key: foo2.f1
+ Output: foo.ctid, foo.*, foo.tableoid, foo.f1
+ Group Key: foo.f1
-> Append
-> Foreign Scan on public.foo2
Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1
@@ -6439,13 +6439,13 @@ select * from bar where f1 in (select f1 from foo) for update;
explain (verbose, costs off)
select * from bar where f1 in (select f1 from foo) for share;
- QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------
+ QUERY PLAN
+----------------------------------------------------------------------------------------------
LockRows
- Output: bar2.f1, bar2.f2, bar2.ctid, ((bar2.*)::bar), bar2.tableoid, foo2.ctid, ((foo2.*)::foo), foo2.tableoid
+ Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid, foo.ctid, foo.*, foo.tableoid
-> Hash Join
- Output: bar2.f1, bar2.f2, bar2.ctid, ((bar2.*)::bar), bar2.tableoid, foo2.ctid, ((foo2.*)::foo), foo2.tableoid
- Hash Cond: (bar2.f1 = foo2.f1)
+ Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid, foo.ctid, foo.*, foo.tableoid
+ Hash Cond: (bar.f1 = foo.f1)
-> Append
-> Foreign Scan on public.bar2
Output: bar2.f1, bar2.f2, bar2.ctid, bar2.*, bar2.tableoid
@@ -6453,10 +6453,10 @@ select * from bar where f1 in (select f1 from foo) for share;
-> Seq Scan on public.bar
Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid
-> Hash
- Output: foo2.ctid, ((foo2.*)::foo), foo2.tableoid, foo2.f1
+ Output: foo.ctid, foo.*, foo.tableoid, foo.f1
-> HashAggregate
- Output: foo2.ctid, ((foo2.*)::foo), foo2.tableoid, foo2.f1
- Group Key: foo2.f1
+ Output: foo.ctid, foo.*, foo.tableoid, foo.f1
+ Group Key: foo.f1
-> Append
-> Foreign Scan on public.foo2
Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1
@@ -6477,22 +6477,22 @@ select * from bar where f1 in (select f1 from foo) for share;
-- Check UPDATE with inherited target and an inherited source table
explain (verbose, costs off)
update bar set f2 = f2 + 100 where f1 in (select f1 from foo);
- QUERY PLAN
----------------------------------------------------------------------------------------------------------
+ QUERY PLAN
+---------------------------------------------------------------------------------------------
Update on public.bar
Update on public.bar
Foreign Update on public.bar2
Remote SQL: UPDATE public.loct2 SET f2 = $2 WHERE ctid = $1
-> Hash Join
- Output: bar.f1, (bar.f2 + 100), bar.ctid, foo2.ctid, ((foo2.*)::foo), foo2.tableoid
- Hash Cond: (bar.f1 = foo2.f1)
+ Output: bar.f1, (bar.f2 + 100), bar.ctid, foo.ctid, foo.*, foo.tableoid
+ Hash Cond: (bar.f1 = foo.f1)
-> Seq Scan on public.bar
Output: bar.f1, bar.f2, bar.ctid
-> Hash
- Output: foo2.ctid, ((foo2.*)::foo), foo2.tableoid, foo2.f1
+ Output: foo.ctid, foo.*, foo.tableoid, foo.f1
-> HashAggregate
- Output: foo2.ctid, ((foo2.*)::foo), foo2.tableoid, foo2.f1
- Group Key: foo2.f1
+ Output: foo.ctid, foo.*, foo.tableoid, foo.f1
+ Group Key: foo.f1
-> Append
-> Foreign Scan on public.foo2
Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1
@@ -6500,16 +6500,16 @@ update bar set f2 = f2 + 100 where f1 in (select f1 from foo);
-> Seq Scan on public.foo
Output: foo.ctid, foo.*, foo.tableoid, foo.f1
-> Hash Join
- Output: bar2.f1, (bar2.f2 + 100), bar2.f3, bar2.ctid, foo2.ctid, ((foo2.*)::foo), foo2.tableoid
- Hash Cond: (bar2.f1 = foo2.f1)
+ Output: bar2.f1, (bar2.f2 + 100), bar2.f3, bar2.ctid, foo.ctid, foo.*, foo.tableoid
+ Hash Cond: (bar2.f1 = foo.f1)
-> Foreign Scan on public.bar2
Output: bar2.f1, bar2.f2, bar2.f3, bar2.ctid
Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 FOR UPDATE
-> Hash
- Output: foo2.ctid, ((foo2.*)::foo), foo2.tableoid, foo2.f1
+ Output: foo.ctid, foo.*, foo.tableoid, foo.f1
-> HashAggregate
- Output: foo2.ctid, ((foo2.*)::foo), foo2.tableoid, foo2.f1
- Group Key: foo2.f1
+ Output: foo.ctid, foo.*, foo.tableoid, foo.f1
+ Group Key: foo.f1
-> Append
-> Foreign Scan on public.foo2
Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1
@@ -6543,8 +6543,8 @@ where bar.f1 = ss.f1;
Foreign Update on public.bar2
Remote SQL: UPDATE public.loct2 SET f2 = $2 WHERE ctid = $1
-> Hash Join
- Output: bar.f1, (bar.f2 + 100), bar.ctid, (ROW(foo2.f1))
- Hash Cond: (foo2.f1 = bar.f1)
+ Output: bar.f1, (bar.f2 + 100), bar.ctid, (ROW(foo.f1))
+ Hash Cond: (foo.f1 = bar.f1)
-> Append
-> Foreign Scan on public.foo2
Output: ROW(foo2.f1), foo2.f1
@@ -6561,8 +6561,8 @@ where bar.f1 = ss.f1;
-> Seq Scan on public.bar
Output: bar.f1, bar.f2, bar.ctid
-> Merge Join
- Output: bar2.f1, (bar2.f2 + 100), bar2.f3, bar2.ctid, (ROW(foo2.f1))
- Merge Cond: (bar2.f1 = foo2.f1)
+ Output: bar2.f1, (bar2.f2 + 100), bar2.f3, bar2.ctid, (ROW(foo.f1))
+ Merge Cond: (bar2.f1 = foo.f1)
-> Sort
Output: bar2.f1, bar2.f2, bar2.f3, bar2.ctid
Sort Key: bar2.f1
@@ -6570,8 +6570,8 @@ where bar.f1 = ss.f1;
Output: bar2.f1, bar2.f2, bar2.f3, bar2.ctid
Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 FOR UPDATE
-> Sort
- Output: (ROW(foo2.f1)), foo2.f1
- Sort Key: foo2.f1
+ Output: (ROW(foo.f1)), foo.f1
+ Sort Key: foo.f1
-> Append
-> Foreign Scan on public.foo2
Output: ROW(foo2.f1), foo2.f1
diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c
index f180838..abb256b 100644
--- a/contrib/postgres_fdw/postgres_fdw.c
+++ b/contrib/postgres_fdw/postgres_fdw.c
@@ -354,7 +354,7 @@ static void postgresGetForeignUpperPaths(PlannerInfo *root,
static bool postgresIsForeignPathAsyncCapable(ForeignPath *path);
static void postgresForeignAsyncRequest(EState *estate,
PendingAsyncRequest *areq);
-static void postgresForeignAsyncConfigureWait(EState *estate,
+static bool postgresForeignAsyncConfigureWait(EState *estate,
PendingAsyncRequest *areq,
bool reinit);
static void postgresForeignAsyncNotify(EState *estate,
@@ -4477,11 +4477,12 @@ postgresForeignAsyncRequest(EState *estate, PendingAsyncRequest *areq)
ExecAsyncRequestDone(estate, areq, (Node *) slot);
}
-static void
+static bool
postgresForeignAsyncConfigureWait(EState *estate, PendingAsyncRequest *areq,
bool reinit)
{
elog(ERROR, "postgresForeignAsyncConfigureWait");
+ return false;
}
static void
diff --git a/src/backend/executor/execAsync.c b/src/backend/executor/execAsync.c
index e070c26..33496a9 100644
--- a/src/backend/executor/execAsync.c
+++ b/src/backend/executor/execAsync.c
@@ -22,7 +22,7 @@
#include "storage/latch.h"
static bool ExecAsyncEventWait(EState *estate, long timeout);
-static void ExecAsyncConfigureWait(EState *estate, PendingAsyncRequest *areq,
+static bool ExecAsyncConfigureWait(EState *estate, PendingAsyncRequest *areq,
bool reinit);
static void ExecAsyncNotify(EState *estate, PendingAsyncRequest *areq);
static void ExecAsyncResponse(EState *estate, PendingAsyncRequest *areq);
@@ -43,7 +43,7 @@ ExecAsyncRequest(EState *estate, PlanState *requestor, int request_index,
PlanState *requestee)
{
PendingAsyncRequest *areq = NULL;
- int i = estate->es_num_pending_async;
+ int nasync = estate->es_num_pending_async;
/*
* If the number of pending asynchronous nodes exceeds the number of
@@ -51,7 +51,7 @@ ExecAsyncRequest(EState *estate, PlanState *requestor, int request_index,
* We start with 16 slots, and thereafter double the array size each
* time we run out of slots.
*/
- if (i >= estate->es_max_pending_async)
+ if (nasync >= estate->es_max_pending_async)
{
int newmax;
@@ -81,25 +81,28 @@ ExecAsyncRequest(EState *estate, PlanState *requestor, int request_index,
* PendingAsyncRequest if there is one. If not, we must allocate a new
* one.
*/
- if (estate->es_pending_async[i] == NULL)
+ if (estate->es_pending_async[nasync] == NULL)
{
areq = MemoryContextAllocZero(estate->es_query_cxt,
sizeof(PendingAsyncRequest));
- estate->es_pending_async[i] = areq;
+ estate->es_pending_async[nasync] = areq;
}
else
{
- areq = estate->es_pending_async[i];
+ areq = estate->es_pending_async[nasync];
MemSet(areq, 0, sizeof(PendingAsyncRequest));
}
- areq->myindex = estate->es_num_pending_async++;
+ areq->myindex = estate->es_num_pending_async;
/* Initialize the new request. */
areq->requestor = requestor;
areq->request_index = request_index;
areq->requestee = requestee;
- /* Give the requestee a chance to do whatever it wants. */
+ /*
+ * Give the requestee a chance to do whatever it wants.
+ * Requst functions return true if a result is immediately available.
+ */
switch (nodeTag(requestee))
{
case T_ForeignScanState:
@@ -110,6 +113,20 @@ ExecAsyncRequest(EState *estate, PlanState *requestor, int request_index,
elog(ERROR, "unrecognized node type: %d",
(int) nodeTag(requestee));
}
+
+ /*
+ * If a result is available, complete it immediately.
+ */
+ if (areq->state == ASYNC_COMPLETE)
+ {
+ Assert(areq->result == NULL || IsA(areq->result, TupleTableSlot));
+ ExecAsyncResponse(estate, areq);
+
+ return;
+ }
+
+ /* No result available now, make this node pending */
+ estate->es_num_pending_async++;
}
/*
@@ -175,22 +192,19 @@ ExecAsyncEventLoop(EState *estate, PlanState *requestor, long timeout)
{
PendingAsyncRequest *areq = estate->es_pending_async[i];
- /* Skip it if no callback is pending. */
- if (!areq->callback_pending)
- continue;
-
- /*
- * Mark it as no longer needing a callback. We must do this
- * before dispatching the callback in case the callback resets
- * the flag.
- */
- areq->callback_pending = false;
- estate->es_async_callback_pending--;
-
- /* Perform the actual callback; set request_done if appropraite. */
- if (!areq->request_complete)
+ /* Skip it if not pending. */
+ if (areq->state == ASYNC_CALLBACK_PENDING)
+ {
+ /*
+ * Mark it as no longer needing a callback. We must do this
+ * before dispatching the callback in case the callback resets
+ * the flag.
+ */
+ estate->es_async_callback_pending--;
ExecAsyncNotify(estate, areq);
- else
+ }
+
+ if (areq->state == ASYNC_COMPLETE)
{
any_node_done = true;
if (requestor == areq->requestor)
@@ -214,7 +228,7 @@ ExecAsyncEventLoop(EState *estate, PlanState *requestor, long timeout)
PendingAsyncRequest *head;
PendingAsyncRequest *tail = estate->es_pending_async[tidx];
- if (!tail->callback_pending && tail->request_complete)
+ if (tail->state == ASYNC_COMPLETE)
continue;
head = estate->es_pending_async[hidx];
estate->es_pending_async[tidx] = head;
@@ -247,7 +261,8 @@ ExecAsyncEventLoop(EState *estate, PlanState *requestor, long timeout)
* means wait forever, 0 means don't wait at all, and >0 means wait for the
* indicated number of milliseconds.
*
- * Returns true if we found some events and false if we timed out.
+ * Returns true if we found some events and false if we timed out or there's
+ * no event to wait. The latter is occur when the areq is processed during
*/
static bool
ExecAsyncEventWait(EState *estate, long timeout)
@@ -258,6 +273,7 @@ ExecAsyncEventWait(EState *estate, long timeout)
int n;
bool reinit = false;
bool process_latch_set = false;
+ bool added = false;
if (estate->es_wait_event_set == NULL)
{
@@ -282,13 +298,16 @@ ExecAsyncEventWait(EState *estate, long timeout)
PendingAsyncRequest *areq = estate->es_pending_async[i];
if (areq->num_fd_events > 0)
- ExecAsyncConfigureWait(estate, areq, reinit);
+ added |= ExecAsyncConfigureWait(estate, areq, reinit);
}
+ Assert(added);
+
/* Wait for at least one event to occur. */
noccurred = WaitEventSetWait(estate->es_wait_event_set, timeout,
occurred_event, EVENT_BUFFER_SIZE,
WAIT_EVENT_ASYNC_WAIT);
+
if (noccurred == 0)
return false;
@@ -312,12 +331,10 @@ ExecAsyncEventWait(EState *estate, long timeout)
{
PendingAsyncRequest *areq = w->user_data;
- if (!areq->callback_pending)
- {
- Assert(!areq->request_complete);
- areq->callback_pending = true;
- estate->es_async_callback_pending++;
- }
+ Assert(areq->state == ASYNC_WAITING);
+
+ areq->state = ASYNC_CALLBACK_PENDING;
+ estate->es_async_callback_pending++;
}
}
@@ -333,8 +350,8 @@ ExecAsyncEventWait(EState *estate, long timeout)
if (areq->wants_process_latch)
{
- Assert(!areq->request_complete);
- areq->callback_pending = true;
+ Assert(areq->state == ASYNC_WAITING);
+ areq->state = ASYNC_CALLBACK_PENDING;
}
}
}
@@ -352,15 +369,19 @@ ExecAsyncEventWait(EState *estate, long timeout)
* The events should include only WL_SOCKET_READABLE or WL_SOCKET_WRITEABLE,
* and the number of calls should not exceed areq->num_fd_events (as
* prevously set via ExecAsyncSetRequiredEvents).
+ *
+ * Individual requests can omit registering an event but it is a
+ * responsibility of the node driver to set at least one event per one
+ * requestor.
*/
-static void
+static bool
ExecAsyncConfigureWait(EState *estate, PendingAsyncRequest *areq,
bool reinit)
{
switch (nodeTag(areq->requestee))
{
case T_ForeignScanState:
- ExecAsyncForeignScanConfigureWait(estate, areq, reinit);
+ return ExecAsyncForeignScanConfigureWait(estate, areq, reinit);
break;
default:
elog(ERROR, "unrecognized node type: %d",
@@ -419,6 +440,7 @@ ExecAsyncSetRequiredEvents(EState *estate, PendingAsyncRequest *areq,
estate->es_total_fd_events += num_fd_events - areq->num_fd_events;
areq->num_fd_events = num_fd_events;
areq->wants_process_latch = wants_process_latch;
+ areq->state = ASYNC_WAITING;
if (force_reset && estate->es_wait_event_set != NULL)
{
@@ -448,17 +470,12 @@ ExecAsyncRequestDone(EState *estate, PendingAsyncRequest *areq, Node *result)
* need a callback to remove registered wait events. It's not clear
* that we would come out ahead, so use brute force for now.
*/
+ Assert(areq->state == ASYNC_IDLE || areq->state == ASYNC_CALLBACK_PENDING);
+
if (areq->num_fd_events > 0 || areq->wants_process_latch)
ExecAsyncSetRequiredEvents(estate, areq, 0, false, true);
/* Save result and mark request as complete. */
areq->result = result;
- areq->request_complete = true;
-
- /* Make sure this request is flagged for a callback. */
- if (!areq->callback_pending)
- {
- areq->callback_pending = true;
- estate->es_async_callback_pending++;
- }
+ areq->state = ASYNC_COMPLETE;
}
diff --git a/src/backend/executor/nodeAppend.c b/src/backend/executor/nodeAppend.c
index e61218a..568fa25 100644
--- a/src/backend/executor/nodeAppend.c
+++ b/src/backend/executor/nodeAppend.c
@@ -229,9 +229,15 @@ ExecAppend(AppendState *node)
*/
while ((i = bms_first_member(node->as_needrequest)) >= 0)
{
- ExecAsyncRequest(estate, &node->ps, i, node->appendplans[i]);
node->as_nasyncpending++;
+
+ ExecAsyncRequest(estate, &node->ps, i, node->appendplans[i]);
+ /* If this request immediately gives a result, take it. */
+ if (node->as_nasyncresult > 0)
+ return node->as_asyncresult[--node->as_nasyncresult];
}
+ if (node->as_nasyncpending == 0 && node->as_syncdone)
+ return ExecClearTuple(node->ps.ps_ResultTupleSlot);
}
for (;;)
@@ -246,32 +252,32 @@ ExecAppend(AppendState *node)
{
long timeout = node->as_syncdone ? -1 : 0;
- for (;;)
+ while (node->as_nasyncpending > 0)
{
- if (node->as_nasyncpending == 0)
- {
- /*
- * If there is no asynchronous activity still pending
- * and the synchronous activity is also complete, we're
- * totally done scanning this node. Otherwise, we're
- * done with the asynchronous stuff but must continue
- * scanning the synchronous children.
- */
- if (node->as_syncdone)
- return ExecClearTuple(node->ps.ps_ResultTupleSlot);
- break;
- }
- if (!ExecAsyncEventLoop(node->ps.state, &node->ps, timeout))
- {
- /* Timeout reached. */
- break;
- }
- if (node->as_nasyncresult > 0)
+ if (ExecAsyncEventLoop(node->ps.state, &node->ps, timeout) &&
+ node->as_nasyncresult > 0)
{
/* Asynchronous subplan returned a tuple! */
--node->as_nasyncresult;
return node->as_asyncresult[node->as_nasyncresult];
}
+
+ /* Timeout reached. Go through to sync nodes if exists */
+ if (!node->as_syncdone)
+ break;
+ }
+
+ /*
+ * If there is no asynchronous activity still pending and the
+ * synchronous activity is also complete, we're totally done
+ * scanning this node. Otherwise, we're done with the
+ * asynchronous stuff but must continue scanning the synchronous
+ * children.
+ */
+ if (node->as_syncdone)
+ {
+ Assert(node->as_nasyncpending == 0);
+ return ExecClearTuple(node->ps.ps_ResultTupleSlot);
}
}
@@ -397,7 +403,7 @@ ExecAsyncAppendResponse(EState *estate, PendingAsyncRequest *areq)
TupleTableSlot *slot;
/* We shouldn't be called until the request is complete. */
- Assert(areq->request_complete);
+ Assert(areq->state == ASYNC_COMPLETE);
/* Our result slot shouldn't already be occupied. */
Assert(TupIsNull(node->ps.ps_ResultTupleSlot));
diff --git a/src/backend/executor/nodeForeignscan.c b/src/backend/executor/nodeForeignscan.c
index 61899d1..85dad79 100644
--- a/src/backend/executor/nodeForeignscan.c
+++ b/src/backend/executor/nodeForeignscan.c
@@ -376,7 +376,7 @@ ExecAsyncForeignScanRequest(EState *estate, PendingAsyncRequest *areq)
* In async mode, configure for a wait
* ----------------------------------------------------------------
*/
-void
+bool
ExecAsyncForeignScanConfigureWait(EState *estate,
PendingAsyncRequest *areq, bool reinit)
{
@@ -384,7 +384,7 @@ ExecAsyncForeignScanConfigureWait(EState *estate,
FdwRoutine *fdwroutine = node->fdwroutine;
Assert(fdwroutine->ForeignAsyncConfigureWait != NULL);
- fdwroutine->ForeignAsyncConfigureWait(estate, areq, reinit);
+ return fdwroutine->ForeignAsyncConfigureWait(estate, areq, reinit);
}
/* ----------------------------------------------------------------
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index a8cabdf..c62aaf2 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -237,6 +237,7 @@ _copyAppend(const Append *from)
*/
COPY_NODE_FIELD(appendplans);
COPY_SCALAR_FIELD(nasyncplans);
+ COPY_SCALAR_FIELD(referent);
return newnode;
}
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index a894a9d..c2e34a8 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -370,6 +370,7 @@ _outAppend(StringInfo str, const Append *node)
WRITE_NODE_FIELD(appendplans);
WRITE_INT_FIELD(nasyncplans);
+ WRITE_INT_FIELD(referent);
}
static void
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index 67439ec..9837eff 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -1540,6 +1540,7 @@ _readAppend(void)
READ_NODE_FIELD(appendplans);
READ_INT_FIELD(nasyncplans);
+ READ_INT_FIELD(referent);
READ_DONE();
}
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index 968f8be..a9164ab 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -194,7 +194,8 @@ static CteScan *make_ctescan(List *qptlist, List *qpqual,
Index scanrelid, int ctePlanId, int cteParam);
static WorkTableScan *make_worktablescan(List *qptlist, List *qpqual,
Index scanrelid, int wtParam);
-static Append *make_append(List *asyncplans, int nasyncplans, List *tlist);
+static Append *make_append(List *asyncplans, int nasyncplans,
+ int referent, List *tlist);
static RecursiveUnion *make_recursive_union(List *tlist,
Plan *lefttree,
Plan *righttree,
@@ -966,6 +967,8 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path)
List *syncplans = NIL;
ListCell *subpaths;
int nasyncplans = 0;
+ bool first = true;
+ bool referent_is_sync = true;
/*
* The subpaths list could be empty, if every child was proven empty by
@@ -991,7 +994,14 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path)
return plan;
}
- /* Build the plan for each child */
+ /*
+ * Build the plan for each child
+
+ * The first child in an inheritance set is the representative in
+ * explaining tlist entries (see set_deparse_planstate). We should keep
+ * the first child in best_path->subpaths at the head of the subplan list
+ * for the reason.
+ */
foreach(subpaths, best_path->subpaths)
{
Path *subpath = (Path *) lfirst(subpaths);
@@ -1005,9 +1015,13 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path)
{
asyncplans = lappend(asyncplans, subplan);
++nasyncplans;
+ if (first)
+ referent_is_sync = false;
}
else
syncplans = lappend(syncplans, subplan);
+
+ first = false;
}
/*
@@ -1017,7 +1031,8 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path)
* parent-rel Vars it'll be asked to emit.
*/
- plan = make_append(list_concat(asyncplans, syncplans), nasyncplans, tlist);
+ plan = make_append(list_concat(asyncplans, syncplans), nasyncplans,
+ referent_is_sync ? nasyncplans : 0, tlist);
copy_generic_path_info(&plan->plan, (Path *) best_path);
@@ -5019,7 +5034,7 @@ make_foreignscan(List *qptlist,
}
static Append *
-make_append(List *appendplans, int nasyncplans, List *tlist)
+make_append(List *appendplans, int nasyncplans, int referent, List *tlist)
{
Append *node = makeNode(Append);
Plan *plan = &node->plan;
@@ -5030,6 +5045,7 @@ make_append(List *appendplans, int nasyncplans, List *tlist)
plan->righttree = NULL;
node->appendplans = appendplans;
node->nasyncplans = nasyncplans;
+ node->referent = referent;
return node;
}
diff --git a/src/backend/utils/adt/ruleutils.c b/src/backend/utils/adt/ruleutils.c
index f26175e..37fc817 100644
--- a/src/backend/utils/adt/ruleutils.c
+++ b/src/backend/utils/adt/ruleutils.c
@@ -4242,7 +4242,11 @@ set_deparse_planstate(deparse_namespace *dpns, PlanState *ps)
* lists containing references to non-target relations.
*/
if (IsA(ps, AppendState))
- dpns->outer_planstate = ((AppendState *) ps)->appendplans[0];
+ {
+ int idx = ((Append*)(((AppendState *) ps)->ps.plan))->referent;
+ dpns->outer_planstate =
+ ((AppendState *) ps)->appendplans[idx];
+ }
else if (IsA(ps, MergeAppendState))
dpns->outer_planstate = ((MergeAppendState *) ps)->mergeplans[0];
else if (IsA(ps, ModifyTableState))
diff --git a/src/include/executor/nodeForeignscan.h b/src/include/executor/nodeForeignscan.h
index 5a61306..2d9a62b 100644
--- a/src/include/executor/nodeForeignscan.h
+++ b/src/include/executor/nodeForeignscan.h
@@ -31,7 +31,7 @@ extern void ExecForeignScanInitializeWorker(ForeignScanState *node,
extern void ExecAsyncForeignScanRequest(EState *estate,
PendingAsyncRequest *areq);
-extern void ExecAsyncForeignScanConfigureWait(EState *estate,
+extern bool ExecAsyncForeignScanConfigureWait(EState *estate,
PendingAsyncRequest *areq, bool reinit);
extern void ExecAsyncForeignScanNotify(EState *estate,
PendingAsyncRequest *areq);
diff --git a/src/include/foreign/fdwapi.h b/src/include/foreign/fdwapi.h
index 4c50f1e..41fc76f 100644
--- a/src/include/foreign/fdwapi.h
+++ b/src/include/foreign/fdwapi.h
@@ -158,7 +158,7 @@ typedef bool (*IsForeignScanParallelSafe_function) (PlannerInfo *root,
typedef bool (*IsForeignPathAsyncCapable_function) (ForeignPath *path);
typedef void (*ForeignAsyncRequest_function) (EState *estate,
PendingAsyncRequest *areq);
-typedef void (*ForeignAsyncConfigureWait_function) (EState *estate,
+typedef bool (*ForeignAsyncConfigureWait_function) (EState *estate,
PendingAsyncRequest *areq,
bool reinit);
typedef void (*ForeignAsyncNotify_function) (EState *estate,
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 29f3d7c..9b43fd6 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -357,6 +357,13 @@ typedef struct ResultRelInfo
* State for an asynchronous tuple request.
* ----------------
*/
+typedef enum AsyncRequestState
+{
+ ASYNC_IDLE,
+ ASYNC_WAITING,
+ ASYNC_CALLBACK_PENDING,
+ ASYNC_COMPLETE
+} AsyncRequestState;
typedef struct PendingAsyncRequest
{
int myindex; /* Index in es_pending_async. */
@@ -365,8 +372,7 @@ typedef struct PendingAsyncRequest
int request_index; /* Scratch space for requestor. */
int num_fd_events; /* Max number of FD events requestee needs. */
bool wants_process_latch; /* Requestee cares about MyLatch. */
- bool callback_pending; /* Callback is needed. */
- bool request_complete; /* Request complete, result valid. */
+ AsyncRequestState state;
Node *result; /* Result (NULL if no more tuples). */
} PendingAsyncRequest;
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index f0daada..ebbc78d 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -229,6 +229,7 @@ typedef struct Append
Plan plan;
List *appendplans;
int nasyncplans; /* # of async plans, always at start of list */
+ int referent; /* index of inheritance tree referent */
} Append;
/* ----------------
--
2.9.2
0004-Make-postgres_fdw-async-capable.patchtext/x-patch; charset=us-asciiDownload
From 28025ab53215ce0cbbfc690bf053e600afa9fb4d Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Mon, 17 Oct 2016 16:00:56 +0900
Subject: [PATCH 4/7] Make postgres_fdw async-capable
---
contrib/postgres_fdw/connection.c | 79 ++--
contrib/postgres_fdw/expected/postgres_fdw.out | 64 ++--
contrib/postgres_fdw/postgres_fdw.c | 483 +++++++++++++++++++++----
contrib/postgres_fdw/postgres_fdw.h | 2 +
contrib/postgres_fdw/sql/postgres_fdw.sql | 4 +-
src/backend/executor/execProcnode.c | 9 +
src/include/foreign/fdwapi.h | 2 +
7 files changed, 510 insertions(+), 133 deletions(-)
diff --git a/contrib/postgres_fdw/connection.c b/contrib/postgres_fdw/connection.c
index 7f7a744..64cc057 100644
--- a/contrib/postgres_fdw/connection.c
+++ b/contrib/postgres_fdw/connection.c
@@ -49,6 +49,7 @@ typedef struct ConnCacheEntry
* one level of subxact open, etc */
bool have_prep_stmt; /* have we prepared any stmts in this xact? */
bool have_error; /* have any subxacts aborted in this xact? */
+ void *storage; /* connection specific storage */
} ConnCacheEntry;
/*
@@ -64,6 +65,7 @@ static unsigned int prep_stmt_number = 0;
static bool xact_got_connection = false;
/* prototypes of private functions */
+static ConnCacheEntry *get_connection_entry(Oid umid);
static PGconn *connect_pg_server(ForeignServer *server, UserMapping *user);
static void check_conn_params(const char **keywords, const char **values);
static void configure_remote_session(PGconn *conn);
@@ -75,26 +77,12 @@ static void pgfdw_subxact_callback(SubXactEvent event,
SubTransactionId parentSubid,
void *arg);
-
/*
- * Get a PGconn which can be used to execute queries on the remote PostgreSQL
- * server with the user's authorization. A new connection is established
- * if we don't already have a suitable one, and a transaction is opened at
- * the right subtransaction nesting depth if we didn't do that already.
- *
- * will_prep_stmt must be true if caller intends to create any prepared
- * statements. Since those don't go away automatically at transaction end
- * (not even on error), we need this flag to cue manual cleanup.
- *
- * XXX Note that caching connections theoretically requires a mechanism to
- * detect change of FDW objects to invalidate already established connections.
- * We could manage that by watching for invalidation events on the relevant
- * syscaches. For the moment, though, it's not clear that this would really
- * be useful and not mere pedantry. We could not flush any active connections
- * mid-transaction anyway.
+ * Common function to acquire or create a connection cache entry.
*/
-PGconn *
-GetConnection(UserMapping *user, bool will_prep_stmt)
+
+static ConnCacheEntry *
+get_connection_entry(Oid umid)
{
bool found;
ConnCacheEntry *entry;
@@ -122,11 +110,8 @@ GetConnection(UserMapping *user, bool will_prep_stmt)
RegisterSubXactCallback(pgfdw_subxact_callback, NULL);
}
- /* Set flag that we did GetConnection during the current transaction */
- xact_got_connection = true;
-
/* Create hash key for the entry. Assume no pad bytes in key struct */
- key = user->umid;
+ key = umid;
/*
* Find or create cached entry for requested connection.
@@ -139,8 +124,39 @@ GetConnection(UserMapping *user, bool will_prep_stmt)
entry->xact_depth = 0;
entry->have_prep_stmt = false;
entry->have_error = false;
+ entry->storage = NULL;
}
+ return entry;
+}
+
+/*
+ * Get a PGconn which can be used to execute queries on the remote PostgreSQL
+ * server with the user's authorization. A new connection is established
+ * if we don't already have a suitable one, and a transaction is opened at
+ * the right subtransaction nesting depth if we didn't do that already.
+ *
+ * will_prep_stmt must be true if caller intends to create any prepared
+ * statements. Since those don't go away automatically at transaction end
+ * (not even on error), we need this flag to cue manual cleanup.
+ *
+ * XXX Note that caching connections theoretically requires a mechanism to
+ * detect change of FDW objects to invalidate already established connections.
+ * We could manage that by watching for invalidation events on the relevant
+ * syscaches. For the moment, though, it's not clear that this would really
+ * be useful and not mere pedantry. We could not flush any active connections
+ * mid-transaction anyway.
+ */
+PGconn *
+GetConnection(UserMapping *user, bool will_prep_stmt)
+{
+ ConnCacheEntry *entry;
+
+ /* Set flag that we did GetConnection during the current transaction */
+ xact_got_connection = true;
+
+ entry = get_connection_entry(user->umid);
+
/*
* We don't check the health of cached connection here, because it would
* require some overhead. Broken connection will be detected when the
@@ -177,6 +193,25 @@ GetConnection(UserMapping *user, bool will_prep_stmt)
}
/*
+ * Rerturns the connection specific storage for this user. Allocate with
+ * initsize if not exists.
+ */
+void *
+GetConnectionSpecificStorage(UserMapping *user, size_t initsize)
+{
+ ConnCacheEntry *entry;
+
+ entry = get_connection_entry(user->umid);
+ if (entry->storage == NULL)
+ {
+ entry->storage = MemoryContextAlloc(CacheMemoryContext, initsize);
+ memset(entry->storage, 0, initsize);
+ }
+
+ return entry->storage;
+}
+
+/*
* Connect to remote server using specified server and user mapping properties.
*/
static PGconn *
diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out
index fd8b628..5d448d1 100644
--- a/contrib/postgres_fdw/expected/postgres_fdw.out
+++ b/contrib/postgres_fdw/expected/postgres_fdw.out
@@ -6254,12 +6254,12 @@ INSERT INTO b(aa) VALUES('bbbbb');
SELECT tableoid::regclass, * FROM a;
tableoid | aa
----------+-------
- b | bbb
- b | bbbb
- b | bbbbb
a | aaa
a | aaaa
a | aaaaa
+ b | bbb
+ b | bbbb
+ b | bbbbb
(6 rows)
SELECT tableoid::regclass, * FROM b;
@@ -6282,12 +6282,12 @@ UPDATE a SET aa = 'zzzzzz' WHERE aa LIKE 'aaaa%';
SELECT tableoid::regclass, * FROM a;
tableoid | aa
----------+--------
- b | bbb
- b | bbbb
- b | bbbbb
a | aaa
a | zzzzzz
a | zzzzzz
+ b | bbb
+ b | bbbb
+ b | bbbbb
(6 rows)
SELECT tableoid::regclass, * FROM b;
@@ -6310,12 +6310,12 @@ UPDATE b SET aa = 'new';
SELECT tableoid::regclass, * FROM a;
tableoid | aa
----------+--------
- b | new
- b | new
- b | new
a | aaa
a | zzzzzz
a | zzzzzz
+ b | new
+ b | new
+ b | new
(6 rows)
SELECT tableoid::regclass, * FROM b;
@@ -6338,12 +6338,12 @@ UPDATE a SET aa = 'newtoo';
SELECT tableoid::regclass, * FROM a;
tableoid | aa
----------+--------
- b | newtoo
- b | newtoo
- b | newtoo
a | newtoo
a | newtoo
a | newtoo
+ b | newtoo
+ b | newtoo
+ b | newtoo
(6 rows)
SELECT tableoid::regclass, * FROM b;
@@ -6431,9 +6431,9 @@ select * from bar where f1 in (select f1 from foo) for update;
select * from bar where f1 in (select f1 from foo) for update;
f1 | f2
----+----
+ 1 | 11
3 | 33
4 | 44
- 1 | 11
2 | 22
(4 rows)
@@ -6468,9 +6468,9 @@ select * from bar where f1 in (select f1 from foo) for share;
select * from bar where f1 in (select f1 from foo) for share;
f1 | f2
----+----
+ 1 | 11
3 | 33
4 | 44
- 1 | 11
2 | 22
(4 rows)
@@ -6733,27 +6733,33 @@ delete from foo where f1 < 5 returning *;
(5 rows)
explain (verbose, costs off)
-update bar set f2 = f2 + 100 returning *;
- QUERY PLAN
-------------------------------------------------------------------------------
- Update on public.bar
- Output: bar.f1, bar.f2
- Update on public.bar
- Foreign Update on public.bar2
- -> Seq Scan on public.bar
- Output: bar.f1, (bar.f2 + 100), bar.ctid
- -> Foreign Update on public.bar2
- Remote SQL: UPDATE public.loct2 SET f2 = (f2 + 100) RETURNING f1, f2
-(8 rows)
+with u as (update bar set f2 = f2 + 100 returning *) select * from u order by 1;
+ QUERY PLAN
+--------------------------------------------------------------------------------------
+ Sort
+ Output: u.f1, u.f2
+ Sort Key: u.f1
+ CTE u
+ -> Update on public.bar
+ Output: bar.f1, bar.f2
+ Update on public.bar
+ Foreign Update on public.bar2
+ -> Seq Scan on public.bar
+ Output: bar.f1, (bar.f2 + 100), bar.ctid
+ -> Foreign Update on public.bar2
+ Remote SQL: UPDATE public.loct2 SET f2 = (f2 + 100) RETURNING f1, f2
+ -> CTE Scan on u
+ Output: u.f1, u.f2
+(14 rows)
-update bar set f2 = f2 + 100 returning *;
+with u as (update bar set f2 = f2 + 100 returning *) select * from u order by 1;
f1 | f2
----+-----
- 2 | 322
1 | 311
- 6 | 266
+ 2 | 322
3 | 333
4 | 344
+ 6 | 266
7 | 277
(6 rows)
diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c
index abb256b..a52d54a 100644
--- a/contrib/postgres_fdw/postgres_fdw.c
+++ b/contrib/postgres_fdw/postgres_fdw.c
@@ -35,6 +35,7 @@
#include "optimizer/var.h"
#include "optimizer/tlist.h"
#include "parser/parsetree.h"
+#include "pgstat.h"
#include "utils/builtins.h"
#include "utils/guc.h"
#include "utils/lsyscache.h"
@@ -54,6 +55,9 @@ PG_MODULE_MAGIC;
/* If no remote estimates, assume a sort costs 20% extra */
#define DEFAULT_FDW_SORT_MULTIPLIER 1.2
+/* Retrive PgFdwScanState struct from ForeginScanState */
+#define GetPgFdwScanState(n) ((PgFdwScanState *)(n)->fdw_state)
+
/*
* Indexes of FDW-private information stored in fdw_private lists.
*
@@ -123,10 +127,27 @@ enum FdwDirectModifyPrivateIndex
};
/*
+ * Connection private area structure.
+ */
+ typedef struct PgFdwConnspecate
+{
+ ForeignScanState *current_owner; /* The node currently running a query
+ * on this connection*/
+} PgFdwConnspecate;
+
+/* Execution state base type */
+typedef struct PgFdwState
+{
+ PGconn *conn; /* connection for the scan */
+ PgFdwConnspecate *connspec; /* connection private memory */
+} PgFdwState;
+
+/*
* Execution state of a foreign scan using postgres_fdw.
*/
typedef struct PgFdwScanState
{
+ PgFdwState s; /* common structure */
Relation rel; /* relcache entry for the foreign table. NULL
* for a foreign join scan. */
TupleDesc tupdesc; /* tuple descriptor of scan */
@@ -137,7 +158,7 @@ typedef struct PgFdwScanState
List *retrieved_attrs; /* list of retrieved attribute numbers */
/* for remote query execution */
- PGconn *conn; /* connection for the scan */
+ bool result_ready;
unsigned int cursor_number; /* quasi-unique ID for my cursor */
bool cursor_exists; /* have we created the cursor? */
int numParams; /* number of parameters passed to query */
@@ -153,6 +174,13 @@ typedef struct PgFdwScanState
/* batch-level state, for optimizing rewinds and avoiding useless fetch */
int fetch_ct_2; /* Min(# of fetches done, 2) */
bool eof_reached; /* true if last fetch reached EOF */
+ bool run_async; /* true if run asynchronously */
+ bool async_waiting; /* true if requesting the parent to wait */
+ ForeignScanState *waiter; /* Next node to run a query among nodes
+ * sharing the same connection */
+ ForeignScanState *last_waiter; /* A waiting node at the end of a waiting
+ * list. Maintained only by the current
+ * owner of the connection */
/* working memory contexts */
MemoryContext batch_cxt; /* context holding current batch of tuples */
@@ -166,11 +194,11 @@ typedef struct PgFdwScanState
*/
typedef struct PgFdwModifyState
{
+ PgFdwState s; /* common structure */
Relation rel; /* relcache entry for the foreign table */
AttInMetadata *attinmeta; /* attribute datatype conversion metadata */
/* for remote query execution */
- PGconn *conn; /* connection for the scan */
char *p_name; /* name of prepared statement, if created */
/* extracted fdw_private data */
@@ -193,6 +221,7 @@ typedef struct PgFdwModifyState
*/
typedef struct PgFdwDirectModifyState
{
+ PgFdwState s; /* common structure */
Relation rel; /* relcache entry for the foreign table */
AttInMetadata *attinmeta; /* attribute datatype conversion metadata */
@@ -291,6 +320,7 @@ static void postgresBeginForeignScan(ForeignScanState *node, int eflags);
static TupleTableSlot *postgresIterateForeignScan(ForeignScanState *node);
static void postgresReScanForeignScan(ForeignScanState *node);
static void postgresEndForeignScan(ForeignScanState *node);
+static void postgresShutdownForeignScan(ForeignScanState *node);
static void postgresAddForeignUpdateTargets(Query *parsetree,
RangeTblEntry *target_rte,
Relation target_relation);
@@ -355,8 +385,8 @@ static bool postgresIsForeignPathAsyncCapable(ForeignPath *path);
static void postgresForeignAsyncRequest(EState *estate,
PendingAsyncRequest *areq);
static bool postgresForeignAsyncConfigureWait(EState *estate,
- PendingAsyncRequest *areq,
- bool reinit);
+ PendingAsyncRequest *areq,
+ bool reinit);
static void postgresForeignAsyncNotify(EState *estate,
PendingAsyncRequest *areq);
@@ -379,7 +409,10 @@ static bool ec_member_matches_foreign(PlannerInfo *root, RelOptInfo *rel,
EquivalenceClass *ec, EquivalenceMember *em,
void *arg);
static void create_cursor(ForeignScanState *node);
-static void fetch_more_data(ForeignScanState *node);
+static void request_more_data(ForeignScanState *node);
+static void fetch_received_data(ForeignScanState *node);
+static void vacate_connection(PgFdwState *fdwconn);
+static void absorb_current_result(ForeignScanState *node);
static void close_cursor(PGconn *conn, unsigned int cursor_number);
static void prepare_foreign_modify(PgFdwModifyState *fmstate);
static const char **convert_prep_stmt_params(PgFdwModifyState *fmstate,
@@ -444,6 +477,7 @@ postgres_fdw_handler(PG_FUNCTION_ARGS)
routine->IterateForeignScan = postgresIterateForeignScan;
routine->ReScanForeignScan = postgresReScanForeignScan;
routine->EndForeignScan = postgresEndForeignScan;
+ routine->ShutdownForeignScan = postgresShutdownForeignScan;
/* Functions for updating foreign tables */
routine->AddForeignUpdateTargets = postgresAddForeignUpdateTargets;
@@ -1335,12 +1369,21 @@ postgresBeginForeignScan(ForeignScanState *node, int eflags)
* Get connection to the foreign server. Connection manager will
* establish new connection if necessary.
*/
- fsstate->conn = GetConnection(user, false);
+ fsstate->s.conn = GetConnection(user, false);
+ fsstate->s.connspec = (PgFdwConnspecate *)
+ GetConnectionSpecificStorage(user, sizeof(PgFdwConnspecate));
+ fsstate->s.connspec->current_owner = NULL;
+ fsstate->waiter = NULL;
+ fsstate->last_waiter = node;
/* Assign a unique ID for my cursor */
- fsstate->cursor_number = GetCursorNumber(fsstate->conn);
+ fsstate->cursor_number = GetCursorNumber(fsstate->s.conn);
fsstate->cursor_exists = false;
+ /* Initialize async execution status */
+ fsstate->run_async = false;
+ fsstate->async_waiting = false;
+
/* Get private info created by planner functions. */
fsstate->query = strVal(list_nth(fsplan->fdw_private,
FdwScanPrivateSelectSql));
@@ -1396,32 +1439,126 @@ postgresBeginForeignScan(ForeignScanState *node, int eflags)
static TupleTableSlot *
postgresIterateForeignScan(ForeignScanState *node)
{
- PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+ PgFdwScanState *fsstate = GetPgFdwScanState(node);
TupleTableSlot *slot = node->ss.ss_ScanTupleSlot;
/*
- * If this is the first call after Begin or ReScan, we need to create the
- * cursor on the remote side.
- */
- if (!fsstate->cursor_exists)
- create_cursor(node);
-
- /*
* Get some more tuples, if we've run out.
*/
if (fsstate->next_tuple >= fsstate->num_tuples)
{
- /* No point in another fetch if we already detected EOF, though. */
- if (!fsstate->eof_reached)
- fetch_more_data(node);
- /* If we didn't get any tuples, must be end of data. */
+ ForeignScanState *next_conn_owner = node;
+
+ /* This node has sent a query on this connection */
+ if (fsstate->s.connspec->current_owner == node)
+ {
+ /* Check if the result is available */
+ if (PQisBusy(fsstate->s.conn))
+ {
+ int rc = WaitLatchOrSocket(NULL,
+ WL_SOCKET_READABLE | WL_TIMEOUT,
+ PQsocket(fsstate->s.conn), 0,
+ WAIT_EVENT_ASYNC_WAIT);
+ if (fsstate->run_async && !(rc & WL_SOCKET_READABLE))
+ {
+ /*
+ * This node is not ready yet. Tell the caller to wait.
+ */
+ fsstate->result_ready = false;
+ return ExecClearTuple(slot);
+ }
+ }
+
+ Assert(fsstate->async_waiting);
+ fsstate->async_waiting = false;
+ fetch_received_data(node);
+
+ /*
+ * If someone is waiting this node on the same connection, let the
+ * first waiter be the next owner of this connection.
+ */
+ if (fsstate->waiter)
+ {
+ PgFdwScanState *next_owner_state;
+
+ next_conn_owner = fsstate->waiter;
+ next_owner_state = GetPgFdwScanState(next_conn_owner);
+ fsstate->waiter = NULL;
+
+ /*
+ * only the current owner is responsible to maintain the shortcut
+ * to the last waiter
+ */
+ next_owner_state->last_waiter = fsstate->last_waiter;
+
+ /*
+ * for simplicity, last_waiter points itself on a node that no one
+ * is waiting for.
+ */
+ fsstate->last_waiter = node;
+ }
+ }
+ else if (fsstate->s.connspec->current_owner)
+ {
+ /*
+ * Anyone else is holding this connection. Add myself to the tail
+ * of the waiters' list then return not-ready. To avoid scanning
+ * through the waiters' list, the current owner is to maintain the
+ * shortcut to the last waiter.
+ */
+ PgFdwScanState *conn_owner_state =
+ GetPgFdwScanState(fsstate->s.connspec->current_owner);
+ ForeignScanState *last_waiter = conn_owner_state->last_waiter;
+ PgFdwScanState *last_waiter_state = GetPgFdwScanState(last_waiter);
+
+ last_waiter_state->waiter = node;
+ conn_owner_state->last_waiter = node;
+
+ /* Register the node to the async-waiting node list */
+ Assert(!GetPgFdwScanState(node)->async_waiting);
+
+ GetPgFdwScanState(node)->async_waiting = true;
+
+ fsstate->result_ready = fsstate->eof_reached;
+ return ExecClearTuple(slot);
+ }
+
+ /*
+ * Send the next request for the next owner of this connection if
+ * needed.
+ */
+
+ if (!GetPgFdwScanState(next_conn_owner)->eof_reached)
+ {
+ PgFdwScanState *next_owner_state =
+ GetPgFdwScanState(next_conn_owner);
+
+ request_more_data(next_conn_owner);
+
+ /* Register the node to the async-waiting node list */
+ if (!next_owner_state->async_waiting)
+ next_owner_state->async_waiting = true;
+
+ if (!next_owner_state->run_async)
+ fetch_received_data(next_conn_owner);
+ }
+
+
+ /*
+ * If we haven't received a result for the given node this time,
+ * return with no tuple to give way to other nodes.
+ */
if (fsstate->next_tuple >= fsstate->num_tuples)
+ {
+ fsstate->result_ready = fsstate->eof_reached;
return ExecClearTuple(slot);
+ }
}
/*
* Return the next tuple.
*/
+ fsstate->result_ready = true;
ExecStoreTuple(fsstate->tuples[fsstate->next_tuple++],
slot,
InvalidBuffer,
@@ -1437,7 +1574,7 @@ postgresIterateForeignScan(ForeignScanState *node)
static void
postgresReScanForeignScan(ForeignScanState *node)
{
- PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+ PgFdwScanState *fsstate = GetPgFdwScanState(node);
char sql[64];
PGresult *res;
@@ -1445,6 +1582,9 @@ postgresReScanForeignScan(ForeignScanState *node)
if (!fsstate->cursor_exists)
return;
+ /* Absorb the ramining result */
+ absorb_current_result(node);
+
/*
* If any internal parameters affecting this node have changed, we'd
* better destroy and recreate the cursor. Otherwise, rewinding it should
@@ -1473,9 +1613,9 @@ postgresReScanForeignScan(ForeignScanState *node)
* We don't use a PG_TRY block here, so be careful not to throw error
* without releasing the PGresult.
*/
- res = pgfdw_exec_query(fsstate->conn, sql);
+ res = pgfdw_exec_query(fsstate->s.conn, sql);
if (PQresultStatus(res) != PGRES_COMMAND_OK)
- pgfdw_report_error(ERROR, res, fsstate->conn, true, sql);
+ pgfdw_report_error(ERROR, res, fsstate->s.conn, true, sql);
PQclear(res);
/* Now force a fresh FETCH. */
@@ -1493,7 +1633,7 @@ postgresReScanForeignScan(ForeignScanState *node)
static void
postgresEndForeignScan(ForeignScanState *node)
{
- PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+ PgFdwScanState *fsstate = GetPgFdwScanState(node);
/* if fsstate is NULL, we are in EXPLAIN; nothing to do */
if (fsstate == NULL)
@@ -1501,16 +1641,32 @@ postgresEndForeignScan(ForeignScanState *node)
/* Close the cursor if open, to prevent accumulation of cursors */
if (fsstate->cursor_exists)
- close_cursor(fsstate->conn, fsstate->cursor_number);
+ close_cursor(fsstate->s.conn, fsstate->cursor_number);
/* Release remote connection */
- ReleaseConnection(fsstate->conn);
- fsstate->conn = NULL;
+ ReleaseConnection(fsstate->s.conn);
+ fsstate->s.conn = NULL;
/* MemoryContexts will be deleted automatically. */
}
/*
+ * postgresShutdownForeignScan
+ * Remove asynchrony stuff and cleanup garbage on the connection.
+ */
+static void
+postgresShutdownForeignScan(ForeignScanState *node)
+{
+ ForeignScan *plan = (ForeignScan *) node->ss.ps.plan;
+
+ if (plan->operation != CMD_SELECT)
+ return;
+
+ /* Absorb the ramining result */
+ absorb_current_result(node);
+}
+
+/*
* postgresAddForeignUpdateTargets
* Add resjunk column(s) needed for update/delete on a foreign table
*/
@@ -1712,7 +1868,9 @@ postgresBeginForeignModify(ModifyTableState *mtstate,
user = GetUserMapping(userid, table->serverid);
/* Open connection; report that we'll create a prepared statement. */
- fmstate->conn = GetConnection(user, true);
+ fmstate->s.conn = GetConnection(user, true);
+ fmstate->s.connspec = (PgFdwConnspecate *)
+ GetConnectionSpecificStorage(user, sizeof(PgFdwConnspecate));
fmstate->p_name = NULL; /* prepared statement not made yet */
/* Deconstruct fdw_private data. */
@@ -1791,6 +1949,8 @@ postgresExecForeignInsert(EState *estate,
PGresult *res;
int n_rows;
+ vacate_connection((PgFdwState *)fmstate);
+
/* Set up the prepared statement on the remote server, if we didn't yet */
if (!fmstate->p_name)
prepare_foreign_modify(fmstate);
@@ -1801,14 +1961,14 @@ postgresExecForeignInsert(EState *estate,
/*
* Execute the prepared statement.
*/
- if (!PQsendQueryPrepared(fmstate->conn,
+ if (!PQsendQueryPrepared(fmstate->s.conn,
fmstate->p_name,
fmstate->p_nums,
p_values,
NULL,
NULL,
0))
- pgfdw_report_error(ERROR, NULL, fmstate->conn, false, fmstate->query);
+ pgfdw_report_error(ERROR, NULL, fmstate->s.conn, false, fmstate->query);
/*
* Get the result, and check for success.
@@ -1816,10 +1976,10 @@ postgresExecForeignInsert(EState *estate,
* We don't use a PG_TRY block here, so be careful not to throw error
* without releasing the PGresult.
*/
- res = pgfdw_get_result(fmstate->conn, fmstate->query);
+ res = pgfdw_get_result(fmstate->s.conn, fmstate->query);
if (PQresultStatus(res) !=
(fmstate->has_returning ? PGRES_TUPLES_OK : PGRES_COMMAND_OK))
- pgfdw_report_error(ERROR, res, fmstate->conn, true, fmstate->query);
+ pgfdw_report_error(ERROR, res, fmstate->s.conn, true, fmstate->query);
/* Check number of rows affected, and fetch RETURNING tuple if any */
if (fmstate->has_returning)
@@ -1857,6 +2017,8 @@ postgresExecForeignUpdate(EState *estate,
PGresult *res;
int n_rows;
+ vacate_connection((PgFdwState *)fmstate);
+
/* Set up the prepared statement on the remote server, if we didn't yet */
if (!fmstate->p_name)
prepare_foreign_modify(fmstate);
@@ -1877,14 +2039,14 @@ postgresExecForeignUpdate(EState *estate,
/*
* Execute the prepared statement.
*/
- if (!PQsendQueryPrepared(fmstate->conn,
+ if (!PQsendQueryPrepared(fmstate->s.conn,
fmstate->p_name,
fmstate->p_nums,
p_values,
NULL,
NULL,
0))
- pgfdw_report_error(ERROR, NULL, fmstate->conn, false, fmstate->query);
+ pgfdw_report_error(ERROR, NULL, fmstate->s.conn, false, fmstate->query);
/*
* Get the result, and check for success.
@@ -1892,10 +2054,10 @@ postgresExecForeignUpdate(EState *estate,
* We don't use a PG_TRY block here, so be careful not to throw error
* without releasing the PGresult.
*/
- res = pgfdw_get_result(fmstate->conn, fmstate->query);
+ res = pgfdw_get_result(fmstate->s.conn, fmstate->query);
if (PQresultStatus(res) !=
(fmstate->has_returning ? PGRES_TUPLES_OK : PGRES_COMMAND_OK))
- pgfdw_report_error(ERROR, res, fmstate->conn, true, fmstate->query);
+ pgfdw_report_error(ERROR, res, fmstate->s.conn, true, fmstate->query);
/* Check number of rows affected, and fetch RETURNING tuple if any */
if (fmstate->has_returning)
@@ -1933,6 +2095,8 @@ postgresExecForeignDelete(EState *estate,
PGresult *res;
int n_rows;
+ vacate_connection((PgFdwState *)fmstate);
+
/* Set up the prepared statement on the remote server, if we didn't yet */
if (!fmstate->p_name)
prepare_foreign_modify(fmstate);
@@ -1953,14 +2117,14 @@ postgresExecForeignDelete(EState *estate,
/*
* Execute the prepared statement.
*/
- if (!PQsendQueryPrepared(fmstate->conn,
+ if (!PQsendQueryPrepared(fmstate->s.conn,
fmstate->p_name,
fmstate->p_nums,
p_values,
NULL,
NULL,
0))
- pgfdw_report_error(ERROR, NULL, fmstate->conn, false, fmstate->query);
+ pgfdw_report_error(ERROR, NULL, fmstate->s.conn, false, fmstate->query);
/*
* Get the result, and check for success.
@@ -1968,10 +2132,10 @@ postgresExecForeignDelete(EState *estate,
* We don't use a PG_TRY block here, so be careful not to throw error
* without releasing the PGresult.
*/
- res = pgfdw_get_result(fmstate->conn, fmstate->query);
+ res = pgfdw_get_result(fmstate->s.conn, fmstate->query);
if (PQresultStatus(res) !=
(fmstate->has_returning ? PGRES_TUPLES_OK : PGRES_COMMAND_OK))
- pgfdw_report_error(ERROR, res, fmstate->conn, true, fmstate->query);
+ pgfdw_report_error(ERROR, res, fmstate->s.conn, true, fmstate->query);
/* Check number of rows affected, and fetch RETURNING tuple if any */
if (fmstate->has_returning)
@@ -2018,16 +2182,16 @@ postgresEndForeignModify(EState *estate,
* We don't use a PG_TRY block here, so be careful not to throw error
* without releasing the PGresult.
*/
- res = pgfdw_exec_query(fmstate->conn, sql);
+ res = pgfdw_exec_query(fmstate->s.conn, sql);
if (PQresultStatus(res) != PGRES_COMMAND_OK)
- pgfdw_report_error(ERROR, res, fmstate->conn, true, sql);
+ pgfdw_report_error(ERROR, res, fmstate->s.conn, true, sql);
PQclear(res);
fmstate->p_name = NULL;
}
/* Release remote connection */
- ReleaseConnection(fmstate->conn);
- fmstate->conn = NULL;
+ ReleaseConnection(fmstate->s.conn);
+ fmstate->s.conn = NULL;
}
/*
@@ -2307,7 +2471,9 @@ postgresBeginDirectModify(ForeignScanState *node, int eflags)
* Get connection to the foreign server. Connection manager will
* establish new connection if necessary.
*/
- dmstate->conn = GetConnection(user, false);
+ dmstate->s.conn = GetConnection(user, false);
+ dmstate->s.connspec = (PgFdwConnspecate *)
+ GetConnectionSpecificStorage(user, sizeof(PgFdwConnspecate));
/* Initialize state variable */
dmstate->num_tuples = -1; /* -1 means not set yet */
@@ -2360,7 +2526,10 @@ postgresIterateDirectModify(ForeignScanState *node)
* If this is the first call after Begin, execute the statement.
*/
if (dmstate->num_tuples == -1)
+ {
+ vacate_connection((PgFdwState *)dmstate);
execute_dml_stmt(node);
+ }
/*
* If the local query doesn't specify RETURNING, just clear tuple slot.
@@ -2407,8 +2576,8 @@ postgresEndDirectModify(ForeignScanState *node)
PQclear(dmstate->result);
/* Release remote connection */
- ReleaseConnection(dmstate->conn);
- dmstate->conn = NULL;
+ ReleaseConnection(dmstate->s.conn);
+ dmstate->s.conn = NULL;
/* MemoryContext will be deleted automatically. */
}
@@ -2527,6 +2696,7 @@ estimate_path_cost_size(PlannerInfo *root,
List *local_param_join_conds;
StringInfoData sql;
PGconn *conn;
+ PgFdwConnspecate *connspec;
Selectivity local_sel;
QualCost local_cost;
List *fdw_scan_tlist = NIL;
@@ -2570,6 +2740,16 @@ estimate_path_cost_size(PlannerInfo *root,
/* Get the remote estimate */
conn = GetConnection(fpinfo->user, false);
+ connspec = GetConnectionSpecificStorage(fpinfo->user,
+ sizeof(PgFdwConnspecate));
+ if (connspec)
+ {
+ PgFdwState tmpstate;
+ tmpstate.conn = conn;
+ tmpstate.connspec = connspec;
+ vacate_connection(&tmpstate);
+ }
+
get_remote_estimate(sql.data, conn, &rows, &width,
&startup_cost, &total_cost);
ReleaseConnection(conn);
@@ -2924,11 +3104,11 @@ ec_member_matches_foreign(PlannerInfo *root, RelOptInfo *rel,
static void
create_cursor(ForeignScanState *node)
{
- PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+ PgFdwScanState *fsstate = GetPgFdwScanState(node);
ExprContext *econtext = node->ss.ps.ps_ExprContext;
int numParams = fsstate->numParams;
const char **values = fsstate->param_values;
- PGconn *conn = fsstate->conn;
+ PGconn *conn = fsstate->s.conn;
StringInfoData buf;
PGresult *res;
@@ -2994,47 +3174,96 @@ create_cursor(ForeignScanState *node)
* Fetch some more rows from the node's cursor.
*/
static void
-fetch_more_data(ForeignScanState *node)
+request_more_data(ForeignScanState *node)
+{
+ PgFdwScanState *fsstate = GetPgFdwScanState(node);
+ PGconn *conn = fsstate->s.conn;
+ char sql[64];
+
+ /* The connection should be vacant */
+ Assert(fsstate->s.connspec->current_owner == NULL);
+
+ /*
+ * If this is the first call after Begin or ReScan, we need to create the
+ * cursor on the remote side.
+ */
+ if (!fsstate->cursor_exists)
+ create_cursor(node);
+
+ snprintf(sql, sizeof(sql), "FETCH %d FROM c%u",
+ fsstate->fetch_size, fsstate->cursor_number);
+
+ if (!PQsendQuery(conn, sql))
+ pgfdw_report_error(ERROR, NULL, conn, false, sql);
+
+ fsstate->s.connspec->current_owner = node;
+}
+
+/*
+ * Fetch some more rows from the node's cursor.
+ */
+static void
+fetch_received_data(ForeignScanState *node)
{
- PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+ PgFdwScanState *fsstate = GetPgFdwScanState(node);
PGresult *volatile res = NULL;
MemoryContext oldcontext;
+ /* I should be the current connection owner */
+ Assert(fsstate->s.connspec->current_owner == node);
+
/*
* We'll store the tuples in the batch_cxt. First, flush the previous
- * batch.
+ * batch if no tuple is remaining
*/
- fsstate->tuples = NULL;
- MemoryContextReset(fsstate->batch_cxt);
+ if (fsstate->next_tuple >= fsstate->num_tuples)
+ {
+ fsstate->tuples = NULL;
+ fsstate->num_tuples = 0;
+ MemoryContextReset(fsstate->batch_cxt);
+ }
+ else if (fsstate->next_tuple > 0)
+ {
+ /* move the remaining tuples to the beginning of the store */
+ int n = 0;
+
+ while(fsstate->next_tuple < fsstate->num_tuples)
+ fsstate->tuples[n++] = fsstate->tuples[fsstate->next_tuple++];
+ fsstate->num_tuples = n;
+ }
+
oldcontext = MemoryContextSwitchTo(fsstate->batch_cxt);
/* PGresult must be released before leaving this function. */
PG_TRY();
{
- PGconn *conn = fsstate->conn;
+ PGconn *conn = fsstate->s.conn;
char sql[64];
- int numrows;
+ int addrows;
+ size_t newsize;
int i;
snprintf(sql, sizeof(sql), "FETCH %d FROM c%u",
fsstate->fetch_size, fsstate->cursor_number);
- res = pgfdw_exec_query(conn, sql);
+ res = pgfdw_get_result(conn, sql);
/* On error, report the original query, not the FETCH. */
if (PQresultStatus(res) != PGRES_TUPLES_OK)
pgfdw_report_error(ERROR, res, conn, false, fsstate->query);
/* Convert the data into HeapTuples */
- numrows = PQntuples(res);
- fsstate->tuples = (HeapTuple *) palloc0(numrows * sizeof(HeapTuple));
- fsstate->num_tuples = numrows;
- fsstate->next_tuple = 0;
+ addrows = PQntuples(res);
+ newsize = (fsstate->num_tuples + addrows) * sizeof(HeapTuple);
+ if (fsstate->tuples)
+ fsstate->tuples = (HeapTuple *) repalloc(fsstate->tuples, newsize);
+ else
+ fsstate->tuples = (HeapTuple *) palloc(newsize);
- for (i = 0; i < numrows; i++)
+ for (i = 0; i < addrows; i++)
{
Assert(IsA(node->ss.ps.plan, ForeignScan));
- fsstate->tuples[i] =
+ fsstate->tuples[fsstate->num_tuples + i] =
make_tuple_from_result_row(res, i,
fsstate->rel,
fsstate->attinmeta,
@@ -3044,27 +3273,82 @@ fetch_more_data(ForeignScanState *node)
}
/* Update fetch_ct_2 */
- if (fsstate->fetch_ct_2 < 2)
+ if (fsstate->fetch_ct_2 < 2 && fsstate->next_tuple == 0)
fsstate->fetch_ct_2++;
+ fsstate->next_tuple = 0;
+ fsstate->num_tuples += addrows;
+
/* Must be EOF if we didn't get as many tuples as we asked for. */
- fsstate->eof_reached = (numrows < fsstate->fetch_size);
+ fsstate->eof_reached = (addrows < fsstate->fetch_size);
PQclear(res);
res = NULL;
}
PG_CATCH();
{
+ fsstate->s.connspec->current_owner = NULL;
if (res)
PQclear(res);
PG_RE_THROW();
}
PG_END_TRY();
+ fsstate->s.connspec->current_owner = NULL;
+
MemoryContextSwitchTo(oldcontext);
}
/*
+ * Vacate a connection so that this node can send the next query
+ */
+static void
+vacate_connection(PgFdwState *fdwstate)
+{
+ PgFdwConnspecate *connspec = fdwstate->connspec;
+ ForeignScanState *owner;
+
+ if (connspec == NULL || connspec->current_owner == NULL)
+ return;
+
+ /*
+ * let the current connection owner read the result for the running query
+ */
+ owner = connspec->current_owner;
+ fetch_received_data(owner);
+
+ /* Clear the waiting list */
+ while (owner)
+ {
+ PgFdwScanState *fsstate = GetPgFdwScanState(owner);
+
+ fsstate->last_waiter = NULL;
+ owner = fsstate->waiter;
+ fsstate->waiter = NULL;
+ }
+}
+
+/*
+ * Absorb the result of the current query.
+ */
+static void
+absorb_current_result(ForeignScanState *node)
+{
+ PgFdwScanState *fsstate = GetPgFdwScanState(node);
+ ForeignScanState *owner = fsstate->s.connspec->current_owner;
+
+ if (owner)
+ {
+ PgFdwScanState *target_state = GetPgFdwScanState(owner);
+ PGconn *conn = target_state->s.conn;
+
+ while(PQisBusy(conn))
+ PQclear(PQgetResult(conn));
+ fsstate->s.connspec->current_owner = NULL;
+ fsstate->async_waiting = false;
+ }
+}
+/*
* Force assorted GUC parameters to settings that ensure that we'll output
* data values in a form that is unambiguous to the remote server.
*
@@ -3148,7 +3432,7 @@ prepare_foreign_modify(PgFdwModifyState *fmstate)
/* Construct name we'll use for the prepared statement. */
snprintf(prep_name, sizeof(prep_name), "pgsql_fdw_prep_%u",
- GetPrepStmtNumber(fmstate->conn));
+ GetPrepStmtNumber(fmstate->s.conn));
p_name = pstrdup(prep_name);
/*
@@ -3158,12 +3442,12 @@ prepare_foreign_modify(PgFdwModifyState *fmstate)
* the prepared statements we use in this module are simple enough that
* the remote server will make the right choices.
*/
- if (!PQsendPrepare(fmstate->conn,
+ if (!PQsendPrepare(fmstate->s.conn,
p_name,
fmstate->query,
0,
NULL))
- pgfdw_report_error(ERROR, NULL, fmstate->conn, false, fmstate->query);
+ pgfdw_report_error(ERROR, NULL, fmstate->s.conn, false, fmstate->query);
/*
* Get the result, and check for success.
@@ -3171,9 +3455,9 @@ prepare_foreign_modify(PgFdwModifyState *fmstate)
* We don't use a PG_TRY block here, so be careful not to throw error
* without releasing the PGresult.
*/
- res = pgfdw_get_result(fmstate->conn, fmstate->query);
+ res = pgfdw_get_result(fmstate->s.conn, fmstate->query);
if (PQresultStatus(res) != PGRES_COMMAND_OK)
- pgfdw_report_error(ERROR, res, fmstate->conn, true, fmstate->query);
+ pgfdw_report_error(ERROR, res, fmstate->s.conn, true, fmstate->query);
PQclear(res);
/* This action shows that the prepare has been done. */
@@ -3304,9 +3588,9 @@ execute_dml_stmt(ForeignScanState *node)
* the desired result. This allows us to avoid assuming that the remote
* server has the same OIDs we do for the parameters' types.
*/
- if (!PQsendQueryParams(dmstate->conn, dmstate->query, numParams,
+ if (!PQsendQueryParams(dmstate->s.conn, dmstate->query, numParams,
NULL, values, NULL, NULL, 0))
- pgfdw_report_error(ERROR, NULL, dmstate->conn, false, dmstate->query);
+ pgfdw_report_error(ERROR, NULL, dmstate->s.conn, false, dmstate->query);
/*
* Get the result, and check for success.
@@ -3314,10 +3598,10 @@ execute_dml_stmt(ForeignScanState *node)
* We don't use a PG_TRY block here, so be careful not to throw error
* without releasing the PGresult.
*/
- dmstate->result = pgfdw_get_result(dmstate->conn, dmstate->query);
+ dmstate->result = pgfdw_get_result(dmstate->s.conn, dmstate->query);
if (PQresultStatus(dmstate->result) !=
(dmstate->has_returning ? PGRES_TUPLES_OK : PGRES_COMMAND_OK))
- pgfdw_report_error(ERROR, dmstate->result, dmstate->conn, true,
+ pgfdw_report_error(ERROR, dmstate->result, dmstate->s.conn, true,
dmstate->query);
/* Get the number of rows affected. */
@@ -4463,8 +4747,10 @@ postgresIsForeignPathAsyncCapable(ForeignPath *path)
}
/*
- * XXX. Just for testing purposes, let's run everything through the async
- * mechanism but return tuples synchronously.
+ * Accept async request. Notify to the caller if the next tuple is immediately
+ * available. ExecForeignScan does additional work to finishing the returning
+ * tuple, so call it instead of postgresIterateForeignScan to acquire a tuple
+ * in expected shape.
*/
static void
postgresForeignAsyncRequest(EState *estate, PendingAsyncRequest *areq)
@@ -4473,22 +4759,59 @@ postgresForeignAsyncRequest(EState *estate, PendingAsyncRequest *areq)
TupleTableSlot *slot;
Assert(IsA(node, ForeignScanState));
+ GetPgFdwScanState(node)->run_async = true;
slot = ExecForeignScan(node);
- ExecAsyncRequestDone(estate, areq, (Node *) slot);
+ if (GetPgFdwScanState(node)->result_ready)
+ ExecAsyncRequestDone(estate, areq, (Node *) slot);
+ else
+ ExecAsyncSetRequiredEvents(estate, areq, 1, false, false);
}
+/*
+ * Configure waiting event.
+ *
+ * Add an wait event only when the node is the connection owner. Elsewise
+ * another node on this connection is the owner.
+ */
static bool
postgresForeignAsyncConfigureWait(EState *estate, PendingAsyncRequest *areq,
- bool reinit)
+ bool reinit)
{
- elog(ERROR, "postgresForeignAsyncConfigureWait");
+ ForeignScanState *node = (ForeignScanState *) areq->requestee;
+ PgFdwScanState *fsstate = GetPgFdwScanState(node);
+
+
+ /* If the caller didn't reinit, this event is already in event set */
+ if (!reinit)
+ return true;
+
+ if (fsstate->s.connspec->current_owner == node)
+ {
+ AddWaitEventToSet(estate->es_wait_event_set,
+ WL_SOCKET_READABLE, PQsocket(fsstate->s.conn),
+ NULL, areq);
+ return true;
+ }
+
return false;
}
+/*
+ * Process a notification from async mechanism. ExecForeignScan does
+ * additional work to complete the returning tuple, so call it instead of
+ * postgresIterateForeignScan to acquire a completed tuple.
+ */
static void
postgresForeignAsyncNotify(EState *estate, PendingAsyncRequest *areq)
{
- elog(ERROR, "postgresForeignAsyncNotify");
+ ForeignScanState *node = (ForeignScanState *) areq->requestee;
+ TupleTableSlot *slot;
+
+ Assert(IsA(node, ForeignScanState));
+ slot = ExecForeignScan(node);
+ Assert(GetPgFdwScanState(node)->result_ready);
+
+ ExecAsyncRequestDone(estate, areq, (Node *) slot);
}
/*
@@ -4848,7 +5171,7 @@ make_tuple_from_result_row(PGresult *res,
PgFdwScanState *fdw_sstate;
Assert(fsstate);
- fdw_sstate = (PgFdwScanState *) fsstate->fdw_state;
+ fdw_sstate = GetPgFdwScanState(fsstate);
tupdesc = fdw_sstate->tupdesc;
}
diff --git a/contrib/postgres_fdw/postgres_fdw.h b/contrib/postgres_fdw/postgres_fdw.h
index 46cac55..b3ac615 100644
--- a/contrib/postgres_fdw/postgres_fdw.h
+++ b/contrib/postgres_fdw/postgres_fdw.h
@@ -79,6 +79,7 @@ typedef struct PgFdwRelationInfo
UserMapping *user; /* only set in use_remote_estimate mode */
int fetch_size; /* fetch size for this remote table */
+ bool allow_prefetch; /* true to allow overlapped fetching */
/*
* Name of the relation while EXPLAINing ForeignScan. It is used for join
@@ -103,6 +104,7 @@ extern void reset_transmission_modes(int nestlevel);
/* in connection.c */
extern PGconn *GetConnection(UserMapping *user, bool will_prep_stmt);
+void *GetConnectionSpecificStorage(UserMapping *user, size_t initsize);
extern void ReleaseConnection(PGconn *conn);
extern unsigned int GetCursorNumber(PGconn *conn);
extern unsigned int GetPrepStmtNumber(PGconn *conn);
diff --git a/contrib/postgres_fdw/sql/postgres_fdw.sql b/contrib/postgres_fdw/sql/postgres_fdw.sql
index e19a3ef..3ae12bc 100644
--- a/contrib/postgres_fdw/sql/postgres_fdw.sql
+++ b/contrib/postgres_fdw/sql/postgres_fdw.sql
@@ -1575,8 +1575,8 @@ explain (verbose, costs off)
delete from foo where f1 < 5 returning *;
delete from foo where f1 < 5 returning *;
explain (verbose, costs off)
-update bar set f2 = f2 + 100 returning *;
-update bar set f2 = f2 + 100 returning *;
+with u as (update bar set f2 = f2 + 100 returning *) select * from u order by 1;
+with u as (update bar set f2 = f2 + 100 returning *) select * from u order by 1;
drop table foo cascade;
drop table bar cascade;
diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
index 0dd95c6..1cba31e 100644
--- a/src/backend/executor/execProcnode.c
+++ b/src/backend/executor/execProcnode.c
@@ -115,6 +115,7 @@
#include "executor/nodeValuesscan.h"
#include "executor/nodeWindowAgg.h"
#include "executor/nodeWorktablescan.h"
+#include "foreign/fdwapi.h"
#include "nodes/nodeFuncs.h"
#include "miscadmin.h"
@@ -820,6 +821,14 @@ ExecShutdownNode(PlanState *node)
case T_GatherState:
ExecShutdownGather((GatherState *) node);
break;
+ case T_ForeignScanState:
+ {
+ ForeignScanState *fsstate = (ForeignScanState *)node;
+ FdwRoutine *fdwroutine = fsstate->fdwroutine;
+ if (fdwroutine->ShutdownForeignScan)
+ fdwroutine->ShutdownForeignScan((ForeignScanState *) node);
+ }
+ break;
default:
break;
}
diff --git a/src/include/foreign/fdwapi.h b/src/include/foreign/fdwapi.h
index 41fc76f..11c3434 100644
--- a/src/include/foreign/fdwapi.h
+++ b/src/include/foreign/fdwapi.h
@@ -163,6 +163,7 @@ typedef bool (*ForeignAsyncConfigureWait_function) (EState *estate,
bool reinit);
typedef void (*ForeignAsyncNotify_function) (EState *estate,
PendingAsyncRequest *areq);
+typedef void (*ShutdownForeignScan_function) (ForeignScanState *node);
/*
* FdwRoutine is the struct returned by a foreign-data wrapper's handler
@@ -239,6 +240,7 @@ typedef struct FdwRoutine
ForeignAsyncRequest_function ForeignAsyncRequest;
ForeignAsyncConfigureWait_function ForeignAsyncConfigureWait;
ForeignAsyncNotify_function ForeignAsyncNotify;
+ ShutdownForeignScan_function ShutdownForeignScan;
} FdwRoutine;
--
2.9.2
0005-Use-resource-owner-to-prevent-wait-event-set-from-le.patchtext/x-patch; charset=us-asciiDownload
From 991c5a4a14a841123237cd370fc1ec4756fad352 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Tue, 25 Oct 2016 18:01:56 +0900
Subject: [PATCH 5/7] Use resource owner to prevent wait event set from leaking
Wait event sets created for async execution can live for some
iterations so it leaks in the case of errors during the
iterations. This commit uses resource owner to prevent such leaks.
---
src/backend/executor/execAsync.c | 28 ++++++++++++++--
src/backend/storage/ipc/latch.c | 19 ++++++++++-
src/backend/utils/resowner/resowner.c | 63 +++++++++++++++++++++++++++++++++++
src/include/utils/resowner_private.h | 8 +++++
4 files changed, 114 insertions(+), 4 deletions(-)
diff --git a/src/backend/executor/execAsync.c b/src/backend/executor/execAsync.c
index 33496a9..40e3f67 100644
--- a/src/backend/executor/execAsync.c
+++ b/src/backend/executor/execAsync.c
@@ -20,6 +20,7 @@
#include "miscadmin.h"
#include "pgstat.h"
#include "storage/latch.h"
+#include "utils/resowner_private.h"
static bool ExecAsyncEventWait(EState *estate, long timeout);
static bool ExecAsyncConfigureWait(EState *estate, PendingAsyncRequest *areq,
@@ -277,6 +278,8 @@ ExecAsyncEventWait(EState *estate, long timeout)
if (estate->es_wait_event_set == NULL)
{
+ ResourceOwner savedOwner;
+
/*
* Allow for a few extra events without reinitializing. It
* doesn't seem worth the complexity of doing anything very
@@ -284,9 +287,28 @@ ExecAsyncEventWait(EState *estate, long timeout)
* of external FDs are likely to run afoul of kernel limits anyway.
*/
estate->es_allocated_fd_events = estate->es_total_fd_events + 16;
- estate->es_wait_event_set =
- CreateWaitEventSet(estate->es_query_cxt,
- estate->es_allocated_fd_events + 1);
+
+ /*
+ * The wait event set created here should be released in case of
+ * error.
+ */
+ savedOwner = CurrentResourceOwner;
+ CurrentResourceOwner = TopTransactionResourceOwner;
+
+ PG_TRY();
+ {
+ estate->es_wait_event_set =
+ CreateWaitEventSet(estate->es_query_cxt,
+ estate->es_allocated_fd_events + 1);
+ }
+ PG_CATCH();
+ {
+ CurrentResourceOwner = savedOwner;
+ PG_RE_THROW();
+ }
+ PG_END_TRY();
+
+ CurrentResourceOwner = savedOwner;
AddWaitEventToSet(estate->es_wait_event_set,
WL_LATCH_SET, PGINVALID_SOCKET, MyLatch, NULL);
reinit = true;
diff --git a/src/backend/storage/ipc/latch.c b/src/backend/storage/ipc/latch.c
index d45a41d..3b64e83 100644
--- a/src/backend/storage/ipc/latch.c
+++ b/src/backend/storage/ipc/latch.c
@@ -62,6 +62,7 @@
#include "storage/latch.h"
#include "storage/pmsignal.h"
#include "storage/shmem.h"
+#include "utils/resowner_private.h"
/*
* Select the fd readiness primitive to use. Normally the "most modern"
@@ -90,6 +91,7 @@ struct WaitEventSet
int nevents; /* number of registered events */
int nevents_space; /* maximum number of events in this set */
+ ResourceOwner resowner; /* Resource owner */
/*
* Array, of nevents_space length, storing the definition of events this
* set is waiting for.
@@ -324,7 +326,13 @@ WaitLatchOrSocket(volatile Latch *latch, int wakeEvents, pgsocket sock,
int ret = 0;
int rc;
WaitEvent event;
- WaitEventSet *set = CreateWaitEventSet(CurrentMemoryContext, 3);
+ WaitEventSet *set;
+ ResourceOwner savedOwner = CurrentResourceOwner;
+
+ /* This function doesn't need resowner for event set */
+ CurrentResourceOwner = NULL;
+ set = CreateWaitEventSet(CurrentMemoryContext, 3);
+ CurrentResourceOwner = savedOwner;
if (wakeEvents & WL_TIMEOUT)
Assert(timeout >= 0);
@@ -488,6 +496,9 @@ CreateWaitEventSet(MemoryContext context, int nevents)
char *data;
Size sz = 0;
+ if (CurrentResourceOwner)
+ ResourceOwnerEnlargeWESs(CurrentResourceOwner);
+
/*
* Use MAXALIGN size/alignment to guarantee that later uses of memory are
* aligned correctly. E.g. epoll_event might need 8 byte alignment on some
@@ -547,6 +558,9 @@ CreateWaitEventSet(MemoryContext context, int nevents)
StaticAssertStmt(WSA_INVALID_EVENT == NULL, "");
#endif
+ set->resowner = CurrentResourceOwner;
+ if (CurrentResourceOwner)
+ ResourceOwnerRememberWES(set->resowner, set);
return set;
}
@@ -582,6 +596,9 @@ FreeWaitEventSet(WaitEventSet *set)
}
#endif
+ if (set->resowner != NULL)
+ ResourceOwnerForgetWES(set->resowner, set);
+
pfree(set);
}
diff --git a/src/backend/utils/resowner/resowner.c b/src/backend/utils/resowner/resowner.c
index af46d78..34c7e37 100644
--- a/src/backend/utils/resowner/resowner.c
+++ b/src/backend/utils/resowner/resowner.c
@@ -124,6 +124,7 @@ typedef struct ResourceOwnerData
ResourceArray snapshotarr; /* snapshot references */
ResourceArray filearr; /* open temporary files */
ResourceArray dsmarr; /* dynamic shmem segments */
+ ResourceArray wesarr; /* wait event sets */
/* We can remember up to MAX_RESOWNER_LOCKS references to local locks. */
int nlocks; /* number of owned locks */
@@ -169,6 +170,7 @@ static void PrintTupleDescLeakWarning(TupleDesc tupdesc);
static void PrintSnapshotLeakWarning(Snapshot snapshot);
static void PrintFileLeakWarning(File file);
static void PrintDSMLeakWarning(dsm_segment *seg);
+static void PrintWESLeakWarning(WaitEventSet *events);
/*****************************************************************************
@@ -437,6 +439,7 @@ ResourceOwnerCreate(ResourceOwner parent, const char *name)
ResourceArrayInit(&(owner->snapshotarr), PointerGetDatum(NULL));
ResourceArrayInit(&(owner->filearr), FileGetDatum(-1));
ResourceArrayInit(&(owner->dsmarr), PointerGetDatum(NULL));
+ ResourceArrayInit(&(owner->wesarr), PointerGetDatum(NULL));
return owner;
}
@@ -552,6 +555,16 @@ ResourceOwnerReleaseInternal(ResourceOwner owner,
PrintDSMLeakWarning(res);
dsm_detach(res);
}
+
+ /* Ditto for wait event sets */
+ while (ResourceArrayGetAny(&(owner->wesarr), &foundres))
+ {
+ WaitEventSet *event = (WaitEventSet *) DatumGetPointer(foundres);
+
+ if (isCommit)
+ PrintWESLeakWarning(event);
+ FreeWaitEventSet(event);
+ }
}
else if (phase == RESOURCE_RELEASE_LOCKS)
{
@@ -699,6 +712,7 @@ ResourceOwnerDelete(ResourceOwner owner)
Assert(owner->snapshotarr.nitems == 0);
Assert(owner->filearr.nitems == 0);
Assert(owner->dsmarr.nitems == 0);
+ Assert(owner->wesarr.nitems == 0);
Assert(owner->nlocks == 0 || owner->nlocks == MAX_RESOWNER_LOCKS + 1);
/*
@@ -725,6 +739,7 @@ ResourceOwnerDelete(ResourceOwner owner)
ResourceArrayFree(&(owner->snapshotarr));
ResourceArrayFree(&(owner->filearr));
ResourceArrayFree(&(owner->dsmarr));
+ ResourceArrayFree(&(owner->wesarr));
pfree(owner);
}
@@ -1267,3 +1282,51 @@ PrintDSMLeakWarning(dsm_segment *seg)
elog(WARNING, "dynamic shared memory leak: segment %u still referenced",
dsm_segment_handle(seg));
}
+
+/*
+ * Make sure there is room for at least one more entry in a ResourceOwner's
+ * wait event set reference array.
+ *
+ * This is separate from actually inserting an entry because if we run out
+ * of memory, it's critical to do so *before* acquiring the resource.
+ */
+void
+ResourceOwnerEnlargeWESs(ResourceOwner owner)
+{
+ ResourceArrayEnlarge(&(owner->wesarr));
+}
+
+/*
+ * Remember that a wait event set is owned by a ResourceOwner
+ *
+ * Caller must have previously done ResourceOwnerEnlargeWESs()
+ */
+void
+ResourceOwnerRememberWES(ResourceOwner owner, WaitEventSet *events)
+{
+ ResourceArrayAdd(&(owner->wesarr), PointerGetDatum(events));
+}
+
+/*
+ * Forget that a wait event set is owned by a ResourceOwner
+ */
+void
+ResourceOwnerForgetWES(ResourceOwner owner, WaitEventSet *events)
+{
+ /* XXXX: There's no property to identify a wait event set */
+ if (!ResourceArrayRemove(&(owner->wesarr), PointerGetDatum(events)))
+ elog(ERROR, "wait event set %p is not owned by resource owner %s",
+ events, owner->name);
+}
+
+/*
+ * Debugging subroutine
+ */
+static void
+PrintWESLeakWarning(WaitEventSet *events)
+{
+ /* XXXX: There's no property to identify a wait event set */
+ elog(WARNING, "wait event set leak: %p still referenced",
+ events);
+}
+
diff --git a/src/include/utils/resowner_private.h b/src/include/utils/resowner_private.h
index 411d08f..0c6979a 100644
--- a/src/include/utils/resowner_private.h
+++ b/src/include/utils/resowner_private.h
@@ -18,6 +18,7 @@
#include "storage/dsm.h"
#include "storage/fd.h"
+#include "storage/latch.h"
#include "storage/lock.h"
#include "utils/catcache.h"
#include "utils/plancache.h"
@@ -88,4 +89,11 @@ extern void ResourceOwnerRememberDSM(ResourceOwner owner,
extern void ResourceOwnerForgetDSM(ResourceOwner owner,
dsm_segment *);
+/* support for wait event set management */
+extern void ResourceOwnerEnlargeWESs(ResourceOwner owner);
+extern void ResourceOwnerRememberWES(ResourceOwner owner,
+ WaitEventSet *);
+extern void ResourceOwnerForgetWES(ResourceOwner owner,
+ WaitEventSet *);
+
#endif /* RESOWNER_PRIVATE_H */
--
2.9.2
0006-Apply-unlikely-to-suggest-synchronous-route-of-ExecA.patchtext/x-patch; charset=us-asciiDownload
From 01abb362be9f30dfe324d5d05a0717d375c3fc57 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Tue, 25 Oct 2016 18:05:30 +0900
Subject: [PATCH 6/7] Apply unlikely to suggest synchronous route of
ExecAppend.
ExecAppend seems to get slowed down by penalty of misprediction of
branches related to async-execution. Apply unlikey to them to prevent
such penalty on exiting route. Asynchronous execution is already
having a lot of additional code so this doesn't add siginificant
degradation.
---
src/backend/executor/nodeAppend.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/src/backend/executor/nodeAppend.c b/src/backend/executor/nodeAppend.c
index 568fa25..9c07b49 100644
--- a/src/backend/executor/nodeAppend.c
+++ b/src/backend/executor/nodeAppend.c
@@ -208,7 +208,7 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
TupleTableSlot *
ExecAppend(AppendState *node)
{
- if (node->as_nasyncplans > 0)
+ if (unlikely(node->as_nasyncplans > 0))
{
EState *estate = node->ps.state;
int i;
@@ -248,7 +248,7 @@ ExecAppend(AppendState *node)
/*
* if we have async requests outstanding, run the event loop
*/
- if (node->as_nasyncpending > 0)
+ if (unlikely(node->as_nasyncpending > 0))
{
long timeout = node->as_syncdone ? -1 : 0;
--
2.9.2
0007-Add-instrumentation-to-async-execution.patchtext/x-patch; charset=us-asciiDownload
From 7939f913ee610ece749fa4c5acacb0301308f503 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Tue, 25 Oct 2016 19:04:04 +0900
Subject: [PATCH 7/7] Add instrumentation to async execution
Make explain analyze give sane result when async execution has taken
place.
---
src/backend/executor/execAsync.c | 19 +++++++++++++++++++
src/backend/executor/instrument.c | 2 +-
2 files changed, 20 insertions(+), 1 deletion(-)
diff --git a/src/backend/executor/execAsync.c b/src/backend/executor/execAsync.c
index 40e3f67..588ba18 100644
--- a/src/backend/executor/execAsync.c
+++ b/src/backend/executor/execAsync.c
@@ -46,6 +46,9 @@ ExecAsyncRequest(EState *estate, PlanState *requestor, int request_index,
PendingAsyncRequest *areq = NULL;
int nasync = estate->es_num_pending_async;
+ if (requestee->instrument)
+ InstrStartNode(requestee->instrument);
+
/*
* If the number of pending asynchronous nodes exceeds the number of
* available slots in the es_pending_async array, expand the array.
@@ -121,11 +124,17 @@ ExecAsyncRequest(EState *estate, PlanState *requestor, int request_index,
if (areq->state == ASYNC_COMPLETE)
{
Assert(areq->result == NULL || IsA(areq->result, TupleTableSlot));
+
ExecAsyncResponse(estate, areq);
+ if (areq->requestee->instrument)
+ InstrStopNode(requestee->instrument,
+ TupIsNull((TupleTableSlot*)areq->result) ? 0.0 : 1.0);
return;
}
+ if (areq->requestee->instrument)
+ InstrStopNode(requestee->instrument, 0);
/* No result available now, make this node pending */
estate->es_num_pending_async++;
}
@@ -193,6 +202,9 @@ ExecAsyncEventLoop(EState *estate, PlanState *requestor, long timeout)
{
PendingAsyncRequest *areq = estate->es_pending_async[i];
+ if (areq->requestee->instrument)
+ InstrStartNode(areq->requestee->instrument);
+
/* Skip it if not pending. */
if (areq->state == ASYNC_CALLBACK_PENDING)
{
@@ -211,7 +223,14 @@ ExecAsyncEventLoop(EState *estate, PlanState *requestor, long timeout)
if (requestor == areq->requestor)
requestor_done = true;
ExecAsyncResponse(estate, areq);
+
+ if (areq->requestee->instrument)
+ InstrStopNode(areq->requestee->instrument,
+ TupIsNull((TupleTableSlot*)areq->result) ?
+ 0.0 : 1.0);
}
+ else if (areq->requestee->instrument)
+ InstrStopNode(areq->requestee->instrument, 0);
}
/* If any node completed, compact the array. */
diff --git a/src/backend/executor/instrument.c b/src/backend/executor/instrument.c
index 6ec96ec..959ee90 100644
--- a/src/backend/executor/instrument.c
+++ b/src/backend/executor/instrument.c
@@ -102,7 +102,7 @@ InstrStopNode(Instrumentation *instr, double nTuples)
&pgBufferUsage, &instr->bufusage_start);
/* Is this the first tuple of this cycle? */
- if (!instr->running)
+ if (!instr->running && nTuples > 0)
{
instr->running = true;
instr->firsttuple = INSTR_TIME_GET_DOUBLE(instr->counter);
--
2.9.2
On Tue, Jan 31, 2017 at 12:45 PM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:
I noticed that this patch is conflicting with 665d1fa (Logical
replication) so I rebased this. Only executor/Makefile
conflicted.
The patches still apply, moved to CF 2017-03. Be aware of that:
$ git diff HEAD~6 --check
contrib/postgres_fdw/postgres_fdw.c:388: indent with spaces.
+ PendingAsyncRequest *areq,
contrib/postgres_fdw/postgres_fdw.c:389: indent with spaces.
+ bool reinit);
src/backend/utils/resowner/resowner.c:1332: new blank line at EOF.
--
Michael
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Thank you.
At Wed, 1 Feb 2017 14:11:58 +0900, Michael Paquier <michael.paquier@gmail.com> wrote in <CAB7nPqS0MhZrzgMVQeFEnnKABcsMnNULd8=O0PG7_h-FUp5aEQ@mail.gmail.com>
On Tue, Jan 31, 2017 at 12:45 PM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:I noticed that this patch is conflicting with 665d1fa (Logical
replication) so I rebased this. Only executor/Makefile
conflicted.The patches still apply, moved to CF 2017-03. Be aware of that: $ git diff HEAD~6 --check contrib/postgres_fdw/postgres_fdw.c:388: indent with spaces. + PendingAsyncRequest *areq, contrib/postgres_fdw/postgres_fdw.c:389: indent with spaces. + bool reinit); src/backend/utils/resowner/resowner.c:1332: new blank line at EOF.
Thank you for letting me know the command. I changed my check
scripts to use them and it seems working fine on both commit and
rebase.
regards,
--
Kyotaro Horiguchi
NTT Open Source Software Center
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
I noticed that this patch is conflicting with 665d1fa (Logical
replication) so I rebased this. Only executor/Makefile
conflicted.
I was lucky enough to see an infinite loop when using this patch, which I
fixed by this change:
diff --git a/src/backend/executor/execAsync.c b/src/backend/executor/execAsync.c
new file mode 100644
index 588ba18..9b87fbd
*** a/src/backend/executor/execAsync.c
--- b/src/backend/executor/execAsync.c
*************** ExecAsyncEventWait(EState *estate, long
*** 364,369 ****
--- 364,370 ----
if ((w->events & WL_LATCH_SET) != 0)
{
+ ResetLatch(MyLatch);
process_latch_set = true;
continue;
}
Actually _almost_ fixed because at some point one of the following
Assert(areq->state == ASYNC_WAITING);
statements fired. I think it was the immediately following one, but I can
imagine the same to happen in the branch
if (process_latch_set)
...
I think the wants_process_latch field of PendingAsyncRequest is not useful
alone because the process latch can be set for reasons completely unrelated to
the asynchronous processing. If the asynchronous node should use latch to
signal it's readiness, I think an additional flag is needed in the request
which tells ExecAsyncEventWait that the latch was set by the asynchronous
node.
BTW, do we really need the ASYNC_CALLBACK_PENDING state? I can imagine the
async node either to change ASYNC_WAITING directly to ASYNC_COMPLETE, or leave
it ASYNC_WAITING if the data is not ready.
In addition, the following comments are based only on code review, I didn't
verify my understanding experimentally:
* Isn't it possible for AppendState.as_asyncresult to contain multiple
responses from the same async node? Since the array stores TupleTableSlot
instead of the actual tuple (so multiple items of as_asyncresult point to
the same slot), I suspect the slot contents might not be defined when the
Append node eventually tries to return it to the upper plan.
* For the WaitEvent subsystem to work, I think postgres_fdw should keep a
separate libpq connection per node, not per user mapping. Currently the
connections are cached by user mapping, but it's legal to locate multiple
child postgres_fdw nodes of Append plan on the same remote server. I expect
that these "co-located" nodes would currently use the same user mapping and
therefore the same connection.
--
Antonin Houska
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de, http://www.cybertec.at
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Fri, Feb 3, 2017 at 5:04 AM, Antonin Houska <ah@cybertec.at> wrote:
Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
I noticed that this patch is conflicting with 665d1fa (Logical
replication) so I rebased this. Only executor/Makefile
conflicted.I was lucky enough to see an infinite loop when using this patch, which I
fixed by this change:diff --git a/src/backend/executor/execAsync.c b/src/backend/executor/execAsync.c new file mode 100644 index 588ba18..9b87fbd *** a/src/backend/executor/execAsync.c --- b/src/backend/executor/execAsync.c *************** ExecAsyncEventWait(EState *estate, long *** 364,369 **** --- 364,370 ----if ((w->events & WL_LATCH_SET) != 0)
{
+ ResetLatch(MyLatch);
process_latch_set = true;
continue;
}
Hi, I've been testing this patch because seemed like it would help a use
case of mine, but can't tell if it's currently working for cases other than
a local parent table that has many child partitions which happen to be
foreign tables. Is it? I was hoping to use it for a case like:
select x, sum(y) from one_remote_table
union all
select x, sum(y) from another_remote_table
union all
select x, sum(y) from a_third_remote_table
but while aggregates do appear to be pushed down, it seems that the remote
tables are being queried in sequence. Am I doing something wrong?
Horiguchi-san,
On 2017/01/31 12:45, Kyotaro HORIGUCHI wrote:
I noticed that this patch is conflicting with 665d1fa (Logical
replication) so I rebased this. Only executor/Makefile
conflicted.
With the latest set of patches, I observe a crash due to an Assert failure:
#0 0x0000003969632625 in *__GI_raise (sig=6) at
../nptl/sysdeps/unix/sysv/linux/raise.c:64
#1 0x0000003969633e05 in *__GI_abort () at abort.c:92
#2 0x000000000098b22c in ExceptionalCondition (conditionName=0xb30e02
"!(added)", errorType=0xb30d77 "FailedAssertion", fileName=0xb30d50
"execAsync.c",
lineNumber=345) at assert.c:54
#3 0x00000000006883ed in ExecAsyncEventWait (estate=0x13c01b8,
timeout=-1) at execAsync.c:345
#4 0x0000000000687ed5 in ExecAsyncEventLoop (estate=0x13c01b8,
requestor=0x13c1640, timeout=-1) at execAsync.c:186
#5 0x00000000006a5170 in ExecAppend (node=0x13c1640) at nodeAppend.c:257
#6 0x0000000000692b9b in ExecProcNode (node=0x13c1640) at execProcnode.c:411
#7 0x00000000006bf4d7 in ExecResult (node=0x13c1170) at nodeResult.c:113
#8 0x0000000000692b5c in ExecProcNode (node=0x13c1170) at execProcnode.c:399
#9 0x00000000006a596b in fetch_input_tuple (aggstate=0x13c06a0) at
nodeAgg.c:587
#10 0x00000000006a8530 in agg_fill_hash_table (aggstate=0x13c06a0) at
nodeAgg.c:2272
#11 0x00000000006a7e76 in ExecAgg (node=0x13c06a0) at nodeAgg.c:1910
#12 0x0000000000692d69 in ExecProcNode (node=0x13c06a0) at execProcnode.c:514
#13 0x00000000006c1a42 in ExecSort (node=0x13c03d0) at nodeSort.c:103
#14 0x0000000000692d3f in ExecProcNode (node=0x13c03d0) at execProcnode.c:506
#15 0x000000000068e733 in ExecutePlan (estate=0x13c01b8,
planstate=0x13c03d0, use_parallel_mode=0 '\000', operation=CMD_SELECT,
sendTuples=1 '\001',
numberTuples=0, direction=ForwardScanDirection, dest=0x7fa368ee1da8)
at execMain.c:1609
#16 0x000000000068c751 in standard_ExecutorRun (queryDesc=0x135c568,
direction=ForwardScanDirection, count=0) at execMain.c:341
#17 0x000000000068c5dc in ExecutorRun (queryDesc=0x135c568,
<snip>
I was running a query whose plan looked like:
explain (costs off) select tableoid::regclass, a, min(b), max(b) from ptab
group by 1,2 order by 1;
QUERY PLAN
------------------------------------------------------
Sort
Sort Key: ((ptab.tableoid)::regclass)
-> HashAggregate
Group Key: (ptab.tableoid)::regclass, ptab.a
-> Result
-> Append
-> Foreign Scan on ptab_00001
-> Foreign Scan on ptab_00002
-> Foreign Scan on ptab_00003
-> Foreign Scan on ptab_00004
-> Foreign Scan on ptab_00005
-> Foreign Scan on ptab_00006
-> Foreign Scan on ptab_00007
-> Foreign Scan on ptab_00008
-> Foreign Scan on ptab_00009
-> Foreign Scan on ptab_00010
<snip>
The snipped part contains Foreign Scans on 90 more foreign partitions (in
fact, I could see the crash even with 10 foreign table partitions for the
same query).
There is a crash in one more case, which seems related to how WaitEventSet
objects are manipulated during resource-owner-mediated cleanup of a failed
query, such as after the FDW returned an error like below:
ERROR: relation "public.ptab_00010" does not exist
CONTEXT: Remote SQL command: SELECT a, b FROM public.ptab_00010
The backtrace in this looks like below:
Program terminated with signal SIGSEGV, Segmentation fault.
#0 0x00000000009c4c35 in ResourceArrayRemove (resarr=0x7f7f7f7f7f7f80bf,
value=20645152) at resowner.c:301
301 lastidx = resarr->lastidx;
(gdb)
(gdb) bt
#0 0x00000000009c4c35 in ResourceArrayRemove (resarr=0x7f7f7f7f7f7f80bf,
value=20645152) at resowner.c:301
#1 0x00000000009c6578 in ResourceOwnerForgetWES
(owner=0x7f7f7f7f7f7f7f7f, events=0x13b0520) at resowner.c:1317
#2 0x0000000000806098 in FreeWaitEventSet (set=0x13b0520) at latch.c:600
#3 0x00000000009c5338 in ResourceOwnerReleaseInternal (owner=0x12de768,
phase=RESOURCE_RELEASE_BEFORE_LOCKS, isCommit=0 '\000', isTopLevel=1 '\001')
at resowner.c:566
#4 0x00000000009c5155 in ResourceOwnerRelease (owner=0x12de768,
phase=RESOURCE_RELEASE_BEFORE_LOCKS, isCommit=0 '\000', isTopLevel=1
'\001') at resowner.c:485
#5 0x0000000000524172 in AbortTransaction () at xact.c:2588
#6 0x0000000000524854 in AbortCurrentTransaction () at xact.c:3016
#7 0x0000000000836aa6 in PostgresMain (argc=1, argv=0x12d7b08,
dbname=0x12d7968 "postgres", username=0x12d7948 "amit") at postgres.c:3860
#8 0x00000000007a49d8 in BackendRun (port=0x12cdf00) at postmaster.c:4310
#9 0x00000000007a4151 in BackendStartup (port=0x12cdf00) at postmaster.c:3982
#10 0x00000000007a0885 in ServerLoop () at postmaster.c:1722
#11 0x000000000079febf in PostmasterMain (argc=3, argv=0x12aacc0) at
postmaster.c:1330
#12 0x00000000006e7549 in main (argc=3, argv=0x12aacc0) at main.c:228
There is a segfault when accessing the events variable, whose members seem
to be pfreed:
(gdb) f 2
#2 0x0000000000806098 in FreeWaitEventSet (set=0x13b0520) at latch.c:600
600 ResourceOwnerForgetWES(set->resowner, set);
(gdb) p *set
$5 = {
nevents = 2139062143,
nevents_space = 2139062143,
resowner = 0x7f7f7f7f7f7f7f7f,
events = 0x7f7f7f7f7f7f7f7f,
latch = 0x7f7f7f7f7f7f7f7f,
latch_pos = 2139062143,
epoll_fd = 2139062143,
epoll_ret_events = 0x7f7f7f7f7f7f7f7f
}
Thanks,
Amit
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Thank you very much for testing this!
At Tue, 7 Feb 2017 13:28:42 +0900, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote in <9058d70b-a6b0-8b3c-091a-fe77ed0df580@lab.ntt.co.jp>
Horiguchi-san,
On 2017/01/31 12:45, Kyotaro HORIGUCHI wrote:
I noticed that this patch is conflicting with 665d1fa (Logical
replication) so I rebased this. Only executor/Makefile
conflicted.With the latest set of patches, I observe a crash due to an Assert failure:
#3 0x00000000006883ed in ExecAsyncEventWait (estate=0x13c01b8,
timeout=-1) at execAsync.c:345
This means no pending fdw scan didn't let itself go to waiting
stage. It leads to a stuck of the whole things. This is caused if
no one acutually is waiting for result. I suppose that all of the
foreign scans ran on the same connection. Anyway it should be a
mistake in state transition. I'll look into it.
I was running a query whose plan looked like:
explain (costs off) select tableoid::regclass, a, min(b), max(b) from ptab
group by 1,2 order by 1;
QUERY PLAN
------------------------------------------------------
Sort
Sort Key: ((ptab.tableoid)::regclass)
-> HashAggregate
Group Key: (ptab.tableoid)::regclass, ptab.a
-> Result
-> Append
-> Foreign Scan on ptab_00001
-> Foreign Scan on ptab_00002
-> Foreign Scan on ptab_00003
-> Foreign Scan on ptab_00004
-> Foreign Scan on ptab_00005
-> Foreign Scan on ptab_00006
-> Foreign Scan on ptab_00007
-> Foreign Scan on ptab_00008
-> Foreign Scan on ptab_00009
-> Foreign Scan on ptab_00010
<snip>The snipped part contains Foreign Scans on 90 more foreign partitions (in
fact, I could see the crash even with 10 foreign table partitions for the
same query).
Yeah, it seems to me unrelated to how many they are.
There is a crash in one more case, which seems related to how WaitEventSet
objects are manipulated during resource-owner-mediated cleanup of a failed
query, such as after the FDW returned an error like below:ERROR: relation "public.ptab_00010" does not exist
CONTEXT: Remote SQL command: SELECT a, b FROM public.ptab_00010The backtrace in this looks like below:
Program terminated with signal SIGSEGV, Segmentation fault.
#0 0x00000000009c4c35 in ResourceArrayRemove (resarr=0x7f7f7f7f7f7f80bf,
value=20645152) at resowner.c:301
301 lastidx = resarr->lastidx;
(gdb)
(gdb) bt
#0 0x00000000009c4c35 in ResourceArrayRemove (resarr=0x7f7f7f7f7f7f80bf,
value=20645152) at resowner.c:301
#1 0x00000000009c6578 in ResourceOwnerForgetWES
(owner=0x7f7f7f7f7f7f7f7f, events=0x13b0520) at resowner.c:1317
#2 0x0000000000806098 in FreeWaitEventSet (set=0x13b0520) at latch.c:600
#3 0x00000000009c5338 in ResourceOwnerReleaseInternal (owner=0x12de768,
phase=RESOURCE_RELEASE_BEFORE_LOCKS, isCommit=0 '\000', isTopLevel=1 '\001')
at resowner.c:566
#4 0x00000000009c5155 in ResourceOwnerRelease (owner=0x12de768,
phase=RESOURCE_RELEASE_BEFORE_LOCKS, isCommit=0 '\000', isTopLevel=1
'\001') at resowner.c:485
#5 0x0000000000524172 in AbortTransaction () at xact.c:2588
#6 0x0000000000524854 in AbortCurrentTransaction () at xact.c:3016
#7 0x0000000000836aa6 in PostgresMain (argc=1, argv=0x12d7b08,
dbname=0x12d7968 "postgres", username=0x12d7948 "amit") at postgres.c:3860
#8 0x00000000007a49d8 in BackendRun (port=0x12cdf00) at postmaster.c:4310
#9 0x00000000007a4151 in BackendStartup (port=0x12cdf00) at postmaster.c:3982
#10 0x00000000007a0885 in ServerLoop () at postmaster.c:1722
#11 0x000000000079febf in PostmasterMain (argc=3, argv=0x12aacc0) at
postmaster.c:1330
#12 0x00000000006e7549 in main (argc=3, argv=0x12aacc0) at main.c:228There is a segfault when accessing the events variable, whose members seem
to be pfreed:(gdb) f 2
#2 0x0000000000806098 in FreeWaitEventSet (set=0x13b0520) at latch.c:600
600 ResourceOwnerForgetWES(set->resowner, set);
(gdb) p *set
$5 = {
nevents = 2139062143,
nevents_space = 2139062143,
resowner = 0x7f7f7f7f7f7f7f7f,
events = 0x7f7f7f7f7f7f7f7f,
latch = 0x7f7f7f7f7f7f7f7f,
latch_pos = 2139062143,
epoll_fd = 2139062143,
epoll_ret_events = 0x7f7f7f7f7f7f7f7f
}
Mmm, I reproduces it quite easily. A silly bug.
Something bad is happening between freeing ExecutorState memory
context and resource owner. Perhaps the ExecutorState is freed by
resowner (as a part of its anscestors) before the memory for the
WaitEventSet is freed. It was careless of me. I'll reconsider it.
Great thanks for the report.
regards,
--
Kyotaro Horiguchi
NTT Open Source Software Center
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
At Thu, 16 Feb 2017 21:06:00 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in <20170216.210600.214980879.horiguchi.kyotaro@lab.ntt.co.jp>
#3 0x00000000006883ed in ExecAsyncEventWait (estate=0x13c01b8,
timeout=-1) at execAsync.c:345This means no pending fdw scan didn't let itself go to waiting
stage. It leads to a stuck of the whole things. This is caused if
no one acutually is waiting for result. I suppose that all of the
foreign scans ran on the same connection. Anyway it should be a
mistake in state transition. I'll look into it.
...
I was running a query whose plan looked like:
explain (costs off) select tableoid::regclass, a, min(b), max(b) from ptab
group by 1,2 order by 1;
QUERY PLAN
------------------------------------------------------
Sort
Sort Key: ((ptab.tableoid)::regclass)
-> HashAggregate
Group Key: (ptab.tableoid)::regclass, ptab.a
-> Result
-> Append
-> Foreign Scan on ptab_00001
-> Foreign Scan on ptab_00002
-> Foreign Scan on ptab_00003
-> Foreign Scan on ptab_00004
-> Foreign Scan on ptab_00005
-> Foreign Scan on ptab_00006
-> Foreign Scan on ptab_00007
-> Foreign Scan on ptab_00008
-> Foreign Scan on ptab_00009
-> Foreign Scan on ptab_00010
<snip>The snipped part contains Foreign Scans on 90 more foreign partitions (in
fact, I could see the crash even with 10 foreign table partitions for the
same query).Yeah, it seems to me unrelated to how many they are.
Finally, I couldn't see the crash for the (maybe) same case. I
can guess two reasons for this. One is that a situation where
node->as_nasyncpending differs from estate->es_num_pending_async,
but I couldn't find a possibility. Another is a situation in
postgresIterateForeignScan where the "next owner" reaches eof but
another waiter is not. I haven't reproduce the situation but
fixed it for the case. Addition to that I found a bug in
ExecAsyncAppendResponse. It calls bms_add_member inappropriate
way.
Mmm, I reproduces it quite easily. A silly bug.
Something bad is happening between freeing ExecutorState memory
context and resource owner. Perhaps the ExecutorState is freed by
resowner (as a part of its anscestors) before the memory for the
WaitEventSet is freed. It was careless of me. I'll reconsider it.
The cause was that the WaitEventSet was placed in ExecutorState
but registered to TopTransactionResourceOwner. I fixed it.
This fixes are made on top of the previous patches for now. In
the attached files, 0008, 0009 are for the second bug, 0012 is
for the first bug. And 0013 is for bms bug.
Sorry for the confused patches, I will resend more neater ones
soon.
regards,
--
Kyotaro Horiguchi
NTT Open Source Software Center
Attachments:
0013-Fix-a-bug-of-a-usage-of-bms_add_member.patchtext/x-patch; charset=us-asciiDownload
From 995f2133c9cb651de46d8c9506537f72e0546b82 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Wed, 22 Feb 2017 17:30:12 +0900
Subject: [PATCH 13/13] Fix a bug of a usage of bms_add_member.
bms_add_member may change the location of the struct. Reassign is
mandatory. This can cause a bug for more than 32 members.
---
src/backend/executor/nodeAppend.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/src/backend/executor/nodeAppend.c b/src/backend/executor/nodeAppend.c
index 9293139..109435d 100644
--- a/src/backend/executor/nodeAppend.c
+++ b/src/backend/executor/nodeAppend.c
@@ -428,5 +428,6 @@ ExecAsyncAppendResponse(EState *estate, PendingAsyncRequest *areq)
* Mark the node that returned a result as ready for a new request. We
* don't launch another one here immediately because it might compelte
*/
- bms_add_member(node->as_needrequest, areq->request_index);
+ node->as_needrequest =
+ bms_add_member(node->as_needrequest, areq->request_index);
}
--
2.9.2
0012-Fix-a-possible-bug.patchtext/x-patch; charset=us-asciiDownload
From fab48a75f2bdb89ba96068938da759b6e67682c2 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Wed, 22 Feb 2017 17:28:49 +0900
Subject: [PATCH 12/13] Fix a possible bug.
I haven't found that but calling postgresIterateForeignScan on eof'ed
node can cause a crash. This may fix it.
---
contrib/postgres_fdw/postgres_fdw.c | 10 ++++++++++
1 file changed, 10 insertions(+)
diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c
index 04f520b..6b694d0 100644
--- a/contrib/postgres_fdw/postgres_fdw.c
+++ b/contrib/postgres_fdw/postgres_fdw.c
@@ -1449,6 +1449,16 @@ postgresIterateForeignScan(ForeignScanState *node)
{
ForeignScanState *next_conn_owner = node;
+ /*
+ * This can be called for eof'ed node, do nothing other than returning
+ * null tuple for the case
+ */
+ if (GetPgFdwScanState(node)->eof_reached)
+ {
+ fsstate->result_ready = true;
+ return ExecClearTuple(slot);
+ }
+
/* This node has sent a query on this connection */
if (fsstate->s.connpriv->current_owner == node)
{
--
2.9.2
0011-Some-non-functional-fixes.patchtext/x-patch; charset=us-asciiDownload
From 98e271051e93a1a10d0b4f45939f18e6cbe01367 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Wed, 22 Feb 2017 17:23:42 +0900
Subject: [PATCH 11/13] Some non-functional fixes.
Rename items of AsyncRequestState. Rewrite some comments, and some
struct members are renamed for readability.
---
contrib/postgres_fdw/postgres_fdw.c | 66 +++++++++++++++++++------------------
src/backend/executor/execAsync.c | 62 ++++++++++++++++------------------
src/backend/executor/nodeAppend.c | 4 +--
src/include/nodes/execnodes.h | 8 ++---
4 files changed, 69 insertions(+), 71 deletions(-)
diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c
index a52d54a..04f520b 100644
--- a/contrib/postgres_fdw/postgres_fdw.c
+++ b/contrib/postgres_fdw/postgres_fdw.c
@@ -129,17 +129,17 @@ enum FdwDirectModifyPrivateIndex
/*
* Connection private area structure.
*/
- typedef struct PgFdwConnspecate
+typedef struct PgFdwConnpriv
{
ForeignScanState *current_owner; /* The node currently running a query
* on this connection*/
-} PgFdwConnspecate;
+} PgFdwConnpriv;
/* Execution state base type */
typedef struct PgFdwState
{
PGconn *conn; /* connection for the scan */
- PgFdwConnspecate *connspec; /* connection private memory */
+ PgFdwConnpriv *connpriv; /* connection private memory */
} PgFdwState;
/*
@@ -385,8 +385,8 @@ static bool postgresIsForeignPathAsyncCapable(ForeignPath *path);
static void postgresForeignAsyncRequest(EState *estate,
PendingAsyncRequest *areq);
static bool postgresForeignAsyncConfigureWait(EState *estate,
- PendingAsyncRequest *areq,
- bool reinit);
+ PendingAsyncRequest *areq,
+ bool reinit);
static void postgresForeignAsyncNotify(EState *estate,
PendingAsyncRequest *areq);
@@ -1370,9 +1370,9 @@ postgresBeginForeignScan(ForeignScanState *node, int eflags)
* establish new connection if necessary.
*/
fsstate->s.conn = GetConnection(user, false);
- fsstate->s.connspec = (PgFdwConnspecate *)
- GetConnectionSpecificStorage(user, sizeof(PgFdwConnspecate));
- fsstate->s.connspec->current_owner = NULL;
+ fsstate->s.connpriv = (PgFdwConnpriv *)
+ GetConnectionSpecificStorage(user, sizeof(PgFdwConnpriv));
+ fsstate->s.connpriv->current_owner = NULL;
fsstate->waiter = NULL;
fsstate->last_waiter = node;
@@ -1450,7 +1450,7 @@ postgresIterateForeignScan(ForeignScanState *node)
ForeignScanState *next_conn_owner = node;
/* This node has sent a query on this connection */
- if (fsstate->s.connspec->current_owner == node)
+ if (fsstate->s.connpriv->current_owner == node)
{
/* Check if the result is available */
if (PQisBusy(fsstate->s.conn))
@@ -1498,7 +1498,7 @@ postgresIterateForeignScan(ForeignScanState *node)
fsstate->last_waiter = node;
}
}
- else if (fsstate->s.connspec->current_owner)
+ else if (fsstate->s.connpriv->current_owner)
{
/*
* Anyone else is holding this connection. Add myself to the tail
@@ -1507,7 +1507,7 @@ postgresIterateForeignScan(ForeignScanState *node)
* shortcut to the last waiter.
*/
PgFdwScanState *conn_owner_state =
- GetPgFdwScanState(fsstate->s.connspec->current_owner);
+ GetPgFdwScanState(fsstate->s.connpriv->current_owner);
ForeignScanState *last_waiter = conn_owner_state->last_waiter;
PgFdwScanState *last_waiter_state = GetPgFdwScanState(last_waiter);
@@ -1523,11 +1523,13 @@ postgresIterateForeignScan(ForeignScanState *node)
return ExecClearTuple(slot);
}
+ /* At this time no node is running on the connection */
+ Assert(GetPgFdwScanState(next_conn_owner)->s.connpriv->current_owner
+ == NULL);
/*
* Send the next request for the next owner of this connection if
* needed.
*/
-
if (!GetPgFdwScanState(next_conn_owner)->eof_reached)
{
PgFdwScanState *next_owner_state =
@@ -1869,8 +1871,8 @@ postgresBeginForeignModify(ModifyTableState *mtstate,
/* Open connection; report that we'll create a prepared statement. */
fmstate->s.conn = GetConnection(user, true);
- fmstate->s.connspec = (PgFdwConnspecate *)
- GetConnectionSpecificStorage(user, sizeof(PgFdwConnspecate));
+ fmstate->s.connpriv = (PgFdwConnpriv *)
+ GetConnectionSpecificStorage(user, sizeof(PgFdwConnpriv));
fmstate->p_name = NULL; /* prepared statement not made yet */
/* Deconstruct fdw_private data. */
@@ -2472,8 +2474,8 @@ postgresBeginDirectModify(ForeignScanState *node, int eflags)
* establish new connection if necessary.
*/
dmstate->s.conn = GetConnection(user, false);
- dmstate->s.connspec = (PgFdwConnspecate *)
- GetConnectionSpecificStorage(user, sizeof(PgFdwConnspecate));
+ dmstate->s.connpriv = (PgFdwConnpriv *)
+ GetConnectionSpecificStorage(user, sizeof(PgFdwConnpriv));
/* Initialize state variable */
dmstate->num_tuples = -1; /* -1 means not set yet */
@@ -2696,7 +2698,7 @@ estimate_path_cost_size(PlannerInfo *root,
List *local_param_join_conds;
StringInfoData sql;
PGconn *conn;
- PgFdwConnspecate *connspec;
+ PgFdwConnpriv *connpriv;
Selectivity local_sel;
QualCost local_cost;
List *fdw_scan_tlist = NIL;
@@ -2740,13 +2742,13 @@ estimate_path_cost_size(PlannerInfo *root,
/* Get the remote estimate */
conn = GetConnection(fpinfo->user, false);
- connspec = GetConnectionSpecificStorage(fpinfo->user,
- sizeof(PgFdwConnspecate));
- if (connspec)
+ connpriv = GetConnectionSpecificStorage(fpinfo->user,
+ sizeof(PgFdwConnpriv));
+ if (connpriv)
{
PgFdwState tmpstate;
tmpstate.conn = conn;
- tmpstate.connspec = connspec;
+ tmpstate.connpriv = connpriv;
vacate_connection(&tmpstate);
}
@@ -3181,7 +3183,7 @@ request_more_data(ForeignScanState *node)
char sql[64];
/* The connection should be vacant */
- Assert(fsstate->s.connspec->current_owner == NULL);
+ Assert(fsstate->s.connpriv->current_owner == NULL);
/*
* If this is the first call after Begin or ReScan, we need to create the
@@ -3196,7 +3198,7 @@ request_more_data(ForeignScanState *node)
if (!PQsendQuery(conn, sql))
pgfdw_report_error(ERROR, NULL, conn, false, sql);
- fsstate->s.connspec->current_owner = node;
+ fsstate->s.connpriv->current_owner = node;
}
/*
@@ -3210,7 +3212,7 @@ fetch_received_data(ForeignScanState *node)
MemoryContext oldcontext;
/* I should be the current connection owner */
- Assert(fsstate->s.connspec->current_owner == node);
+ Assert(fsstate->s.connpriv->current_owner == node);
/*
* We'll store the tuples in the batch_cxt. First, flush the previous
@@ -3287,14 +3289,14 @@ fetch_received_data(ForeignScanState *node)
}
PG_CATCH();
{
- fsstate->s.connspec->current_owner = NULL;
+ fsstate->s.connpriv->current_owner = NULL;
if (res)
PQclear(res);
PG_RE_THROW();
}
PG_END_TRY();
- fsstate->s.connspec->current_owner = NULL;
+ fsstate->s.connpriv->current_owner = NULL;
MemoryContextSwitchTo(oldcontext);
}
@@ -3305,16 +3307,16 @@ fetch_received_data(ForeignScanState *node)
static void
vacate_connection(PgFdwState *fdwstate)
{
- PgFdwConnspecate *connspec = fdwstate->connspec;
+ PgFdwConnpriv *connpriv = fdwstate->connpriv;
ForeignScanState *owner;
- if (connspec == NULL || connspec->current_owner == NULL)
+ if (connpriv == NULL || connpriv->current_owner == NULL)
return;
/*
* let the current connection owner read the result for the running query
*/
- owner = connspec->current_owner;
+ owner = connpriv->current_owner;
fetch_received_data(owner);
/* Clear the waiting list */
@@ -3335,7 +3337,7 @@ static void
absorb_current_result(ForeignScanState *node)
{
PgFdwScanState *fsstate = GetPgFdwScanState(node);
- ForeignScanState *owner = fsstate->s.connspec->current_owner;
+ ForeignScanState *owner = fsstate->s.connpriv->current_owner;
if (owner)
{
@@ -3344,7 +3346,7 @@ absorb_current_result(ForeignScanState *node)
while(PQisBusy(conn))
PQclear(PQgetResult(conn));
- fsstate->s.connspec->current_owner = NULL;
+ fsstate->s.connpriv->current_owner = NULL;
fsstate->async_waiting = false;
}
}
@@ -4785,7 +4787,7 @@ postgresForeignAsyncConfigureWait(EState *estate, PendingAsyncRequest *areq,
if (!reinit)
return true;
- if (fsstate->s.connspec->current_owner == node)
+ if (fsstate->s.connpriv->current_owner == node)
{
AddWaitEventToSet(estate->es_wait_event_set,
WL_SOCKET_READABLE, PQsocket(fsstate->s.conn),
diff --git a/src/backend/executor/execAsync.c b/src/backend/executor/execAsync.c
index a8e5f80..03ab811 100644
--- a/src/backend/executor/execAsync.c
+++ b/src/backend/executor/execAsync.c
@@ -99,14 +99,12 @@ ExecAsyncRequest(EState *estate, PlanState *requestor, int request_index,
areq->myindex = estate->es_num_pending_async;
/* Initialize the new request. */
+ areq->state = ASYNCREQ_IDLE;
areq->requestor = requestor;
areq->request_index = request_index;
areq->requestee = requestee;
- /*
- * Give the requestee a chance to do whatever it wants.
- * Requst functions return true if a result is immediately available.
- */
+ /* Give the requestee a chance to do whatever it wants. */
switch (nodeTag(requestee))
{
case T_ForeignScanState:
@@ -118,10 +116,8 @@ ExecAsyncRequest(EState *estate, PlanState *requestor, int request_index,
(int) nodeTag(requestee));
}
- /*
- * If a result is available, complete it immediately.
- */
- if (areq->state == ASYNC_COMPLETE)
+ /* If a result is available, complete it immediately */
+ if (areq->state == ASYNCREQ_COMPLETE)
{
Assert(areq->result == NULL || IsA(areq->result, TupleTableSlot));
@@ -178,15 +174,16 @@ ExecAsyncEventLoop(EState *estate, PlanState *requestor, long timeout)
CHECK_FOR_INTERRUPTS();
/*
- * Check for events, but don't block if there notifications that
- * have not been delivered yet.
+ * Check for events, but don't block if any undelivered notification
+ * remains and process the notification immediately.
*/
if (estate->es_async_callback_pending > 0)
ExecAsyncEventWait(estate, 0);
else if (!ExecAsyncEventWait(estate, cur_timeout))
cur_timeout = 0; /* Timeout was reached. */
- else
+ else if (timeout > 0)
{
+ /* Exited before timeout. Calculate the remaining time. */
instr_time cur_time;
long cur_timeout = -1;
@@ -205,19 +202,15 @@ ExecAsyncEventLoop(EState *estate, PlanState *requestor, long timeout)
if (areq->requestee->instrument)
InstrStartNode(areq->requestee->instrument);
- /* Skip it if not pending. */
- if (areq->state == ASYNC_CALLBACK_PENDING)
+ /* Notify if the requestee is ready */
+ if (areq->state == ASYNCREQ_CALLBACK_PENDING)
{
- /*
- * Mark it as no longer needing a callback. We must do this
- * before dispatching the callback in case the callback resets
- * the flag.
- */
estate->es_async_callback_pending--;
ExecAsyncNotify(estate, areq);
}
- if (areq->state == ASYNC_COMPLETE)
+ /* Deliver the acquired tuple to the requester */
+ if (areq->state == ASYNCREQ_COMPLETE)
{
any_node_done = true;
if (requestor == areq->requestor)
@@ -248,7 +241,9 @@ ExecAsyncEventLoop(EState *estate, PlanState *requestor, long timeout)
PendingAsyncRequest *head;
PendingAsyncRequest *tail = estate->es_pending_async[tidx];
- if (tail->state == ASYNC_COMPLETE)
+ Assert(tail->state != ASYNCREQ_CALLBACK_PENDING);
+
+ if (tail->state == ASYNCREQ_COMPLETE)
continue;
head = estate->es_pending_async[hidx];
estate->es_pending_async[tidx] = head;
@@ -324,7 +319,7 @@ ExecAsyncEventWait(EState *estate, long timeout)
{
PendingAsyncRequest *areq = estate->es_pending_async[i];
- if (areq->num_fd_events > 0)
+ if (areq->num_fd_events > 0 || areq->wants_process_latch)
added |= ExecAsyncConfigureWait(estate, areq, reinit);
}
@@ -358,9 +353,9 @@ ExecAsyncEventWait(EState *estate, long timeout)
{
PendingAsyncRequest *areq = w->user_data;
- Assert(areq->state == ASYNC_WAITING);
+ Assert(areq->state == ASYNCREQ_WAITING);
- areq->state = ASYNC_CALLBACK_PENDING;
+ areq->state = ASYNCREQ_CALLBACK_PENDING;
estate->es_async_callback_pending++;
}
}
@@ -377,8 +372,8 @@ ExecAsyncEventWait(EState *estate, long timeout)
if (areq->wants_process_latch)
{
- Assert(areq->state == ASYNC_WAITING);
- areq->state = ASYNC_CALLBACK_PENDING;
+ Assert(areq->state == ASYNCREQ_WAITING);
+ areq->state = ASYNCREQ_CALLBACK_PENDING;
}
}
}
@@ -453,11 +448,11 @@ ExecAsyncResponse(EState *estate, PendingAsyncRequest *areq)
/*
* An executor node should call this function to signal that it needs to wait
* on one or more file descriptor events that can be registered on a
- * WaitEventSet, and possibly also on the process latch. num_fd_events
- * should be the maximum number of file descriptor events that it will wish to
- * register. force_reset should be true if the node can't reuse the
- * WaitEventSet it most recently initialized, for example because it needs to
- * drop a wait event from the set.
+ * WaitEventSet, and possibly also on process latch. num_fd_events is the
+ * maximum number of file descriptor events that it will wish to register.
+ * force_reset should be true if the node can't reuse the WaitEventSet it most
+ * recently initialized, for example because it needs to drop a wait event
+ * from the set.
*/
void
ExecAsyncSetRequiredEvents(EState *estate, PendingAsyncRequest *areq,
@@ -467,7 +462,7 @@ ExecAsyncSetRequiredEvents(EState *estate, PendingAsyncRequest *areq,
estate->es_total_fd_events += num_fd_events - areq->num_fd_events;
areq->num_fd_events = num_fd_events;
areq->wants_process_latch = wants_process_latch;
- areq->state = ASYNC_WAITING;
+ areq->state = ASYNCREQ_WAITING;
if (force_reset && estate->es_wait_event_set != NULL)
{
@@ -497,12 +492,13 @@ ExecAsyncRequestDone(EState *estate, PendingAsyncRequest *areq, Node *result)
* need a callback to remove registered wait events. It's not clear
* that we would come out ahead, so use brute force for now.
*/
- Assert(areq->state == ASYNC_IDLE || areq->state == ASYNC_CALLBACK_PENDING);
+ Assert(areq->state == ASYNCREQ_IDLE ||
+ areq->state == ASYNCREQ_CALLBACK_PENDING);
if (areq->num_fd_events > 0 || areq->wants_process_latch)
ExecAsyncSetRequiredEvents(estate, areq, 0, false, true);
/* Save result and mark request as complete. */
areq->result = result;
- areq->state = ASYNC_COMPLETE;
+ areq->state = ASYNCREQ_COMPLETE;
}
diff --git a/src/backend/executor/nodeAppend.c b/src/backend/executor/nodeAppend.c
index 9c07b49..9293139 100644
--- a/src/backend/executor/nodeAppend.c
+++ b/src/backend/executor/nodeAppend.c
@@ -403,7 +403,7 @@ ExecAsyncAppendResponse(EState *estate, PendingAsyncRequest *areq)
TupleTableSlot *slot;
/* We shouldn't be called until the request is complete. */
- Assert(areq->state == ASYNC_COMPLETE);
+ Assert(areq->state == ASYNCREQ_COMPLETE);
/* Our result slot shouldn't already be occupied. */
Assert(TupIsNull(node->ps.ps_ResultTupleSlot));
@@ -420,7 +420,7 @@ ExecAsyncAppendResponse(EState *estate, PendingAsyncRequest *areq)
if (TupIsNull(slot))
return;
- /* Save result so we can return it. */
+ /* Set the next tuple from this requestee. */
Assert(node->as_nasyncresult < node->as_nasyncplans);
node->as_asyncresult[node->as_nasyncresult++] = slot;
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 5afcd34..7a62eff 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -363,10 +363,10 @@ typedef struct ResultRelInfo
*/
typedef enum AsyncRequestState
{
- ASYNC_IDLE,
- ASYNC_WAITING,
- ASYNC_CALLBACK_PENDING,
- ASYNC_COMPLETE
+ ASYNCREQ_IDLE, /* Nothing is requested */
+ ASYNCREQ_WAITING, /* Waiting for events */
+ ASYNCREQ_CALLBACK_PENDING, /* Having events to be processed */
+ ASYNCREQ_COMPLETE /* Result is available */
} AsyncRequestState;
typedef struct PendingAsyncRequest
{
--
2.9.2
0010-Fix-a-typo-of-mcxt.c.patchtext/x-patch; charset=us-asciiDownload
From 886e0bbf1e7742c9c34582cccb5a05575420555c Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Wed, 22 Feb 2017 09:14:15 +0900
Subject: [PATCH 10/13] Fix a typo of mcxt.c
---
src/backend/utils/mmgr/mcxt.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/src/backend/utils/mmgr/mcxt.c b/src/backend/utils/mmgr/mcxt.c
index 6ad0bb4..2e74e29 100644
--- a/src/backend/utils/mmgr/mcxt.c
+++ b/src/backend/utils/mmgr/mcxt.c
@@ -208,7 +208,7 @@ MemoryContextDelete(MemoryContext context)
MemoryContextDeleteChildren(context);
/*
- * It's not entirely clear whether 'tis better to do this before or after
+ * It's not entirely clear whether it's better to do this before or after
* delinking the context; but an error in a callback will likely result in
* leaking the whole context (if it's not a root context) if we do it
* after, so let's do it before.
--
2.9.2
0009-Fix-the-resource-owner-to-be-used.patchtext/x-patch; charset=us-asciiDownload
From cdad8b09c0e66507d65b1c8db552923e89d23294 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Wed, 22 Feb 2017 09:12:40 +0900
Subject: [PATCH 09/13] Fix the resource owner to be used
Fixup of previous commit.
---
src/backend/executor/execAsync.c | 28 +++++++---------------------
1 file changed, 7 insertions(+), 21 deletions(-)
diff --git a/src/backend/executor/execAsync.c b/src/backend/executor/execAsync.c
index 588ba18..a8e5f80 100644
--- a/src/backend/executor/execAsync.c
+++ b/src/backend/executor/execAsync.c
@@ -20,7 +20,7 @@
#include "miscadmin.h"
#include "pgstat.h"
#include "storage/latch.h"
-#include "utils/resowner_private.h"
+#include "utils/memutils.h"
static bool ExecAsyncEventWait(EState *estate, long timeout);
static bool ExecAsyncConfigureWait(EState *estate, PendingAsyncRequest *areq,
@@ -297,8 +297,6 @@ ExecAsyncEventWait(EState *estate, long timeout)
if (estate->es_wait_event_set == NULL)
{
- ResourceOwner savedOwner;
-
/*
* Allow for a few extra events without reinitializing. It
* doesn't seem worth the complexity of doing anything very
@@ -308,26 +306,14 @@ ExecAsyncEventWait(EState *estate, long timeout)
estate->es_allocated_fd_events = estate->es_total_fd_events + 16;
/*
- * The wait event set created here should be released in case of
- * error.
+ * The wait event set created here should be live beyond ExecutorState
+ * context but released in case of error.
*/
- savedOwner = CurrentResourceOwner;
- CurrentResourceOwner = TopTransactionResourceOwner;
-
- PG_TRY();
- {
- estate->es_wait_event_set =
- CreateWaitEventSet(estate->es_query_cxt,
- estate->es_allocated_fd_events + 1);
- }
- PG_CATCH();
- {
- CurrentResourceOwner = savedOwner;
- PG_RE_THROW();
- }
- PG_END_TRY();
+ estate->es_wait_event_set =
+ CreateWaitEventSet(TopTransactionContext,
+ TopTransactionResourceOwner,
+ estate->es_allocated_fd_events + 1);
- CurrentResourceOwner = savedOwner;
AddWaitEventToSet(estate->es_wait_event_set,
WL_LATCH_SET, PGINVALID_SOCKET, MyLatch, NULL);
reinit = true;
--
2.9.2
0008-Allow-wait-event-set-to-be-registered-to-resource-ow.patchtext/x-patch; charset=us-asciiDownload
From 7b85a878ddef06b9dda1608eed318c176d43b575 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Wed, 22 Feb 2017 09:07:49 +0900
Subject: [PATCH 08/13] Allow wait event set to be registered to resource owner
WaitEventSet may have to be released using resource owner. This change
allow the creator of a WaitEventSet to specify a resource owner.
---
src/backend/libpq/pqcomm.c | 2 +-
src/backend/storage/ipc/latch.c | 14 ++++++++------
src/backend/storage/lmgr/condition_variable.c | 2 +-
src/backend/utils/resowner/resowner.c | 1 -
src/include/storage/latch.h | 4 +++-
5 files changed, 13 insertions(+), 10 deletions(-)
diff --git a/src/backend/libpq/pqcomm.c b/src/backend/libpq/pqcomm.c
index 7939b1f..16a5d7a 100644
--- a/src/backend/libpq/pqcomm.c
+++ b/src/backend/libpq/pqcomm.c
@@ -201,7 +201,7 @@ pq_init(void)
(errmsg("could not set socket to nonblocking mode: %m")));
#endif
- FeBeWaitSet = CreateWaitEventSet(TopMemoryContext, 3);
+ FeBeWaitSet = CreateWaitEventSet(TopMemoryContext, NULL, 3);
AddWaitEventToSet(FeBeWaitSet, WL_SOCKET_WRITEABLE, MyProcPort->sock,
NULL, NULL);
AddWaitEventToSet(FeBeWaitSet, WL_LATCH_SET, -1, MyLatch, NULL);
diff --git a/src/backend/storage/ipc/latch.c b/src/backend/storage/ipc/latch.c
index 30dc77b..da2c41d 100644
--- a/src/backend/storage/ipc/latch.c
+++ b/src/backend/storage/ipc/latch.c
@@ -331,7 +331,7 @@ WaitLatchOrSocket(volatile Latch *latch, int wakeEvents, pgsocket sock,
/* This function doesn't need resowner for event set */
CurrentResourceOwner = NULL;
- set = CreateWaitEventSet(CurrentMemoryContext, 3);
+ set = CreateWaitEventSet(CurrentMemoryContext, NULL, 3);
CurrentResourceOwner = savedOwner;
if (wakeEvents & WL_TIMEOUT)
@@ -490,14 +490,14 @@ ResetLatch(volatile Latch *latch)
* WaitEventSetWait().
*/
WaitEventSet *
-CreateWaitEventSet(MemoryContext context, int nevents)
+CreateWaitEventSet(MemoryContext context, ResourceOwner res, int nevents)
{
WaitEventSet *set;
char *data;
Size sz = 0;
- if (CurrentResourceOwner)
- ResourceOwnerEnlargeWESs(CurrentResourceOwner);
+ if (res)
+ ResourceOwnerEnlargeWESs(res);
/*
* Use MAXALIGN size/alignment to guarantee that later uses of memory are
@@ -558,9 +558,11 @@ CreateWaitEventSet(MemoryContext context, int nevents)
StaticAssertStmt(WSA_INVALID_EVENT == NULL, "");
#endif
- set->resowner = CurrentResourceOwner;
- if (CurrentResourceOwner)
+ /* Register this wait event set if requested */
+ set->resowner = res;
+ if (res)
ResourceOwnerRememberWES(set->resowner, set);
+
return set;
}
diff --git a/src/backend/storage/lmgr/condition_variable.c b/src/backend/storage/lmgr/condition_variable.c
index 6f1ef0b..503aef1 100644
--- a/src/backend/storage/lmgr/condition_variable.c
+++ b/src/backend/storage/lmgr/condition_variable.c
@@ -66,7 +66,7 @@ ConditionVariablePrepareToSleep(ConditionVariable *cv)
/* Create a reusable WaitEventSet. */
if (cv_wait_event_set == NULL)
{
- cv_wait_event_set = CreateWaitEventSet(TopMemoryContext, 1);
+ cv_wait_event_set = CreateWaitEventSet(TopMemoryContext, NULL, 1);
AddWaitEventToSet(cv_wait_event_set, WL_LATCH_SET, PGINVALID_SOCKET,
&MyProc->procLatch, NULL);
}
diff --git a/src/backend/utils/resowner/resowner.c b/src/backend/utils/resowner/resowner.c
index 34c7e37..d497216 100644
--- a/src/backend/utils/resowner/resowner.c
+++ b/src/backend/utils/resowner/resowner.c
@@ -1329,4 +1329,3 @@ PrintWESLeakWarning(WaitEventSet *events)
elog(WARNING, "wait event set leak: %p still referenced",
events);
}
-
diff --git a/src/include/storage/latch.h b/src/include/storage/latch.h
index 3158d7b..8233b6d 100644
--- a/src/include/storage/latch.h
+++ b/src/include/storage/latch.h
@@ -101,6 +101,7 @@
#define LATCH_H
#include <signal.h>
+#include "utils/resowner.h"
/*
* Latch structure should be treated as opaque and only accessed through
@@ -152,7 +153,8 @@ extern void DisownLatch(volatile Latch *latch);
extern void SetLatch(volatile Latch *latch);
extern void ResetLatch(volatile Latch *latch);
-extern WaitEventSet *CreateWaitEventSet(MemoryContext context, int nevents);
+extern WaitEventSet *CreateWaitEventSet(MemoryContext context,
+ ResourceOwner res, int nevents);
extern void FreeWaitEventSet(WaitEventSet *set);
extern int AddWaitEventToSet(WaitEventSet *set, uint32 events, pgsocket fd,
Latch *latch, void *user_data);
--
2.9.2
0007-Add-instrumentation-to-async-execution.patchtext/x-patch; charset=us-asciiDownload
From 50e0e4ba3b495b85de95b2261d248679ffeb40f2 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Tue, 25 Oct 2016 19:04:04 +0900
Subject: [PATCH 07/13] Add instrumentation to async execution
Make explain analyze give sane result when async execution has taken
place.
---
src/backend/executor/execAsync.c | 19 +++++++++++++++++++
src/backend/executor/instrument.c | 2 +-
2 files changed, 20 insertions(+), 1 deletion(-)
diff --git a/src/backend/executor/execAsync.c b/src/backend/executor/execAsync.c
index 40e3f67..588ba18 100644
--- a/src/backend/executor/execAsync.c
+++ b/src/backend/executor/execAsync.c
@@ -46,6 +46,9 @@ ExecAsyncRequest(EState *estate, PlanState *requestor, int request_index,
PendingAsyncRequest *areq = NULL;
int nasync = estate->es_num_pending_async;
+ if (requestee->instrument)
+ InstrStartNode(requestee->instrument);
+
/*
* If the number of pending asynchronous nodes exceeds the number of
* available slots in the es_pending_async array, expand the array.
@@ -121,11 +124,17 @@ ExecAsyncRequest(EState *estate, PlanState *requestor, int request_index,
if (areq->state == ASYNC_COMPLETE)
{
Assert(areq->result == NULL || IsA(areq->result, TupleTableSlot));
+
ExecAsyncResponse(estate, areq);
+ if (areq->requestee->instrument)
+ InstrStopNode(requestee->instrument,
+ TupIsNull((TupleTableSlot*)areq->result) ? 0.0 : 1.0);
return;
}
+ if (areq->requestee->instrument)
+ InstrStopNode(requestee->instrument, 0);
/* No result available now, make this node pending */
estate->es_num_pending_async++;
}
@@ -193,6 +202,9 @@ ExecAsyncEventLoop(EState *estate, PlanState *requestor, long timeout)
{
PendingAsyncRequest *areq = estate->es_pending_async[i];
+ if (areq->requestee->instrument)
+ InstrStartNode(areq->requestee->instrument);
+
/* Skip it if not pending. */
if (areq->state == ASYNC_CALLBACK_PENDING)
{
@@ -211,7 +223,14 @@ ExecAsyncEventLoop(EState *estate, PlanState *requestor, long timeout)
if (requestor == areq->requestor)
requestor_done = true;
ExecAsyncResponse(estate, areq);
+
+ if (areq->requestee->instrument)
+ InstrStopNode(areq->requestee->instrument,
+ TupIsNull((TupleTableSlot*)areq->result) ?
+ 0.0 : 1.0);
}
+ else if (areq->requestee->instrument)
+ InstrStopNode(areq->requestee->instrument, 0);
}
/* If any node completed, compact the array. */
diff --git a/src/backend/executor/instrument.c b/src/backend/executor/instrument.c
index 6ec96ec..959ee90 100644
--- a/src/backend/executor/instrument.c
+++ b/src/backend/executor/instrument.c
@@ -102,7 +102,7 @@ InstrStopNode(Instrumentation *instr, double nTuples)
&pgBufferUsage, &instr->bufusage_start);
/* Is this the first tuple of this cycle? */
- if (!instr->running)
+ if (!instr->running && nTuples > 0)
{
instr->running = true;
instr->firsttuple = INSTR_TIME_GET_DOUBLE(instr->counter);
--
2.9.2
0006-Apply-unlikely-to-suggest-synchronous-route-of-ExecA.patchtext/x-patch; charset=us-asciiDownload
From a7f4a6833c6eabae9c66ed1c99f15948ef91c59f Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Tue, 25 Oct 2016 18:05:30 +0900
Subject: [PATCH 06/13] Apply unlikely to suggest synchronous route of
ExecAppend.
ExecAppend seems to get slowed down by penalty of misprediction of
branches related to async-execution. Apply unlikey to them to prevent
such penalty on exiting route. Asynchronous execution is already
having a lot of additional code so this doesn't add siginificant
degradation.
---
src/backend/executor/nodeAppend.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/src/backend/executor/nodeAppend.c b/src/backend/executor/nodeAppend.c
index 568fa25..9c07b49 100644
--- a/src/backend/executor/nodeAppend.c
+++ b/src/backend/executor/nodeAppend.c
@@ -208,7 +208,7 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
TupleTableSlot *
ExecAppend(AppendState *node)
{
- if (node->as_nasyncplans > 0)
+ if (unlikely(node->as_nasyncplans > 0))
{
EState *estate = node->ps.state;
int i;
@@ -248,7 +248,7 @@ ExecAppend(AppendState *node)
/*
* if we have async requests outstanding, run the event loop
*/
- if (node->as_nasyncpending > 0)
+ if (unlikely(node->as_nasyncpending > 0))
{
long timeout = node->as_syncdone ? -1 : 0;
--
2.9.2
0005-Use-resource-owner-to-prevent-wait-event-set-from-le.patchtext/x-patch; charset=us-asciiDownload
From 436b22a547e66875480ad8a151ba9f1fe239dd8c Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Tue, 25 Oct 2016 18:01:56 +0900
Subject: [PATCH 05/13] Use resource owner to prevent wait event set from
leaking
Wait event sets created for async execution can live for some
iterations so it leaks in the case of errors during the
iterations. This commit uses resource owner to prevent such leaks.
---
src/backend/executor/execAsync.c | 28 ++++++++++++++--
src/backend/storage/ipc/latch.c | 19 ++++++++++-
src/backend/utils/resowner/resowner.c | 63 +++++++++++++++++++++++++++++++++++
src/include/utils/resowner_private.h | 8 +++++
4 files changed, 114 insertions(+), 4 deletions(-)
diff --git a/src/backend/executor/execAsync.c b/src/backend/executor/execAsync.c
index 33496a9..40e3f67 100644
--- a/src/backend/executor/execAsync.c
+++ b/src/backend/executor/execAsync.c
@@ -20,6 +20,7 @@
#include "miscadmin.h"
#include "pgstat.h"
#include "storage/latch.h"
+#include "utils/resowner_private.h"
static bool ExecAsyncEventWait(EState *estate, long timeout);
static bool ExecAsyncConfigureWait(EState *estate, PendingAsyncRequest *areq,
@@ -277,6 +278,8 @@ ExecAsyncEventWait(EState *estate, long timeout)
if (estate->es_wait_event_set == NULL)
{
+ ResourceOwner savedOwner;
+
/*
* Allow for a few extra events without reinitializing. It
* doesn't seem worth the complexity of doing anything very
@@ -284,9 +287,28 @@ ExecAsyncEventWait(EState *estate, long timeout)
* of external FDs are likely to run afoul of kernel limits anyway.
*/
estate->es_allocated_fd_events = estate->es_total_fd_events + 16;
- estate->es_wait_event_set =
- CreateWaitEventSet(estate->es_query_cxt,
- estate->es_allocated_fd_events + 1);
+
+ /*
+ * The wait event set created here should be released in case of
+ * error.
+ */
+ savedOwner = CurrentResourceOwner;
+ CurrentResourceOwner = TopTransactionResourceOwner;
+
+ PG_TRY();
+ {
+ estate->es_wait_event_set =
+ CreateWaitEventSet(estate->es_query_cxt,
+ estate->es_allocated_fd_events + 1);
+ }
+ PG_CATCH();
+ {
+ CurrentResourceOwner = savedOwner;
+ PG_RE_THROW();
+ }
+ PG_END_TRY();
+
+ CurrentResourceOwner = savedOwner;
AddWaitEventToSet(estate->es_wait_event_set,
WL_LATCH_SET, PGINVALID_SOCKET, MyLatch, NULL);
reinit = true;
diff --git a/src/backend/storage/ipc/latch.c b/src/backend/storage/ipc/latch.c
index 0079ba5..30dc77b 100644
--- a/src/backend/storage/ipc/latch.c
+++ b/src/backend/storage/ipc/latch.c
@@ -62,6 +62,7 @@
#include "storage/latch.h"
#include "storage/pmsignal.h"
#include "storage/shmem.h"
+#include "utils/resowner_private.h"
/*
* Select the fd readiness primitive to use. Normally the "most modern"
@@ -90,6 +91,7 @@ struct WaitEventSet
int nevents; /* number of registered events */
int nevents_space; /* maximum number of events in this set */
+ ResourceOwner resowner; /* Resource owner */
/*
* Array, of nevents_space length, storing the definition of events this
* set is waiting for.
@@ -324,7 +326,13 @@ WaitLatchOrSocket(volatile Latch *latch, int wakeEvents, pgsocket sock,
int ret = 0;
int rc;
WaitEvent event;
- WaitEventSet *set = CreateWaitEventSet(CurrentMemoryContext, 3);
+ WaitEventSet *set;
+ ResourceOwner savedOwner = CurrentResourceOwner;
+
+ /* This function doesn't need resowner for event set */
+ CurrentResourceOwner = NULL;
+ set = CreateWaitEventSet(CurrentMemoryContext, 3);
+ CurrentResourceOwner = savedOwner;
if (wakeEvents & WL_TIMEOUT)
Assert(timeout >= 0);
@@ -488,6 +496,9 @@ CreateWaitEventSet(MemoryContext context, int nevents)
char *data;
Size sz = 0;
+ if (CurrentResourceOwner)
+ ResourceOwnerEnlargeWESs(CurrentResourceOwner);
+
/*
* Use MAXALIGN size/alignment to guarantee that later uses of memory are
* aligned correctly. E.g. epoll_event might need 8 byte alignment on some
@@ -547,6 +558,9 @@ CreateWaitEventSet(MemoryContext context, int nevents)
StaticAssertStmt(WSA_INVALID_EVENT == NULL, "");
#endif
+ set->resowner = CurrentResourceOwner;
+ if (CurrentResourceOwner)
+ ResourceOwnerRememberWES(set->resowner, set);
return set;
}
@@ -582,6 +596,9 @@ FreeWaitEventSet(WaitEventSet *set)
}
#endif
+ if (set->resowner != NULL)
+ ResourceOwnerForgetWES(set->resowner, set);
+
pfree(set);
}
diff --git a/src/backend/utils/resowner/resowner.c b/src/backend/utils/resowner/resowner.c
index af46d78..34c7e37 100644
--- a/src/backend/utils/resowner/resowner.c
+++ b/src/backend/utils/resowner/resowner.c
@@ -124,6 +124,7 @@ typedef struct ResourceOwnerData
ResourceArray snapshotarr; /* snapshot references */
ResourceArray filearr; /* open temporary files */
ResourceArray dsmarr; /* dynamic shmem segments */
+ ResourceArray wesarr; /* wait event sets */
/* We can remember up to MAX_RESOWNER_LOCKS references to local locks. */
int nlocks; /* number of owned locks */
@@ -169,6 +170,7 @@ static void PrintTupleDescLeakWarning(TupleDesc tupdesc);
static void PrintSnapshotLeakWarning(Snapshot snapshot);
static void PrintFileLeakWarning(File file);
static void PrintDSMLeakWarning(dsm_segment *seg);
+static void PrintWESLeakWarning(WaitEventSet *events);
/*****************************************************************************
@@ -437,6 +439,7 @@ ResourceOwnerCreate(ResourceOwner parent, const char *name)
ResourceArrayInit(&(owner->snapshotarr), PointerGetDatum(NULL));
ResourceArrayInit(&(owner->filearr), FileGetDatum(-1));
ResourceArrayInit(&(owner->dsmarr), PointerGetDatum(NULL));
+ ResourceArrayInit(&(owner->wesarr), PointerGetDatum(NULL));
return owner;
}
@@ -552,6 +555,16 @@ ResourceOwnerReleaseInternal(ResourceOwner owner,
PrintDSMLeakWarning(res);
dsm_detach(res);
}
+
+ /* Ditto for wait event sets */
+ while (ResourceArrayGetAny(&(owner->wesarr), &foundres))
+ {
+ WaitEventSet *event = (WaitEventSet *) DatumGetPointer(foundres);
+
+ if (isCommit)
+ PrintWESLeakWarning(event);
+ FreeWaitEventSet(event);
+ }
}
else if (phase == RESOURCE_RELEASE_LOCKS)
{
@@ -699,6 +712,7 @@ ResourceOwnerDelete(ResourceOwner owner)
Assert(owner->snapshotarr.nitems == 0);
Assert(owner->filearr.nitems == 0);
Assert(owner->dsmarr.nitems == 0);
+ Assert(owner->wesarr.nitems == 0);
Assert(owner->nlocks == 0 || owner->nlocks == MAX_RESOWNER_LOCKS + 1);
/*
@@ -725,6 +739,7 @@ ResourceOwnerDelete(ResourceOwner owner)
ResourceArrayFree(&(owner->snapshotarr));
ResourceArrayFree(&(owner->filearr));
ResourceArrayFree(&(owner->dsmarr));
+ ResourceArrayFree(&(owner->wesarr));
pfree(owner);
}
@@ -1267,3 +1282,51 @@ PrintDSMLeakWarning(dsm_segment *seg)
elog(WARNING, "dynamic shared memory leak: segment %u still referenced",
dsm_segment_handle(seg));
}
+
+/*
+ * Make sure there is room for at least one more entry in a ResourceOwner's
+ * wait event set reference array.
+ *
+ * This is separate from actually inserting an entry because if we run out
+ * of memory, it's critical to do so *before* acquiring the resource.
+ */
+void
+ResourceOwnerEnlargeWESs(ResourceOwner owner)
+{
+ ResourceArrayEnlarge(&(owner->wesarr));
+}
+
+/*
+ * Remember that a wait event set is owned by a ResourceOwner
+ *
+ * Caller must have previously done ResourceOwnerEnlargeWESs()
+ */
+void
+ResourceOwnerRememberWES(ResourceOwner owner, WaitEventSet *events)
+{
+ ResourceArrayAdd(&(owner->wesarr), PointerGetDatum(events));
+}
+
+/*
+ * Forget that a wait event set is owned by a ResourceOwner
+ */
+void
+ResourceOwnerForgetWES(ResourceOwner owner, WaitEventSet *events)
+{
+ /* XXXX: There's no property to identify a wait event set */
+ if (!ResourceArrayRemove(&(owner->wesarr), PointerGetDatum(events)))
+ elog(ERROR, "wait event set %p is not owned by resource owner %s",
+ events, owner->name);
+}
+
+/*
+ * Debugging subroutine
+ */
+static void
+PrintWESLeakWarning(WaitEventSet *events)
+{
+ /* XXXX: There's no property to identify a wait event set */
+ elog(WARNING, "wait event set leak: %p still referenced",
+ events);
+}
+
diff --git a/src/include/utils/resowner_private.h b/src/include/utils/resowner_private.h
index 411d08f..0c6979a 100644
--- a/src/include/utils/resowner_private.h
+++ b/src/include/utils/resowner_private.h
@@ -18,6 +18,7 @@
#include "storage/dsm.h"
#include "storage/fd.h"
+#include "storage/latch.h"
#include "storage/lock.h"
#include "utils/catcache.h"
#include "utils/plancache.h"
@@ -88,4 +89,11 @@ extern void ResourceOwnerRememberDSM(ResourceOwner owner,
extern void ResourceOwnerForgetDSM(ResourceOwner owner,
dsm_segment *);
+/* support for wait event set management */
+extern void ResourceOwnerEnlargeWESs(ResourceOwner owner);
+extern void ResourceOwnerRememberWES(ResourceOwner owner,
+ WaitEventSet *);
+extern void ResourceOwnerForgetWES(ResourceOwner owner,
+ WaitEventSet *);
+
#endif /* RESOWNER_PRIVATE_H */
--
2.9.2
0004-Make-postgres_fdw-async-capable.patchtext/x-patch; charset=us-asciiDownload
From 72cd861c84a9d5bc214a58ce4c9052e52e2a2213 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Mon, 17 Oct 2016 16:00:56 +0900
Subject: [PATCH 04/13] Make postgres_fdw async-capable
---
contrib/postgres_fdw/connection.c | 79 ++--
contrib/postgres_fdw/expected/postgres_fdw.out | 64 ++--
contrib/postgres_fdw/postgres_fdw.c | 483 +++++++++++++++++++++----
contrib/postgres_fdw/postgres_fdw.h | 2 +
contrib/postgres_fdw/sql/postgres_fdw.sql | 4 +-
src/backend/executor/execProcnode.c | 9 +
src/include/foreign/fdwapi.h | 2 +
7 files changed, 510 insertions(+), 133 deletions(-)
diff --git a/contrib/postgres_fdw/connection.c b/contrib/postgres_fdw/connection.c
index 7f7a744..64cc057 100644
--- a/contrib/postgres_fdw/connection.c
+++ b/contrib/postgres_fdw/connection.c
@@ -49,6 +49,7 @@ typedef struct ConnCacheEntry
* one level of subxact open, etc */
bool have_prep_stmt; /* have we prepared any stmts in this xact? */
bool have_error; /* have any subxacts aborted in this xact? */
+ void *storage; /* connection specific storage */
} ConnCacheEntry;
/*
@@ -64,6 +65,7 @@ static unsigned int prep_stmt_number = 0;
static bool xact_got_connection = false;
/* prototypes of private functions */
+static ConnCacheEntry *get_connection_entry(Oid umid);
static PGconn *connect_pg_server(ForeignServer *server, UserMapping *user);
static void check_conn_params(const char **keywords, const char **values);
static void configure_remote_session(PGconn *conn);
@@ -75,26 +77,12 @@ static void pgfdw_subxact_callback(SubXactEvent event,
SubTransactionId parentSubid,
void *arg);
-
/*
- * Get a PGconn which can be used to execute queries on the remote PostgreSQL
- * server with the user's authorization. A new connection is established
- * if we don't already have a suitable one, and a transaction is opened at
- * the right subtransaction nesting depth if we didn't do that already.
- *
- * will_prep_stmt must be true if caller intends to create any prepared
- * statements. Since those don't go away automatically at transaction end
- * (not even on error), we need this flag to cue manual cleanup.
- *
- * XXX Note that caching connections theoretically requires a mechanism to
- * detect change of FDW objects to invalidate already established connections.
- * We could manage that by watching for invalidation events on the relevant
- * syscaches. For the moment, though, it's not clear that this would really
- * be useful and not mere pedantry. We could not flush any active connections
- * mid-transaction anyway.
+ * Common function to acquire or create a connection cache entry.
*/
-PGconn *
-GetConnection(UserMapping *user, bool will_prep_stmt)
+
+static ConnCacheEntry *
+get_connection_entry(Oid umid)
{
bool found;
ConnCacheEntry *entry;
@@ -122,11 +110,8 @@ GetConnection(UserMapping *user, bool will_prep_stmt)
RegisterSubXactCallback(pgfdw_subxact_callback, NULL);
}
- /* Set flag that we did GetConnection during the current transaction */
- xact_got_connection = true;
-
/* Create hash key for the entry. Assume no pad bytes in key struct */
- key = user->umid;
+ key = umid;
/*
* Find or create cached entry for requested connection.
@@ -139,8 +124,39 @@ GetConnection(UserMapping *user, bool will_prep_stmt)
entry->xact_depth = 0;
entry->have_prep_stmt = false;
entry->have_error = false;
+ entry->storage = NULL;
}
+ return entry;
+}
+
+/*
+ * Get a PGconn which can be used to execute queries on the remote PostgreSQL
+ * server with the user's authorization. A new connection is established
+ * if we don't already have a suitable one, and a transaction is opened at
+ * the right subtransaction nesting depth if we didn't do that already.
+ *
+ * will_prep_stmt must be true if caller intends to create any prepared
+ * statements. Since those don't go away automatically at transaction end
+ * (not even on error), we need this flag to cue manual cleanup.
+ *
+ * XXX Note that caching connections theoretically requires a mechanism to
+ * detect change of FDW objects to invalidate already established connections.
+ * We could manage that by watching for invalidation events on the relevant
+ * syscaches. For the moment, though, it's not clear that this would really
+ * be useful and not mere pedantry. We could not flush any active connections
+ * mid-transaction anyway.
+ */
+PGconn *
+GetConnection(UserMapping *user, bool will_prep_stmt)
+{
+ ConnCacheEntry *entry;
+
+ /* Set flag that we did GetConnection during the current transaction */
+ xact_got_connection = true;
+
+ entry = get_connection_entry(user->umid);
+
/*
* We don't check the health of cached connection here, because it would
* require some overhead. Broken connection will be detected when the
@@ -177,6 +193,25 @@ GetConnection(UserMapping *user, bool will_prep_stmt)
}
/*
+ * Rerturns the connection specific storage for this user. Allocate with
+ * initsize if not exists.
+ */
+void *
+GetConnectionSpecificStorage(UserMapping *user, size_t initsize)
+{
+ ConnCacheEntry *entry;
+
+ entry = get_connection_entry(user->umid);
+ if (entry->storage == NULL)
+ {
+ entry->storage = MemoryContextAlloc(CacheMemoryContext, initsize);
+ memset(entry->storage, 0, initsize);
+ }
+
+ return entry->storage;
+}
+
+/*
* Connect to remote server using specified server and user mapping properties.
*/
static PGconn *
diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out
index 9180afe..bfa2211 100644
--- a/contrib/postgres_fdw/expected/postgres_fdw.out
+++ b/contrib/postgres_fdw/expected/postgres_fdw.out
@@ -6254,12 +6254,12 @@ INSERT INTO b(aa) VALUES('bbbbb');
SELECT tableoid::regclass, * FROM a;
tableoid | aa
----------+-------
- b | bbb
- b | bbbb
- b | bbbbb
a | aaa
a | aaaa
a | aaaaa
+ b | bbb
+ b | bbbb
+ b | bbbbb
(6 rows)
SELECT tableoid::regclass, * FROM b;
@@ -6282,12 +6282,12 @@ UPDATE a SET aa = 'zzzzzz' WHERE aa LIKE 'aaaa%';
SELECT tableoid::regclass, * FROM a;
tableoid | aa
----------+--------
- b | bbb
- b | bbbb
- b | bbbbb
a | aaa
a | zzzzzz
a | zzzzzz
+ b | bbb
+ b | bbbb
+ b | bbbbb
(6 rows)
SELECT tableoid::regclass, * FROM b;
@@ -6310,12 +6310,12 @@ UPDATE b SET aa = 'new';
SELECT tableoid::regclass, * FROM a;
tableoid | aa
----------+--------
- b | new
- b | new
- b | new
a | aaa
a | zzzzzz
a | zzzzzz
+ b | new
+ b | new
+ b | new
(6 rows)
SELECT tableoid::regclass, * FROM b;
@@ -6338,12 +6338,12 @@ UPDATE a SET aa = 'newtoo';
SELECT tableoid::regclass, * FROM a;
tableoid | aa
----------+--------
- b | newtoo
- b | newtoo
- b | newtoo
a | newtoo
a | newtoo
a | newtoo
+ b | newtoo
+ b | newtoo
+ b | newtoo
(6 rows)
SELECT tableoid::regclass, * FROM b;
@@ -6431,9 +6431,9 @@ select * from bar where f1 in (select f1 from foo) for update;
select * from bar where f1 in (select f1 from foo) for update;
f1 | f2
----+----
+ 1 | 11
3 | 33
4 | 44
- 1 | 11
2 | 22
(4 rows)
@@ -6468,9 +6468,9 @@ select * from bar where f1 in (select f1 from foo) for share;
select * from bar where f1 in (select f1 from foo) for share;
f1 | f2
----+----
+ 1 | 11
3 | 33
4 | 44
- 1 | 11
2 | 22
(4 rows)
@@ -6733,27 +6733,33 @@ delete from foo where f1 < 5 returning *;
(5 rows)
explain (verbose, costs off)
-update bar set f2 = f2 + 100 returning *;
- QUERY PLAN
-------------------------------------------------------------------------------
- Update on public.bar
- Output: bar.f1, bar.f2
- Update on public.bar
- Foreign Update on public.bar2
- -> Seq Scan on public.bar
- Output: bar.f1, (bar.f2 + 100), bar.ctid
- -> Foreign Update on public.bar2
- Remote SQL: UPDATE public.loct2 SET f2 = (f2 + 100) RETURNING f1, f2
-(8 rows)
+with u as (update bar set f2 = f2 + 100 returning *) select * from u order by 1;
+ QUERY PLAN
+--------------------------------------------------------------------------------------
+ Sort
+ Output: u.f1, u.f2
+ Sort Key: u.f1
+ CTE u
+ -> Update on public.bar
+ Output: bar.f1, bar.f2
+ Update on public.bar
+ Foreign Update on public.bar2
+ -> Seq Scan on public.bar
+ Output: bar.f1, (bar.f2 + 100), bar.ctid
+ -> Foreign Update on public.bar2
+ Remote SQL: UPDATE public.loct2 SET f2 = (f2 + 100) RETURNING f1, f2
+ -> CTE Scan on u
+ Output: u.f1, u.f2
+(14 rows)
-update bar set f2 = f2 + 100 returning *;
+with u as (update bar set f2 = f2 + 100 returning *) select * from u order by 1;
f1 | f2
----+-----
- 2 | 322
1 | 311
- 6 | 266
+ 2 | 322
3 | 333
4 | 344
+ 6 | 266
7 | 277
(6 rows)
diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c
index abb256b..a52d54a 100644
--- a/contrib/postgres_fdw/postgres_fdw.c
+++ b/contrib/postgres_fdw/postgres_fdw.c
@@ -35,6 +35,7 @@
#include "optimizer/var.h"
#include "optimizer/tlist.h"
#include "parser/parsetree.h"
+#include "pgstat.h"
#include "utils/builtins.h"
#include "utils/guc.h"
#include "utils/lsyscache.h"
@@ -54,6 +55,9 @@ PG_MODULE_MAGIC;
/* If no remote estimates, assume a sort costs 20% extra */
#define DEFAULT_FDW_SORT_MULTIPLIER 1.2
+/* Retrive PgFdwScanState struct from ForeginScanState */
+#define GetPgFdwScanState(n) ((PgFdwScanState *)(n)->fdw_state)
+
/*
* Indexes of FDW-private information stored in fdw_private lists.
*
@@ -123,10 +127,27 @@ enum FdwDirectModifyPrivateIndex
};
/*
+ * Connection private area structure.
+ */
+ typedef struct PgFdwConnspecate
+{
+ ForeignScanState *current_owner; /* The node currently running a query
+ * on this connection*/
+} PgFdwConnspecate;
+
+/* Execution state base type */
+typedef struct PgFdwState
+{
+ PGconn *conn; /* connection for the scan */
+ PgFdwConnspecate *connspec; /* connection private memory */
+} PgFdwState;
+
+/*
* Execution state of a foreign scan using postgres_fdw.
*/
typedef struct PgFdwScanState
{
+ PgFdwState s; /* common structure */
Relation rel; /* relcache entry for the foreign table. NULL
* for a foreign join scan. */
TupleDesc tupdesc; /* tuple descriptor of scan */
@@ -137,7 +158,7 @@ typedef struct PgFdwScanState
List *retrieved_attrs; /* list of retrieved attribute numbers */
/* for remote query execution */
- PGconn *conn; /* connection for the scan */
+ bool result_ready;
unsigned int cursor_number; /* quasi-unique ID for my cursor */
bool cursor_exists; /* have we created the cursor? */
int numParams; /* number of parameters passed to query */
@@ -153,6 +174,13 @@ typedef struct PgFdwScanState
/* batch-level state, for optimizing rewinds and avoiding useless fetch */
int fetch_ct_2; /* Min(# of fetches done, 2) */
bool eof_reached; /* true if last fetch reached EOF */
+ bool run_async; /* true if run asynchronously */
+ bool async_waiting; /* true if requesting the parent to wait */
+ ForeignScanState *waiter; /* Next node to run a query among nodes
+ * sharing the same connection */
+ ForeignScanState *last_waiter; /* A waiting node at the end of a waiting
+ * list. Maintained only by the current
+ * owner of the connection */
/* working memory contexts */
MemoryContext batch_cxt; /* context holding current batch of tuples */
@@ -166,11 +194,11 @@ typedef struct PgFdwScanState
*/
typedef struct PgFdwModifyState
{
+ PgFdwState s; /* common structure */
Relation rel; /* relcache entry for the foreign table */
AttInMetadata *attinmeta; /* attribute datatype conversion metadata */
/* for remote query execution */
- PGconn *conn; /* connection for the scan */
char *p_name; /* name of prepared statement, if created */
/* extracted fdw_private data */
@@ -193,6 +221,7 @@ typedef struct PgFdwModifyState
*/
typedef struct PgFdwDirectModifyState
{
+ PgFdwState s; /* common structure */
Relation rel; /* relcache entry for the foreign table */
AttInMetadata *attinmeta; /* attribute datatype conversion metadata */
@@ -291,6 +320,7 @@ static void postgresBeginForeignScan(ForeignScanState *node, int eflags);
static TupleTableSlot *postgresIterateForeignScan(ForeignScanState *node);
static void postgresReScanForeignScan(ForeignScanState *node);
static void postgresEndForeignScan(ForeignScanState *node);
+static void postgresShutdownForeignScan(ForeignScanState *node);
static void postgresAddForeignUpdateTargets(Query *parsetree,
RangeTblEntry *target_rte,
Relation target_relation);
@@ -355,8 +385,8 @@ static bool postgresIsForeignPathAsyncCapable(ForeignPath *path);
static void postgresForeignAsyncRequest(EState *estate,
PendingAsyncRequest *areq);
static bool postgresForeignAsyncConfigureWait(EState *estate,
- PendingAsyncRequest *areq,
- bool reinit);
+ PendingAsyncRequest *areq,
+ bool reinit);
static void postgresForeignAsyncNotify(EState *estate,
PendingAsyncRequest *areq);
@@ -379,7 +409,10 @@ static bool ec_member_matches_foreign(PlannerInfo *root, RelOptInfo *rel,
EquivalenceClass *ec, EquivalenceMember *em,
void *arg);
static void create_cursor(ForeignScanState *node);
-static void fetch_more_data(ForeignScanState *node);
+static void request_more_data(ForeignScanState *node);
+static void fetch_received_data(ForeignScanState *node);
+static void vacate_connection(PgFdwState *fdwconn);
+static void absorb_current_result(ForeignScanState *node);
static void close_cursor(PGconn *conn, unsigned int cursor_number);
static void prepare_foreign_modify(PgFdwModifyState *fmstate);
static const char **convert_prep_stmt_params(PgFdwModifyState *fmstate,
@@ -444,6 +477,7 @@ postgres_fdw_handler(PG_FUNCTION_ARGS)
routine->IterateForeignScan = postgresIterateForeignScan;
routine->ReScanForeignScan = postgresReScanForeignScan;
routine->EndForeignScan = postgresEndForeignScan;
+ routine->ShutdownForeignScan = postgresShutdownForeignScan;
/* Functions for updating foreign tables */
routine->AddForeignUpdateTargets = postgresAddForeignUpdateTargets;
@@ -1335,12 +1369,21 @@ postgresBeginForeignScan(ForeignScanState *node, int eflags)
* Get connection to the foreign server. Connection manager will
* establish new connection if necessary.
*/
- fsstate->conn = GetConnection(user, false);
+ fsstate->s.conn = GetConnection(user, false);
+ fsstate->s.connspec = (PgFdwConnspecate *)
+ GetConnectionSpecificStorage(user, sizeof(PgFdwConnspecate));
+ fsstate->s.connspec->current_owner = NULL;
+ fsstate->waiter = NULL;
+ fsstate->last_waiter = node;
/* Assign a unique ID for my cursor */
- fsstate->cursor_number = GetCursorNumber(fsstate->conn);
+ fsstate->cursor_number = GetCursorNumber(fsstate->s.conn);
fsstate->cursor_exists = false;
+ /* Initialize async execution status */
+ fsstate->run_async = false;
+ fsstate->async_waiting = false;
+
/* Get private info created by planner functions. */
fsstate->query = strVal(list_nth(fsplan->fdw_private,
FdwScanPrivateSelectSql));
@@ -1396,32 +1439,126 @@ postgresBeginForeignScan(ForeignScanState *node, int eflags)
static TupleTableSlot *
postgresIterateForeignScan(ForeignScanState *node)
{
- PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+ PgFdwScanState *fsstate = GetPgFdwScanState(node);
TupleTableSlot *slot = node->ss.ss_ScanTupleSlot;
/*
- * If this is the first call after Begin or ReScan, we need to create the
- * cursor on the remote side.
- */
- if (!fsstate->cursor_exists)
- create_cursor(node);
-
- /*
* Get some more tuples, if we've run out.
*/
if (fsstate->next_tuple >= fsstate->num_tuples)
{
- /* No point in another fetch if we already detected EOF, though. */
- if (!fsstate->eof_reached)
- fetch_more_data(node);
- /* If we didn't get any tuples, must be end of data. */
+ ForeignScanState *next_conn_owner = node;
+
+ /* This node has sent a query on this connection */
+ if (fsstate->s.connspec->current_owner == node)
+ {
+ /* Check if the result is available */
+ if (PQisBusy(fsstate->s.conn))
+ {
+ int rc = WaitLatchOrSocket(NULL,
+ WL_SOCKET_READABLE | WL_TIMEOUT,
+ PQsocket(fsstate->s.conn), 0,
+ WAIT_EVENT_ASYNC_WAIT);
+ if (fsstate->run_async && !(rc & WL_SOCKET_READABLE))
+ {
+ /*
+ * This node is not ready yet. Tell the caller to wait.
+ */
+ fsstate->result_ready = false;
+ return ExecClearTuple(slot);
+ }
+ }
+
+ Assert(fsstate->async_waiting);
+ fsstate->async_waiting = false;
+ fetch_received_data(node);
+
+ /*
+ * If someone is waiting this node on the same connection, let the
+ * first waiter be the next owner of this connection.
+ */
+ if (fsstate->waiter)
+ {
+ PgFdwScanState *next_owner_state;
+
+ next_conn_owner = fsstate->waiter;
+ next_owner_state = GetPgFdwScanState(next_conn_owner);
+ fsstate->waiter = NULL;
+
+ /*
+ * only the current owner is responsible to maintain the shortcut
+ * to the last waiter
+ */
+ next_owner_state->last_waiter = fsstate->last_waiter;
+
+ /*
+ * for simplicity, last_waiter points itself on a node that no one
+ * is waiting for.
+ */
+ fsstate->last_waiter = node;
+ }
+ }
+ else if (fsstate->s.connspec->current_owner)
+ {
+ /*
+ * Anyone else is holding this connection. Add myself to the tail
+ * of the waiters' list then return not-ready. To avoid scanning
+ * through the waiters' list, the current owner is to maintain the
+ * shortcut to the last waiter.
+ */
+ PgFdwScanState *conn_owner_state =
+ GetPgFdwScanState(fsstate->s.connspec->current_owner);
+ ForeignScanState *last_waiter = conn_owner_state->last_waiter;
+ PgFdwScanState *last_waiter_state = GetPgFdwScanState(last_waiter);
+
+ last_waiter_state->waiter = node;
+ conn_owner_state->last_waiter = node;
+
+ /* Register the node to the async-waiting node list */
+ Assert(!GetPgFdwScanState(node)->async_waiting);
+
+ GetPgFdwScanState(node)->async_waiting = true;
+
+ fsstate->result_ready = fsstate->eof_reached;
+ return ExecClearTuple(slot);
+ }
+
+ /*
+ * Send the next request for the next owner of this connection if
+ * needed.
+ */
+
+ if (!GetPgFdwScanState(next_conn_owner)->eof_reached)
+ {
+ PgFdwScanState *next_owner_state =
+ GetPgFdwScanState(next_conn_owner);
+
+ request_more_data(next_conn_owner);
+
+ /* Register the node to the async-waiting node list */
+ if (!next_owner_state->async_waiting)
+ next_owner_state->async_waiting = true;
+
+ if (!next_owner_state->run_async)
+ fetch_received_data(next_conn_owner);
+ }
+
+
+ /*
+ * If we haven't received a result for the given node this time,
+ * return with no tuple to give way to other nodes.
+ */
if (fsstate->next_tuple >= fsstate->num_tuples)
+ {
+ fsstate->result_ready = fsstate->eof_reached;
return ExecClearTuple(slot);
+ }
}
/*
* Return the next tuple.
*/
+ fsstate->result_ready = true;
ExecStoreTuple(fsstate->tuples[fsstate->next_tuple++],
slot,
InvalidBuffer,
@@ -1437,7 +1574,7 @@ postgresIterateForeignScan(ForeignScanState *node)
static void
postgresReScanForeignScan(ForeignScanState *node)
{
- PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+ PgFdwScanState *fsstate = GetPgFdwScanState(node);
char sql[64];
PGresult *res;
@@ -1445,6 +1582,9 @@ postgresReScanForeignScan(ForeignScanState *node)
if (!fsstate->cursor_exists)
return;
+ /* Absorb the ramining result */
+ absorb_current_result(node);
+
/*
* If any internal parameters affecting this node have changed, we'd
* better destroy and recreate the cursor. Otherwise, rewinding it should
@@ -1473,9 +1613,9 @@ postgresReScanForeignScan(ForeignScanState *node)
* We don't use a PG_TRY block here, so be careful not to throw error
* without releasing the PGresult.
*/
- res = pgfdw_exec_query(fsstate->conn, sql);
+ res = pgfdw_exec_query(fsstate->s.conn, sql);
if (PQresultStatus(res) != PGRES_COMMAND_OK)
- pgfdw_report_error(ERROR, res, fsstate->conn, true, sql);
+ pgfdw_report_error(ERROR, res, fsstate->s.conn, true, sql);
PQclear(res);
/* Now force a fresh FETCH. */
@@ -1493,7 +1633,7 @@ postgresReScanForeignScan(ForeignScanState *node)
static void
postgresEndForeignScan(ForeignScanState *node)
{
- PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+ PgFdwScanState *fsstate = GetPgFdwScanState(node);
/* if fsstate is NULL, we are in EXPLAIN; nothing to do */
if (fsstate == NULL)
@@ -1501,16 +1641,32 @@ postgresEndForeignScan(ForeignScanState *node)
/* Close the cursor if open, to prevent accumulation of cursors */
if (fsstate->cursor_exists)
- close_cursor(fsstate->conn, fsstate->cursor_number);
+ close_cursor(fsstate->s.conn, fsstate->cursor_number);
/* Release remote connection */
- ReleaseConnection(fsstate->conn);
- fsstate->conn = NULL;
+ ReleaseConnection(fsstate->s.conn);
+ fsstate->s.conn = NULL;
/* MemoryContexts will be deleted automatically. */
}
/*
+ * postgresShutdownForeignScan
+ * Remove asynchrony stuff and cleanup garbage on the connection.
+ */
+static void
+postgresShutdownForeignScan(ForeignScanState *node)
+{
+ ForeignScan *plan = (ForeignScan *) node->ss.ps.plan;
+
+ if (plan->operation != CMD_SELECT)
+ return;
+
+ /* Absorb the ramining result */
+ absorb_current_result(node);
+}
+
+/*
* postgresAddForeignUpdateTargets
* Add resjunk column(s) needed for update/delete on a foreign table
*/
@@ -1712,7 +1868,9 @@ postgresBeginForeignModify(ModifyTableState *mtstate,
user = GetUserMapping(userid, table->serverid);
/* Open connection; report that we'll create a prepared statement. */
- fmstate->conn = GetConnection(user, true);
+ fmstate->s.conn = GetConnection(user, true);
+ fmstate->s.connspec = (PgFdwConnspecate *)
+ GetConnectionSpecificStorage(user, sizeof(PgFdwConnspecate));
fmstate->p_name = NULL; /* prepared statement not made yet */
/* Deconstruct fdw_private data. */
@@ -1791,6 +1949,8 @@ postgresExecForeignInsert(EState *estate,
PGresult *res;
int n_rows;
+ vacate_connection((PgFdwState *)fmstate);
+
/* Set up the prepared statement on the remote server, if we didn't yet */
if (!fmstate->p_name)
prepare_foreign_modify(fmstate);
@@ -1801,14 +1961,14 @@ postgresExecForeignInsert(EState *estate,
/*
* Execute the prepared statement.
*/
- if (!PQsendQueryPrepared(fmstate->conn,
+ if (!PQsendQueryPrepared(fmstate->s.conn,
fmstate->p_name,
fmstate->p_nums,
p_values,
NULL,
NULL,
0))
- pgfdw_report_error(ERROR, NULL, fmstate->conn, false, fmstate->query);
+ pgfdw_report_error(ERROR, NULL, fmstate->s.conn, false, fmstate->query);
/*
* Get the result, and check for success.
@@ -1816,10 +1976,10 @@ postgresExecForeignInsert(EState *estate,
* We don't use a PG_TRY block here, so be careful not to throw error
* without releasing the PGresult.
*/
- res = pgfdw_get_result(fmstate->conn, fmstate->query);
+ res = pgfdw_get_result(fmstate->s.conn, fmstate->query);
if (PQresultStatus(res) !=
(fmstate->has_returning ? PGRES_TUPLES_OK : PGRES_COMMAND_OK))
- pgfdw_report_error(ERROR, res, fmstate->conn, true, fmstate->query);
+ pgfdw_report_error(ERROR, res, fmstate->s.conn, true, fmstate->query);
/* Check number of rows affected, and fetch RETURNING tuple if any */
if (fmstate->has_returning)
@@ -1857,6 +2017,8 @@ postgresExecForeignUpdate(EState *estate,
PGresult *res;
int n_rows;
+ vacate_connection((PgFdwState *)fmstate);
+
/* Set up the prepared statement on the remote server, if we didn't yet */
if (!fmstate->p_name)
prepare_foreign_modify(fmstate);
@@ -1877,14 +2039,14 @@ postgresExecForeignUpdate(EState *estate,
/*
* Execute the prepared statement.
*/
- if (!PQsendQueryPrepared(fmstate->conn,
+ if (!PQsendQueryPrepared(fmstate->s.conn,
fmstate->p_name,
fmstate->p_nums,
p_values,
NULL,
NULL,
0))
- pgfdw_report_error(ERROR, NULL, fmstate->conn, false, fmstate->query);
+ pgfdw_report_error(ERROR, NULL, fmstate->s.conn, false, fmstate->query);
/*
* Get the result, and check for success.
@@ -1892,10 +2054,10 @@ postgresExecForeignUpdate(EState *estate,
* We don't use a PG_TRY block here, so be careful not to throw error
* without releasing the PGresult.
*/
- res = pgfdw_get_result(fmstate->conn, fmstate->query);
+ res = pgfdw_get_result(fmstate->s.conn, fmstate->query);
if (PQresultStatus(res) !=
(fmstate->has_returning ? PGRES_TUPLES_OK : PGRES_COMMAND_OK))
- pgfdw_report_error(ERROR, res, fmstate->conn, true, fmstate->query);
+ pgfdw_report_error(ERROR, res, fmstate->s.conn, true, fmstate->query);
/* Check number of rows affected, and fetch RETURNING tuple if any */
if (fmstate->has_returning)
@@ -1933,6 +2095,8 @@ postgresExecForeignDelete(EState *estate,
PGresult *res;
int n_rows;
+ vacate_connection((PgFdwState *)fmstate);
+
/* Set up the prepared statement on the remote server, if we didn't yet */
if (!fmstate->p_name)
prepare_foreign_modify(fmstate);
@@ -1953,14 +2117,14 @@ postgresExecForeignDelete(EState *estate,
/*
* Execute the prepared statement.
*/
- if (!PQsendQueryPrepared(fmstate->conn,
+ if (!PQsendQueryPrepared(fmstate->s.conn,
fmstate->p_name,
fmstate->p_nums,
p_values,
NULL,
NULL,
0))
- pgfdw_report_error(ERROR, NULL, fmstate->conn, false, fmstate->query);
+ pgfdw_report_error(ERROR, NULL, fmstate->s.conn, false, fmstate->query);
/*
* Get the result, and check for success.
@@ -1968,10 +2132,10 @@ postgresExecForeignDelete(EState *estate,
* We don't use a PG_TRY block here, so be careful not to throw error
* without releasing the PGresult.
*/
- res = pgfdw_get_result(fmstate->conn, fmstate->query);
+ res = pgfdw_get_result(fmstate->s.conn, fmstate->query);
if (PQresultStatus(res) !=
(fmstate->has_returning ? PGRES_TUPLES_OK : PGRES_COMMAND_OK))
- pgfdw_report_error(ERROR, res, fmstate->conn, true, fmstate->query);
+ pgfdw_report_error(ERROR, res, fmstate->s.conn, true, fmstate->query);
/* Check number of rows affected, and fetch RETURNING tuple if any */
if (fmstate->has_returning)
@@ -2018,16 +2182,16 @@ postgresEndForeignModify(EState *estate,
* We don't use a PG_TRY block here, so be careful not to throw error
* without releasing the PGresult.
*/
- res = pgfdw_exec_query(fmstate->conn, sql);
+ res = pgfdw_exec_query(fmstate->s.conn, sql);
if (PQresultStatus(res) != PGRES_COMMAND_OK)
- pgfdw_report_error(ERROR, res, fmstate->conn, true, sql);
+ pgfdw_report_error(ERROR, res, fmstate->s.conn, true, sql);
PQclear(res);
fmstate->p_name = NULL;
}
/* Release remote connection */
- ReleaseConnection(fmstate->conn);
- fmstate->conn = NULL;
+ ReleaseConnection(fmstate->s.conn);
+ fmstate->s.conn = NULL;
}
/*
@@ -2307,7 +2471,9 @@ postgresBeginDirectModify(ForeignScanState *node, int eflags)
* Get connection to the foreign server. Connection manager will
* establish new connection if necessary.
*/
- dmstate->conn = GetConnection(user, false);
+ dmstate->s.conn = GetConnection(user, false);
+ dmstate->s.connspec = (PgFdwConnspecate *)
+ GetConnectionSpecificStorage(user, sizeof(PgFdwConnspecate));
/* Initialize state variable */
dmstate->num_tuples = -1; /* -1 means not set yet */
@@ -2360,7 +2526,10 @@ postgresIterateDirectModify(ForeignScanState *node)
* If this is the first call after Begin, execute the statement.
*/
if (dmstate->num_tuples == -1)
+ {
+ vacate_connection((PgFdwState *)dmstate);
execute_dml_stmt(node);
+ }
/*
* If the local query doesn't specify RETURNING, just clear tuple slot.
@@ -2407,8 +2576,8 @@ postgresEndDirectModify(ForeignScanState *node)
PQclear(dmstate->result);
/* Release remote connection */
- ReleaseConnection(dmstate->conn);
- dmstate->conn = NULL;
+ ReleaseConnection(dmstate->s.conn);
+ dmstate->s.conn = NULL;
/* MemoryContext will be deleted automatically. */
}
@@ -2527,6 +2696,7 @@ estimate_path_cost_size(PlannerInfo *root,
List *local_param_join_conds;
StringInfoData sql;
PGconn *conn;
+ PgFdwConnspecate *connspec;
Selectivity local_sel;
QualCost local_cost;
List *fdw_scan_tlist = NIL;
@@ -2570,6 +2740,16 @@ estimate_path_cost_size(PlannerInfo *root,
/* Get the remote estimate */
conn = GetConnection(fpinfo->user, false);
+ connspec = GetConnectionSpecificStorage(fpinfo->user,
+ sizeof(PgFdwConnspecate));
+ if (connspec)
+ {
+ PgFdwState tmpstate;
+ tmpstate.conn = conn;
+ tmpstate.connspec = connspec;
+ vacate_connection(&tmpstate);
+ }
+
get_remote_estimate(sql.data, conn, &rows, &width,
&startup_cost, &total_cost);
ReleaseConnection(conn);
@@ -2924,11 +3104,11 @@ ec_member_matches_foreign(PlannerInfo *root, RelOptInfo *rel,
static void
create_cursor(ForeignScanState *node)
{
- PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+ PgFdwScanState *fsstate = GetPgFdwScanState(node);
ExprContext *econtext = node->ss.ps.ps_ExprContext;
int numParams = fsstate->numParams;
const char **values = fsstate->param_values;
- PGconn *conn = fsstate->conn;
+ PGconn *conn = fsstate->s.conn;
StringInfoData buf;
PGresult *res;
@@ -2994,47 +3174,96 @@ create_cursor(ForeignScanState *node)
* Fetch some more rows from the node's cursor.
*/
static void
-fetch_more_data(ForeignScanState *node)
+request_more_data(ForeignScanState *node)
+{
+ PgFdwScanState *fsstate = GetPgFdwScanState(node);
+ PGconn *conn = fsstate->s.conn;
+ char sql[64];
+
+ /* The connection should be vacant */
+ Assert(fsstate->s.connspec->current_owner == NULL);
+
+ /*
+ * If this is the first call after Begin or ReScan, we need to create the
+ * cursor on the remote side.
+ */
+ if (!fsstate->cursor_exists)
+ create_cursor(node);
+
+ snprintf(sql, sizeof(sql), "FETCH %d FROM c%u",
+ fsstate->fetch_size, fsstate->cursor_number);
+
+ if (!PQsendQuery(conn, sql))
+ pgfdw_report_error(ERROR, NULL, conn, false, sql);
+
+ fsstate->s.connspec->current_owner = node;
+}
+
+/*
+ * Fetch some more rows from the node's cursor.
+ */
+static void
+fetch_received_data(ForeignScanState *node)
{
- PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+ PgFdwScanState *fsstate = GetPgFdwScanState(node);
PGresult *volatile res = NULL;
MemoryContext oldcontext;
+ /* I should be the current connection owner */
+ Assert(fsstate->s.connspec->current_owner == node);
+
/*
* We'll store the tuples in the batch_cxt. First, flush the previous
- * batch.
+ * batch if no tuple is remaining
*/
- fsstate->tuples = NULL;
- MemoryContextReset(fsstate->batch_cxt);
+ if (fsstate->next_tuple >= fsstate->num_tuples)
+ {
+ fsstate->tuples = NULL;
+ fsstate->num_tuples = 0;
+ MemoryContextReset(fsstate->batch_cxt);
+ }
+ else if (fsstate->next_tuple > 0)
+ {
+ /* move the remaining tuples to the beginning of the store */
+ int n = 0;
+
+ while(fsstate->next_tuple < fsstate->num_tuples)
+ fsstate->tuples[n++] = fsstate->tuples[fsstate->next_tuple++];
+ fsstate->num_tuples = n;
+ }
+
oldcontext = MemoryContextSwitchTo(fsstate->batch_cxt);
/* PGresult must be released before leaving this function. */
PG_TRY();
{
- PGconn *conn = fsstate->conn;
+ PGconn *conn = fsstate->s.conn;
char sql[64];
- int numrows;
+ int addrows;
+ size_t newsize;
int i;
snprintf(sql, sizeof(sql), "FETCH %d FROM c%u",
fsstate->fetch_size, fsstate->cursor_number);
- res = pgfdw_exec_query(conn, sql);
+ res = pgfdw_get_result(conn, sql);
/* On error, report the original query, not the FETCH. */
if (PQresultStatus(res) != PGRES_TUPLES_OK)
pgfdw_report_error(ERROR, res, conn, false, fsstate->query);
/* Convert the data into HeapTuples */
- numrows = PQntuples(res);
- fsstate->tuples = (HeapTuple *) palloc0(numrows * sizeof(HeapTuple));
- fsstate->num_tuples = numrows;
- fsstate->next_tuple = 0;
+ addrows = PQntuples(res);
+ newsize = (fsstate->num_tuples + addrows) * sizeof(HeapTuple);
+ if (fsstate->tuples)
+ fsstate->tuples = (HeapTuple *) repalloc(fsstate->tuples, newsize);
+ else
+ fsstate->tuples = (HeapTuple *) palloc(newsize);
- for (i = 0; i < numrows; i++)
+ for (i = 0; i < addrows; i++)
{
Assert(IsA(node->ss.ps.plan, ForeignScan));
- fsstate->tuples[i] =
+ fsstate->tuples[fsstate->num_tuples + i] =
make_tuple_from_result_row(res, i,
fsstate->rel,
fsstate->attinmeta,
@@ -3044,27 +3273,82 @@ fetch_more_data(ForeignScanState *node)
}
/* Update fetch_ct_2 */
- if (fsstate->fetch_ct_2 < 2)
+ if (fsstate->fetch_ct_2 < 2 && fsstate->next_tuple == 0)
fsstate->fetch_ct_2++;
+ fsstate->next_tuple = 0;
+ fsstate->num_tuples += addrows;
+
/* Must be EOF if we didn't get as many tuples as we asked for. */
- fsstate->eof_reached = (numrows < fsstate->fetch_size);
+ fsstate->eof_reached = (addrows < fsstate->fetch_size);
PQclear(res);
res = NULL;
}
PG_CATCH();
{
+ fsstate->s.connspec->current_owner = NULL;
if (res)
PQclear(res);
PG_RE_THROW();
}
PG_END_TRY();
+ fsstate->s.connspec->current_owner = NULL;
+
MemoryContextSwitchTo(oldcontext);
}
/*
+ * Vacate a connection so that this node can send the next query
+ */
+static void
+vacate_connection(PgFdwState *fdwstate)
+{
+ PgFdwConnspecate *connspec = fdwstate->connspec;
+ ForeignScanState *owner;
+
+ if (connspec == NULL || connspec->current_owner == NULL)
+ return;
+
+ /*
+ * let the current connection owner read the result for the running query
+ */
+ owner = connspec->current_owner;
+ fetch_received_data(owner);
+
+ /* Clear the waiting list */
+ while (owner)
+ {
+ PgFdwScanState *fsstate = GetPgFdwScanState(owner);
+
+ fsstate->last_waiter = NULL;
+ owner = fsstate->waiter;
+ fsstate->waiter = NULL;
+ }
+}
+
+/*
+ * Absorb the result of the current query.
+ */
+static void
+absorb_current_result(ForeignScanState *node)
+{
+ PgFdwScanState *fsstate = GetPgFdwScanState(node);
+ ForeignScanState *owner = fsstate->s.connspec->current_owner;
+
+ if (owner)
+ {
+ PgFdwScanState *target_state = GetPgFdwScanState(owner);
+ PGconn *conn = target_state->s.conn;
+
+ while(PQisBusy(conn))
+ PQclear(PQgetResult(conn));
+ fsstate->s.connspec->current_owner = NULL;
+ fsstate->async_waiting = false;
+ }
+}
+/*
* Force assorted GUC parameters to settings that ensure that we'll output
* data values in a form that is unambiguous to the remote server.
*
@@ -3148,7 +3432,7 @@ prepare_foreign_modify(PgFdwModifyState *fmstate)
/* Construct name we'll use for the prepared statement. */
snprintf(prep_name, sizeof(prep_name), "pgsql_fdw_prep_%u",
- GetPrepStmtNumber(fmstate->conn));
+ GetPrepStmtNumber(fmstate->s.conn));
p_name = pstrdup(prep_name);
/*
@@ -3158,12 +3442,12 @@ prepare_foreign_modify(PgFdwModifyState *fmstate)
* the prepared statements we use in this module are simple enough that
* the remote server will make the right choices.
*/
- if (!PQsendPrepare(fmstate->conn,
+ if (!PQsendPrepare(fmstate->s.conn,
p_name,
fmstate->query,
0,
NULL))
- pgfdw_report_error(ERROR, NULL, fmstate->conn, false, fmstate->query);
+ pgfdw_report_error(ERROR, NULL, fmstate->s.conn, false, fmstate->query);
/*
* Get the result, and check for success.
@@ -3171,9 +3455,9 @@ prepare_foreign_modify(PgFdwModifyState *fmstate)
* We don't use a PG_TRY block here, so be careful not to throw error
* without releasing the PGresult.
*/
- res = pgfdw_get_result(fmstate->conn, fmstate->query);
+ res = pgfdw_get_result(fmstate->s.conn, fmstate->query);
if (PQresultStatus(res) != PGRES_COMMAND_OK)
- pgfdw_report_error(ERROR, res, fmstate->conn, true, fmstate->query);
+ pgfdw_report_error(ERROR, res, fmstate->s.conn, true, fmstate->query);
PQclear(res);
/* This action shows that the prepare has been done. */
@@ -3304,9 +3588,9 @@ execute_dml_stmt(ForeignScanState *node)
* the desired result. This allows us to avoid assuming that the remote
* server has the same OIDs we do for the parameters' types.
*/
- if (!PQsendQueryParams(dmstate->conn, dmstate->query, numParams,
+ if (!PQsendQueryParams(dmstate->s.conn, dmstate->query, numParams,
NULL, values, NULL, NULL, 0))
- pgfdw_report_error(ERROR, NULL, dmstate->conn, false, dmstate->query);
+ pgfdw_report_error(ERROR, NULL, dmstate->s.conn, false, dmstate->query);
/*
* Get the result, and check for success.
@@ -3314,10 +3598,10 @@ execute_dml_stmt(ForeignScanState *node)
* We don't use a PG_TRY block here, so be careful not to throw error
* without releasing the PGresult.
*/
- dmstate->result = pgfdw_get_result(dmstate->conn, dmstate->query);
+ dmstate->result = pgfdw_get_result(dmstate->s.conn, dmstate->query);
if (PQresultStatus(dmstate->result) !=
(dmstate->has_returning ? PGRES_TUPLES_OK : PGRES_COMMAND_OK))
- pgfdw_report_error(ERROR, dmstate->result, dmstate->conn, true,
+ pgfdw_report_error(ERROR, dmstate->result, dmstate->s.conn, true,
dmstate->query);
/* Get the number of rows affected. */
@@ -4463,8 +4747,10 @@ postgresIsForeignPathAsyncCapable(ForeignPath *path)
}
/*
- * XXX. Just for testing purposes, let's run everything through the async
- * mechanism but return tuples synchronously.
+ * Accept async request. Notify to the caller if the next tuple is immediately
+ * available. ExecForeignScan does additional work to finishing the returning
+ * tuple, so call it instead of postgresIterateForeignScan to acquire a tuple
+ * in expected shape.
*/
static void
postgresForeignAsyncRequest(EState *estate, PendingAsyncRequest *areq)
@@ -4473,22 +4759,59 @@ postgresForeignAsyncRequest(EState *estate, PendingAsyncRequest *areq)
TupleTableSlot *slot;
Assert(IsA(node, ForeignScanState));
+ GetPgFdwScanState(node)->run_async = true;
slot = ExecForeignScan(node);
- ExecAsyncRequestDone(estate, areq, (Node *) slot);
+ if (GetPgFdwScanState(node)->result_ready)
+ ExecAsyncRequestDone(estate, areq, (Node *) slot);
+ else
+ ExecAsyncSetRequiredEvents(estate, areq, 1, false, false);
}
+/*
+ * Configure waiting event.
+ *
+ * Add an wait event only when the node is the connection owner. Elsewise
+ * another node on this connection is the owner.
+ */
static bool
postgresForeignAsyncConfigureWait(EState *estate, PendingAsyncRequest *areq,
- bool reinit)
+ bool reinit)
{
- elog(ERROR, "postgresForeignAsyncConfigureWait");
+ ForeignScanState *node = (ForeignScanState *) areq->requestee;
+ PgFdwScanState *fsstate = GetPgFdwScanState(node);
+
+
+ /* If the caller didn't reinit, this event is already in event set */
+ if (!reinit)
+ return true;
+
+ if (fsstate->s.connspec->current_owner == node)
+ {
+ AddWaitEventToSet(estate->es_wait_event_set,
+ WL_SOCKET_READABLE, PQsocket(fsstate->s.conn),
+ NULL, areq);
+ return true;
+ }
+
return false;
}
+/*
+ * Process a notification from async mechanism. ExecForeignScan does
+ * additional work to complete the returning tuple, so call it instead of
+ * postgresIterateForeignScan to acquire a completed tuple.
+ */
static void
postgresForeignAsyncNotify(EState *estate, PendingAsyncRequest *areq)
{
- elog(ERROR, "postgresForeignAsyncNotify");
+ ForeignScanState *node = (ForeignScanState *) areq->requestee;
+ TupleTableSlot *slot;
+
+ Assert(IsA(node, ForeignScanState));
+ slot = ExecForeignScan(node);
+ Assert(GetPgFdwScanState(node)->result_ready);
+
+ ExecAsyncRequestDone(estate, areq, (Node *) slot);
}
/*
@@ -4848,7 +5171,7 @@ make_tuple_from_result_row(PGresult *res,
PgFdwScanState *fdw_sstate;
Assert(fsstate);
- fdw_sstate = (PgFdwScanState *) fsstate->fdw_state;
+ fdw_sstate = GetPgFdwScanState(fsstate);
tupdesc = fdw_sstate->tupdesc;
}
diff --git a/contrib/postgres_fdw/postgres_fdw.h b/contrib/postgres_fdw/postgres_fdw.h
index 46cac55..b3ac615 100644
--- a/contrib/postgres_fdw/postgres_fdw.h
+++ b/contrib/postgres_fdw/postgres_fdw.h
@@ -79,6 +79,7 @@ typedef struct PgFdwRelationInfo
UserMapping *user; /* only set in use_remote_estimate mode */
int fetch_size; /* fetch size for this remote table */
+ bool allow_prefetch; /* true to allow overlapped fetching */
/*
* Name of the relation while EXPLAINing ForeignScan. It is used for join
@@ -103,6 +104,7 @@ extern void reset_transmission_modes(int nestlevel);
/* in connection.c */
extern PGconn *GetConnection(UserMapping *user, bool will_prep_stmt);
+void *GetConnectionSpecificStorage(UserMapping *user, size_t initsize);
extern void ReleaseConnection(PGconn *conn);
extern unsigned int GetCursorNumber(PGconn *conn);
extern unsigned int GetPrepStmtNumber(PGconn *conn);
diff --git a/contrib/postgres_fdw/sql/postgres_fdw.sql b/contrib/postgres_fdw/sql/postgres_fdw.sql
index 56b01d0..3f83b72 100644
--- a/contrib/postgres_fdw/sql/postgres_fdw.sql
+++ b/contrib/postgres_fdw/sql/postgres_fdw.sql
@@ -1575,8 +1575,8 @@ explain (verbose, costs off)
delete from foo where f1 < 5 returning *;
delete from foo where f1 < 5 returning *;
explain (verbose, costs off)
-update bar set f2 = f2 + 100 returning *;
-update bar set f2 = f2 + 100 returning *;
+with u as (update bar set f2 = f2 + 100 returning *) select * from u order by 1;
+with u as (update bar set f2 = f2 + 100 returning *) select * from u order by 1;
drop table foo cascade;
drop table bar cascade;
diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
index 0dd95c6..1cba31e 100644
--- a/src/backend/executor/execProcnode.c
+++ b/src/backend/executor/execProcnode.c
@@ -115,6 +115,7 @@
#include "executor/nodeValuesscan.h"
#include "executor/nodeWindowAgg.h"
#include "executor/nodeWorktablescan.h"
+#include "foreign/fdwapi.h"
#include "nodes/nodeFuncs.h"
#include "miscadmin.h"
@@ -820,6 +821,14 @@ ExecShutdownNode(PlanState *node)
case T_GatherState:
ExecShutdownGather((GatherState *) node);
break;
+ case T_ForeignScanState:
+ {
+ ForeignScanState *fsstate = (ForeignScanState *)node;
+ FdwRoutine *fdwroutine = fsstate->fdwroutine;
+ if (fdwroutine->ShutdownForeignScan)
+ fdwroutine->ShutdownForeignScan((ForeignScanState *) node);
+ }
+ break;
default:
break;
}
diff --git a/src/include/foreign/fdwapi.h b/src/include/foreign/fdwapi.h
index 41fc76f..11c3434 100644
--- a/src/include/foreign/fdwapi.h
+++ b/src/include/foreign/fdwapi.h
@@ -163,6 +163,7 @@ typedef bool (*ForeignAsyncConfigureWait_function) (EState *estate,
bool reinit);
typedef void (*ForeignAsyncNotify_function) (EState *estate,
PendingAsyncRequest *areq);
+typedef void (*ShutdownForeignScan_function) (ForeignScanState *node);
/*
* FdwRoutine is the struct returned by a foreign-data wrapper's handler
@@ -239,6 +240,7 @@ typedef struct FdwRoutine
ForeignAsyncRequest_function ForeignAsyncRequest;
ForeignAsyncConfigureWait_function ForeignAsyncConfigureWait;
ForeignAsyncNotify_function ForeignAsyncNotify;
+ ShutdownForeignScan_function ShutdownForeignScan;
} FdwRoutine;
--
2.9.2
0003-Modify-async-execution-infrastructure.patchtext/x-patch; charset=us-asciiDownload
From 52aa13caddd7c4da68784f9a4cd58dc635062ca9 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Mon, 17 Oct 2016 15:54:32 +0900
Subject: [PATCH 03/13] Modify async execution infrastructure.
---
contrib/postgres_fdw/expected/postgres_fdw.out | 68 ++++++++--------
contrib/postgres_fdw/postgres_fdw.c | 5 +-
src/backend/executor/execAsync.c | 105 ++++++++++++++-----------
src/backend/executor/nodeAppend.c | 50 ++++++------
src/backend/executor/nodeForeignscan.c | 4 +-
src/backend/nodes/copyfuncs.c | 1 +
src/backend/nodes/outfuncs.c | 1 +
src/backend/nodes/readfuncs.c | 1 +
src/backend/optimizer/plan/createplan.c | 24 +++++-
src/backend/utils/adt/ruleutils.c | 6 +-
src/include/executor/nodeForeignscan.h | 2 +-
src/include/foreign/fdwapi.h | 2 +-
src/include/nodes/execnodes.h | 10 ++-
src/include/nodes/plannodes.h | 1 +
14 files changed, 167 insertions(+), 113 deletions(-)
diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out
index df22beb..9180afe 100644
--- a/contrib/postgres_fdw/expected/postgres_fdw.out
+++ b/contrib/postgres_fdw/expected/postgres_fdw.out
@@ -6402,13 +6402,13 @@ insert into bar2 values(4,44,44);
insert into bar2 values(7,77,77);
explain (verbose, costs off)
select * from bar where f1 in (select f1 from foo) for update;
- QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------
+ QUERY PLAN
+----------------------------------------------------------------------------------------------
LockRows
- Output: bar2.f1, bar2.f2, bar2.ctid, ((bar2.*)::bar), bar2.tableoid, foo2.ctid, ((foo2.*)::foo), foo2.tableoid
+ Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid, foo.ctid, foo.*, foo.tableoid
-> Hash Join
- Output: bar2.f1, bar2.f2, bar2.ctid, ((bar2.*)::bar), bar2.tableoid, foo2.ctid, ((foo2.*)::foo), foo2.tableoid
- Hash Cond: (bar2.f1 = foo2.f1)
+ Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid, foo.ctid, foo.*, foo.tableoid
+ Hash Cond: (bar.f1 = foo.f1)
-> Append
-> Foreign Scan on public.bar2
Output: bar2.f1, bar2.f2, bar2.ctid, bar2.*, bar2.tableoid
@@ -6416,10 +6416,10 @@ select * from bar where f1 in (select f1 from foo) for update;
-> Seq Scan on public.bar
Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid
-> Hash
- Output: foo2.ctid, ((foo2.*)::foo), foo2.tableoid, foo2.f1
+ Output: foo.ctid, foo.*, foo.tableoid, foo.f1
-> HashAggregate
- Output: foo2.ctid, ((foo2.*)::foo), foo2.tableoid, foo2.f1
- Group Key: foo2.f1
+ Output: foo.ctid, foo.*, foo.tableoid, foo.f1
+ Group Key: foo.f1
-> Append
-> Foreign Scan on public.foo2
Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1
@@ -6439,13 +6439,13 @@ select * from bar where f1 in (select f1 from foo) for update;
explain (verbose, costs off)
select * from bar where f1 in (select f1 from foo) for share;
- QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------
+ QUERY PLAN
+----------------------------------------------------------------------------------------------
LockRows
- Output: bar2.f1, bar2.f2, bar2.ctid, ((bar2.*)::bar), bar2.tableoid, foo2.ctid, ((foo2.*)::foo), foo2.tableoid
+ Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid, foo.ctid, foo.*, foo.tableoid
-> Hash Join
- Output: bar2.f1, bar2.f2, bar2.ctid, ((bar2.*)::bar), bar2.tableoid, foo2.ctid, ((foo2.*)::foo), foo2.tableoid
- Hash Cond: (bar2.f1 = foo2.f1)
+ Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid, foo.ctid, foo.*, foo.tableoid
+ Hash Cond: (bar.f1 = foo.f1)
-> Append
-> Foreign Scan on public.bar2
Output: bar2.f1, bar2.f2, bar2.ctid, bar2.*, bar2.tableoid
@@ -6453,10 +6453,10 @@ select * from bar where f1 in (select f1 from foo) for share;
-> Seq Scan on public.bar
Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid
-> Hash
- Output: foo2.ctid, ((foo2.*)::foo), foo2.tableoid, foo2.f1
+ Output: foo.ctid, foo.*, foo.tableoid, foo.f1
-> HashAggregate
- Output: foo2.ctid, ((foo2.*)::foo), foo2.tableoid, foo2.f1
- Group Key: foo2.f1
+ Output: foo.ctid, foo.*, foo.tableoid, foo.f1
+ Group Key: foo.f1
-> Append
-> Foreign Scan on public.foo2
Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1
@@ -6477,22 +6477,22 @@ select * from bar where f1 in (select f1 from foo) for share;
-- Check UPDATE with inherited target and an inherited source table
explain (verbose, costs off)
update bar set f2 = f2 + 100 where f1 in (select f1 from foo);
- QUERY PLAN
----------------------------------------------------------------------------------------------------------
+ QUERY PLAN
+---------------------------------------------------------------------------------------------
Update on public.bar
Update on public.bar
Foreign Update on public.bar2
Remote SQL: UPDATE public.loct2 SET f2 = $2 WHERE ctid = $1
-> Hash Join
- Output: bar.f1, (bar.f2 + 100), bar.ctid, foo2.ctid, ((foo2.*)::foo), foo2.tableoid
- Hash Cond: (bar.f1 = foo2.f1)
+ Output: bar.f1, (bar.f2 + 100), bar.ctid, foo.ctid, foo.*, foo.tableoid
+ Hash Cond: (bar.f1 = foo.f1)
-> Seq Scan on public.bar
Output: bar.f1, bar.f2, bar.ctid
-> Hash
- Output: foo2.ctid, ((foo2.*)::foo), foo2.tableoid, foo2.f1
+ Output: foo.ctid, foo.*, foo.tableoid, foo.f1
-> HashAggregate
- Output: foo2.ctid, ((foo2.*)::foo), foo2.tableoid, foo2.f1
- Group Key: foo2.f1
+ Output: foo.ctid, foo.*, foo.tableoid, foo.f1
+ Group Key: foo.f1
-> Append
-> Foreign Scan on public.foo2
Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1
@@ -6500,16 +6500,16 @@ update bar set f2 = f2 + 100 where f1 in (select f1 from foo);
-> Seq Scan on public.foo
Output: foo.ctid, foo.*, foo.tableoid, foo.f1
-> Hash Join
- Output: bar2.f1, (bar2.f2 + 100), bar2.f3, bar2.ctid, foo2.ctid, ((foo2.*)::foo), foo2.tableoid
- Hash Cond: (bar2.f1 = foo2.f1)
+ Output: bar2.f1, (bar2.f2 + 100), bar2.f3, bar2.ctid, foo.ctid, foo.*, foo.tableoid
+ Hash Cond: (bar2.f1 = foo.f1)
-> Foreign Scan on public.bar2
Output: bar2.f1, bar2.f2, bar2.f3, bar2.ctid
Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 FOR UPDATE
-> Hash
- Output: foo2.ctid, ((foo2.*)::foo), foo2.tableoid, foo2.f1
+ Output: foo.ctid, foo.*, foo.tableoid, foo.f1
-> HashAggregate
- Output: foo2.ctid, ((foo2.*)::foo), foo2.tableoid, foo2.f1
- Group Key: foo2.f1
+ Output: foo.ctid, foo.*, foo.tableoid, foo.f1
+ Group Key: foo.f1
-> Append
-> Foreign Scan on public.foo2
Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1
@@ -6543,8 +6543,8 @@ where bar.f1 = ss.f1;
Foreign Update on public.bar2
Remote SQL: UPDATE public.loct2 SET f2 = $2 WHERE ctid = $1
-> Hash Join
- Output: bar.f1, (bar.f2 + 100), bar.ctid, (ROW(foo2.f1))
- Hash Cond: (foo2.f1 = bar.f1)
+ Output: bar.f1, (bar.f2 + 100), bar.ctid, (ROW(foo.f1))
+ Hash Cond: (foo.f1 = bar.f1)
-> Append
-> Foreign Scan on public.foo2
Output: ROW(foo2.f1), foo2.f1
@@ -6561,8 +6561,8 @@ where bar.f1 = ss.f1;
-> Seq Scan on public.bar
Output: bar.f1, bar.f2, bar.ctid
-> Merge Join
- Output: bar2.f1, (bar2.f2 + 100), bar2.f3, bar2.ctid, (ROW(foo2.f1))
- Merge Cond: (bar2.f1 = foo2.f1)
+ Output: bar2.f1, (bar2.f2 + 100), bar2.f3, bar2.ctid, (ROW(foo.f1))
+ Merge Cond: (bar2.f1 = foo.f1)
-> Sort
Output: bar2.f1, bar2.f2, bar2.f3, bar2.ctid
Sort Key: bar2.f1
@@ -6570,8 +6570,8 @@ where bar.f1 = ss.f1;
Output: bar2.f1, bar2.f2, bar2.f3, bar2.ctid
Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 FOR UPDATE
-> Sort
- Output: (ROW(foo2.f1)), foo2.f1
- Sort Key: foo2.f1
+ Output: (ROW(foo.f1)), foo.f1
+ Sort Key: foo.f1
-> Append
-> Foreign Scan on public.foo2
Output: ROW(foo2.f1), foo2.f1
diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c
index f180838..abb256b 100644
--- a/contrib/postgres_fdw/postgres_fdw.c
+++ b/contrib/postgres_fdw/postgres_fdw.c
@@ -354,7 +354,7 @@ static void postgresGetForeignUpperPaths(PlannerInfo *root,
static bool postgresIsForeignPathAsyncCapable(ForeignPath *path);
static void postgresForeignAsyncRequest(EState *estate,
PendingAsyncRequest *areq);
-static void postgresForeignAsyncConfigureWait(EState *estate,
+static bool postgresForeignAsyncConfigureWait(EState *estate,
PendingAsyncRequest *areq,
bool reinit);
static void postgresForeignAsyncNotify(EState *estate,
@@ -4477,11 +4477,12 @@ postgresForeignAsyncRequest(EState *estate, PendingAsyncRequest *areq)
ExecAsyncRequestDone(estate, areq, (Node *) slot);
}
-static void
+static bool
postgresForeignAsyncConfigureWait(EState *estate, PendingAsyncRequest *areq,
bool reinit)
{
elog(ERROR, "postgresForeignAsyncConfigureWait");
+ return false;
}
static void
diff --git a/src/backend/executor/execAsync.c b/src/backend/executor/execAsync.c
index e070c26..33496a9 100644
--- a/src/backend/executor/execAsync.c
+++ b/src/backend/executor/execAsync.c
@@ -22,7 +22,7 @@
#include "storage/latch.h"
static bool ExecAsyncEventWait(EState *estate, long timeout);
-static void ExecAsyncConfigureWait(EState *estate, PendingAsyncRequest *areq,
+static bool ExecAsyncConfigureWait(EState *estate, PendingAsyncRequest *areq,
bool reinit);
static void ExecAsyncNotify(EState *estate, PendingAsyncRequest *areq);
static void ExecAsyncResponse(EState *estate, PendingAsyncRequest *areq);
@@ -43,7 +43,7 @@ ExecAsyncRequest(EState *estate, PlanState *requestor, int request_index,
PlanState *requestee)
{
PendingAsyncRequest *areq = NULL;
- int i = estate->es_num_pending_async;
+ int nasync = estate->es_num_pending_async;
/*
* If the number of pending asynchronous nodes exceeds the number of
@@ -51,7 +51,7 @@ ExecAsyncRequest(EState *estate, PlanState *requestor, int request_index,
* We start with 16 slots, and thereafter double the array size each
* time we run out of slots.
*/
- if (i >= estate->es_max_pending_async)
+ if (nasync >= estate->es_max_pending_async)
{
int newmax;
@@ -81,25 +81,28 @@ ExecAsyncRequest(EState *estate, PlanState *requestor, int request_index,
* PendingAsyncRequest if there is one. If not, we must allocate a new
* one.
*/
- if (estate->es_pending_async[i] == NULL)
+ if (estate->es_pending_async[nasync] == NULL)
{
areq = MemoryContextAllocZero(estate->es_query_cxt,
sizeof(PendingAsyncRequest));
- estate->es_pending_async[i] = areq;
+ estate->es_pending_async[nasync] = areq;
}
else
{
- areq = estate->es_pending_async[i];
+ areq = estate->es_pending_async[nasync];
MemSet(areq, 0, sizeof(PendingAsyncRequest));
}
- areq->myindex = estate->es_num_pending_async++;
+ areq->myindex = estate->es_num_pending_async;
/* Initialize the new request. */
areq->requestor = requestor;
areq->request_index = request_index;
areq->requestee = requestee;
- /* Give the requestee a chance to do whatever it wants. */
+ /*
+ * Give the requestee a chance to do whatever it wants.
+ * Requst functions return true if a result is immediately available.
+ */
switch (nodeTag(requestee))
{
case T_ForeignScanState:
@@ -110,6 +113,20 @@ ExecAsyncRequest(EState *estate, PlanState *requestor, int request_index,
elog(ERROR, "unrecognized node type: %d",
(int) nodeTag(requestee));
}
+
+ /*
+ * If a result is available, complete it immediately.
+ */
+ if (areq->state == ASYNC_COMPLETE)
+ {
+ Assert(areq->result == NULL || IsA(areq->result, TupleTableSlot));
+ ExecAsyncResponse(estate, areq);
+
+ return;
+ }
+
+ /* No result available now, make this node pending */
+ estate->es_num_pending_async++;
}
/*
@@ -175,22 +192,19 @@ ExecAsyncEventLoop(EState *estate, PlanState *requestor, long timeout)
{
PendingAsyncRequest *areq = estate->es_pending_async[i];
- /* Skip it if no callback is pending. */
- if (!areq->callback_pending)
- continue;
-
- /*
- * Mark it as no longer needing a callback. We must do this
- * before dispatching the callback in case the callback resets
- * the flag.
- */
- areq->callback_pending = false;
- estate->es_async_callback_pending--;
-
- /* Perform the actual callback; set request_done if appropraite. */
- if (!areq->request_complete)
+ /* Skip it if not pending. */
+ if (areq->state == ASYNC_CALLBACK_PENDING)
+ {
+ /*
+ * Mark it as no longer needing a callback. We must do this
+ * before dispatching the callback in case the callback resets
+ * the flag.
+ */
+ estate->es_async_callback_pending--;
ExecAsyncNotify(estate, areq);
- else
+ }
+
+ if (areq->state == ASYNC_COMPLETE)
{
any_node_done = true;
if (requestor == areq->requestor)
@@ -214,7 +228,7 @@ ExecAsyncEventLoop(EState *estate, PlanState *requestor, long timeout)
PendingAsyncRequest *head;
PendingAsyncRequest *tail = estate->es_pending_async[tidx];
- if (!tail->callback_pending && tail->request_complete)
+ if (tail->state == ASYNC_COMPLETE)
continue;
head = estate->es_pending_async[hidx];
estate->es_pending_async[tidx] = head;
@@ -247,7 +261,8 @@ ExecAsyncEventLoop(EState *estate, PlanState *requestor, long timeout)
* means wait forever, 0 means don't wait at all, and >0 means wait for the
* indicated number of milliseconds.
*
- * Returns true if we found some events and false if we timed out.
+ * Returns true if we found some events and false if we timed out or there's
+ * no event to wait. The latter is occur when the areq is processed during
*/
static bool
ExecAsyncEventWait(EState *estate, long timeout)
@@ -258,6 +273,7 @@ ExecAsyncEventWait(EState *estate, long timeout)
int n;
bool reinit = false;
bool process_latch_set = false;
+ bool added = false;
if (estate->es_wait_event_set == NULL)
{
@@ -282,13 +298,16 @@ ExecAsyncEventWait(EState *estate, long timeout)
PendingAsyncRequest *areq = estate->es_pending_async[i];
if (areq->num_fd_events > 0)
- ExecAsyncConfigureWait(estate, areq, reinit);
+ added |= ExecAsyncConfigureWait(estate, areq, reinit);
}
+ Assert(added);
+
/* Wait for at least one event to occur. */
noccurred = WaitEventSetWait(estate->es_wait_event_set, timeout,
occurred_event, EVENT_BUFFER_SIZE,
WAIT_EVENT_ASYNC_WAIT);
+
if (noccurred == 0)
return false;
@@ -312,12 +331,10 @@ ExecAsyncEventWait(EState *estate, long timeout)
{
PendingAsyncRequest *areq = w->user_data;
- if (!areq->callback_pending)
- {
- Assert(!areq->request_complete);
- areq->callback_pending = true;
- estate->es_async_callback_pending++;
- }
+ Assert(areq->state == ASYNC_WAITING);
+
+ areq->state = ASYNC_CALLBACK_PENDING;
+ estate->es_async_callback_pending++;
}
}
@@ -333,8 +350,8 @@ ExecAsyncEventWait(EState *estate, long timeout)
if (areq->wants_process_latch)
{
- Assert(!areq->request_complete);
- areq->callback_pending = true;
+ Assert(areq->state == ASYNC_WAITING);
+ areq->state = ASYNC_CALLBACK_PENDING;
}
}
}
@@ -352,15 +369,19 @@ ExecAsyncEventWait(EState *estate, long timeout)
* The events should include only WL_SOCKET_READABLE or WL_SOCKET_WRITEABLE,
* and the number of calls should not exceed areq->num_fd_events (as
* prevously set via ExecAsyncSetRequiredEvents).
+ *
+ * Individual requests can omit registering an event but it is a
+ * responsibility of the node driver to set at least one event per one
+ * requestor.
*/
-static void
+static bool
ExecAsyncConfigureWait(EState *estate, PendingAsyncRequest *areq,
bool reinit)
{
switch (nodeTag(areq->requestee))
{
case T_ForeignScanState:
- ExecAsyncForeignScanConfigureWait(estate, areq, reinit);
+ return ExecAsyncForeignScanConfigureWait(estate, areq, reinit);
break;
default:
elog(ERROR, "unrecognized node type: %d",
@@ -419,6 +440,7 @@ ExecAsyncSetRequiredEvents(EState *estate, PendingAsyncRequest *areq,
estate->es_total_fd_events += num_fd_events - areq->num_fd_events;
areq->num_fd_events = num_fd_events;
areq->wants_process_latch = wants_process_latch;
+ areq->state = ASYNC_WAITING;
if (force_reset && estate->es_wait_event_set != NULL)
{
@@ -448,17 +470,12 @@ ExecAsyncRequestDone(EState *estate, PendingAsyncRequest *areq, Node *result)
* need a callback to remove registered wait events. It's not clear
* that we would come out ahead, so use brute force for now.
*/
+ Assert(areq->state == ASYNC_IDLE || areq->state == ASYNC_CALLBACK_PENDING);
+
if (areq->num_fd_events > 0 || areq->wants_process_latch)
ExecAsyncSetRequiredEvents(estate, areq, 0, false, true);
/* Save result and mark request as complete. */
areq->result = result;
- areq->request_complete = true;
-
- /* Make sure this request is flagged for a callback. */
- if (!areq->callback_pending)
- {
- areq->callback_pending = true;
- estate->es_async_callback_pending++;
- }
+ areq->state = ASYNC_COMPLETE;
}
diff --git a/src/backend/executor/nodeAppend.c b/src/backend/executor/nodeAppend.c
index e61218a..568fa25 100644
--- a/src/backend/executor/nodeAppend.c
+++ b/src/backend/executor/nodeAppend.c
@@ -229,9 +229,15 @@ ExecAppend(AppendState *node)
*/
while ((i = bms_first_member(node->as_needrequest)) >= 0)
{
- ExecAsyncRequest(estate, &node->ps, i, node->appendplans[i]);
node->as_nasyncpending++;
+
+ ExecAsyncRequest(estate, &node->ps, i, node->appendplans[i]);
+ /* If this request immediately gives a result, take it. */
+ if (node->as_nasyncresult > 0)
+ return node->as_asyncresult[--node->as_nasyncresult];
}
+ if (node->as_nasyncpending == 0 && node->as_syncdone)
+ return ExecClearTuple(node->ps.ps_ResultTupleSlot);
}
for (;;)
@@ -246,32 +252,32 @@ ExecAppend(AppendState *node)
{
long timeout = node->as_syncdone ? -1 : 0;
- for (;;)
+ while (node->as_nasyncpending > 0)
{
- if (node->as_nasyncpending == 0)
- {
- /*
- * If there is no asynchronous activity still pending
- * and the synchronous activity is also complete, we're
- * totally done scanning this node. Otherwise, we're
- * done with the asynchronous stuff but must continue
- * scanning the synchronous children.
- */
- if (node->as_syncdone)
- return ExecClearTuple(node->ps.ps_ResultTupleSlot);
- break;
- }
- if (!ExecAsyncEventLoop(node->ps.state, &node->ps, timeout))
- {
- /* Timeout reached. */
- break;
- }
- if (node->as_nasyncresult > 0)
+ if (ExecAsyncEventLoop(node->ps.state, &node->ps, timeout) &&
+ node->as_nasyncresult > 0)
{
/* Asynchronous subplan returned a tuple! */
--node->as_nasyncresult;
return node->as_asyncresult[node->as_nasyncresult];
}
+
+ /* Timeout reached. Go through to sync nodes if exists */
+ if (!node->as_syncdone)
+ break;
+ }
+
+ /*
+ * If there is no asynchronous activity still pending and the
+ * synchronous activity is also complete, we're totally done
+ * scanning this node. Otherwise, we're done with the
+ * asynchronous stuff but must continue scanning the synchronous
+ * children.
+ */
+ if (node->as_syncdone)
+ {
+ Assert(node->as_nasyncpending == 0);
+ return ExecClearTuple(node->ps.ps_ResultTupleSlot);
}
}
@@ -397,7 +403,7 @@ ExecAsyncAppendResponse(EState *estate, PendingAsyncRequest *areq)
TupleTableSlot *slot;
/* We shouldn't be called until the request is complete. */
- Assert(areq->request_complete);
+ Assert(areq->state == ASYNC_COMPLETE);
/* Our result slot shouldn't already be occupied. */
Assert(TupIsNull(node->ps.ps_ResultTupleSlot));
diff --git a/src/backend/executor/nodeForeignscan.c b/src/backend/executor/nodeForeignscan.c
index 61899d1..85dad79 100644
--- a/src/backend/executor/nodeForeignscan.c
+++ b/src/backend/executor/nodeForeignscan.c
@@ -376,7 +376,7 @@ ExecAsyncForeignScanRequest(EState *estate, PendingAsyncRequest *areq)
* In async mode, configure for a wait
* ----------------------------------------------------------------
*/
-void
+bool
ExecAsyncForeignScanConfigureWait(EState *estate,
PendingAsyncRequest *areq, bool reinit)
{
@@ -384,7 +384,7 @@ ExecAsyncForeignScanConfigureWait(EState *estate,
FdwRoutine *fdwroutine = node->fdwroutine;
Assert(fdwroutine->ForeignAsyncConfigureWait != NULL);
- fdwroutine->ForeignAsyncConfigureWait(estate, areq, reinit);
+ return fdwroutine->ForeignAsyncConfigureWait(estate, areq, reinit);
}
/* ----------------------------------------------------------------
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index a8cabdf..c62aaf2 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -237,6 +237,7 @@ _copyAppend(const Append *from)
*/
COPY_NODE_FIELD(appendplans);
COPY_SCALAR_FIELD(nasyncplans);
+ COPY_SCALAR_FIELD(referent);
return newnode;
}
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index a894a9d..c2e34a8 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -370,6 +370,7 @@ _outAppend(StringInfo str, const Append *node)
WRITE_NODE_FIELD(appendplans);
WRITE_INT_FIELD(nasyncplans);
+ WRITE_INT_FIELD(referent);
}
static void
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index 67439ec..9837eff 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -1540,6 +1540,7 @@ _readAppend(void)
READ_NODE_FIELD(appendplans);
READ_INT_FIELD(nasyncplans);
+ READ_INT_FIELD(referent);
READ_DONE();
}
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index 2140094..0575541 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -194,7 +194,8 @@ static CteScan *make_ctescan(List *qptlist, List *qpqual,
Index scanrelid, int ctePlanId, int cteParam);
static WorkTableScan *make_worktablescan(List *qptlist, List *qpqual,
Index scanrelid, int wtParam);
-static Append *make_append(List *asyncplans, int nasyncplans, List *tlist);
+static Append *make_append(List *asyncplans, int nasyncplans,
+ int referent, List *tlist);
static RecursiveUnion *make_recursive_union(List *tlist,
Plan *lefttree,
Plan *righttree,
@@ -966,6 +967,8 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path)
List *syncplans = NIL;
ListCell *subpaths;
int nasyncplans = 0;
+ bool first = true;
+ bool referent_is_sync = true;
/*
* The subpaths list could be empty, if every child was proven empty by
@@ -991,7 +994,14 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path)
return plan;
}
- /* Build the plan for each child */
+ /*
+ * Build the plan for each child
+
+ * The first child in an inheritance set is the representative in
+ * explaining tlist entries (see set_deparse_planstate). We should keep
+ * the first child in best_path->subpaths at the head of the subplan list
+ * for the reason.
+ */
foreach(subpaths, best_path->subpaths)
{
Path *subpath = (Path *) lfirst(subpaths);
@@ -1005,9 +1015,13 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path)
{
asyncplans = lappend(asyncplans, subplan);
++nasyncplans;
+ if (first)
+ referent_is_sync = false;
}
else
syncplans = lappend(syncplans, subplan);
+
+ first = false;
}
/*
@@ -1017,7 +1031,8 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path)
* parent-rel Vars it'll be asked to emit.
*/
- plan = make_append(list_concat(asyncplans, syncplans), nasyncplans, tlist);
+ plan = make_append(list_concat(asyncplans, syncplans), nasyncplans,
+ referent_is_sync ? nasyncplans : 0, tlist);
copy_generic_path_info(&plan->plan, (Path *) best_path);
@@ -5019,7 +5034,7 @@ make_foreignscan(List *qptlist,
}
static Append *
-make_append(List *appendplans, int nasyncplans, List *tlist)
+make_append(List *appendplans, int nasyncplans, int referent, List *tlist)
{
Append *node = makeNode(Append);
Plan *plan = &node->plan;
@@ -5030,6 +5045,7 @@ make_append(List *appendplans, int nasyncplans, List *tlist)
plan->righttree = NULL;
node->appendplans = appendplans;
node->nasyncplans = nasyncplans;
+ node->referent = referent;
return node;
}
diff --git a/src/backend/utils/adt/ruleutils.c b/src/backend/utils/adt/ruleutils.c
index f355954..76dd07a 100644
--- a/src/backend/utils/adt/ruleutils.c
+++ b/src/backend/utils/adt/ruleutils.c
@@ -4242,7 +4242,11 @@ set_deparse_planstate(deparse_namespace *dpns, PlanState *ps)
* lists containing references to non-target relations.
*/
if (IsA(ps, AppendState))
- dpns->outer_planstate = ((AppendState *) ps)->appendplans[0];
+ {
+ int idx = ((Append*)(((AppendState *) ps)->ps.plan))->referent;
+ dpns->outer_planstate =
+ ((AppendState *) ps)->appendplans[idx];
+ }
else if (IsA(ps, MergeAppendState))
dpns->outer_planstate = ((MergeAppendState *) ps)->mergeplans[0];
else if (IsA(ps, ModifyTableState))
diff --git a/src/include/executor/nodeForeignscan.h b/src/include/executor/nodeForeignscan.h
index 5a61306..2d9a62b 100644
--- a/src/include/executor/nodeForeignscan.h
+++ b/src/include/executor/nodeForeignscan.h
@@ -31,7 +31,7 @@ extern void ExecForeignScanInitializeWorker(ForeignScanState *node,
extern void ExecAsyncForeignScanRequest(EState *estate,
PendingAsyncRequest *areq);
-extern void ExecAsyncForeignScanConfigureWait(EState *estate,
+extern bool ExecAsyncForeignScanConfigureWait(EState *estate,
PendingAsyncRequest *areq, bool reinit);
extern void ExecAsyncForeignScanNotify(EState *estate,
PendingAsyncRequest *areq);
diff --git a/src/include/foreign/fdwapi.h b/src/include/foreign/fdwapi.h
index 4c50f1e..41fc76f 100644
--- a/src/include/foreign/fdwapi.h
+++ b/src/include/foreign/fdwapi.h
@@ -158,7 +158,7 @@ typedef bool (*IsForeignScanParallelSafe_function) (PlannerInfo *root,
typedef bool (*IsForeignPathAsyncCapable_function) (ForeignPath *path);
typedef void (*ForeignAsyncRequest_function) (EState *estate,
PendingAsyncRequest *areq);
-typedef void (*ForeignAsyncConfigureWait_function) (EState *estate,
+typedef bool (*ForeignAsyncConfigureWait_function) (EState *estate,
PendingAsyncRequest *areq,
bool reinit);
typedef void (*ForeignAsyncNotify_function) (EState *estate,
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 81e997e..5afcd34 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -361,6 +361,13 @@ typedef struct ResultRelInfo
* State for an asynchronous tuple request.
* ----------------
*/
+typedef enum AsyncRequestState
+{
+ ASYNC_IDLE,
+ ASYNC_WAITING,
+ ASYNC_CALLBACK_PENDING,
+ ASYNC_COMPLETE
+} AsyncRequestState;
typedef struct PendingAsyncRequest
{
int myindex; /* Index in es_pending_async. */
@@ -369,8 +376,7 @@ typedef struct PendingAsyncRequest
int request_index; /* Scratch space for requestor. */
int num_fd_events; /* Max number of FD events requestee needs. */
bool wants_process_latch; /* Requestee cares about MyLatch. */
- bool callback_pending; /* Callback is needed. */
- bool request_complete; /* Request complete, result valid. */
+ AsyncRequestState state;
Node *result; /* Result (NULL if no more tuples). */
} PendingAsyncRequest;
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index f0daada..ebbc78d 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -229,6 +229,7 @@ typedef struct Append
Plan plan;
List *appendplans;
int nasyncplans; /* # of async plans, always at start of list */
+ int referent; /* index of inheritance tree referent */
} Append;
/* ----------------
--
2.9.2
0002-Fix-some-bugs.patchtext/x-patch; charset=us-asciiDownload
From 9af6f95965adf04e713235f541919158512ae994 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Wed, 12 Oct 2016 14:03:53 +0900
Subject: [PATCH 02/13] Fix some bugs.
---
contrib/postgres_fdw/expected/postgres_fdw.out | 142 ++++++++++++-------------
contrib/postgres_fdw/postgres_fdw.c | 3 +-
src/backend/executor/execAsync.c | 4 +-
src/backend/postmaster/pgstat.c | 3 +
src/include/pgstat.h | 3 +-
5 files changed, 81 insertions(+), 74 deletions(-)
diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out
index 0b9e3e4..df22beb 100644
--- a/contrib/postgres_fdw/expected/postgres_fdw.out
+++ b/contrib/postgres_fdw/expected/postgres_fdw.out
@@ -6254,12 +6254,12 @@ INSERT INTO b(aa) VALUES('bbbbb');
SELECT tableoid::regclass, * FROM a;
tableoid | aa
----------+-------
- a | aaa
- a | aaaa
- a | aaaaa
b | bbb
b | bbbb
b | bbbbb
+ a | aaa
+ a | aaaa
+ a | aaaaa
(6 rows)
SELECT tableoid::regclass, * FROM b;
@@ -6282,12 +6282,12 @@ UPDATE a SET aa = 'zzzzzz' WHERE aa LIKE 'aaaa%';
SELECT tableoid::regclass, * FROM a;
tableoid | aa
----------+--------
- a | aaa
- a | zzzzzz
- a | zzzzzz
b | bbb
b | bbbb
b | bbbbb
+ a | aaa
+ a | zzzzzz
+ a | zzzzzz
(6 rows)
SELECT tableoid::regclass, * FROM b;
@@ -6310,12 +6310,12 @@ UPDATE b SET aa = 'new';
SELECT tableoid::regclass, * FROM a;
tableoid | aa
----------+--------
- a | aaa
- a | zzzzzz
- a | zzzzzz
b | new
b | new
b | new
+ a | aaa
+ a | zzzzzz
+ a | zzzzzz
(6 rows)
SELECT tableoid::regclass, * FROM b;
@@ -6338,12 +6338,12 @@ UPDATE a SET aa = 'newtoo';
SELECT tableoid::regclass, * FROM a;
tableoid | aa
----------+--------
- a | newtoo
- a | newtoo
- a | newtoo
b | newtoo
b | newtoo
b | newtoo
+ a | newtoo
+ a | newtoo
+ a | newtoo
(6 rows)
SELECT tableoid::regclass, * FROM b;
@@ -6402,120 +6402,120 @@ insert into bar2 values(4,44,44);
insert into bar2 values(7,77,77);
explain (verbose, costs off)
select * from bar where f1 in (select f1 from foo) for update;
- QUERY PLAN
-----------------------------------------------------------------------------------------------
+ QUERY PLAN
+------------------------------------------------------------------------------------------------------------------------
LockRows
- Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid, foo.ctid, foo.*, foo.tableoid
+ Output: bar2.f1, bar2.f2, bar2.ctid, ((bar2.*)::bar), bar2.tableoid, foo2.ctid, ((foo2.*)::foo), foo2.tableoid
-> Hash Join
- Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid, foo.ctid, foo.*, foo.tableoid
- Hash Cond: (bar.f1 = foo.f1)
+ Output: bar2.f1, bar2.f2, bar2.ctid, ((bar2.*)::bar), bar2.tableoid, foo2.ctid, ((foo2.*)::foo), foo2.tableoid
+ Hash Cond: (bar2.f1 = foo2.f1)
-> Append
- -> Seq Scan on public.bar
- Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid
-> Foreign Scan on public.bar2
Output: bar2.f1, bar2.f2, bar2.ctid, bar2.*, bar2.tableoid
Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 FOR UPDATE
+ -> Seq Scan on public.bar
+ Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid
-> Hash
- Output: foo.ctid, foo.*, foo.tableoid, foo.f1
+ Output: foo2.ctid, ((foo2.*)::foo), foo2.tableoid, foo2.f1
-> HashAggregate
- Output: foo.ctid, foo.*, foo.tableoid, foo.f1
- Group Key: foo.f1
+ Output: foo2.ctid, ((foo2.*)::foo), foo2.tableoid, foo2.f1
+ Group Key: foo2.f1
-> Append
- -> Seq Scan on public.foo
- Output: foo.ctid, foo.*, foo.tableoid, foo.f1
-> Foreign Scan on public.foo2
Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1
Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1
+ -> Seq Scan on public.foo
+ Output: foo.ctid, foo.*, foo.tableoid, foo.f1
(22 rows)
select * from bar where f1 in (select f1 from foo) for update;
f1 | f2
----+----
- 1 | 11
- 2 | 22
3 | 33
4 | 44
+ 1 | 11
+ 2 | 22
(4 rows)
explain (verbose, costs off)
select * from bar where f1 in (select f1 from foo) for share;
- QUERY PLAN
-----------------------------------------------------------------------------------------------
+ QUERY PLAN
+------------------------------------------------------------------------------------------------------------------------
LockRows
- Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid, foo.ctid, foo.*, foo.tableoid
+ Output: bar2.f1, bar2.f2, bar2.ctid, ((bar2.*)::bar), bar2.tableoid, foo2.ctid, ((foo2.*)::foo), foo2.tableoid
-> Hash Join
- Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid, foo.ctid, foo.*, foo.tableoid
- Hash Cond: (bar.f1 = foo.f1)
+ Output: bar2.f1, bar2.f2, bar2.ctid, ((bar2.*)::bar), bar2.tableoid, foo2.ctid, ((foo2.*)::foo), foo2.tableoid
+ Hash Cond: (bar2.f1 = foo2.f1)
-> Append
- -> Seq Scan on public.bar
- Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid
-> Foreign Scan on public.bar2
Output: bar2.f1, bar2.f2, bar2.ctid, bar2.*, bar2.tableoid
Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 FOR SHARE
+ -> Seq Scan on public.bar
+ Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid
-> Hash
- Output: foo.ctid, foo.*, foo.tableoid, foo.f1
+ Output: foo2.ctid, ((foo2.*)::foo), foo2.tableoid, foo2.f1
-> HashAggregate
- Output: foo.ctid, foo.*, foo.tableoid, foo.f1
- Group Key: foo.f1
+ Output: foo2.ctid, ((foo2.*)::foo), foo2.tableoid, foo2.f1
+ Group Key: foo2.f1
-> Append
- -> Seq Scan on public.foo
- Output: foo.ctid, foo.*, foo.tableoid, foo.f1
-> Foreign Scan on public.foo2
Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1
Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1
+ -> Seq Scan on public.foo
+ Output: foo.ctid, foo.*, foo.tableoid, foo.f1
(22 rows)
select * from bar where f1 in (select f1 from foo) for share;
f1 | f2
----+----
- 1 | 11
- 2 | 22
3 | 33
4 | 44
+ 1 | 11
+ 2 | 22
(4 rows)
-- Check UPDATE with inherited target and an inherited source table
explain (verbose, costs off)
update bar set f2 = f2 + 100 where f1 in (select f1 from foo);
- QUERY PLAN
----------------------------------------------------------------------------------------------
+ QUERY PLAN
+---------------------------------------------------------------------------------------------------------
Update on public.bar
Update on public.bar
Foreign Update on public.bar2
Remote SQL: UPDATE public.loct2 SET f2 = $2 WHERE ctid = $1
-> Hash Join
- Output: bar.f1, (bar.f2 + 100), bar.ctid, foo.ctid, foo.*, foo.tableoid
- Hash Cond: (bar.f1 = foo.f1)
+ Output: bar.f1, (bar.f2 + 100), bar.ctid, foo2.ctid, ((foo2.*)::foo), foo2.tableoid
+ Hash Cond: (bar.f1 = foo2.f1)
-> Seq Scan on public.bar
Output: bar.f1, bar.f2, bar.ctid
-> Hash
- Output: foo.ctid, foo.*, foo.tableoid, foo.f1
+ Output: foo2.ctid, ((foo2.*)::foo), foo2.tableoid, foo2.f1
-> HashAggregate
- Output: foo.ctid, foo.*, foo.tableoid, foo.f1
- Group Key: foo.f1
+ Output: foo2.ctid, ((foo2.*)::foo), foo2.tableoid, foo2.f1
+ Group Key: foo2.f1
-> Append
- -> Seq Scan on public.foo
- Output: foo.ctid, foo.*, foo.tableoid, foo.f1
-> Foreign Scan on public.foo2
Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1
Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1
+ -> Seq Scan on public.foo
+ Output: foo.ctid, foo.*, foo.tableoid, foo.f1
-> Hash Join
- Output: bar2.f1, (bar2.f2 + 100), bar2.f3, bar2.ctid, foo.ctid, foo.*, foo.tableoid
- Hash Cond: (bar2.f1 = foo.f1)
+ Output: bar2.f1, (bar2.f2 + 100), bar2.f3, bar2.ctid, foo2.ctid, ((foo2.*)::foo), foo2.tableoid
+ Hash Cond: (bar2.f1 = foo2.f1)
-> Foreign Scan on public.bar2
Output: bar2.f1, bar2.f2, bar2.f3, bar2.ctid
Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 FOR UPDATE
-> Hash
- Output: foo.ctid, foo.*, foo.tableoid, foo.f1
+ Output: foo2.ctid, ((foo2.*)::foo), foo2.tableoid, foo2.f1
-> HashAggregate
- Output: foo.ctid, foo.*, foo.tableoid, foo.f1
- Group Key: foo.f1
+ Output: foo2.ctid, ((foo2.*)::foo), foo2.tableoid, foo2.f1
+ Group Key: foo2.f1
-> Append
- -> Seq Scan on public.foo
- Output: foo.ctid, foo.*, foo.tableoid, foo.f1
-> Foreign Scan on public.foo2
Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1
Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1
+ -> Seq Scan on public.foo
+ Output: foo.ctid, foo.*, foo.tableoid, foo.f1
(37 rows)
update bar set f2 = f2 + 100 where f1 in (select f1 from foo);
@@ -6543,26 +6543,26 @@ where bar.f1 = ss.f1;
Foreign Update on public.bar2
Remote SQL: UPDATE public.loct2 SET f2 = $2 WHERE ctid = $1
-> Hash Join
- Output: bar.f1, (bar.f2 + 100), bar.ctid, (ROW(foo.f1))
- Hash Cond: (foo.f1 = bar.f1)
+ Output: bar.f1, (bar.f2 + 100), bar.ctid, (ROW(foo2.f1))
+ Hash Cond: (foo2.f1 = bar.f1)
-> Append
- -> Seq Scan on public.foo
- Output: ROW(foo.f1), foo.f1
-> Foreign Scan on public.foo2
Output: ROW(foo2.f1), foo2.f1
Remote SQL: SELECT f1 FROM public.loct1
- -> Seq Scan on public.foo foo_1
- Output: ROW((foo_1.f1 + 3)), (foo_1.f1 + 3)
-> Foreign Scan on public.foo2 foo2_1
Output: ROW((foo2_1.f1 + 3)), (foo2_1.f1 + 3)
Remote SQL: SELECT f1 FROM public.loct1
+ -> Seq Scan on public.foo
+ Output: ROW(foo.f1), foo.f1
+ -> Seq Scan on public.foo foo_1
+ Output: ROW((foo_1.f1 + 3)), (foo_1.f1 + 3)
-> Hash
Output: bar.f1, bar.f2, bar.ctid
-> Seq Scan on public.bar
Output: bar.f1, bar.f2, bar.ctid
-> Merge Join
- Output: bar2.f1, (bar2.f2 + 100), bar2.f3, bar2.ctid, (ROW(foo.f1))
- Merge Cond: (bar2.f1 = foo.f1)
+ Output: bar2.f1, (bar2.f2 + 100), bar2.f3, bar2.ctid, (ROW(foo2.f1))
+ Merge Cond: (bar2.f1 = foo2.f1)
-> Sort
Output: bar2.f1, bar2.f2, bar2.f3, bar2.ctid
Sort Key: bar2.f1
@@ -6570,19 +6570,19 @@ where bar.f1 = ss.f1;
Output: bar2.f1, bar2.f2, bar2.f3, bar2.ctid
Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 FOR UPDATE
-> Sort
- Output: (ROW(foo.f1)), foo.f1
- Sort Key: foo.f1
+ Output: (ROW(foo2.f1)), foo2.f1
+ Sort Key: foo2.f1
-> Append
- -> Seq Scan on public.foo
- Output: ROW(foo.f1), foo.f1
-> Foreign Scan on public.foo2
Output: ROW(foo2.f1), foo2.f1
Remote SQL: SELECT f1 FROM public.loct1
- -> Seq Scan on public.foo foo_1
- Output: ROW((foo_1.f1 + 3)), (foo_1.f1 + 3)
-> Foreign Scan on public.foo2 foo2_1
Output: ROW((foo2_1.f1 + 3)), (foo2_1.f1 + 3)
Remote SQL: SELECT f1 FROM public.loct1
+ -> Seq Scan on public.foo
+ Output: ROW(foo.f1), foo.f1
+ -> Seq Scan on public.foo foo_1
+ Output: ROW((foo_1.f1 + 3)), (foo_1.f1 + 3)
(45 rows)
update bar set f2 = f2 + 100
@@ -6749,8 +6749,8 @@ update bar set f2 = f2 + 100 returning *;
update bar set f2 = f2 + 100 returning *;
f1 | f2
----+-----
- 1 | 311
2 | 322
+ 1 | 311
6 | 266
3 | 333
4 | 344
diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c
index 595a47e..f180838 100644
--- a/contrib/postgres_fdw/postgres_fdw.c
+++ b/contrib/postgres_fdw/postgres_fdw.c
@@ -20,6 +20,7 @@
#include "commands/explain.h"
#include "commands/vacuum.h"
#include "executor/execAsync.h"
+#include "executor/nodeForeignscan.h"
#include "foreign/fdwapi.h"
#include "funcapi.h"
#include "miscadmin.h"
@@ -4472,7 +4473,7 @@ postgresForeignAsyncRequest(EState *estate, PendingAsyncRequest *areq)
TupleTableSlot *slot;
Assert(IsA(node, ForeignScanState));
- slot = postgresIterateForeignScan(node);
+ slot = ExecForeignScan(node);
ExecAsyncRequestDone(estate, areq, (Node *) slot);
}
diff --git a/src/backend/executor/execAsync.c b/src/backend/executor/execAsync.c
index 5858bb5..e070c26 100644
--- a/src/backend/executor/execAsync.c
+++ b/src/backend/executor/execAsync.c
@@ -18,6 +18,7 @@
#include "executor/nodeAppend.h"
#include "executor/nodeForeignscan.h"
#include "miscadmin.h"
+#include "pgstat.h"
#include "storage/latch.h"
static bool ExecAsyncEventWait(EState *estate, long timeout);
@@ -286,7 +287,8 @@ ExecAsyncEventWait(EState *estate, long timeout)
/* Wait for at least one event to occur. */
noccurred = WaitEventSetWait(estate->es_wait_event_set, timeout,
- occurred_event, EVENT_BUFFER_SIZE);
+ occurred_event, EVENT_BUFFER_SIZE,
+ WAIT_EVENT_ASYNC_WAIT);
if (noccurred == 0)
return false;
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 7176cf1..af59f51 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -3398,6 +3398,9 @@ pgstat_get_wait_ipc(WaitEventIPC w)
case WAIT_EVENT_SYNC_REP:
event_name = "SyncRep";
break;
+ case WAIT_EVENT_ASYNC_WAIT:
+ event_name = "AsyncExecWait";
+ break;
/* no default case, so that compiler will warn */
}
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index de8225b..7769d3c 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -787,7 +787,8 @@ typedef enum
WAIT_EVENT_MQ_SEND,
WAIT_EVENT_PARALLEL_FINISH,
WAIT_EVENT_SAFE_SNAPSHOT,
- WAIT_EVENT_SYNC_REP
+ WAIT_EVENT_SYNC_REP,
+ WAIT_EVENT_ASYNC_WAIT
} WaitEventIPC;
/* ----------
--
2.9.2
0001-robert-s-2nd-framework.patchtext/x-patch; charset=us-asciiDownload
From 6ae1a77eaa324fe4455840ddbeb734bd12bc4ede Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Wed, 12 Oct 2016 12:46:10 +0900
Subject: [PATCH 01/13] robert's 2nd framework
---
contrib/postgres_fdw/postgres_fdw.c | 49 ++++
src/backend/executor/Makefile | 4 +-
src/backend/executor/README | 43 +++
src/backend/executor/execAmi.c | 5 +
src/backend/executor/execAsync.c | 462 ++++++++++++++++++++++++++++++++
src/backend/executor/nodeAppend.c | 162 ++++++++++-
src/backend/executor/nodeForeignscan.c | 49 ++++
src/backend/nodes/copyfuncs.c | 1 +
src/backend/nodes/outfuncs.c | 1 +
src/backend/nodes/readfuncs.c | 1 +
src/backend/optimizer/plan/createplan.c | 45 +++-
src/include/executor/execAsync.h | 29 ++
src/include/executor/nodeAppend.h | 3 +
src/include/executor/nodeForeignscan.h | 7 +
src/include/foreign/fdwapi.h | 15 ++
src/include/nodes/execnodes.h | 57 +++-
src/include/nodes/plannodes.h | 1 +
17 files changed, 909 insertions(+), 25 deletions(-)
create mode 100644 src/backend/executor/execAsync.c
create mode 100644 src/include/executor/execAsync.h
diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c
index 5d270b9..595a47e 100644
--- a/contrib/postgres_fdw/postgres_fdw.c
+++ b/contrib/postgres_fdw/postgres_fdw.c
@@ -19,6 +19,7 @@
#include "commands/defrem.h"
#include "commands/explain.h"
#include "commands/vacuum.h"
+#include "executor/execAsync.h"
#include "foreign/fdwapi.h"
#include "funcapi.h"
#include "miscadmin.h"
@@ -349,6 +350,14 @@ static void postgresGetForeignUpperPaths(PlannerInfo *root,
UpperRelationKind stage,
RelOptInfo *input_rel,
RelOptInfo *output_rel);
+static bool postgresIsForeignPathAsyncCapable(ForeignPath *path);
+static void postgresForeignAsyncRequest(EState *estate,
+ PendingAsyncRequest *areq);
+static void postgresForeignAsyncConfigureWait(EState *estate,
+ PendingAsyncRequest *areq,
+ bool reinit);
+static void postgresForeignAsyncNotify(EState *estate,
+ PendingAsyncRequest *areq);
/*
* Helper functions
@@ -468,6 +477,12 @@ postgres_fdw_handler(PG_FUNCTION_ARGS)
/* Support functions for upper relation push-down */
routine->GetForeignUpperPaths = postgresGetForeignUpperPaths;
+ /* Support functions for async execution */
+ routine->IsForeignPathAsyncCapable = postgresIsForeignPathAsyncCapable;
+ routine->ForeignAsyncRequest = postgresForeignAsyncRequest;
+ routine->ForeignAsyncConfigureWait = postgresForeignAsyncConfigureWait;
+ routine->ForeignAsyncNotify = postgresForeignAsyncNotify;
+
PG_RETURN_POINTER(routine);
}
@@ -4440,6 +4455,40 @@ postgresGetForeignJoinPaths(PlannerInfo *root,
/* XXX Consider parameterized paths for the join relation */
}
+static bool
+postgresIsForeignPathAsyncCapable(ForeignPath *path)
+{
+ return true;
+}
+
+/*
+ * XXX. Just for testing purposes, let's run everything through the async
+ * mechanism but return tuples synchronously.
+ */
+static void
+postgresForeignAsyncRequest(EState *estate, PendingAsyncRequest *areq)
+{
+ ForeignScanState *node = (ForeignScanState *) areq->requestee;
+ TupleTableSlot *slot;
+
+ Assert(IsA(node, ForeignScanState));
+ slot = postgresIterateForeignScan(node);
+ ExecAsyncRequestDone(estate, areq, (Node *) slot);
+}
+
+static void
+postgresForeignAsyncConfigureWait(EState *estate, PendingAsyncRequest *areq,
+ bool reinit)
+{
+ elog(ERROR, "postgresForeignAsyncConfigureWait");
+}
+
+static void
+postgresForeignAsyncNotify(EState *estate, PendingAsyncRequest *areq)
+{
+ elog(ERROR, "postgresForeignAsyncNotify");
+}
+
/*
* Assess whether the aggregation, grouping and having operations can be pushed
* down to the foreign server. As a side effect, save information we obtain in
diff --git a/src/backend/executor/Makefile b/src/backend/executor/Makefile
index 2a2b7eb..dd05d1e 100644
--- a/src/backend/executor/Makefile
+++ b/src/backend/executor/Makefile
@@ -12,8 +12,8 @@ subdir = src/backend/executor
top_builddir = ../../..
include $(top_builddir)/src/Makefile.global
-OBJS = execAmi.o execCurrent.o execGrouping.o execIndexing.o execJunk.o \
- execMain.o execParallel.o execProcnode.o execQual.o \
+OBJS = execAmi.o execAsync.o execCurrent.o execGrouping.o execIndexing.o \
+ execJunk.o execMain.o execParallel.o execProcnode.o execQual.o \
execReplication.o execScan.o execTuples.o \
execUtils.o functions.o instrument.o nodeAppend.o nodeAgg.o \
nodeBitmapAnd.o nodeBitmapOr.o \
diff --git a/src/backend/executor/README b/src/backend/executor/README
index f1d1e4c..1dee3db 100644
--- a/src/backend/executor/README
+++ b/src/backend/executor/README
@@ -199,3 +199,46 @@ query returning the same set of scan tuples multiple times. Likewise,
SRFs are disallowed in an UPDATE's targetlist. There, they would have the
effect of the same row being updated multiple times, which is not very
useful --- and updates after the first would have no effect anyway.
+
+Asynchronous Execution
+----------------------
+
+In certain cases, it's desirable for a node to indicate that it cannot
+return any tuple immediately but may be able to do at a later time. This
+might either because the node is waiting on an event external to the
+database system, such as a ForeignScan awaiting network I/O, or because
+the node is waiting for an event internal to the database system - e.g.
+one process involved in a parallel query may find that it cannot progress
+a certain parallel operation until some other process reaches a certain
+point in the computation. A process which discovers this type of situation
+can always handle it simply by blocking, but this may waste time that could
+be spent executing some other part of the plan where progress could be
+made immediately. This is particularly likely to occur when the plan
+contains an Append node.
+
+To use asynchronous execution, a node must first request a tuple from an
+async-capable child node using ExecAsyncRequest. Next, it must execute
+the asynchronous event loop using ExecAsyncEventLoop; it can avoid giving
+up control indefinitely by passing a timeout to this function, even passing
+-1 to poll for events without blocking. Eventually, when a node to which an
+asynchronous request has been made produces a tuple, the requesting node
+will receive a callback from the event loop via ExecAsyncResponse. Typically,
+the ExecAsyncResponse callback is the only one required for nodes that wish
+to request tuples asynchronously.
+
+On the other hand, nodes that wish to produce tuples asynchronously
+generally need to implement three methods:
+
+1. When an asynchronous request is made, the node's ExecAsyncRequest callback
+will be invoked; it should use ExecAsyncSetRequiredEvents to indicate the
+number of file descriptor events for which it wishes to wait and whether it
+wishes to receive a callback when the process latch is set. Alternatively,
+it can instead use ExecAsyncRequestDone if a result is available immediately.
+
+2. When the event loop wishes to wait or poll for file descriptor events and
+the process latch, the ExecAsyncConfigureWait callback is invoked to configure
+the file descriptor wait events for which the node wishes to wait. This
+callback isn't needed if the node only cares about the process latch.
+
+3. When file descriptors or the process latch become ready, the node's
+ExecAsyncNotify callback is invoked.
diff --git a/src/backend/executor/execAmi.c b/src/backend/executor/execAmi.c
index d380207..e154c59 100644
--- a/src/backend/executor/execAmi.c
+++ b/src/backend/executor/execAmi.c
@@ -468,11 +468,16 @@ ExecSupportsBackwardScan(Plan *node)
{
ListCell *l;
+ /* With async, tuples may be interleaved, so can't back up. */
+ if (((Append *) node)->nasyncplans != 0)
+ return false;
+
foreach(l, ((Append *) node)->appendplans)
{
if (!ExecSupportsBackwardScan((Plan *) lfirst(l)))
return false;
}
+
/* need not check tlist because Append doesn't evaluate it */
return true;
}
diff --git a/src/backend/executor/execAsync.c b/src/backend/executor/execAsync.c
new file mode 100644
index 0000000..5858bb5
--- /dev/null
+++ b/src/backend/executor/execAsync.c
@@ -0,0 +1,462 @@
+/*-------------------------------------------------------------------------
+ *
+ * execAsync.c
+ * Support routines for asynchronous execution.
+ *
+ * Portions Copyright (c) 1996-2016, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/executor/execAsync.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "executor/execAsync.h"
+#include "executor/nodeAppend.h"
+#include "executor/nodeForeignscan.h"
+#include "miscadmin.h"
+#include "storage/latch.h"
+
+static bool ExecAsyncEventWait(EState *estate, long timeout);
+static void ExecAsyncConfigureWait(EState *estate, PendingAsyncRequest *areq,
+ bool reinit);
+static void ExecAsyncNotify(EState *estate, PendingAsyncRequest *areq);
+static void ExecAsyncResponse(EState *estate, PendingAsyncRequest *areq);
+
+#define EVENT_BUFFER_SIZE 16
+
+/*
+ * Asynchronously request a tuple from a designed async-aware node.
+ *
+ * requestor is the node that wants the tuple; requestee is the node from
+ * which it wants the tuple. request_index is an arbitrary integer specified
+ * by the requestor which will be available at the time the requestor receives
+ * the tuple. This is useful if the requestor has multiple children and
+ * needs an easy way to figure out which one is delivering a tuple.
+ */
+void
+ExecAsyncRequest(EState *estate, PlanState *requestor, int request_index,
+ PlanState *requestee)
+{
+ PendingAsyncRequest *areq = NULL;
+ int i = estate->es_num_pending_async;
+
+ /*
+ * If the number of pending asynchronous nodes exceeds the number of
+ * available slots in the es_pending_async array, expand the array.
+ * We start with 16 slots, and thereafter double the array size each
+ * time we run out of slots.
+ */
+ if (i >= estate->es_max_pending_async)
+ {
+ int newmax;
+
+ newmax = estate->es_max_pending_async * 2;
+ if (estate->es_max_pending_async == 0)
+ {
+ newmax = 16;
+ estate->es_pending_async =
+ MemoryContextAllocZero(estate->es_query_cxt,
+ newmax * sizeof(PendingAsyncRequest *));
+ }
+ else
+ {
+ int newentries = newmax - estate->es_max_pending_async;
+
+ estate->es_pending_async =
+ repalloc(estate->es_pending_async,
+ newmax * sizeof(PendingAsyncRequest *));
+ MemSet(&estate->es_pending_async[estate->es_max_pending_async],
+ 0, newentries * sizeof(PendingAsyncRequest *));
+ }
+ estate->es_max_pending_async = newmax;
+ }
+
+ /*
+ * To avoid unnecessary palloc traffic, we reuse a previously-allocated
+ * PendingAsyncRequest if there is one. If not, we must allocate a new
+ * one.
+ */
+ if (estate->es_pending_async[i] == NULL)
+ {
+ areq = MemoryContextAllocZero(estate->es_query_cxt,
+ sizeof(PendingAsyncRequest));
+ estate->es_pending_async[i] = areq;
+ }
+ else
+ {
+ areq = estate->es_pending_async[i];
+ MemSet(areq, 0, sizeof(PendingAsyncRequest));
+ }
+ areq->myindex = estate->es_num_pending_async++;
+
+ /* Initialize the new request. */
+ areq->requestor = requestor;
+ areq->request_index = request_index;
+ areq->requestee = requestee;
+
+ /* Give the requestee a chance to do whatever it wants. */
+ switch (nodeTag(requestee))
+ {
+ case T_ForeignScanState:
+ ExecAsyncForeignScanRequest(estate, areq);
+ break;
+ default:
+ /* If requestee doesn't support async, caller messed up. */
+ elog(ERROR, "unrecognized node type: %d",
+ (int) nodeTag(requestee));
+ }
+}
+
+/*
+ * Execute the main loop until the timeout expires or a result is delivered
+ * to the requestor.
+ *
+ * If the timeout is -1, there is no timeout; wait indefinitely until a
+ * result is ready for requestor. If the timeout is 0, do not block, but
+ * poll for events and fire callbacks for as long as we can do so without
+ * blocking. If timeout is greater than 0, block for at most the number
+ * of milliseconds indicated by the timeout.
+ *
+ * Returns true if a result was delivered to the requestor. A return value
+ * of false indicates that the timeout was reached without delivering a
+ * result to the requestor.
+ */
+bool
+ExecAsyncEventLoop(EState *estate, PlanState *requestor, long timeout)
+{
+ instr_time start_time;
+ long cur_timeout = timeout;
+ bool requestor_done = false;
+
+ Assert(requestor != NULL);
+
+ /*
+ * If we plan to wait - but not indefinitely - we need to record the
+ * current time.
+ */
+ if (timeout > 0)
+ INSTR_TIME_SET_CURRENT(start_time);
+
+ /* Main event loop: poll for events, deliver notifications. */
+ for (;;)
+ {
+ int i;
+ bool any_node_done = false;
+
+ CHECK_FOR_INTERRUPTS();
+
+ /*
+ * Check for events, but don't block if there notifications that
+ * have not been delivered yet.
+ */
+ if (estate->es_async_callback_pending > 0)
+ ExecAsyncEventWait(estate, 0);
+ else if (!ExecAsyncEventWait(estate, cur_timeout))
+ cur_timeout = 0; /* Timeout was reached. */
+ else
+ {
+ instr_time cur_time;
+ long cur_timeout = -1;
+
+ INSTR_TIME_SET_CURRENT(cur_time);
+ INSTR_TIME_SUBTRACT(cur_time, start_time);
+ cur_timeout = timeout - (long) INSTR_TIME_GET_MILLISEC(cur_time);
+ if (cur_timeout < 0)
+ cur_timeout = 0;
+ }
+
+ /* Deliver notifications. */
+ for (i = 0; i < estate->es_num_pending_async; ++i)
+ {
+ PendingAsyncRequest *areq = estate->es_pending_async[i];
+
+ /* Skip it if no callback is pending. */
+ if (!areq->callback_pending)
+ continue;
+
+ /*
+ * Mark it as no longer needing a callback. We must do this
+ * before dispatching the callback in case the callback resets
+ * the flag.
+ */
+ areq->callback_pending = false;
+ estate->es_async_callback_pending--;
+
+ /* Perform the actual callback; set request_done if appropraite. */
+ if (!areq->request_complete)
+ ExecAsyncNotify(estate, areq);
+ else
+ {
+ any_node_done = true;
+ if (requestor == areq->requestor)
+ requestor_done = true;
+ ExecAsyncResponse(estate, areq);
+ }
+ }
+
+ /* If any node completed, compact the array. */
+ if (any_node_done)
+ {
+ int hidx = 0,
+ tidx;
+
+ /*
+ * Swap all non-yet-completed items to the start of the array.
+ * Keep them in the same order.
+ */
+ for (tidx = 0; tidx < estate->es_num_pending_async; ++tidx)
+ {
+ PendingAsyncRequest *head;
+ PendingAsyncRequest *tail = estate->es_pending_async[tidx];
+
+ if (!tail->callback_pending && tail->request_complete)
+ continue;
+ head = estate->es_pending_async[hidx];
+ estate->es_pending_async[tidx] = head;
+ estate->es_pending_async[hidx] = tail;
+ ++hidx;
+ }
+ estate->es_num_pending_async = hidx;
+ }
+
+ /*
+ * We only consider exiting the loop when no notifications are
+ * pending. Otherwise, each call to this function might advance
+ * the computation by only a very small amount; to the contrary,
+ * we want to push it forward as far as possible.
+ */
+ if (estate->es_async_callback_pending == 0)
+ {
+ /* If requestor is ready, exit. */
+ if (requestor_done)
+ return true;
+ /* If timeout was 0 or has expired, exit. */
+ if (cur_timeout == 0)
+ return false;
+ }
+ }
+}
+
+/*
+ * Wait or poll for events. As with ExecAsyncEventLoop, a timeout of -1
+ * means wait forever, 0 means don't wait at all, and >0 means wait for the
+ * indicated number of milliseconds.
+ *
+ * Returns true if we found some events and false if we timed out.
+ */
+static bool
+ExecAsyncEventWait(EState *estate, long timeout)
+{
+ WaitEvent occurred_event[EVENT_BUFFER_SIZE];
+ int noccurred;
+ int i;
+ int n;
+ bool reinit = false;
+ bool process_latch_set = false;
+
+ if (estate->es_wait_event_set == NULL)
+ {
+ /*
+ * Allow for a few extra events without reinitializing. It
+ * doesn't seem worth the complexity of doing anything very
+ * aggressive here, because plans that depend on massive numbers
+ * of external FDs are likely to run afoul of kernel limits anyway.
+ */
+ estate->es_allocated_fd_events = estate->es_total_fd_events + 16;
+ estate->es_wait_event_set =
+ CreateWaitEventSet(estate->es_query_cxt,
+ estate->es_allocated_fd_events + 1);
+ AddWaitEventToSet(estate->es_wait_event_set,
+ WL_LATCH_SET, PGINVALID_SOCKET, MyLatch, NULL);
+ reinit = true;
+ }
+
+ /* Give each waiting node a chance to add or modify events. */
+ for (i = 0; i < estate->es_num_pending_async; ++i)
+ {
+ PendingAsyncRequest *areq = estate->es_pending_async[i];
+
+ if (areq->num_fd_events > 0)
+ ExecAsyncConfigureWait(estate, areq, reinit);
+ }
+
+ /* Wait for at least one event to occur. */
+ noccurred = WaitEventSetWait(estate->es_wait_event_set, timeout,
+ occurred_event, EVENT_BUFFER_SIZE);
+ if (noccurred == 0)
+ return false;
+
+ /*
+ * Loop over the occurred events and set the callback_pending flags
+ * for the appropriate requests. The waiting nodes should have
+ * registered their wait events with user_data pointing back to the
+ * PendingAsyncRequest, but the process latch needs special handling.
+ */
+ for (n = 0; n < noccurred; ++n)
+ {
+ WaitEvent *w = &occurred_event[n];
+
+ if ((w->events & WL_LATCH_SET) != 0)
+ {
+ process_latch_set = true;
+ continue;
+ }
+
+ if ((w->events & (WL_SOCKET_READABLE|WL_SOCKET_WRITEABLE)) != 0)
+ {
+ PendingAsyncRequest *areq = w->user_data;
+
+ if (!areq->callback_pending)
+ {
+ Assert(!areq->request_complete);
+ areq->callback_pending = true;
+ estate->es_async_callback_pending++;
+ }
+ }
+ }
+
+ /*
+ * If the process latch got set, we must schedule a callback for every
+ * requestee that cares about it.
+ */
+ if (process_latch_set)
+ {
+ for (i = 0; i < estate->es_num_pending_async; ++i)
+ {
+ PendingAsyncRequest *areq = estate->es_pending_async[i];
+
+ if (areq->wants_process_latch)
+ {
+ Assert(!areq->request_complete);
+ areq->callback_pending = true;
+ }
+ }
+ }
+
+ return true;
+}
+
+/*
+ * Give the asynchronous node a chance to configure the file descriptor
+ * events for which it wishes to wait. We expect the node-type specific
+ * callback to make one or more calls of the following form:
+ *
+ * AddWaitEventToSet(es->es_wait_event_set, events, fd, NULL, areq);
+ *
+ * The events should include only WL_SOCKET_READABLE or WL_SOCKET_WRITEABLE,
+ * and the number of calls should not exceed areq->num_fd_events (as
+ * prevously set via ExecAsyncSetRequiredEvents).
+ */
+static void
+ExecAsyncConfigureWait(EState *estate, PendingAsyncRequest *areq,
+ bool reinit)
+{
+ switch (nodeTag(areq->requestee))
+ {
+ case T_ForeignScanState:
+ ExecAsyncForeignScanConfigureWait(estate, areq, reinit);
+ break;
+ default:
+ elog(ERROR, "unrecognized node type: %d",
+ (int) nodeTag(areq->requestee));
+ }
+}
+
+/*
+ * Call the asynchronous node back when a relevant event has occurred.
+ */
+static void
+ExecAsyncNotify(EState *estate, PendingAsyncRequest *areq)
+{
+ switch (nodeTag(areq->requestee))
+ {
+ case T_ForeignScanState:
+ ExecAsyncForeignScanNotify(estate, areq);
+ break;
+ default:
+ elog(ERROR, "unrecognized node type: %d",
+ (int) nodeTag(areq->requestee));
+ }
+}
+
+/*
+ * Call the requestor back when an asynchronous node has produced a result.
+ */
+static void
+ExecAsyncResponse(EState *estate, PendingAsyncRequest *areq)
+{
+ switch (nodeTag(areq->requestor))
+ {
+ case T_AppendState:
+ ExecAsyncAppendResponse(estate, areq);
+ break;
+ default:
+ elog(ERROR, "unrecognized node type: %d",
+ (int) nodeTag(areq->requestor));
+ }
+}
+
+/*
+ * An executor node should call this function to signal that it needs to wait
+ * on one or more file descriptor events that can be registered on a
+ * WaitEventSet, and possibly also on the process latch. num_fd_events
+ * should be the maximum number of file descriptor events that it will wish to
+ * register. force_reset should be true if the node can't reuse the
+ * WaitEventSet it most recently initialized, for example because it needs to
+ * drop a wait event from the set.
+ */
+void
+ExecAsyncSetRequiredEvents(EState *estate, PendingAsyncRequest *areq,
+ int num_fd_events, bool wants_process_latch,
+ bool force_reset)
+{
+ estate->es_total_fd_events += num_fd_events - areq->num_fd_events;
+ areq->num_fd_events = num_fd_events;
+ areq->wants_process_latch = wants_process_latch;
+
+ if (force_reset && estate->es_wait_event_set != NULL)
+ {
+ FreeWaitEventSet(estate->es_wait_event_set);
+ estate->es_wait_event_set = NULL;
+ }
+}
+
+/*
+ * An async-capable node should call this function to deliver the tuple to
+ * the node which requested it. The node can call this from its
+ * ExecAsyncRequest callback if the requested tuple is available immediately,
+ * or at a later time from its ExecAsyncNotify callback.
+ */
+void
+ExecAsyncRequestDone(EState *estate, PendingAsyncRequest *areq, Node *result)
+{
+ /*
+ * Since the request is complete, the requestee is no longer allowed
+ * to wait for any events. Note that this forces a rebuild of
+ * es_wait_event_set every time a process that was previously waiting
+ * stops doing so. It might be possible to defer that decision until
+ * we actually wait again, because it's quite possible that a new
+ * request will be made of the same node before any wait actually
+ * happens. However, we have to balance the cost of rebuilding the
+ * WaitEventSet against the additional overhead of tracking which nodes
+ * need a callback to remove registered wait events. It's not clear
+ * that we would come out ahead, so use brute force for now.
+ */
+ if (areq->num_fd_events > 0 || areq->wants_process_latch)
+ ExecAsyncSetRequiredEvents(estate, areq, 0, false, true);
+
+ /* Save result and mark request as complete. */
+ areq->result = result;
+ areq->request_complete = true;
+
+ /* Make sure this request is flagged for a callback. */
+ if (!areq->callback_pending)
+ {
+ areq->callback_pending = true;
+ estate->es_async_callback_pending++;
+ }
+}
diff --git a/src/backend/executor/nodeAppend.c b/src/backend/executor/nodeAppend.c
index 6986cae..e61218a 100644
--- a/src/backend/executor/nodeAppend.c
+++ b/src/backend/executor/nodeAppend.c
@@ -58,6 +58,7 @@
#include "postgres.h"
#include "executor/execdebug.h"
+#include "executor/execAsync.h"
#include "executor/nodeAppend.h"
static bool exec_append_initialize_next(AppendState *appendstate);
@@ -79,16 +80,21 @@ exec_append_initialize_next(AppendState *appendstate)
/*
* get information from the append node
*/
- whichplan = appendstate->as_whichplan;
+ whichplan = appendstate->as_whichsyncplan;
- if (whichplan < 0)
+ /*
+ * This routine is only responsible for setting up for nodes being scanned
+ * synchronously, so the first node we can scan is given by nasyncplans
+ * and the last is given by as_nplans - 1.
+ */
+ if (whichplan < appendstate->as_nasyncplans)
{
/*
* if scanning in reverse, we start at the last scan in the list and
* then proceed back to the first.. in any case we inform ExecAppend
* that we are at the end of the line by returning FALSE
*/
- appendstate->as_whichplan = 0;
+ appendstate->as_whichsyncplan = appendstate->as_nasyncplans;
return FALSE;
}
else if (whichplan >= appendstate->as_nplans)
@@ -96,7 +102,7 @@ exec_append_initialize_next(AppendState *appendstate)
/*
* as above, end the scan if we go beyond the last scan in our list..
*/
- appendstate->as_whichplan = appendstate->as_nplans - 1;
+ appendstate->as_whichsyncplan = appendstate->as_nplans - 1;
return FALSE;
}
else
@@ -142,6 +148,15 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
appendstate->ps.state = estate;
appendstate->appendplans = appendplanstates;
appendstate->as_nplans = nplans;
+ appendstate->as_nasyncplans = node->nasyncplans;
+ appendstate->as_syncdone = (node->nasyncplans == nplans);
+ appendstate->as_asyncresult = (TupleTableSlot **)
+ palloc0(node->nasyncplans * sizeof(TupleTableSlot *));
+
+ /* initially, all async requests need a request */
+ for (i = 0; i < appendstate->as_nasyncplans; ++i)
+ appendstate->as_needrequest =
+ bms_add_member(appendstate->as_needrequest, i);
/*
* Miscellaneous initialization
@@ -176,9 +191,9 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
appendstate->ps.ps_ProjInfo = NULL;
/*
- * initialize to scan first subplan
+ * initialize to scan first synchronous subplan
*/
- appendstate->as_whichplan = 0;
+ appendstate->as_whichsyncplan = appendstate->as_nasyncplans;
exec_append_initialize_next(appendstate);
return appendstate;
@@ -193,15 +208,78 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
TupleTableSlot *
ExecAppend(AppendState *node)
{
+ if (node->as_nasyncplans > 0)
+ {
+ EState *estate = node->ps.state;
+ int i;
+
+ /*
+ * If there are any asynchronously-generated results that have
+ * not yet been returned, return one of them.
+ */
+ if (node->as_nasyncresult > 0)
+ {
+ --node->as_nasyncresult;
+ return node->as_asyncresult[node->as_nasyncresult];
+ }
+
+ /*
+ * If there are any nodes that need a new asynchronous request,
+ * make all of them.
+ */
+ while ((i = bms_first_member(node->as_needrequest)) >= 0)
+ {
+ ExecAsyncRequest(estate, &node->ps, i, node->appendplans[i]);
+ node->as_nasyncpending++;
+ }
+ }
+
for (;;)
{
PlanState *subnode;
TupleTableSlot *result;
/*
- * figure out which subplan we are currently processing
+ * if we have async requests outstanding, run the event loop
*/
- subnode = node->appendplans[node->as_whichplan];
+ if (node->as_nasyncpending > 0)
+ {
+ long timeout = node->as_syncdone ? -1 : 0;
+
+ for (;;)
+ {
+ if (node->as_nasyncpending == 0)
+ {
+ /*
+ * If there is no asynchronous activity still pending
+ * and the synchronous activity is also complete, we're
+ * totally done scanning this node. Otherwise, we're
+ * done with the asynchronous stuff but must continue
+ * scanning the synchronous children.
+ */
+ if (node->as_syncdone)
+ return ExecClearTuple(node->ps.ps_ResultTupleSlot);
+ break;
+ }
+ if (!ExecAsyncEventLoop(node->ps.state, &node->ps, timeout))
+ {
+ /* Timeout reached. */
+ break;
+ }
+ if (node->as_nasyncresult > 0)
+ {
+ /* Asynchronous subplan returned a tuple! */
+ --node->as_nasyncresult;
+ return node->as_asyncresult[node->as_nasyncresult];
+ }
+ }
+ }
+
+ /*
+ * figure out which synchronous subplan we are currently processing
+ */
+ Assert(!node->as_syncdone);
+ subnode = node->appendplans[node->as_whichsyncplan];
/*
* get a tuple from the subplan
@@ -221,14 +299,21 @@ ExecAppend(AppendState *node)
/*
* Go on to the "next" subplan in the appropriate direction. If no
* more subplans, return the empty slot set up for us by
- * ExecInitAppend.
+ * ExecInitAppend, unless there are async plans we have yet to finish.
*/
if (ScanDirectionIsForward(node->ps.state->es_direction))
- node->as_whichplan++;
+ node->as_whichsyncplan++;
else
- node->as_whichplan--;
+ node->as_whichsyncplan--;
if (!exec_append_initialize_next(node))
- return ExecClearTuple(node->ps.ps_ResultTupleSlot);
+ {
+ node->as_syncdone = true;
+ if (node->as_nasyncpending == 0)
+ {
+ Assert(bms_is_empty(node->as_needrequest));
+ return ExecClearTuple(node->ps.ps_ResultTupleSlot);
+ }
+ }
/* Else loop back and try to get a tuple from the new subplan */
}
@@ -267,6 +352,16 @@ ExecReScanAppend(AppendState *node)
{
int i;
+ /*
+ * XXX. Cancel outstanding asynchronous tuple requests here! (How?)
+ */
+
+ /* Reset async state. */
+ for (i = 0; i < node->as_nasyncplans; ++i)
+ node->as_needrequest = bms_add_member(node->as_needrequest, i);
+ node->as_nasyncresult = 0;
+ node->as_syncdone = (node->as_nasyncplans == node->as_nplans);
+
for (i = 0; i < node->as_nplans; i++)
{
PlanState *subnode = node->appendplans[i];
@@ -285,6 +380,47 @@ ExecReScanAppend(AppendState *node)
if (subnode->chgParam == NULL)
ExecReScan(subnode);
}
- node->as_whichplan = 0;
+ node->as_whichsyncplan = node->as_nasyncplans;
exec_append_initialize_next(node);
}
+
+/* ----------------------------------------------------------------
+ * ExecAsyncAppendResponse
+ *
+ * Receive a response from an asynchronous request we made.
+ * ----------------------------------------------------------------
+ */
+void
+ExecAsyncAppendResponse(EState *estate, PendingAsyncRequest *areq)
+{
+ AppendState *node = (AppendState *) areq->requestor;
+ TupleTableSlot *slot;
+
+ /* We shouldn't be called until the request is complete. */
+ Assert(areq->request_complete);
+
+ /* Our result slot shouldn't already be occupied. */
+ Assert(TupIsNull(node->ps.ps_ResultTupleSlot));
+
+ /* Result should be a TupleTableSlot or NULL. */
+ slot = (TupleTableSlot *) areq->result;
+ Assert(slot == NULL || IsA(slot, TupleTableSlot));
+
+ /* Request is no longer pending. */
+ Assert(node->as_nasyncpending > 0);
+ --node->as_nasyncpending;
+
+ /* If the result is NULL or an empty slot, there's nothing more to do. */
+ if (TupIsNull(slot))
+ return;
+
+ /* Save result so we can return it. */
+ Assert(node->as_nasyncresult < node->as_nasyncplans);
+ node->as_asyncresult[node->as_nasyncresult++] = slot;
+
+ /*
+ * Mark the node that returned a result as ready for a new request. We
+ * don't launch another one here immediately because it might compelte
+ */
+ bms_add_member(node->as_needrequest, areq->request_index);
+}
diff --git a/src/backend/executor/nodeForeignscan.c b/src/backend/executor/nodeForeignscan.c
index 86a77e3..61899d1 100644
--- a/src/backend/executor/nodeForeignscan.c
+++ b/src/backend/executor/nodeForeignscan.c
@@ -353,3 +353,52 @@ ExecForeignScanInitializeWorker(ForeignScanState *node, shm_toc *toc)
fdwroutine->InitializeWorkerForeignScan(node, toc, coordinate);
}
}
+
+/* ----------------------------------------------------------------
+ * ExecAsyncForeignScanRequest
+ *
+ * Initiate an asynchronous request
+ * ----------------------------------------------------------------
+ */
+void
+ExecAsyncForeignScanRequest(EState *estate, PendingAsyncRequest *areq)
+{
+ ForeignScanState *node = (ForeignScanState *) areq->requestee;
+ FdwRoutine *fdwroutine = node->fdwroutine;
+
+ Assert(fdwroutine->ForeignAsyncRequest != NULL);
+ fdwroutine->ForeignAsyncRequest(estate, areq);
+}
+
+/* ----------------------------------------------------------------
+ * ExecAsyncForeignScanConfigureWait
+ *
+ * In async mode, configure for a wait
+ * ----------------------------------------------------------------
+ */
+void
+ExecAsyncForeignScanConfigureWait(EState *estate,
+ PendingAsyncRequest *areq, bool reinit)
+{
+ ForeignScanState *node = (ForeignScanState *) areq->requestee;
+ FdwRoutine *fdwroutine = node->fdwroutine;
+
+ Assert(fdwroutine->ForeignAsyncConfigureWait != NULL);
+ fdwroutine->ForeignAsyncConfigureWait(estate, areq, reinit);
+}
+
+/* ----------------------------------------------------------------
+ * ExecAsyncForeignScanNotify
+ *
+ * Event loop callback
+ * ----------------------------------------------------------------
+ */
+void
+ExecAsyncForeignScanNotify(EState *estate, PendingAsyncRequest *areq)
+{
+ ForeignScanState *node = (ForeignScanState *) areq->requestee;
+ FdwRoutine *fdwroutine = node->fdwroutine;
+
+ Assert(fdwroutine->ForeignAsyncNotify != NULL);
+ fdwroutine->ForeignAsyncNotify(estate, areq);
+}
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index 30d733e..a8cabdf 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -236,6 +236,7 @@ _copyAppend(const Append *from)
* copy remainder of node
*/
COPY_NODE_FIELD(appendplans);
+ COPY_SCALAR_FIELD(nasyncplans);
return newnode;
}
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index 1560ac3..a894a9d 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -369,6 +369,7 @@ _outAppend(StringInfo str, const Append *node)
_outPlanInfo(str, (const Plan *) node);
WRITE_NODE_FIELD(appendplans);
+ WRITE_INT_FIELD(nasyncplans);
}
static void
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index dcfa6ee..67439ec 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -1539,6 +1539,7 @@ _readAppend(void)
ReadCommonPlan(&local_node->plan);
READ_NODE_FIELD(appendplans);
+ READ_INT_FIELD(nasyncplans);
READ_DONE();
}
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index 997bdcf..2140094 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -194,7 +194,7 @@ static CteScan *make_ctescan(List *qptlist, List *qpqual,
Index scanrelid, int ctePlanId, int cteParam);
static WorkTableScan *make_worktablescan(List *qptlist, List *qpqual,
Index scanrelid, int wtParam);
-static Append *make_append(List *appendplans, List *tlist);
+static Append *make_append(List *asyncplans, int nasyncplans, List *tlist);
static RecursiveUnion *make_recursive_union(List *tlist,
Plan *lefttree,
Plan *righttree,
@@ -272,6 +272,7 @@ static ModifyTable *make_modifytable(PlannerInfo *root,
List *resultRelations, List *subplans,
List *withCheckOptionLists, List *returningLists,
List *rowMarks, OnConflictExpr *onconflict, int epqParam);
+static bool is_async_capable_path(Path *path);
/*
@@ -961,8 +962,10 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path)
{
Append *plan;
List *tlist = build_path_tlist(root, &best_path->path);
- List *subplans = NIL;
+ List *asyncplans = NIL;
+ List *syncplans = NIL;
ListCell *subpaths;
+ int nasyncplans = 0;
/*
* The subpaths list could be empty, if every child was proven empty by
@@ -997,7 +1000,14 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path)
/* Must insist that all children return the same tlist */
subplan = create_plan_recurse(root, subpath, CP_EXACT_TLIST);
- subplans = lappend(subplans, subplan);
+ /* Classify as async-capable or not */
+ if (is_async_capable_path(subpath))
+ {
+ asyncplans = lappend(asyncplans, subplan);
+ ++nasyncplans;
+ }
+ else
+ syncplans = lappend(syncplans, subplan);
}
/*
@@ -1007,7 +1017,7 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path)
* parent-rel Vars it'll be asked to emit.
*/
- plan = make_append(subplans, tlist);
+ plan = make_append(list_concat(asyncplans, syncplans), nasyncplans, tlist);
copy_generic_path_info(&plan->plan, (Path *) best_path);
@@ -5009,7 +5019,7 @@ make_foreignscan(List *qptlist,
}
static Append *
-make_append(List *appendplans, List *tlist)
+make_append(List *appendplans, int nasyncplans, List *tlist)
{
Append *node = makeNode(Append);
Plan *plan = &node->plan;
@@ -5019,6 +5029,7 @@ make_append(List *appendplans, List *tlist)
plan->lefttree = NULL;
plan->righttree = NULL;
node->appendplans = appendplans;
+ node->nasyncplans = nasyncplans;
return node;
}
@@ -6340,3 +6351,27 @@ is_projection_capable_plan(Plan *plan)
}
return true;
}
+
+/*
+ * is_projection_capable_path
+ * Check whether a given Path node is async-capable.
+ */
+static bool
+is_async_capable_path(Path *path)
+{
+ switch (nodeTag(path))
+ {
+ case T_ForeignPath:
+ {
+ FdwRoutine *fdwroutine = path->parent->fdwroutine;
+
+ Assert(fdwroutine != NULL);
+ if (fdwroutine->IsForeignPathAsyncCapable != NULL &&
+ fdwroutine->IsForeignPathAsyncCapable((ForeignPath *) path))
+ return true;
+ }
+ default:
+ break;
+ }
+ return false;
+}
diff --git a/src/include/executor/execAsync.h b/src/include/executor/execAsync.h
new file mode 100644
index 0000000..2abc32d
--- /dev/null
+++ b/src/include/executor/execAsync.h
@@ -0,0 +1,29 @@
+/*--------------------------------------------------------------------
+ * execAsync.c
+ * Support functions for asynchronous query execution
+ *
+ * Portions Copyright (c) 1996-2016, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/executor/execAsync.c
+ *--------------------------------------------------------------------
+ */
+
+#ifndef EXECASYNC_H
+#define EXECASYNC_H
+
+#include "nodes/execnodes.h"
+
+extern void ExecAsyncRequest(EState *estate, PlanState *requestor,
+ int request_index, PlanState *requestee);
+extern bool ExecAsyncEventLoop(EState *estate, PlanState *requestor,
+ long timeout);
+
+extern void ExecAsyncSetRequiredEvents(EState *estate,
+ PendingAsyncRequest *areq, int num_fd_events,
+ bool wants_process_latch, bool force_reset);
+extern void ExecAsyncRequestDone(EState *estate,
+ PendingAsyncRequest *areq, Node *result);
+
+#endif /* EXECASYNC_H */
diff --git a/src/include/executor/nodeAppend.h b/src/include/executor/nodeAppend.h
index 6fb4662..3cbf9ff 100644
--- a/src/include/executor/nodeAppend.h
+++ b/src/include/executor/nodeAppend.h
@@ -21,4 +21,7 @@ extern TupleTableSlot *ExecAppend(AppendState *node);
extern void ExecEndAppend(AppendState *node);
extern void ExecReScanAppend(AppendState *node);
+extern void ExecAsyncAppendResponse(EState *estate,
+ PendingAsyncRequest *areq);
+
#endif /* NODEAPPEND_H */
diff --git a/src/include/executor/nodeForeignscan.h b/src/include/executor/nodeForeignscan.h
index f0e942a..5a61306 100644
--- a/src/include/executor/nodeForeignscan.h
+++ b/src/include/executor/nodeForeignscan.h
@@ -29,4 +29,11 @@ extern void ExecForeignScanInitializeDSM(ForeignScanState *node,
extern void ExecForeignScanInitializeWorker(ForeignScanState *node,
shm_toc *toc);
+extern void ExecAsyncForeignScanRequest(EState *estate,
+ PendingAsyncRequest *areq);
+extern void ExecAsyncForeignScanConfigureWait(EState *estate,
+ PendingAsyncRequest *areq, bool reinit);
+extern void ExecAsyncForeignScanNotify(EState *estate,
+ PendingAsyncRequest *areq);
+
#endif /* NODEFOREIGNSCAN_H */
diff --git a/src/include/foreign/fdwapi.h b/src/include/foreign/fdwapi.h
index 523d415..4c50f1e 100644
--- a/src/include/foreign/fdwapi.h
+++ b/src/include/foreign/fdwapi.h
@@ -155,6 +155,15 @@ typedef bool (*IsForeignScanParallelSafe_function) (PlannerInfo *root,
RelOptInfo *rel,
RangeTblEntry *rte);
+typedef bool (*IsForeignPathAsyncCapable_function) (ForeignPath *path);
+typedef void (*ForeignAsyncRequest_function) (EState *estate,
+ PendingAsyncRequest *areq);
+typedef void (*ForeignAsyncConfigureWait_function) (EState *estate,
+ PendingAsyncRequest *areq,
+ bool reinit);
+typedef void (*ForeignAsyncNotify_function) (EState *estate,
+ PendingAsyncRequest *areq);
+
/*
* FdwRoutine is the struct returned by a foreign-data wrapper's handler
* function. It provides pointers to the callback functions needed by the
@@ -224,6 +233,12 @@ typedef struct FdwRoutine
EstimateDSMForeignScan_function EstimateDSMForeignScan;
InitializeDSMForeignScan_function InitializeDSMForeignScan;
InitializeWorkerForeignScan_function InitializeWorkerForeignScan;
+
+ /* Support functions for asynchronous execution */
+ IsForeignPathAsyncCapable_function IsForeignPathAsyncCapable;
+ ForeignAsyncRequest_function ForeignAsyncRequest;
+ ForeignAsyncConfigureWait_function ForeignAsyncConfigureWait;
+ ForeignAsyncNotify_function ForeignAsyncNotify;
} FdwRoutine;
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 42c6c58..81e997e 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -356,6 +356,25 @@ typedef struct ResultRelInfo
} ResultRelInfo;
/* ----------------
+ * PendingAsyncRequest
+ *
+ * State for an asynchronous tuple request.
+ * ----------------
+ */
+typedef struct PendingAsyncRequest
+{
+ int myindex; /* Index in es_pending_async. */
+ struct PlanState *requestor; /* Node that wants a tuple. */
+ struct PlanState *requestee; /* Node from which a tuple is wanted. */
+ int request_index; /* Scratch space for requestor. */
+ int num_fd_events; /* Max number of FD events requestee needs. */
+ bool wants_process_latch; /* Requestee cares about MyLatch. */
+ bool callback_pending; /* Callback is needed. */
+ bool request_complete; /* Request complete, result valid. */
+ Node *result; /* Result (NULL if no more tuples). */
+} PendingAsyncRequest;
+
+/* ----------------
* EState information
*
* Master working state for an Executor invocation
@@ -434,6 +453,31 @@ typedef struct EState
/* The per-query shared memory area to use for parallel execution. */
struct dsa_area *es_query_dsa;
+
+ /*
+ * Support for asynchronous execution.
+ *
+ * es_max_pending_async is the allocated size of es_pending_async, and
+ * es_num_pending_aync is the number of entries that are currently valid.
+ * (Entries after that may point to storage that can be reused.)
+ * es_async_callback_pending is the number of PendingAsyncRequests for
+ * which callback_pending is true.
+ *
+ * es_total_fd_events is the total number of FD events needed by all
+ * pending async nodes, and es_allocated_fd_events is the number any
+ * current wait event set was allocated to handle. es_wait_event_set, if
+ * non-NULL, is a previously allocated event set that may be reusable by a
+ * future wait provided that nothing's been removed and not too many more
+ * events have been added.
+ */
+ int es_num_pending_async;
+ int es_max_pending_async;
+ int es_async_callback_pending;
+ PendingAsyncRequest **es_pending_async;
+
+ int es_total_fd_events;
+ int es_allocated_fd_events;
+ struct WaitEventSet *es_wait_event_set;
} EState;
@@ -1179,17 +1223,20 @@ typedef struct ModifyTableState
/* ----------------
* AppendState information
- *
- * nplans how many plans are in the array
- * whichplan which plan is being executed (0 .. n-1)
* ----------------
*/
typedef struct AppendState
{
PlanState ps; /* its first field is NodeTag */
PlanState **appendplans; /* array of PlanStates for my inputs */
- int as_nplans;
- int as_whichplan;
+ int as_nplans; /* total # of children */
+ int as_nasyncplans; /* # of async-capable children */
+ int as_whichsyncplan; /* which sync plan is being executed */
+ bool as_syncdone; /* all synchronous plans done? */
+ Bitmapset *as_needrequest; /* async plans needing a new request */
+ TupleTableSlot **as_asyncresult; /* unreturned results of async plans */
+ int as_nasyncresult; /* # of valid entries in as_asyncresult */
+ int as_nasyncpending; /* # of outstanding async requests */
} AppendState;
/* ----------------
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index f72f7a8..f0daada 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -228,6 +228,7 @@ typedef struct Append
{
Plan plan;
List *appendplans;
+ int nasyncplans; /* # of async plans, always at start of list */
} Append;
/* ----------------
--
2.9.2
Hello, I totally reorganized the patch set to four pathces on the
current master (9e43e87).
At Wed, 22 Feb 2017 17:39:45 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in <20170222.173945.262776579.horiguchi.kyotaro@lab.ntt.co.jp>
Finally, I couldn't see the crash for the (maybe) same case. I
can guess two reasons for this. One is that a situation where
node->as_nasyncpending differs from estate->es_num_pending_async,
but I couldn't find a possibility. Another is a situation in
postgresIterateForeignScan where the "next owner" reaches eof but
another waiter is not. I haven't reproduce the situation but
fixed it for the case. Addition to that I found a bug in
ExecAsyncAppendResponse. It calls bms_add_member inappropriate
way.
This found to be wrong. The true problem here was (maybe) that
ExecAsyncRequest can complete a tuple immediately. This causes
multiple calling to ExecAsyncRequest for the same child at
once. (For the case, the processing node is added again to
node->as_needrequest before ExecAsyncRequest returns.)
Using a copy of node->as_needrequest will fix this but it is
uneasy so I changed ExecAsyncRequest not to return a tuple
immediately. Instaed, ExecAsyncEventLoop skips waiting if no node
to wait. The tuple previously "response"'ed in ExecAsyncRequest
is now responsed here.
Addition to that, the current policy of preserving of
es_wait_event_set doesn't seem to work with the async-capable
postgres_fdw. So the current code cleares it at every entering to
ExecAppend. This needs more thoughts.
I measured the performance of async-execution and it was quite
good from the previous version especially for single-connection
environment.
pf0: 4 foreign tables on single connection
non async : (prev) 7928ms -> (this time)7993ms
async : (prev) 6447ms -> (this time)3211ms
pf1: 4 foreign tables on dedicate connection for every table
non async : (prev) 8023ms -> (this time)7953ms
async : (prev) 1876ms -> (this time)1841ms
Boost rate by async execution is 60% for single connectsion and
77% for dedicate connection environment.
Mmm, I reproduces it quite easily. A silly bug.
Something bad is happening between freeing ExecutorState memory
context and resource owner. Perhaps the ExecutorState is freed by
resowner (as a part of its anscestors) before the memory for the
WaitEventSet is freed. It was careless of me. I'll reconsider it.The cause was that the WaitEventSet was placed in ExecutorState
but registered to TopTransactionResourceOwner. I fixed it.
The attached patches are the following.
0001-Allow-wait-event-set-to-be-registered-to-resource-ow.patch
Allows WaitEventSet to released by resource owner.
0002-Asynchronous-execution-framework.patch
Asynchronous execution framework based on Robert's version. All
edits on this is merged.
0003-Make-postgres_fdw-async-capable.patch
Make postgres_fdw async-capable.
0004-Apply-unlikely-to-suggest-synchronous-route-of-ExecA.patch
This can be merged to 0002 but I didn't since the usage of
using these pragmas is arguable.
regards,
--
Kyotaro Horiguchi
NTT Open Source Software Center
Attachments:
0001-Allow-wait-event-set-to-be-registered-to-resource-ow.patchtext/x-patch; charset=us-asciiDownload
From bcd888a98a7aa5e1bd367c83e06d598121fd2d94 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Wed, 22 Feb 2017 09:07:49 +0900
Subject: [PATCH 1/5] Allow wait event set to be registered to resource owner
WaitEventSet needs to be released using resource owner for a certain
case. This change adds WaitEventSet reowner and allow the creator of a
WaitEventSet to specify a resource owner.
---
src/backend/libpq/pqcomm.c | 2 +-
src/backend/storage/ipc/latch.c | 18 ++++++-
src/backend/storage/lmgr/condition_variable.c | 2 +-
src/backend/utils/resowner/resowner.c | 68 +++++++++++++++++++++++++++
src/include/storage/latch.h | 4 +-
src/include/utils/resowner_private.h | 8 ++++
6 files changed, 97 insertions(+), 5 deletions(-)
diff --git a/src/backend/libpq/pqcomm.c b/src/backend/libpq/pqcomm.c
index 7939b1f..16a5d7a 100644
--- a/src/backend/libpq/pqcomm.c
+++ b/src/backend/libpq/pqcomm.c
@@ -201,7 +201,7 @@ pq_init(void)
(errmsg("could not set socket to nonblocking mode: %m")));
#endif
- FeBeWaitSet = CreateWaitEventSet(TopMemoryContext, 3);
+ FeBeWaitSet = CreateWaitEventSet(TopMemoryContext, NULL, 3);
AddWaitEventToSet(FeBeWaitSet, WL_SOCKET_WRITEABLE, MyProcPort->sock,
NULL, NULL);
AddWaitEventToSet(FeBeWaitSet, WL_LATCH_SET, -1, MyLatch, NULL);
diff --git a/src/backend/storage/ipc/latch.c b/src/backend/storage/ipc/latch.c
index 0079ba5..a204b0c 100644
--- a/src/backend/storage/ipc/latch.c
+++ b/src/backend/storage/ipc/latch.c
@@ -62,6 +62,7 @@
#include "storage/latch.h"
#include "storage/pmsignal.h"
#include "storage/shmem.h"
+#include "utils/resowner_private.h"
/*
* Select the fd readiness primitive to use. Normally the "most modern"
@@ -90,6 +91,8 @@ struct WaitEventSet
int nevents; /* number of registered events */
int nevents_space; /* maximum number of events in this set */
+ ResourceOwner resowner; /* Resource owner */
+
/*
* Array, of nevents_space length, storing the definition of events this
* set is waiting for.
@@ -324,7 +327,7 @@ WaitLatchOrSocket(volatile Latch *latch, int wakeEvents, pgsocket sock,
int ret = 0;
int rc;
WaitEvent event;
- WaitEventSet *set = CreateWaitEventSet(CurrentMemoryContext, 3);
+ WaitEventSet *set = CreateWaitEventSet(CurrentMemoryContext, NULL, 3);
if (wakeEvents & WL_TIMEOUT)
Assert(timeout >= 0);
@@ -482,12 +485,15 @@ ResetLatch(volatile Latch *latch)
* WaitEventSetWait().
*/
WaitEventSet *
-CreateWaitEventSet(MemoryContext context, int nevents)
+CreateWaitEventSet(MemoryContext context, ResourceOwner res, int nevents)
{
WaitEventSet *set;
char *data;
Size sz = 0;
+ if (res)
+ ResourceOwnerEnlargeWESs(res);
+
/*
* Use MAXALIGN size/alignment to guarantee that later uses of memory are
* aligned correctly. E.g. epoll_event might need 8 byte alignment on some
@@ -547,6 +553,11 @@ CreateWaitEventSet(MemoryContext context, int nevents)
StaticAssertStmt(WSA_INVALID_EVENT == NULL, "");
#endif
+ /* Register this wait event set if requested */
+ set->resowner = res;
+ if (res)
+ ResourceOwnerRememberWES(set->resowner, set);
+
return set;
}
@@ -582,6 +593,9 @@ FreeWaitEventSet(WaitEventSet *set)
}
#endif
+ if (set->resowner != NULL)
+ ResourceOwnerForgetWES(set->resowner, set);
+
pfree(set);
}
diff --git a/src/backend/storage/lmgr/condition_variable.c b/src/backend/storage/lmgr/condition_variable.c
index 6f1ef0b..503aef1 100644
--- a/src/backend/storage/lmgr/condition_variable.c
+++ b/src/backend/storage/lmgr/condition_variable.c
@@ -66,7 +66,7 @@ ConditionVariablePrepareToSleep(ConditionVariable *cv)
/* Create a reusable WaitEventSet. */
if (cv_wait_event_set == NULL)
{
- cv_wait_event_set = CreateWaitEventSet(TopMemoryContext, 1);
+ cv_wait_event_set = CreateWaitEventSet(TopMemoryContext, NULL, 1);
AddWaitEventToSet(cv_wait_event_set, WL_LATCH_SET, PGINVALID_SOCKET,
&MyProc->procLatch, NULL);
}
diff --git a/src/backend/utils/resowner/resowner.c b/src/backend/utils/resowner/resowner.c
index af46d78..a1a1121 100644
--- a/src/backend/utils/resowner/resowner.c
+++ b/src/backend/utils/resowner/resowner.c
@@ -124,6 +124,7 @@ typedef struct ResourceOwnerData
ResourceArray snapshotarr; /* snapshot references */
ResourceArray filearr; /* open temporary files */
ResourceArray dsmarr; /* dynamic shmem segments */
+ ResourceArray wesarr; /* wait event sets */
/* We can remember up to MAX_RESOWNER_LOCKS references to local locks. */
int nlocks; /* number of owned locks */
@@ -169,6 +170,7 @@ static void PrintTupleDescLeakWarning(TupleDesc tupdesc);
static void PrintSnapshotLeakWarning(Snapshot snapshot);
static void PrintFileLeakWarning(File file);
static void PrintDSMLeakWarning(dsm_segment *seg);
+static void PrintWESLeakWarning(WaitEventSet *events);
/*****************************************************************************
@@ -437,6 +439,7 @@ ResourceOwnerCreate(ResourceOwner parent, const char *name)
ResourceArrayInit(&(owner->snapshotarr), PointerGetDatum(NULL));
ResourceArrayInit(&(owner->filearr), FileGetDatum(-1));
ResourceArrayInit(&(owner->dsmarr), PointerGetDatum(NULL));
+ ResourceArrayInit(&(owner->wesarr), PointerGetDatum(NULL));
return owner;
}
@@ -552,6 +555,16 @@ ResourceOwnerReleaseInternal(ResourceOwner owner,
PrintDSMLeakWarning(res);
dsm_detach(res);
}
+
+ /* Ditto for wait event sets */
+ while (ResourceArrayGetAny(&(owner->wesarr), &foundres))
+ {
+ WaitEventSet *event = (WaitEventSet *) DatumGetPointer(foundres);
+
+ if (isCommit)
+ PrintWESLeakWarning(event);
+ FreeWaitEventSet(event);
+ }
}
else if (phase == RESOURCE_RELEASE_LOCKS)
{
@@ -699,6 +712,7 @@ ResourceOwnerDelete(ResourceOwner owner)
Assert(owner->snapshotarr.nitems == 0);
Assert(owner->filearr.nitems == 0);
Assert(owner->dsmarr.nitems == 0);
+ Assert(owner->wesarr.nitems == 0);
Assert(owner->nlocks == 0 || owner->nlocks == MAX_RESOWNER_LOCKS + 1);
/*
@@ -725,6 +739,7 @@ ResourceOwnerDelete(ResourceOwner owner)
ResourceArrayFree(&(owner->snapshotarr));
ResourceArrayFree(&(owner->filearr));
ResourceArrayFree(&(owner->dsmarr));
+ ResourceArrayFree(&(owner->wesarr));
pfree(owner);
}
@@ -1267,3 +1282,56 @@ PrintDSMLeakWarning(dsm_segment *seg)
elog(WARNING, "dynamic shared memory leak: segment %u still referenced",
dsm_segment_handle(seg));
}
+
+/*
+ * Make sure there is room for at least one more entry in a ResourceOwner's
+ * wait event set reference array.
+ *
+ * This is separate from actually inserting an entry because if we run out
+ * of memory, it's critical to do so *before* acquiring the resource.
+ */
+void
+ResourceOwnerEnlargeWESs(ResourceOwner owner)
+{
+ ResourceArrayEnlarge(&(owner->wesarr));
+}
+
+/*
+ * Remember that a wait event set is owned by a ResourceOwner
+ *
+ * Caller must have previously done ResourceOwnerEnlargeWESs()
+ */
+void
+ResourceOwnerRememberWES(ResourceOwner owner, WaitEventSet *events)
+{
+ ResourceArrayAdd(&(owner->wesarr), PointerGetDatum(events));
+}
+
+/*
+ * Forget that a wait event set is owned by a ResourceOwner
+ */
+void
+ResourceOwnerForgetWES(ResourceOwner owner, WaitEventSet *events)
+{
+ /*
+ * XXXX: There's no property to show as an identier of a wait event set,
+ * use its pointer instead.
+ */
+ if (!ResourceArrayRemove(&(owner->wesarr), PointerGetDatum(events)))
+ elog(ERROR, "wait event set %p is not owned by resource owner %s",
+ events, owner->name);
+}
+
+/*
+ * Debugging subroutine
+ */
+static void
+PrintWESLeakWarning(WaitEventSet *events)
+{
+ /*
+ * XXXX: There's no property to show as an identier of a wait event set,
+ * use its pointer instead.
+ */
+ elog(WARNING, "wait event set leak: %p still referenced",
+ events);
+}
diff --git a/src/include/storage/latch.h b/src/include/storage/latch.h
index 3158d7b..8233b6d 100644
--- a/src/include/storage/latch.h
+++ b/src/include/storage/latch.h
@@ -101,6 +101,7 @@
#define LATCH_H
#include <signal.h>
+#include "utils/resowner.h"
/*
* Latch structure should be treated as opaque and only accessed through
@@ -152,7 +153,8 @@ extern void DisownLatch(volatile Latch *latch);
extern void SetLatch(volatile Latch *latch);
extern void ResetLatch(volatile Latch *latch);
-extern WaitEventSet *CreateWaitEventSet(MemoryContext context, int nevents);
+extern WaitEventSet *CreateWaitEventSet(MemoryContext context,
+ ResourceOwner res, int nevents);
extern void FreeWaitEventSet(WaitEventSet *set);
extern int AddWaitEventToSet(WaitEventSet *set, uint32 events, pgsocket fd,
Latch *latch, void *user_data);
diff --git a/src/include/utils/resowner_private.h b/src/include/utils/resowner_private.h
index 411d08f..0c6979a 100644
--- a/src/include/utils/resowner_private.h
+++ b/src/include/utils/resowner_private.h
@@ -18,6 +18,7 @@
#include "storage/dsm.h"
#include "storage/fd.h"
+#include "storage/latch.h"
#include "storage/lock.h"
#include "utils/catcache.h"
#include "utils/plancache.h"
@@ -88,4 +89,11 @@ extern void ResourceOwnerRememberDSM(ResourceOwner owner,
extern void ResourceOwnerForgetDSM(ResourceOwner owner,
dsm_segment *);
+/* support for wait event set management */
+extern void ResourceOwnerEnlargeWESs(ResourceOwner owner);
+extern void ResourceOwnerRememberWES(ResourceOwner owner,
+ WaitEventSet *);
+extern void ResourceOwnerForgetWES(ResourceOwner owner,
+ WaitEventSet *);
+
#endif /* RESOWNER_PRIVATE_H */
--
2.9.2
0002-Asynchronous-execution-framework.patchtext/x-patch; charset=us-asciiDownload
From 2f90dd114467c5da10b8e3bdaa20ccef47052a15 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 23 Feb 2017 12:20:31 +0900
Subject: [PATCH 2/5] Asynchronous execution framework
This is a framework for asynchronous execution based on Robert Haas's
proposal. Any executor node can receive tuples from underlying nodes
asynchronously by this. This is a different mechanism from parallel
execution. While the parallel execution is analogous to threads, this
frame work is analogous to select(2), which handles multiple input on
single backend process. To avoid degradation of non-async execution,
this framework uses completely different channel to convey tuples.
You will see the deatil of the API at the end of
src/backend/executor/README.
---
src/backend/executor/Makefile | 4 +-
src/backend/executor/README | 45 +++
src/backend/executor/execAmi.c | 5 +
src/backend/executor/execAsync.c | 520 ++++++++++++++++++++++++++++++++
src/backend/executor/execProcnode.c | 9 +
src/backend/executor/instrument.c | 2 +-
src/backend/executor/nodeAppend.c | 169 ++++++++++-
src/backend/executor/nodeForeignscan.c | 49 +++
src/backend/nodes/copyfuncs.c | 2 +
src/backend/nodes/outfuncs.c | 2 +
src/backend/nodes/readfuncs.c | 2 +
src/backend/optimizer/plan/createplan.c | 63 +++-
src/backend/postmaster/pgstat.c | 3 +
src/backend/utils/adt/ruleutils.c | 6 +-
src/include/executor/execAsync.h | 30 ++
src/include/executor/nodeAppend.h | 3 +
src/include/executor/nodeForeignscan.h | 7 +
src/include/foreign/fdwapi.h | 17 ++
src/include/nodes/execnodes.h | 65 +++-
src/include/nodes/plannodes.h | 2 +
src/include/pgstat.h | 3 +-
21 files changed, 979 insertions(+), 29 deletions(-)
create mode 100644 src/backend/executor/execAsync.c
create mode 100644 src/include/executor/execAsync.h
diff --git a/src/backend/executor/Makefile b/src/backend/executor/Makefile
index 2a2b7eb..dd05d1e 100644
--- a/src/backend/executor/Makefile
+++ b/src/backend/executor/Makefile
@@ -12,8 +12,8 @@ subdir = src/backend/executor
top_builddir = ../../..
include $(top_builddir)/src/Makefile.global
-OBJS = execAmi.o execCurrent.o execGrouping.o execIndexing.o execJunk.o \
- execMain.o execParallel.o execProcnode.o execQual.o \
+OBJS = execAmi.o execAsync.o execCurrent.o execGrouping.o execIndexing.o \
+ execJunk.o execMain.o execParallel.o execProcnode.o execQual.o \
execReplication.o execScan.o execTuples.o \
execUtils.o functions.o instrument.o nodeAppend.o nodeAgg.o \
nodeBitmapAnd.o nodeBitmapOr.o \
diff --git a/src/backend/executor/README b/src/backend/executor/README
index f1d1e4c..7bd009c 100644
--- a/src/backend/executor/README
+++ b/src/backend/executor/README
@@ -199,3 +199,48 @@ query returning the same set of scan tuples multiple times. Likewise,
SRFs are disallowed in an UPDATE's targetlist. There, they would have the
effect of the same row being updated multiple times, which is not very
useful --- and updates after the first would have no effect anyway.
+
+Asynchronous Execution
+----------------------
+
+In certain cases, it's desirable for a node to indicate that it cannot
+return any tuple immediately but may be able to do at a later time. This
+might either because the node is waiting on an event external to the
+database system, such as a ForeignScan awaiting network I/O, or because
+the node is waiting for an event internal to the database system - e.g.
+one process involved in a parallel query may find that it cannot progress
+a certain parallel operation until some other process reaches a certain
+point in the computation. A process which discovers this type of situation
+can always handle it simply by blocking, but this may waste time that could
+be spent executing some other part of the plan where progress could be
+made immediately. This is particularly likely to occur when the plan
+contains an Append node.
+
+To use asynchronous execution, a node must first request a tuple from
+an async-capable child node using ExecAsyncRequest. Next, when the
+result is not available immediately, it must execute the asynchronous
+event loop using ExecAsyncEventLoop; it can avoid giving up control
+indefinitely by passing a timeout to this function, even passing -1 to
+poll for events without blocking. Eventually, when a node to which an
+asynchronous request has been made produces a tuple, the requesting
+node will receive a callback from the event loop via
+ExecAsyncResponse. Typically, the ExecAsyncResponse callback is the
+only one required for nodes that wish to request tuples
+asynchronously.
+
+On the other hand, nodes that wish to produce tuples asynchronously
+generally need to implement three methods:
+
+1. When an asynchronous request is made, the node's ExecAsyncRequest callback
+will be invoked; it should use ExecAsyncSetRequiredEvents to indicate the
+number of file descriptor events for which it wishes to wait and whether it
+wishes to receive a callback when the process latch is set. Alternatively,
+it can instead use ExecAsyncRequestDone if a result is available immediately.
+
+2. When the event loop wishes to wait or poll for file descriptor events and
+the process latch, the ExecAsyncConfigureWait callback is invoked to configure
+the file descriptor wait events for which the node wishes to wait. This
+callback isn't needed if the node only cares about the process latch.
+
+3. When file descriptors or the process latch become ready, the node's
+ExecAsyncNotify callback is invoked.
diff --git a/src/backend/executor/execAmi.c b/src/backend/executor/execAmi.c
index d380207..e154c59 100644
--- a/src/backend/executor/execAmi.c
+++ b/src/backend/executor/execAmi.c
@@ -468,11 +468,16 @@ ExecSupportsBackwardScan(Plan *node)
{
ListCell *l;
+ /* With async, tuples may be interleaved, so can't back up. */
+ if (((Append *) node)->nasyncplans != 0)
+ return false;
+
foreach(l, ((Append *) node)->appendplans)
{
if (!ExecSupportsBackwardScan((Plan *) lfirst(l)))
return false;
}
+
/* need not check tlist because Append doesn't evaluate it */
return true;
}
diff --git a/src/backend/executor/execAsync.c b/src/backend/executor/execAsync.c
new file mode 100644
index 0000000..115b147
--- /dev/null
+++ b/src/backend/executor/execAsync.c
@@ -0,0 +1,520 @@
+/*-------------------------------------------------------------------------
+ *
+ * execAsync.c
+ * Support routines for asynchronous execution.
+ *
+ * Portions Copyright (c) 1996-2016, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/executor/execAsync.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "executor/execAsync.h"
+#include "executor/nodeAppend.h"
+#include "executor/nodeForeignscan.h"
+#include "miscadmin.h"
+#include "pgstat.h"
+#include "storage/latch.h"
+#include "utils/memutils.h"
+
+static bool ExecAsyncEventWait(EState *estate, long timeout);
+static bool ExecAsyncConfigureWait(EState *estate, PendingAsyncRequest *areq,
+ bool reinit);
+static void ExecAsyncNotify(EState *estate, PendingAsyncRequest *areq);
+static void ExecAsyncResponse(EState *estate, PendingAsyncRequest *areq);
+
+#define EVENT_BUFFER_SIZE 16
+
+/*
+ * Asynchronously request a tuple from a designed async-aware node.
+ *
+ * requestor is the node that wants the tuple; requestee is the node from
+ * which it wants the tuple. request_index is an arbitrary integer specified
+ * by the requestor which will be available at the time the requestor receives
+ * the tuple. This is useful if the requestor has multiple children and
+ * needs an easy way to figure out which one is delivering a tuple.
+ */
+void
+ExecAsyncRequest(EState *estate, PlanState *requestor, int request_index,
+ PlanState *requestee)
+{
+ PendingAsyncRequest *areq = NULL;
+ int nasync = estate->es_num_pending_async;
+
+ if (requestee->instrument)
+ InstrStartNode(requestee->instrument);
+
+ /*
+ * If the number of pending asynchronous nodes exceeds the number of
+ * available slots in the es_pending_async array, expand the array.
+ * We start with 16 slots, and thereafter double the array size each
+ * time we run out of slots.
+ */
+ if (nasync >= estate->es_max_pending_async)
+ {
+ int newmax;
+
+ newmax = estate->es_max_pending_async * 2;
+ if (estate->es_max_pending_async == 0)
+ {
+ newmax = 16;
+ estate->es_pending_async =
+ MemoryContextAllocZero(estate->es_query_cxt,
+ newmax * sizeof(PendingAsyncRequest *));
+ }
+ else
+ {
+ int newentries = newmax - estate->es_max_pending_async;
+
+ estate->es_pending_async =
+ repalloc(estate->es_pending_async,
+ newmax * sizeof(PendingAsyncRequest *));
+ MemSet(&estate->es_pending_async[estate->es_max_pending_async],
+ 0, newentries * sizeof(PendingAsyncRequest *));
+ }
+ estate->es_max_pending_async = newmax;
+ }
+
+ /*
+ * To avoid unnecessary palloc traffic, we reuse a previously-allocated
+ * PendingAsyncRequest if there is one. If not, we must allocate a new
+ * one.
+ */
+ if (estate->es_pending_async[nasync] == NULL)
+ {
+ areq = MemoryContextAllocZero(estate->es_query_cxt,
+ sizeof(PendingAsyncRequest));
+ estate->es_pending_async[nasync] = areq;
+ }
+ else
+ {
+ areq = estate->es_pending_async[nasync];
+ MemSet(areq, 0, sizeof(PendingAsyncRequest));
+ }
+ areq->myindex = estate->es_num_pending_async;
+
+ /* Initialize the new request. */
+ areq->state = ASYNCREQ_IDLE;
+ areq->requestor = requestor;
+ areq->request_index = request_index;
+ areq->requestee = requestee;
+
+ /* Give the requestee a chance to do whatever it wants. */
+ switch (nodeTag(requestee))
+ {
+ case T_ForeignScanState:
+ ExecAsyncForeignScanRequest(estate, areq);
+ break;
+ default:
+ /* If requestee doesn't support async, caller messed up. */
+ elog(ERROR, "unrecognized node type: %d",
+ (int) nodeTag(requestee));
+ }
+
+ if (areq->requestee->instrument)
+ InstrStopNode(requestee->instrument, 0);
+
+ /* No result available now, make this node pending */
+ estate->es_num_pending_async++;
+
+ return;
+}
+
+/*
+ * Execute the main loop until the timeout expires or a result is delivered
+ * to the requestor.
+ *
+ * If the timeout is -1, there is no timeout; wait indefinitely until a
+ * result is ready for requestor. If the timeout is 0, do not block, but
+ * poll for events and fire callbacks for as long as we can do so without
+ * blocking. If timeout is greater than 0, block for at most the number
+ * of milliseconds indicated by the timeout.
+ *
+ * Returns true if a result was delivered to the requestor. A return value
+ * of false indicates that the timeout was reached without delivering a
+ * result to the requestor.
+ */
+bool
+ExecAsyncEventLoop(EState *estate, PlanState *requestor, long timeout)
+{
+ instr_time start_time;
+ long cur_timeout = timeout;
+ bool requestor_done = false;
+
+ Assert(requestor != NULL);
+
+ /*
+ * If we plan to wait - but not indefinitely - we need to record the
+ * current time.
+ */
+ if (timeout > 0)
+ INSTR_TIME_SET_CURRENT(start_time);
+
+ /* Main event loop: poll for events, deliver notifications. */
+ Assert(estate->es_async_callback_pending == 0);
+ for (;;)
+ {
+ int i;
+ bool any_node_done = false;
+
+ CHECK_FOR_INTERRUPTS();
+
+ /* Check for events only if any node is async-not-ready. */
+ if (estate->es_num_async_ready < estate->es_num_pending_async)
+ {
+ /* Don't block if any tuple available. */
+ if (estate->es_async_callback_pending > 0)
+ ExecAsyncEventWait(estate, 0);
+ else if (!ExecAsyncEventWait(estate, cur_timeout))
+ { /* Not fired */
+ /* Exited before timeout. Calculate the remaining time. */
+ instr_time cur_time;
+ long cur_timeout = -1;
+
+ /* Wait forever */
+ if (timeout < 0)
+ continue;
+
+ INSTR_TIME_SET_CURRENT(cur_time);
+ INSTR_TIME_SUBTRACT(cur_time, start_time);
+ cur_timeout =
+ timeout - (long) INSTR_TIME_GET_MILLISEC(cur_time);
+
+ if (cur_timeout > 0)
+ continue;
+ }
+ }
+
+ /* Deliver notifications. */
+ for (i = 0; i < estate->es_num_pending_async; ++i)
+ {
+ PendingAsyncRequest *areq = estate->es_pending_async[i];
+
+ if (areq->requestee->instrument)
+ InstrStartNode(areq->requestee->instrument);
+
+ /* Notify if the requestee is ready */
+ if (areq->state == ASYNCREQ_CALLBACK_PENDING)
+ ExecAsyncNotify(estate, areq);
+
+ /* Deliver the acquired tuple to the requester */
+ if (areq->state == ASYNCREQ_COMPLETE)
+ {
+ any_node_done = true;
+ if (requestor == areq->requestor)
+ requestor_done = true;
+ ExecAsyncResponse(estate, areq);
+
+ if (areq->requestee->instrument)
+ InstrStopNode(areq->requestee->instrument,
+ TupIsNull((TupleTableSlot*)areq->result) ?
+ 0.0 : 1.0);
+ }
+ else if (areq->requestee->instrument)
+ InstrStopNode(areq->requestee->instrument, 0);
+ }
+
+ /* If any node completed, compact the array. */
+ if (any_node_done)
+ {
+ int hidx = 0,
+ tidx;
+
+ /*
+ * Swap all non-yet-completed items to the start of the array.
+ * Keep them in the same order.
+ */
+ for (tidx = 0; tidx < estate->es_num_pending_async; ++tidx)
+ {
+ PendingAsyncRequest *head;
+ PendingAsyncRequest *tail = estate->es_pending_async[tidx];
+
+ Assert(tail->state != ASYNCREQ_CALLBACK_PENDING);
+
+ if (tail->state == ASYNCREQ_COMPLETE)
+ continue;
+ head = estate->es_pending_async[hidx];
+ estate->es_pending_async[tidx] = head;
+ estate->es_pending_async[hidx] = tail;
+ ++hidx;
+ }
+ estate->es_num_pending_async = hidx;
+ }
+
+ /*
+ * We only consider exiting the loop when no notifications are
+ * pending. Otherwise, each call to this function might advance
+ * the computation by only a very small amount; to the contrary,
+ * we want to push it forward as far as possible.
+ */
+ if (estate->es_async_callback_pending == 0)
+ {
+ /* If requestor is ready, exit. */
+ if (requestor_done)
+ return true;
+ /* If timeout was 0 or has expired, exit. */
+ if (cur_timeout == 0)
+ return false;
+ }
+ }
+}
+
+/*
+ * Wait or poll for events. As with ExecAsyncEventLoop, a timeout of -1
+ * means wait forever, 0 means don't wait at all, and >0 means wait for the
+ * indicated number of milliseconds.
+ *
+ * Returns false if we timed out or true if anything found or there's no event
+ * to wait.
+ */
+static bool
+ExecAsyncEventWait(EState *estate, long timeout)
+{
+ WaitEvent occurred_event[EVENT_BUFFER_SIZE];
+ int noccurred;
+ int i;
+ int n;
+ bool reinit = false;
+ bool process_latch_set = false;
+ bool added = false;
+ bool fired = false;
+
+ if (estate->es_wait_event_set == NULL)
+ {
+ /*
+ * Allow for a few extra events without reinitializing. It
+ * doesn't seem worth the complexity of doing anything very
+ * aggressive here, because plans that depend on massive numbers
+ * of external FDs are likely to run afoul of kernel limits anyway.
+ */
+ estate->es_allocated_fd_events = estate->es_total_fd_events + 16;
+
+ /*
+ * The wait event set created here should be live beyond ExecutorState
+ * context but released in case of error.
+ */
+ estate->es_wait_event_set =
+ CreateWaitEventSet(TopTransactionContext,
+ TopTransactionResourceOwner,
+ estate->es_allocated_fd_events + 1);
+
+ AddWaitEventToSet(estate->es_wait_event_set,
+ WL_LATCH_SET, PGINVALID_SOCKET, MyLatch, NULL);
+ reinit = true;
+ }
+
+ /* Give each waiting node a chance to add or modify events. */
+ for (i = 0; i < estate->es_num_pending_async; ++i)
+ {
+ PendingAsyncRequest *areq = estate->es_pending_async[i];
+
+ if (areq->num_fd_events > 0 || areq->wants_process_latch)
+ added |= ExecAsyncConfigureWait(estate, areq, reinit);
+ }
+
+ /*
+ * We may have no event to wait. This occurs when all nodes that
+ * is asynchronously executing have tuples immediately available.
+ */
+ if (!added)
+ return true;
+
+ /* Wait for at least one event to occur. */
+ noccurred = WaitEventSetWait(estate->es_wait_event_set, timeout,
+ occurred_event, EVENT_BUFFER_SIZE,
+ WAIT_EVENT_ASYNC_WAIT);
+
+ if (noccurred == 0)
+ return false;
+
+ /*
+ * Loop over the occurred events and set the callback_pending flags
+ * for the appropriate requests. The waiting nodes should have
+ * registered their wait events with user_data pointing back to the
+ * PendingAsyncRequest, but the process latch needs special handling.
+ */
+ for (n = 0; n < noccurred; ++n)
+ {
+ WaitEvent *w = &occurred_event[n];
+
+ if ((w->events & WL_LATCH_SET) != 0)
+ {
+ process_latch_set = true;
+ continue;
+ }
+
+ if ((w->events & (WL_SOCKET_READABLE|WL_SOCKET_WRITEABLE)) != 0)
+ {
+ PendingAsyncRequest *areq = w->user_data;
+
+ Assert(areq->state == ASYNCREQ_WAITING);
+
+ areq->state = ASYNCREQ_CALLBACK_PENDING;
+ estate->es_async_callback_pending++;
+ fired = true;
+ }
+ }
+
+ /*
+ * If the process latch got set, we must schedule a callback for every
+ * requestee that cares about it.
+ */
+ if (process_latch_set)
+ {
+ for (i = 0; i < estate->es_num_pending_async; ++i)
+ {
+ PendingAsyncRequest *areq = estate->es_pending_async[i];
+
+ if (areq->wants_process_latch)
+ {
+ Assert(areq->state == ASYNCREQ_WAITING);
+ areq->state = ASYNCREQ_CALLBACK_PENDING;
+ estate->es_async_callback_pending++;
+ fired = true;
+ }
+ }
+ }
+
+ return fired;
+}
+
+/*
+ * Give the asynchronous node a chance to configure the file descriptor
+ * events for which it wishes to wait. We expect the node-type specific
+ * callback to make one or more calls of the following form:
+ *
+ * AddWaitEventToSet(es->es_wait_event_set, events, fd, NULL, areq);
+ *
+ * The events should include only WL_SOCKET_READABLE or WL_SOCKET_WRITEABLE,
+ * and the number of calls should not exceed areq->num_fd_events (as
+ * prevously set via ExecAsyncSetRequiredEvents).
+ *
+ * Individual requests can omit registering an event but it is a
+ * responsibility of the node driver to set at least one event per one
+ * requestor.
+ */
+static bool
+ExecAsyncConfigureWait(EState *estate, PendingAsyncRequest *areq,
+ bool reinit)
+{
+ switch (nodeTag(areq->requestee))
+ {
+ case T_ForeignScanState:
+ return ExecAsyncForeignScanConfigureWait(estate, areq, reinit);
+ break;
+ default:
+ elog(ERROR, "unrecognized node type: %d",
+ (int) nodeTag(areq->requestee));
+ }
+}
+
+/*
+ * Call the asynchronous node back when a relevant event has occurred.
+ */
+static void
+ExecAsyncNotify(EState *estate, PendingAsyncRequest *areq)
+{
+ switch (nodeTag(areq->requestee))
+ {
+ case T_ForeignScanState:
+ ExecAsyncForeignScanNotify(estate, areq);
+ break;
+ default:
+ elog(ERROR, "unrecognized node type: %d",
+ (int) nodeTag(areq->requestee));
+ }
+
+ estate->es_async_callback_pending--;
+}
+
+/*
+ * Call the requestor back when an asynchronous node has produced a result.
+ */
+static void
+ExecAsyncResponse(EState *estate, PendingAsyncRequest *areq)
+{
+ switch (nodeTag(areq->requestor))
+ {
+ case T_AppendState:
+ ExecAsyncAppendResponse(estate, areq);
+ break;
+ default:
+ elog(ERROR, "unrecognized node type: %d",
+ (int) nodeTag(areq->requestor));
+ }
+ estate->es_num_async_ready--;
+}
+
+/*
+ * An executor node should call this function to signal that it needs to wait
+ * on one or more file descriptor events that can be registered on a
+ * WaitEventSet, and possibly also on process latch. num_fd_events is the
+ * maximum number of file descriptor events that it will wish to register.
+ * force_reset should be true if the node can't reuse the WaitEventSet it most
+ * recently initialized, for example because it needs to drop a wait event
+ * from the set.
+ */
+void
+ExecAsyncSetRequiredEvents(EState *estate, PendingAsyncRequest *areq,
+ int num_fd_events, bool wants_process_latch,
+ bool force_reset)
+{
+ estate->es_total_fd_events += num_fd_events - areq->num_fd_events;
+ areq->num_fd_events = num_fd_events;
+ areq->wants_process_latch = wants_process_latch;
+ areq->state = ASYNCREQ_WAITING;
+
+ if (force_reset && estate->es_wait_event_set != NULL)
+ ExecAsyncClearEvents(estate);
+}
+
+/*
+ * An async-capable node should call this function to deliver the tuple to
+ * the node which requested it. The node can call this from its
+ * ExecAsyncRequest callback if the requested tuple is available immediately,
+ * or at a later time from its ExecAsyncNotify callback.
+ */
+void
+ExecAsyncRequestDone(EState *estate, PendingAsyncRequest *areq, Node *result)
+{
+ /*
+ * Since the request is complete, the requestee is no longer allowed
+ * to wait for any events. Note that this forces a rebuild of
+ * es_wait_event_set every time a process that was previously waiting
+ * stops doing so. It might be possible to defer that decision until
+ * we actually wait again, because it's quite possible that a new
+ * request will be made of the same node before any wait actually
+ * happens. However, we have to balance the cost of rebuilding the
+ * WaitEventSet against the additional overhead of tracking which nodes
+ * need a callback to remove registered wait events. It's not clear
+ * that we would come out ahead, so use brute force for now.
+ */
+ Assert(areq->state == ASYNCREQ_IDLE ||
+ areq->state == ASYNCREQ_CALLBACK_PENDING);
+
+ if (areq->num_fd_events > 0 || areq->wants_process_latch)
+ ExecAsyncSetRequiredEvents(estate, areq, 0, false, true);
+
+
+ /* Save result and mark request as complete. */
+ areq->result = result;
+ areq->state = ASYNCREQ_COMPLETE;
+ estate->es_num_async_ready++;
+}
+
+
+/* Clear async events */
+void
+ExecAsyncClearEvents(EState *estate)
+{
+ if (estate->es_wait_event_set == NULL)
+ return;
+
+ FreeWaitEventSet(estate->es_wait_event_set);
+ estate->es_wait_event_set = NULL;
+}
diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
index 5ccc2e8..88f823d 100644
--- a/src/backend/executor/execProcnode.c
+++ b/src/backend/executor/execProcnode.c
@@ -115,6 +115,7 @@
#include "executor/nodeValuesscan.h"
#include "executor/nodeWindowAgg.h"
#include "executor/nodeWorktablescan.h"
+#include "foreign/fdwapi.h"
#include "nodes/nodeFuncs.h"
#include "miscadmin.h"
@@ -822,6 +823,14 @@ ExecShutdownNode(PlanState *node)
case T_GatherState:
ExecShutdownGather((GatherState *) node);
break;
+ case T_ForeignScanState:
+ {
+ ForeignScanState *fsstate = (ForeignScanState *)node;
+ FdwRoutine *fdwroutine = fsstate->fdwroutine;
+ if (fdwroutine->ShutdownForeignScan)
+ fdwroutine->ShutdownForeignScan((ForeignScanState *) node);
+ }
+ break;
default:
break;
}
diff --git a/src/backend/executor/instrument.c b/src/backend/executor/instrument.c
index 6ec96ec..959ee90 100644
--- a/src/backend/executor/instrument.c
+++ b/src/backend/executor/instrument.c
@@ -102,7 +102,7 @@ InstrStopNode(Instrumentation *instr, double nTuples)
&pgBufferUsage, &instr->bufusage_start);
/* Is this the first tuple of this cycle? */
- if (!instr->running)
+ if (!instr->running && nTuples > 0)
{
instr->running = true;
instr->firsttuple = INSTR_TIME_GET_DOUBLE(instr->counter);
diff --git a/src/backend/executor/nodeAppend.c b/src/backend/executor/nodeAppend.c
index 6986cae..12d3742 100644
--- a/src/backend/executor/nodeAppend.c
+++ b/src/backend/executor/nodeAppend.c
@@ -58,6 +58,7 @@
#include "postgres.h"
#include "executor/execdebug.h"
+#include "executor/execAsync.h"
#include "executor/nodeAppend.h"
static bool exec_append_initialize_next(AppendState *appendstate);
@@ -79,16 +80,21 @@ exec_append_initialize_next(AppendState *appendstate)
/*
* get information from the append node
*/
- whichplan = appendstate->as_whichplan;
+ whichplan = appendstate->as_whichsyncplan;
- if (whichplan < 0)
+ /*
+ * This routine is only responsible for setting up for nodes being scanned
+ * synchronously, so the first node we can scan is given by nasyncplans
+ * and the last is given by as_nplans - 1.
+ */
+ if (whichplan < appendstate->as_nasyncplans)
{
/*
* if scanning in reverse, we start at the last scan in the list and
* then proceed back to the first.. in any case we inform ExecAppend
* that we are at the end of the line by returning FALSE
*/
- appendstate->as_whichplan = 0;
+ appendstate->as_whichsyncplan = appendstate->as_nasyncplans;
return FALSE;
}
else if (whichplan >= appendstate->as_nplans)
@@ -96,7 +102,7 @@ exec_append_initialize_next(AppendState *appendstate)
/*
* as above, end the scan if we go beyond the last scan in our list..
*/
- appendstate->as_whichplan = appendstate->as_nplans - 1;
+ appendstate->as_whichsyncplan = appendstate->as_nplans - 1;
return FALSE;
}
else
@@ -142,6 +148,15 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
appendstate->ps.state = estate;
appendstate->appendplans = appendplanstates;
appendstate->as_nplans = nplans;
+ appendstate->as_nasyncplans = node->nasyncplans;
+ appendstate->as_syncdone = (node->nasyncplans == nplans);
+ appendstate->as_asyncresult = (TupleTableSlot **)
+ palloc0(node->nasyncplans * sizeof(TupleTableSlot *));
+
+ /* initially, all async requests need a request */
+ for (i = 0; i < appendstate->as_nasyncplans; ++i)
+ appendstate->as_needrequest =
+ bms_add_member(appendstate->as_needrequest, i);
/*
* Miscellaneous initialization
@@ -176,9 +191,9 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
appendstate->ps.ps_ProjInfo = NULL;
/*
- * initialize to scan first subplan
+ * initialize to scan first synchronous subplan
*/
- appendstate->as_whichplan = 0;
+ appendstate->as_whichsyncplan = appendstate->as_nasyncplans;
exec_append_initialize_next(appendstate);
return appendstate;
@@ -193,15 +208,85 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
TupleTableSlot *
ExecAppend(AppendState *node)
{
+ if (node->as_nasyncplans > 0)
+ {
+ EState *estate = node->ps.state;
+ int i;
+
+ /*
+ * If there are any asynchronously-generated results that have
+ * not yet been returned, return one of them.
+ */
+ if (node->as_nasyncresult > 0)
+ {
+ --node->as_nasyncresult;
+ return node->as_asyncresult[node->as_nasyncresult];
+ }
+
+
+ /*
+ * XXXX: Always clear registered event. This seems a bit ineffecient
+ * but the events to wait are almost randomly altered for every
+ * calling.
+ */
+ ExecAsyncClearEvents(estate);
+
+ while ((i = bms_first_member(node->as_needrequest)) >= 0)
+ {
+ node->as_nasyncpending++;
+ ExecAsyncRequest(estate, &node->ps, i, node->appendplans[i]);
+ }
+
+ if (node->as_nasyncpending == 0 && node->as_syncdone)
+ return ExecClearTuple(node->ps.ps_ResultTupleSlot);
+ }
+
for (;;)
{
PlanState *subnode;
TupleTableSlot *result;
/*
- * figure out which subplan we are currently processing
+ * if we have async requests outstanding, run the event loop
+ */
+ if (node->as_nasyncpending > 0)
+ {
+ long timeout = node->as_syncdone ? -1 : 0;
+
+ while (node->as_nasyncpending > 0)
+ {
+ if (ExecAsyncEventLoop(node->ps.state, &node->ps, timeout) &&
+ node->as_nasyncresult > 0)
+ {
+ /* Asynchronous subplan returned a tuple! */
+ --node->as_nasyncresult;
+ return node->as_asyncresult[node->as_nasyncresult];
+ }
+
+ /* Timeout reached. Go through to sync nodes if exists */
+ if (!node->as_syncdone)
+ break;
+ }
+
+ /*
+ * If there is no asynchronous activity still pending and the
+ * synchronous activity is also complete, we're totally done
+ * scanning this node. Otherwise, we're done with the
+ * asynchronous stuff but must continue scanning the synchronous
+ * children.
+ */
+ if (node->as_syncdone)
+ {
+ Assert(node->as_nasyncpending == 0);
+ return ExecClearTuple(node->ps.ps_ResultTupleSlot);
+ }
+ }
+
+ /*
+ * figure out which synchronous subplan we are currently processing
*/
- subnode = node->appendplans[node->as_whichplan];
+ Assert(!node->as_syncdone);
+ subnode = node->appendplans[node->as_whichsyncplan];
/*
* get a tuple from the subplan
@@ -221,14 +306,21 @@ ExecAppend(AppendState *node)
/*
* Go on to the "next" subplan in the appropriate direction. If no
* more subplans, return the empty slot set up for us by
- * ExecInitAppend.
+ * ExecInitAppend, unless there are async plans we have yet to finish.
*/
if (ScanDirectionIsForward(node->ps.state->es_direction))
- node->as_whichplan++;
+ node->as_whichsyncplan++;
else
- node->as_whichplan--;
+ node->as_whichsyncplan--;
if (!exec_append_initialize_next(node))
- return ExecClearTuple(node->ps.ps_ResultTupleSlot);
+ {
+ node->as_syncdone = true;
+ if (node->as_nasyncpending == 0)
+ {
+ Assert(bms_is_empty(node->as_needrequest));
+ return ExecClearTuple(node->ps.ps_ResultTupleSlot);
+ }
+ }
/* Else loop back and try to get a tuple from the new subplan */
}
@@ -267,6 +359,16 @@ ExecReScanAppend(AppendState *node)
{
int i;
+ /*
+ * XXX. Cancel outstanding asynchronous tuple requests here! (How?)
+ */
+
+ /* Reset async state. */
+ for (i = 0; i < node->as_nasyncplans; ++i)
+ node->as_needrequest = bms_add_member(node->as_needrequest, i);
+ node->as_nasyncresult = 0;
+ node->as_syncdone = (node->as_nasyncplans == node->as_nplans);
+
for (i = 0; i < node->as_nplans; i++)
{
PlanState *subnode = node->appendplans[i];
@@ -285,6 +387,47 @@ ExecReScanAppend(AppendState *node)
if (subnode->chgParam == NULL)
ExecReScan(subnode);
}
- node->as_whichplan = 0;
+ node->as_whichsyncplan = node->as_nasyncplans;
exec_append_initialize_next(node);
}
+
+/* ----------------------------------------------------------------
+ * ExecAsyncAppendResponse
+ *
+ * Receive a response from an asynchronous request we made.
+ * ----------------------------------------------------------------
+ */
+void
+ExecAsyncAppendResponse(EState *estate, PendingAsyncRequest *areq)
+{
+ AppendState *node = (AppendState *) areq->requestor;
+ TupleTableSlot *slot;
+
+ /* We shouldn't be called until the request is complete. */
+ Assert(areq->state == ASYNCREQ_COMPLETE);
+
+ /* Our result slot shouldn't already be occupied. */
+ Assert(TupIsNull(node->ps.ps_ResultTupleSlot));
+
+ /* Result should be a TupleTableSlot or NULL. */
+ slot = (TupleTableSlot *) areq->result;
+ Assert(slot == NULL || IsA(slot, TupleTableSlot));
+
+ /* This is no longer pending */
+ --node->as_nasyncpending;
+
+ /* If the result is NULL or an empty slot, there's nothing more to do. */
+ if (TupIsNull(slot))
+ return;
+
+ /* Save result so we can return it. */
+ Assert(node->as_nasyncresult < node->as_nasyncplans);
+ node->as_asyncresult[node->as_nasyncresult++] = slot;
+
+ /*
+ * Mark the node that returned a result as ready for a new request. We
+ * don't launch another one here immediately because it might compelte
+ */
+ node->as_needrequest =
+ bms_add_member(node->as_needrequest, areq->request_index);
+}
diff --git a/src/backend/executor/nodeForeignscan.c b/src/backend/executor/nodeForeignscan.c
index 86a77e3..85dad79 100644
--- a/src/backend/executor/nodeForeignscan.c
+++ b/src/backend/executor/nodeForeignscan.c
@@ -353,3 +353,52 @@ ExecForeignScanInitializeWorker(ForeignScanState *node, shm_toc *toc)
fdwroutine->InitializeWorkerForeignScan(node, toc, coordinate);
}
}
+
+/* ----------------------------------------------------------------
+ * ExecAsyncForeignScanRequest
+ *
+ * Initiate an asynchronous request
+ * ----------------------------------------------------------------
+ */
+void
+ExecAsyncForeignScanRequest(EState *estate, PendingAsyncRequest *areq)
+{
+ ForeignScanState *node = (ForeignScanState *) areq->requestee;
+ FdwRoutine *fdwroutine = node->fdwroutine;
+
+ Assert(fdwroutine->ForeignAsyncRequest != NULL);
+ fdwroutine->ForeignAsyncRequest(estate, areq);
+}
+
+/* ----------------------------------------------------------------
+ * ExecAsyncForeignScanConfigureWait
+ *
+ * In async mode, configure for a wait
+ * ----------------------------------------------------------------
+ */
+bool
+ExecAsyncForeignScanConfigureWait(EState *estate,
+ PendingAsyncRequest *areq, bool reinit)
+{
+ ForeignScanState *node = (ForeignScanState *) areq->requestee;
+ FdwRoutine *fdwroutine = node->fdwroutine;
+
+ Assert(fdwroutine->ForeignAsyncConfigureWait != NULL);
+ return fdwroutine->ForeignAsyncConfigureWait(estate, areq, reinit);
+}
+
+/* ----------------------------------------------------------------
+ * ExecAsyncForeignScanNotify
+ *
+ * Event loop callback
+ * ----------------------------------------------------------------
+ */
+void
+ExecAsyncForeignScanNotify(EState *estate, PendingAsyncRequest *areq)
+{
+ ForeignScanState *node = (ForeignScanState *) areq->requestee;
+ FdwRoutine *fdwroutine = node->fdwroutine;
+
+ Assert(fdwroutine->ForeignAsyncNotify != NULL);
+ fdwroutine->ForeignAsyncNotify(estate, areq);
+}
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index 05d8538..e64ec77 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -236,6 +236,8 @@ _copyAppend(const Append *from)
* copy remainder of node
*/
COPY_NODE_FIELD(appendplans);
+ COPY_SCALAR_FIELD(nasyncplans);
+ COPY_SCALAR_FIELD(referent);
return newnode;
}
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index b3802b4..8b39efa 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -369,6 +369,8 @@ _outAppend(StringInfo str, const Append *node)
_outPlanInfo(str, (const Plan *) node);
WRITE_NODE_FIELD(appendplans);
+ WRITE_INT_FIELD(nasyncplans);
+ WRITE_INT_FIELD(referent);
}
static void
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index d2f69fe..d5d3c81 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -1539,6 +1539,8 @@ _readAppend(void)
ReadCommonPlan(&local_node->plan);
READ_NODE_FIELD(appendplans);
+ READ_INT_FIELD(nasyncplans);
+ READ_INT_FIELD(referent);
READ_DONE();
}
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index 1e953b4..72080cb 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -194,7 +194,8 @@ static CteScan *make_ctescan(List *qptlist, List *qpqual,
Index scanrelid, int ctePlanId, int cteParam);
static WorkTableScan *make_worktablescan(List *qptlist, List *qpqual,
Index scanrelid, int wtParam);
-static Append *make_append(List *appendplans, List *tlist);
+static Append *make_append(List *asyncplans, int nasyncplans,
+ int referent, List *tlist);
static RecursiveUnion *make_recursive_union(List *tlist,
Plan *lefttree,
Plan *righttree,
@@ -272,6 +273,7 @@ static ModifyTable *make_modifytable(PlannerInfo *root,
List *resultRelations, List *subplans,
List *withCheckOptionLists, List *returningLists,
List *rowMarks, OnConflictExpr *onconflict, int epqParam);
+static bool is_async_capable_path(Path *path);
/*
@@ -960,8 +962,12 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path)
{
Append *plan;
List *tlist = build_path_tlist(root, &best_path->path);
- List *subplans = NIL;
+ List *asyncplans = NIL;
+ List *syncplans = NIL;
ListCell *subpaths;
+ int nasyncplans = 0;
+ bool first = true;
+ bool referent_is_sync = true;
/*
* The subpaths list could be empty, if every child was proven empty by
@@ -987,7 +993,14 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path)
return plan;
}
- /* Build the plan for each child */
+ /*
+ * Build the plan for each child
+
+ * The first child in an inheritance set is the representative in
+ * explaining tlist entries (see set_deparse_planstate). We should keep
+ * the first child in best_path->subpaths at the head of the subplan list
+ * for the reason.
+ */
foreach(subpaths, best_path->subpaths)
{
Path *subpath = (Path *) lfirst(subpaths);
@@ -996,7 +1009,18 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path)
/* Must insist that all children return the same tlist */
subplan = create_plan_recurse(root, subpath, CP_EXACT_TLIST);
- subplans = lappend(subplans, subplan);
+ /* Classify as async-capable or not */
+ if (is_async_capable_path(subpath))
+ {
+ asyncplans = lappend(asyncplans, subplan);
+ ++nasyncplans;
+ if (first)
+ referent_is_sync = false;
+ }
+ else
+ syncplans = lappend(syncplans, subplan);
+
+ first = false;
}
/*
@@ -1006,7 +1030,8 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path)
* parent-rel Vars it'll be asked to emit.
*/
- plan = make_append(subplans, tlist);
+ plan = make_append(list_concat(asyncplans, syncplans), nasyncplans,
+ referent_is_sync ? nasyncplans : 0, tlist);
copy_generic_path_info(&plan->plan, (Path *) best_path);
@@ -5003,7 +5028,7 @@ make_foreignscan(List *qptlist,
}
static Append *
-make_append(List *appendplans, List *tlist)
+make_append(List *appendplans, int nasyncplans, int referent, List *tlist)
{
Append *node = makeNode(Append);
Plan *plan = &node->plan;
@@ -5013,6 +5038,8 @@ make_append(List *appendplans, List *tlist)
plan->lefttree = NULL;
plan->righttree = NULL;
node->appendplans = appendplans;
+ node->nasyncplans = nasyncplans;
+ node->referent = referent;
return node;
}
@@ -6334,3 +6361,27 @@ is_projection_capable_plan(Plan *plan)
}
return true;
}
+
+/*
+ * is_projection_capable_path
+ * Check whether a given Path node is async-capable.
+ */
+static bool
+is_async_capable_path(Path *path)
+{
+ switch (nodeTag(path))
+ {
+ case T_ForeignPath:
+ {
+ FdwRoutine *fdwroutine = path->parent->fdwroutine;
+
+ Assert(fdwroutine != NULL);
+ if (fdwroutine->IsForeignPathAsyncCapable != NULL &&
+ fdwroutine->IsForeignPathAsyncCapable((ForeignPath *) path))
+ return true;
+ }
+ default:
+ break;
+ }
+ return false;
+}
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index ada374c..a0ec3b7 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -3401,6 +3401,9 @@ pgstat_get_wait_ipc(WaitEventIPC w)
case WAIT_EVENT_SYNC_REP:
event_name = "SyncRep";
break;
+ case WAIT_EVENT_ASYNC_WAIT:
+ event_name = "AsyncExecWait";
+ break;
/* no default case, so that compiler will warn */
}
diff --git a/src/backend/utils/adt/ruleutils.c b/src/backend/utils/adt/ruleutils.c
index b27b77d..c43e8b2 100644
--- a/src/backend/utils/adt/ruleutils.c
+++ b/src/backend/utils/adt/ruleutils.c
@@ -4240,7 +4240,11 @@ set_deparse_planstate(deparse_namespace *dpns, PlanState *ps)
* lists containing references to non-target relations.
*/
if (IsA(ps, AppendState))
- dpns->outer_planstate = ((AppendState *) ps)->appendplans[0];
+ {
+ int idx = ((Append*)(((AppendState *) ps)->ps.plan))->referent;
+ dpns->outer_planstate =
+ ((AppendState *) ps)->appendplans[idx];
+ }
else if (IsA(ps, MergeAppendState))
dpns->outer_planstate = ((MergeAppendState *) ps)->mergeplans[0];
else if (IsA(ps, ModifyTableState))
diff --git a/src/include/executor/execAsync.h b/src/include/executor/execAsync.h
new file mode 100644
index 0000000..9e7845c
--- /dev/null
+++ b/src/include/executor/execAsync.h
@@ -0,0 +1,30 @@
+/*--------------------------------------------------------------------
+ * execAsync.c
+ * Support functions for asynchronous query execution
+ *
+ * Portions Copyright (c) 1996-2016, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/executor/execAsync.c
+ *--------------------------------------------------------------------
+ */
+
+#ifndef EXECASYNC_H
+#define EXECASYNC_H
+
+#include "nodes/execnodes.h"
+
+extern void ExecAsyncRequest(EState *estate, PlanState *requestor,
+ int request_index, PlanState *requestee);
+extern bool ExecAsyncEventLoop(EState *estate, PlanState *requestor,
+ long timeout);
+
+extern void ExecAsyncSetRequiredEvents(EState *estate,
+ PendingAsyncRequest *areq, int num_fd_events,
+ bool wants_process_latch, bool force_reset);
+extern void ExecAsyncRequestDone(EState *estate,
+ PendingAsyncRequest *areq, Node *result);
+extern void ExecAsyncClearEvents(EState *estate);
+
+#endif /* EXECASYNC_H */
diff --git a/src/include/executor/nodeAppend.h b/src/include/executor/nodeAppend.h
index 6fb4662..3cbf9ff 100644
--- a/src/include/executor/nodeAppend.h
+++ b/src/include/executor/nodeAppend.h
@@ -21,4 +21,7 @@ extern TupleTableSlot *ExecAppend(AppendState *node);
extern void ExecEndAppend(AppendState *node);
extern void ExecReScanAppend(AppendState *node);
+extern void ExecAsyncAppendResponse(EState *estate,
+ PendingAsyncRequest *areq);
+
#endif /* NODEAPPEND_H */
diff --git a/src/include/executor/nodeForeignscan.h b/src/include/executor/nodeForeignscan.h
index f0e942a..2d9a62b 100644
--- a/src/include/executor/nodeForeignscan.h
+++ b/src/include/executor/nodeForeignscan.h
@@ -29,4 +29,11 @@ extern void ExecForeignScanInitializeDSM(ForeignScanState *node,
extern void ExecForeignScanInitializeWorker(ForeignScanState *node,
shm_toc *toc);
+extern void ExecAsyncForeignScanRequest(EState *estate,
+ PendingAsyncRequest *areq);
+extern bool ExecAsyncForeignScanConfigureWait(EState *estate,
+ PendingAsyncRequest *areq, bool reinit);
+extern void ExecAsyncForeignScanNotify(EState *estate,
+ PendingAsyncRequest *areq);
+
#endif /* NODEFOREIGNSCAN_H */
diff --git a/src/include/foreign/fdwapi.h b/src/include/foreign/fdwapi.h
index 523d415..11c3434 100644
--- a/src/include/foreign/fdwapi.h
+++ b/src/include/foreign/fdwapi.h
@@ -155,6 +155,16 @@ typedef bool (*IsForeignScanParallelSafe_function) (PlannerInfo *root,
RelOptInfo *rel,
RangeTblEntry *rte);
+typedef bool (*IsForeignPathAsyncCapable_function) (ForeignPath *path);
+typedef void (*ForeignAsyncRequest_function) (EState *estate,
+ PendingAsyncRequest *areq);
+typedef bool (*ForeignAsyncConfigureWait_function) (EState *estate,
+ PendingAsyncRequest *areq,
+ bool reinit);
+typedef void (*ForeignAsyncNotify_function) (EState *estate,
+ PendingAsyncRequest *areq);
+typedef void (*ShutdownForeignScan_function) (ForeignScanState *node);
+
/*
* FdwRoutine is the struct returned by a foreign-data wrapper's handler
* function. It provides pointers to the callback functions needed by the
@@ -224,6 +234,13 @@ typedef struct FdwRoutine
EstimateDSMForeignScan_function EstimateDSMForeignScan;
InitializeDSMForeignScan_function InitializeDSMForeignScan;
InitializeWorkerForeignScan_function InitializeWorkerForeignScan;
+
+ /* Support functions for asynchronous execution */
+ IsForeignPathAsyncCapable_function IsForeignPathAsyncCapable;
+ ForeignAsyncRequest_function ForeignAsyncRequest;
+ ForeignAsyncConfigureWait_function ForeignAsyncConfigureWait;
+ ForeignAsyncNotify_function ForeignAsyncNotify;
+ ShutdownForeignScan_function ShutdownForeignScan;
} FdwRoutine;
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 6332ea0..8445d79 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -356,6 +356,32 @@ typedef struct ResultRelInfo
} ResultRelInfo;
/* ----------------
+ * PendingAsyncRequest
+ *
+ * State for an asynchronous tuple request.
+ * ----------------
+ */
+typedef enum AsyncRequestState
+{
+ ASYNCREQ_IDLE, /* Nothing is requested */
+ ASYNCREQ_WAITING, /* Waiting for events */
+ ASYNCREQ_CALLBACK_PENDING, /* Having events to be processed */
+ ASYNCREQ_COMPLETE /* Result is available */
+} AsyncRequestState;
+
+typedef struct PendingAsyncRequest
+{
+ int myindex; /* Index in es_pending_async. */
+ struct PlanState *requestor; /* Node that wants a tuple. */
+ struct PlanState *requestee; /* Node from which a tuple is wanted. */
+ int request_index; /* Scratch space for requestor. */
+ int num_fd_events; /* Max number of FD events requestee needs. */
+ bool wants_process_latch; /* Requestee cares about MyLatch. */
+ AsyncRequestState state;
+ Node *result; /* Result (NULL if no more tuples). */
+} PendingAsyncRequest;
+
+/* ----------------
* EState information
*
* Master working state for an Executor invocation
@@ -435,6 +461,32 @@ typedef struct EState
/* The per-query shared memory area to use for parallel execution. */
struct dsa_area *es_query_dsa;
+
+ /*
+ * Support for asynchronous execution.
+ *
+ * es_max_pending_async is the allocated size of es_pending_async, and
+ * es_num_pending_aync is the number of entries that are currently valid.
+ * (Entries after that may point to storage that can be reused.)
+ * es_async_ready is the number of PendingAsyncRequests that is ready to
+ * retrieve a tuple.
+ *
+ * es_total_fd_events is the total number of FD events needed by all
+ * pending async nodes, and es_allocated_fd_events is the number any
+ * current wait event set was allocated to handle. es_wait_event_set, if
+ * non-NULL, is a previously allocated event set that may be reusable by a
+ * future wait provided that nothing's been removed and not too many more
+ * events have been added.
+ */
+ int es_num_pending_async; /* # of nodes to wait */
+ int es_max_pending_async; /* max # of pending nodes */
+ int es_async_callback_pending; /* # of nodes to callback */
+ int es_num_async_ready; /* # of tuple-ready nodes */
+ PendingAsyncRequest **es_pending_async;
+
+ int es_total_fd_events;
+ int es_allocated_fd_events;
+ struct WaitEventSet *es_wait_event_set;
} EState;
@@ -1180,17 +1232,20 @@ typedef struct ModifyTableState
/* ----------------
* AppendState information
- *
- * nplans how many plans are in the array
- * whichplan which plan is being executed (0 .. n-1)
* ----------------
*/
typedef struct AppendState
{
PlanState ps; /* its first field is NodeTag */
PlanState **appendplans; /* array of PlanStates for my inputs */
- int as_nplans;
- int as_whichplan;
+ int as_nplans; /* total # of children */
+ int as_nasyncplans; /* # of async-capable children */
+ int as_whichsyncplan; /* which sync plan is being executed */
+ bool as_syncdone; /* all synchronous plans done? */
+ Bitmapset *as_needrequest; /* async plans needing a new request */
+ TupleTableSlot **as_asyncresult; /* unreturned results of async plans */
+ int as_nasyncresult; /* # of valid entries in as_asyncresult */
+ int as_nasyncpending; /* # of outstanding async requests */
} AppendState;
/* ----------------
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index f72f7a8..ebbc78d 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -228,6 +228,8 @@ typedef struct Append
{
Plan plan;
List *appendplans;
+ int nasyncplans; /* # of async plans, always at start of list */
+ int referent; /* index of inheritance tree referent */
} Append;
/* ----------------
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 8b710ec..6c94a75 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -788,7 +788,8 @@ typedef enum
WAIT_EVENT_MQ_SEND,
WAIT_EVENT_PARALLEL_FINISH,
WAIT_EVENT_SAFE_SNAPSHOT,
- WAIT_EVENT_SYNC_REP
+ WAIT_EVENT_SYNC_REP,
+ WAIT_EVENT_ASYNC_WAIT
} WaitEventIPC;
/* ----------
--
2.9.2
0003-Make-postgres_fdw-async-capable.patchtext/x-patch; charset=us-asciiDownload
From bd740f884446b60847c579a6a4c16c7b2d16cf90 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 23 Feb 2017 15:04:46 +0900
Subject: [PATCH 3/5] Make postgres_fdw async-capable.
Make postgre_fdw async-capable using the infrastructure. Additionaly,
this makes connections for postgres_fdw have a connection-specific
area to store information so that foreign scans on the same connection
can share some data. postgres_fdw shares scan node currently running
on the underlying connection. This allows us async-execution of
multiple foreign scans on one foreign server.
---
contrib/postgres_fdw/connection.c | 79 ++--
contrib/postgres_fdw/expected/postgres_fdw.out | 120 +++---
contrib/postgres_fdw/postgres_fdw.c | 522 +++++++++++++++++++++----
contrib/postgres_fdw/postgres_fdw.h | 2 +
contrib/postgres_fdw/sql/postgres_fdw.sql | 12 +-
5 files changed, 583 insertions(+), 152 deletions(-)
diff --git a/contrib/postgres_fdw/connection.c b/contrib/postgres_fdw/connection.c
index 7f7a744..64cc057 100644
--- a/contrib/postgres_fdw/connection.c
+++ b/contrib/postgres_fdw/connection.c
@@ -49,6 +49,7 @@ typedef struct ConnCacheEntry
* one level of subxact open, etc */
bool have_prep_stmt; /* have we prepared any stmts in this xact? */
bool have_error; /* have any subxacts aborted in this xact? */
+ void *storage; /* connection specific storage */
} ConnCacheEntry;
/*
@@ -64,6 +65,7 @@ static unsigned int prep_stmt_number = 0;
static bool xact_got_connection = false;
/* prototypes of private functions */
+static ConnCacheEntry *get_connection_entry(Oid umid);
static PGconn *connect_pg_server(ForeignServer *server, UserMapping *user);
static void check_conn_params(const char **keywords, const char **values);
static void configure_remote_session(PGconn *conn);
@@ -75,26 +77,12 @@ static void pgfdw_subxact_callback(SubXactEvent event,
SubTransactionId parentSubid,
void *arg);
-
/*
- * Get a PGconn which can be used to execute queries on the remote PostgreSQL
- * server with the user's authorization. A new connection is established
- * if we don't already have a suitable one, and a transaction is opened at
- * the right subtransaction nesting depth if we didn't do that already.
- *
- * will_prep_stmt must be true if caller intends to create any prepared
- * statements. Since those don't go away automatically at transaction end
- * (not even on error), we need this flag to cue manual cleanup.
- *
- * XXX Note that caching connections theoretically requires a mechanism to
- * detect change of FDW objects to invalidate already established connections.
- * We could manage that by watching for invalidation events on the relevant
- * syscaches. For the moment, though, it's not clear that this would really
- * be useful and not mere pedantry. We could not flush any active connections
- * mid-transaction anyway.
+ * Common function to acquire or create a connection cache entry.
*/
-PGconn *
-GetConnection(UserMapping *user, bool will_prep_stmt)
+
+static ConnCacheEntry *
+get_connection_entry(Oid umid)
{
bool found;
ConnCacheEntry *entry;
@@ -122,11 +110,8 @@ GetConnection(UserMapping *user, bool will_prep_stmt)
RegisterSubXactCallback(pgfdw_subxact_callback, NULL);
}
- /* Set flag that we did GetConnection during the current transaction */
- xact_got_connection = true;
-
/* Create hash key for the entry. Assume no pad bytes in key struct */
- key = user->umid;
+ key = umid;
/*
* Find or create cached entry for requested connection.
@@ -139,8 +124,39 @@ GetConnection(UserMapping *user, bool will_prep_stmt)
entry->xact_depth = 0;
entry->have_prep_stmt = false;
entry->have_error = false;
+ entry->storage = NULL;
}
+ return entry;
+}
+
+/*
+ * Get a PGconn which can be used to execute queries on the remote PostgreSQL
+ * server with the user's authorization. A new connection is established
+ * if we don't already have a suitable one, and a transaction is opened at
+ * the right subtransaction nesting depth if we didn't do that already.
+ *
+ * will_prep_stmt must be true if caller intends to create any prepared
+ * statements. Since those don't go away automatically at transaction end
+ * (not even on error), we need this flag to cue manual cleanup.
+ *
+ * XXX Note that caching connections theoretically requires a mechanism to
+ * detect change of FDW objects to invalidate already established connections.
+ * We could manage that by watching for invalidation events on the relevant
+ * syscaches. For the moment, though, it's not clear that this would really
+ * be useful and not mere pedantry. We could not flush any active connections
+ * mid-transaction anyway.
+ */
+PGconn *
+GetConnection(UserMapping *user, bool will_prep_stmt)
+{
+ ConnCacheEntry *entry;
+
+ /* Set flag that we did GetConnection during the current transaction */
+ xact_got_connection = true;
+
+ entry = get_connection_entry(user->umid);
+
/*
* We don't check the health of cached connection here, because it would
* require some overhead. Broken connection will be detected when the
@@ -177,6 +193,25 @@ GetConnection(UserMapping *user, bool will_prep_stmt)
}
/*
+ * Rerturns the connection specific storage for this user. Allocate with
+ * initsize if not exists.
+ */
+void *
+GetConnectionSpecificStorage(UserMapping *user, size_t initsize)
+{
+ ConnCacheEntry *entry;
+
+ entry = get_connection_entry(user->umid);
+ if (entry->storage == NULL)
+ {
+ entry->storage = MemoryContextAlloc(CacheMemoryContext, initsize);
+ memset(entry->storage, 0, initsize);
+ }
+
+ return entry->storage;
+}
+
+/*
* Connect to remote server using specified server and user mapping properties.
*/
static PGconn *
diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out
index 0b9e3e4..90691e5 100644
--- a/contrib/postgres_fdw/expected/postgres_fdw.out
+++ b/contrib/postgres_fdw/expected/postgres_fdw.out
@@ -6401,34 +6401,39 @@ insert into bar2 values(3,33,33);
insert into bar2 values(4,44,44);
insert into bar2 values(7,77,77);
explain (verbose, costs off)
-select * from bar where f1 in (select f1 from foo) for update;
- QUERY PLAN
-----------------------------------------------------------------------------------------------
+select * from bar where f1 in (select f1 from foo) order by 1 for update;
+ QUERY PLAN
+-----------------------------------------------------------------------------------------------------------------
LockRows
Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid, foo.ctid, foo.*, foo.tableoid
- -> Hash Join
+ -> Merge Join
Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid, foo.ctid, foo.*, foo.tableoid
- Hash Cond: (bar.f1 = foo.f1)
- -> Append
- -> Seq Scan on public.bar
+ Merge Cond: (bar.f1 = foo.f1)
+ -> Merge Append
+ Sort Key: bar.f1
+ -> Sort
Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid
+ Sort Key: bar.f1
+ -> Seq Scan on public.bar
+ Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid
-> Foreign Scan on public.bar2
Output: bar2.f1, bar2.f2, bar2.ctid, bar2.*, bar2.tableoid
- Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 FOR UPDATE
- -> Hash
+ Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 ORDER BY f1 ASC NULLS LAST FOR UPDATE
+ -> Sort
Output: foo.ctid, foo.*, foo.tableoid, foo.f1
+ Sort Key: foo.f1
-> HashAggregate
Output: foo.ctid, foo.*, foo.tableoid, foo.f1
Group Key: foo.f1
-> Append
- -> Seq Scan on public.foo
- Output: foo.ctid, foo.*, foo.tableoid, foo.f1
-> Foreign Scan on public.foo2
Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1
Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1
-(22 rows)
+ -> Seq Scan on public.foo
+ Output: foo.ctid, foo.*, foo.tableoid, foo.f1
+(27 rows)
-select * from bar where f1 in (select f1 from foo) for update;
+select * from bar where f1 in (select f1 from foo) order by 1 for update;
f1 | f2
----+----
1 | 11
@@ -6438,34 +6443,39 @@ select * from bar where f1 in (select f1 from foo) for update;
(4 rows)
explain (verbose, costs off)
-select * from bar where f1 in (select f1 from foo) for share;
- QUERY PLAN
-----------------------------------------------------------------------------------------------
+select * from bar where f1 in (select f1 from foo) order by 1 for share;
+ QUERY PLAN
+----------------------------------------------------------------------------------------------------------------
LockRows
Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid, foo.ctid, foo.*, foo.tableoid
- -> Hash Join
+ -> Merge Join
Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid, foo.ctid, foo.*, foo.tableoid
- Hash Cond: (bar.f1 = foo.f1)
- -> Append
- -> Seq Scan on public.bar
+ Merge Cond: (bar.f1 = foo.f1)
+ -> Merge Append
+ Sort Key: bar.f1
+ -> Sort
Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid
+ Sort Key: bar.f1
+ -> Seq Scan on public.bar
+ Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid
-> Foreign Scan on public.bar2
Output: bar2.f1, bar2.f2, bar2.ctid, bar2.*, bar2.tableoid
- Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 FOR SHARE
- -> Hash
+ Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 ORDER BY f1 ASC NULLS LAST FOR SHARE
+ -> Sort
Output: foo.ctid, foo.*, foo.tableoid, foo.f1
+ Sort Key: foo.f1
-> HashAggregate
Output: foo.ctid, foo.*, foo.tableoid, foo.f1
Group Key: foo.f1
-> Append
- -> Seq Scan on public.foo
- Output: foo.ctid, foo.*, foo.tableoid, foo.f1
-> Foreign Scan on public.foo2
Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1
Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1
-(22 rows)
+ -> Seq Scan on public.foo
+ Output: foo.ctid, foo.*, foo.tableoid, foo.f1
+(27 rows)
-select * from bar where f1 in (select f1 from foo) for share;
+select * from bar where f1 in (select f1 from foo) order by 1 for share;
f1 | f2
----+----
1 | 11
@@ -6494,11 +6504,11 @@ update bar set f2 = f2 + 100 where f1 in (select f1 from foo);
Output: foo.ctid, foo.*, foo.tableoid, foo.f1
Group Key: foo.f1
-> Append
- -> Seq Scan on public.foo
- Output: foo.ctid, foo.*, foo.tableoid, foo.f1
-> Foreign Scan on public.foo2
Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1
Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1
+ -> Seq Scan on public.foo
+ Output: foo.ctid, foo.*, foo.tableoid, foo.f1
-> Hash Join
Output: bar2.f1, (bar2.f2 + 100), bar2.f3, bar2.ctid, foo.ctid, foo.*, foo.tableoid
Hash Cond: (bar2.f1 = foo.f1)
@@ -6511,11 +6521,11 @@ update bar set f2 = f2 + 100 where f1 in (select f1 from foo);
Output: foo.ctid, foo.*, foo.tableoid, foo.f1
Group Key: foo.f1
-> Append
- -> Seq Scan on public.foo
- Output: foo.ctid, foo.*, foo.tableoid, foo.f1
-> Foreign Scan on public.foo2
Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1
Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1
+ -> Seq Scan on public.foo
+ Output: foo.ctid, foo.*, foo.tableoid, foo.f1
(37 rows)
update bar set f2 = f2 + 100 where f1 in (select f1 from foo);
@@ -6546,16 +6556,16 @@ where bar.f1 = ss.f1;
Output: bar.f1, (bar.f2 + 100), bar.ctid, (ROW(foo.f1))
Hash Cond: (foo.f1 = bar.f1)
-> Append
- -> Seq Scan on public.foo
- Output: ROW(foo.f1), foo.f1
-> Foreign Scan on public.foo2
Output: ROW(foo2.f1), foo2.f1
Remote SQL: SELECT f1 FROM public.loct1
- -> Seq Scan on public.foo foo_1
- Output: ROW((foo_1.f1 + 3)), (foo_1.f1 + 3)
-> Foreign Scan on public.foo2 foo2_1
Output: ROW((foo2_1.f1 + 3)), (foo2_1.f1 + 3)
Remote SQL: SELECT f1 FROM public.loct1
+ -> Seq Scan on public.foo
+ Output: ROW(foo.f1), foo.f1
+ -> Seq Scan on public.foo foo_1
+ Output: ROW((foo_1.f1 + 3)), (foo_1.f1 + 3)
-> Hash
Output: bar.f1, bar.f2, bar.ctid
-> Seq Scan on public.bar
@@ -6573,16 +6583,16 @@ where bar.f1 = ss.f1;
Output: (ROW(foo.f1)), foo.f1
Sort Key: foo.f1
-> Append
- -> Seq Scan on public.foo
- Output: ROW(foo.f1), foo.f1
-> Foreign Scan on public.foo2
Output: ROW(foo2.f1), foo2.f1
Remote SQL: SELECT f1 FROM public.loct1
- -> Seq Scan on public.foo foo_1
- Output: ROW((foo_1.f1 + 3)), (foo_1.f1 + 3)
-> Foreign Scan on public.foo2 foo2_1
Output: ROW((foo2_1.f1 + 3)), (foo2_1.f1 + 3)
Remote SQL: SELECT f1 FROM public.loct1
+ -> Seq Scan on public.foo
+ Output: ROW(foo.f1), foo.f1
+ -> Seq Scan on public.foo foo_1
+ Output: ROW((foo_1.f1 + 3)), (foo_1.f1 + 3)
(45 rows)
update bar set f2 = f2 + 100
@@ -6733,27 +6743,33 @@ delete from foo where f1 < 5 returning *;
(5 rows)
explain (verbose, costs off)
-update bar set f2 = f2 + 100 returning *;
- QUERY PLAN
-------------------------------------------------------------------------------
- Update on public.bar
- Output: bar.f1, bar.f2
- Update on public.bar
- Foreign Update on public.bar2
- -> Seq Scan on public.bar
- Output: bar.f1, (bar.f2 + 100), bar.ctid
- -> Foreign Update on public.bar2
- Remote SQL: UPDATE public.loct2 SET f2 = (f2 + 100) RETURNING f1, f2
-(8 rows)
+with u as (update bar set f2 = f2 + 100 returning *) select * from u order by 1;
+ QUERY PLAN
+--------------------------------------------------------------------------------------
+ Sort
+ Output: u.f1, u.f2
+ Sort Key: u.f1
+ CTE u
+ -> Update on public.bar
+ Output: bar.f1, bar.f2
+ Update on public.bar
+ Foreign Update on public.bar2
+ -> Seq Scan on public.bar
+ Output: bar.f1, (bar.f2 + 100), bar.ctid
+ -> Foreign Update on public.bar2
+ Remote SQL: UPDATE public.loct2 SET f2 = (f2 + 100) RETURNING f1, f2
+ -> CTE Scan on u
+ Output: u.f1, u.f2
+(14 rows)
-update bar set f2 = f2 + 100 returning *;
+with u as (update bar set f2 = f2 + 100 returning *) select * from u order by 1;
f1 | f2
----+-----
1 | 311
2 | 322
- 6 | 266
3 | 333
4 | 344
+ 6 | 266
7 | 277
(6 rows)
diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c
index 5d270b9..76e8437 100644
--- a/contrib/postgres_fdw/postgres_fdw.c
+++ b/contrib/postgres_fdw/postgres_fdw.c
@@ -19,6 +19,8 @@
#include "commands/defrem.h"
#include "commands/explain.h"
#include "commands/vacuum.h"
+#include "executor/execAsync.h"
+#include "executor/nodeForeignscan.h"
#include "foreign/fdwapi.h"
#include "funcapi.h"
#include "miscadmin.h"
@@ -33,6 +35,7 @@
#include "optimizer/var.h"
#include "optimizer/tlist.h"
#include "parser/parsetree.h"
+#include "pgstat.h"
#include "utils/builtins.h"
#include "utils/guc.h"
#include "utils/lsyscache.h"
@@ -52,6 +55,9 @@ PG_MODULE_MAGIC;
/* If no remote estimates, assume a sort costs 20% extra */
#define DEFAULT_FDW_SORT_MULTIPLIER 1.2
+/* Retrive PgFdwScanState struct from ForeginScanState */
+#define GetPgFdwScanState(n) ((PgFdwScanState *)(n)->fdw_state)
+
/*
* Indexes of FDW-private information stored in fdw_private lists.
*
@@ -121,10 +127,27 @@ enum FdwDirectModifyPrivateIndex
};
/*
+ * Connection private area structure.
+ */
+typedef struct PgFdwConnpriv
+{
+ ForeignScanState *current_owner; /* The node currently running a query
+ * on this connection*/
+} PgFdwConnpriv;
+
+/* Execution state base type */
+typedef struct PgFdwState
+{
+ PGconn *conn; /* connection for the scan */
+ PgFdwConnpriv *connpriv; /* connection private memory */
+} PgFdwState;
+
+/*
* Execution state of a foreign scan using postgres_fdw.
*/
typedef struct PgFdwScanState
{
+ PgFdwState s; /* common structure */
Relation rel; /* relcache entry for the foreign table. NULL
* for a foreign join scan. */
TupleDesc tupdesc; /* tuple descriptor of scan */
@@ -135,7 +158,7 @@ typedef struct PgFdwScanState
List *retrieved_attrs; /* list of retrieved attribute numbers */
/* for remote query execution */
- PGconn *conn; /* connection for the scan */
+ bool result_ready;
unsigned int cursor_number; /* quasi-unique ID for my cursor */
bool cursor_exists; /* have we created the cursor? */
int numParams; /* number of parameters passed to query */
@@ -151,6 +174,13 @@ typedef struct PgFdwScanState
/* batch-level state, for optimizing rewinds and avoiding useless fetch */
int fetch_ct_2; /* Min(# of fetches done, 2) */
bool eof_reached; /* true if last fetch reached EOF */
+ bool run_async; /* true if run asynchronously */
+ bool async_waiting; /* true if requesting the parent to wait */
+ ForeignScanState *waiter; /* Next node to run a query among nodes
+ * sharing the same connection */
+ ForeignScanState *last_waiter; /* A waiting node at the end of a waiting
+ * list. Maintained only by the current
+ * owner of the connection */
/* working memory contexts */
MemoryContext batch_cxt; /* context holding current batch of tuples */
@@ -164,11 +194,11 @@ typedef struct PgFdwScanState
*/
typedef struct PgFdwModifyState
{
+ PgFdwState s; /* common structure */
Relation rel; /* relcache entry for the foreign table */
AttInMetadata *attinmeta; /* attribute datatype conversion metadata */
/* for remote query execution */
- PGconn *conn; /* connection for the scan */
char *p_name; /* name of prepared statement, if created */
/* extracted fdw_private data */
@@ -191,6 +221,7 @@ typedef struct PgFdwModifyState
*/
typedef struct PgFdwDirectModifyState
{
+ PgFdwState s; /* common structure */
Relation rel; /* relcache entry for the foreign table */
AttInMetadata *attinmeta; /* attribute datatype conversion metadata */
@@ -289,6 +320,7 @@ static void postgresBeginForeignScan(ForeignScanState *node, int eflags);
static TupleTableSlot *postgresIterateForeignScan(ForeignScanState *node);
static void postgresReScanForeignScan(ForeignScanState *node);
static void postgresEndForeignScan(ForeignScanState *node);
+static void postgresShutdownForeignScan(ForeignScanState *node);
static void postgresAddForeignUpdateTargets(Query *parsetree,
RangeTblEntry *target_rte,
Relation target_relation);
@@ -349,6 +381,14 @@ static void postgresGetForeignUpperPaths(PlannerInfo *root,
UpperRelationKind stage,
RelOptInfo *input_rel,
RelOptInfo *output_rel);
+static bool postgresIsForeignPathAsyncCapable(ForeignPath *path);
+static void postgresForeignAsyncRequest(EState *estate,
+ PendingAsyncRequest *areq);
+static bool postgresForeignAsyncConfigureWait(EState *estate,
+ PendingAsyncRequest *areq,
+ bool reinit);
+static void postgresForeignAsyncNotify(EState *estate,
+ PendingAsyncRequest *areq);
/*
* Helper functions
@@ -369,7 +409,10 @@ static bool ec_member_matches_foreign(PlannerInfo *root, RelOptInfo *rel,
EquivalenceClass *ec, EquivalenceMember *em,
void *arg);
static void create_cursor(ForeignScanState *node);
-static void fetch_more_data(ForeignScanState *node);
+static void request_more_data(ForeignScanState *node);
+static void fetch_received_data(ForeignScanState *node);
+static void vacate_connection(PgFdwState *fdwconn);
+static void absorb_current_result(ForeignScanState *node);
static void close_cursor(PGconn *conn, unsigned int cursor_number);
static void prepare_foreign_modify(PgFdwModifyState *fmstate);
static const char **convert_prep_stmt_params(PgFdwModifyState *fmstate,
@@ -434,6 +477,7 @@ postgres_fdw_handler(PG_FUNCTION_ARGS)
routine->IterateForeignScan = postgresIterateForeignScan;
routine->ReScanForeignScan = postgresReScanForeignScan;
routine->EndForeignScan = postgresEndForeignScan;
+ routine->ShutdownForeignScan = postgresShutdownForeignScan;
/* Functions for updating foreign tables */
routine->AddForeignUpdateTargets = postgresAddForeignUpdateTargets;
@@ -468,6 +512,12 @@ postgres_fdw_handler(PG_FUNCTION_ARGS)
/* Support functions for upper relation push-down */
routine->GetForeignUpperPaths = postgresGetForeignUpperPaths;
+ /* Support functions for async execution */
+ routine->IsForeignPathAsyncCapable = postgresIsForeignPathAsyncCapable;
+ routine->ForeignAsyncRequest = postgresForeignAsyncRequest;
+ routine->ForeignAsyncConfigureWait = postgresForeignAsyncConfigureWait;
+ routine->ForeignAsyncNotify = postgresForeignAsyncNotify;
+
PG_RETURN_POINTER(routine);
}
@@ -1319,12 +1369,21 @@ postgresBeginForeignScan(ForeignScanState *node, int eflags)
* Get connection to the foreign server. Connection manager will
* establish new connection if necessary.
*/
- fsstate->conn = GetConnection(user, false);
+ fsstate->s.conn = GetConnection(user, false);
+ fsstate->s.connpriv = (PgFdwConnpriv *)
+ GetConnectionSpecificStorage(user, sizeof(PgFdwConnpriv));
+ fsstate->s.connpriv->current_owner = NULL;
+ fsstate->waiter = NULL;
+ fsstate->last_waiter = node;
/* Assign a unique ID for my cursor */
- fsstate->cursor_number = GetCursorNumber(fsstate->conn);
+ fsstate->cursor_number = GetCursorNumber(fsstate->s.conn);
fsstate->cursor_exists = false;
+ /* Initialize async execution status */
+ fsstate->run_async = false;
+ fsstate->async_waiting = false;
+
/* Get private info created by planner functions. */
fsstate->query = strVal(list_nth(fsplan->fdw_private,
FdwScanPrivateSelectSql));
@@ -1380,32 +1439,130 @@ postgresBeginForeignScan(ForeignScanState *node, int eflags)
static TupleTableSlot *
postgresIterateForeignScan(ForeignScanState *node)
{
- PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+ PgFdwScanState *fsstate = GetPgFdwScanState(node);
TupleTableSlot *slot = node->ss.ss_ScanTupleSlot;
/*
- * If this is the first call after Begin or ReScan, we need to create the
- * cursor on the remote side.
- */
- if (!fsstate->cursor_exists)
- create_cursor(node);
-
- /*
* Get some more tuples, if we've run out.
*/
if (fsstate->next_tuple >= fsstate->num_tuples)
{
- /* No point in another fetch if we already detected EOF, though. */
- if (!fsstate->eof_reached)
- fetch_more_data(node);
- /* If we didn't get any tuples, must be end of data. */
+ ForeignScanState *next_conn_owner = node;
+
+ /* This node has sent a query on this connection */
+ if (fsstate->s.connpriv->current_owner == node)
+ {
+ /* Check if the result is available */
+ if (PQisBusy(fsstate->s.conn))
+ {
+ int rc = WaitLatchOrSocket(NULL,
+ WL_SOCKET_READABLE | WL_TIMEOUT,
+ PQsocket(fsstate->s.conn), 0,
+ WAIT_EVENT_ASYNC_WAIT);
+ if (fsstate->run_async && !(rc & WL_SOCKET_READABLE))
+ {
+ /*
+ * This node is not ready yet. Tell the caller to wait.
+ */
+ fsstate->result_ready = false;
+ return ExecClearTuple(slot);
+ }
+ }
+
+ Assert(fsstate->async_waiting);
+ fsstate->async_waiting = false;
+ fetch_received_data(node);
+
+ /*
+ * If someone is waiting this node on the same connection, let the
+ * first waiter be the next owner of this connection.
+ */
+ if (fsstate->waiter)
+ {
+ PgFdwScanState *next_owner_state;
+
+ next_conn_owner = fsstate->waiter;
+ next_owner_state = GetPgFdwScanState(next_conn_owner);
+ fsstate->waiter = NULL;
+
+ /*
+ * only the current owner is responsible to maintain the shortcut
+ * to the last waiter
+ */
+ next_owner_state->last_waiter = fsstate->last_waiter;
+
+ /*
+ * for simplicity, last_waiter points itself on a node that no one
+ * is waiting for.
+ */
+ fsstate->last_waiter = node;
+ }
+ }
+ else if (fsstate->s.connpriv->current_owner &&
+ !GetPgFdwScanState(node)->eof_reached)
+ {
+ /*
+ * Anyone else is holding this connection and we want this node to
+ * run later. Add myself to the tail of the waiters' list then
+ * return not-ready. To avoid scanning through the waiters' list,
+ * the current owner is to maintain the shortcut to the last
+ * waiter.
+ */
+ PgFdwScanState *conn_owner_state =
+ GetPgFdwScanState(fsstate->s.connpriv->current_owner);
+ ForeignScanState *last_waiter = conn_owner_state->last_waiter;
+ PgFdwScanState *last_waiter_state = GetPgFdwScanState(last_waiter);
+
+ last_waiter_state->waiter = node;
+ conn_owner_state->last_waiter = node;
+
+ /* Register the node to the async-waiting node list */
+ Assert(!GetPgFdwScanState(node)->async_waiting);
+
+ GetPgFdwScanState(node)->async_waiting = true;
+
+ fsstate->result_ready = fsstate->eof_reached;
+ return ExecClearTuple(slot);
+ }
+
+ /* At this time no node is running on the connection */
+ Assert(GetPgFdwScanState(next_conn_owner)->s.connpriv->current_owner
+ == NULL);
+ /*
+ * Send the next request for the next owner of this connection if
+ * needed.
+ */
+ if (!GetPgFdwScanState(next_conn_owner)->eof_reached)
+ {
+ PgFdwScanState *next_owner_state =
+ GetPgFdwScanState(next_conn_owner);
+
+ request_more_data(next_conn_owner);
+
+ /* Register the node to the async-waiting node list */
+ if (!next_owner_state->async_waiting)
+ next_owner_state->async_waiting = true;
+
+ if (!next_owner_state->run_async)
+ fetch_received_data(next_conn_owner);
+ }
+
+
+ /*
+ * If we haven't received a result for the given node this time,
+ * return with no tuple to give way to other nodes.
+ */
if (fsstate->next_tuple >= fsstate->num_tuples)
+ {
+ fsstate->result_ready = fsstate->eof_reached;
return ExecClearTuple(slot);
+ }
}
/*
* Return the next tuple.
*/
+ fsstate->result_ready = true;
ExecStoreTuple(fsstate->tuples[fsstate->next_tuple++],
slot,
InvalidBuffer,
@@ -1421,7 +1578,7 @@ postgresIterateForeignScan(ForeignScanState *node)
static void
postgresReScanForeignScan(ForeignScanState *node)
{
- PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+ PgFdwScanState *fsstate = GetPgFdwScanState(node);
char sql[64];
PGresult *res;
@@ -1429,6 +1586,9 @@ postgresReScanForeignScan(ForeignScanState *node)
if (!fsstate->cursor_exists)
return;
+ /* Absorb the ramining result */
+ absorb_current_result(node);
+
/*
* If any internal parameters affecting this node have changed, we'd
* better destroy and recreate the cursor. Otherwise, rewinding it should
@@ -1457,9 +1617,9 @@ postgresReScanForeignScan(ForeignScanState *node)
* We don't use a PG_TRY block here, so be careful not to throw error
* without releasing the PGresult.
*/
- res = pgfdw_exec_query(fsstate->conn, sql);
+ res = pgfdw_exec_query(fsstate->s.conn, sql);
if (PQresultStatus(res) != PGRES_COMMAND_OK)
- pgfdw_report_error(ERROR, res, fsstate->conn, true, sql);
+ pgfdw_report_error(ERROR, res, fsstate->s.conn, true, sql);
PQclear(res);
/* Now force a fresh FETCH. */
@@ -1477,7 +1637,7 @@ postgresReScanForeignScan(ForeignScanState *node)
static void
postgresEndForeignScan(ForeignScanState *node)
{
- PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+ PgFdwScanState *fsstate = GetPgFdwScanState(node);
/* if fsstate is NULL, we are in EXPLAIN; nothing to do */
if (fsstate == NULL)
@@ -1485,16 +1645,32 @@ postgresEndForeignScan(ForeignScanState *node)
/* Close the cursor if open, to prevent accumulation of cursors */
if (fsstate->cursor_exists)
- close_cursor(fsstate->conn, fsstate->cursor_number);
+ close_cursor(fsstate->s.conn, fsstate->cursor_number);
/* Release remote connection */
- ReleaseConnection(fsstate->conn);
- fsstate->conn = NULL;
+ ReleaseConnection(fsstate->s.conn);
+ fsstate->s.conn = NULL;
/* MemoryContexts will be deleted automatically. */
}
/*
+ * postgresShutdownForeignScan
+ * Remove asynchrony stuff and cleanup garbage on the connection.
+ */
+static void
+postgresShutdownForeignScan(ForeignScanState *node)
+{
+ ForeignScan *plan = (ForeignScan *) node->ss.ps.plan;
+
+ if (plan->operation != CMD_SELECT)
+ return;
+
+ /* Absorb the ramining result */
+ absorb_current_result(node);
+}
+
+/*
* postgresAddForeignUpdateTargets
* Add resjunk column(s) needed for update/delete on a foreign table
*/
@@ -1696,7 +1872,9 @@ postgresBeginForeignModify(ModifyTableState *mtstate,
user = GetUserMapping(userid, table->serverid);
/* Open connection; report that we'll create a prepared statement. */
- fmstate->conn = GetConnection(user, true);
+ fmstate->s.conn = GetConnection(user, true);
+ fmstate->s.connpriv = (PgFdwConnpriv *)
+ GetConnectionSpecificStorage(user, sizeof(PgFdwConnpriv));
fmstate->p_name = NULL; /* prepared statement not made yet */
/* Deconstruct fdw_private data. */
@@ -1775,6 +1953,8 @@ postgresExecForeignInsert(EState *estate,
PGresult *res;
int n_rows;
+ vacate_connection((PgFdwState *)fmstate);
+
/* Set up the prepared statement on the remote server, if we didn't yet */
if (!fmstate->p_name)
prepare_foreign_modify(fmstate);
@@ -1785,14 +1965,14 @@ postgresExecForeignInsert(EState *estate,
/*
* Execute the prepared statement.
*/
- if (!PQsendQueryPrepared(fmstate->conn,
+ if (!PQsendQueryPrepared(fmstate->s.conn,
fmstate->p_name,
fmstate->p_nums,
p_values,
NULL,
NULL,
0))
- pgfdw_report_error(ERROR, NULL, fmstate->conn, false, fmstate->query);
+ pgfdw_report_error(ERROR, NULL, fmstate->s.conn, false, fmstate->query);
/*
* Get the result, and check for success.
@@ -1800,10 +1980,10 @@ postgresExecForeignInsert(EState *estate,
* We don't use a PG_TRY block here, so be careful not to throw error
* without releasing the PGresult.
*/
- res = pgfdw_get_result(fmstate->conn, fmstate->query);
+ res = pgfdw_get_result(fmstate->s.conn, fmstate->query);
if (PQresultStatus(res) !=
(fmstate->has_returning ? PGRES_TUPLES_OK : PGRES_COMMAND_OK))
- pgfdw_report_error(ERROR, res, fmstate->conn, true, fmstate->query);
+ pgfdw_report_error(ERROR, res, fmstate->s.conn, true, fmstate->query);
/* Check number of rows affected, and fetch RETURNING tuple if any */
if (fmstate->has_returning)
@@ -1841,6 +2021,8 @@ postgresExecForeignUpdate(EState *estate,
PGresult *res;
int n_rows;
+ vacate_connection((PgFdwState *)fmstate);
+
/* Set up the prepared statement on the remote server, if we didn't yet */
if (!fmstate->p_name)
prepare_foreign_modify(fmstate);
@@ -1861,14 +2043,14 @@ postgresExecForeignUpdate(EState *estate,
/*
* Execute the prepared statement.
*/
- if (!PQsendQueryPrepared(fmstate->conn,
+ if (!PQsendQueryPrepared(fmstate->s.conn,
fmstate->p_name,
fmstate->p_nums,
p_values,
NULL,
NULL,
0))
- pgfdw_report_error(ERROR, NULL, fmstate->conn, false, fmstate->query);
+ pgfdw_report_error(ERROR, NULL, fmstate->s.conn, false, fmstate->query);
/*
* Get the result, and check for success.
@@ -1876,10 +2058,10 @@ postgresExecForeignUpdate(EState *estate,
* We don't use a PG_TRY block here, so be careful not to throw error
* without releasing the PGresult.
*/
- res = pgfdw_get_result(fmstate->conn, fmstate->query);
+ res = pgfdw_get_result(fmstate->s.conn, fmstate->query);
if (PQresultStatus(res) !=
(fmstate->has_returning ? PGRES_TUPLES_OK : PGRES_COMMAND_OK))
- pgfdw_report_error(ERROR, res, fmstate->conn, true, fmstate->query);
+ pgfdw_report_error(ERROR, res, fmstate->s.conn, true, fmstate->query);
/* Check number of rows affected, and fetch RETURNING tuple if any */
if (fmstate->has_returning)
@@ -1917,6 +2099,8 @@ postgresExecForeignDelete(EState *estate,
PGresult *res;
int n_rows;
+ vacate_connection((PgFdwState *)fmstate);
+
/* Set up the prepared statement on the remote server, if we didn't yet */
if (!fmstate->p_name)
prepare_foreign_modify(fmstate);
@@ -1937,14 +2121,14 @@ postgresExecForeignDelete(EState *estate,
/*
* Execute the prepared statement.
*/
- if (!PQsendQueryPrepared(fmstate->conn,
+ if (!PQsendQueryPrepared(fmstate->s.conn,
fmstate->p_name,
fmstate->p_nums,
p_values,
NULL,
NULL,
0))
- pgfdw_report_error(ERROR, NULL, fmstate->conn, false, fmstate->query);
+ pgfdw_report_error(ERROR, NULL, fmstate->s.conn, false, fmstate->query);
/*
* Get the result, and check for success.
@@ -1952,10 +2136,10 @@ postgresExecForeignDelete(EState *estate,
* We don't use a PG_TRY block here, so be careful not to throw error
* without releasing the PGresult.
*/
- res = pgfdw_get_result(fmstate->conn, fmstate->query);
+ res = pgfdw_get_result(fmstate->s.conn, fmstate->query);
if (PQresultStatus(res) !=
(fmstate->has_returning ? PGRES_TUPLES_OK : PGRES_COMMAND_OK))
- pgfdw_report_error(ERROR, res, fmstate->conn, true, fmstate->query);
+ pgfdw_report_error(ERROR, res, fmstate->s.conn, true, fmstate->query);
/* Check number of rows affected, and fetch RETURNING tuple if any */
if (fmstate->has_returning)
@@ -2002,16 +2186,16 @@ postgresEndForeignModify(EState *estate,
* We don't use a PG_TRY block here, so be careful not to throw error
* without releasing the PGresult.
*/
- res = pgfdw_exec_query(fmstate->conn, sql);
+ res = pgfdw_exec_query(fmstate->s.conn, sql);
if (PQresultStatus(res) != PGRES_COMMAND_OK)
- pgfdw_report_error(ERROR, res, fmstate->conn, true, sql);
+ pgfdw_report_error(ERROR, res, fmstate->s.conn, true, sql);
PQclear(res);
fmstate->p_name = NULL;
}
/* Release remote connection */
- ReleaseConnection(fmstate->conn);
- fmstate->conn = NULL;
+ ReleaseConnection(fmstate->s.conn);
+ fmstate->s.conn = NULL;
}
/*
@@ -2291,7 +2475,9 @@ postgresBeginDirectModify(ForeignScanState *node, int eflags)
* Get connection to the foreign server. Connection manager will
* establish new connection if necessary.
*/
- dmstate->conn = GetConnection(user, false);
+ dmstate->s.conn = GetConnection(user, false);
+ dmstate->s.connpriv = (PgFdwConnpriv *)
+ GetConnectionSpecificStorage(user, sizeof(PgFdwConnpriv));
/* Initialize state variable */
dmstate->num_tuples = -1; /* -1 means not set yet */
@@ -2344,7 +2530,10 @@ postgresIterateDirectModify(ForeignScanState *node)
* If this is the first call after Begin, execute the statement.
*/
if (dmstate->num_tuples == -1)
+ {
+ vacate_connection((PgFdwState *)dmstate);
execute_dml_stmt(node);
+ }
/*
* If the local query doesn't specify RETURNING, just clear tuple slot.
@@ -2391,8 +2580,8 @@ postgresEndDirectModify(ForeignScanState *node)
PQclear(dmstate->result);
/* Release remote connection */
- ReleaseConnection(dmstate->conn);
- dmstate->conn = NULL;
+ ReleaseConnection(dmstate->s.conn);
+ dmstate->s.conn = NULL;
/* MemoryContext will be deleted automatically. */
}
@@ -2511,6 +2700,7 @@ estimate_path_cost_size(PlannerInfo *root,
List *local_param_join_conds;
StringInfoData sql;
PGconn *conn;
+ PgFdwConnpriv *connpriv;
Selectivity local_sel;
QualCost local_cost;
List *fdw_scan_tlist = NIL;
@@ -2554,6 +2744,16 @@ estimate_path_cost_size(PlannerInfo *root,
/* Get the remote estimate */
conn = GetConnection(fpinfo->user, false);
+ connpriv = GetConnectionSpecificStorage(fpinfo->user,
+ sizeof(PgFdwConnpriv));
+ if (connpriv)
+ {
+ PgFdwState tmpstate;
+ tmpstate.conn = conn;
+ tmpstate.connpriv = connpriv;
+ vacate_connection(&tmpstate);
+ }
+
get_remote_estimate(sql.data, conn, &rows, &width,
&startup_cost, &total_cost);
ReleaseConnection(conn);
@@ -2908,11 +3108,11 @@ ec_member_matches_foreign(PlannerInfo *root, RelOptInfo *rel,
static void
create_cursor(ForeignScanState *node)
{
- PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+ PgFdwScanState *fsstate = GetPgFdwScanState(node);
ExprContext *econtext = node->ss.ps.ps_ExprContext;
int numParams = fsstate->numParams;
const char **values = fsstate->param_values;
- PGconn *conn = fsstate->conn;
+ PGconn *conn = fsstate->s.conn;
StringInfoData buf;
PGresult *res;
@@ -2978,47 +3178,96 @@ create_cursor(ForeignScanState *node)
* Fetch some more rows from the node's cursor.
*/
static void
-fetch_more_data(ForeignScanState *node)
+request_more_data(ForeignScanState *node)
+{
+ PgFdwScanState *fsstate = GetPgFdwScanState(node);
+ PGconn *conn = fsstate->s.conn;
+ char sql[64];
+
+ /* The connection should be vacant */
+ Assert(fsstate->s.connpriv->current_owner == NULL);
+
+ /*
+ * If this is the first call after Begin or ReScan, we need to create the
+ * cursor on the remote side.
+ */
+ if (!fsstate->cursor_exists)
+ create_cursor(node);
+
+ snprintf(sql, sizeof(sql), "FETCH %d FROM c%u",
+ fsstate->fetch_size, fsstate->cursor_number);
+
+ if (!PQsendQuery(conn, sql))
+ pgfdw_report_error(ERROR, NULL, conn, false, sql);
+
+ fsstate->s.connpriv->current_owner = node;
+}
+
+/*
+ * Fetch some more rows from the node's cursor.
+ */
+static void
+fetch_received_data(ForeignScanState *node)
{
- PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+ PgFdwScanState *fsstate = GetPgFdwScanState(node);
PGresult *volatile res = NULL;
MemoryContext oldcontext;
+ /* I should be the current connection owner */
+ Assert(fsstate->s.connpriv->current_owner == node);
+
/*
* We'll store the tuples in the batch_cxt. First, flush the previous
- * batch.
+ * batch if no tuple is remaining
*/
- fsstate->tuples = NULL;
- MemoryContextReset(fsstate->batch_cxt);
+ if (fsstate->next_tuple >= fsstate->num_tuples)
+ {
+ fsstate->tuples = NULL;
+ fsstate->num_tuples = 0;
+ MemoryContextReset(fsstate->batch_cxt);
+ }
+ else if (fsstate->next_tuple > 0)
+ {
+ /* move the remaining tuples to the beginning of the store */
+ int n = 0;
+
+ while(fsstate->next_tuple < fsstate->num_tuples)
+ fsstate->tuples[n++] = fsstate->tuples[fsstate->next_tuple++];
+ fsstate->num_tuples = n;
+ }
+
oldcontext = MemoryContextSwitchTo(fsstate->batch_cxt);
/* PGresult must be released before leaving this function. */
PG_TRY();
{
- PGconn *conn = fsstate->conn;
+ PGconn *conn = fsstate->s.conn;
char sql[64];
- int numrows;
+ int addrows;
+ size_t newsize;
int i;
snprintf(sql, sizeof(sql), "FETCH %d FROM c%u",
fsstate->fetch_size, fsstate->cursor_number);
- res = pgfdw_exec_query(conn, sql);
+ res = pgfdw_get_result(conn, sql);
/* On error, report the original query, not the FETCH. */
if (PQresultStatus(res) != PGRES_TUPLES_OK)
pgfdw_report_error(ERROR, res, conn, false, fsstate->query);
/* Convert the data into HeapTuples */
- numrows = PQntuples(res);
- fsstate->tuples = (HeapTuple *) palloc0(numrows * sizeof(HeapTuple));
- fsstate->num_tuples = numrows;
- fsstate->next_tuple = 0;
+ addrows = PQntuples(res);
+ newsize = (fsstate->num_tuples + addrows) * sizeof(HeapTuple);
+ if (fsstate->tuples)
+ fsstate->tuples = (HeapTuple *) repalloc(fsstate->tuples, newsize);
+ else
+ fsstate->tuples = (HeapTuple *) palloc(newsize);
- for (i = 0; i < numrows; i++)
+ for (i = 0; i < addrows; i++)
{
Assert(IsA(node->ss.ps.plan, ForeignScan));
- fsstate->tuples[i] =
+ fsstate->tuples[fsstate->num_tuples + i] =
make_tuple_from_result_row(res, i,
fsstate->rel,
fsstate->attinmeta,
@@ -3028,27 +3277,82 @@ fetch_more_data(ForeignScanState *node)
}
/* Update fetch_ct_2 */
- if (fsstate->fetch_ct_2 < 2)
+ if (fsstate->fetch_ct_2 < 2 && fsstate->next_tuple == 0)
fsstate->fetch_ct_2++;
+ fsstate->next_tuple = 0;
+ fsstate->num_tuples += addrows;
+
/* Must be EOF if we didn't get as many tuples as we asked for. */
- fsstate->eof_reached = (numrows < fsstate->fetch_size);
+ fsstate->eof_reached = (addrows < fsstate->fetch_size);
PQclear(res);
res = NULL;
}
PG_CATCH();
{
+ fsstate->s.connpriv->current_owner = NULL;
if (res)
PQclear(res);
PG_RE_THROW();
}
PG_END_TRY();
+ fsstate->s.connpriv->current_owner = NULL;
+
MemoryContextSwitchTo(oldcontext);
}
/*
+ * Vacate a connection so that this node can send the next query
+ */
+static void
+vacate_connection(PgFdwState *fdwstate)
+{
+ PgFdwConnpriv *connpriv = fdwstate->connpriv;
+ ForeignScanState *owner;
+
+ if (connpriv == NULL || connpriv->current_owner == NULL)
+ return;
+
+ /*
+ * let the current connection owner read the result for the running query
+ */
+ owner = connpriv->current_owner;
+ fetch_received_data(owner);
+
+ /* Clear the waiting list */
+ while (owner)
+ {
+ PgFdwScanState *fsstate = GetPgFdwScanState(owner);
+
+ fsstate->last_waiter = NULL;
+ owner = fsstate->waiter;
+ fsstate->waiter = NULL;
+ }
+}
+
+/*
+ * Absorb the result of the current query.
+ */
+static void
+absorb_current_result(ForeignScanState *node)
+{
+ PgFdwScanState *fsstate = GetPgFdwScanState(node);
+ ForeignScanState *owner = fsstate->s.connpriv->current_owner;
+
+ if (owner)
+ {
+ PgFdwScanState *target_state = GetPgFdwScanState(owner);
+ PGconn *conn = target_state->s.conn;
+
+ while(PQisBusy(conn))
+ PQclear(PQgetResult(conn));
+ fsstate->s.connpriv->current_owner = NULL;
+ fsstate->async_waiting = false;
+ }
+}
+/*
* Force assorted GUC parameters to settings that ensure that we'll output
* data values in a form that is unambiguous to the remote server.
*
@@ -3132,7 +3436,7 @@ prepare_foreign_modify(PgFdwModifyState *fmstate)
/* Construct name we'll use for the prepared statement. */
snprintf(prep_name, sizeof(prep_name), "pgsql_fdw_prep_%u",
- GetPrepStmtNumber(fmstate->conn));
+ GetPrepStmtNumber(fmstate->s.conn));
p_name = pstrdup(prep_name);
/*
@@ -3142,12 +3446,12 @@ prepare_foreign_modify(PgFdwModifyState *fmstate)
* the prepared statements we use in this module are simple enough that
* the remote server will make the right choices.
*/
- if (!PQsendPrepare(fmstate->conn,
+ if (!PQsendPrepare(fmstate->s.conn,
p_name,
fmstate->query,
0,
NULL))
- pgfdw_report_error(ERROR, NULL, fmstate->conn, false, fmstate->query);
+ pgfdw_report_error(ERROR, NULL, fmstate->s.conn, false, fmstate->query);
/*
* Get the result, and check for success.
@@ -3155,9 +3459,9 @@ prepare_foreign_modify(PgFdwModifyState *fmstate)
* We don't use a PG_TRY block here, so be careful not to throw error
* without releasing the PGresult.
*/
- res = pgfdw_get_result(fmstate->conn, fmstate->query);
+ res = pgfdw_get_result(fmstate->s.conn, fmstate->query);
if (PQresultStatus(res) != PGRES_COMMAND_OK)
- pgfdw_report_error(ERROR, res, fmstate->conn, true, fmstate->query);
+ pgfdw_report_error(ERROR, res, fmstate->s.conn, true, fmstate->query);
PQclear(res);
/* This action shows that the prepare has been done. */
@@ -3288,9 +3592,9 @@ execute_dml_stmt(ForeignScanState *node)
* the desired result. This allows us to avoid assuming that the remote
* server has the same OIDs we do for the parameters' types.
*/
- if (!PQsendQueryParams(dmstate->conn, dmstate->query, numParams,
+ if (!PQsendQueryParams(dmstate->s.conn, dmstate->query, numParams,
NULL, values, NULL, NULL, 0))
- pgfdw_report_error(ERROR, NULL, dmstate->conn, false, dmstate->query);
+ pgfdw_report_error(ERROR, NULL, dmstate->s.conn, false, dmstate->query);
/*
* Get the result, and check for success.
@@ -3298,10 +3602,10 @@ execute_dml_stmt(ForeignScanState *node)
* We don't use a PG_TRY block here, so be careful not to throw error
* without releasing the PGresult.
*/
- dmstate->result = pgfdw_get_result(dmstate->conn, dmstate->query);
+ dmstate->result = pgfdw_get_result(dmstate->s.conn, dmstate->query);
if (PQresultStatus(dmstate->result) !=
(dmstate->has_returning ? PGRES_TUPLES_OK : PGRES_COMMAND_OK))
- pgfdw_report_error(ERROR, dmstate->result, dmstate->conn, true,
+ pgfdw_report_error(ERROR, dmstate->result, dmstate->s.conn, true,
dmstate->query);
/* Get the number of rows affected. */
@@ -4440,6 +4744,80 @@ postgresGetForeignJoinPaths(PlannerInfo *root,
/* XXX Consider parameterized paths for the join relation */
}
+static bool
+postgresIsForeignPathAsyncCapable(ForeignPath *path)
+{
+ return true;
+}
+
+/*
+ * Accept async request. Notify to the caller if the next tuple is immediately
+ * available. ExecForeignScan does additional work to finishing the returning
+ * tuple, so call it instead of postgresIterateForeignScan to acquire a tuple
+ * in expected shape.
+ */
+static void
+postgresForeignAsyncRequest(EState *estate, PendingAsyncRequest *areq)
+{
+ ForeignScanState *node = (ForeignScanState *) areq->requestee;
+ TupleTableSlot *slot;
+
+ Assert(IsA(node, ForeignScanState));
+ GetPgFdwScanState(node)->run_async = true;
+ slot = ExecForeignScan(node);
+ if (GetPgFdwScanState(node)->result_ready)
+ ExecAsyncRequestDone(estate, areq, (Node *) slot);
+ else
+ ExecAsyncSetRequiredEvents(estate, areq, 1, false, false);
+}
+
+/*
+ * Configure waiting event.
+ *
+ * Add an wait event only when the node is the connection owner. Elsewise
+ * another node on this connection is the owner.
+ */
+static bool
+postgresForeignAsyncConfigureWait(EState *estate, PendingAsyncRequest *areq,
+ bool reinit)
+{
+ ForeignScanState *node = (ForeignScanState *) areq->requestee;
+ PgFdwScanState *fsstate = GetPgFdwScanState(node);
+
+
+ /* If the caller didn't reinit, this event is already in event set */
+ if (!reinit)
+ return true;
+
+ if (fsstate->s.connpriv->current_owner == node)
+ {
+ AddWaitEventToSet(estate->es_wait_event_set,
+ WL_SOCKET_READABLE, PQsocket(fsstate->s.conn),
+ NULL, areq);
+ return true;
+ }
+
+ return false;
+}
+
+/*
+ * Process a notification from async mechanism. ExecForeignScan does
+ * additional work to complete the returning tuple, so call it instead of
+ * postgresIterateForeignScan to acquire a completed tuple.
+ */
+static void
+postgresForeignAsyncNotify(EState *estate, PendingAsyncRequest *areq)
+{
+ ForeignScanState *node = (ForeignScanState *) areq->requestee;
+ TupleTableSlot *slot;
+
+ Assert(IsA(node, ForeignScanState));
+ slot = ExecForeignScan(node);
+ Assert(GetPgFdwScanState(node)->result_ready);
+
+ ExecAsyncRequestDone(estate, areq, (Node *) slot);
+}
+
/*
* Assess whether the aggregation, grouping and having operations can be pushed
* down to the foreign server. As a side effect, save information we obtain in
@@ -4797,7 +5175,7 @@ make_tuple_from_result_row(PGresult *res,
PgFdwScanState *fdw_sstate;
Assert(fsstate);
- fdw_sstate = (PgFdwScanState *) fsstate->fdw_state;
+ fdw_sstate = GetPgFdwScanState(fsstate);
tupdesc = fdw_sstate->tupdesc;
}
diff --git a/contrib/postgres_fdw/postgres_fdw.h b/contrib/postgres_fdw/postgres_fdw.h
index 46cac55..b3ac615 100644
--- a/contrib/postgres_fdw/postgres_fdw.h
+++ b/contrib/postgres_fdw/postgres_fdw.h
@@ -79,6 +79,7 @@ typedef struct PgFdwRelationInfo
UserMapping *user; /* only set in use_remote_estimate mode */
int fetch_size; /* fetch size for this remote table */
+ bool allow_prefetch; /* true to allow overlapped fetching */
/*
* Name of the relation while EXPLAINing ForeignScan. It is used for join
@@ -103,6 +104,7 @@ extern void reset_transmission_modes(int nestlevel);
/* in connection.c */
extern PGconn *GetConnection(UserMapping *user, bool will_prep_stmt);
+void *GetConnectionSpecificStorage(UserMapping *user, size_t initsize);
extern void ReleaseConnection(PGconn *conn);
extern unsigned int GetCursorNumber(PGconn *conn);
extern unsigned int GetPrepStmtNumber(PGconn *conn);
diff --git a/contrib/postgres_fdw/sql/postgres_fdw.sql b/contrib/postgres_fdw/sql/postgres_fdw.sql
index 56b01d0..4dca0c4 100644
--- a/contrib/postgres_fdw/sql/postgres_fdw.sql
+++ b/contrib/postgres_fdw/sql/postgres_fdw.sql
@@ -1511,12 +1511,12 @@ insert into bar2 values(4,44,44);
insert into bar2 values(7,77,77);
explain (verbose, costs off)
-select * from bar where f1 in (select f1 from foo) for update;
-select * from bar where f1 in (select f1 from foo) for update;
+select * from bar where f1 in (select f1 from foo) order by 1 for update;
+select * from bar where f1 in (select f1 from foo) order by 1 for update;
explain (verbose, costs off)
-select * from bar where f1 in (select f1 from foo) for share;
-select * from bar where f1 in (select f1 from foo) for share;
+select * from bar where f1 in (select f1 from foo) order by 1 for share;
+select * from bar where f1 in (select f1 from foo) order by 1 for share;
-- Check UPDATE with inherited target and an inherited source table
explain (verbose, costs off)
@@ -1575,8 +1575,8 @@ explain (verbose, costs off)
delete from foo where f1 < 5 returning *;
delete from foo where f1 < 5 returning *;
explain (verbose, costs off)
-update bar set f2 = f2 + 100 returning *;
-update bar set f2 = f2 + 100 returning *;
+with u as (update bar set f2 = f2 + 100 returning *) select * from u order by 1;
+with u as (update bar set f2 = f2 + 100 returning *) select * from u order by 1;
drop table foo cascade;
drop table bar cascade;
--
2.9.2
0004-Apply-unlikely-to-suggest-synchronous-route-of-ExecA.patchtext/x-patch; charset=us-asciiDownload
From b50c350b8392b6c7621cb93c863470a07f5bb563 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Tue, 25 Oct 2016 18:05:30 +0900
Subject: [PATCH 4/5] Apply unlikely to suggest synchronous route of
ExecAppend.
ExecAppend seems to get slowed down by penalty of misprediction of
branches related to async-execution. Apply unlikey to them to prevent
such penalty on exiting route. Asynchronous execution is already
having a lot of additional code so this doesn't add siginificant
degradation.
---
src/backend/executor/nodeAppend.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/src/backend/executor/nodeAppend.c b/src/backend/executor/nodeAppend.c
index 12d3742..f44c40a 100644
--- a/src/backend/executor/nodeAppend.c
+++ b/src/backend/executor/nodeAppend.c
@@ -208,7 +208,7 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
TupleTableSlot *
ExecAppend(AppendState *node)
{
- if (node->as_nasyncplans > 0)
+ if (unlikely(node->as_nasyncplans > 0))
{
EState *estate = node->ps.state;
int i;
@@ -249,7 +249,7 @@ ExecAppend(AppendState *node)
/*
* if we have async requests outstanding, run the event loop
*/
- if (node->as_nasyncpending > 0)
+ if (unlikely(node->as_nasyncpending > 0))
{
long timeout = node->as_syncdone ? -1 : 0;
--
2.9.2
On Thu, Feb 23, 2017 at 6:59 AM, Kyotaro HORIGUCHI <
horiguchi.kyotaro@lab.ntt.co.jp> wrote:
9e43e87
Patch fails on current master, but correctly applies to 9e43e87. Thanks for
including the commit id.
Regression tests pass.
As with my last attempt at reviewing this patch, I'm confused about what
kind of queries can take advantage of this patch. Is it only cases where a
local table has multiple inherited foreign table children? Will it work
with queries where two foreign tables are referenced and combined with a
UNION ALL?
On 2017/03/11 8:19, Corey Huinker wrote:
On Thu, Feb 23, 2017 at 6:59 AM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp <mailto:horiguchi.kyotaro@lab.ntt.co.jp>>
wrote:9e43e87
Patch fails on current master, but correctly applies to 9e43e87. Thanks
for including the commit id.Regression tests pass.
As with my last attempt at reviewing this patch, I'm confused about what
kind of queries can take advantage of this patch. Is it only cases where a
local table has multiple inherited foreign table children?
IIUC, Horiguchi-san's patch adds asynchronous capability for ForeignScan's
driven by postgres_fdw (after building some relevant infrastructure
first). The same might be added to other Scan nodes (and probably other
nodes as well) eventually so that more queries will benefit from
asynchronous execution. It may just be that ForeignScan's benefit more
from asynchronous execution than other Scan types.
Will it work
with queries where two foreign tables are referenced and combined with a
UNION ALL?
I think it will, because Append itself has been made async-capable by one
of the patches and UNION ALL uses Append. But as mentioned above, only
the postgres_fdw foreign tables will be able to utilize this for now.
Thanks,
Amit
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
I think it will, because Append itself has been made async-capable by one
of the patches and UNION ALL uses Append. But as mentioned above, only
the postgres_fdw foreign tables will be able to utilize this for now.
Ok, I'll re-run my test from a few weeks back and see if anything has
changed.
On Mon, Mar 13, 2017 at 1:06 AM, Corey Huinker <corey.huinker@gmail.com>
wrote:
I think it will, because Append itself has been made async-capable by one
of the patches and UNION ALL uses Append. But as mentioned above, only
the postgres_fdw foreign tables will be able to utilize this for now.Ok, I'll re-run my test from a few weeks back and see if anything has
changed.
I'm not able to discern any difference in plan between a 9.6 instance and
this patch.
The basic outline of my test is:
EXPLAIN ANALYZE
SELECT c1, c2, ..., cN FROM tab1 WHERE date = '1 day ago'
UNION ALL
SELECT c1, c2, ..., cN FROM tab2 WHERE date = '2 days ago'
UNION ALL
SELECT c1, c2, ..., cN FROM tab3 WHERE date = '3 days ago'
UNION ALL
SELECT c1, c2, ..., cN FROM tab4 WHERE date = '4 days ago'
I've tried this test where tab1 through tab4 all are the same postgres_fdw
foreign table.
I've tried this test where tab1 through tab4 all are different foreign
tables pointing to the same remote table sharing a the same server
definition.
I've tried this test where tab1 through tab4 all are different foreign
tables pointing each with it's own foreign server definition, all of which
happen to point to the same remote cluster.
Are there some postgresql.conf settings I should set to get a decent test?
On 2017/03/14 6:31, Corey Huinker wrote:
On Mon, Mar 13, 2017 at 1:06 AM, Corey Huinker <corey.huinker@gmail.com>
wrote:I think it will, because Append itself has been made async-capable by one
of the patches and UNION ALL uses Append. But as mentioned above, only
the postgres_fdw foreign tables will be able to utilize this for now.Ok, I'll re-run my test from a few weeks back and see if anything has
changed.I'm not able to discern any difference in plan between a 9.6 instance and
this patch.The basic outline of my test is:
EXPLAIN ANALYZE
SELECT c1, c2, ..., cN FROM tab1 WHERE date = '1 day ago'
UNION ALL
SELECT c1, c2, ..., cN FROM tab2 WHERE date = '2 days ago'
UNION ALL
SELECT c1, c2, ..., cN FROM tab3 WHERE date = '3 days ago'
UNION ALL
SELECT c1, c2, ..., cN FROM tab4 WHERE date = '4 days ago'I've tried this test where tab1 through tab4 all are the same postgres_fdw
foreign table.
I've tried this test where tab1 through tab4 all are different foreign
tables pointing to the same remote table sharing a the same server
definition.
I've tried this test where tab1 through tab4 all are different foreign
tables pointing each with it's own foreign server definition, all of which
happen to point to the same remote cluster.Are there some postgresql.conf settings I should set to get a decent test?
I don't think the plan itself will change as a result of applying this
patch. You might however be able to observe some performance improvement.
Thanks,
Amit
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
I don't think the plan itself will change as a result of applying this
patch. You might however be able to observe some performance improvement.Thanks,
Amit
I could see no performance improvement, even with 16 separate queries
combined with UNION ALL. Query performance was always with +/- 10% of a 9.6
instance given the same script. I must be missing something.
On 2017/03/14 10:08, Corey Huinker wrote:
I don't think the plan itself will change as a result of applying this
patch. You might however be able to observe some performance improvement.I could see no performance improvement, even with 16 separate queries
combined with UNION ALL. Query performance was always with +/- 10% of a 9.6
instance given the same script. I must be missing something.
Hmm, maybe I'm missing something too.
Anyway, here is an older message on this thread from Horiguchi-san where
he shared some of the test cases that this patch improves performance for:
/messages/by-id/20161018.103051.30820907.horiguchi.kyotaro@lab.ntt.co.jp
From that message:
<quote>
I measured performance and had the following result.
t0 - SELECT sum(a) FROM <local single table>;
pl - SELECT sum(a) FROM <4 local children>;
pf0 - SELECT sum(a) FROM <4 foreign children on single connection>;
pf1 - SELECT sum(a) FROM <4 foreign children on dedicate connections>;
The result is written as "time<ms> (std dev <ms>)"
sync
t0: 3820.33 ( 1.88)
pl: 1608.59 ( 12.06)
pf0: 7928.29 ( 46.58)
pf1: 8023.16 ( 26.43)
async
t0: 3806.31 ( 4.49) 0.4% faster (should be error)
pl: 1629.17 ( 0.29) 1.3% slower
pf0: 6447.07 ( 25.19) 18.7% faster
pf1: 1876.80 ( 47.13) 76.6% faster
</quote>
IIUC, pf0 and pf1 is the same test case (all 4 foreign tables target the
same server) measured with different implementations of the patch.
Thanks,
Amit
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Mon, Mar 13, 2017 at 9:28 PM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp
wrote:
On 2017/03/14 10:08, Corey Huinker wrote:
I don't think the plan itself will change as a result of applying this
patch. You might however be able to observe some performanceimprovement.
I could see no performance improvement, even with 16 separate queries
combined with UNION ALL. Query performance was always with +/- 10% of a9.6
instance given the same script. I must be missing something.
Hmm, maybe I'm missing something too.
Anyway, here is an older message on this thread from Horiguchi-san where
he shared some of the test cases that this patch improves performance for:/messages/by-id/20161018.103051.
30820907.horiguchi.kyotaro%40lab.ntt.co.jpFrom that message:
<quote>
I measured performance and had the following result.t0 - SELECT sum(a) FROM <local single table>;
pl - SELECT sum(a) FROM <4 local children>;
pf0 - SELECT sum(a) FROM <4 foreign children on single connection>;
pf1 - SELECT sum(a) FROM <4 foreign children on dedicate connections>;The result is written as "time<ms> (std dev <ms>)"
sync
t0: 3820.33 ( 1.88)
pl: 1608.59 ( 12.06)
pf0: 7928.29 ( 46.58)
pf1: 8023.16 ( 26.43)async
t0: 3806.31 ( 4.49) 0.4% faster (should be error)
pl: 1629.17 ( 0.29) 1.3% slower
pf0: 6447.07 ( 25.19) 18.7% faster
pf1: 1876.80 ( 47.13) 76.6% faster
</quote>IIUC, pf0 and pf1 is the same test case (all 4 foreign tables target the
same server) measured with different implementations of the patch.Thanks,
Amit
I reworked the test such that all of the foreign tables inherit from the
same parent table, and if you query that you do get async execution. But It
doesn't work when just stringing together those foreign tables with UNION
ALLs.
I don't know how to proceed with this review if that was a goal of the
patch.
Corey Huinker <corey.huinker@gmail.com> writes:
I reworked the test such that all of the foreign tables inherit from the
same parent table, and if you query that you do get async execution. But It
doesn't work when just stringing together those foreign tables with UNION
ALLs.
I don't know how to proceed with this review if that was a goal of the
patch.
Whether it was a goal or not, I'd say there is something either broken
or incorrectly implemented if you don't see that. The planner (and
therefore also the executor) generally treats inheritance the same as
simple UNION ALL. If that's not the case here, I'd want to know why.
regards, tom lane
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Thu, Mar 16, 2017 at 4:17 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Corey Huinker <corey.huinker@gmail.com> writes:
I reworked the test such that all of the foreign tables inherit from the
same parent table, and if you query that you do get async execution. ButIt
doesn't work when just stringing together those foreign tables with UNION
ALLs.I don't know how to proceed with this review if that was a goal of the
patch.Whether it was a goal or not, I'd say there is something either broken
or incorrectly implemented if you don't see that. The planner (and
therefore also the executor) generally treats inheritance the same as
simple UNION ALL. If that's not the case here, I'd want to know why.regards, tom lane
Updated commitfest entry to "Returned With Feedback".
At Thu, 16 Mar 2017 17:16:32 -0400, Corey Huinker <corey.huinker@gmail.com> wrote in <CADkLM=cBZEX9L9HnhJYrtfiAN5Ebdu=xbvM_poWVGBR7yN3gVw@mail.gmail.com>
On Thu, Mar 16, 2017 at 4:17 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Corey Huinker <corey.huinker@gmail.com> writes:
I reworked the test such that all of the foreign tables inherit from the
same parent table, and if you query that you do get async execution. ButIt
doesn't work when just stringing together those foreign tables with UNION
ALLs.I don't know how to proceed with this review if that was a goal of the
patch.Whether it was a goal or not, I'd say there is something either broken
or incorrectly implemented if you don't see that. The planner (and
therefore also the executor) generally treats inheritance the same as
simple UNION ALL. If that's not the case here, I'd want to know why.regards, tom lane
Updated commitfest entry to "Returned With Feedback".
Sorry for the absense. For information, I'll continue to write
some more.
At Tue, 14 Mar 2017 10:28:36 +0900, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote in <e7dc8128-f32b-ff9a-870e-f1117b8e4fa6@lab.ntt.co.jp>
async
t0: 3806.31 ( 4.49) 0.4% faster (should be error)
pl: 1629.17 ( 0.29) 1.3% slower
pf0: 6447.07 ( 25.19) 18.7% faster
pf1: 1876.80 ( 47.13) 76.6% faster
</quote>IIUC, pf0 and pf1 is the same test case (all 4 foreign tables target the
same server) measured with different implementations of the patch.
pf0 is measured on a partitioned(sharded) tables on one foreign
server, that is, sharing a connection. pf1 is in contrast sharded
tables that have dedicate server (or connection). The parent
server is async-patched and the child server is not patched.
Async-capable plan is generated in planner. An Append contains at
least one async-capable child becomes async-aware Append. So the
async feature should be effective also for the UNION ALL case.
The following will work faster than unpatched version.I
SELECT sum(a) FROM (SELECT a FROM ft10 UNION ALL SELECT a FROM ft20 UNION ALL SELECT a FROM ft30 UNION ALL SELECT a FROM ft40) as ft;
I'll measure the performance for the case next week.
regards,
--
Kyotaro Horiguchi
NTT Open Source Software Center
Attachments:
0002-Asynchronous-execution-framework.patchtext/x-patch; charset=us-asciiDownload
From f049f01a92e91f4185f12f814dd90bb16d390121 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 23 Feb 2017 12:20:31 +0900
Subject: [PATCH 2/5] Asynchronous execution framework
This is a framework for asynchronous execution based on Robert Haas's
proposal. Any executor node can receive tuples from underlying nodes
asynchronously by this. This is a different mechanism from parallel
execution. While the parallel execution is analogous to threads, this
frame work is analogous to select(2), which handles multiple input on
single backend process. To avoid degradation of non-async execution,
this framework uses completely different channel to convey tuples.
You will see the deatil of the API at the end of
src/backend/executor/README.
---
src/backend/executor/Makefile | 4 +-
src/backend/executor/README | 45 +++
src/backend/executor/execAmi.c | 5 +
src/backend/executor/execAsync.c | 520 ++++++++++++++++++++++++++++++++
src/backend/executor/execProcnode.c | 1 +
src/backend/executor/instrument.c | 2 +-
src/backend/executor/nodeAppend.c | 169 ++++++++++-
src/backend/executor/nodeForeignscan.c | 49 +++
src/backend/nodes/copyfuncs.c | 2 +
src/backend/nodes/outfuncs.c | 2 +
src/backend/nodes/readfuncs.c | 2 +
src/backend/optimizer/plan/createplan.c | 64 +++-
src/backend/postmaster/pgstat.c | 3 +
src/backend/utils/adt/ruleutils.c | 6 +-
src/include/executor/execAsync.h | 30 ++
src/include/executor/nodeAppend.h | 3 +
src/include/executor/nodeForeignscan.h | 7 +
src/include/foreign/fdwapi.h | 17 ++
src/include/nodes/execnodes.h | 65 +++-
src/include/nodes/plannodes.h | 2 +
src/include/pgstat.h | 3 +-
21 files changed, 971 insertions(+), 30 deletions(-)
create mode 100644 src/backend/executor/execAsync.c
create mode 100644 src/include/executor/execAsync.h
diff --git a/src/backend/executor/Makefile b/src/backend/executor/Makefile
index d281906..d6c74bd 100644
--- a/src/backend/executor/Makefile
+++ b/src/backend/executor/Makefile
@@ -12,8 +12,8 @@ subdir = src/backend/executor
top_builddir = ../../..
include $(top_builddir)/src/Makefile.global
-OBJS = execAmi.o execCurrent.o execGrouping.o execIndexing.o execJunk.o \
- execMain.o execParallel.o execProcnode.o execQual.o \
+OBJS = execAmi.o execAsync.o execCurrent.o execGrouping.o execIndexing.o \
+ execJunk.o execMain.o execParallel.o execProcnode.o execQual.o \
execReplication.o execScan.o execTuples.o \
execUtils.o functions.o instrument.o nodeAppend.o nodeAgg.o \
nodeBitmapAnd.o nodeBitmapOr.o \
diff --git a/src/backend/executor/README b/src/backend/executor/README
index f1d1e4c..7bd009c 100644
--- a/src/backend/executor/README
+++ b/src/backend/executor/README
@@ -199,3 +199,48 @@ query returning the same set of scan tuples multiple times. Likewise,
SRFs are disallowed in an UPDATE's targetlist. There, they would have the
effect of the same row being updated multiple times, which is not very
useful --- and updates after the first would have no effect anyway.
+
+Asynchronous Execution
+----------------------
+
+In certain cases, it's desirable for a node to indicate that it cannot
+return any tuple immediately but may be able to do at a later time. This
+might either because the node is waiting on an event external to the
+database system, such as a ForeignScan awaiting network I/O, or because
+the node is waiting for an event internal to the database system - e.g.
+one process involved in a parallel query may find that it cannot progress
+a certain parallel operation until some other process reaches a certain
+point in the computation. A process which discovers this type of situation
+can always handle it simply by blocking, but this may waste time that could
+be spent executing some other part of the plan where progress could be
+made immediately. This is particularly likely to occur when the plan
+contains an Append node.
+
+To use asynchronous execution, a node must first request a tuple from
+an async-capable child node using ExecAsyncRequest. Next, when the
+result is not available immediately, it must execute the asynchronous
+event loop using ExecAsyncEventLoop; it can avoid giving up control
+indefinitely by passing a timeout to this function, even passing -1 to
+poll for events without blocking. Eventually, when a node to which an
+asynchronous request has been made produces a tuple, the requesting
+node will receive a callback from the event loop via
+ExecAsyncResponse. Typically, the ExecAsyncResponse callback is the
+only one required for nodes that wish to request tuples
+asynchronously.
+
+On the other hand, nodes that wish to produce tuples asynchronously
+generally need to implement three methods:
+
+1. When an asynchronous request is made, the node's ExecAsyncRequest callback
+will be invoked; it should use ExecAsyncSetRequiredEvents to indicate the
+number of file descriptor events for which it wishes to wait and whether it
+wishes to receive a callback when the process latch is set. Alternatively,
+it can instead use ExecAsyncRequestDone if a result is available immediately.
+
+2. When the event loop wishes to wait or poll for file descriptor events and
+the process latch, the ExecAsyncConfigureWait callback is invoked to configure
+the file descriptor wait events for which the node wishes to wait. This
+callback isn't needed if the node only cares about the process latch.
+
+3. When file descriptors or the process latch become ready, the node's
+ExecAsyncNotify callback is invoked.
diff --git a/src/backend/executor/execAmi.c b/src/backend/executor/execAmi.c
index 5d59f95..ecc8eec 100644
--- a/src/backend/executor/execAmi.c
+++ b/src/backend/executor/execAmi.c
@@ -473,11 +473,16 @@ ExecSupportsBackwardScan(Plan *node)
{
ListCell *l;
+ /* With async, tuples may be interleaved, so can't back up. */
+ if (((Append *) node)->nasyncplans != 0)
+ return false;
+
foreach(l, ((Append *) node)->appendplans)
{
if (!ExecSupportsBackwardScan((Plan *) lfirst(l)))
return false;
}
+
/* need not check tlist because Append doesn't evaluate it */
return true;
}
diff --git a/src/backend/executor/execAsync.c b/src/backend/executor/execAsync.c
new file mode 100644
index 0000000..115b147
--- /dev/null
+++ b/src/backend/executor/execAsync.c
@@ -0,0 +1,520 @@
+/*-------------------------------------------------------------------------
+ *
+ * execAsync.c
+ * Support routines for asynchronous execution.
+ *
+ * Portions Copyright (c) 1996-2016, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/executor/execAsync.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "executor/execAsync.h"
+#include "executor/nodeAppend.h"
+#include "executor/nodeForeignscan.h"
+#include "miscadmin.h"
+#include "pgstat.h"
+#include "storage/latch.h"
+#include "utils/memutils.h"
+
+static bool ExecAsyncEventWait(EState *estate, long timeout);
+static bool ExecAsyncConfigureWait(EState *estate, PendingAsyncRequest *areq,
+ bool reinit);
+static void ExecAsyncNotify(EState *estate, PendingAsyncRequest *areq);
+static void ExecAsyncResponse(EState *estate, PendingAsyncRequest *areq);
+
+#define EVENT_BUFFER_SIZE 16
+
+/*
+ * Asynchronously request a tuple from a designed async-aware node.
+ *
+ * requestor is the node that wants the tuple; requestee is the node from
+ * which it wants the tuple. request_index is an arbitrary integer specified
+ * by the requestor which will be available at the time the requestor receives
+ * the tuple. This is useful if the requestor has multiple children and
+ * needs an easy way to figure out which one is delivering a tuple.
+ */
+void
+ExecAsyncRequest(EState *estate, PlanState *requestor, int request_index,
+ PlanState *requestee)
+{
+ PendingAsyncRequest *areq = NULL;
+ int nasync = estate->es_num_pending_async;
+
+ if (requestee->instrument)
+ InstrStartNode(requestee->instrument);
+
+ /*
+ * If the number of pending asynchronous nodes exceeds the number of
+ * available slots in the es_pending_async array, expand the array.
+ * We start with 16 slots, and thereafter double the array size each
+ * time we run out of slots.
+ */
+ if (nasync >= estate->es_max_pending_async)
+ {
+ int newmax;
+
+ newmax = estate->es_max_pending_async * 2;
+ if (estate->es_max_pending_async == 0)
+ {
+ newmax = 16;
+ estate->es_pending_async =
+ MemoryContextAllocZero(estate->es_query_cxt,
+ newmax * sizeof(PendingAsyncRequest *));
+ }
+ else
+ {
+ int newentries = newmax - estate->es_max_pending_async;
+
+ estate->es_pending_async =
+ repalloc(estate->es_pending_async,
+ newmax * sizeof(PendingAsyncRequest *));
+ MemSet(&estate->es_pending_async[estate->es_max_pending_async],
+ 0, newentries * sizeof(PendingAsyncRequest *));
+ }
+ estate->es_max_pending_async = newmax;
+ }
+
+ /*
+ * To avoid unnecessary palloc traffic, we reuse a previously-allocated
+ * PendingAsyncRequest if there is one. If not, we must allocate a new
+ * one.
+ */
+ if (estate->es_pending_async[nasync] == NULL)
+ {
+ areq = MemoryContextAllocZero(estate->es_query_cxt,
+ sizeof(PendingAsyncRequest));
+ estate->es_pending_async[nasync] = areq;
+ }
+ else
+ {
+ areq = estate->es_pending_async[nasync];
+ MemSet(areq, 0, sizeof(PendingAsyncRequest));
+ }
+ areq->myindex = estate->es_num_pending_async;
+
+ /* Initialize the new request. */
+ areq->state = ASYNCREQ_IDLE;
+ areq->requestor = requestor;
+ areq->request_index = request_index;
+ areq->requestee = requestee;
+
+ /* Give the requestee a chance to do whatever it wants. */
+ switch (nodeTag(requestee))
+ {
+ case T_ForeignScanState:
+ ExecAsyncForeignScanRequest(estate, areq);
+ break;
+ default:
+ /* If requestee doesn't support async, caller messed up. */
+ elog(ERROR, "unrecognized node type: %d",
+ (int) nodeTag(requestee));
+ }
+
+ if (areq->requestee->instrument)
+ InstrStopNode(requestee->instrument, 0);
+
+ /* No result available now, make this node pending */
+ estate->es_num_pending_async++;
+
+ return;
+}
+
+/*
+ * Execute the main loop until the timeout expires or a result is delivered
+ * to the requestor.
+ *
+ * If the timeout is -1, there is no timeout; wait indefinitely until a
+ * result is ready for requestor. If the timeout is 0, do not block, but
+ * poll for events and fire callbacks for as long as we can do so without
+ * blocking. If timeout is greater than 0, block for at most the number
+ * of milliseconds indicated by the timeout.
+ *
+ * Returns true if a result was delivered to the requestor. A return value
+ * of false indicates that the timeout was reached without delivering a
+ * result to the requestor.
+ */
+bool
+ExecAsyncEventLoop(EState *estate, PlanState *requestor, long timeout)
+{
+ instr_time start_time;
+ long cur_timeout = timeout;
+ bool requestor_done = false;
+
+ Assert(requestor != NULL);
+
+ /*
+ * If we plan to wait - but not indefinitely - we need to record the
+ * current time.
+ */
+ if (timeout > 0)
+ INSTR_TIME_SET_CURRENT(start_time);
+
+ /* Main event loop: poll for events, deliver notifications. */
+ Assert(estate->es_async_callback_pending == 0);
+ for (;;)
+ {
+ int i;
+ bool any_node_done = false;
+
+ CHECK_FOR_INTERRUPTS();
+
+ /* Check for events only if any node is async-not-ready. */
+ if (estate->es_num_async_ready < estate->es_num_pending_async)
+ {
+ /* Don't block if any tuple available. */
+ if (estate->es_async_callback_pending > 0)
+ ExecAsyncEventWait(estate, 0);
+ else if (!ExecAsyncEventWait(estate, cur_timeout))
+ { /* Not fired */
+ /* Exited before timeout. Calculate the remaining time. */
+ instr_time cur_time;
+ long cur_timeout = -1;
+
+ /* Wait forever */
+ if (timeout < 0)
+ continue;
+
+ INSTR_TIME_SET_CURRENT(cur_time);
+ INSTR_TIME_SUBTRACT(cur_time, start_time);
+ cur_timeout =
+ timeout - (long) INSTR_TIME_GET_MILLISEC(cur_time);
+
+ if (cur_timeout > 0)
+ continue;
+ }
+ }
+
+ /* Deliver notifications. */
+ for (i = 0; i < estate->es_num_pending_async; ++i)
+ {
+ PendingAsyncRequest *areq = estate->es_pending_async[i];
+
+ if (areq->requestee->instrument)
+ InstrStartNode(areq->requestee->instrument);
+
+ /* Notify if the requestee is ready */
+ if (areq->state == ASYNCREQ_CALLBACK_PENDING)
+ ExecAsyncNotify(estate, areq);
+
+ /* Deliver the acquired tuple to the requester */
+ if (areq->state == ASYNCREQ_COMPLETE)
+ {
+ any_node_done = true;
+ if (requestor == areq->requestor)
+ requestor_done = true;
+ ExecAsyncResponse(estate, areq);
+
+ if (areq->requestee->instrument)
+ InstrStopNode(areq->requestee->instrument,
+ TupIsNull((TupleTableSlot*)areq->result) ?
+ 0.0 : 1.0);
+ }
+ else if (areq->requestee->instrument)
+ InstrStopNode(areq->requestee->instrument, 0);
+ }
+
+ /* If any node completed, compact the array. */
+ if (any_node_done)
+ {
+ int hidx = 0,
+ tidx;
+
+ /*
+ * Swap all non-yet-completed items to the start of the array.
+ * Keep them in the same order.
+ */
+ for (tidx = 0; tidx < estate->es_num_pending_async; ++tidx)
+ {
+ PendingAsyncRequest *head;
+ PendingAsyncRequest *tail = estate->es_pending_async[tidx];
+
+ Assert(tail->state != ASYNCREQ_CALLBACK_PENDING);
+
+ if (tail->state == ASYNCREQ_COMPLETE)
+ continue;
+ head = estate->es_pending_async[hidx];
+ estate->es_pending_async[tidx] = head;
+ estate->es_pending_async[hidx] = tail;
+ ++hidx;
+ }
+ estate->es_num_pending_async = hidx;
+ }
+
+ /*
+ * We only consider exiting the loop when no notifications are
+ * pending. Otherwise, each call to this function might advance
+ * the computation by only a very small amount; to the contrary,
+ * we want to push it forward as far as possible.
+ */
+ if (estate->es_async_callback_pending == 0)
+ {
+ /* If requestor is ready, exit. */
+ if (requestor_done)
+ return true;
+ /* If timeout was 0 or has expired, exit. */
+ if (cur_timeout == 0)
+ return false;
+ }
+ }
+}
+
+/*
+ * Wait or poll for events. As with ExecAsyncEventLoop, a timeout of -1
+ * means wait forever, 0 means don't wait at all, and >0 means wait for the
+ * indicated number of milliseconds.
+ *
+ * Returns false if we timed out or true if anything found or there's no event
+ * to wait.
+ */
+static bool
+ExecAsyncEventWait(EState *estate, long timeout)
+{
+ WaitEvent occurred_event[EVENT_BUFFER_SIZE];
+ int noccurred;
+ int i;
+ int n;
+ bool reinit = false;
+ bool process_latch_set = false;
+ bool added = false;
+ bool fired = false;
+
+ if (estate->es_wait_event_set == NULL)
+ {
+ /*
+ * Allow for a few extra events without reinitializing. It
+ * doesn't seem worth the complexity of doing anything very
+ * aggressive here, because plans that depend on massive numbers
+ * of external FDs are likely to run afoul of kernel limits anyway.
+ */
+ estate->es_allocated_fd_events = estate->es_total_fd_events + 16;
+
+ /*
+ * The wait event set created here should be live beyond ExecutorState
+ * context but released in case of error.
+ */
+ estate->es_wait_event_set =
+ CreateWaitEventSet(TopTransactionContext,
+ TopTransactionResourceOwner,
+ estate->es_allocated_fd_events + 1);
+
+ AddWaitEventToSet(estate->es_wait_event_set,
+ WL_LATCH_SET, PGINVALID_SOCKET, MyLatch, NULL);
+ reinit = true;
+ }
+
+ /* Give each waiting node a chance to add or modify events. */
+ for (i = 0; i < estate->es_num_pending_async; ++i)
+ {
+ PendingAsyncRequest *areq = estate->es_pending_async[i];
+
+ if (areq->num_fd_events > 0 || areq->wants_process_latch)
+ added |= ExecAsyncConfigureWait(estate, areq, reinit);
+ }
+
+ /*
+ * We may have no event to wait. This occurs when all nodes that
+ * is asynchronously executing have tuples immediately available.
+ */
+ if (!added)
+ return true;
+
+ /* Wait for at least one event to occur. */
+ noccurred = WaitEventSetWait(estate->es_wait_event_set, timeout,
+ occurred_event, EVENT_BUFFER_SIZE,
+ WAIT_EVENT_ASYNC_WAIT);
+
+ if (noccurred == 0)
+ return false;
+
+ /*
+ * Loop over the occurred events and set the callback_pending flags
+ * for the appropriate requests. The waiting nodes should have
+ * registered their wait events with user_data pointing back to the
+ * PendingAsyncRequest, but the process latch needs special handling.
+ */
+ for (n = 0; n < noccurred; ++n)
+ {
+ WaitEvent *w = &occurred_event[n];
+
+ if ((w->events & WL_LATCH_SET) != 0)
+ {
+ process_latch_set = true;
+ continue;
+ }
+
+ if ((w->events & (WL_SOCKET_READABLE|WL_SOCKET_WRITEABLE)) != 0)
+ {
+ PendingAsyncRequest *areq = w->user_data;
+
+ Assert(areq->state == ASYNCREQ_WAITING);
+
+ areq->state = ASYNCREQ_CALLBACK_PENDING;
+ estate->es_async_callback_pending++;
+ fired = true;
+ }
+ }
+
+ /*
+ * If the process latch got set, we must schedule a callback for every
+ * requestee that cares about it.
+ */
+ if (process_latch_set)
+ {
+ for (i = 0; i < estate->es_num_pending_async; ++i)
+ {
+ PendingAsyncRequest *areq = estate->es_pending_async[i];
+
+ if (areq->wants_process_latch)
+ {
+ Assert(areq->state == ASYNCREQ_WAITING);
+ areq->state = ASYNCREQ_CALLBACK_PENDING;
+ estate->es_async_callback_pending++;
+ fired = true;
+ }
+ }
+ }
+
+ return fired;
+}
+
+/*
+ * Give the asynchronous node a chance to configure the file descriptor
+ * events for which it wishes to wait. We expect the node-type specific
+ * callback to make one or more calls of the following form:
+ *
+ * AddWaitEventToSet(es->es_wait_event_set, events, fd, NULL, areq);
+ *
+ * The events should include only WL_SOCKET_READABLE or WL_SOCKET_WRITEABLE,
+ * and the number of calls should not exceed areq->num_fd_events (as
+ * prevously set via ExecAsyncSetRequiredEvents).
+ *
+ * Individual requests can omit registering an event but it is a
+ * responsibility of the node driver to set at least one event per one
+ * requestor.
+ */
+static bool
+ExecAsyncConfigureWait(EState *estate, PendingAsyncRequest *areq,
+ bool reinit)
+{
+ switch (nodeTag(areq->requestee))
+ {
+ case T_ForeignScanState:
+ return ExecAsyncForeignScanConfigureWait(estate, areq, reinit);
+ break;
+ default:
+ elog(ERROR, "unrecognized node type: %d",
+ (int) nodeTag(areq->requestee));
+ }
+}
+
+/*
+ * Call the asynchronous node back when a relevant event has occurred.
+ */
+static void
+ExecAsyncNotify(EState *estate, PendingAsyncRequest *areq)
+{
+ switch (nodeTag(areq->requestee))
+ {
+ case T_ForeignScanState:
+ ExecAsyncForeignScanNotify(estate, areq);
+ break;
+ default:
+ elog(ERROR, "unrecognized node type: %d",
+ (int) nodeTag(areq->requestee));
+ }
+
+ estate->es_async_callback_pending--;
+}
+
+/*
+ * Call the requestor back when an asynchronous node has produced a result.
+ */
+static void
+ExecAsyncResponse(EState *estate, PendingAsyncRequest *areq)
+{
+ switch (nodeTag(areq->requestor))
+ {
+ case T_AppendState:
+ ExecAsyncAppendResponse(estate, areq);
+ break;
+ default:
+ elog(ERROR, "unrecognized node type: %d",
+ (int) nodeTag(areq->requestor));
+ }
+ estate->es_num_async_ready--;
+}
+
+/*
+ * An executor node should call this function to signal that it needs to wait
+ * on one or more file descriptor events that can be registered on a
+ * WaitEventSet, and possibly also on process latch. num_fd_events is the
+ * maximum number of file descriptor events that it will wish to register.
+ * force_reset should be true if the node can't reuse the WaitEventSet it most
+ * recently initialized, for example because it needs to drop a wait event
+ * from the set.
+ */
+void
+ExecAsyncSetRequiredEvents(EState *estate, PendingAsyncRequest *areq,
+ int num_fd_events, bool wants_process_latch,
+ bool force_reset)
+{
+ estate->es_total_fd_events += num_fd_events - areq->num_fd_events;
+ areq->num_fd_events = num_fd_events;
+ areq->wants_process_latch = wants_process_latch;
+ areq->state = ASYNCREQ_WAITING;
+
+ if (force_reset && estate->es_wait_event_set != NULL)
+ ExecAsyncClearEvents(estate);
+}
+
+/*
+ * An async-capable node should call this function to deliver the tuple to
+ * the node which requested it. The node can call this from its
+ * ExecAsyncRequest callback if the requested tuple is available immediately,
+ * or at a later time from its ExecAsyncNotify callback.
+ */
+void
+ExecAsyncRequestDone(EState *estate, PendingAsyncRequest *areq, Node *result)
+{
+ /*
+ * Since the request is complete, the requestee is no longer allowed
+ * to wait for any events. Note that this forces a rebuild of
+ * es_wait_event_set every time a process that was previously waiting
+ * stops doing so. It might be possible to defer that decision until
+ * we actually wait again, because it's quite possible that a new
+ * request will be made of the same node before any wait actually
+ * happens. However, we have to balance the cost of rebuilding the
+ * WaitEventSet against the additional overhead of tracking which nodes
+ * need a callback to remove registered wait events. It's not clear
+ * that we would come out ahead, so use brute force for now.
+ */
+ Assert(areq->state == ASYNCREQ_IDLE ||
+ areq->state == ASYNCREQ_CALLBACK_PENDING);
+
+ if (areq->num_fd_events > 0 || areq->wants_process_latch)
+ ExecAsyncSetRequiredEvents(estate, areq, 0, false, true);
+
+
+ /* Save result and mark request as complete. */
+ areq->result = result;
+ areq->state = ASYNCREQ_COMPLETE;
+ estate->es_num_async_ready++;
+}
+
+
+/* Clear async events */
+void
+ExecAsyncClearEvents(EState *estate)
+{
+ if (estate->es_wait_event_set == NULL)
+ return;
+
+ FreeWaitEventSet(estate->es_wait_event_set);
+ estate->es_wait_event_set = NULL;
+}
diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
index 80c77ad..31222ea 100644
--- a/src/backend/executor/execProcnode.c
+++ b/src/backend/executor/execProcnode.c
@@ -117,6 +117,7 @@
#include "executor/nodeValuesscan.h"
#include "executor/nodeWindowAgg.h"
#include "executor/nodeWorktablescan.h"
+#include "foreign/fdwapi.h"
#include "nodes/nodeFuncs.h"
#include "miscadmin.h"
diff --git a/src/backend/executor/instrument.c b/src/backend/executor/instrument.c
index 6ec96ec..959ee90 100644
--- a/src/backend/executor/instrument.c
+++ b/src/backend/executor/instrument.c
@@ -102,7 +102,7 @@ InstrStopNode(Instrumentation *instr, double nTuples)
&pgBufferUsage, &instr->bufusage_start);
/* Is this the first tuple of this cycle? */
- if (!instr->running)
+ if (!instr->running && nTuples > 0)
{
instr->running = true;
instr->firsttuple = INSTR_TIME_GET_DOUBLE(instr->counter);
diff --git a/src/backend/executor/nodeAppend.c b/src/backend/executor/nodeAppend.c
index 6986cae..12d3742 100644
--- a/src/backend/executor/nodeAppend.c
+++ b/src/backend/executor/nodeAppend.c
@@ -58,6 +58,7 @@
#include "postgres.h"
#include "executor/execdebug.h"
+#include "executor/execAsync.h"
#include "executor/nodeAppend.h"
static bool exec_append_initialize_next(AppendState *appendstate);
@@ -79,16 +80,21 @@ exec_append_initialize_next(AppendState *appendstate)
/*
* get information from the append node
*/
- whichplan = appendstate->as_whichplan;
+ whichplan = appendstate->as_whichsyncplan;
- if (whichplan < 0)
+ /*
+ * This routine is only responsible for setting up for nodes being scanned
+ * synchronously, so the first node we can scan is given by nasyncplans
+ * and the last is given by as_nplans - 1.
+ */
+ if (whichplan < appendstate->as_nasyncplans)
{
/*
* if scanning in reverse, we start at the last scan in the list and
* then proceed back to the first.. in any case we inform ExecAppend
* that we are at the end of the line by returning FALSE
*/
- appendstate->as_whichplan = 0;
+ appendstate->as_whichsyncplan = appendstate->as_nasyncplans;
return FALSE;
}
else if (whichplan >= appendstate->as_nplans)
@@ -96,7 +102,7 @@ exec_append_initialize_next(AppendState *appendstate)
/*
* as above, end the scan if we go beyond the last scan in our list..
*/
- appendstate->as_whichplan = appendstate->as_nplans - 1;
+ appendstate->as_whichsyncplan = appendstate->as_nplans - 1;
return FALSE;
}
else
@@ -142,6 +148,15 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
appendstate->ps.state = estate;
appendstate->appendplans = appendplanstates;
appendstate->as_nplans = nplans;
+ appendstate->as_nasyncplans = node->nasyncplans;
+ appendstate->as_syncdone = (node->nasyncplans == nplans);
+ appendstate->as_asyncresult = (TupleTableSlot **)
+ palloc0(node->nasyncplans * sizeof(TupleTableSlot *));
+
+ /* initially, all async requests need a request */
+ for (i = 0; i < appendstate->as_nasyncplans; ++i)
+ appendstate->as_needrequest =
+ bms_add_member(appendstate->as_needrequest, i);
/*
* Miscellaneous initialization
@@ -176,9 +191,9 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
appendstate->ps.ps_ProjInfo = NULL;
/*
- * initialize to scan first subplan
+ * initialize to scan first synchronous subplan
*/
- appendstate->as_whichplan = 0;
+ appendstate->as_whichsyncplan = appendstate->as_nasyncplans;
exec_append_initialize_next(appendstate);
return appendstate;
@@ -193,15 +208,85 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
TupleTableSlot *
ExecAppend(AppendState *node)
{
+ if (node->as_nasyncplans > 0)
+ {
+ EState *estate = node->ps.state;
+ int i;
+
+ /*
+ * If there are any asynchronously-generated results that have
+ * not yet been returned, return one of them.
+ */
+ if (node->as_nasyncresult > 0)
+ {
+ --node->as_nasyncresult;
+ return node->as_asyncresult[node->as_nasyncresult];
+ }
+
+
+ /*
+ * XXXX: Always clear registered event. This seems a bit ineffecient
+ * but the events to wait are almost randomly altered for every
+ * calling.
+ */
+ ExecAsyncClearEvents(estate);
+
+ while ((i = bms_first_member(node->as_needrequest)) >= 0)
+ {
+ node->as_nasyncpending++;
+ ExecAsyncRequest(estate, &node->ps, i, node->appendplans[i]);
+ }
+
+ if (node->as_nasyncpending == 0 && node->as_syncdone)
+ return ExecClearTuple(node->ps.ps_ResultTupleSlot);
+ }
+
for (;;)
{
PlanState *subnode;
TupleTableSlot *result;
/*
- * figure out which subplan we are currently processing
+ * if we have async requests outstanding, run the event loop
+ */
+ if (node->as_nasyncpending > 0)
+ {
+ long timeout = node->as_syncdone ? -1 : 0;
+
+ while (node->as_nasyncpending > 0)
+ {
+ if (ExecAsyncEventLoop(node->ps.state, &node->ps, timeout) &&
+ node->as_nasyncresult > 0)
+ {
+ /* Asynchronous subplan returned a tuple! */
+ --node->as_nasyncresult;
+ return node->as_asyncresult[node->as_nasyncresult];
+ }
+
+ /* Timeout reached. Go through to sync nodes if exists */
+ if (!node->as_syncdone)
+ break;
+ }
+
+ /*
+ * If there is no asynchronous activity still pending and the
+ * synchronous activity is also complete, we're totally done
+ * scanning this node. Otherwise, we're done with the
+ * asynchronous stuff but must continue scanning the synchronous
+ * children.
+ */
+ if (node->as_syncdone)
+ {
+ Assert(node->as_nasyncpending == 0);
+ return ExecClearTuple(node->ps.ps_ResultTupleSlot);
+ }
+ }
+
+ /*
+ * figure out which synchronous subplan we are currently processing
*/
- subnode = node->appendplans[node->as_whichplan];
+ Assert(!node->as_syncdone);
+ subnode = node->appendplans[node->as_whichsyncplan];
/*
* get a tuple from the subplan
@@ -221,14 +306,21 @@ ExecAppend(AppendState *node)
/*
* Go on to the "next" subplan in the appropriate direction. If no
* more subplans, return the empty slot set up for us by
- * ExecInitAppend.
+ * ExecInitAppend, unless there are async plans we have yet to finish.
*/
if (ScanDirectionIsForward(node->ps.state->es_direction))
- node->as_whichplan++;
+ node->as_whichsyncplan++;
else
- node->as_whichplan--;
+ node->as_whichsyncplan--;
if (!exec_append_initialize_next(node))
- return ExecClearTuple(node->ps.ps_ResultTupleSlot);
+ {
+ node->as_syncdone = true;
+ if (node->as_nasyncpending == 0)
+ {
+ Assert(bms_is_empty(node->as_needrequest));
+ return ExecClearTuple(node->ps.ps_ResultTupleSlot);
+ }
+ }
/* Else loop back and try to get a tuple from the new subplan */
}
@@ -267,6 +359,16 @@ ExecReScanAppend(AppendState *node)
{
int i;
+ /*
+ * XXX. Cancel outstanding asynchronous tuple requests here! (How?)
+ */
+
+ /* Reset async state. */
+ for (i = 0; i < node->as_nasyncplans; ++i)
+ node->as_needrequest = bms_add_member(node->as_needrequest, i);
+ node->as_nasyncresult = 0;
+ node->as_syncdone = (node->as_nasyncplans == node->as_nplans);
+
for (i = 0; i < node->as_nplans; i++)
{
PlanState *subnode = node->appendplans[i];
@@ -285,6 +387,47 @@ ExecReScanAppend(AppendState *node)
if (subnode->chgParam == NULL)
ExecReScan(subnode);
}
- node->as_whichplan = 0;
+ node->as_whichsyncplan = node->as_nasyncplans;
exec_append_initialize_next(node);
}
+
+/* ----------------------------------------------------------------
+ * ExecAsyncAppendResponse
+ *
+ * Receive a response from an asynchronous request we made.
+ * ----------------------------------------------------------------
+ */
+void
+ExecAsyncAppendResponse(EState *estate, PendingAsyncRequest *areq)
+{
+ AppendState *node = (AppendState *) areq->requestor;
+ TupleTableSlot *slot;
+
+ /* We shouldn't be called until the request is complete. */
+ Assert(areq->state == ASYNCREQ_COMPLETE);
+
+ /* Our result slot shouldn't already be occupied. */
+ Assert(TupIsNull(node->ps.ps_ResultTupleSlot));
+
+ /* Result should be a TupleTableSlot or NULL. */
+ slot = (TupleTableSlot *) areq->result;
+ Assert(slot == NULL || IsA(slot, TupleTableSlot));
+
+ /* This is no longer pending */
+ --node->as_nasyncpending;
+
+ /* If the result is NULL or an empty slot, there's nothing more to do. */
+ if (TupIsNull(slot))
+ return;
+
+ /* Save result so we can return it. */
+ Assert(node->as_nasyncresult < node->as_nasyncplans);
+ node->as_asyncresult[node->as_nasyncresult++] = slot;
+
+ /*
+ * Mark the node that returned a result as ready for a new request. We
+ * don't launch another one here immediately because it might compelte
+ */
+ node->as_needrequest =
+ bms_add_member(node->as_needrequest, areq->request_index);
+}
diff --git a/src/backend/executor/nodeForeignscan.c b/src/backend/executor/nodeForeignscan.c
index 3b6d139..0a46f5f 100644
--- a/src/backend/executor/nodeForeignscan.c
+++ b/src/backend/executor/nodeForeignscan.c
@@ -369,3 +369,52 @@ ExecShutdownForeignScan(ForeignScanState *node)
if (fdwroutine->ShutdownForeignScan)
fdwroutine->ShutdownForeignScan(node);
}
+
+/* ----------------------------------------------------------------
+ * ExecAsyncForeignScanRequest
+ *
+ * Initiate an asynchronous request
+ * ----------------------------------------------------------------
+ */
+void
+ExecAsyncForeignScanRequest(EState *estate, PendingAsyncRequest *areq)
+{
+ ForeignScanState *node = (ForeignScanState *) areq->requestee;
+ FdwRoutine *fdwroutine = node->fdwroutine;
+
+ Assert(fdwroutine->ForeignAsyncRequest != NULL);
+ fdwroutine->ForeignAsyncRequest(estate, areq);
+}
+
+/* ----------------------------------------------------------------
+ * ExecAsyncForeignScanConfigureWait
+ *
+ * In async mode, configure for a wait
+ * ----------------------------------------------------------------
+ */
+bool
+ExecAsyncForeignScanConfigureWait(EState *estate,
+ PendingAsyncRequest *areq, bool reinit)
+{
+ ForeignScanState *node = (ForeignScanState *) areq->requestee;
+ FdwRoutine *fdwroutine = node->fdwroutine;
+
+ Assert(fdwroutine->ForeignAsyncConfigureWait != NULL);
+ return fdwroutine->ForeignAsyncConfigureWait(estate, areq, reinit);
+}
+
+/* ----------------------------------------------------------------
+ * ExecAsyncForeignScanNotify
+ *
+ * Event loop callback
+ * ----------------------------------------------------------------
+ */
+void
+ExecAsyncForeignScanNotify(EState *estate, PendingAsyncRequest *areq)
+{
+ ForeignScanState *node = (ForeignScanState *) areq->requestee;
+ FdwRoutine *fdwroutine = node->fdwroutine;
+
+ Assert(fdwroutine->ForeignAsyncNotify != NULL);
+ fdwroutine->ForeignAsyncNotify(estate, areq);
+}
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index 25fd051..7b548c0 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -236,6 +236,8 @@ _copyAppend(const Append *from)
* copy remainder of node
*/
COPY_NODE_FIELD(appendplans);
+ COPY_SCALAR_FIELD(nasyncplans);
+ COPY_SCALAR_FIELD(referent);
return newnode;
}
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index 7418fbe..688d197 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -369,6 +369,8 @@ _outAppend(StringInfo str, const Append *node)
_outPlanInfo(str, (const Plan *) node);
WRITE_NODE_FIELD(appendplans);
+ WRITE_INT_FIELD(nasyncplans);
+ WRITE_INT_FIELD(referent);
}
static void
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index d3bbc02..7cb9d2f 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -1565,6 +1565,8 @@ _readAppend(void)
ReadCommonPlan(&local_node->plan);
READ_NODE_FIELD(appendplans);
+ READ_INT_FIELD(nasyncplans);
+ READ_INT_FIELD(referent);
READ_DONE();
}
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index 89e1946..14b46ef 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -199,7 +199,8 @@ static CteScan *make_ctescan(List *qptlist, List *qpqual,
Index scanrelid, int ctePlanId, int cteParam);
static WorkTableScan *make_worktablescan(List *qptlist, List *qpqual,
Index scanrelid, int wtParam);
-static Append *make_append(List *appendplans, List *tlist);
+static Append *make_append(List *asyncplans, int nasyncplans,
+ int referent, List *tlist);
static RecursiveUnion *make_recursive_union(List *tlist,
Plan *lefttree,
Plan *righttree,
@@ -279,7 +280,7 @@ static ModifyTable *make_modifytable(PlannerInfo *root,
List *rowMarks, OnConflictExpr *onconflict, int epqParam);
static GatherMerge *create_gather_merge_plan(PlannerInfo *root,
GatherMergePath *best_path);
-
+static bool is_async_capable_path(Path *path);
/*
* create_plan
@@ -980,8 +981,12 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path)
{
Append *plan;
List *tlist = build_path_tlist(root, &best_path->path);
- List *subplans = NIL;
+ List *asyncplans = NIL;
+ List *syncplans = NIL;
ListCell *subpaths;
+ int nasyncplans = 0;
+ bool first = true;
+ bool referent_is_sync = true;
/*
* The subpaths list could be empty, if every child was proven empty by
@@ -1007,7 +1012,14 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path)
return plan;
}
- /* Build the plan for each child */
+ /*
+ * Build the plan for each child
+
+ * The first child in an inheritance set is the representative in
+ * explaining tlist entries (see set_deparse_planstate). We should keep
+ * the first child in best_path->subpaths at the head of the subplan list
+ * for the reason.
+ */
foreach(subpaths, best_path->subpaths)
{
Path *subpath = (Path *) lfirst(subpaths);
@@ -1016,7 +1028,18 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path)
/* Must insist that all children return the same tlist */
subplan = create_plan_recurse(root, subpath, CP_EXACT_TLIST);
- subplans = lappend(subplans, subplan);
+ /* Classify as async-capable or not */
+ if (is_async_capable_path(subpath))
+ {
+ asyncplans = lappend(asyncplans, subplan);
+ ++nasyncplans;
+ if (first)
+ referent_is_sync = false;
+ }
+ else
+ syncplans = lappend(syncplans, subplan);
+
+ first = false;
}
/*
@@ -1026,7 +1049,8 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path)
* parent-rel Vars it'll be asked to emit.
*/
- plan = make_append(subplans, tlist);
+ plan = make_append(list_concat(asyncplans, syncplans), nasyncplans,
+ referent_is_sync ? nasyncplans : 0, tlist);
copy_generic_path_info(&plan->plan, (Path *) best_path);
@@ -5161,7 +5185,7 @@ make_foreignscan(List *qptlist,
}
static Append *
-make_append(List *appendplans, List *tlist)
+make_append(List *appendplans, int nasyncplans, int referent, List *tlist)
{
Append *node = makeNode(Append);
Plan *plan = &node->plan;
@@ -5171,6 +5195,8 @@ make_append(List *appendplans, List *tlist)
plan->lefttree = NULL;
plan->righttree = NULL;
node->appendplans = appendplans;
+ node->nasyncplans = nasyncplans;
+ node->referent = referent;
return node;
}
@@ -6492,3 +6518,27 @@ is_projection_capable_plan(Plan *plan)
}
return true;
}
+
+/*
+ * is_projection_capable_path
+ * Check whether a given Path node is async-capable.
+ */
+static bool
+is_async_capable_path(Path *path)
+{
+ switch (nodeTag(path))
+ {
+ case T_ForeignPath:
+ {
+ FdwRoutine *fdwroutine = path->parent->fdwroutine;
+
+ Assert(fdwroutine != NULL);
+ if (fdwroutine->IsForeignPathAsyncCapable != NULL &&
+ fdwroutine->IsForeignPathAsyncCapable((ForeignPath *) path))
+ return true;
+ }
+ default:
+ break;
+ }
+ return false;
+}
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 7cacb1e..1a47c2a 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -3404,6 +3404,9 @@ pgstat_get_wait_ipc(WaitEventIPC w)
case WAIT_EVENT_SYNC_REP:
event_name = "SyncRep";
break;
+ case WAIT_EVENT_ASYNC_WAIT:
+ event_name = "AsyncExecWait";
+ break;
/* no default case, so that compiler will warn */
}
diff --git a/src/backend/utils/adt/ruleutils.c b/src/backend/utils/adt/ruleutils.c
index 5c82325..779ffb5 100644
--- a/src/backend/utils/adt/ruleutils.c
+++ b/src/backend/utils/adt/ruleutils.c
@@ -4264,7 +4264,11 @@ set_deparse_planstate(deparse_namespace *dpns, PlanState *ps)
* lists containing references to non-target relations.
*/
if (IsA(ps, AppendState))
- dpns->outer_planstate = ((AppendState *) ps)->appendplans[0];
+ {
+ int idx = ((Append*)(((AppendState *) ps)->ps.plan))->referent;
+ dpns->outer_planstate =
+ ((AppendState *) ps)->appendplans[idx];
+ }
else if (IsA(ps, MergeAppendState))
dpns->outer_planstate = ((MergeAppendState *) ps)->mergeplans[0];
else if (IsA(ps, ModifyTableState))
diff --git a/src/include/executor/execAsync.h b/src/include/executor/execAsync.h
new file mode 100644
index 0000000..9e7845c
--- /dev/null
+++ b/src/include/executor/execAsync.h
@@ -0,0 +1,30 @@
+/*--------------------------------------------------------------------
+ * execAsync.c
+ * Support functions for asynchronous query execution
+ *
+ * Portions Copyright (c) 1996-2016, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/executor/execAsync.c
+ *--------------------------------------------------------------------
+ */
+
+#ifndef EXECASYNC_H
+#define EXECASYNC_H
+
+#include "nodes/execnodes.h"
+
+extern void ExecAsyncRequest(EState *estate, PlanState *requestor,
+ int request_index, PlanState *requestee);
+extern bool ExecAsyncEventLoop(EState *estate, PlanState *requestor,
+ long timeout);
+
+extern void ExecAsyncSetRequiredEvents(EState *estate,
+ PendingAsyncRequest *areq, int num_fd_events,
+ bool wants_process_latch, bool force_reset);
+extern void ExecAsyncRequestDone(EState *estate,
+ PendingAsyncRequest *areq, Node *result);
+extern void ExecAsyncClearEvents(EState *estate);
+
+#endif /* EXECASYNC_H */
diff --git a/src/include/executor/nodeAppend.h b/src/include/executor/nodeAppend.h
index 6fb4662..3cbf9ff 100644
--- a/src/include/executor/nodeAppend.h
+++ b/src/include/executor/nodeAppend.h
@@ -21,4 +21,7 @@ extern TupleTableSlot *ExecAppend(AppendState *node);
extern void ExecEndAppend(AppendState *node);
extern void ExecReScanAppend(AppendState *node);
+extern void ExecAsyncAppendResponse(EState *estate,
+ PendingAsyncRequest *areq);
+
#endif /* NODEAPPEND_H */
diff --git a/src/include/executor/nodeForeignscan.h b/src/include/executor/nodeForeignscan.h
index 1b167b8..e4ba4a9 100644
--- a/src/include/executor/nodeForeignscan.h
+++ b/src/include/executor/nodeForeignscan.h
@@ -30,4 +30,11 @@ extern void ExecForeignScanInitializeWorker(ForeignScanState *node,
shm_toc *toc);
extern void ExecShutdownForeignScan(ForeignScanState *node);
+extern void ExecAsyncForeignScanRequest(EState *estate,
+ PendingAsyncRequest *areq);
+extern bool ExecAsyncForeignScanConfigureWait(EState *estate,
+ PendingAsyncRequest *areq, bool reinit);
+extern void ExecAsyncForeignScanNotify(EState *estate,
+ PendingAsyncRequest *areq);
+
#endif /* NODEFOREIGNSCAN_H */
diff --git a/src/include/foreign/fdwapi.h b/src/include/foreign/fdwapi.h
index 6ca44f7..863ff0e 100644
--- a/src/include/foreign/fdwapi.h
+++ b/src/include/foreign/fdwapi.h
@@ -156,6 +156,16 @@ typedef bool (*IsForeignScanParallelSafe_function) (PlannerInfo *root,
RelOptInfo *rel,
RangeTblEntry *rte);
+typedef bool (*IsForeignPathAsyncCapable_function) (ForeignPath *path);
+typedef void (*ForeignAsyncRequest_function) (EState *estate,
+ PendingAsyncRequest *areq);
+typedef bool (*ForeignAsyncConfigureWait_function) (EState *estate,
+ PendingAsyncRequest *areq,
+ bool reinit);
+typedef void (*ForeignAsyncNotify_function) (EState *estate,
+ PendingAsyncRequest *areq);
+typedef void (*ShutdownForeignScan_function) (ForeignScanState *node);
+
/*
* FdwRoutine is the struct returned by a foreign-data wrapper's handler
* function. It provides pointers to the callback functions needed by the
@@ -225,6 +235,13 @@ typedef struct FdwRoutine
EstimateDSMForeignScan_function EstimateDSMForeignScan;
InitializeDSMForeignScan_function InitializeDSMForeignScan;
InitializeWorkerForeignScan_function InitializeWorkerForeignScan;
+
+ /* Support functions for asynchronous execution */
+ IsForeignPathAsyncCapable_function IsForeignPathAsyncCapable;
+ ForeignAsyncRequest_function ForeignAsyncRequest;
+ ForeignAsyncConfigureWait_function ForeignAsyncConfigureWait;
+ ForeignAsyncNotify_function ForeignAsyncNotify;
+
ShutdownForeignScan_function ShutdownForeignScan;
} FdwRoutine;
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index f856f60..0308afc 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -358,6 +358,32 @@ typedef struct ResultRelInfo
} ResultRelInfo;
/* ----------------
+ * PendingAsyncRequest
+ *
+ * State for an asynchronous tuple request.
+ * ----------------
+ */
+typedef enum AsyncRequestState
+{
+ ASYNCREQ_IDLE, /* Nothing is requested */
+ ASYNCREQ_WAITING, /* Waiting for events */
+ ASYNCREQ_CALLBACK_PENDING, /* Having events to be processed */
+ ASYNCREQ_COMPLETE /* Result is available */
+} AsyncRequestState;
+
+typedef struct PendingAsyncRequest
+{
+ int myindex; /* Index in es_pending_async. */
+ struct PlanState *requestor; /* Node that wants a tuple. */
+ struct PlanState *requestee; /* Node from which a tuple is wanted. */
+ int request_index; /* Scratch space for requestor. */
+ int num_fd_events; /* Max number of FD events requestee needs. */
+ bool wants_process_latch; /* Requestee cares about MyLatch. */
+ AsyncRequestState state;
+ Node *result; /* Result (NULL if no more tuples). */
+} PendingAsyncRequest;
+
+/* ----------------
* EState information
*
* Master working state for an Executor invocation
@@ -437,6 +463,32 @@ typedef struct EState
/* The per-query shared memory area to use for parallel execution. */
struct dsa_area *es_query_dsa;
+
+ /*
+ * Support for asynchronous execution.
+ *
+ * es_max_pending_async is the allocated size of es_pending_async, and
+ * es_num_pending_aync is the number of entries that are currently valid.
+ * (Entries after that may point to storage that can be reused.)
+ * es_async_ready is the number of PendingAsyncRequests that is ready to
+ * retrieve a tuple.
+ *
+ * es_total_fd_events is the total number of FD events needed by all
+ * pending async nodes, and es_allocated_fd_events is the number any
+ * current wait event set was allocated to handle. es_wait_event_set, if
+ * non-NULL, is a previously allocated event set that may be reusable by a
+ * future wait provided that nothing's been removed and not too many more
+ * events have been added.
+ */
+ int es_num_pending_async; /* # of nodes to wait */
+ int es_max_pending_async; /* max # of pending nodes */
+ int es_async_callback_pending; /* # of nodes to callback */
+ int es_num_async_ready; /* # of tuple-ready nodes */
+ PendingAsyncRequest **es_pending_async;
+
+ int es_total_fd_events;
+ int es_allocated_fd_events;
+ struct WaitEventSet *es_wait_event_set;
} EState;
@@ -1182,17 +1234,20 @@ typedef struct ModifyTableState
/* ----------------
* AppendState information
- *
- * nplans how many plans are in the array
- * whichplan which plan is being executed (0 .. n-1)
* ----------------
*/
typedef struct AppendState
{
PlanState ps; /* its first field is NodeTag */
PlanState **appendplans; /* array of PlanStates for my inputs */
- int as_nplans;
- int as_whichplan;
+ int as_nplans; /* total # of children */
+ int as_nasyncplans; /* # of async-capable children */
+ int as_whichsyncplan; /* which sync plan is being executed */
+ bool as_syncdone; /* all synchronous plans done? */
+ Bitmapset *as_needrequest; /* async plans needing a new request */
+ TupleTableSlot **as_asyncresult; /* unreturned results of async plans */
+ int as_nasyncresult; /* # of valid entries in as_asyncresult */
+ int as_nasyncpending; /* # of outstanding async requests */
} AppendState;
/* ----------------
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index b880dc1..0d4f285 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -228,6 +228,8 @@ typedef struct Append
{
Plan plan;
List *appendplans;
+ int nasyncplans; /* # of async plans, always at start of list */
+ int referent; /* index of inheritance tree referent */
} Append;
/* ----------------
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 60c78d1..3265a48 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -789,7 +789,8 @@ typedef enum
WAIT_EVENT_PARALLEL_FINISH,
WAIT_EVENT_PARALLEL_BITMAP_SCAN,
WAIT_EVENT_SAFE_SNAPSHOT,
- WAIT_EVENT_SYNC_REP
+ WAIT_EVENT_SYNC_REP,
+ WAIT_EVENT_ASYNC_WAIT
} WaitEventIPC;
/* ----------
--
2.9.2
0004-Apply-unlikely-to-suggest-synchronous-route-of-ExecA.patchtext/x-patch; charset=us-asciiDownload
From 7cf7a75a323634c3f89bb38167bd2a83b2fa8d13 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Tue, 25 Oct 2016 18:05:30 +0900
Subject: [PATCH 4/5] Apply unlikely to suggest synchronous route of
ExecAppend.
ExecAppend seems to get slowed down by penalty of misprediction of
branches related to async-execution. Apply unlikey to them to prevent
such penalty on exiting route. Asynchronous execution is already
having a lot of additional code so this doesn't add siginificant
degradation.
---
src/backend/executor/nodeAppend.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/src/backend/executor/nodeAppend.c b/src/backend/executor/nodeAppend.c
index 12d3742..f44c40a 100644
--- a/src/backend/executor/nodeAppend.c
+++ b/src/backend/executor/nodeAppend.c
@@ -208,7 +208,7 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
TupleTableSlot *
ExecAppend(AppendState *node)
{
- if (node->as_nasyncplans > 0)
+ if (unlikely(node->as_nasyncplans > 0))
{
EState *estate = node->ps.state;
int i;
@@ -249,7 +249,7 @@ ExecAppend(AppendState *node)
/*
* if we have async requests outstanding, run the event loop
*/
- if (node->as_nasyncpending > 0)
+ if (unlikely(node->as_nasyncpending > 0))
{
long timeout = node->as_syncdone ? -1 : 0;
--
2.9.2
0003-Make-postgres_fdw-async-capable.patchtext/x-patch; charset=us-asciiDownload
From 7c1fca8aae368466300e6c48f650f3ba1d310577 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 23 Feb 2017 15:04:46 +0900
Subject: [PATCH 3/5] Make postgres_fdw async-capable.
Make postgre_fdw async-capable using the infrastructure. Additionaly,
this makes connections for postgres_fdw have a connection-specific
area to store information so that foreign scans on the same connection
can share some data. postgres_fdw shares scan node currently running
on the underlying connection. This allows us async-execution of
multiple foreign scans on one foreign server.
---
contrib/postgres_fdw/connection.c | 79 ++--
contrib/postgres_fdw/expected/postgres_fdw.out | 120 +++---
contrib/postgres_fdw/postgres_fdw.c | 522 +++++++++++++++++++++----
contrib/postgres_fdw/postgres_fdw.h | 2 +
contrib/postgres_fdw/sql/postgres_fdw.sql | 12 +-
5 files changed, 583 insertions(+), 152 deletions(-)
diff --git a/contrib/postgres_fdw/connection.c b/contrib/postgres_fdw/connection.c
index c6e3d44..d8ded74 100644
--- a/contrib/postgres_fdw/connection.c
+++ b/contrib/postgres_fdw/connection.c
@@ -49,6 +49,7 @@ typedef struct ConnCacheEntry
* one level of subxact open, etc */
bool have_prep_stmt; /* have we prepared any stmts in this xact? */
bool have_error; /* have any subxacts aborted in this xact? */
+ void *storage; /* connection specific storage */
} ConnCacheEntry;
/*
@@ -64,6 +65,7 @@ static unsigned int prep_stmt_number = 0;
static bool xact_got_connection = false;
/* prototypes of private functions */
+static ConnCacheEntry *get_connection_entry(Oid umid);
static PGconn *connect_pg_server(ForeignServer *server, UserMapping *user);
static void check_conn_params(const char **keywords, const char **values);
static void configure_remote_session(PGconn *conn);
@@ -75,26 +77,12 @@ static void pgfdw_subxact_callback(SubXactEvent event,
SubTransactionId parentSubid,
void *arg);
-
/*
- * Get a PGconn which can be used to execute queries on the remote PostgreSQL
- * server with the user's authorization. A new connection is established
- * if we don't already have a suitable one, and a transaction is opened at
- * the right subtransaction nesting depth if we didn't do that already.
- *
- * will_prep_stmt must be true if caller intends to create any prepared
- * statements. Since those don't go away automatically at transaction end
- * (not even on error), we need this flag to cue manual cleanup.
- *
- * XXX Note that caching connections theoretically requires a mechanism to
- * detect change of FDW objects to invalidate already established connections.
- * We could manage that by watching for invalidation events on the relevant
- * syscaches. For the moment, though, it's not clear that this would really
- * be useful and not mere pedantry. We could not flush any active connections
- * mid-transaction anyway.
+ * Common function to acquire or create a connection cache entry.
*/
-PGconn *
-GetConnection(UserMapping *user, bool will_prep_stmt)
+
+static ConnCacheEntry *
+get_connection_entry(Oid umid)
{
bool found;
ConnCacheEntry *entry;
@@ -122,11 +110,8 @@ GetConnection(UserMapping *user, bool will_prep_stmt)
RegisterSubXactCallback(pgfdw_subxact_callback, NULL);
}
- /* Set flag that we did GetConnection during the current transaction */
- xact_got_connection = true;
-
/* Create hash key for the entry. Assume no pad bytes in key struct */
- key = user->umid;
+ key = umid;
/*
* Find or create cached entry for requested connection.
@@ -139,8 +124,39 @@ GetConnection(UserMapping *user, bool will_prep_stmt)
entry->xact_depth = 0;
entry->have_prep_stmt = false;
entry->have_error = false;
+ entry->storage = NULL;
}
+ return entry;
+}
+
+/*
+ * Get a PGconn which can be used to execute queries on the remote PostgreSQL
+ * server with the user's authorization. A new connection is established
+ * if we don't already have a suitable one, and a transaction is opened at
+ * the right subtransaction nesting depth if we didn't do that already.
+ *
+ * will_prep_stmt must be true if caller intends to create any prepared
+ * statements. Since those don't go away automatically at transaction end
+ * (not even on error), we need this flag to cue manual cleanup.
+ *
+ * XXX Note that caching connections theoretically requires a mechanism to
+ * detect change of FDW objects to invalidate already established connections.
+ * We could manage that by watching for invalidation events on the relevant
+ * syscaches. For the moment, though, it's not clear that this would really
+ * be useful and not mere pedantry. We could not flush any active connections
+ * mid-transaction anyway.
+ */
+PGconn *
+GetConnection(UserMapping *user, bool will_prep_stmt)
+{
+ ConnCacheEntry *entry;
+
+ /* Set flag that we did GetConnection during the current transaction */
+ xact_got_connection = true;
+
+ entry = get_connection_entry(user->umid);
+
/*
* We don't check the health of cached connection here, because it would
* require some overhead. Broken connection will be detected when the
@@ -177,6 +193,25 @@ GetConnection(UserMapping *user, bool will_prep_stmt)
}
/*
+ * Rerturns the connection specific storage for this user. Allocate with
+ * initsize if not exists.
+ */
+void *
+GetConnectionSpecificStorage(UserMapping *user, size_t initsize)
+{
+ ConnCacheEntry *entry;
+
+ entry = get_connection_entry(user->umid);
+ if (entry->storage == NULL)
+ {
+ entry->storage = MemoryContextAlloc(CacheMemoryContext, initsize);
+ memset(entry->storage, 0, initsize);
+ }
+
+ return entry->storage;
+}
+
+/*
* Connect to remote server using specified server and user mapping properties.
*/
static PGconn *
diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out
index 0b9e3e4..90691e5 100644
--- a/contrib/postgres_fdw/expected/postgres_fdw.out
+++ b/contrib/postgres_fdw/expected/postgres_fdw.out
@@ -6401,34 +6401,39 @@ insert into bar2 values(3,33,33);
insert into bar2 values(4,44,44);
insert into bar2 values(7,77,77);
explain (verbose, costs off)
-select * from bar where f1 in (select f1 from foo) for update;
- QUERY PLAN
-----------------------------------------------------------------------------------------------
+select * from bar where f1 in (select f1 from foo) order by 1 for update;
+ QUERY PLAN
+-----------------------------------------------------------------------------------------------------------------
LockRows
Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid, foo.ctid, foo.*, foo.tableoid
- -> Hash Join
+ -> Merge Join
Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid, foo.ctid, foo.*, foo.tableoid
- Hash Cond: (bar.f1 = foo.f1)
- -> Append
- -> Seq Scan on public.bar
+ Merge Cond: (bar.f1 = foo.f1)
+ -> Merge Append
+ Sort Key: bar.f1
+ -> Sort
Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid
+ Sort Key: bar.f1
+ -> Seq Scan on public.bar
+ Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid
-> Foreign Scan on public.bar2
Output: bar2.f1, bar2.f2, bar2.ctid, bar2.*, bar2.tableoid
- Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 FOR UPDATE
- -> Hash
+ Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 ORDER BY f1 ASC NULLS LAST FOR UPDATE
+ -> Sort
Output: foo.ctid, foo.*, foo.tableoid, foo.f1
+ Sort Key: foo.f1
-> HashAggregate
Output: foo.ctid, foo.*, foo.tableoid, foo.f1
Group Key: foo.f1
-> Append
- -> Seq Scan on public.foo
- Output: foo.ctid, foo.*, foo.tableoid, foo.f1
-> Foreign Scan on public.foo2
Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1
Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1
-(22 rows)
+ -> Seq Scan on public.foo
+ Output: foo.ctid, foo.*, foo.tableoid, foo.f1
+(27 rows)
-select * from bar where f1 in (select f1 from foo) for update;
+select * from bar where f1 in (select f1 from foo) order by 1 for update;
f1 | f2
----+----
1 | 11
@@ -6438,34 +6443,39 @@ select * from bar where f1 in (select f1 from foo) for update;
(4 rows)
explain (verbose, costs off)
-select * from bar where f1 in (select f1 from foo) for share;
- QUERY PLAN
-----------------------------------------------------------------------------------------------
+select * from bar where f1 in (select f1 from foo) order by 1 for share;
+ QUERY PLAN
+----------------------------------------------------------------------------------------------------------------
LockRows
Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid, foo.ctid, foo.*, foo.tableoid
- -> Hash Join
+ -> Merge Join
Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid, foo.ctid, foo.*, foo.tableoid
- Hash Cond: (bar.f1 = foo.f1)
- -> Append
- -> Seq Scan on public.bar
+ Merge Cond: (bar.f1 = foo.f1)
+ -> Merge Append
+ Sort Key: bar.f1
+ -> Sort
Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid
+ Sort Key: bar.f1
+ -> Seq Scan on public.bar
+ Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid
-> Foreign Scan on public.bar2
Output: bar2.f1, bar2.f2, bar2.ctid, bar2.*, bar2.tableoid
- Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 FOR SHARE
- -> Hash
+ Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 ORDER BY f1 ASC NULLS LAST FOR SHARE
+ -> Sort
Output: foo.ctid, foo.*, foo.tableoid, foo.f1
+ Sort Key: foo.f1
-> HashAggregate
Output: foo.ctid, foo.*, foo.tableoid, foo.f1
Group Key: foo.f1
-> Append
- -> Seq Scan on public.foo
- Output: foo.ctid, foo.*, foo.tableoid, foo.f1
-> Foreign Scan on public.foo2
Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1
Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1
-(22 rows)
+ -> Seq Scan on public.foo
+ Output: foo.ctid, foo.*, foo.tableoid, foo.f1
+(27 rows)
-select * from bar where f1 in (select f1 from foo) for share;
+select * from bar where f1 in (select f1 from foo) order by 1 for share;
f1 | f2
----+----
1 | 11
@@ -6494,11 +6504,11 @@ update bar set f2 = f2 + 100 where f1 in (select f1 from foo);
Output: foo.ctid, foo.*, foo.tableoid, foo.f1
Group Key: foo.f1
-> Append
- -> Seq Scan on public.foo
- Output: foo.ctid, foo.*, foo.tableoid, foo.f1
-> Foreign Scan on public.foo2
Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1
Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1
+ -> Seq Scan on public.foo
+ Output: foo.ctid, foo.*, foo.tableoid, foo.f1
-> Hash Join
Output: bar2.f1, (bar2.f2 + 100), bar2.f3, bar2.ctid, foo.ctid, foo.*, foo.tableoid
Hash Cond: (bar2.f1 = foo.f1)
@@ -6511,11 +6521,11 @@ update bar set f2 = f2 + 100 where f1 in (select f1 from foo);
Output: foo.ctid, foo.*, foo.tableoid, foo.f1
Group Key: foo.f1
-> Append
- -> Seq Scan on public.foo
- Output: foo.ctid, foo.*, foo.tableoid, foo.f1
-> Foreign Scan on public.foo2
Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1
Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1
+ -> Seq Scan on public.foo
+ Output: foo.ctid, foo.*, foo.tableoid, foo.f1
(37 rows)
update bar set f2 = f2 + 100 where f1 in (select f1 from foo);
@@ -6546,16 +6556,16 @@ where bar.f1 = ss.f1;
Output: bar.f1, (bar.f2 + 100), bar.ctid, (ROW(foo.f1))
Hash Cond: (foo.f1 = bar.f1)
-> Append
- -> Seq Scan on public.foo
- Output: ROW(foo.f1), foo.f1
-> Foreign Scan on public.foo2
Output: ROW(foo2.f1), foo2.f1
Remote SQL: SELECT f1 FROM public.loct1
- -> Seq Scan on public.foo foo_1
- Output: ROW((foo_1.f1 + 3)), (foo_1.f1 + 3)
-> Foreign Scan on public.foo2 foo2_1
Output: ROW((foo2_1.f1 + 3)), (foo2_1.f1 + 3)
Remote SQL: SELECT f1 FROM public.loct1
+ -> Seq Scan on public.foo
+ Output: ROW(foo.f1), foo.f1
+ -> Seq Scan on public.foo foo_1
+ Output: ROW((foo_1.f1 + 3)), (foo_1.f1 + 3)
-> Hash
Output: bar.f1, bar.f2, bar.ctid
-> Seq Scan on public.bar
@@ -6573,16 +6583,16 @@ where bar.f1 = ss.f1;
Output: (ROW(foo.f1)), foo.f1
Sort Key: foo.f1
-> Append
- -> Seq Scan on public.foo
- Output: ROW(foo.f1), foo.f1
-> Foreign Scan on public.foo2
Output: ROW(foo2.f1), foo2.f1
Remote SQL: SELECT f1 FROM public.loct1
- -> Seq Scan on public.foo foo_1
- Output: ROW((foo_1.f1 + 3)), (foo_1.f1 + 3)
-> Foreign Scan on public.foo2 foo2_1
Output: ROW((foo2_1.f1 + 3)), (foo2_1.f1 + 3)
Remote SQL: SELECT f1 FROM public.loct1
+ -> Seq Scan on public.foo
+ Output: ROW(foo.f1), foo.f1
+ -> Seq Scan on public.foo foo_1
+ Output: ROW((foo_1.f1 + 3)), (foo_1.f1 + 3)
(45 rows)
update bar set f2 = f2 + 100
@@ -6733,27 +6743,33 @@ delete from foo where f1 < 5 returning *;
(5 rows)
explain (verbose, costs off)
-update bar set f2 = f2 + 100 returning *;
- QUERY PLAN
-------------------------------------------------------------------------------
- Update on public.bar
- Output: bar.f1, bar.f2
- Update on public.bar
- Foreign Update on public.bar2
- -> Seq Scan on public.bar
- Output: bar.f1, (bar.f2 + 100), bar.ctid
- -> Foreign Update on public.bar2
- Remote SQL: UPDATE public.loct2 SET f2 = (f2 + 100) RETURNING f1, f2
-(8 rows)
+with u as (update bar set f2 = f2 + 100 returning *) select * from u order by 1;
+ QUERY PLAN
+--------------------------------------------------------------------------------------
+ Sort
+ Output: u.f1, u.f2
+ Sort Key: u.f1
+ CTE u
+ -> Update on public.bar
+ Output: bar.f1, bar.f2
+ Update on public.bar
+ Foreign Update on public.bar2
+ -> Seq Scan on public.bar
+ Output: bar.f1, (bar.f2 + 100), bar.ctid
+ -> Foreign Update on public.bar2
+ Remote SQL: UPDATE public.loct2 SET f2 = (f2 + 100) RETURNING f1, f2
+ -> CTE Scan on u
+ Output: u.f1, u.f2
+(14 rows)
-update bar set f2 = f2 + 100 returning *;
+with u as (update bar set f2 = f2 + 100 returning *) select * from u order by 1;
f1 | f2
----+-----
1 | 311
2 | 322
- 6 | 266
3 | 333
4 | 344
+ 6 | 266
7 | 277
(6 rows)
diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c
index 990313a..093fa1a 100644
--- a/contrib/postgres_fdw/postgres_fdw.c
+++ b/contrib/postgres_fdw/postgres_fdw.c
@@ -20,6 +20,8 @@
#include "commands/defrem.h"
#include "commands/explain.h"
#include "commands/vacuum.h"
+#include "executor/execAsync.h"
+#include "executor/nodeForeignscan.h"
#include "foreign/fdwapi.h"
#include "funcapi.h"
#include "miscadmin.h"
@@ -34,6 +36,7 @@
#include "optimizer/var.h"
#include "optimizer/tlist.h"
#include "parser/parsetree.h"
+#include "pgstat.h"
#include "utils/builtins.h"
#include "utils/guc.h"
#include "utils/lsyscache.h"
@@ -53,6 +56,9 @@ PG_MODULE_MAGIC;
/* If no remote estimates, assume a sort costs 20% extra */
#define DEFAULT_FDW_SORT_MULTIPLIER 1.2
+/* Retrive PgFdwScanState struct from ForeginScanState */
+#define GetPgFdwScanState(n) ((PgFdwScanState *)(n)->fdw_state)
+
/*
* Indexes of FDW-private information stored in fdw_private lists.
*
@@ -122,10 +128,27 @@ enum FdwDirectModifyPrivateIndex
};
/*
+ * Connection private area structure.
+ */
+typedef struct PgFdwConnpriv
+{
+ ForeignScanState *current_owner; /* The node currently running a query
+ * on this connection*/
+} PgFdwConnpriv;
+
+/* Execution state base type */
+typedef struct PgFdwState
+{
+ PGconn *conn; /* connection for the scan */
+ PgFdwConnpriv *connpriv; /* connection private memory */
+} PgFdwState;
+
+/*
* Execution state of a foreign scan using postgres_fdw.
*/
typedef struct PgFdwScanState
{
+ PgFdwState s; /* common structure */
Relation rel; /* relcache entry for the foreign table. NULL
* for a foreign join scan. */
TupleDesc tupdesc; /* tuple descriptor of scan */
@@ -136,7 +159,7 @@ typedef struct PgFdwScanState
List *retrieved_attrs; /* list of retrieved attribute numbers */
/* for remote query execution */
- PGconn *conn; /* connection for the scan */
+ bool result_ready;
unsigned int cursor_number; /* quasi-unique ID for my cursor */
bool cursor_exists; /* have we created the cursor? */
int numParams; /* number of parameters passed to query */
@@ -152,6 +175,13 @@ typedef struct PgFdwScanState
/* batch-level state, for optimizing rewinds and avoiding useless fetch */
int fetch_ct_2; /* Min(# of fetches done, 2) */
bool eof_reached; /* true if last fetch reached EOF */
+ bool run_async; /* true if run asynchronously */
+ bool async_waiting; /* true if requesting the parent to wait */
+ ForeignScanState *waiter; /* Next node to run a query among nodes
+ * sharing the same connection */
+ ForeignScanState *last_waiter; /* A waiting node at the end of a waiting
+ * list. Maintained only by the current
+ * owner of the connection */
/* working memory contexts */
MemoryContext batch_cxt; /* context holding current batch of tuples */
@@ -165,11 +195,11 @@ typedef struct PgFdwScanState
*/
typedef struct PgFdwModifyState
{
+ PgFdwState s; /* common structure */
Relation rel; /* relcache entry for the foreign table */
AttInMetadata *attinmeta; /* attribute datatype conversion metadata */
/* for remote query execution */
- PGconn *conn; /* connection for the scan */
char *p_name; /* name of prepared statement, if created */
/* extracted fdw_private data */
@@ -192,6 +222,7 @@ typedef struct PgFdwModifyState
*/
typedef struct PgFdwDirectModifyState
{
+ PgFdwState s; /* common structure */
Relation rel; /* relcache entry for the foreign table */
AttInMetadata *attinmeta; /* attribute datatype conversion metadata */
@@ -290,6 +321,7 @@ static void postgresBeginForeignScan(ForeignScanState *node, int eflags);
static TupleTableSlot *postgresIterateForeignScan(ForeignScanState *node);
static void postgresReScanForeignScan(ForeignScanState *node);
static void postgresEndForeignScan(ForeignScanState *node);
+static void postgresShutdownForeignScan(ForeignScanState *node);
static void postgresAddForeignUpdateTargets(Query *parsetree,
RangeTblEntry *target_rte,
Relation target_relation);
@@ -350,6 +382,14 @@ static void postgresGetForeignUpperPaths(PlannerInfo *root,
UpperRelationKind stage,
RelOptInfo *input_rel,
RelOptInfo *output_rel);
+static bool postgresIsForeignPathAsyncCapable(ForeignPath *path);
+static void postgresForeignAsyncRequest(EState *estate,
+ PendingAsyncRequest *areq);
+static bool postgresForeignAsyncConfigureWait(EState *estate,
+ PendingAsyncRequest *areq,
+ bool reinit);
+static void postgresForeignAsyncNotify(EState *estate,
+ PendingAsyncRequest *areq);
/*
* Helper functions
@@ -370,7 +410,10 @@ static bool ec_member_matches_foreign(PlannerInfo *root, RelOptInfo *rel,
EquivalenceClass *ec, EquivalenceMember *em,
void *arg);
static void create_cursor(ForeignScanState *node);
-static void fetch_more_data(ForeignScanState *node);
+static void request_more_data(ForeignScanState *node);
+static void fetch_received_data(ForeignScanState *node);
+static void vacate_connection(PgFdwState *fdwconn);
+static void absorb_current_result(ForeignScanState *node);
static void close_cursor(PGconn *conn, unsigned int cursor_number);
static void prepare_foreign_modify(PgFdwModifyState *fmstate);
static const char **convert_prep_stmt_params(PgFdwModifyState *fmstate,
@@ -435,6 +478,7 @@ postgres_fdw_handler(PG_FUNCTION_ARGS)
routine->IterateForeignScan = postgresIterateForeignScan;
routine->ReScanForeignScan = postgresReScanForeignScan;
routine->EndForeignScan = postgresEndForeignScan;
+ routine->ShutdownForeignScan = postgresShutdownForeignScan;
/* Functions for updating foreign tables */
routine->AddForeignUpdateTargets = postgresAddForeignUpdateTargets;
@@ -469,6 +513,12 @@ postgres_fdw_handler(PG_FUNCTION_ARGS)
/* Support functions for upper relation push-down */
routine->GetForeignUpperPaths = postgresGetForeignUpperPaths;
+ /* Support functions for async execution */
+ routine->IsForeignPathAsyncCapable = postgresIsForeignPathAsyncCapable;
+ routine->ForeignAsyncRequest = postgresForeignAsyncRequest;
+ routine->ForeignAsyncConfigureWait = postgresForeignAsyncConfigureWait;
+ routine->ForeignAsyncNotify = postgresForeignAsyncNotify;
+
PG_RETURN_POINTER(routine);
}
@@ -1320,12 +1370,21 @@ postgresBeginForeignScan(ForeignScanState *node, int eflags)
* Get connection to the foreign server. Connection manager will
* establish new connection if necessary.
*/
- fsstate->conn = GetConnection(user, false);
+ fsstate->s.conn = GetConnection(user, false);
+ fsstate->s.connpriv = (PgFdwConnpriv *)
+ GetConnectionSpecificStorage(user, sizeof(PgFdwConnpriv));
+ fsstate->s.connpriv->current_owner = NULL;
+ fsstate->waiter = NULL;
+ fsstate->last_waiter = node;
/* Assign a unique ID for my cursor */
- fsstate->cursor_number = GetCursorNumber(fsstate->conn);
+ fsstate->cursor_number = GetCursorNumber(fsstate->s.conn);
fsstate->cursor_exists = false;
+ /* Initialize async execution status */
+ fsstate->run_async = false;
+ fsstate->async_waiting = false;
+
/* Get private info created by planner functions. */
fsstate->query = strVal(list_nth(fsplan->fdw_private,
FdwScanPrivateSelectSql));
@@ -1381,32 +1440,130 @@ postgresBeginForeignScan(ForeignScanState *node, int eflags)
static TupleTableSlot *
postgresIterateForeignScan(ForeignScanState *node)
{
- PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+ PgFdwScanState *fsstate = GetPgFdwScanState(node);
TupleTableSlot *slot = node->ss.ss_ScanTupleSlot;
/*
- * If this is the first call after Begin or ReScan, we need to create the
- * cursor on the remote side.
- */
- if (!fsstate->cursor_exists)
- create_cursor(node);
-
- /*
* Get some more tuples, if we've run out.
*/
if (fsstate->next_tuple >= fsstate->num_tuples)
{
- /* No point in another fetch if we already detected EOF, though. */
- if (!fsstate->eof_reached)
- fetch_more_data(node);
- /* If we didn't get any tuples, must be end of data. */
+ ForeignScanState *next_conn_owner = node;
+
+ /* This node has sent a query on this connection */
+ if (fsstate->s.connpriv->current_owner == node)
+ {
+ /* Check if the result is available */
+ if (PQisBusy(fsstate->s.conn))
+ {
+ int rc = WaitLatchOrSocket(NULL,
+ WL_SOCKET_READABLE | WL_TIMEOUT,
+ PQsocket(fsstate->s.conn), 0,
+ WAIT_EVENT_ASYNC_WAIT);
+ if (fsstate->run_async && !(rc & WL_SOCKET_READABLE))
+ {
+ /*
+ * This node is not ready yet. Tell the caller to wait.
+ */
+ fsstate->result_ready = false;
+ return ExecClearTuple(slot);
+ }
+ }
+
+ Assert(fsstate->async_waiting);
+ fsstate->async_waiting = false;
+ fetch_received_data(node);
+
+ /*
+ * If someone is waiting this node on the same connection, let the
+ * first waiter be the next owner of this connection.
+ */
+ if (fsstate->waiter)
+ {
+ PgFdwScanState *next_owner_state;
+
+ next_conn_owner = fsstate->waiter;
+ next_owner_state = GetPgFdwScanState(next_conn_owner);
+ fsstate->waiter = NULL;
+
+ /*
+ * only the current owner is responsible to maintain the shortcut
+ * to the last waiter
+ */
+ next_owner_state->last_waiter = fsstate->last_waiter;
+
+ /*
+ * for simplicity, last_waiter points itself on a node that no one
+ * is waiting for.
+ */
+ fsstate->last_waiter = node;
+ }
+ }
+ else if (fsstate->s.connpriv->current_owner &&
+ !GetPgFdwScanState(node)->eof_reached)
+ {
+ /*
+ * Anyone else is holding this connection and we want this node to
+ * run later. Add myself to the tail of the waiters' list then
+ * return not-ready. To avoid scanning through the waiters' list,
+ * the current owner is to maintain the shortcut to the last
+ * waiter.
+ */
+ PgFdwScanState *conn_owner_state =
+ GetPgFdwScanState(fsstate->s.connpriv->current_owner);
+ ForeignScanState *last_waiter = conn_owner_state->last_waiter;
+ PgFdwScanState *last_waiter_state = GetPgFdwScanState(last_waiter);
+
+ last_waiter_state->waiter = node;
+ conn_owner_state->last_waiter = node;
+
+ /* Register the node to the async-waiting node list */
+ Assert(!GetPgFdwScanState(node)->async_waiting);
+
+ GetPgFdwScanState(node)->async_waiting = true;
+
+ fsstate->result_ready = fsstate->eof_reached;
+ return ExecClearTuple(slot);
+ }
+
+ /* At this time no node is running on the connection */
+ Assert(GetPgFdwScanState(next_conn_owner)->s.connpriv->current_owner
+ == NULL);
+ /*
+ * Send the next request for the next owner of this connection if
+ * needed.
+ */
+ if (!GetPgFdwScanState(next_conn_owner)->eof_reached)
+ {
+ PgFdwScanState *next_owner_state =
+ GetPgFdwScanState(next_conn_owner);
+
+ request_more_data(next_conn_owner);
+
+ /* Register the node to the async-waiting node list */
+ if (!next_owner_state->async_waiting)
+ next_owner_state->async_waiting = true;
+
+ if (!next_owner_state->run_async)
+ fetch_received_data(next_conn_owner);
+ }
+
+
+ /*
+ * If we haven't received a result for the given node this time,
+ * return with no tuple to give way to other nodes.
+ */
if (fsstate->next_tuple >= fsstate->num_tuples)
+ {
+ fsstate->result_ready = fsstate->eof_reached;
return ExecClearTuple(slot);
+ }
}
/*
* Return the next tuple.
*/
+ fsstate->result_ready = true;
ExecStoreTuple(fsstate->tuples[fsstate->next_tuple++],
slot,
InvalidBuffer,
@@ -1422,7 +1579,7 @@ postgresIterateForeignScan(ForeignScanState *node)
static void
postgresReScanForeignScan(ForeignScanState *node)
{
- PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+ PgFdwScanState *fsstate = GetPgFdwScanState(node);
char sql[64];
PGresult *res;
@@ -1430,6 +1587,9 @@ postgresReScanForeignScan(ForeignScanState *node)
if (!fsstate->cursor_exists)
return;
+ /* Absorb the ramining result */
+ absorb_current_result(node);
+
/*
* If any internal parameters affecting this node have changed, we'd
* better destroy and recreate the cursor. Otherwise, rewinding it should
@@ -1458,9 +1618,9 @@ postgresReScanForeignScan(ForeignScanState *node)
* We don't use a PG_TRY block here, so be careful not to throw error
* without releasing the PGresult.
*/
- res = pgfdw_exec_query(fsstate->conn, sql);
+ res = pgfdw_exec_query(fsstate->s.conn, sql);
if (PQresultStatus(res) != PGRES_COMMAND_OK)
- pgfdw_report_error(ERROR, res, fsstate->conn, true, sql);
+ pgfdw_report_error(ERROR, res, fsstate->s.conn, true, sql);
PQclear(res);
/* Now force a fresh FETCH. */
@@ -1478,7 +1638,7 @@ postgresReScanForeignScan(ForeignScanState *node)
static void
postgresEndForeignScan(ForeignScanState *node)
{
- PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+ PgFdwScanState *fsstate = GetPgFdwScanState(node);
/* if fsstate is NULL, we are in EXPLAIN; nothing to do */
if (fsstate == NULL)
@@ -1486,16 +1646,32 @@ postgresEndForeignScan(ForeignScanState *node)
/* Close the cursor if open, to prevent accumulation of cursors */
if (fsstate->cursor_exists)
- close_cursor(fsstate->conn, fsstate->cursor_number);
+ close_cursor(fsstate->s.conn, fsstate->cursor_number);
/* Release remote connection */
- ReleaseConnection(fsstate->conn);
- fsstate->conn = NULL;
+ ReleaseConnection(fsstate->s.conn);
+ fsstate->s.conn = NULL;
/* MemoryContexts will be deleted automatically. */
}
/*
+ * postgresShutdownForeignScan
+ * Remove asynchrony stuff and cleanup garbage on the connection.
+ */
+static void
+postgresShutdownForeignScan(ForeignScanState *node)
+{
+ ForeignScan *plan = (ForeignScan *) node->ss.ps.plan;
+
+ if (plan->operation != CMD_SELECT)
+ return;
+
+ /* Absorb the ramining result */
+ absorb_current_result(node);
+}
+
+/*
* postgresAddForeignUpdateTargets
* Add resjunk column(s) needed for update/delete on a foreign table
*/
@@ -1697,7 +1873,9 @@ postgresBeginForeignModify(ModifyTableState *mtstate,
user = GetUserMapping(userid, table->serverid);
/* Open connection; report that we'll create a prepared statement. */
- fmstate->conn = GetConnection(user, true);
+ fmstate->s.conn = GetConnection(user, true);
+ fmstate->s.connpriv = (PgFdwConnpriv *)
+ GetConnectionSpecificStorage(user, sizeof(PgFdwConnpriv));
fmstate->p_name = NULL; /* prepared statement not made yet */
/* Deconstruct fdw_private data. */
@@ -1776,6 +1954,8 @@ postgresExecForeignInsert(EState *estate,
PGresult *res;
int n_rows;
+ vacate_connection((PgFdwState *)fmstate);
+
/* Set up the prepared statement on the remote server, if we didn't yet */
if (!fmstate->p_name)
prepare_foreign_modify(fmstate);
@@ -1786,14 +1966,14 @@ postgresExecForeignInsert(EState *estate,
/*
* Execute the prepared statement.
*/
- if (!PQsendQueryPrepared(fmstate->conn,
+ if (!PQsendQueryPrepared(fmstate->s.conn,
fmstate->p_name,
fmstate->p_nums,
p_values,
NULL,
NULL,
0))
- pgfdw_report_error(ERROR, NULL, fmstate->conn, false, fmstate->query);
+ pgfdw_report_error(ERROR, NULL, fmstate->s.conn, false, fmstate->query);
/*
* Get the result, and check for success.
@@ -1801,10 +1981,10 @@ postgresExecForeignInsert(EState *estate,
* We don't use a PG_TRY block here, so be careful not to throw error
* without releasing the PGresult.
*/
- res = pgfdw_get_result(fmstate->conn, fmstate->query);
+ res = pgfdw_get_result(fmstate->s.conn, fmstate->query);
if (PQresultStatus(res) !=
(fmstate->has_returning ? PGRES_TUPLES_OK : PGRES_COMMAND_OK))
- pgfdw_report_error(ERROR, res, fmstate->conn, true, fmstate->query);
+ pgfdw_report_error(ERROR, res, fmstate->s.conn, true, fmstate->query);
/* Check number of rows affected, and fetch RETURNING tuple if any */
if (fmstate->has_returning)
@@ -1842,6 +2022,8 @@ postgresExecForeignUpdate(EState *estate,
PGresult *res;
int n_rows;
+ vacate_connection((PgFdwState *)fmstate);
+
/* Set up the prepared statement on the remote server, if we didn't yet */
if (!fmstate->p_name)
prepare_foreign_modify(fmstate);
@@ -1862,14 +2044,14 @@ postgresExecForeignUpdate(EState *estate,
/*
* Execute the prepared statement.
*/
- if (!PQsendQueryPrepared(fmstate->conn,
+ if (!PQsendQueryPrepared(fmstate->s.conn,
fmstate->p_name,
fmstate->p_nums,
p_values,
NULL,
NULL,
0))
- pgfdw_report_error(ERROR, NULL, fmstate->conn, false, fmstate->query);
+ pgfdw_report_error(ERROR, NULL, fmstate->s.conn, false, fmstate->query);
/*
* Get the result, and check for success.
@@ -1877,10 +2059,10 @@ postgresExecForeignUpdate(EState *estate,
* We don't use a PG_TRY block here, so be careful not to throw error
* without releasing the PGresult.
*/
- res = pgfdw_get_result(fmstate->conn, fmstate->query);
+ res = pgfdw_get_result(fmstate->s.conn, fmstate->query);
if (PQresultStatus(res) !=
(fmstate->has_returning ? PGRES_TUPLES_OK : PGRES_COMMAND_OK))
- pgfdw_report_error(ERROR, res, fmstate->conn, true, fmstate->query);
+ pgfdw_report_error(ERROR, res, fmstate->s.conn, true, fmstate->query);
/* Check number of rows affected, and fetch RETURNING tuple if any */
if (fmstate->has_returning)
@@ -1918,6 +2100,8 @@ postgresExecForeignDelete(EState *estate,
PGresult *res;
int n_rows;
+ vacate_connection((PgFdwState *)fmstate);
+
/* Set up the prepared statement on the remote server, if we didn't yet */
if (!fmstate->p_name)
prepare_foreign_modify(fmstate);
@@ -1938,14 +2122,14 @@ postgresExecForeignDelete(EState *estate,
/*
* Execute the prepared statement.
*/
- if (!PQsendQueryPrepared(fmstate->conn,
+ if (!PQsendQueryPrepared(fmstate->s.conn,
fmstate->p_name,
fmstate->p_nums,
p_values,
NULL,
NULL,
0))
- pgfdw_report_error(ERROR, NULL, fmstate->conn, false, fmstate->query);
+ pgfdw_report_error(ERROR, NULL, fmstate->s.conn, false, fmstate->query);
/*
* Get the result, and check for success.
@@ -1953,10 +2137,10 @@ postgresExecForeignDelete(EState *estate,
* We don't use a PG_TRY block here, so be careful not to throw error
* without releasing the PGresult.
*/
- res = pgfdw_get_result(fmstate->conn, fmstate->query);
+ res = pgfdw_get_result(fmstate->s.conn, fmstate->query);
if (PQresultStatus(res) !=
(fmstate->has_returning ? PGRES_TUPLES_OK : PGRES_COMMAND_OK))
- pgfdw_report_error(ERROR, res, fmstate->conn, true, fmstate->query);
+ pgfdw_report_error(ERROR, res, fmstate->s.conn, true, fmstate->query);
/* Check number of rows affected, and fetch RETURNING tuple if any */
if (fmstate->has_returning)
@@ -2003,16 +2187,16 @@ postgresEndForeignModify(EState *estate,
* We don't use a PG_TRY block here, so be careful not to throw error
* without releasing the PGresult.
*/
- res = pgfdw_exec_query(fmstate->conn, sql);
+ res = pgfdw_exec_query(fmstate->s.conn, sql);
if (PQresultStatus(res) != PGRES_COMMAND_OK)
- pgfdw_report_error(ERROR, res, fmstate->conn, true, sql);
+ pgfdw_report_error(ERROR, res, fmstate->s.conn, true, sql);
PQclear(res);
fmstate->p_name = NULL;
}
/* Release remote connection */
- ReleaseConnection(fmstate->conn);
- fmstate->conn = NULL;
+ ReleaseConnection(fmstate->s.conn);
+ fmstate->s.conn = NULL;
}
/*
@@ -2292,7 +2476,9 @@ postgresBeginDirectModify(ForeignScanState *node, int eflags)
* Get connection to the foreign server. Connection manager will
* establish new connection if necessary.
*/
- dmstate->conn = GetConnection(user, false);
+ dmstate->s.conn = GetConnection(user, false);
+ dmstate->s.connpriv = (PgFdwConnpriv *)
+ GetConnectionSpecificStorage(user, sizeof(PgFdwConnpriv));
/* Initialize state variable */
dmstate->num_tuples = -1; /* -1 means not set yet */
@@ -2345,7 +2531,10 @@ postgresIterateDirectModify(ForeignScanState *node)
* If this is the first call after Begin, execute the statement.
*/
if (dmstate->num_tuples == -1)
+ {
+ vacate_connection((PgFdwState *)dmstate);
execute_dml_stmt(node);
+ }
/*
* If the local query doesn't specify RETURNING, just clear tuple slot.
@@ -2392,8 +2581,8 @@ postgresEndDirectModify(ForeignScanState *node)
PQclear(dmstate->result);
/* Release remote connection */
- ReleaseConnection(dmstate->conn);
- dmstate->conn = NULL;
+ ReleaseConnection(dmstate->s.conn);
+ dmstate->s.conn = NULL;
/* MemoryContext will be deleted automatically. */
}
@@ -2512,6 +2701,7 @@ estimate_path_cost_size(PlannerInfo *root,
List *local_param_join_conds;
StringInfoData sql;
PGconn *conn;
+ PgFdwConnpriv *connpriv;
Selectivity local_sel;
QualCost local_cost;
List *fdw_scan_tlist = NIL;
@@ -2555,6 +2745,16 @@ estimate_path_cost_size(PlannerInfo *root,
/* Get the remote estimate */
conn = GetConnection(fpinfo->user, false);
+ connpriv = GetConnectionSpecificStorage(fpinfo->user,
+ sizeof(PgFdwConnpriv));
+ if (connpriv)
+ {
+ PgFdwState tmpstate;
+ tmpstate.conn = conn;
+ tmpstate.connpriv = connpriv;
+ vacate_connection(&tmpstate);
+ }
+
get_remote_estimate(sql.data, conn, &rows, &width,
&startup_cost, &total_cost);
ReleaseConnection(conn);
@@ -2909,11 +3109,11 @@ ec_member_matches_foreign(PlannerInfo *root, RelOptInfo *rel,
static void
create_cursor(ForeignScanState *node)
{
- PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+ PgFdwScanState *fsstate = GetPgFdwScanState(node);
ExprContext *econtext = node->ss.ps.ps_ExprContext;
int numParams = fsstate->numParams;
const char **values = fsstate->param_values;
- PGconn *conn = fsstate->conn;
+ PGconn *conn = fsstate->s.conn;
StringInfoData buf;
PGresult *res;
@@ -2979,47 +3179,96 @@ create_cursor(ForeignScanState *node)
* Fetch some more rows from the node's cursor.
*/
static void
-fetch_more_data(ForeignScanState *node)
+request_more_data(ForeignScanState *node)
+{
+ PgFdwScanState *fsstate = GetPgFdwScanState(node);
+ PGconn *conn = fsstate->s.conn;
+ char sql[64];
+
+ /* The connection should be vacant */
+ Assert(fsstate->s.connpriv->current_owner == NULL);
+
+ /*
+ * If this is the first call after Begin or ReScan, we need to create the
+ * cursor on the remote side.
+ */
+ if (!fsstate->cursor_exists)
+ create_cursor(node);
+
+ snprintf(sql, sizeof(sql), "FETCH %d FROM c%u",
+ fsstate->fetch_size, fsstate->cursor_number);
+
+ if (!PQsendQuery(conn, sql))
+ pgfdw_report_error(ERROR, NULL, conn, false, sql);
+
+ fsstate->s.connpriv->current_owner = node;
+}
+
+/*
+ * Fetch some more rows from the node's cursor.
+ */
+static void
+fetch_received_data(ForeignScanState *node)
{
- PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+ PgFdwScanState *fsstate = GetPgFdwScanState(node);
PGresult *volatile res = NULL;
MemoryContext oldcontext;
+ /* I should be the current connection owner */
+ Assert(fsstate->s.connpriv->current_owner == node);
+
/*
* We'll store the tuples in the batch_cxt. First, flush the previous
- * batch.
+ * batch if no tuple is remaining
*/
- fsstate->tuples = NULL;
- MemoryContextReset(fsstate->batch_cxt);
+ if (fsstate->next_tuple >= fsstate->num_tuples)
+ {
+ fsstate->tuples = NULL;
+ fsstate->num_tuples = 0;
+ MemoryContextReset(fsstate->batch_cxt);
+ }
+ else if (fsstate->next_tuple > 0)
+ {
+ /* move the remaining tuples to the beginning of the store */
+ int n = 0;
+
+ while(fsstate->next_tuple < fsstate->num_tuples)
+ fsstate->tuples[n++] = fsstate->tuples[fsstate->next_tuple++];
+ fsstate->num_tuples = n;
+ }
+
oldcontext = MemoryContextSwitchTo(fsstate->batch_cxt);
/* PGresult must be released before leaving this function. */
PG_TRY();
{
- PGconn *conn = fsstate->conn;
+ PGconn *conn = fsstate->s.conn;
char sql[64];
- int numrows;
+ int addrows;
+ size_t newsize;
int i;
snprintf(sql, sizeof(sql), "FETCH %d FROM c%u",
fsstate->fetch_size, fsstate->cursor_number);
- res = pgfdw_exec_query(conn, sql);
+ res = pgfdw_get_result(conn, sql);
/* On error, report the original query, not the FETCH. */
if (PQresultStatus(res) != PGRES_TUPLES_OK)
pgfdw_report_error(ERROR, res, conn, false, fsstate->query);
/* Convert the data into HeapTuples */
- numrows = PQntuples(res);
- fsstate->tuples = (HeapTuple *) palloc0(numrows * sizeof(HeapTuple));
- fsstate->num_tuples = numrows;
- fsstate->next_tuple = 0;
+ addrows = PQntuples(res);
+ newsize = (fsstate->num_tuples + addrows) * sizeof(HeapTuple);
+ if (fsstate->tuples)
+ fsstate->tuples = (HeapTuple *) repalloc(fsstate->tuples, newsize);
+ else
+ fsstate->tuples = (HeapTuple *) palloc(newsize);
- for (i = 0; i < numrows; i++)
+ for (i = 0; i < addrows; i++)
{
Assert(IsA(node->ss.ps.plan, ForeignScan));
- fsstate->tuples[i] =
+ fsstate->tuples[fsstate->num_tuples + i] =
make_tuple_from_result_row(res, i,
fsstate->rel,
fsstate->attinmeta,
@@ -3029,27 +3278,82 @@ fetch_more_data(ForeignScanState *node)
}
/* Update fetch_ct_2 */
- if (fsstate->fetch_ct_2 < 2)
+ if (fsstate->fetch_ct_2 < 2 && fsstate->next_tuple == 0)
fsstate->fetch_ct_2++;
+ fsstate->next_tuple = 0;
+ fsstate->num_tuples += addrows;
+
/* Must be EOF if we didn't get as many tuples as we asked for. */
- fsstate->eof_reached = (numrows < fsstate->fetch_size);
+ fsstate->eof_reached = (addrows < fsstate->fetch_size);
PQclear(res);
res = NULL;
}
PG_CATCH();
{
+ fsstate->s.connpriv->current_owner = NULL;
if (res)
PQclear(res);
PG_RE_THROW();
}
PG_END_TRY();
+ fsstate->s.connpriv->current_owner = NULL;
+
MemoryContextSwitchTo(oldcontext);
}
/*
+ * Vacate a connection so that this node can send the next query
+ */
+static void
+vacate_connection(PgFdwState *fdwstate)
+{
+ PgFdwConnpriv *connpriv = fdwstate->connpriv;
+ ForeignScanState *owner;
+
+ if (connpriv == NULL || connpriv->current_owner == NULL)
+ return;
+
+ /*
+ * let the current connection owner read the result for the running query
+ */
+ owner = connpriv->current_owner;
+ fetch_received_data(owner);
+
+ /* Clear the waiting list */
+ while (owner)
+ {
+ PgFdwScanState *fsstate = GetPgFdwScanState(owner);
+
+ fsstate->last_waiter = NULL;
+ owner = fsstate->waiter;
+ fsstate->waiter = NULL;
+ }
+}
+
+/*
+ * Absorb the result of the current query.
+ */
+static void
+absorb_current_result(ForeignScanState *node)
+{
+ PgFdwScanState *fsstate = GetPgFdwScanState(node);
+ ForeignScanState *owner = fsstate->s.connpriv->current_owner;
+
+ if (owner)
+ {
+ PgFdwScanState *target_state = GetPgFdwScanState(owner);
+ PGconn *conn = target_state->s.conn;
+
+ while(PQisBusy(conn))
+ PQclear(PQgetResult(conn));
+ fsstate->s.connpriv->current_owner = NULL;
+ fsstate->async_waiting = false;
+ }
+}
+/*
* Force assorted GUC parameters to settings that ensure that we'll output
* data values in a form that is unambiguous to the remote server.
*
@@ -3133,7 +3437,7 @@ prepare_foreign_modify(PgFdwModifyState *fmstate)
/* Construct name we'll use for the prepared statement. */
snprintf(prep_name, sizeof(prep_name), "pgsql_fdw_prep_%u",
- GetPrepStmtNumber(fmstate->conn));
+ GetPrepStmtNumber(fmstate->s.conn));
p_name = pstrdup(prep_name);
/*
@@ -3143,12 +3447,12 @@ prepare_foreign_modify(PgFdwModifyState *fmstate)
* the prepared statements we use in this module are simple enough that
* the remote server will make the right choices.
*/
- if (!PQsendPrepare(fmstate->conn,
+ if (!PQsendPrepare(fmstate->s.conn,
p_name,
fmstate->query,
0,
NULL))
- pgfdw_report_error(ERROR, NULL, fmstate->conn, false, fmstate->query);
+ pgfdw_report_error(ERROR, NULL, fmstate->s.conn, false, fmstate->query);
/*
* Get the result, and check for success.
@@ -3156,9 +3460,9 @@ prepare_foreign_modify(PgFdwModifyState *fmstate)
* We don't use a PG_TRY block here, so be careful not to throw error
* without releasing the PGresult.
*/
- res = pgfdw_get_result(fmstate->conn, fmstate->query);
+ res = pgfdw_get_result(fmstate->s.conn, fmstate->query);
if (PQresultStatus(res) != PGRES_COMMAND_OK)
- pgfdw_report_error(ERROR, res, fmstate->conn, true, fmstate->query);
+ pgfdw_report_error(ERROR, res, fmstate->s.conn, true, fmstate->query);
PQclear(res);
/* This action shows that the prepare has been done. */
@@ -3289,9 +3593,9 @@ execute_dml_stmt(ForeignScanState *node)
* the desired result. This allows us to avoid assuming that the remote
* server has the same OIDs we do for the parameters' types.
*/
- if (!PQsendQueryParams(dmstate->conn, dmstate->query, numParams,
+ if (!PQsendQueryParams(dmstate->s.conn, dmstate->query, numParams,
NULL, values, NULL, NULL, 0))
- pgfdw_report_error(ERROR, NULL, dmstate->conn, false, dmstate->query);
+ pgfdw_report_error(ERROR, NULL, dmstate->s.conn, false, dmstate->query);
/*
* Get the result, and check for success.
@@ -3299,10 +3603,10 @@ execute_dml_stmt(ForeignScanState *node)
* We don't use a PG_TRY block here, so be careful not to throw error
* without releasing the PGresult.
*/
- dmstate->result = pgfdw_get_result(dmstate->conn, dmstate->query);
+ dmstate->result = pgfdw_get_result(dmstate->s.conn, dmstate->query);
if (PQresultStatus(dmstate->result) !=
(dmstate->has_returning ? PGRES_TUPLES_OK : PGRES_COMMAND_OK))
- pgfdw_report_error(ERROR, dmstate->result, dmstate->conn, true,
+ pgfdw_report_error(ERROR, dmstate->result, dmstate->s.conn, true,
dmstate->query);
/* Get the number of rows affected. */
@@ -4445,6 +4749,80 @@ postgresGetForeignJoinPaths(PlannerInfo *root,
/* XXX Consider parameterized paths for the join relation */
}
+static bool
+postgresIsForeignPathAsyncCapable(ForeignPath *path)
+{
+ return true;
+}
+
+/*
+ * Accept async request. Notify to the caller if the next tuple is immediately
+ * available. ExecForeignScan does additional work to finishing the returning
+ * tuple, so call it instead of postgresIterateForeignScan to acquire a tuple
+ * in expected shape.
+ */
+static void
+postgresForeignAsyncRequest(EState *estate, PendingAsyncRequest *areq)
+{
+ ForeignScanState *node = (ForeignScanState *) areq->requestee;
+ TupleTableSlot *slot;
+
+ Assert(IsA(node, ForeignScanState));
+ GetPgFdwScanState(node)->run_async = true;
+ slot = ExecForeignScan(node);
+ if (GetPgFdwScanState(node)->result_ready)
+ ExecAsyncRequestDone(estate, areq, (Node *) slot);
+ else
+ ExecAsyncSetRequiredEvents(estate, areq, 1, false, false);
+}
+
+/*
+ * Configure waiting event.
+ *
+ * Add an wait event only when the node is the connection owner. Elsewise
+ * another node on this connection is the owner.
+ */
+static bool
+postgresForeignAsyncConfigureWait(EState *estate, PendingAsyncRequest *areq,
+ bool reinit)
+{
+ ForeignScanState *node = (ForeignScanState *) areq->requestee;
+ PgFdwScanState *fsstate = GetPgFdwScanState(node);
+
+
+ /* If the caller didn't reinit, this event is already in event set */
+ if (!reinit)
+ return true;
+
+ if (fsstate->s.connpriv->current_owner == node)
+ {
+ AddWaitEventToSet(estate->es_wait_event_set,
+ WL_SOCKET_READABLE, PQsocket(fsstate->s.conn),
+ NULL, areq);
+ return true;
+ }
+
+ return false;
+}
+
+/*
+ * Process a notification from async mechanism. ExecForeignScan does
+ * additional work to complete the returning tuple, so call it instead of
+ * postgresIterateForeignScan to acquire a completed tuple.
+ */
+static void
+postgresForeignAsyncNotify(EState *estate, PendingAsyncRequest *areq)
+{
+ ForeignScanState *node = (ForeignScanState *) areq->requestee;
+ TupleTableSlot *slot;
+
+ Assert(IsA(node, ForeignScanState));
+ slot = ExecForeignScan(node);
+ Assert(GetPgFdwScanState(node)->result_ready);
+
+ ExecAsyncRequestDone(estate, areq, (Node *) slot);
+}
+
/*
* Assess whether the aggregation, grouping and having operations can be pushed
* down to the foreign server. As a side effect, save information we obtain in
@@ -4802,7 +5180,7 @@ make_tuple_from_result_row(PGresult *res,
PgFdwScanState *fdw_sstate;
Assert(fsstate);
- fdw_sstate = (PgFdwScanState *) fsstate->fdw_state;
+ fdw_sstate = GetPgFdwScanState(fsstate);
tupdesc = fdw_sstate->tupdesc;
}
diff --git a/contrib/postgres_fdw/postgres_fdw.h b/contrib/postgres_fdw/postgres_fdw.h
index 46cac55..b3ac615 100644
--- a/contrib/postgres_fdw/postgres_fdw.h
+++ b/contrib/postgres_fdw/postgres_fdw.h
@@ -79,6 +79,7 @@ typedef struct PgFdwRelationInfo
UserMapping *user; /* only set in use_remote_estimate mode */
int fetch_size; /* fetch size for this remote table */
+ bool allow_prefetch; /* true to allow overlapped fetching */
/*
* Name of the relation while EXPLAINing ForeignScan. It is used for join
@@ -103,6 +104,7 @@ extern void reset_transmission_modes(int nestlevel);
/* in connection.c */
extern PGconn *GetConnection(UserMapping *user, bool will_prep_stmt);
+void *GetConnectionSpecificStorage(UserMapping *user, size_t initsize);
extern void ReleaseConnection(PGconn *conn);
extern unsigned int GetCursorNumber(PGconn *conn);
extern unsigned int GetPrepStmtNumber(PGconn *conn);
diff --git a/contrib/postgres_fdw/sql/postgres_fdw.sql b/contrib/postgres_fdw/sql/postgres_fdw.sql
index 56b01d0..4dca0c4 100644
--- a/contrib/postgres_fdw/sql/postgres_fdw.sql
+++ b/contrib/postgres_fdw/sql/postgres_fdw.sql
@@ -1511,12 +1511,12 @@ insert into bar2 values(4,44,44);
insert into bar2 values(7,77,77);
explain (verbose, costs off)
-select * from bar where f1 in (select f1 from foo) for update;
-select * from bar where f1 in (select f1 from foo) for update;
+select * from bar where f1 in (select f1 from foo) order by 1 for update;
+select * from bar where f1 in (select f1 from foo) order by 1 for update;
explain (verbose, costs off)
-select * from bar where f1 in (select f1 from foo) for share;
-select * from bar where f1 in (select f1 from foo) for share;
+select * from bar where f1 in (select f1 from foo) order by 1 for share;
+select * from bar where f1 in (select f1 from foo) order by 1 for share;
-- Check UPDATE with inherited target and an inherited source table
explain (verbose, costs off)
@@ -1575,8 +1575,8 @@ explain (verbose, costs off)
delete from foo where f1 < 5 returning *;
delete from foo where f1 < 5 returning *;
explain (verbose, costs off)
-update bar set f2 = f2 + 100 returning *;
-update bar set f2 = f2 + 100 returning *;
+with u as (update bar set f2 = f2 + 100 returning *) select * from u order by 1;
+with u as (update bar set f2 = f2 + 100 returning *) select * from u order by 1;
drop table foo cascade;
drop table bar cascade;
--
2.9.2
0001-Allow-wait-event-set-to-be-registered-to-resource-ow.patchtext/x-patch; charset=us-asciiDownload
From 5b685ff78f11ee08c385d7a6c793f4d7cfc164e3 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Wed, 22 Feb 2017 09:07:49 +0900
Subject: [PATCH 1/5] Allow wait event set to be registered to resource owner
WaitEventSet needs to be released using resource owner for a certain
case. This change adds WaitEventSet reowner and allow the creator of a
WaitEventSet to specify a resource owner.
---
src/backend/libpq/pqcomm.c | 2 +-
src/backend/storage/ipc/latch.c | 18 ++++++-
src/backend/storage/lmgr/condition_variable.c | 2 +-
src/backend/utils/resowner/resowner.c | 68 +++++++++++++++++++++++++++
src/include/storage/latch.h | 4 +-
src/include/utils/resowner_private.h | 8 ++++
6 files changed, 97 insertions(+), 5 deletions(-)
diff --git a/src/backend/libpq/pqcomm.c b/src/backend/libpq/pqcomm.c
index 0fad806..1efdeb4 100644
--- a/src/backend/libpq/pqcomm.c
+++ b/src/backend/libpq/pqcomm.c
@@ -201,7 +201,7 @@ pq_init(void)
(errmsg("could not set socket to nonblocking mode: %m")));
#endif
- FeBeWaitSet = CreateWaitEventSet(TopMemoryContext, 3);
+ FeBeWaitSet = CreateWaitEventSet(TopMemoryContext, NULL, 3);
AddWaitEventToSet(FeBeWaitSet, WL_SOCKET_WRITEABLE, MyProcPort->sock,
NULL, NULL);
AddWaitEventToSet(FeBeWaitSet, WL_LATCH_SET, -1, MyLatch, NULL);
diff --git a/src/backend/storage/ipc/latch.c b/src/backend/storage/ipc/latch.c
index ea7f930..7a8059f 100644
--- a/src/backend/storage/ipc/latch.c
+++ b/src/backend/storage/ipc/latch.c
@@ -61,6 +61,7 @@
#include "storage/latch.h"
#include "storage/pmsignal.h"
#include "storage/shmem.h"
+#include "utils/resowner_private.h"
/*
* Select the fd readiness primitive to use. Normally the "most modern"
@@ -89,6 +90,8 @@ struct WaitEventSet
int nevents; /* number of registered events */
int nevents_space; /* maximum number of events in this set */
+ ResourceOwner resowner; /* Resource owner */
+
/*
* Array, of nevents_space length, storing the definition of events this
* set is waiting for.
@@ -323,7 +326,7 @@ WaitLatchOrSocket(volatile Latch *latch, int wakeEvents, pgsocket sock,
int ret = 0;
int rc;
WaitEvent event;
- WaitEventSet *set = CreateWaitEventSet(CurrentMemoryContext, 3);
+ WaitEventSet *set = CreateWaitEventSet(CurrentMemoryContext, NULL, 3);
if (wakeEvents & WL_TIMEOUT)
Assert(timeout >= 0);
@@ -481,12 +484,15 @@ ResetLatch(volatile Latch *latch)
* WaitEventSetWait().
*/
WaitEventSet *
-CreateWaitEventSet(MemoryContext context, int nevents)
+CreateWaitEventSet(MemoryContext context, ResourceOwner res, int nevents)
{
WaitEventSet *set;
char *data;
Size sz = 0;
+ if (res)
+ ResourceOwnerEnlargeWESs(res);
+
/*
* Use MAXALIGN size/alignment to guarantee that later uses of memory are
* aligned correctly. E.g. epoll_event might need 8 byte alignment on some
@@ -546,6 +552,11 @@ CreateWaitEventSet(MemoryContext context, int nevents)
StaticAssertStmt(WSA_INVALID_EVENT == NULL, "");
#endif
+ /* Register this wait event set if requested */
+ set->resowner = res;
+ if (res)
+ ResourceOwnerRememberWES(set->resowner, set);
+
return set;
}
@@ -581,6 +592,9 @@ FreeWaitEventSet(WaitEventSet *set)
}
#endif
+ if (set->resowner != NULL)
+ ResourceOwnerForgetWES(set->resowner, set);
+
pfree(set);
}
diff --git a/src/backend/storage/lmgr/condition_variable.c b/src/backend/storage/lmgr/condition_variable.c
index 6f1ef0b..503aef1 100644
--- a/src/backend/storage/lmgr/condition_variable.c
+++ b/src/backend/storage/lmgr/condition_variable.c
@@ -66,7 +66,7 @@ ConditionVariablePrepareToSleep(ConditionVariable *cv)
/* Create a reusable WaitEventSet. */
if (cv_wait_event_set == NULL)
{
- cv_wait_event_set = CreateWaitEventSet(TopMemoryContext, 1);
+ cv_wait_event_set = CreateWaitEventSet(TopMemoryContext, NULL, 1);
AddWaitEventToSet(cv_wait_event_set, WL_LATCH_SET, PGINVALID_SOCKET,
&MyProc->procLatch, NULL);
}
diff --git a/src/backend/utils/resowner/resowner.c b/src/backend/utils/resowner/resowner.c
index af46d78..a1a1121 100644
--- a/src/backend/utils/resowner/resowner.c
+++ b/src/backend/utils/resowner/resowner.c
@@ -124,6 +124,7 @@ typedef struct ResourceOwnerData
ResourceArray snapshotarr; /* snapshot references */
ResourceArray filearr; /* open temporary files */
ResourceArray dsmarr; /* dynamic shmem segments */
+ ResourceArray wesarr; /* wait event sets */
/* We can remember up to MAX_RESOWNER_LOCKS references to local locks. */
int nlocks; /* number of owned locks */
@@ -169,6 +170,7 @@ static void PrintTupleDescLeakWarning(TupleDesc tupdesc);
static void PrintSnapshotLeakWarning(Snapshot snapshot);
static void PrintFileLeakWarning(File file);
static void PrintDSMLeakWarning(dsm_segment *seg);
+static void PrintWESLeakWarning(WaitEventSet *events);
/*****************************************************************************
@@ -437,6 +439,7 @@ ResourceOwnerCreate(ResourceOwner parent, const char *name)
ResourceArrayInit(&(owner->snapshotarr), PointerGetDatum(NULL));
ResourceArrayInit(&(owner->filearr), FileGetDatum(-1));
ResourceArrayInit(&(owner->dsmarr), PointerGetDatum(NULL));
+ ResourceArrayInit(&(owner->wesarr), PointerGetDatum(NULL));
return owner;
}
@@ -552,6 +555,16 @@ ResourceOwnerReleaseInternal(ResourceOwner owner,
PrintDSMLeakWarning(res);
dsm_detach(res);
}
+
+ /* Ditto for wait event sets */
+ while (ResourceArrayGetAny(&(owner->wesarr), &foundres))
+ {
+ WaitEventSet *event = (WaitEventSet *) DatumGetPointer(foundres);
+
+ if (isCommit)
+ PrintWESLeakWarning(event);
+ FreeWaitEventSet(event);
+ }
}
else if (phase == RESOURCE_RELEASE_LOCKS)
{
@@ -699,6 +712,7 @@ ResourceOwnerDelete(ResourceOwner owner)
Assert(owner->snapshotarr.nitems == 0);
Assert(owner->filearr.nitems == 0);
Assert(owner->dsmarr.nitems == 0);
+ Assert(owner->wesarr.nitems == 0);
Assert(owner->nlocks == 0 || owner->nlocks == MAX_RESOWNER_LOCKS + 1);
/*
@@ -725,6 +739,7 @@ ResourceOwnerDelete(ResourceOwner owner)
ResourceArrayFree(&(owner->snapshotarr));
ResourceArrayFree(&(owner->filearr));
ResourceArrayFree(&(owner->dsmarr));
+ ResourceArrayFree(&(owner->wesarr));
pfree(owner);
}
@@ -1267,3 +1282,56 @@ PrintDSMLeakWarning(dsm_segment *seg)
elog(WARNING, "dynamic shared memory leak: segment %u still referenced",
dsm_segment_handle(seg));
}
+
+/*
+ * Make sure there is room for at least one more entry in a ResourceOwner's
+ * wait event set reference array.
+ *
+ * This is separate from actually inserting an entry because if we run out
+ * of memory, it's critical to do so *before* acquiring the resource.
+ */
+void
+ResourceOwnerEnlargeWESs(ResourceOwner owner)
+{
+ ResourceArrayEnlarge(&(owner->wesarr));
+}
+
+/*
+ * Remember that a wait event set is owned by a ResourceOwner
+ *
+ * Caller must have previously done ResourceOwnerEnlargeWESs()
+ */
+void
+ResourceOwnerRememberWES(ResourceOwner owner, WaitEventSet *events)
+{
+ ResourceArrayAdd(&(owner->wesarr), PointerGetDatum(events));
+}
+
+/*
+ * Forget that a wait event set is owned by a ResourceOwner
+ */
+void
+ResourceOwnerForgetWES(ResourceOwner owner, WaitEventSet *events)
+{
+ /*
+ * XXXX: There's no property to show as an identier of a wait event set,
+ * use its pointer instead.
+ */
+ if (!ResourceArrayRemove(&(owner->wesarr), PointerGetDatum(events)))
+ elog(ERROR, "wait event set %p is not owned by resource owner %s",
+ events, owner->name);
+}
+
+/*
+ * Debugging subroutine
+ */
+static void
+PrintWESLeakWarning(WaitEventSet *events)
+{
+ /*
+ * XXXX: There's no property to show as an identier of a wait event set,
+ * use its pointer instead.
+ */
+ elog(WARNING, "wait event set leak: %p still referenced",
+ events);
+}
diff --git a/src/include/storage/latch.h b/src/include/storage/latch.h
index 3158d7b..8233b6d 100644
--- a/src/include/storage/latch.h
+++ b/src/include/storage/latch.h
@@ -101,6 +101,7 @@
#define LATCH_H
#include <signal.h>
+#include "utils/resowner.h"
/*
* Latch structure should be treated as opaque and only accessed through
@@ -152,7 +153,8 @@ extern void DisownLatch(volatile Latch *latch);
extern void SetLatch(volatile Latch *latch);
extern void ResetLatch(volatile Latch *latch);
-extern WaitEventSet *CreateWaitEventSet(MemoryContext context, int nevents);
+extern WaitEventSet *CreateWaitEventSet(MemoryContext context,
+ ResourceOwner res, int nevents);
extern void FreeWaitEventSet(WaitEventSet *set);
extern int AddWaitEventToSet(WaitEventSet *set, uint32 events, pgsocket fd,
Latch *latch, void *user_data);
diff --git a/src/include/utils/resowner_private.h b/src/include/utils/resowner_private.h
index 411d08f..0c6979a 100644
--- a/src/include/utils/resowner_private.h
+++ b/src/include/utils/resowner_private.h
@@ -18,6 +18,7 @@
#include "storage/dsm.h"
#include "storage/fd.h"
+#include "storage/latch.h"
#include "storage/lock.h"
#include "utils/catcache.h"
#include "utils/plancache.h"
@@ -88,4 +89,11 @@ extern void ResourceOwnerRememberDSM(ResourceOwner owner,
extern void ResourceOwnerForgetDSM(ResourceOwner owner,
dsm_segment *);
+/* support for wait event set management */
+extern void ResourceOwnerEnlargeWESs(ResourceOwner owner);
+extern void ResourceOwnerRememberWES(ResourceOwner owner,
+ WaitEventSet *);
+extern void ResourceOwnerForgetWES(ResourceOwner owner,
+ WaitEventSet *);
+
#endif /* RESOWNER_PRIVATE_H */
--
2.9.2
Import Notes
Reply to msg id not found: CADkLMcBZEX9L9HnhJYrtfiAN5EbduxbvM_poWVGBR7yN3gVw@mail.gmail.come7dc8128-f32b-ff9a-870e-f1117b8e4fa6@lab.ntt.co.jp
Hello. This is the final report in this CF period.
At Fri, 17 Mar 2017 17:35:05 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in <20170317.173505.152063931.horiguchi.kyotaro@lab.ntt.co.jp>
Async-capable plan is generated in planner. An Append contains at
least one async-capable child becomes async-aware Append. So the
async feature should be effective also for the UNION ALL case.The following will work faster than unpatched version.I
SELECT sum(a) FROM (SELECT a FROM ft10 UNION ALL SELECT a FROM ft20 UNION ALL SELECT a FROM ft30 UNION ALL SELECT a FROM ft40) as ft;
I'll measure the performance for the case next week.
I found that the following query works as the same as partitioned
table.
SELECT sum(a) FROM (SELECT a FROM ft10 UNION ALL SELECT a FROM ft20 UNION ALL SELECT a FROM ft30 UNION ALL SELECT a FROM ft40 UNION ALL *SELECT a FROM ONLY pf0*) as ft;
So, the difference comes from the additional async-uncapable
child (faster if contains any). In both cases, Append node runs
children asynchronously but slightly differently when all
async-capable children are busy.
I'll continue working on this from this point aiming to the next
commit fest.
Thank you for valuable feedback.
regards,
--
Kyotaro Horiguchi
NTT Open Source Software Center
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
I'll continue working on this from this point aiming to the next
commit fest.
This probably will not surprise you given the many commits in the past 2
weeks, but the patches no longer apply to master:
$ git apply
~/async/0001-Allow-wait-event-set-to-be-registered-to-resource-ow.patch
/home/ubuntu/async/0001-Allow-wait-event-set-to-be-registered-to-resource-ow.patch:27:
trailing whitespace.
FeBeWaitSet = CreateWaitEventSet(TopMemoryContext, NULL, 3);
/home/ubuntu/async/0001-Allow-wait-event-set-to-be-registered-to-resource-ow.patch:39:
trailing whitespace.
#include "utils/resowner_private.h"
/home/ubuntu/async/0001-Allow-wait-event-set-to-be-registered-to-resource-ow.patch:47:
trailing whitespace.
ResourceOwner resowner; /* Resource owner */
/home/ubuntu/async/0001-Allow-wait-event-set-to-be-registered-to-resource-ow.patch:48:
trailing whitespace.
/home/ubuntu/async/0001-Allow-wait-event-set-to-be-registered-to-resource-ow.patch:57:
trailing whitespace.
WaitEventSet *set = CreateWaitEventSet(CurrentMemoryContext, NULL,
3);
error: patch failed: src/backend/libpq/pqcomm.c:201
error: src/backend/libpq/pqcomm.c: patch does not apply
error: patch failed: src/backend/storage/ipc/latch.c:61
error: src/backend/storage/ipc/latch.c: patch does not apply
error: patch failed: src/backend/storage/lmgr/condition_variable.c:66
error: src/backend/storage/lmgr/condition_variable.c: patch does not apply
error: patch failed: src/backend/utils/resowner/resowner.c:124
error: src/backend/utils/resowner/resowner.c: patch does not apply
error: patch failed: src/include/storage/latch.h:101
error: src/include/storage/latch.h: patch does not apply
error: patch failed: src/include/utils/resowner_private.h:18
error: src/include/utils/resowner_private.h: patch does not apply
Hello,
At Sun, 2 Apr 2017 12:21:14 -0400, Corey Huinker <corey.huinker@gmail.com> wrote in <CADkLM=dN_vt8kazOoiVOfjN6xFHpzf5uiGJz+iN+f4fLbYwSKA@mail.gmail.com>
I'll continue working on this from this point aiming to the next
commit fest.This probably will not surprise you given the many commits in the past 2
weeks, but the patches no longer apply to master:
Yeah, I won't surprise by that but thank you for noticing
me. Greately reduces the difficulty of merging. Thank you.
$ git apply
~/async/0001-Allow-wait-event-set-to-be-registered-to-resource-ow.patch
/home/ubuntu/async/0001-Allow-wait-event-set-to-be-registered-to-resource-ow.patch:27:
trailing whitespace.
Maybe the patch was retrieved on Windows then transferred to
Linux box. Converting EOLs of the files or some git configuration
might save that. (git am has --no-keep-cr but I haven't find that
for git apply)
The attached patch is rebased on the current master, but no
substantial changes other than disallowing partitioned tables on
async by assertion.
regards,
--
Kyotaro Horiguchi
NTT Open Source Software Center
Attachments:
0001-Allow-wait-event-set-to-be-registered-to-resource-ow.patchtext/x-patch; charset=us-asciiDownload
From e4c38a11171e8c6c6a1950f122b97b5048c7c5f8 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Wed, 22 Feb 2017 09:07:49 +0900
Subject: [PATCH 1/5] Allow wait event set to be registered to resource owner
WaitEventSet needs to be released using resource owner for a certain
case. This change adds WaitEventSet reowner and allow the creator of a
WaitEventSet to specify a resource owner.
---
src/backend/libpq/pqcomm.c | 2 +-
src/backend/storage/ipc/latch.c | 18 ++++++-
src/backend/storage/lmgr/condition_variable.c | 2 +-
src/backend/utils/resowner/resowner.c | 68 +++++++++++++++++++++++++++
src/include/storage/latch.h | 4 +-
src/include/utils/resowner_private.h | 8 ++++
6 files changed, 97 insertions(+), 5 deletions(-)
diff --git a/src/backend/libpq/pqcomm.c b/src/backend/libpq/pqcomm.c
index 0fad806..1efdeb4 100644
--- a/src/backend/libpq/pqcomm.c
+++ b/src/backend/libpq/pqcomm.c
@@ -201,7 +201,7 @@ pq_init(void)
(errmsg("could not set socket to nonblocking mode: %m")));
#endif
- FeBeWaitSet = CreateWaitEventSet(TopMemoryContext, 3);
+ FeBeWaitSet = CreateWaitEventSet(TopMemoryContext, NULL, 3);
AddWaitEventToSet(FeBeWaitSet, WL_SOCKET_WRITEABLE, MyProcPort->sock,
NULL, NULL);
AddWaitEventToSet(FeBeWaitSet, WL_LATCH_SET, -1, MyLatch, NULL);
diff --git a/src/backend/storage/ipc/latch.c b/src/backend/storage/ipc/latch.c
index 4798370..a3372bd 100644
--- a/src/backend/storage/ipc/latch.c
+++ b/src/backend/storage/ipc/latch.c
@@ -61,6 +61,7 @@
#include "storage/latch.h"
#include "storage/pmsignal.h"
#include "storage/shmem.h"
+#include "utils/resowner_private.h"
/*
* Select the fd readiness primitive to use. Normally the "most modern"
@@ -89,6 +90,8 @@ struct WaitEventSet
int nevents; /* number of registered events */
int nevents_space; /* maximum number of events in this set */
+ ResourceOwner resowner; /* Resource owner */
+
/*
* Array, of nevents_space length, storing the definition of events this
* set is waiting for.
@@ -323,7 +326,7 @@ WaitLatchOrSocket(volatile Latch *latch, int wakeEvents, pgsocket sock,
int ret = 0;
int rc;
WaitEvent event;
- WaitEventSet *set = CreateWaitEventSet(CurrentMemoryContext, 3);
+ WaitEventSet *set = CreateWaitEventSet(CurrentMemoryContext, NULL, 3);
if (wakeEvents & WL_TIMEOUT)
Assert(timeout >= 0);
@@ -481,12 +484,15 @@ ResetLatch(volatile Latch *latch)
* WaitEventSetWait().
*/
WaitEventSet *
-CreateWaitEventSet(MemoryContext context, int nevents)
+CreateWaitEventSet(MemoryContext context, ResourceOwner res, int nevents)
{
WaitEventSet *set;
char *data;
Size sz = 0;
+ if (res)
+ ResourceOwnerEnlargeWESs(res);
+
/*
* Use MAXALIGN size/alignment to guarantee that later uses of memory are
* aligned correctly. E.g. epoll_event might need 8 byte alignment on some
@@ -546,6 +552,11 @@ CreateWaitEventSet(MemoryContext context, int nevents)
StaticAssertStmt(WSA_INVALID_EVENT == NULL, "");
#endif
+ /* Register this wait event set if requested */
+ set->resowner = res;
+ if (res)
+ ResourceOwnerRememberWES(set->resowner, set);
+
return set;
}
@@ -581,6 +592,9 @@ FreeWaitEventSet(WaitEventSet *set)
}
#endif
+ if (set->resowner != NULL)
+ ResourceOwnerForgetWES(set->resowner, set);
+
pfree(set);
}
diff --git a/src/backend/storage/lmgr/condition_variable.c b/src/backend/storage/lmgr/condition_variable.c
index 6f1ef0b..503aef1 100644
--- a/src/backend/storage/lmgr/condition_variable.c
+++ b/src/backend/storage/lmgr/condition_variable.c
@@ -66,7 +66,7 @@ ConditionVariablePrepareToSleep(ConditionVariable *cv)
/* Create a reusable WaitEventSet. */
if (cv_wait_event_set == NULL)
{
- cv_wait_event_set = CreateWaitEventSet(TopMemoryContext, 1);
+ cv_wait_event_set = CreateWaitEventSet(TopMemoryContext, NULL, 1);
AddWaitEventToSet(cv_wait_event_set, WL_LATCH_SET, PGINVALID_SOCKET,
&MyProc->procLatch, NULL);
}
diff --git a/src/backend/utils/resowner/resowner.c b/src/backend/utils/resowner/resowner.c
index af46d78..a1a1121 100644
--- a/src/backend/utils/resowner/resowner.c
+++ b/src/backend/utils/resowner/resowner.c
@@ -124,6 +124,7 @@ typedef struct ResourceOwnerData
ResourceArray snapshotarr; /* snapshot references */
ResourceArray filearr; /* open temporary files */
ResourceArray dsmarr; /* dynamic shmem segments */
+ ResourceArray wesarr; /* wait event sets */
/* We can remember up to MAX_RESOWNER_LOCKS references to local locks. */
int nlocks; /* number of owned locks */
@@ -169,6 +170,7 @@ static void PrintTupleDescLeakWarning(TupleDesc tupdesc);
static void PrintSnapshotLeakWarning(Snapshot snapshot);
static void PrintFileLeakWarning(File file);
static void PrintDSMLeakWarning(dsm_segment *seg);
+static void PrintWESLeakWarning(WaitEventSet *events);
/*****************************************************************************
@@ -437,6 +439,7 @@ ResourceOwnerCreate(ResourceOwner parent, const char *name)
ResourceArrayInit(&(owner->snapshotarr), PointerGetDatum(NULL));
ResourceArrayInit(&(owner->filearr), FileGetDatum(-1));
ResourceArrayInit(&(owner->dsmarr), PointerGetDatum(NULL));
+ ResourceArrayInit(&(owner->wesarr), PointerGetDatum(NULL));
return owner;
}
@@ -552,6 +555,16 @@ ResourceOwnerReleaseInternal(ResourceOwner owner,
PrintDSMLeakWarning(res);
dsm_detach(res);
}
+
+ /* Ditto for wait event sets */
+ while (ResourceArrayGetAny(&(owner->wesarr), &foundres))
+ {
+ WaitEventSet *event = (WaitEventSet *) DatumGetPointer(foundres);
+
+ if (isCommit)
+ PrintWESLeakWarning(event);
+ FreeWaitEventSet(event);
+ }
}
else if (phase == RESOURCE_RELEASE_LOCKS)
{
@@ -699,6 +712,7 @@ ResourceOwnerDelete(ResourceOwner owner)
Assert(owner->snapshotarr.nitems == 0);
Assert(owner->filearr.nitems == 0);
Assert(owner->dsmarr.nitems == 0);
+ Assert(owner->wesarr.nitems == 0);
Assert(owner->nlocks == 0 || owner->nlocks == MAX_RESOWNER_LOCKS + 1);
/*
@@ -725,6 +739,7 @@ ResourceOwnerDelete(ResourceOwner owner)
ResourceArrayFree(&(owner->snapshotarr));
ResourceArrayFree(&(owner->filearr));
ResourceArrayFree(&(owner->dsmarr));
+ ResourceArrayFree(&(owner->wesarr));
pfree(owner);
}
@@ -1267,3 +1282,56 @@ PrintDSMLeakWarning(dsm_segment *seg)
elog(WARNING, "dynamic shared memory leak: segment %u still referenced",
dsm_segment_handle(seg));
}
+
+/*
+ * Make sure there is room for at least one more entry in a ResourceOwner's
+ * wait event set reference array.
+ *
+ * This is separate from actually inserting an entry because if we run out
+ * of memory, it's critical to do so *before* acquiring the resource.
+ */
+void
+ResourceOwnerEnlargeWESs(ResourceOwner owner)
+{
+ ResourceArrayEnlarge(&(owner->wesarr));
+}
+
+/*
+ * Remember that a wait event set is owned by a ResourceOwner
+ *
+ * Caller must have previously done ResourceOwnerEnlargeWESs()
+ */
+void
+ResourceOwnerRememberWES(ResourceOwner owner, WaitEventSet *events)
+{
+ ResourceArrayAdd(&(owner->wesarr), PointerGetDatum(events));
+}
+
+/*
+ * Forget that a wait event set is owned by a ResourceOwner
+ */
+void
+ResourceOwnerForgetWES(ResourceOwner owner, WaitEventSet *events)
+{
+ /*
+ * XXXX: There's no property to show as an identier of a wait event set,
+ * use its pointer instead.
+ */
+ if (!ResourceArrayRemove(&(owner->wesarr), PointerGetDatum(events)))
+ elog(ERROR, "wait event set %p is not owned by resource owner %s",
+ events, owner->name);
+}
+
+/*
+ * Debugging subroutine
+ */
+static void
+PrintWESLeakWarning(WaitEventSet *events)
+{
+ /*
+ * XXXX: There's no property to show as an identier of a wait event set,
+ * use its pointer instead.
+ */
+ elog(WARNING, "wait event set leak: %p still referenced",
+ events);
+}
diff --git a/src/include/storage/latch.h b/src/include/storage/latch.h
index 3158d7b..8233b6d 100644
--- a/src/include/storage/latch.h
+++ b/src/include/storage/latch.h
@@ -101,6 +101,7 @@
#define LATCH_H
#include <signal.h>
+#include "utils/resowner.h"
/*
* Latch structure should be treated as opaque and only accessed through
@@ -152,7 +153,8 @@ extern void DisownLatch(volatile Latch *latch);
extern void SetLatch(volatile Latch *latch);
extern void ResetLatch(volatile Latch *latch);
-extern WaitEventSet *CreateWaitEventSet(MemoryContext context, int nevents);
+extern WaitEventSet *CreateWaitEventSet(MemoryContext context,
+ ResourceOwner res, int nevents);
extern void FreeWaitEventSet(WaitEventSet *set);
extern int AddWaitEventToSet(WaitEventSet *set, uint32 events, pgsocket fd,
Latch *latch, void *user_data);
diff --git a/src/include/utils/resowner_private.h b/src/include/utils/resowner_private.h
index 411d08f..0c6979a 100644
--- a/src/include/utils/resowner_private.h
+++ b/src/include/utils/resowner_private.h
@@ -18,6 +18,7 @@
#include "storage/dsm.h"
#include "storage/fd.h"
+#include "storage/latch.h"
#include "storage/lock.h"
#include "utils/catcache.h"
#include "utils/plancache.h"
@@ -88,4 +89,11 @@ extern void ResourceOwnerRememberDSM(ResourceOwner owner,
extern void ResourceOwnerForgetDSM(ResourceOwner owner,
dsm_segment *);
+/* support for wait event set management */
+extern void ResourceOwnerEnlargeWESs(ResourceOwner owner);
+extern void ResourceOwnerRememberWES(ResourceOwner owner,
+ WaitEventSet *);
+extern void ResourceOwnerForgetWES(ResourceOwner owner,
+ WaitEventSet *);
+
#endif /* RESOWNER_PRIVATE_H */
--
2.9.2
0002-Asynchronous-execution-framework.patchtext/x-patch; charset=us-asciiDownload
From 505fb96f7ca0a3cc729311e68dbd010fdb098c27 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 23 Feb 2017 12:20:31 +0900
Subject: [PATCH 2/5] Asynchronous execution framework
This is a framework for asynchronous execution based on Robert Haas's
proposal. Any executor node can receive tuples from underlying nodes
asynchronously by this. This is a different mechanism from parallel
execution. While the parallel execution is analogous to threads, this
frame work is analogous to select(2), which handles multiple input on
single backend process. To avoid degradation of non-async execution,
this framework uses completely different channel to convey tuples.
You will see the deatil of the API at the end of
src/backend/executor/README.
---
src/backend/executor/Makefile | 2 +-
src/backend/executor/README | 45 +++
src/backend/executor/execAmi.c | 5 +
src/backend/executor/execAsync.c | 520 ++++++++++++++++++++++++++++++++
src/backend/executor/execProcnode.c | 1 +
src/backend/executor/instrument.c | 2 +-
src/backend/executor/nodeAppend.c | 169 ++++++++++-
src/backend/executor/nodeForeignscan.c | 49 +++
src/backend/nodes/copyfuncs.c | 2 +
src/backend/nodes/outfuncs.c | 2 +
src/backend/nodes/readfuncs.c | 2 +
src/backend/optimizer/plan/createplan.c | 69 ++++-
src/backend/postmaster/pgstat.c | 2 +
src/backend/utils/adt/ruleutils.c | 6 +-
src/include/executor/execAsync.h | 30 ++
src/include/executor/nodeAppend.h | 3 +
src/include/executor/nodeForeignscan.h | 7 +
src/include/foreign/fdwapi.h | 17 ++
src/include/nodes/execnodes.h | 65 +++-
src/include/nodes/plannodes.h | 2 +
src/include/pgstat.h | 3 +-
21 files changed, 974 insertions(+), 29 deletions(-)
create mode 100644 src/backend/executor/execAsync.c
create mode 100644 src/include/executor/execAsync.h
diff --git a/src/backend/executor/Makefile b/src/backend/executor/Makefile
index 083b20f..21f5ad0 100644
--- a/src/backend/executor/Makefile
+++ b/src/backend/executor/Makefile
@@ -12,7 +12,7 @@ subdir = src/backend/executor
top_builddir = ../../..
include $(top_builddir)/src/Makefile.global
-OBJS = execAmi.o execCurrent.o execExpr.o execExprInterp.o \
+OBJS = execAmi.o execAsync.o execCurrent.o execExpr.o execExprInterp.o \
execGrouping.o execIndexing.o execJunk.o \
execMain.o execParallel.o execProcnode.o \
execReplication.o execScan.o execSRF.o execTuples.o \
diff --git a/src/backend/executor/README b/src/backend/executor/README
index a004506..e6caeb7 100644
--- a/src/backend/executor/README
+++ b/src/backend/executor/README
@@ -349,3 +349,48 @@ query returning the same set of scan tuples multiple times. Likewise,
SRFs are disallowed in an UPDATE's targetlist. There, they would have the
effect of the same row being updated multiple times, which is not very
useful --- and updates after the first would have no effect anyway.
+
+Asynchronous Execution
+----------------------
+
+In certain cases, it's desirable for a node to indicate that it cannot
+return any tuple immediately but may be able to do at a later time. This
+might either because the node is waiting on an event external to the
+database system, such as a ForeignScan awaiting network I/O, or because
+the node is waiting for an event internal to the database system - e.g.
+one process involved in a parallel query may find that it cannot progress
+a certain parallel operation until some other process reaches a certain
+point in the computation. A process which discovers this type of situation
+can always handle it simply by blocking, but this may waste time that could
+be spent executing some other part of the plan where progress could be
+made immediately. This is particularly likely to occur when the plan
+contains an Append node.
+
+To use asynchronous execution, a node must first request a tuple from
+an async-capable child node using ExecAsyncRequest. Next, when the
+result is not available immediately, it must execute the asynchronous
+event loop using ExecAsyncEventLoop; it can avoid giving up control
+indefinitely by passing a timeout to this function, even passing -1 to
+poll for events without blocking. Eventually, when a node to which an
+asynchronous request has been made produces a tuple, the requesting
+node will receive a callback from the event loop via
+ExecAsyncResponse. Typically, the ExecAsyncResponse callback is the
+only one required for nodes that wish to request tuples
+asynchronously.
+
+On the other hand, nodes that wish to produce tuples asynchronously
+generally need to implement three methods:
+
+1. When an asynchronous request is made, the node's ExecAsyncRequest callback
+will be invoked; it should use ExecAsyncSetRequiredEvents to indicate the
+number of file descriptor events for which it wishes to wait and whether it
+wishes to receive a callback when the process latch is set. Alternatively,
+it can instead use ExecAsyncRequestDone if a result is available immediately.
+
+2. When the event loop wishes to wait or poll for file descriptor events and
+the process latch, the ExecAsyncConfigureWait callback is invoked to configure
+the file descriptor wait events for which the node wishes to wait. This
+callback isn't needed if the node only cares about the process latch.
+
+3. When file descriptors or the process latch become ready, the node's
+ExecAsyncNotify callback is invoked.
diff --git a/src/backend/executor/execAmi.c b/src/backend/executor/execAmi.c
index 7e85c66..ddb6d64 100644
--- a/src/backend/executor/execAmi.c
+++ b/src/backend/executor/execAmi.c
@@ -478,11 +478,16 @@ ExecSupportsBackwardScan(Plan *node)
{
ListCell *l;
+ /* With async, tuples may be interleaved, so can't back up. */
+ if (((Append *) node)->nasyncplans != 0)
+ return false;
+
foreach(l, ((Append *) node)->appendplans)
{
if (!ExecSupportsBackwardScan((Plan *) lfirst(l)))
return false;
}
+
/* need not check tlist because Append doesn't evaluate it */
return true;
}
diff --git a/src/backend/executor/execAsync.c b/src/backend/executor/execAsync.c
new file mode 100644
index 0000000..115b147
--- /dev/null
+++ b/src/backend/executor/execAsync.c
@@ -0,0 +1,520 @@
+/*-------------------------------------------------------------------------
+ *
+ * execAsync.c
+ * Support routines for asynchronous execution.
+ *
+ * Portions Copyright (c) 1996-2016, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/executor/execAsync.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "executor/execAsync.h"
+#include "executor/nodeAppend.h"
+#include "executor/nodeForeignscan.h"
+#include "miscadmin.h"
+#include "pgstat.h"
+#include "storage/latch.h"
+#include "utils/memutils.h"
+
+static bool ExecAsyncEventWait(EState *estate, long timeout);
+static bool ExecAsyncConfigureWait(EState *estate, PendingAsyncRequest *areq,
+ bool reinit);
+static void ExecAsyncNotify(EState *estate, PendingAsyncRequest *areq);
+static void ExecAsyncResponse(EState *estate, PendingAsyncRequest *areq);
+
+#define EVENT_BUFFER_SIZE 16
+
+/*
+ * Asynchronously request a tuple from a designed async-aware node.
+ *
+ * requestor is the node that wants the tuple; requestee is the node from
+ * which it wants the tuple. request_index is an arbitrary integer specified
+ * by the requestor which will be available at the time the requestor receives
+ * the tuple. This is useful if the requestor has multiple children and
+ * needs an easy way to figure out which one is delivering a tuple.
+ */
+void
+ExecAsyncRequest(EState *estate, PlanState *requestor, int request_index,
+ PlanState *requestee)
+{
+ PendingAsyncRequest *areq = NULL;
+ int nasync = estate->es_num_pending_async;
+
+ if (requestee->instrument)
+ InstrStartNode(requestee->instrument);
+
+ /*
+ * If the number of pending asynchronous nodes exceeds the number of
+ * available slots in the es_pending_async array, expand the array.
+ * We start with 16 slots, and thereafter double the array size each
+ * time we run out of slots.
+ */
+ if (nasync >= estate->es_max_pending_async)
+ {
+ int newmax;
+
+ newmax = estate->es_max_pending_async * 2;
+ if (estate->es_max_pending_async == 0)
+ {
+ newmax = 16;
+ estate->es_pending_async =
+ MemoryContextAllocZero(estate->es_query_cxt,
+ newmax * sizeof(PendingAsyncRequest *));
+ }
+ else
+ {
+ int newentries = newmax - estate->es_max_pending_async;
+
+ estate->es_pending_async =
+ repalloc(estate->es_pending_async,
+ newmax * sizeof(PendingAsyncRequest *));
+ MemSet(&estate->es_pending_async[estate->es_max_pending_async],
+ 0, newentries * sizeof(PendingAsyncRequest *));
+ }
+ estate->es_max_pending_async = newmax;
+ }
+
+ /*
+ * To avoid unnecessary palloc traffic, we reuse a previously-allocated
+ * PendingAsyncRequest if there is one. If not, we must allocate a new
+ * one.
+ */
+ if (estate->es_pending_async[nasync] == NULL)
+ {
+ areq = MemoryContextAllocZero(estate->es_query_cxt,
+ sizeof(PendingAsyncRequest));
+ estate->es_pending_async[nasync] = areq;
+ }
+ else
+ {
+ areq = estate->es_pending_async[nasync];
+ MemSet(areq, 0, sizeof(PendingAsyncRequest));
+ }
+ areq->myindex = estate->es_num_pending_async;
+
+ /* Initialize the new request. */
+ areq->state = ASYNCREQ_IDLE;
+ areq->requestor = requestor;
+ areq->request_index = request_index;
+ areq->requestee = requestee;
+
+ /* Give the requestee a chance to do whatever it wants. */
+ switch (nodeTag(requestee))
+ {
+ case T_ForeignScanState:
+ ExecAsyncForeignScanRequest(estate, areq);
+ break;
+ default:
+ /* If requestee doesn't support async, caller messed up. */
+ elog(ERROR, "unrecognized node type: %d",
+ (int) nodeTag(requestee));
+ }
+
+ if (areq->requestee->instrument)
+ InstrStopNode(requestee->instrument, 0);
+
+ /* No result available now, make this node pending */
+ estate->es_num_pending_async++;
+
+ return;
+}
+
+/*
+ * Execute the main loop until the timeout expires or a result is delivered
+ * to the requestor.
+ *
+ * If the timeout is -1, there is no timeout; wait indefinitely until a
+ * result is ready for requestor. If the timeout is 0, do not block, but
+ * poll for events and fire callbacks for as long as we can do so without
+ * blocking. If timeout is greater than 0, block for at most the number
+ * of milliseconds indicated by the timeout.
+ *
+ * Returns true if a result was delivered to the requestor. A return value
+ * of false indicates that the timeout was reached without delivering a
+ * result to the requestor.
+ */
+bool
+ExecAsyncEventLoop(EState *estate, PlanState *requestor, long timeout)
+{
+ instr_time start_time;
+ long cur_timeout = timeout;
+ bool requestor_done = false;
+
+ Assert(requestor != NULL);
+
+ /*
+ * If we plan to wait - but not indefinitely - we need to record the
+ * current time.
+ */
+ if (timeout > 0)
+ INSTR_TIME_SET_CURRENT(start_time);
+
+ /* Main event loop: poll for events, deliver notifications. */
+ Assert(estate->es_async_callback_pending == 0);
+ for (;;)
+ {
+ int i;
+ bool any_node_done = false;
+
+ CHECK_FOR_INTERRUPTS();
+
+ /* Check for events only if any node is async-not-ready. */
+ if (estate->es_num_async_ready < estate->es_num_pending_async)
+ {
+ /* Don't block if any tuple available. */
+ if (estate->es_async_callback_pending > 0)
+ ExecAsyncEventWait(estate, 0);
+ else if (!ExecAsyncEventWait(estate, cur_timeout))
+ { /* Not fired */
+ /* Exited before timeout. Calculate the remaining time. */
+ instr_time cur_time;
+ long cur_timeout = -1;
+
+ /* Wait forever */
+ if (timeout < 0)
+ continue;
+
+ INSTR_TIME_SET_CURRENT(cur_time);
+ INSTR_TIME_SUBTRACT(cur_time, start_time);
+ cur_timeout =
+ timeout - (long) INSTR_TIME_GET_MILLISEC(cur_time);
+
+ if (cur_timeout > 0)
+ continue;
+ }
+ }
+
+ /* Deliver notifications. */
+ for (i = 0; i < estate->es_num_pending_async; ++i)
+ {
+ PendingAsyncRequest *areq = estate->es_pending_async[i];
+
+ if (areq->requestee->instrument)
+ InstrStartNode(areq->requestee->instrument);
+
+ /* Notify if the requestee is ready */
+ if (areq->state == ASYNCREQ_CALLBACK_PENDING)
+ ExecAsyncNotify(estate, areq);
+
+ /* Deliver the acquired tuple to the requester */
+ if (areq->state == ASYNCREQ_COMPLETE)
+ {
+ any_node_done = true;
+ if (requestor == areq->requestor)
+ requestor_done = true;
+ ExecAsyncResponse(estate, areq);
+
+ if (areq->requestee->instrument)
+ InstrStopNode(areq->requestee->instrument,
+ TupIsNull((TupleTableSlot*)areq->result) ?
+ 0.0 : 1.0);
+ }
+ else if (areq->requestee->instrument)
+ InstrStopNode(areq->requestee->instrument, 0);
+ }
+
+ /* If any node completed, compact the array. */
+ if (any_node_done)
+ {
+ int hidx = 0,
+ tidx;
+
+ /*
+ * Swap all non-yet-completed items to the start of the array.
+ * Keep them in the same order.
+ */
+ for (tidx = 0; tidx < estate->es_num_pending_async; ++tidx)
+ {
+ PendingAsyncRequest *head;
+ PendingAsyncRequest *tail = estate->es_pending_async[tidx];
+
+ Assert(tail->state != ASYNCREQ_CALLBACK_PENDING);
+
+ if (tail->state == ASYNCREQ_COMPLETE)
+ continue;
+ head = estate->es_pending_async[hidx];
+ estate->es_pending_async[tidx] = head;
+ estate->es_pending_async[hidx] = tail;
+ ++hidx;
+ }
+ estate->es_num_pending_async = hidx;
+ }
+
+ /*
+ * We only consider exiting the loop when no notifications are
+ * pending. Otherwise, each call to this function might advance
+ * the computation by only a very small amount; to the contrary,
+ * we want to push it forward as far as possible.
+ */
+ if (estate->es_async_callback_pending == 0)
+ {
+ /* If requestor is ready, exit. */
+ if (requestor_done)
+ return true;
+ /* If timeout was 0 or has expired, exit. */
+ if (cur_timeout == 0)
+ return false;
+ }
+ }
+}
+
+/*
+ * Wait or poll for events. As with ExecAsyncEventLoop, a timeout of -1
+ * means wait forever, 0 means don't wait at all, and >0 means wait for the
+ * indicated number of milliseconds.
+ *
+ * Returns false if we timed out or true if anything found or there's no event
+ * to wait.
+ */
+static bool
+ExecAsyncEventWait(EState *estate, long timeout)
+{
+ WaitEvent occurred_event[EVENT_BUFFER_SIZE];
+ int noccurred;
+ int i;
+ int n;
+ bool reinit = false;
+ bool process_latch_set = false;
+ bool added = false;
+ bool fired = false;
+
+ if (estate->es_wait_event_set == NULL)
+ {
+ /*
+ * Allow for a few extra events without reinitializing. It
+ * doesn't seem worth the complexity of doing anything very
+ * aggressive here, because plans that depend on massive numbers
+ * of external FDs are likely to run afoul of kernel limits anyway.
+ */
+ estate->es_allocated_fd_events = estate->es_total_fd_events + 16;
+
+ /*
+ * The wait event set created here should be live beyond ExecutorState
+ * context but released in case of error.
+ */
+ estate->es_wait_event_set =
+ CreateWaitEventSet(TopTransactionContext,
+ TopTransactionResourceOwner,
+ estate->es_allocated_fd_events + 1);
+
+ AddWaitEventToSet(estate->es_wait_event_set,
+ WL_LATCH_SET, PGINVALID_SOCKET, MyLatch, NULL);
+ reinit = true;
+ }
+
+ /* Give each waiting node a chance to add or modify events. */
+ for (i = 0; i < estate->es_num_pending_async; ++i)
+ {
+ PendingAsyncRequest *areq = estate->es_pending_async[i];
+
+ if (areq->num_fd_events > 0 || areq->wants_process_latch)
+ added |= ExecAsyncConfigureWait(estate, areq, reinit);
+ }
+
+ /*
+ * We may have no event to wait. This occurs when all nodes that
+ * is asynchronously executing have tuples immediately available.
+ */
+ if (!added)
+ return true;
+
+ /* Wait for at least one event to occur. */
+ noccurred = WaitEventSetWait(estate->es_wait_event_set, timeout,
+ occurred_event, EVENT_BUFFER_SIZE,
+ WAIT_EVENT_ASYNC_WAIT);
+
+ if (noccurred == 0)
+ return false;
+
+ /*
+ * Loop over the occurred events and set the callback_pending flags
+ * for the appropriate requests. The waiting nodes should have
+ * registered their wait events with user_data pointing back to the
+ * PendingAsyncRequest, but the process latch needs special handling.
+ */
+ for (n = 0; n < noccurred; ++n)
+ {
+ WaitEvent *w = &occurred_event[n];
+
+ if ((w->events & WL_LATCH_SET) != 0)
+ {
+ process_latch_set = true;
+ continue;
+ }
+
+ if ((w->events & (WL_SOCKET_READABLE|WL_SOCKET_WRITEABLE)) != 0)
+ {
+ PendingAsyncRequest *areq = w->user_data;
+
+ Assert(areq->state == ASYNCREQ_WAITING);
+
+ areq->state = ASYNCREQ_CALLBACK_PENDING;
+ estate->es_async_callback_pending++;
+ fired = true;
+ }
+ }
+
+ /*
+ * If the process latch got set, we must schedule a callback for every
+ * requestee that cares about it.
+ */
+ if (process_latch_set)
+ {
+ for (i = 0; i < estate->es_num_pending_async; ++i)
+ {
+ PendingAsyncRequest *areq = estate->es_pending_async[i];
+
+ if (areq->wants_process_latch)
+ {
+ Assert(areq->state == ASYNCREQ_WAITING);
+ areq->state = ASYNCREQ_CALLBACK_PENDING;
+ estate->es_async_callback_pending++;
+ fired = true;
+ }
+ }
+ }
+
+ return fired;
+}
+
+/*
+ * Give the asynchronous node a chance to configure the file descriptor
+ * events for which it wishes to wait. We expect the node-type specific
+ * callback to make one or more calls of the following form:
+ *
+ * AddWaitEventToSet(es->es_wait_event_set, events, fd, NULL, areq);
+ *
+ * The events should include only WL_SOCKET_READABLE or WL_SOCKET_WRITEABLE,
+ * and the number of calls should not exceed areq->num_fd_events (as
+ * prevously set via ExecAsyncSetRequiredEvents).
+ *
+ * Individual requests can omit registering an event but it is a
+ * responsibility of the node driver to set at least one event per one
+ * requestor.
+ */
+static bool
+ExecAsyncConfigureWait(EState *estate, PendingAsyncRequest *areq,
+ bool reinit)
+{
+ switch (nodeTag(areq->requestee))
+ {
+ case T_ForeignScanState:
+ return ExecAsyncForeignScanConfigureWait(estate, areq, reinit);
+ break;
+ default:
+ elog(ERROR, "unrecognized node type: %d",
+ (int) nodeTag(areq->requestee));
+ }
+}
+
+/*
+ * Call the asynchronous node back when a relevant event has occurred.
+ */
+static void
+ExecAsyncNotify(EState *estate, PendingAsyncRequest *areq)
+{
+ switch (nodeTag(areq->requestee))
+ {
+ case T_ForeignScanState:
+ ExecAsyncForeignScanNotify(estate, areq);
+ break;
+ default:
+ elog(ERROR, "unrecognized node type: %d",
+ (int) nodeTag(areq->requestee));
+ }
+
+ estate->es_async_callback_pending--;
+}
+
+/*
+ * Call the requestor back when an asynchronous node has produced a result.
+ */
+static void
+ExecAsyncResponse(EState *estate, PendingAsyncRequest *areq)
+{
+ switch (nodeTag(areq->requestor))
+ {
+ case T_AppendState:
+ ExecAsyncAppendResponse(estate, areq);
+ break;
+ default:
+ elog(ERROR, "unrecognized node type: %d",
+ (int) nodeTag(areq->requestor));
+ }
+ estate->es_num_async_ready--;
+}
+
+/*
+ * An executor node should call this function to signal that it needs to wait
+ * on one or more file descriptor events that can be registered on a
+ * WaitEventSet, and possibly also on process latch. num_fd_events is the
+ * maximum number of file descriptor events that it will wish to register.
+ * force_reset should be true if the node can't reuse the WaitEventSet it most
+ * recently initialized, for example because it needs to drop a wait event
+ * from the set.
+ */
+void
+ExecAsyncSetRequiredEvents(EState *estate, PendingAsyncRequest *areq,
+ int num_fd_events, bool wants_process_latch,
+ bool force_reset)
+{
+ estate->es_total_fd_events += num_fd_events - areq->num_fd_events;
+ areq->num_fd_events = num_fd_events;
+ areq->wants_process_latch = wants_process_latch;
+ areq->state = ASYNCREQ_WAITING;
+
+ if (force_reset && estate->es_wait_event_set != NULL)
+ ExecAsyncClearEvents(estate);
+}
+
+/*
+ * An async-capable node should call this function to deliver the tuple to
+ * the node which requested it. The node can call this from its
+ * ExecAsyncRequest callback if the requested tuple is available immediately,
+ * or at a later time from its ExecAsyncNotify callback.
+ */
+void
+ExecAsyncRequestDone(EState *estate, PendingAsyncRequest *areq, Node *result)
+{
+ /*
+ * Since the request is complete, the requestee is no longer allowed
+ * to wait for any events. Note that this forces a rebuild of
+ * es_wait_event_set every time a process that was previously waiting
+ * stops doing so. It might be possible to defer that decision until
+ * we actually wait again, because it's quite possible that a new
+ * request will be made of the same node before any wait actually
+ * happens. However, we have to balance the cost of rebuilding the
+ * WaitEventSet against the additional overhead of tracking which nodes
+ * need a callback to remove registered wait events. It's not clear
+ * that we would come out ahead, so use brute force for now.
+ */
+ Assert(areq->state == ASYNCREQ_IDLE ||
+ areq->state == ASYNCREQ_CALLBACK_PENDING);
+
+ if (areq->num_fd_events > 0 || areq->wants_process_latch)
+ ExecAsyncSetRequiredEvents(estate, areq, 0, false, true);
+
+
+ /* Save result and mark request as complete. */
+ areq->result = result;
+ areq->state = ASYNCREQ_COMPLETE;
+ estate->es_num_async_ready++;
+}
+
+
+/* Clear async events */
+void
+ExecAsyncClearEvents(EState *estate)
+{
+ if (estate->es_wait_event_set == NULL)
+ return;
+
+ FreeWaitEventSet(estate->es_wait_event_set);
+ estate->es_wait_event_set = NULL;
+}
diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
index 486ddf1..2f896ef 100644
--- a/src/backend/executor/execProcnode.c
+++ b/src/backend/executor/execProcnode.c
@@ -118,6 +118,7 @@
#include "executor/nodeValuesscan.h"
#include "executor/nodeWindowAgg.h"
#include "executor/nodeWorktablescan.h"
+#include "foreign/fdwapi.h"
#include "nodes/nodeFuncs.h"
#include "miscadmin.h"
diff --git a/src/backend/executor/instrument.c b/src/backend/executor/instrument.c
index 6ec96ec..959ee90 100644
--- a/src/backend/executor/instrument.c
+++ b/src/backend/executor/instrument.c
@@ -102,7 +102,7 @@ InstrStopNode(Instrumentation *instr, double nTuples)
&pgBufferUsage, &instr->bufusage_start);
/* Is this the first tuple of this cycle? */
- if (!instr->running)
+ if (!instr->running && nTuples > 0)
{
instr->running = true;
instr->firsttuple = INSTR_TIME_GET_DOUBLE(instr->counter);
diff --git a/src/backend/executor/nodeAppend.c b/src/backend/executor/nodeAppend.c
index a107545..d91e621 100644
--- a/src/backend/executor/nodeAppend.c
+++ b/src/backend/executor/nodeAppend.c
@@ -58,6 +58,7 @@
#include "postgres.h"
#include "executor/execdebug.h"
+#include "executor/execAsync.h"
#include "executor/nodeAppend.h"
static bool exec_append_initialize_next(AppendState *appendstate);
@@ -79,16 +80,21 @@ exec_append_initialize_next(AppendState *appendstate)
/*
* get information from the append node
*/
- whichplan = appendstate->as_whichplan;
+ whichplan = appendstate->as_whichsyncplan;
- if (whichplan < 0)
+ /*
+ * This routine is only responsible for setting up for nodes being scanned
+ * synchronously, so the first node we can scan is given by nasyncplans
+ * and the last is given by as_nplans - 1.
+ */
+ if (whichplan < appendstate->as_nasyncplans)
{
/*
* if scanning in reverse, we start at the last scan in the list and
* then proceed back to the first.. in any case we inform ExecAppend
* that we are at the end of the line by returning FALSE
*/
- appendstate->as_whichplan = 0;
+ appendstate->as_whichsyncplan = appendstate->as_nasyncplans;
return FALSE;
}
else if (whichplan >= appendstate->as_nplans)
@@ -96,7 +102,7 @@ exec_append_initialize_next(AppendState *appendstate)
/*
* as above, end the scan if we go beyond the last scan in our list..
*/
- appendstate->as_whichplan = appendstate->as_nplans - 1;
+ appendstate->as_whichsyncplan = appendstate->as_nplans - 1;
return FALSE;
}
else
@@ -148,6 +154,15 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
appendstate->ps.state = estate;
appendstate->appendplans = appendplanstates;
appendstate->as_nplans = nplans;
+ appendstate->as_nasyncplans = node->nasyncplans;
+ appendstate->as_syncdone = (node->nasyncplans == nplans);
+ appendstate->as_asyncresult = (TupleTableSlot **)
+ palloc0(node->nasyncplans * sizeof(TupleTableSlot *));
+
+ /* initially, all async requests need a request */
+ for (i = 0; i < appendstate->as_nasyncplans; ++i)
+ appendstate->as_needrequest =
+ bms_add_member(appendstate->as_needrequest, i);
/*
* Miscellaneous initialization
@@ -182,9 +197,9 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
appendstate->ps.ps_ProjInfo = NULL;
/*
- * initialize to scan first subplan
+ * initialize to scan first synchronous subplan
*/
- appendstate->as_whichplan = 0;
+ appendstate->as_whichsyncplan = appendstate->as_nasyncplans;
exec_append_initialize_next(appendstate);
return appendstate;
@@ -199,15 +214,85 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
TupleTableSlot *
ExecAppend(AppendState *node)
{
+ if (node->as_nasyncplans > 0)
+ {
+ EState *estate = node->ps.state;
+ int i;
+
+ /*
+ * If there are any asynchronously-generated results that have
+ * not yet been returned, return one of them.
+ */
+ if (node->as_nasyncresult > 0)
+ {
+ --node->as_nasyncresult;
+ return node->as_asyncresult[node->as_nasyncresult];
+ }
+
+
+ /*
+ * XXXX: Always clear registered event. This seems a bit ineffecient
+ * but the events to wait are almost randomly altered for every
+ * calling.
+ */
+ ExecAsyncClearEvents(estate);
+
+ while ((i = bms_first_member(node->as_needrequest)) >= 0)
+ {
+ node->as_nasyncpending++;
+ ExecAsyncRequest(estate, &node->ps, i, node->appendplans[i]);
+ }
+
+ if (node->as_nasyncpending == 0 && node->as_syncdone)
+ return ExecClearTuple(node->ps.ps_ResultTupleSlot);
+ }
+
for (;;)
{
PlanState *subnode;
TupleTableSlot *result;
/*
- * figure out which subplan we are currently processing
+ * if we have async requests outstanding, run the event loop
+ */
+ if (node->as_nasyncpending > 0)
+ {
+ long timeout = node->as_syncdone ? -1 : 0;
+
+ while (node->as_nasyncpending > 0)
+ {
+ if (ExecAsyncEventLoop(node->ps.state, &node->ps, timeout) &&
+ node->as_nasyncresult > 0)
+ {
+ /* Asynchronous subplan returned a tuple! */
+ --node->as_nasyncresult;
+ return node->as_asyncresult[node->as_nasyncresult];
+ }
+
+ /* Timeout reached. Go through to sync nodes if exists */
+ if (!node->as_syncdone)
+ break;
+ }
+
+ /*
+ * If there is no asynchronous activity still pending and the
+ * synchronous activity is also complete, we're totally done
+ * scanning this node. Otherwise, we're done with the
+ * asynchronous stuff but must continue scanning the synchronous
+ * children.
+ */
+ if (node->as_syncdone)
+ {
+ Assert(node->as_nasyncpending == 0);
+ return ExecClearTuple(node->ps.ps_ResultTupleSlot);
+ }
+ }
+
+ /*
+ * figure out which synchronous subplan we are currently processing
*/
- subnode = node->appendplans[node->as_whichplan];
+ Assert(!node->as_syncdone);
+ subnode = node->appendplans[node->as_whichsyncplan];
/*
* get a tuple from the subplan
@@ -227,14 +312,21 @@ ExecAppend(AppendState *node)
/*
* Go on to the "next" subplan in the appropriate direction. If no
* more subplans, return the empty slot set up for us by
- * ExecInitAppend.
+ * ExecInitAppend, unless there are async plans we have yet to finish.
*/
if (ScanDirectionIsForward(node->ps.state->es_direction))
- node->as_whichplan++;
+ node->as_whichsyncplan++;
else
- node->as_whichplan--;
+ node->as_whichsyncplan--;
if (!exec_append_initialize_next(node))
- return ExecClearTuple(node->ps.ps_ResultTupleSlot);
+ {
+ node->as_syncdone = true;
+ if (node->as_nasyncpending == 0)
+ {
+ Assert(bms_is_empty(node->as_needrequest));
+ return ExecClearTuple(node->ps.ps_ResultTupleSlot);
+ }
+ }
/* Else loop back and try to get a tuple from the new subplan */
}
@@ -273,6 +365,16 @@ ExecReScanAppend(AppendState *node)
{
int i;
+ /*
+ * XXX. Cancel outstanding asynchronous tuple requests here! (How?)
+ */
+
+ /* Reset async state. */
+ for (i = 0; i < node->as_nasyncplans; ++i)
+ node->as_needrequest = bms_add_member(node->as_needrequest, i);
+ node->as_nasyncresult = 0;
+ node->as_syncdone = (node->as_nasyncplans == node->as_nplans);
+
for (i = 0; i < node->as_nplans; i++)
{
PlanState *subnode = node->appendplans[i];
@@ -291,6 +393,47 @@ ExecReScanAppend(AppendState *node)
if (subnode->chgParam == NULL)
ExecReScan(subnode);
}
- node->as_whichplan = 0;
+ node->as_whichsyncplan = node->as_nasyncplans;
exec_append_initialize_next(node);
}
+
+/* ----------------------------------------------------------------
+ * ExecAsyncAppendResponse
+ *
+ * Receive a response from an asynchronous request we made.
+ * ----------------------------------------------------------------
+ */
+void
+ExecAsyncAppendResponse(EState *estate, PendingAsyncRequest *areq)
+{
+ AppendState *node = (AppendState *) areq->requestor;
+ TupleTableSlot *slot;
+
+ /* We shouldn't be called until the request is complete. */
+ Assert(areq->state == ASYNCREQ_COMPLETE);
+
+ /* Our result slot shouldn't already be occupied. */
+ Assert(TupIsNull(node->ps.ps_ResultTupleSlot));
+
+ /* Result should be a TupleTableSlot or NULL. */
+ slot = (TupleTableSlot *) areq->result;
+ Assert(slot == NULL || IsA(slot, TupleTableSlot));
+
+ /* This is no longer pending */
+ --node->as_nasyncpending;
+
+ /* If the result is NULL or an empty slot, there's nothing more to do. */
+ if (TupIsNull(slot))
+ return;
+
+ /* Save result so we can return it. */
+ Assert(node->as_nasyncresult < node->as_nasyncplans);
+ node->as_asyncresult[node->as_nasyncresult++] = slot;
+
+ /*
+ * Mark the node that returned a result as ready for a new request. We
+ * don't launch another one here immediately because it might compelte
+ */
+ node->as_needrequest =
+ bms_add_member(node->as_needrequest, areq->request_index);
+}
diff --git a/src/backend/executor/nodeForeignscan.c b/src/backend/executor/nodeForeignscan.c
index 9ae1561..7db5c30 100644
--- a/src/backend/executor/nodeForeignscan.c
+++ b/src/backend/executor/nodeForeignscan.c
@@ -364,3 +364,52 @@ ExecShutdownForeignScan(ForeignScanState *node)
if (fdwroutine->ShutdownForeignScan)
fdwroutine->ShutdownForeignScan(node);
}
+
+/* ----------------------------------------------------------------
+ * ExecAsyncForeignScanRequest
+ *
+ * Initiate an asynchronous request
+ * ----------------------------------------------------------------
+ */
+void
+ExecAsyncForeignScanRequest(EState *estate, PendingAsyncRequest *areq)
+{
+ ForeignScanState *node = (ForeignScanState *) areq->requestee;
+ FdwRoutine *fdwroutine = node->fdwroutine;
+
+ Assert(fdwroutine->ForeignAsyncRequest != NULL);
+ fdwroutine->ForeignAsyncRequest(estate, areq);
+}
+
+/* ----------------------------------------------------------------
+ * ExecAsyncForeignScanConfigureWait
+ *
+ * In async mode, configure for a wait
+ * ----------------------------------------------------------------
+ */
+bool
+ExecAsyncForeignScanConfigureWait(EState *estate,
+ PendingAsyncRequest *areq, bool reinit)
+{
+ ForeignScanState *node = (ForeignScanState *) areq->requestee;
+ FdwRoutine *fdwroutine = node->fdwroutine;
+
+ Assert(fdwroutine->ForeignAsyncConfigureWait != NULL);
+ return fdwroutine->ForeignAsyncConfigureWait(estate, areq, reinit);
+}
+
+/* ----------------------------------------------------------------
+ * ExecAsyncForeignScanNotify
+ *
+ * Event loop callback
+ * ----------------------------------------------------------------
+ */
+void
+ExecAsyncForeignScanNotify(EState *estate, PendingAsyncRequest *areq)
+{
+ ForeignScanState *node = (ForeignScanState *) areq->requestee;
+ FdwRoutine *fdwroutine = node->fdwroutine;
+
+ Assert(fdwroutine->ForeignAsyncNotify != NULL);
+ fdwroutine->ForeignAsyncNotify(estate, areq);
+}
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index 61bc502..9856dfb 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -239,6 +239,8 @@ _copyAppend(const Append *from)
*/
COPY_NODE_FIELD(partitioned_rels);
COPY_NODE_FIELD(appendplans);
+ COPY_SCALAR_FIELD(nasyncplans);
+ COPY_SCALAR_FIELD(referent);
return newnode;
}
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index 83fb39f..f324b0c 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -372,6 +372,8 @@ _outAppend(StringInfo str, const Append *node)
WRITE_NODE_FIELD(partitioned_rels);
WRITE_NODE_FIELD(appendplans);
+ WRITE_INT_FIELD(nasyncplans);
+ WRITE_INT_FIELD(referent);
}
static void
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index 766f2d8..8c57d81 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -1575,6 +1575,8 @@ _readAppend(void)
READ_NODE_FIELD(partitioned_rels);
READ_NODE_FIELD(appendplans);
+ READ_INT_FIELD(nasyncplans);
+ READ_INT_FIELD(referent);
READ_DONE();
}
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index b121f40..c6825d2 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -203,7 +203,8 @@ static NamedTuplestoreScan *make_namedtuplestorescan(List *qptlist, List *qpqual
Index scanrelid, char *enrname);
static WorkTableScan *make_worktablescan(List *qptlist, List *qpqual,
Index scanrelid, int wtParam);
-static Append *make_append(List *appendplans, List *tlist, List *partitioned_rels);
+static Append *make_append(List *asyncplans, int nasyncplans,
+ int referent, List *tlist, List *partitioned_rels);
static RecursiveUnion *make_recursive_union(List *tlist,
Plan *lefttree,
Plan *righttree,
@@ -283,7 +284,7 @@ static ModifyTable *make_modifytable(PlannerInfo *root,
List *rowMarks, OnConflictExpr *onconflict, int epqParam);
static GatherMerge *create_gather_merge_plan(PlannerInfo *root,
GatherMergePath *best_path);
-
+static bool is_async_capable_path(Path *path);
/*
* create_plan
@@ -992,8 +993,12 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path)
{
Append *plan;
List *tlist = build_path_tlist(root, &best_path->path);
- List *subplans = NIL;
+ List *asyncplans = NIL;
+ List *syncplans = NIL;
ListCell *subpaths;
+ int nasyncplans = 0;
+ bool first = true;
+ bool referent_is_sync = true;
/*
* The subpaths list could be empty, if every child was proven empty by
@@ -1019,7 +1024,14 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path)
return plan;
}
- /* Build the plan for each child */
+ /*
+ * Build the plan for each child
+
+ * The first child in an inheritance set is the representative in
+ * explaining tlist entries (see set_deparse_planstate). We should keep
+ * the first child in best_path->subpaths at the head of the subplan list
+ * for the reason.
+ */
foreach(subpaths, best_path->subpaths)
{
Path *subpath = (Path *) lfirst(subpaths);
@@ -1028,7 +1040,18 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path)
/* Must insist that all children return the same tlist */
subplan = create_plan_recurse(root, subpath, CP_EXACT_TLIST);
- subplans = lappend(subplans, subplan);
+ /* Classify as async-capable or not */
+ if (is_async_capable_path(subpath))
+ {
+ asyncplans = lappend(asyncplans, subplan);
+ ++nasyncplans;
+ if (first)
+ referent_is_sync = false;
+ }
+ else
+ syncplans = lappend(syncplans, subplan);
+
+ first = false;
}
/*
@@ -1038,7 +1061,9 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path)
* parent-rel Vars it'll be asked to emit.
*/
- plan = make_append(subplans, tlist, best_path->partitioned_rels);
+ plan = make_append(list_concat(asyncplans, syncplans), nasyncplans,
+ referent_is_sync ? nasyncplans : 0, tlist,
+ best_path->partitioned_rels);
copy_generic_path_info(&plan->plan, (Path *) best_path);
@@ -5245,17 +5270,23 @@ make_foreignscan(List *qptlist,
}
static Append *
-make_append(List *appendplans, List *tlist, List *partitioned_rels)
+make_append(List *appendplans, int nasyncplans, int referent,
+ List *tlist, List *partitioned_rels)
{
Append *node = makeNode(Append);
Plan *plan = &node->plan;
+ /* Currently async on partitioned tables is not available */
+ Assert(nasyncplans == 0 || partitioned_rels == NIL);
+
plan->targetlist = tlist;
plan->qual = NIL;
plan->lefttree = NULL;
plan->righttree = NULL;
node->partitioned_rels = partitioned_rels;
node->appendplans = appendplans;
+ node->nasyncplans = nasyncplans;
+ node->referent = referent;
return node;
}
@@ -6578,3 +6609,27 @@ is_projection_capable_plan(Plan *plan)
}
return true;
}
+
+/*
+ * is_projection_capable_path
+ * Check whether a given Path node is async-capable.
+ */
+static bool
+is_async_capable_path(Path *path)
+{
+ switch (nodeTag(path))
+ {
+ case T_ForeignPath:
+ {
+ FdwRoutine *fdwroutine = path->parent->fdwroutine;
+
+ Assert(fdwroutine != NULL);
+ if (fdwroutine->IsForeignPathAsyncCapable != NULL &&
+ fdwroutine->IsForeignPathAsyncCapable((ForeignPath *) path))
+ return true;
+ }
+ default:
+ break;
+ }
+ return false;
+}
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 56a8bf2..fbcdba6 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -3571,6 +3571,8 @@ pgstat_get_wait_ipc(WaitEventIPC w)
break;
case WAIT_EVENT_LOGICAL_SYNC_STATE_CHANGE:
event_name = "LogicalSyncStateChange";
+ case WAIT_EVENT_ASYNC_WAIT:
+ event_name = "AsyncExecWait";
break;
/* no default case, so that compiler will warn */
}
diff --git a/src/backend/utils/adt/ruleutils.c b/src/backend/utils/adt/ruleutils.c
index 0c1a201..e224158 100644
--- a/src/backend/utils/adt/ruleutils.c
+++ b/src/backend/utils/adt/ruleutils.c
@@ -4335,7 +4335,11 @@ set_deparse_planstate(deparse_namespace *dpns, PlanState *ps)
* lists containing references to non-target relations.
*/
if (IsA(ps, AppendState))
- dpns->outer_planstate = ((AppendState *) ps)->appendplans[0];
+ {
+ int idx = ((Append*)(((AppendState *) ps)->ps.plan))->referent;
+ dpns->outer_planstate =
+ ((AppendState *) ps)->appendplans[idx];
+ }
else if (IsA(ps, MergeAppendState))
dpns->outer_planstate = ((MergeAppendState *) ps)->mergeplans[0];
else if (IsA(ps, ModifyTableState))
diff --git a/src/include/executor/execAsync.h b/src/include/executor/execAsync.h
new file mode 100644
index 0000000..9e7845c
--- /dev/null
+++ b/src/include/executor/execAsync.h
@@ -0,0 +1,30 @@
+/*--------------------------------------------------------------------
+ * execAsync.c
+ * Support functions for asynchronous query execution
+ *
+ * Portions Copyright (c) 1996-2016, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/executor/execAsync.c
+ *--------------------------------------------------------------------
+ */
+
+#ifndef EXECASYNC_H
+#define EXECASYNC_H
+
+#include "nodes/execnodes.h"
+
+extern void ExecAsyncRequest(EState *estate, PlanState *requestor,
+ int request_index, PlanState *requestee);
+extern bool ExecAsyncEventLoop(EState *estate, PlanState *requestor,
+ long timeout);
+
+extern void ExecAsyncSetRequiredEvents(EState *estate,
+ PendingAsyncRequest *areq, int num_fd_events,
+ bool wants_process_latch, bool force_reset);
+extern void ExecAsyncRequestDone(EState *estate,
+ PendingAsyncRequest *areq, Node *result);
+extern void ExecAsyncClearEvents(EState *estate);
+
+#endif /* EXECASYNC_H */
diff --git a/src/include/executor/nodeAppend.h b/src/include/executor/nodeAppend.h
index 6fb4662..3cbf9ff 100644
--- a/src/include/executor/nodeAppend.h
+++ b/src/include/executor/nodeAppend.h
@@ -21,4 +21,7 @@ extern TupleTableSlot *ExecAppend(AppendState *node);
extern void ExecEndAppend(AppendState *node);
extern void ExecReScanAppend(AppendState *node);
+extern void ExecAsyncAppendResponse(EState *estate,
+ PendingAsyncRequest *areq);
+
#endif /* NODEAPPEND_H */
diff --git a/src/include/executor/nodeForeignscan.h b/src/include/executor/nodeForeignscan.h
index 1b167b8..e4ba4a9 100644
--- a/src/include/executor/nodeForeignscan.h
+++ b/src/include/executor/nodeForeignscan.h
@@ -30,4 +30,11 @@ extern void ExecForeignScanInitializeWorker(ForeignScanState *node,
shm_toc *toc);
extern void ExecShutdownForeignScan(ForeignScanState *node);
+extern void ExecAsyncForeignScanRequest(EState *estate,
+ PendingAsyncRequest *areq);
+extern bool ExecAsyncForeignScanConfigureWait(EState *estate,
+ PendingAsyncRequest *areq, bool reinit);
+extern void ExecAsyncForeignScanNotify(EState *estate,
+ PendingAsyncRequest *areq);
+
#endif /* NODEFOREIGNSCAN_H */
diff --git a/src/include/foreign/fdwapi.h b/src/include/foreign/fdwapi.h
index 6ca44f7..863ff0e 100644
--- a/src/include/foreign/fdwapi.h
+++ b/src/include/foreign/fdwapi.h
@@ -156,6 +156,16 @@ typedef bool (*IsForeignScanParallelSafe_function) (PlannerInfo *root,
RelOptInfo *rel,
RangeTblEntry *rte);
+typedef bool (*IsForeignPathAsyncCapable_function) (ForeignPath *path);
+typedef void (*ForeignAsyncRequest_function) (EState *estate,
+ PendingAsyncRequest *areq);
+typedef bool (*ForeignAsyncConfigureWait_function) (EState *estate,
+ PendingAsyncRequest *areq,
+ bool reinit);
+typedef void (*ForeignAsyncNotify_function) (EState *estate,
+ PendingAsyncRequest *areq);
+typedef void (*ShutdownForeignScan_function) (ForeignScanState *node);
+
/*
* FdwRoutine is the struct returned by a foreign-data wrapper's handler
* function. It provides pointers to the callback functions needed by the
@@ -225,6 +235,13 @@ typedef struct FdwRoutine
EstimateDSMForeignScan_function EstimateDSMForeignScan;
InitializeDSMForeignScan_function InitializeDSMForeignScan;
InitializeWorkerForeignScan_function InitializeWorkerForeignScan;
+
+ /* Support functions for asynchronous execution */
+ IsForeignPathAsyncCapable_function IsForeignPathAsyncCapable;
+ ForeignAsyncRequest_function ForeignAsyncRequest;
+ ForeignAsyncConfigureWait_function ForeignAsyncConfigureWait;
+ ForeignAsyncNotify_function ForeignAsyncNotify;
+
ShutdownForeignScan_function ShutdownForeignScan;
} FdwRoutine;
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index fa99244..735a157 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -395,6 +395,32 @@ typedef struct ResultRelInfo
} ResultRelInfo;
/* ----------------
+ * PendingAsyncRequest
+ *
+ * State for an asynchronous tuple request.
+ * ----------------
+ */
+typedef enum AsyncRequestState
+{
+ ASYNCREQ_IDLE, /* Nothing is requested */
+ ASYNCREQ_WAITING, /* Waiting for events */
+ ASYNCREQ_CALLBACK_PENDING, /* Having events to be processed */
+ ASYNCREQ_COMPLETE /* Result is available */
+} AsyncRequestState;
+
+typedef struct PendingAsyncRequest
+{
+ int myindex; /* Index in es_pending_async. */
+ struct PlanState *requestor; /* Node that wants a tuple. */
+ struct PlanState *requestee; /* Node from which a tuple is wanted. */
+ int request_index; /* Scratch space for requestor. */
+ int num_fd_events; /* Max number of FD events requestee needs. */
+ bool wants_process_latch; /* Requestee cares about MyLatch. */
+ AsyncRequestState state;
+ Node *result; /* Result (NULL if no more tuples). */
+} PendingAsyncRequest;
+
+/* ----------------
* EState information
*
* Master working state for an Executor invocation
@@ -476,6 +502,32 @@ typedef struct EState
/* The per-query shared memory area to use for parallel execution. */
struct dsa_area *es_query_dsa;
+
+ /*
+ * Support for asynchronous execution.
+ *
+ * es_max_pending_async is the allocated size of es_pending_async, and
+ * es_num_pending_aync is the number of entries that are currently valid.
+ * (Entries after that may point to storage that can be reused.)
+ * es_async_ready is the number of PendingAsyncRequests that is ready to
+ * retrieve a tuple.
+ *
+ * es_total_fd_events is the total number of FD events needed by all
+ * pending async nodes, and es_allocated_fd_events is the number any
+ * current wait event set was allocated to handle. es_wait_event_set, if
+ * non-NULL, is a previously allocated event set that may be reusable by a
+ * future wait provided that nothing's been removed and not too many more
+ * events have been added.
+ */
+ int es_num_pending_async; /* # of nodes to wait */
+ int es_max_pending_async; /* max # of pending nodes */
+ int es_async_callback_pending; /* # of nodes to callback */
+ int es_num_async_ready; /* # of tuple-ready nodes */
+ PendingAsyncRequest **es_pending_async;
+
+ int es_total_fd_events;
+ int es_allocated_fd_events;
+ struct WaitEventSet *es_wait_event_set;
} EState;
@@ -939,17 +991,20 @@ typedef struct ModifyTableState
/* ----------------
* AppendState information
- *
- * nplans how many plans are in the array
- * whichplan which plan is being executed (0 .. n-1)
* ----------------
*/
typedef struct AppendState
{
PlanState ps; /* its first field is NodeTag */
PlanState **appendplans; /* array of PlanStates for my inputs */
- int as_nplans;
- int as_whichplan;
+ int as_nplans; /* total # of children */
+ int as_nasyncplans; /* # of async-capable children */
+ int as_whichsyncplan; /* which sync plan is being executed */
+ bool as_syncdone; /* all synchronous plans done? */
+ Bitmapset *as_needrequest; /* async plans needing a new request */
+ TupleTableSlot **as_asyncresult; /* unreturned results of async plans */
+ int as_nasyncresult; /* # of valid entries in as_asyncresult */
+ int as_nasyncpending; /* # of outstanding async requests */
} AppendState;
/* ----------------
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index a2dd26f..15f4de9 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -235,6 +235,8 @@ typedef struct Append
/* RT indexes of non-leaf tables in a partition tree */
List *partitioned_rels;
List *appendplans;
+ int nasyncplans; /* # of async plans, always at start of list */
+ int referent; /* index of inheritance tree referent */
} Append;
/* ----------------
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index e29397f..8bcfcb2 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -811,7 +811,8 @@ typedef enum
WAIT_EVENT_SAFE_SNAPSHOT,
WAIT_EVENT_SYNC_REP,
WAIT_EVENT_LOGICAL_SYNC_DATA,
- WAIT_EVENT_LOGICAL_SYNC_STATE_CHANGE
+ WAIT_EVENT_LOGICAL_SYNC_STATE_CHANGE,
+ WAIT_EVENT_ASYNC_WAIT
} WaitEventIPC;
/* ----------
--
2.9.2
0003-Make-postgres_fdw-async-capable.patchtext/x-patch; charset=us-asciiDownload
From 741b974f971f7f94fdc9cc7bf76db9c73767b7d6 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 23 Feb 2017 15:04:46 +0900
Subject: [PATCH 3/5] Make postgres_fdw async-capable.
Make postgre_fdw async-capable using the infrastructure. Additionaly,
this makes connections for postgres_fdw have a connection-specific
area to store information so that foreign scans on the same connection
can share some data. postgres_fdw shares scan node currently running
on the underlying connection. This allows us async-execution of
multiple foreign scans on one foreign server.
---
contrib/postgres_fdw/connection.c | 79 ++--
contrib/postgres_fdw/expected/postgres_fdw.out | 120 +++---
contrib/postgres_fdw/postgres_fdw.c | 522 +++++++++++++++++++++----
contrib/postgres_fdw/postgres_fdw.h | 2 +
contrib/postgres_fdw/sql/postgres_fdw.sql | 12 +-
5 files changed, 583 insertions(+), 152 deletions(-)
diff --git a/contrib/postgres_fdw/connection.c b/contrib/postgres_fdw/connection.c
index c6e3d44..d8ded74 100644
--- a/contrib/postgres_fdw/connection.c
+++ b/contrib/postgres_fdw/connection.c
@@ -49,6 +49,7 @@ typedef struct ConnCacheEntry
* one level of subxact open, etc */
bool have_prep_stmt; /* have we prepared any stmts in this xact? */
bool have_error; /* have any subxacts aborted in this xact? */
+ void *storage; /* connection specific storage */
} ConnCacheEntry;
/*
@@ -64,6 +65,7 @@ static unsigned int prep_stmt_number = 0;
static bool xact_got_connection = false;
/* prototypes of private functions */
+static ConnCacheEntry *get_connection_entry(Oid umid);
static PGconn *connect_pg_server(ForeignServer *server, UserMapping *user);
static void check_conn_params(const char **keywords, const char **values);
static void configure_remote_session(PGconn *conn);
@@ -75,26 +77,12 @@ static void pgfdw_subxact_callback(SubXactEvent event,
SubTransactionId parentSubid,
void *arg);
-
/*
- * Get a PGconn which can be used to execute queries on the remote PostgreSQL
- * server with the user's authorization. A new connection is established
- * if we don't already have a suitable one, and a transaction is opened at
- * the right subtransaction nesting depth if we didn't do that already.
- *
- * will_prep_stmt must be true if caller intends to create any prepared
- * statements. Since those don't go away automatically at transaction end
- * (not even on error), we need this flag to cue manual cleanup.
- *
- * XXX Note that caching connections theoretically requires a mechanism to
- * detect change of FDW objects to invalidate already established connections.
- * We could manage that by watching for invalidation events on the relevant
- * syscaches. For the moment, though, it's not clear that this would really
- * be useful and not mere pedantry. We could not flush any active connections
- * mid-transaction anyway.
+ * Common function to acquire or create a connection cache entry.
*/
-PGconn *
-GetConnection(UserMapping *user, bool will_prep_stmt)
+
+static ConnCacheEntry *
+get_connection_entry(Oid umid)
{
bool found;
ConnCacheEntry *entry;
@@ -122,11 +110,8 @@ GetConnection(UserMapping *user, bool will_prep_stmt)
RegisterSubXactCallback(pgfdw_subxact_callback, NULL);
}
- /* Set flag that we did GetConnection during the current transaction */
- xact_got_connection = true;
-
/* Create hash key for the entry. Assume no pad bytes in key struct */
- key = user->umid;
+ key = umid;
/*
* Find or create cached entry for requested connection.
@@ -139,8 +124,39 @@ GetConnection(UserMapping *user, bool will_prep_stmt)
entry->xact_depth = 0;
entry->have_prep_stmt = false;
entry->have_error = false;
+ entry->storage = NULL;
}
+ return entry;
+}
+
+/*
+ * Get a PGconn which can be used to execute queries on the remote PostgreSQL
+ * server with the user's authorization. A new connection is established
+ * if we don't already have a suitable one, and a transaction is opened at
+ * the right subtransaction nesting depth if we didn't do that already.
+ *
+ * will_prep_stmt must be true if caller intends to create any prepared
+ * statements. Since those don't go away automatically at transaction end
+ * (not even on error), we need this flag to cue manual cleanup.
+ *
+ * XXX Note that caching connections theoretically requires a mechanism to
+ * detect change of FDW objects to invalidate already established connections.
+ * We could manage that by watching for invalidation events on the relevant
+ * syscaches. For the moment, though, it's not clear that this would really
+ * be useful and not mere pedantry. We could not flush any active connections
+ * mid-transaction anyway.
+ */
+PGconn *
+GetConnection(UserMapping *user, bool will_prep_stmt)
+{
+ ConnCacheEntry *entry;
+
+ /* Set flag that we did GetConnection during the current transaction */
+ xact_got_connection = true;
+
+ entry = get_connection_entry(user->umid);
+
/*
* We don't check the health of cached connection here, because it would
* require some overhead. Broken connection will be detected when the
@@ -177,6 +193,25 @@ GetConnection(UserMapping *user, bool will_prep_stmt)
}
/*
+ * Rerturns the connection specific storage for this user. Allocate with
+ * initsize if not exists.
+ */
+void *
+GetConnectionSpecificStorage(UserMapping *user, size_t initsize)
+{
+ ConnCacheEntry *entry;
+
+ entry = get_connection_entry(user->umid);
+ if (entry->storage == NULL)
+ {
+ entry->storage = MemoryContextAlloc(CacheMemoryContext, initsize);
+ memset(entry->storage, 0, initsize);
+ }
+
+ return entry->storage;
+}
+
+/*
* Connect to remote server using specified server and user mapping properties.
*/
static PGconn *
diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out
index 1a9e6c8..88f0c7e 100644
--- a/contrib/postgres_fdw/expected/postgres_fdw.out
+++ b/contrib/postgres_fdw/expected/postgres_fdw.out
@@ -6534,34 +6534,39 @@ insert into bar2 values(3,33,33);
insert into bar2 values(4,44,44);
insert into bar2 values(7,77,77);
explain (verbose, costs off)
-select * from bar where f1 in (select f1 from foo) for update;
- QUERY PLAN
-----------------------------------------------------------------------------------------------
+select * from bar where f1 in (select f1 from foo) order by 1 for update;
+ QUERY PLAN
+-----------------------------------------------------------------------------------------------------------------
LockRows
Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid, foo.ctid, foo.*, foo.tableoid
- -> Hash Join
+ -> Merge Join
Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid, foo.ctid, foo.*, foo.tableoid
- Hash Cond: (bar.f1 = foo.f1)
- -> Append
- -> Seq Scan on public.bar
+ Merge Cond: (bar.f1 = foo.f1)
+ -> Merge Append
+ Sort Key: bar.f1
+ -> Sort
Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid
+ Sort Key: bar.f1
+ -> Seq Scan on public.bar
+ Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid
-> Foreign Scan on public.bar2
Output: bar2.f1, bar2.f2, bar2.ctid, bar2.*, bar2.tableoid
- Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 FOR UPDATE
- -> Hash
+ Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 ORDER BY f1 ASC NULLS LAST FOR UPDATE
+ -> Sort
Output: foo.ctid, foo.*, foo.tableoid, foo.f1
+ Sort Key: foo.f1
-> HashAggregate
Output: foo.ctid, foo.*, foo.tableoid, foo.f1
Group Key: foo.f1
-> Append
- -> Seq Scan on public.foo
- Output: foo.ctid, foo.*, foo.tableoid, foo.f1
-> Foreign Scan on public.foo2
Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1
Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1
-(22 rows)
+ -> Seq Scan on public.foo
+ Output: foo.ctid, foo.*, foo.tableoid, foo.f1
+(27 rows)
-select * from bar where f1 in (select f1 from foo) for update;
+select * from bar where f1 in (select f1 from foo) order by 1 for update;
f1 | f2
----+----
1 | 11
@@ -6571,34 +6576,39 @@ select * from bar where f1 in (select f1 from foo) for update;
(4 rows)
explain (verbose, costs off)
-select * from bar where f1 in (select f1 from foo) for share;
- QUERY PLAN
-----------------------------------------------------------------------------------------------
+select * from bar where f1 in (select f1 from foo) order by 1 for share;
+ QUERY PLAN
+----------------------------------------------------------------------------------------------------------------
LockRows
Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid, foo.ctid, foo.*, foo.tableoid
- -> Hash Join
+ -> Merge Join
Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid, foo.ctid, foo.*, foo.tableoid
- Hash Cond: (bar.f1 = foo.f1)
- -> Append
- -> Seq Scan on public.bar
+ Merge Cond: (bar.f1 = foo.f1)
+ -> Merge Append
+ Sort Key: bar.f1
+ -> Sort
Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid
+ Sort Key: bar.f1
+ -> Seq Scan on public.bar
+ Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid
-> Foreign Scan on public.bar2
Output: bar2.f1, bar2.f2, bar2.ctid, bar2.*, bar2.tableoid
- Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 FOR SHARE
- -> Hash
+ Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 ORDER BY f1 ASC NULLS LAST FOR SHARE
+ -> Sort
Output: foo.ctid, foo.*, foo.tableoid, foo.f1
+ Sort Key: foo.f1
-> HashAggregate
Output: foo.ctid, foo.*, foo.tableoid, foo.f1
Group Key: foo.f1
-> Append
- -> Seq Scan on public.foo
- Output: foo.ctid, foo.*, foo.tableoid, foo.f1
-> Foreign Scan on public.foo2
Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1
Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1
-(22 rows)
+ -> Seq Scan on public.foo
+ Output: foo.ctid, foo.*, foo.tableoid, foo.f1
+(27 rows)
-select * from bar where f1 in (select f1 from foo) for share;
+select * from bar where f1 in (select f1 from foo) order by 1 for share;
f1 | f2
----+----
1 | 11
@@ -6627,11 +6637,11 @@ update bar set f2 = f2 + 100 where f1 in (select f1 from foo);
Output: foo.ctid, foo.*, foo.tableoid, foo.f1
Group Key: foo.f1
-> Append
- -> Seq Scan on public.foo
- Output: foo.ctid, foo.*, foo.tableoid, foo.f1
-> Foreign Scan on public.foo2
Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1
Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1
+ -> Seq Scan on public.foo
+ Output: foo.ctid, foo.*, foo.tableoid, foo.f1
-> Hash Join
Output: bar2.f1, (bar2.f2 + 100), bar2.f3, bar2.ctid, foo.ctid, foo.*, foo.tableoid
Hash Cond: (bar2.f1 = foo.f1)
@@ -6644,11 +6654,11 @@ update bar set f2 = f2 + 100 where f1 in (select f1 from foo);
Output: foo.ctid, foo.*, foo.tableoid, foo.f1
Group Key: foo.f1
-> Append
- -> Seq Scan on public.foo
- Output: foo.ctid, foo.*, foo.tableoid, foo.f1
-> Foreign Scan on public.foo2
Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1
Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1
+ -> Seq Scan on public.foo
+ Output: foo.ctid, foo.*, foo.tableoid, foo.f1
(37 rows)
update bar set f2 = f2 + 100 where f1 in (select f1 from foo);
@@ -6679,16 +6689,16 @@ where bar.f1 = ss.f1;
Output: bar.f1, (bar.f2 + 100), bar.ctid, (ROW(foo.f1))
Hash Cond: (foo.f1 = bar.f1)
-> Append
- -> Seq Scan on public.foo
- Output: ROW(foo.f1), foo.f1
-> Foreign Scan on public.foo2
Output: ROW(foo2.f1), foo2.f1
Remote SQL: SELECT f1 FROM public.loct1
- -> Seq Scan on public.foo foo_1
- Output: ROW((foo_1.f1 + 3)), (foo_1.f1 + 3)
-> Foreign Scan on public.foo2 foo2_1
Output: ROW((foo2_1.f1 + 3)), (foo2_1.f1 + 3)
Remote SQL: SELECT f1 FROM public.loct1
+ -> Seq Scan on public.foo
+ Output: ROW(foo.f1), foo.f1
+ -> Seq Scan on public.foo foo_1
+ Output: ROW((foo_1.f1 + 3)), (foo_1.f1 + 3)
-> Hash
Output: bar.f1, bar.f2, bar.ctid
-> Seq Scan on public.bar
@@ -6706,16 +6716,16 @@ where bar.f1 = ss.f1;
Output: (ROW(foo.f1)), foo.f1
Sort Key: foo.f1
-> Append
- -> Seq Scan on public.foo
- Output: ROW(foo.f1), foo.f1
-> Foreign Scan on public.foo2
Output: ROW(foo2.f1), foo2.f1
Remote SQL: SELECT f1 FROM public.loct1
- -> Seq Scan on public.foo foo_1
- Output: ROW((foo_1.f1 + 3)), (foo_1.f1 + 3)
-> Foreign Scan on public.foo2 foo2_1
Output: ROW((foo2_1.f1 + 3)), (foo2_1.f1 + 3)
Remote SQL: SELECT f1 FROM public.loct1
+ -> Seq Scan on public.foo
+ Output: ROW(foo.f1), foo.f1
+ -> Seq Scan on public.foo foo_1
+ Output: ROW((foo_1.f1 + 3)), (foo_1.f1 + 3)
(45 rows)
update bar set f2 = f2 + 100
@@ -6866,27 +6876,33 @@ delete from foo where f1 < 5 returning *;
(5 rows)
explain (verbose, costs off)
-update bar set f2 = f2 + 100 returning *;
- QUERY PLAN
-------------------------------------------------------------------------------
- Update on public.bar
- Output: bar.f1, bar.f2
- Update on public.bar
- Foreign Update on public.bar2
- -> Seq Scan on public.bar
- Output: bar.f1, (bar.f2 + 100), bar.ctid
- -> Foreign Update on public.bar2
- Remote SQL: UPDATE public.loct2 SET f2 = (f2 + 100) RETURNING f1, f2
-(8 rows)
+with u as (update bar set f2 = f2 + 100 returning *) select * from u order by 1;
+ QUERY PLAN
+--------------------------------------------------------------------------------------
+ Sort
+ Output: u.f1, u.f2
+ Sort Key: u.f1
+ CTE u
+ -> Update on public.bar
+ Output: bar.f1, bar.f2
+ Update on public.bar
+ Foreign Update on public.bar2
+ -> Seq Scan on public.bar
+ Output: bar.f1, (bar.f2 + 100), bar.ctid
+ -> Foreign Update on public.bar2
+ Remote SQL: UPDATE public.loct2 SET f2 = (f2 + 100) RETURNING f1, f2
+ -> CTE Scan on u
+ Output: u.f1, u.f2
+(14 rows)
-update bar set f2 = f2 + 100 returning *;
+with u as (update bar set f2 = f2 + 100 returning *) select * from u order by 1;
f1 | f2
----+-----
1 | 311
2 | 322
- 6 | 266
3 | 333
4 | 344
+ 6 | 266
7 | 277
(6 rows)
diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c
index 2851869..2347fba 100644
--- a/contrib/postgres_fdw/postgres_fdw.c
+++ b/contrib/postgres_fdw/postgres_fdw.c
@@ -20,6 +20,8 @@
#include "commands/defrem.h"
#include "commands/explain.h"
#include "commands/vacuum.h"
+#include "executor/execAsync.h"
+#include "executor/nodeForeignscan.h"
#include "foreign/fdwapi.h"
#include "funcapi.h"
#include "miscadmin.h"
@@ -34,6 +36,7 @@
#include "optimizer/var.h"
#include "optimizer/tlist.h"
#include "parser/parsetree.h"
+#include "pgstat.h"
#include "utils/builtins.h"
#include "utils/guc.h"
#include "utils/lsyscache.h"
@@ -53,6 +56,9 @@ PG_MODULE_MAGIC;
/* If no remote estimates, assume a sort costs 20% extra */
#define DEFAULT_FDW_SORT_MULTIPLIER 1.2
+/* Retrive PgFdwScanState struct from ForeginScanState */
+#define GetPgFdwScanState(n) ((PgFdwScanState *)(n)->fdw_state)
+
/*
* Indexes of FDW-private information stored in fdw_private lists.
*
@@ -122,10 +128,27 @@ enum FdwDirectModifyPrivateIndex
};
/*
+ * Connection private area structure.
+ */
+typedef struct PgFdwConnpriv
+{
+ ForeignScanState *current_owner; /* The node currently running a query
+ * on this connection*/
+} PgFdwConnpriv;
+
+/* Execution state base type */
+typedef struct PgFdwState
+{
+ PGconn *conn; /* connection for the scan */
+ PgFdwConnpriv *connpriv; /* connection private memory */
+} PgFdwState;
+
+/*
* Execution state of a foreign scan using postgres_fdw.
*/
typedef struct PgFdwScanState
{
+ PgFdwState s; /* common structure */
Relation rel; /* relcache entry for the foreign table. NULL
* for a foreign join scan. */
TupleDesc tupdesc; /* tuple descriptor of scan */
@@ -136,7 +159,7 @@ typedef struct PgFdwScanState
List *retrieved_attrs; /* list of retrieved attribute numbers */
/* for remote query execution */
- PGconn *conn; /* connection for the scan */
+ bool result_ready;
unsigned int cursor_number; /* quasi-unique ID for my cursor */
bool cursor_exists; /* have we created the cursor? */
int numParams; /* number of parameters passed to query */
@@ -152,6 +175,13 @@ typedef struct PgFdwScanState
/* batch-level state, for optimizing rewinds and avoiding useless fetch */
int fetch_ct_2; /* Min(# of fetches done, 2) */
bool eof_reached; /* true if last fetch reached EOF */
+ bool run_async; /* true if run asynchronously */
+ bool async_waiting; /* true if requesting the parent to wait */
+ ForeignScanState *waiter; /* Next node to run a query among nodes
+ * sharing the same connection */
+ ForeignScanState *last_waiter; /* A waiting node at the end of a waiting
+ * list. Maintained only by the current
+ * owner of the connection */
/* working memory contexts */
MemoryContext batch_cxt; /* context holding current batch of tuples */
@@ -165,11 +195,11 @@ typedef struct PgFdwScanState
*/
typedef struct PgFdwModifyState
{
+ PgFdwState s; /* common structure */
Relation rel; /* relcache entry for the foreign table */
AttInMetadata *attinmeta; /* attribute datatype conversion metadata */
/* for remote query execution */
- PGconn *conn; /* connection for the scan */
char *p_name; /* name of prepared statement, if created */
/* extracted fdw_private data */
@@ -192,6 +222,7 @@ typedef struct PgFdwModifyState
*/
typedef struct PgFdwDirectModifyState
{
+ PgFdwState s; /* common structure */
Relation rel; /* relcache entry for the foreign table */
AttInMetadata *attinmeta; /* attribute datatype conversion metadata */
@@ -290,6 +321,7 @@ static void postgresBeginForeignScan(ForeignScanState *node, int eflags);
static TupleTableSlot *postgresIterateForeignScan(ForeignScanState *node);
static void postgresReScanForeignScan(ForeignScanState *node);
static void postgresEndForeignScan(ForeignScanState *node);
+static void postgresShutdownForeignScan(ForeignScanState *node);
static void postgresAddForeignUpdateTargets(Query *parsetree,
RangeTblEntry *target_rte,
Relation target_relation);
@@ -350,6 +382,14 @@ static void postgresGetForeignUpperPaths(PlannerInfo *root,
UpperRelationKind stage,
RelOptInfo *input_rel,
RelOptInfo *output_rel);
+static bool postgresIsForeignPathAsyncCapable(ForeignPath *path);
+static void postgresForeignAsyncRequest(EState *estate,
+ PendingAsyncRequest *areq);
+static bool postgresForeignAsyncConfigureWait(EState *estate,
+ PendingAsyncRequest *areq,
+ bool reinit);
+static void postgresForeignAsyncNotify(EState *estate,
+ PendingAsyncRequest *areq);
/*
* Helper functions
@@ -370,7 +410,10 @@ static bool ec_member_matches_foreign(PlannerInfo *root, RelOptInfo *rel,
EquivalenceClass *ec, EquivalenceMember *em,
void *arg);
static void create_cursor(ForeignScanState *node);
-static void fetch_more_data(ForeignScanState *node);
+static void request_more_data(ForeignScanState *node);
+static void fetch_received_data(ForeignScanState *node);
+static void vacate_connection(PgFdwState *fdwconn);
+static void absorb_current_result(ForeignScanState *node);
static void close_cursor(PGconn *conn, unsigned int cursor_number);
static void prepare_foreign_modify(PgFdwModifyState *fmstate);
static const char **convert_prep_stmt_params(PgFdwModifyState *fmstate,
@@ -435,6 +478,7 @@ postgres_fdw_handler(PG_FUNCTION_ARGS)
routine->IterateForeignScan = postgresIterateForeignScan;
routine->ReScanForeignScan = postgresReScanForeignScan;
routine->EndForeignScan = postgresEndForeignScan;
+ routine->ShutdownForeignScan = postgresShutdownForeignScan;
/* Functions for updating foreign tables */
routine->AddForeignUpdateTargets = postgresAddForeignUpdateTargets;
@@ -469,6 +513,12 @@ postgres_fdw_handler(PG_FUNCTION_ARGS)
/* Support functions for upper relation push-down */
routine->GetForeignUpperPaths = postgresGetForeignUpperPaths;
+ /* Support functions for async execution */
+ routine->IsForeignPathAsyncCapable = postgresIsForeignPathAsyncCapable;
+ routine->ForeignAsyncRequest = postgresForeignAsyncRequest;
+ routine->ForeignAsyncConfigureWait = postgresForeignAsyncConfigureWait;
+ routine->ForeignAsyncNotify = postgresForeignAsyncNotify;
+
PG_RETURN_POINTER(routine);
}
@@ -1327,12 +1377,21 @@ postgresBeginForeignScan(ForeignScanState *node, int eflags)
* Get connection to the foreign server. Connection manager will
* establish new connection if necessary.
*/
- fsstate->conn = GetConnection(user, false);
+ fsstate->s.conn = GetConnection(user, false);
+ fsstate->s.connpriv = (PgFdwConnpriv *)
+ GetConnectionSpecificStorage(user, sizeof(PgFdwConnpriv));
+ fsstate->s.connpriv->current_owner = NULL;
+ fsstate->waiter = NULL;
+ fsstate->last_waiter = node;
/* Assign a unique ID for my cursor */
- fsstate->cursor_number = GetCursorNumber(fsstate->conn);
+ fsstate->cursor_number = GetCursorNumber(fsstate->s.conn);
fsstate->cursor_exists = false;
+ /* Initialize async execution status */
+ fsstate->run_async = false;
+ fsstate->async_waiting = false;
+
/* Get private info created by planner functions. */
fsstate->query = strVal(list_nth(fsplan->fdw_private,
FdwScanPrivateSelectSql));
@@ -1388,32 +1447,130 @@ postgresBeginForeignScan(ForeignScanState *node, int eflags)
static TupleTableSlot *
postgresIterateForeignScan(ForeignScanState *node)
{
- PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+ PgFdwScanState *fsstate = GetPgFdwScanState(node);
TupleTableSlot *slot = node->ss.ss_ScanTupleSlot;
/*
- * If this is the first call after Begin or ReScan, we need to create the
- * cursor on the remote side.
- */
- if (!fsstate->cursor_exists)
- create_cursor(node);
-
- /*
* Get some more tuples, if we've run out.
*/
if (fsstate->next_tuple >= fsstate->num_tuples)
{
- /* No point in another fetch if we already detected EOF, though. */
- if (!fsstate->eof_reached)
- fetch_more_data(node);
- /* If we didn't get any tuples, must be end of data. */
+ ForeignScanState *next_conn_owner = node;
+
+ /* This node has sent a query on this connection */
+ if (fsstate->s.connpriv->current_owner == node)
+ {
+ /* Check if the result is available */
+ if (PQisBusy(fsstate->s.conn))
+ {
+ int rc = WaitLatchOrSocket(NULL,
+ WL_SOCKET_READABLE | WL_TIMEOUT,
+ PQsocket(fsstate->s.conn), 0,
+ WAIT_EVENT_ASYNC_WAIT);
+ if (fsstate->run_async && !(rc & WL_SOCKET_READABLE))
+ {
+ /*
+ * This node is not ready yet. Tell the caller to wait.
+ */
+ fsstate->result_ready = false;
+ return ExecClearTuple(slot);
+ }
+ }
+
+ Assert(fsstate->async_waiting);
+ fsstate->async_waiting = false;
+ fetch_received_data(node);
+
+ /*
+ * If someone is waiting this node on the same connection, let the
+ * first waiter be the next owner of this connection.
+ */
+ if (fsstate->waiter)
+ {
+ PgFdwScanState *next_owner_state;
+
+ next_conn_owner = fsstate->waiter;
+ next_owner_state = GetPgFdwScanState(next_conn_owner);
+ fsstate->waiter = NULL;
+
+ /*
+ * only the current owner is responsible to maintain the shortcut
+ * to the last waiter
+ */
+ next_owner_state->last_waiter = fsstate->last_waiter;
+
+ /*
+ * for simplicity, last_waiter points itself on a node that no one
+ * is waiting for.
+ */
+ fsstate->last_waiter = node;
+ }
+ }
+ else if (fsstate->s.connpriv->current_owner &&
+ !GetPgFdwScanState(node)->eof_reached)
+ {
+ /*
+ * Anyone else is holding this connection and we want this node to
+ * run later. Add myself to the tail of the waiters' list then
+ * return not-ready. To avoid scanning through the waiters' list,
+ * the current owner is to maintain the shortcut to the last
+ * waiter.
+ */
+ PgFdwScanState *conn_owner_state =
+ GetPgFdwScanState(fsstate->s.connpriv->current_owner);
+ ForeignScanState *last_waiter = conn_owner_state->last_waiter;
+ PgFdwScanState *last_waiter_state = GetPgFdwScanState(last_waiter);
+
+ last_waiter_state->waiter = node;
+ conn_owner_state->last_waiter = node;
+
+ /* Register the node to the async-waiting node list */
+ Assert(!GetPgFdwScanState(node)->async_waiting);
+
+ GetPgFdwScanState(node)->async_waiting = true;
+
+ fsstate->result_ready = fsstate->eof_reached;
+ return ExecClearTuple(slot);
+ }
+
+ /* At this time no node is running on the connection */
+ Assert(GetPgFdwScanState(next_conn_owner)->s.connpriv->current_owner
+ == NULL);
+ /*
+ * Send the next request for the next owner of this connection if
+ * needed.
+ */
+ if (!GetPgFdwScanState(next_conn_owner)->eof_reached)
+ {
+ PgFdwScanState *next_owner_state =
+ GetPgFdwScanState(next_conn_owner);
+
+ request_more_data(next_conn_owner);
+
+ /* Register the node to the async-waiting node list */
+ if (!next_owner_state->async_waiting)
+ next_owner_state->async_waiting = true;
+
+ if (!next_owner_state->run_async)
+ fetch_received_data(next_conn_owner);
+ }
+
+
+ /*
+ * If we haven't received a result for the given node this time,
+ * return with no tuple to give way to other nodes.
+ */
if (fsstate->next_tuple >= fsstate->num_tuples)
+ {
+ fsstate->result_ready = fsstate->eof_reached;
return ExecClearTuple(slot);
+ }
}
/*
* Return the next tuple.
*/
+ fsstate->result_ready = true;
ExecStoreTuple(fsstate->tuples[fsstate->next_tuple++],
slot,
InvalidBuffer,
@@ -1429,7 +1586,7 @@ postgresIterateForeignScan(ForeignScanState *node)
static void
postgresReScanForeignScan(ForeignScanState *node)
{
- PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+ PgFdwScanState *fsstate = GetPgFdwScanState(node);
char sql[64];
PGresult *res;
@@ -1437,6 +1594,9 @@ postgresReScanForeignScan(ForeignScanState *node)
if (!fsstate->cursor_exists)
return;
+ /* Absorb the ramining result */
+ absorb_current_result(node);
+
/*
* If any internal parameters affecting this node have changed, we'd
* better destroy and recreate the cursor. Otherwise, rewinding it should
@@ -1465,9 +1625,9 @@ postgresReScanForeignScan(ForeignScanState *node)
* We don't use a PG_TRY block here, so be careful not to throw error
* without releasing the PGresult.
*/
- res = pgfdw_exec_query(fsstate->conn, sql);
+ res = pgfdw_exec_query(fsstate->s.conn, sql);
if (PQresultStatus(res) != PGRES_COMMAND_OK)
- pgfdw_report_error(ERROR, res, fsstate->conn, true, sql);
+ pgfdw_report_error(ERROR, res, fsstate->s.conn, true, sql);
PQclear(res);
/* Now force a fresh FETCH. */
@@ -1485,7 +1645,7 @@ postgresReScanForeignScan(ForeignScanState *node)
static void
postgresEndForeignScan(ForeignScanState *node)
{
- PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+ PgFdwScanState *fsstate = GetPgFdwScanState(node);
/* if fsstate is NULL, we are in EXPLAIN; nothing to do */
if (fsstate == NULL)
@@ -1493,16 +1653,32 @@ postgresEndForeignScan(ForeignScanState *node)
/* Close the cursor if open, to prevent accumulation of cursors */
if (fsstate->cursor_exists)
- close_cursor(fsstate->conn, fsstate->cursor_number);
+ close_cursor(fsstate->s.conn, fsstate->cursor_number);
/* Release remote connection */
- ReleaseConnection(fsstate->conn);
- fsstate->conn = NULL;
+ ReleaseConnection(fsstate->s.conn);
+ fsstate->s.conn = NULL;
/* MemoryContexts will be deleted automatically. */
}
/*
+ * postgresShutdownForeignScan
+ * Remove asynchrony stuff and cleanup garbage on the connection.
+ */
+static void
+postgresShutdownForeignScan(ForeignScanState *node)
+{
+ ForeignScan *plan = (ForeignScan *) node->ss.ps.plan;
+
+ if (plan->operation != CMD_SELECT)
+ return;
+
+ /* Absorb the ramining result */
+ absorb_current_result(node);
+}
+
+/*
* postgresAddForeignUpdateTargets
* Add resjunk column(s) needed for update/delete on a foreign table
*/
@@ -1704,7 +1880,9 @@ postgresBeginForeignModify(ModifyTableState *mtstate,
user = GetUserMapping(userid, table->serverid);
/* Open connection; report that we'll create a prepared statement. */
- fmstate->conn = GetConnection(user, true);
+ fmstate->s.conn = GetConnection(user, true);
+ fmstate->s.connpriv = (PgFdwConnpriv *)
+ GetConnectionSpecificStorage(user, sizeof(PgFdwConnpriv));
fmstate->p_name = NULL; /* prepared statement not made yet */
/* Deconstruct fdw_private data. */
@@ -1783,6 +1961,8 @@ postgresExecForeignInsert(EState *estate,
PGresult *res;
int n_rows;
+ vacate_connection((PgFdwState *)fmstate);
+
/* Set up the prepared statement on the remote server, if we didn't yet */
if (!fmstate->p_name)
prepare_foreign_modify(fmstate);
@@ -1793,14 +1973,14 @@ postgresExecForeignInsert(EState *estate,
/*
* Execute the prepared statement.
*/
- if (!PQsendQueryPrepared(fmstate->conn,
+ if (!PQsendQueryPrepared(fmstate->s.conn,
fmstate->p_name,
fmstate->p_nums,
p_values,
NULL,
NULL,
0))
- pgfdw_report_error(ERROR, NULL, fmstate->conn, false, fmstate->query);
+ pgfdw_report_error(ERROR, NULL, fmstate->s.conn, false, fmstate->query);
/*
* Get the result, and check for success.
@@ -1808,10 +1988,10 @@ postgresExecForeignInsert(EState *estate,
* We don't use a PG_TRY block here, so be careful not to throw error
* without releasing the PGresult.
*/
- res = pgfdw_get_result(fmstate->conn, fmstate->query);
+ res = pgfdw_get_result(fmstate->s.conn, fmstate->query);
if (PQresultStatus(res) !=
(fmstate->has_returning ? PGRES_TUPLES_OK : PGRES_COMMAND_OK))
- pgfdw_report_error(ERROR, res, fmstate->conn, true, fmstate->query);
+ pgfdw_report_error(ERROR, res, fmstate->s.conn, true, fmstate->query);
/* Check number of rows affected, and fetch RETURNING tuple if any */
if (fmstate->has_returning)
@@ -1849,6 +2029,8 @@ postgresExecForeignUpdate(EState *estate,
PGresult *res;
int n_rows;
+ vacate_connection((PgFdwState *)fmstate);
+
/* Set up the prepared statement on the remote server, if we didn't yet */
if (!fmstate->p_name)
prepare_foreign_modify(fmstate);
@@ -1869,14 +2051,14 @@ postgresExecForeignUpdate(EState *estate,
/*
* Execute the prepared statement.
*/
- if (!PQsendQueryPrepared(fmstate->conn,
+ if (!PQsendQueryPrepared(fmstate->s.conn,
fmstate->p_name,
fmstate->p_nums,
p_values,
NULL,
NULL,
0))
- pgfdw_report_error(ERROR, NULL, fmstate->conn, false, fmstate->query);
+ pgfdw_report_error(ERROR, NULL, fmstate->s.conn, false, fmstate->query);
/*
* Get the result, and check for success.
@@ -1884,10 +2066,10 @@ postgresExecForeignUpdate(EState *estate,
* We don't use a PG_TRY block here, so be careful not to throw error
* without releasing the PGresult.
*/
- res = pgfdw_get_result(fmstate->conn, fmstate->query);
+ res = pgfdw_get_result(fmstate->s.conn, fmstate->query);
if (PQresultStatus(res) !=
(fmstate->has_returning ? PGRES_TUPLES_OK : PGRES_COMMAND_OK))
- pgfdw_report_error(ERROR, res, fmstate->conn, true, fmstate->query);
+ pgfdw_report_error(ERROR, res, fmstate->s.conn, true, fmstate->query);
/* Check number of rows affected, and fetch RETURNING tuple if any */
if (fmstate->has_returning)
@@ -1925,6 +2107,8 @@ postgresExecForeignDelete(EState *estate,
PGresult *res;
int n_rows;
+ vacate_connection((PgFdwState *)fmstate);
+
/* Set up the prepared statement on the remote server, if we didn't yet */
if (!fmstate->p_name)
prepare_foreign_modify(fmstate);
@@ -1945,14 +2129,14 @@ postgresExecForeignDelete(EState *estate,
/*
* Execute the prepared statement.
*/
- if (!PQsendQueryPrepared(fmstate->conn,
+ if (!PQsendQueryPrepared(fmstate->s.conn,
fmstate->p_name,
fmstate->p_nums,
p_values,
NULL,
NULL,
0))
- pgfdw_report_error(ERROR, NULL, fmstate->conn, false, fmstate->query);
+ pgfdw_report_error(ERROR, NULL, fmstate->s.conn, false, fmstate->query);
/*
* Get the result, and check for success.
@@ -1960,10 +2144,10 @@ postgresExecForeignDelete(EState *estate,
* We don't use a PG_TRY block here, so be careful not to throw error
* without releasing the PGresult.
*/
- res = pgfdw_get_result(fmstate->conn, fmstate->query);
+ res = pgfdw_get_result(fmstate->s.conn, fmstate->query);
if (PQresultStatus(res) !=
(fmstate->has_returning ? PGRES_TUPLES_OK : PGRES_COMMAND_OK))
- pgfdw_report_error(ERROR, res, fmstate->conn, true, fmstate->query);
+ pgfdw_report_error(ERROR, res, fmstate->s.conn, true, fmstate->query);
/* Check number of rows affected, and fetch RETURNING tuple if any */
if (fmstate->has_returning)
@@ -2010,16 +2194,16 @@ postgresEndForeignModify(EState *estate,
* We don't use a PG_TRY block here, so be careful not to throw error
* without releasing the PGresult.
*/
- res = pgfdw_exec_query(fmstate->conn, sql);
+ res = pgfdw_exec_query(fmstate->s.conn, sql);
if (PQresultStatus(res) != PGRES_COMMAND_OK)
- pgfdw_report_error(ERROR, res, fmstate->conn, true, sql);
+ pgfdw_report_error(ERROR, res, fmstate->s.conn, true, sql);
PQclear(res);
fmstate->p_name = NULL;
}
/* Release remote connection */
- ReleaseConnection(fmstate->conn);
- fmstate->conn = NULL;
+ ReleaseConnection(fmstate->s.conn);
+ fmstate->s.conn = NULL;
}
/*
@@ -2299,7 +2483,9 @@ postgresBeginDirectModify(ForeignScanState *node, int eflags)
* Get connection to the foreign server. Connection manager will
* establish new connection if necessary.
*/
- dmstate->conn = GetConnection(user, false);
+ dmstate->s.conn = GetConnection(user, false);
+ dmstate->s.connpriv = (PgFdwConnpriv *)
+ GetConnectionSpecificStorage(user, sizeof(PgFdwConnpriv));
/* Initialize state variable */
dmstate->num_tuples = -1; /* -1 means not set yet */
@@ -2352,7 +2538,10 @@ postgresIterateDirectModify(ForeignScanState *node)
* If this is the first call after Begin, execute the statement.
*/
if (dmstate->num_tuples == -1)
+ {
+ vacate_connection((PgFdwState *)dmstate);
execute_dml_stmt(node);
+ }
/*
* If the local query doesn't specify RETURNING, just clear tuple slot.
@@ -2399,8 +2588,8 @@ postgresEndDirectModify(ForeignScanState *node)
PQclear(dmstate->result);
/* Release remote connection */
- ReleaseConnection(dmstate->conn);
- dmstate->conn = NULL;
+ ReleaseConnection(dmstate->s.conn);
+ dmstate->s.conn = NULL;
/* MemoryContext will be deleted automatically. */
}
@@ -2519,6 +2708,7 @@ estimate_path_cost_size(PlannerInfo *root,
List *local_param_join_conds;
StringInfoData sql;
PGconn *conn;
+ PgFdwConnpriv *connpriv;
Selectivity local_sel;
QualCost local_cost;
List *fdw_scan_tlist = NIL;
@@ -2561,6 +2751,16 @@ estimate_path_cost_size(PlannerInfo *root,
/* Get the remote estimate */
conn = GetConnection(fpinfo->user, false);
+ connpriv = GetConnectionSpecificStorage(fpinfo->user,
+ sizeof(PgFdwConnpriv));
+ if (connpriv)
+ {
+ PgFdwState tmpstate;
+ tmpstate.conn = conn;
+ tmpstate.connpriv = connpriv;
+ vacate_connection(&tmpstate);
+ }
+
get_remote_estimate(sql.data, conn, &rows, &width,
&startup_cost, &total_cost);
ReleaseConnection(conn);
@@ -2915,11 +3115,11 @@ ec_member_matches_foreign(PlannerInfo *root, RelOptInfo *rel,
static void
create_cursor(ForeignScanState *node)
{
- PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+ PgFdwScanState *fsstate = GetPgFdwScanState(node);
ExprContext *econtext = node->ss.ps.ps_ExprContext;
int numParams = fsstate->numParams;
const char **values = fsstate->param_values;
- PGconn *conn = fsstate->conn;
+ PGconn *conn = fsstate->s.conn;
StringInfoData buf;
PGresult *res;
@@ -2985,47 +3185,96 @@ create_cursor(ForeignScanState *node)
* Fetch some more rows from the node's cursor.
*/
static void
-fetch_more_data(ForeignScanState *node)
+request_more_data(ForeignScanState *node)
+{
+ PgFdwScanState *fsstate = GetPgFdwScanState(node);
+ PGconn *conn = fsstate->s.conn;
+ char sql[64];
+
+ /* The connection should be vacant */
+ Assert(fsstate->s.connpriv->current_owner == NULL);
+
+ /*
+ * If this is the first call after Begin or ReScan, we need to create the
+ * cursor on the remote side.
+ */
+ if (!fsstate->cursor_exists)
+ create_cursor(node);
+
+ snprintf(sql, sizeof(sql), "FETCH %d FROM c%u",
+ fsstate->fetch_size, fsstate->cursor_number);
+
+ if (!PQsendQuery(conn, sql))
+ pgfdw_report_error(ERROR, NULL, conn, false, sql);
+
+ fsstate->s.connpriv->current_owner = node;
+}
+
+/*
+ * Fetch some more rows from the node's cursor.
+ */
+static void
+fetch_received_data(ForeignScanState *node)
{
- PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+ PgFdwScanState *fsstate = GetPgFdwScanState(node);
PGresult *volatile res = NULL;
MemoryContext oldcontext;
+ /* I should be the current connection owner */
+ Assert(fsstate->s.connpriv->current_owner == node);
+
/*
* We'll store the tuples in the batch_cxt. First, flush the previous
- * batch.
+ * batch if no tuple is remaining
*/
- fsstate->tuples = NULL;
- MemoryContextReset(fsstate->batch_cxt);
+ if (fsstate->next_tuple >= fsstate->num_tuples)
+ {
+ fsstate->tuples = NULL;
+ fsstate->num_tuples = 0;
+ MemoryContextReset(fsstate->batch_cxt);
+ }
+ else if (fsstate->next_tuple > 0)
+ {
+ /* move the remaining tuples to the beginning of the store */
+ int n = 0;
+
+ while(fsstate->next_tuple < fsstate->num_tuples)
+ fsstate->tuples[n++] = fsstate->tuples[fsstate->next_tuple++];
+ fsstate->num_tuples = n;
+ }
+
oldcontext = MemoryContextSwitchTo(fsstate->batch_cxt);
/* PGresult must be released before leaving this function. */
PG_TRY();
{
- PGconn *conn = fsstate->conn;
+ PGconn *conn = fsstate->s.conn;
char sql[64];
- int numrows;
+ int addrows;
+ size_t newsize;
int i;
snprintf(sql, sizeof(sql), "FETCH %d FROM c%u",
fsstate->fetch_size, fsstate->cursor_number);
- res = pgfdw_exec_query(conn, sql);
+ res = pgfdw_get_result(conn, sql);
/* On error, report the original query, not the FETCH. */
if (PQresultStatus(res) != PGRES_TUPLES_OK)
pgfdw_report_error(ERROR, res, conn, false, fsstate->query);
/* Convert the data into HeapTuples */
- numrows = PQntuples(res);
- fsstate->tuples = (HeapTuple *) palloc0(numrows * sizeof(HeapTuple));
- fsstate->num_tuples = numrows;
- fsstate->next_tuple = 0;
+ addrows = PQntuples(res);
+ newsize = (fsstate->num_tuples + addrows) * sizeof(HeapTuple);
+ if (fsstate->tuples)
+ fsstate->tuples = (HeapTuple *) repalloc(fsstate->tuples, newsize);
+ else
+ fsstate->tuples = (HeapTuple *) palloc(newsize);
- for (i = 0; i < numrows; i++)
+ for (i = 0; i < addrows; i++)
{
Assert(IsA(node->ss.ps.plan, ForeignScan));
- fsstate->tuples[i] =
+ fsstate->tuples[fsstate->num_tuples + i] =
make_tuple_from_result_row(res, i,
fsstate->rel,
fsstate->attinmeta,
@@ -3035,27 +3284,82 @@ fetch_more_data(ForeignScanState *node)
}
/* Update fetch_ct_2 */
- if (fsstate->fetch_ct_2 < 2)
+ if (fsstate->fetch_ct_2 < 2 && fsstate->next_tuple == 0)
fsstate->fetch_ct_2++;
+ fsstate->next_tuple = 0;
+ fsstate->num_tuples += addrows;
+
/* Must be EOF if we didn't get as many tuples as we asked for. */
- fsstate->eof_reached = (numrows < fsstate->fetch_size);
+ fsstate->eof_reached = (addrows < fsstate->fetch_size);
PQclear(res);
res = NULL;
}
PG_CATCH();
{
+ fsstate->s.connpriv->current_owner = NULL;
if (res)
PQclear(res);
PG_RE_THROW();
}
PG_END_TRY();
+ fsstate->s.connpriv->current_owner = NULL;
+
MemoryContextSwitchTo(oldcontext);
}
/*
+ * Vacate a connection so that this node can send the next query
+ */
+static void
+vacate_connection(PgFdwState *fdwstate)
+{
+ PgFdwConnpriv *connpriv = fdwstate->connpriv;
+ ForeignScanState *owner;
+
+ if (connpriv == NULL || connpriv->current_owner == NULL)
+ return;
+
+ /*
+ * let the current connection owner read the result for the running query
+ */
+ owner = connpriv->current_owner;
+ fetch_received_data(owner);
+
+ /* Clear the waiting list */
+ while (owner)
+ {
+ PgFdwScanState *fsstate = GetPgFdwScanState(owner);
+
+ fsstate->last_waiter = NULL;
+ owner = fsstate->waiter;
+ fsstate->waiter = NULL;
+ }
+}
+
+/*
+ * Absorb the result of the current query.
+ */
+static void
+absorb_current_result(ForeignScanState *node)
+{
+ PgFdwScanState *fsstate = GetPgFdwScanState(node);
+ ForeignScanState *owner = fsstate->s.connpriv->current_owner;
+
+ if (owner)
+ {
+ PgFdwScanState *target_state = GetPgFdwScanState(owner);
+ PGconn *conn = target_state->s.conn;
+
+ while(PQisBusy(conn))
+ PQclear(PQgetResult(conn));
+ fsstate->s.connpriv->current_owner = NULL;
+ fsstate->async_waiting = false;
+ }
+}
+/*
* Force assorted GUC parameters to settings that ensure that we'll output
* data values in a form that is unambiguous to the remote server.
*
@@ -3139,7 +3443,7 @@ prepare_foreign_modify(PgFdwModifyState *fmstate)
/* Construct name we'll use for the prepared statement. */
snprintf(prep_name, sizeof(prep_name), "pgsql_fdw_prep_%u",
- GetPrepStmtNumber(fmstate->conn));
+ GetPrepStmtNumber(fmstate->s.conn));
p_name = pstrdup(prep_name);
/*
@@ -3149,12 +3453,12 @@ prepare_foreign_modify(PgFdwModifyState *fmstate)
* the prepared statements we use in this module are simple enough that
* the remote server will make the right choices.
*/
- if (!PQsendPrepare(fmstate->conn,
+ if (!PQsendPrepare(fmstate->s.conn,
p_name,
fmstate->query,
0,
NULL))
- pgfdw_report_error(ERROR, NULL, fmstate->conn, false, fmstate->query);
+ pgfdw_report_error(ERROR, NULL, fmstate->s.conn, false, fmstate->query);
/*
* Get the result, and check for success.
@@ -3162,9 +3466,9 @@ prepare_foreign_modify(PgFdwModifyState *fmstate)
* We don't use a PG_TRY block here, so be careful not to throw error
* without releasing the PGresult.
*/
- res = pgfdw_get_result(fmstate->conn, fmstate->query);
+ res = pgfdw_get_result(fmstate->s.conn, fmstate->query);
if (PQresultStatus(res) != PGRES_COMMAND_OK)
- pgfdw_report_error(ERROR, res, fmstate->conn, true, fmstate->query);
+ pgfdw_report_error(ERROR, res, fmstate->s.conn, true, fmstate->query);
PQclear(res);
/* This action shows that the prepare has been done. */
@@ -3295,9 +3599,9 @@ execute_dml_stmt(ForeignScanState *node)
* the desired result. This allows us to avoid assuming that the remote
* server has the same OIDs we do for the parameters' types.
*/
- if (!PQsendQueryParams(dmstate->conn, dmstate->query, numParams,
+ if (!PQsendQueryParams(dmstate->s.conn, dmstate->query, numParams,
NULL, values, NULL, NULL, 0))
- pgfdw_report_error(ERROR, NULL, dmstate->conn, false, dmstate->query);
+ pgfdw_report_error(ERROR, NULL, dmstate->s.conn, false, dmstate->query);
/*
* Get the result, and check for success.
@@ -3305,10 +3609,10 @@ execute_dml_stmt(ForeignScanState *node)
* We don't use a PG_TRY block here, so be careful not to throw error
* without releasing the PGresult.
*/
- dmstate->result = pgfdw_get_result(dmstate->conn, dmstate->query);
+ dmstate->result = pgfdw_get_result(dmstate->s.conn, dmstate->query);
if (PQresultStatus(dmstate->result) !=
(dmstate->has_returning ? PGRES_TUPLES_OK : PGRES_COMMAND_OK))
- pgfdw_report_error(ERROR, dmstate->result, dmstate->conn, true,
+ pgfdw_report_error(ERROR, dmstate->result, dmstate->s.conn, true,
dmstate->query);
/* Get the number of rows affected. */
@@ -4502,6 +4806,80 @@ postgresGetForeignJoinPaths(PlannerInfo *root,
/* XXX Consider parameterized paths for the join relation */
}
+static bool
+postgresIsForeignPathAsyncCapable(ForeignPath *path)
+{
+ return true;
+}
+
+/*
+ * Accept async request. Notify to the caller if the next tuple is immediately
+ * available. ExecForeignScan does additional work to finishing the returning
+ * tuple, so call it instead of postgresIterateForeignScan to acquire a tuple
+ * in expected shape.
+ */
+static void
+postgresForeignAsyncRequest(EState *estate, PendingAsyncRequest *areq)
+{
+ ForeignScanState *node = (ForeignScanState *) areq->requestee;
+ TupleTableSlot *slot;
+
+ Assert(IsA(node, ForeignScanState));
+ GetPgFdwScanState(node)->run_async = true;
+ slot = ExecForeignScan(node);
+ if (GetPgFdwScanState(node)->result_ready)
+ ExecAsyncRequestDone(estate, areq, (Node *) slot);
+ else
+ ExecAsyncSetRequiredEvents(estate, areq, 1, false, false);
+}
+
+/*
+ * Configure waiting event.
+ *
+ * Add an wait event only when the node is the connection owner. Elsewise
+ * another node on this connection is the owner.
+ */
+static bool
+postgresForeignAsyncConfigureWait(EState *estate, PendingAsyncRequest *areq,
+ bool reinit)
+{
+ ForeignScanState *node = (ForeignScanState *) areq->requestee;
+ PgFdwScanState *fsstate = GetPgFdwScanState(node);
+
+
+ /* If the caller didn't reinit, this event is already in event set */
+ if (!reinit)
+ return true;
+
+ if (fsstate->s.connpriv->current_owner == node)
+ {
+ AddWaitEventToSet(estate->es_wait_event_set,
+ WL_SOCKET_READABLE, PQsocket(fsstate->s.conn),
+ NULL, areq);
+ return true;
+ }
+
+ return false;
+}
+
+/*
+ * Process a notification from async mechanism. ExecForeignScan does
+ * additional work to complete the returning tuple, so call it instead of
+ * postgresIterateForeignScan to acquire a completed tuple.
+ */
+static void
+postgresForeignAsyncNotify(EState *estate, PendingAsyncRequest *areq)
+{
+ ForeignScanState *node = (ForeignScanState *) areq->requestee;
+ TupleTableSlot *slot;
+
+ Assert(IsA(node, ForeignScanState));
+ slot = ExecForeignScan(node);
+ Assert(GetPgFdwScanState(node)->result_ready);
+
+ ExecAsyncRequestDone(estate, areq, (Node *) slot);
+}
+
/*
* Assess whether the aggregation, grouping and having operations can be pushed
* down to the foreign server. As a side effect, save information we obtain in
@@ -4859,7 +5237,7 @@ make_tuple_from_result_row(PGresult *res,
PgFdwScanState *fdw_sstate;
Assert(fsstate);
- fdw_sstate = (PgFdwScanState *) fsstate->fdw_state;
+ fdw_sstate = GetPgFdwScanState(fsstate);
tupdesc = fdw_sstate->tupdesc;
}
diff --git a/contrib/postgres_fdw/postgres_fdw.h b/contrib/postgres_fdw/postgres_fdw.h
index 57dbb79..1194d29 100644
--- a/contrib/postgres_fdw/postgres_fdw.h
+++ b/contrib/postgres_fdw/postgres_fdw.h
@@ -79,6 +79,7 @@ typedef struct PgFdwRelationInfo
UserMapping *user; /* only set in use_remote_estimate mode */
int fetch_size; /* fetch size for this remote table */
+ bool allow_prefetch; /* true to allow overlapped fetching */
/*
* Name of the relation while EXPLAINing ForeignScan. It is used for join
@@ -117,6 +118,7 @@ extern void reset_transmission_modes(int nestlevel);
/* in connection.c */
extern PGconn *GetConnection(UserMapping *user, bool will_prep_stmt);
+void *GetConnectionSpecificStorage(UserMapping *user, size_t initsize);
extern void ReleaseConnection(PGconn *conn);
extern unsigned int GetCursorNumber(PGconn *conn);
extern unsigned int GetPrepStmtNumber(PGconn *conn);
diff --git a/contrib/postgres_fdw/sql/postgres_fdw.sql b/contrib/postgres_fdw/sql/postgres_fdw.sql
index cf70ca2..d161a8e 100644
--- a/contrib/postgres_fdw/sql/postgres_fdw.sql
+++ b/contrib/postgres_fdw/sql/postgres_fdw.sql
@@ -1534,12 +1534,12 @@ insert into bar2 values(4,44,44);
insert into bar2 values(7,77,77);
explain (verbose, costs off)
-select * from bar where f1 in (select f1 from foo) for update;
-select * from bar where f1 in (select f1 from foo) for update;
+select * from bar where f1 in (select f1 from foo) order by 1 for update;
+select * from bar where f1 in (select f1 from foo) order by 1 for update;
explain (verbose, costs off)
-select * from bar where f1 in (select f1 from foo) for share;
-select * from bar where f1 in (select f1 from foo) for share;
+select * from bar where f1 in (select f1 from foo) order by 1 for share;
+select * from bar where f1 in (select f1 from foo) order by 1 for share;
-- Check UPDATE with inherited target and an inherited source table
explain (verbose, costs off)
@@ -1598,8 +1598,8 @@ explain (verbose, costs off)
delete from foo where f1 < 5 returning *;
delete from foo where f1 < 5 returning *;
explain (verbose, costs off)
-update bar set f2 = f2 + 100 returning *;
-update bar set f2 = f2 + 100 returning *;
+with u as (update bar set f2 = f2 + 100 returning *) select * from u order by 1;
+with u as (update bar set f2 = f2 + 100 returning *) select * from u order by 1;
drop table foo cascade;
drop table bar cascade;
--
2.9.2
0004-Apply-unlikely-to-suggest-synchronous-route-of-ExecA.patchtext/x-patch; charset=us-asciiDownload
From f7a0e01e079af33059aa366af18105727b9a0ce0 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Tue, 25 Oct 2016 18:05:30 +0900
Subject: [PATCH 4/5] Apply unlikely to suggest synchronous route of
ExecAppend.
ExecAppend seems to get slowed down by penalty of misprediction of
branches related to async-execution. Apply unlikey to them to prevent
such penalty on exiting route. Asynchronous execution is already
having a lot of additional code so this doesn't add siginificant
degradation.
---
src/backend/executor/nodeAppend.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/src/backend/executor/nodeAppend.c b/src/backend/executor/nodeAppend.c
index d91e621..2bdcee6 100644
--- a/src/backend/executor/nodeAppend.c
+++ b/src/backend/executor/nodeAppend.c
@@ -214,7 +214,7 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
TupleTableSlot *
ExecAppend(AppendState *node)
{
- if (node->as_nasyncplans > 0)
+ if (unlikely(node->as_nasyncplans > 0))
{
EState *estate = node->ps.state;
int i;
@@ -255,7 +255,7 @@ ExecAppend(AppendState *node)
/*
* if we have async requests outstanding, run the event loop
*/
- if (node->as_nasyncpending > 0)
+ if (unlikely(node->as_nasyncpending > 0))
{
long timeout = node->as_syncdone ? -1 : 0;
--
2.9.2
0005-Fix-a-typo-of-mcxt.c.patchtext/x-patch; charset=us-asciiDownload
From 26d427bd57c4b5019097a3c1586c14fd7786c7a9 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Wed, 22 Feb 2017 09:14:15 +0900
Subject: [PATCH 5/5] Fix a typo of mcxt.c
---
src/backend/utils/mmgr/mcxt.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/src/backend/utils/mmgr/mcxt.c b/src/backend/utils/mmgr/mcxt.c
index 6668bf1..d1598c5 100644
--- a/src/backend/utils/mmgr/mcxt.c
+++ b/src/backend/utils/mmgr/mcxt.c
@@ -208,7 +208,7 @@ MemoryContextDelete(MemoryContext context)
MemoryContextDeleteChildren(context);
/*
- * It's not entirely clear whether 'tis better to do this before or after
+ * It's not entirely clear whether it's better to do this before or after
* delinking the context; but an error in a callback will likely result in
* leaking the whole context (if it's not a root context) if we do it
* after, so let's do it before.
--
2.9.2
Hello.
At Tue, 04 Apr 2017 19:25:39 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in <20170404.192539.29699823.horiguchi.kyotaro@lab.ntt.co.jp>
The attached patch is rebased on the current master, but no
substantial changes other than disallowing partitioned tables on
async by assertion.
This is just rebased onto the current master (d761fe2).
I'll recheck further detail after this.
regards,
--
Kyotaro Horiguchi
NTT Open Source Software Center
Attachments:
0001-Allow-wait-event-set-to-be-registered-to-resource-ow.patchtext/x-patch; charset=us-asciiDownload
From 000f0465a59cdabd02f43e886c76c89c14d987a5 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Mon, 22 May 2017 12:42:58 +0900
Subject: [PATCH 1/4] Allow wait event set to be registered to resource owner
WaitEventSet needs to be released using resource owner for a certain
case. This change adds WaitEventSet reowner and allow the creator of a
WaitEventSet to specify a resource owner.
---
src/backend/libpq/pqcomm.c | 2 +-
src/backend/storage/ipc/latch.c | 18 ++++++-
src/backend/storage/lmgr/condition_variable.c | 2 +-
src/backend/utils/resowner/resowner.c | 68 +++++++++++++++++++++++++++
src/include/storage/latch.h | 4 +-
src/include/utils/resowner_private.h | 8 ++++
6 files changed, 97 insertions(+), 5 deletions(-)
diff --git a/src/backend/libpq/pqcomm.c b/src/backend/libpq/pqcomm.c
index d1cc38b..1c34114 100644
--- a/src/backend/libpq/pqcomm.c
+++ b/src/backend/libpq/pqcomm.c
@@ -201,7 +201,7 @@ pq_init(void)
(errmsg("could not set socket to nonblocking mode: %m")));
#endif
- FeBeWaitSet = CreateWaitEventSet(TopMemoryContext, 3);
+ FeBeWaitSet = CreateWaitEventSet(TopMemoryContext, NULL, 3);
AddWaitEventToSet(FeBeWaitSet, WL_SOCKET_WRITEABLE, MyProcPort->sock,
NULL, NULL);
AddWaitEventToSet(FeBeWaitSet, WL_LATCH_SET, -1, MyLatch, NULL);
diff --git a/src/backend/storage/ipc/latch.c b/src/backend/storage/ipc/latch.c
index 53e6bf2..8c182a2 100644
--- a/src/backend/storage/ipc/latch.c
+++ b/src/backend/storage/ipc/latch.c
@@ -51,6 +51,7 @@
#include "storage/latch.h"
#include "storage/pmsignal.h"
#include "storage/shmem.h"
+#include "utils/resowner_private.h"
/*
* Select the fd readiness primitive to use. Normally the "most modern"
@@ -77,6 +78,8 @@ struct WaitEventSet
int nevents; /* number of registered events */
int nevents_space; /* maximum number of events in this set */
+ ResourceOwner resowner; /* Resource owner */
+
/*
* Array, of nevents_space length, storing the definition of events this
* set is waiting for.
@@ -359,7 +362,7 @@ WaitLatchOrSocket(volatile Latch *latch, int wakeEvents, pgsocket sock,
int ret = 0;
int rc;
WaitEvent event;
- WaitEventSet *set = CreateWaitEventSet(CurrentMemoryContext, 3);
+ WaitEventSet *set = CreateWaitEventSet(CurrentMemoryContext, NULL, 3);
if (wakeEvents & WL_TIMEOUT)
Assert(timeout >= 0);
@@ -518,12 +521,15 @@ ResetLatch(volatile Latch *latch)
* WaitEventSetWait().
*/
WaitEventSet *
-CreateWaitEventSet(MemoryContext context, int nevents)
+CreateWaitEventSet(MemoryContext context, ResourceOwner res, int nevents)
{
WaitEventSet *set;
char *data;
Size sz = 0;
+ if (res)
+ ResourceOwnerEnlargeWESs(res);
+
/*
* Use MAXALIGN size/alignment to guarantee that later uses of memory are
* aligned correctly. E.g. epoll_event might need 8 byte alignment on some
@@ -592,6 +598,11 @@ CreateWaitEventSet(MemoryContext context, int nevents)
StaticAssertStmt(WSA_INVALID_EVENT == NULL, "");
#endif
+ /* Register this wait event set if requested */
+ set->resowner = res;
+ if (res)
+ ResourceOwnerRememberWES(set->resowner, set);
+
return set;
}
@@ -633,6 +644,9 @@ FreeWaitEventSet(WaitEventSet *set)
}
#endif
+ if (set->resowner != NULL)
+ ResourceOwnerForgetWES(set->resowner, set);
+
pfree(set);
}
diff --git a/src/backend/storage/lmgr/condition_variable.c b/src/backend/storage/lmgr/condition_variable.c
index 5afb211..1d9111e 100644
--- a/src/backend/storage/lmgr/condition_variable.c
+++ b/src/backend/storage/lmgr/condition_variable.c
@@ -66,7 +66,7 @@ ConditionVariablePrepareToSleep(ConditionVariable *cv)
/* Create a reusable WaitEventSet. */
if (cv_wait_event_set == NULL)
{
- cv_wait_event_set = CreateWaitEventSet(TopMemoryContext, 1);
+ cv_wait_event_set = CreateWaitEventSet(TopMemoryContext, NULL, 1);
AddWaitEventToSet(cv_wait_event_set, WL_LATCH_SET, PGINVALID_SOCKET,
&MyProc->procLatch, NULL);
}
diff --git a/src/backend/utils/resowner/resowner.c b/src/backend/utils/resowner/resowner.c
index af46d78..a1a1121 100644
--- a/src/backend/utils/resowner/resowner.c
+++ b/src/backend/utils/resowner/resowner.c
@@ -124,6 +124,7 @@ typedef struct ResourceOwnerData
ResourceArray snapshotarr; /* snapshot references */
ResourceArray filearr; /* open temporary files */
ResourceArray dsmarr; /* dynamic shmem segments */
+ ResourceArray wesarr; /* wait event sets */
/* We can remember up to MAX_RESOWNER_LOCKS references to local locks. */
int nlocks; /* number of owned locks */
@@ -169,6 +170,7 @@ static void PrintTupleDescLeakWarning(TupleDesc tupdesc);
static void PrintSnapshotLeakWarning(Snapshot snapshot);
static void PrintFileLeakWarning(File file);
static void PrintDSMLeakWarning(dsm_segment *seg);
+static void PrintWESLeakWarning(WaitEventSet *events);
/*****************************************************************************
@@ -437,6 +439,7 @@ ResourceOwnerCreate(ResourceOwner parent, const char *name)
ResourceArrayInit(&(owner->snapshotarr), PointerGetDatum(NULL));
ResourceArrayInit(&(owner->filearr), FileGetDatum(-1));
ResourceArrayInit(&(owner->dsmarr), PointerGetDatum(NULL));
+ ResourceArrayInit(&(owner->wesarr), PointerGetDatum(NULL));
return owner;
}
@@ -552,6 +555,16 @@ ResourceOwnerReleaseInternal(ResourceOwner owner,
PrintDSMLeakWarning(res);
dsm_detach(res);
}
+
+ /* Ditto for wait event sets */
+ while (ResourceArrayGetAny(&(owner->wesarr), &foundres))
+ {
+ WaitEventSet *event = (WaitEventSet *) DatumGetPointer(foundres);
+
+ if (isCommit)
+ PrintWESLeakWarning(event);
+ FreeWaitEventSet(event);
+ }
}
else if (phase == RESOURCE_RELEASE_LOCKS)
{
@@ -699,6 +712,7 @@ ResourceOwnerDelete(ResourceOwner owner)
Assert(owner->snapshotarr.nitems == 0);
Assert(owner->filearr.nitems == 0);
Assert(owner->dsmarr.nitems == 0);
+ Assert(owner->wesarr.nitems == 0);
Assert(owner->nlocks == 0 || owner->nlocks == MAX_RESOWNER_LOCKS + 1);
/*
@@ -725,6 +739,7 @@ ResourceOwnerDelete(ResourceOwner owner)
ResourceArrayFree(&(owner->snapshotarr));
ResourceArrayFree(&(owner->filearr));
ResourceArrayFree(&(owner->dsmarr));
+ ResourceArrayFree(&(owner->wesarr));
pfree(owner);
}
@@ -1267,3 +1282,56 @@ PrintDSMLeakWarning(dsm_segment *seg)
elog(WARNING, "dynamic shared memory leak: segment %u still referenced",
dsm_segment_handle(seg));
}
+
+/*
+ * Make sure there is room for at least one more entry in a ResourceOwner's
+ * wait event set reference array.
+ *
+ * This is separate from actually inserting an entry because if we run out
+ * of memory, it's critical to do so *before* acquiring the resource.
+ */
+void
+ResourceOwnerEnlargeWESs(ResourceOwner owner)
+{
+ ResourceArrayEnlarge(&(owner->wesarr));
+}
+
+/*
+ * Remember that a wait event set is owned by a ResourceOwner
+ *
+ * Caller must have previously done ResourceOwnerEnlargeWESs()
+ */
+void
+ResourceOwnerRememberWES(ResourceOwner owner, WaitEventSet *events)
+{
+ ResourceArrayAdd(&(owner->wesarr), PointerGetDatum(events));
+}
+
+/*
+ * Forget that a wait event set is owned by a ResourceOwner
+ */
+void
+ResourceOwnerForgetWES(ResourceOwner owner, WaitEventSet *events)
+{
+ /*
+ * XXXX: There's no property to show as an identier of a wait event set,
+ * use its pointer instead.
+ */
+ if (!ResourceArrayRemove(&(owner->wesarr), PointerGetDatum(events)))
+ elog(ERROR, "wait event set %p is not owned by resource owner %s",
+ events, owner->name);
+}
+
+/*
+ * Debugging subroutine
+ */
+static void
+PrintWESLeakWarning(WaitEventSet *events)
+{
+ /*
+ * XXXX: There's no property to show as an identier of a wait event set,
+ * use its pointer instead.
+ */
+ elog(WARNING, "wait event set leak: %p still referenced",
+ events);
+}
diff --git a/src/include/storage/latch.h b/src/include/storage/latch.h
index 3158d7b..8233b6d 100644
--- a/src/include/storage/latch.h
+++ b/src/include/storage/latch.h
@@ -101,6 +101,7 @@
#define LATCH_H
#include <signal.h>
+#include "utils/resowner.h"
/*
* Latch structure should be treated as opaque and only accessed through
@@ -152,7 +153,8 @@ extern void DisownLatch(volatile Latch *latch);
extern void SetLatch(volatile Latch *latch);
extern void ResetLatch(volatile Latch *latch);
-extern WaitEventSet *CreateWaitEventSet(MemoryContext context, int nevents);
+extern WaitEventSet *CreateWaitEventSet(MemoryContext context,
+ ResourceOwner res, int nevents);
extern void FreeWaitEventSet(WaitEventSet *set);
extern int AddWaitEventToSet(WaitEventSet *set, uint32 events, pgsocket fd,
Latch *latch, void *user_data);
diff --git a/src/include/utils/resowner_private.h b/src/include/utils/resowner_private.h
index 411d08f..0c6979a 100644
--- a/src/include/utils/resowner_private.h
+++ b/src/include/utils/resowner_private.h
@@ -18,6 +18,7 @@
#include "storage/dsm.h"
#include "storage/fd.h"
+#include "storage/latch.h"
#include "storage/lock.h"
#include "utils/catcache.h"
#include "utils/plancache.h"
@@ -88,4 +89,11 @@ extern void ResourceOwnerRememberDSM(ResourceOwner owner,
extern void ResourceOwnerForgetDSM(ResourceOwner owner,
dsm_segment *);
+/* support for wait event set management */
+extern void ResourceOwnerEnlargeWESs(ResourceOwner owner);
+extern void ResourceOwnerRememberWES(ResourceOwner owner,
+ WaitEventSet *);
+extern void ResourceOwnerForgetWES(ResourceOwner owner,
+ WaitEventSet *);
+
#endif /* RESOWNER_PRIVATE_H */
--
2.9.2
0002-Asynchronous-execution-framework.patchtext/x-patch; charset=us-asciiDownload
From 1fd1847c105ddd1ed2d10cd9043081d642e6a57f Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Mon, 22 May 2017 12:46:48 +0900
Subject: [PATCH 2/4] Asynchronous execution framework
This is a framework for asynchronous execution based on Robert Haas's
proposal. Any executor node can receive tuples from underlying nodes
asynchronously by this. This is a different mechanism from parallel
execution. While the parallel execution is analogous to threads, this
frame work is analogous to select(2), which handles multiple input on
single backend process. To avoid degradation of non-async execution,
this framework uses completely different channel to convey tuples.
You will see the deatil of the API at the end of
src/backend/executor/README.
---
src/backend/executor/Makefile | 2 +-
src/backend/executor/README | 45 +++++++++
src/backend/executor/execAmi.c | 5 +
src/backend/executor/execProcnode.c | 1 +
src/backend/executor/instrument.c | 2 +-
src/backend/executor/nodeAppend.c | 169 +++++++++++++++++++++++++++++---
src/backend/executor/nodeForeignscan.c | 49 +++++++++
src/backend/nodes/copyfuncs.c | 2 +
src/backend/nodes/outfuncs.c | 2 +
src/backend/nodes/readfuncs.c | 2 +
src/backend/optimizer/plan/createplan.c | 69 +++++++++++--
src/backend/postmaster/pgstat.c | 2 +
src/backend/utils/adt/ruleutils.c | 6 +-
src/include/executor/nodeAppend.h | 3 +
src/include/executor/nodeForeignscan.h | 7 ++
src/include/foreign/fdwapi.h | 17 ++++
src/include/nodes/execnodes.h | 65 +++++++++++-
src/include/nodes/plannodes.h | 2 +
src/include/pgstat.h | 3 +-
19 files changed, 424 insertions(+), 29 deletions(-)
diff --git a/src/backend/executor/Makefile b/src/backend/executor/Makefile
index 083b20f..21f5ad0 100644
--- a/src/backend/executor/Makefile
+++ b/src/backend/executor/Makefile
@@ -12,7 +12,7 @@ subdir = src/backend/executor
top_builddir = ../../..
include $(top_builddir)/src/Makefile.global
-OBJS = execAmi.o execCurrent.o execExpr.o execExprInterp.o \
+OBJS = execAmi.o execAsync.o execCurrent.o execExpr.o execExprInterp.o \
execGrouping.o execIndexing.o execJunk.o \
execMain.o execParallel.o execProcnode.o \
execReplication.o execScan.o execSRF.o execTuples.o \
diff --git a/src/backend/executor/README b/src/backend/executor/README
index a004506..e6caeb7 100644
--- a/src/backend/executor/README
+++ b/src/backend/executor/README
@@ -349,3 +349,48 @@ query returning the same set of scan tuples multiple times. Likewise,
SRFs are disallowed in an UPDATE's targetlist. There, they would have the
effect of the same row being updated multiple times, which is not very
useful --- and updates after the first would have no effect anyway.
+
+Asynchronous Execution
+----------------------
+
+In certain cases, it's desirable for a node to indicate that it cannot
+return any tuple immediately but may be able to do at a later time. This
+might either because the node is waiting on an event external to the
+database system, such as a ForeignScan awaiting network I/O, or because
+the node is waiting for an event internal to the database system - e.g.
+one process involved in a parallel query may find that it cannot progress
+a certain parallel operation until some other process reaches a certain
+point in the computation. A process which discovers this type of situation
+can always handle it simply by blocking, but this may waste time that could
+be spent executing some other part of the plan where progress could be
+made immediately. This is particularly likely to occur when the plan
+contains an Append node.
+
+To use asynchronous execution, a node must first request a tuple from
+an async-capable child node using ExecAsyncRequest. Next, when the
+result is not available immediately, it must execute the asynchronous
+event loop using ExecAsyncEventLoop; it can avoid giving up control
+indefinitely by passing a timeout to this function, even passing -1 to
+poll for events without blocking. Eventually, when a node to which an
+asynchronous request has been made produces a tuple, the requesting
+node will receive a callback from the event loop via
+ExecAsyncResponse. Typically, the ExecAsyncResponse callback is the
+only one required for nodes that wish to request tuples
+asynchronously.
+
+On the other hand, nodes that wish to produce tuples asynchronously
+generally need to implement three methods:
+
+1. When an asynchronous request is made, the node's ExecAsyncRequest callback
+will be invoked; it should use ExecAsyncSetRequiredEvents to indicate the
+number of file descriptor events for which it wishes to wait and whether it
+wishes to receive a callback when the process latch is set. Alternatively,
+it can instead use ExecAsyncRequestDone if a result is available immediately.
+
+2. When the event loop wishes to wait or poll for file descriptor events and
+the process latch, the ExecAsyncConfigureWait callback is invoked to configure
+the file descriptor wait events for which the node wishes to wait. This
+callback isn't needed if the node only cares about the process latch.
+
+3. When file descriptors or the process latch become ready, the node's
+ExecAsyncNotify callback is invoked.
diff --git a/src/backend/executor/execAmi.c b/src/backend/executor/execAmi.c
index 7337d21..4c1991c 100644
--- a/src/backend/executor/execAmi.c
+++ b/src/backend/executor/execAmi.c
@@ -479,11 +479,16 @@ ExecSupportsBackwardScan(Plan *node)
{
ListCell *l;
+ /* With async, tuples may be interleaved, so can't back up. */
+ if (((Append *) node)->nasyncplans != 0)
+ return false;
+
foreach(l, ((Append *) node)->appendplans)
{
if (!ExecSupportsBackwardScan((Plan *) lfirst(l)))
return false;
}
+
/* need not check tlist because Append doesn't evaluate it */
return true;
}
diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
index 5469cde..2b727c0 100644
--- a/src/backend/executor/execProcnode.c
+++ b/src/backend/executor/execProcnode.c
@@ -118,6 +118,7 @@
#include "executor/nodeValuesscan.h"
#include "executor/nodeWindowAgg.h"
#include "executor/nodeWorktablescan.h"
+#include "foreign/fdwapi.h"
#include "nodes/nodeFuncs.h"
#include "miscadmin.h"
diff --git a/src/backend/executor/instrument.c b/src/backend/executor/instrument.c
index 6ec96ec..959ee90 100644
--- a/src/backend/executor/instrument.c
+++ b/src/backend/executor/instrument.c
@@ -102,7 +102,7 @@ InstrStopNode(Instrumentation *instr, double nTuples)
&pgBufferUsage, &instr->bufusage_start);
/* Is this the first tuple of this cycle? */
- if (!instr->running)
+ if (!instr->running && nTuples > 0)
{
instr->running = true;
instr->firsttuple = INSTR_TIME_GET_DOUBLE(instr->counter);
diff --git a/src/backend/executor/nodeAppend.c b/src/backend/executor/nodeAppend.c
index aae5e3f..2c07095 100644
--- a/src/backend/executor/nodeAppend.c
+++ b/src/backend/executor/nodeAppend.c
@@ -58,6 +58,7 @@
#include "postgres.h"
#include "executor/execdebug.h"
+#include "executor/execAsync.h"
#include "executor/nodeAppend.h"
static bool exec_append_initialize_next(AppendState *appendstate);
@@ -79,16 +80,21 @@ exec_append_initialize_next(AppendState *appendstate)
/*
* get information from the append node
*/
- whichplan = appendstate->as_whichplan;
+ whichplan = appendstate->as_whichsyncplan;
- if (whichplan < 0)
+ /*
+ * This routine is only responsible for setting up for nodes being scanned
+ * synchronously, so the first node we can scan is given by nasyncplans
+ * and the last is given by as_nplans - 1.
+ */
+ if (whichplan < appendstate->as_nasyncplans)
{
/*
* if scanning in reverse, we start at the last scan in the list and
* then proceed back to the first.. in any case we inform ExecAppend
* that we are at the end of the line by returning FALSE
*/
- appendstate->as_whichplan = 0;
+ appendstate->as_whichsyncplan = appendstate->as_nasyncplans;
return FALSE;
}
else if (whichplan >= appendstate->as_nplans)
@@ -96,7 +102,7 @@ exec_append_initialize_next(AppendState *appendstate)
/*
* as above, end the scan if we go beyond the last scan in our list..
*/
- appendstate->as_whichplan = appendstate->as_nplans - 1;
+ appendstate->as_whichsyncplan = appendstate->as_nplans - 1;
return FALSE;
}
else
@@ -148,6 +154,15 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
appendstate->ps.state = estate;
appendstate->appendplans = appendplanstates;
appendstate->as_nplans = nplans;
+ appendstate->as_nasyncplans = node->nasyncplans;
+ appendstate->as_syncdone = (node->nasyncplans == nplans);
+ appendstate->as_asyncresult = (TupleTableSlot **)
+ palloc0(node->nasyncplans * sizeof(TupleTableSlot *));
+
+ /* initially, all async requests need a request */
+ for (i = 0; i < appendstate->as_nasyncplans; ++i)
+ appendstate->as_needrequest =
+ bms_add_member(appendstate->as_needrequest, i);
/*
* Miscellaneous initialization
@@ -182,9 +197,9 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
appendstate->ps.ps_ProjInfo = NULL;
/*
- * initialize to scan first subplan
+ * initialize to scan first synchronous subplan
*/
- appendstate->as_whichplan = 0;
+ appendstate->as_whichsyncplan = appendstate->as_nasyncplans;
exec_append_initialize_next(appendstate);
return appendstate;
@@ -199,15 +214,85 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
TupleTableSlot *
ExecAppend(AppendState *node)
{
+ if (node->as_nasyncplans > 0)
+ {
+ EState *estate = node->ps.state;
+ int i;
+
+ /*
+ * If there are any asynchronously-generated results that have
+ * not yet been returned, return one of them.
+ */
+ if (node->as_nasyncresult > 0)
+ {
+ --node->as_nasyncresult;
+ return node->as_asyncresult[node->as_nasyncresult];
+ }
+
+
+ /*
+ * XXXX: Always clear registered event. This seems a bit ineffecient
+ * but the events to wait are almost randomly altered for every
+ * calling.
+ */
+ ExecAsyncClearEvents(estate);
+
+ while ((i = bms_first_member(node->as_needrequest)) >= 0)
+ {
+ node->as_nasyncpending++;
+ ExecAsyncRequest(estate, &node->ps, i, node->appendplans[i]);
+ }
+
+ if (node->as_nasyncpending == 0 && node->as_syncdone)
+ return ExecClearTuple(node->ps.ps_ResultTupleSlot);
+ }
+
for (;;)
{
PlanState *subnode;
TupleTableSlot *result;
/*
- * figure out which subplan we are currently processing
+ * if we have async requests outstanding, run the event loop
+ */
+ if (node->as_nasyncpending > 0)
+ {
+ long timeout = node->as_syncdone ? -1 : 0;
+
+ while (node->as_nasyncpending > 0)
+ {
+ if (ExecAsyncEventLoop(node->ps.state, &node->ps, timeout) &&
+ node->as_nasyncresult > 0)
+ {
+ /* Asynchronous subplan returned a tuple! */
+ --node->as_nasyncresult;
+ return node->as_asyncresult[node->as_nasyncresult];
+ }
+
+ /* Timeout reached. Go through to sync nodes if exists */
+ if (!node->as_syncdone)
+ break;
+ }
+
+ /*
+ * If there is no asynchronous activity still pending and the
+ * synchronous activity is also complete, we're totally done
+ * scanning this node. Otherwise, we're done with the
+ * asynchronous stuff but must continue scanning the synchronous
+ * children.
+ */
+ if (node->as_syncdone)
+ {
+ Assert(node->as_nasyncpending == 0);
+ return ExecClearTuple(node->ps.ps_ResultTupleSlot);
+ }
+ }
+
+ /*
+ * figure out which synchronous subplan we are currently processing
*/
- subnode = node->appendplans[node->as_whichplan];
+ Assert(!node->as_syncdone);
+ subnode = node->appendplans[node->as_whichsyncplan];
/*
* get a tuple from the subplan
@@ -227,14 +312,21 @@ ExecAppend(AppendState *node)
/*
* Go on to the "next" subplan in the appropriate direction. If no
* more subplans, return the empty slot set up for us by
- * ExecInitAppend.
+ * ExecInitAppend, unless there are async plans we have yet to finish.
*/
if (ScanDirectionIsForward(node->ps.state->es_direction))
- node->as_whichplan++;
+ node->as_whichsyncplan++;
else
- node->as_whichplan--;
+ node->as_whichsyncplan--;
if (!exec_append_initialize_next(node))
- return ExecClearTuple(node->ps.ps_ResultTupleSlot);
+ {
+ node->as_syncdone = true;
+ if (node->as_nasyncpending == 0)
+ {
+ Assert(bms_is_empty(node->as_needrequest));
+ return ExecClearTuple(node->ps.ps_ResultTupleSlot);
+ }
+ }
/* Else loop back and try to get a tuple from the new subplan */
}
@@ -273,6 +365,16 @@ ExecReScanAppend(AppendState *node)
{
int i;
+ /*
+ * XXX. Cancel outstanding asynchronous tuple requests here! (How?)
+ */
+
+ /* Reset async state. */
+ for (i = 0; i < node->as_nasyncplans; ++i)
+ node->as_needrequest = bms_add_member(node->as_needrequest, i);
+ node->as_nasyncresult = 0;
+ node->as_syncdone = (node->as_nasyncplans == node->as_nplans);
+
for (i = 0; i < node->as_nplans; i++)
{
PlanState *subnode = node->appendplans[i];
@@ -291,6 +393,47 @@ ExecReScanAppend(AppendState *node)
if (subnode->chgParam == NULL)
ExecReScan(subnode);
}
- node->as_whichplan = 0;
+ node->as_whichsyncplan = node->as_nasyncplans;
exec_append_initialize_next(node);
}
+
+/* ----------------------------------------------------------------
+ * ExecAsyncAppendResponse
+ *
+ * Receive a response from an asynchronous request we made.
+ * ----------------------------------------------------------------
+ */
+void
+ExecAsyncAppendResponse(EState *estate, PendingAsyncRequest *areq)
+{
+ AppendState *node = (AppendState *) areq->requestor;
+ TupleTableSlot *slot;
+
+ /* We shouldn't be called until the request is complete. */
+ Assert(areq->state == ASYNCREQ_COMPLETE);
+
+ /* Our result slot shouldn't already be occupied. */
+ Assert(TupIsNull(node->ps.ps_ResultTupleSlot));
+
+ /* Result should be a TupleTableSlot or NULL. */
+ slot = (TupleTableSlot *) areq->result;
+ Assert(slot == NULL || IsA(slot, TupleTableSlot));
+
+ /* This is no longer pending */
+ --node->as_nasyncpending;
+
+ /* If the result is NULL or an empty slot, there's nothing more to do. */
+ if (TupIsNull(slot))
+ return;
+
+ /* Save result so we can return it. */
+ Assert(node->as_nasyncresult < node->as_nasyncplans);
+ node->as_asyncresult[node->as_nasyncresult++] = slot;
+
+ /*
+ * Mark the node that returned a result as ready for a new request. We
+ * don't launch another one here immediately because it might compelte
+ */
+ node->as_needrequest =
+ bms_add_member(node->as_needrequest, areq->request_index);
+}
diff --git a/src/backend/executor/nodeForeignscan.c b/src/backend/executor/nodeForeignscan.c
index 9ae1561..7db5c30 100644
--- a/src/backend/executor/nodeForeignscan.c
+++ b/src/backend/executor/nodeForeignscan.c
@@ -364,3 +364,52 @@ ExecShutdownForeignScan(ForeignScanState *node)
if (fdwroutine->ShutdownForeignScan)
fdwroutine->ShutdownForeignScan(node);
}
+
+/* ----------------------------------------------------------------
+ * ExecAsyncForeignScanRequest
+ *
+ * Initiate an asynchronous request
+ * ----------------------------------------------------------------
+ */
+void
+ExecAsyncForeignScanRequest(EState *estate, PendingAsyncRequest *areq)
+{
+ ForeignScanState *node = (ForeignScanState *) areq->requestee;
+ FdwRoutine *fdwroutine = node->fdwroutine;
+
+ Assert(fdwroutine->ForeignAsyncRequest != NULL);
+ fdwroutine->ForeignAsyncRequest(estate, areq);
+}
+
+/* ----------------------------------------------------------------
+ * ExecAsyncForeignScanConfigureWait
+ *
+ * In async mode, configure for a wait
+ * ----------------------------------------------------------------
+ */
+bool
+ExecAsyncForeignScanConfigureWait(EState *estate,
+ PendingAsyncRequest *areq, bool reinit)
+{
+ ForeignScanState *node = (ForeignScanState *) areq->requestee;
+ FdwRoutine *fdwroutine = node->fdwroutine;
+
+ Assert(fdwroutine->ForeignAsyncConfigureWait != NULL);
+ return fdwroutine->ForeignAsyncConfigureWait(estate, areq, reinit);
+}
+
+/* ----------------------------------------------------------------
+ * ExecAsyncForeignScanNotify
+ *
+ * Event loop callback
+ * ----------------------------------------------------------------
+ */
+void
+ExecAsyncForeignScanNotify(EState *estate, PendingAsyncRequest *areq)
+{
+ ForeignScanState *node = (ForeignScanState *) areq->requestee;
+ FdwRoutine *fdwroutine = node->fdwroutine;
+
+ Assert(fdwroutine->ForeignAsyncNotify != NULL);
+ fdwroutine->ForeignAsyncNotify(estate, areq);
+}
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index 7811ad5..8cd0821 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -242,6 +242,8 @@ _copyAppend(const Append *from)
*/
COPY_NODE_FIELD(partitioned_rels);
COPY_NODE_FIELD(appendplans);
+ COPY_SCALAR_FIELD(nasyncplans);
+ COPY_SCALAR_FIELD(referent);
return newnode;
}
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index 4949d58..2d50b8a 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -376,6 +376,8 @@ _outAppend(StringInfo str, const Append *node)
WRITE_NODE_FIELD(partitioned_rels);
WRITE_NODE_FIELD(appendplans);
+ WRITE_INT_FIELD(nasyncplans);
+ WRITE_INT_FIELD(referent);
}
static void
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index e24f5d6..fae9396 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -1579,6 +1579,8 @@ _readAppend(void)
READ_NODE_FIELD(partitioned_rels);
READ_NODE_FIELD(appendplans);
+ READ_INT_FIELD(nasyncplans);
+ READ_INT_FIELD(referent);
READ_DONE();
}
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index 94beeb8..9c29787 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -203,7 +203,8 @@ static NamedTuplestoreScan *make_namedtuplestorescan(List *qptlist, List *qpqual
Index scanrelid, char *enrname);
static WorkTableScan *make_worktablescan(List *qptlist, List *qpqual,
Index scanrelid, int wtParam);
-static Append *make_append(List *appendplans, List *tlist, List *partitioned_rels);
+static Append *make_append(List *asyncplans, int nasyncplans,
+ int referent, List *tlist, List *partitioned_rels);
static RecursiveUnion *make_recursive_union(List *tlist,
Plan *lefttree,
Plan *righttree,
@@ -282,7 +283,7 @@ static ModifyTable *make_modifytable(PlannerInfo *root,
List *rowMarks, OnConflictExpr *onconflict, int epqParam);
static GatherMerge *create_gather_merge_plan(PlannerInfo *root,
GatherMergePath *best_path);
-
+static bool is_async_capable_path(Path *path);
/*
* create_plan
@@ -1003,8 +1004,12 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path)
{
Append *plan;
List *tlist = build_path_tlist(root, &best_path->path);
- List *subplans = NIL;
+ List *asyncplans = NIL;
+ List *syncplans = NIL;
ListCell *subpaths;
+ int nasyncplans = 0;
+ bool first = true;
+ bool referent_is_sync = true;
/*
* The subpaths list could be empty, if every child was proven empty by
@@ -1030,7 +1035,14 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path)
return plan;
}
- /* Build the plan for each child */
+ /*
+ * Build the plan for each child
+
+ * The first child in an inheritance set is the representative in
+ * explaining tlist entries (see set_deparse_planstate). We should keep
+ * the first child in best_path->subpaths at the head of the subplan list
+ * for the reason.
+ */
foreach(subpaths, best_path->subpaths)
{
Path *subpath = (Path *) lfirst(subpaths);
@@ -1039,7 +1051,18 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path)
/* Must insist that all children return the same tlist */
subplan = create_plan_recurse(root, subpath, CP_EXACT_TLIST);
- subplans = lappend(subplans, subplan);
+ /* Classify as async-capable or not */
+ if (is_async_capable_path(subpath))
+ {
+ asyncplans = lappend(asyncplans, subplan);
+ ++nasyncplans;
+ if (first)
+ referent_is_sync = false;
+ }
+ else
+ syncplans = lappend(syncplans, subplan);
+
+ first = false;
}
/*
@@ -1049,7 +1072,9 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path)
* parent-rel Vars it'll be asked to emit.
*/
- plan = make_append(subplans, tlist, best_path->partitioned_rels);
+ plan = make_append(list_concat(asyncplans, syncplans), nasyncplans,
+ referent_is_sync ? nasyncplans : 0, tlist,
+ best_path->partitioned_rels);
copy_generic_path_info(&plan->plan, (Path *) best_path);
@@ -5268,17 +5293,23 @@ make_foreignscan(List *qptlist,
}
static Append *
-make_append(List *appendplans, List *tlist, List *partitioned_rels)
+make_append(List *appendplans, int nasyncplans, int referent,
+ List *tlist, List *partitioned_rels)
{
Append *node = makeNode(Append);
Plan *plan = &node->plan;
+ /* Currently async on partitioned tables is not available */
+ Assert(nasyncplans == 0 || partitioned_rels == NIL);
+
plan->targetlist = tlist;
plan->qual = NIL;
plan->lefttree = NULL;
plan->righttree = NULL;
node->partitioned_rels = partitioned_rels;
node->appendplans = appendplans;
+ node->nasyncplans = nasyncplans;
+ node->referent = referent;
return node;
}
@@ -6608,3 +6639,27 @@ is_projection_capable_plan(Plan *plan)
}
return true;
}
+
+/*
+ * is_projection_capable_path
+ * Check whether a given Path node is async-capable.
+ */
+static bool
+is_async_capable_path(Path *path)
+{
+ switch (nodeTag(path))
+ {
+ case T_ForeignPath:
+ {
+ FdwRoutine *fdwroutine = path->parent->fdwroutine;
+
+ Assert(fdwroutine != NULL);
+ if (fdwroutine->IsForeignPathAsyncCapable != NULL &&
+ fdwroutine->IsForeignPathAsyncCapable((ForeignPath *) path))
+ return true;
+ }
+ default:
+ break;
+ }
+ return false;
+}
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index f453dad..97337bd 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -3611,6 +3611,8 @@ pgstat_get_wait_ipc(WaitEventIPC w)
break;
case WAIT_EVENT_LOGICAL_SYNC_STATE_CHANGE:
event_name = "LogicalSyncStateChange";
+ case WAIT_EVENT_ASYNC_WAIT:
+ event_name = "AsyncExecWait";
break;
/* no default case, so that compiler will warn */
}
diff --git a/src/backend/utils/adt/ruleutils.c b/src/backend/utils/adt/ruleutils.c
index 9234bc2..0ed6d2c 100644
--- a/src/backend/utils/adt/ruleutils.c
+++ b/src/backend/utils/adt/ruleutils.c
@@ -4425,7 +4425,11 @@ set_deparse_planstate(deparse_namespace *dpns, PlanState *ps)
* lists containing references to non-target relations.
*/
if (IsA(ps, AppendState))
- dpns->outer_planstate = ((AppendState *) ps)->appendplans[0];
+ {
+ int idx = ((Append*)(((AppendState *) ps)->ps.plan))->referent;
+ dpns->outer_planstate =
+ ((AppendState *) ps)->appendplans[idx];
+ }
else if (IsA(ps, MergeAppendState))
dpns->outer_planstate = ((MergeAppendState *) ps)->mergeplans[0];
else if (IsA(ps, ModifyTableState))
diff --git a/src/include/executor/nodeAppend.h b/src/include/executor/nodeAppend.h
index 6fb4662..3cbf9ff 100644
--- a/src/include/executor/nodeAppend.h
+++ b/src/include/executor/nodeAppend.h
@@ -21,4 +21,7 @@ extern TupleTableSlot *ExecAppend(AppendState *node);
extern void ExecEndAppend(AppendState *node);
extern void ExecReScanAppend(AppendState *node);
+extern void ExecAsyncAppendResponse(EState *estate,
+ PendingAsyncRequest *areq);
+
#endif /* NODEAPPEND_H */
diff --git a/src/include/executor/nodeForeignscan.h b/src/include/executor/nodeForeignscan.h
index 1b167b8..e4ba4a9 100644
--- a/src/include/executor/nodeForeignscan.h
+++ b/src/include/executor/nodeForeignscan.h
@@ -30,4 +30,11 @@ extern void ExecForeignScanInitializeWorker(ForeignScanState *node,
shm_toc *toc);
extern void ExecShutdownForeignScan(ForeignScanState *node);
+extern void ExecAsyncForeignScanRequest(EState *estate,
+ PendingAsyncRequest *areq);
+extern bool ExecAsyncForeignScanConfigureWait(EState *estate,
+ PendingAsyncRequest *areq, bool reinit);
+extern void ExecAsyncForeignScanNotify(EState *estate,
+ PendingAsyncRequest *areq);
+
#endif /* NODEFOREIGNSCAN_H */
diff --git a/src/include/foreign/fdwapi.h b/src/include/foreign/fdwapi.h
index 6ca44f7..863ff0e 100644
--- a/src/include/foreign/fdwapi.h
+++ b/src/include/foreign/fdwapi.h
@@ -156,6 +156,16 @@ typedef bool (*IsForeignScanParallelSafe_function) (PlannerInfo *root,
RelOptInfo *rel,
RangeTblEntry *rte);
+typedef bool (*IsForeignPathAsyncCapable_function) (ForeignPath *path);
+typedef void (*ForeignAsyncRequest_function) (EState *estate,
+ PendingAsyncRequest *areq);
+typedef bool (*ForeignAsyncConfigureWait_function) (EState *estate,
+ PendingAsyncRequest *areq,
+ bool reinit);
+typedef void (*ForeignAsyncNotify_function) (EState *estate,
+ PendingAsyncRequest *areq);
+typedef void (*ShutdownForeignScan_function) (ForeignScanState *node);
+
/*
* FdwRoutine is the struct returned by a foreign-data wrapper's handler
* function. It provides pointers to the callback functions needed by the
@@ -225,6 +235,13 @@ typedef struct FdwRoutine
EstimateDSMForeignScan_function EstimateDSMForeignScan;
InitializeDSMForeignScan_function InitializeDSMForeignScan;
InitializeWorkerForeignScan_function InitializeWorkerForeignScan;
+
+ /* Support functions for asynchronous execution */
+ IsForeignPathAsyncCapable_function IsForeignPathAsyncCapable;
+ ForeignAsyncRequest_function ForeignAsyncRequest;
+ ForeignAsyncConfigureWait_function ForeignAsyncConfigureWait;
+ ForeignAsyncNotify_function ForeignAsyncNotify;
+
ShutdownForeignScan_function ShutdownForeignScan;
} FdwRoutine;
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index d33392f..b58c66e 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -395,6 +395,32 @@ typedef struct ResultRelInfo
} ResultRelInfo;
/* ----------------
+ * PendingAsyncRequest
+ *
+ * State for an asynchronous tuple request.
+ * ----------------
+ */
+typedef enum AsyncRequestState
+{
+ ASYNCREQ_IDLE, /* Nothing is requested */
+ ASYNCREQ_WAITING, /* Waiting for events */
+ ASYNCREQ_CALLBACK_PENDING, /* Having events to be processed */
+ ASYNCREQ_COMPLETE /* Result is available */
+} AsyncRequestState;
+
+typedef struct PendingAsyncRequest
+{
+ int myindex; /* Index in es_pending_async. */
+ struct PlanState *requestor; /* Node that wants a tuple. */
+ struct PlanState *requestee; /* Node from which a tuple is wanted. */
+ int request_index; /* Scratch space for requestor. */
+ int num_fd_events; /* Max number of FD events requestee needs. */
+ bool wants_process_latch; /* Requestee cares about MyLatch. */
+ AsyncRequestState state;
+ Node *result; /* Result (NULL if no more tuples). */
+} PendingAsyncRequest;
+
+/* ----------------
* EState information
*
* Master working state for an Executor invocation
@@ -486,6 +512,32 @@ typedef struct EState
/* The per-query shared memory area to use for parallel execution. */
struct dsa_area *es_query_dsa;
+
+ /*
+ * Support for asynchronous execution.
+ *
+ * es_max_pending_async is the allocated size of es_pending_async, and
+ * es_num_pending_aync is the number of entries that are currently valid.
+ * (Entries after that may point to storage that can be reused.)
+ * es_async_ready is the number of PendingAsyncRequests that is ready to
+ * retrieve a tuple.
+ *
+ * es_total_fd_events is the total number of FD events needed by all
+ * pending async nodes, and es_allocated_fd_events is the number any
+ * current wait event set was allocated to handle. es_wait_event_set, if
+ * non-NULL, is a previously allocated event set that may be reusable by a
+ * future wait provided that nothing's been removed and not too many more
+ * events have been added.
+ */
+ int es_num_pending_async; /* # of nodes to wait */
+ int es_max_pending_async; /* max # of pending nodes */
+ int es_async_callback_pending; /* # of nodes to callback */
+ int es_num_async_ready; /* # of tuple-ready nodes */
+ PendingAsyncRequest **es_pending_async;
+
+ int es_total_fd_events;
+ int es_allocated_fd_events;
+ struct WaitEventSet *es_wait_event_set;
} EState;
@@ -950,17 +1002,20 @@ typedef struct ModifyTableState
/* ----------------
* AppendState information
- *
- * nplans how many plans are in the array
- * whichplan which plan is being executed (0 .. n-1)
* ----------------
*/
typedef struct AppendState
{
PlanState ps; /* its first field is NodeTag */
PlanState **appendplans; /* array of PlanStates for my inputs */
- int as_nplans;
- int as_whichplan;
+ int as_nplans; /* total # of children */
+ int as_nasyncplans; /* # of async-capable children */
+ int as_whichsyncplan; /* which sync plan is being executed */
+ bool as_syncdone; /* all synchronous plans done? */
+ Bitmapset *as_needrequest; /* async plans needing a new request */
+ TupleTableSlot **as_asyncresult; /* unreturned results of async plans */
+ int as_nasyncresult; /* # of valid entries in as_asyncresult */
+ int as_nasyncpending; /* # of outstanding async requests */
} AppendState;
/* ----------------
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index d84372d..8bace1f 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -248,6 +248,8 @@ typedef struct Append
/* RT indexes of non-leaf tables in a partition tree */
List *partitioned_rels;
List *appendplans;
+ int nasyncplans; /* # of async plans, always at start of list */
+ int referent; /* index of inheritance tree referent */
} Append;
/* ----------------
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 5e029c0..7537ce2 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -812,7 +812,8 @@ typedef enum
WAIT_EVENT_SAFE_SNAPSHOT,
WAIT_EVENT_SYNC_REP,
WAIT_EVENT_LOGICAL_SYNC_DATA,
- WAIT_EVENT_LOGICAL_SYNC_STATE_CHANGE
+ WAIT_EVENT_LOGICAL_SYNC_STATE_CHANGE,
+ WAIT_EVENT_ASYNC_WAIT
} WaitEventIPC;
/* ----------
--
2.9.2
0003-Make-postgres_fdw-async-capable.patchtext/x-patch; charset=us-asciiDownload
From ad2cb622293b3888e0cc7c590f517b5e1b4e5d74 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Mon, 22 May 2017 12:49:41 +0900
Subject: [PATCH 3/4] Make postgres_fdw async-capable.
Make postgre_fdw async-capable using the infrastructure. Additionaly,
this makes connections for postgres_fdw have a connection-specific
area to store information so that foreign scans on the same connection
can share some data. postgres_fdw shares scan node currently running
on the underlying connection. This allows us async-execution of
multiple foreign scans on one foreign server.
---
contrib/postgres_fdw/connection.c | 79 ++--
contrib/postgres_fdw/expected/postgres_fdw.out | 128 +++---
contrib/postgres_fdw/postgres_fdw.c | 522 +++++++++++++++++++++----
contrib/postgres_fdw/postgres_fdw.h | 2 +
contrib/postgres_fdw/sql/postgres_fdw.sql | 20 +-
5 files changed, 591 insertions(+), 160 deletions(-)
diff --git a/contrib/postgres_fdw/connection.c b/contrib/postgres_fdw/connection.c
index c6e3d44..d8ded74 100644
--- a/contrib/postgres_fdw/connection.c
+++ b/contrib/postgres_fdw/connection.c
@@ -49,6 +49,7 @@ typedef struct ConnCacheEntry
* one level of subxact open, etc */
bool have_prep_stmt; /* have we prepared any stmts in this xact? */
bool have_error; /* have any subxacts aborted in this xact? */
+ void *storage; /* connection specific storage */
} ConnCacheEntry;
/*
@@ -64,6 +65,7 @@ static unsigned int prep_stmt_number = 0;
static bool xact_got_connection = false;
/* prototypes of private functions */
+static ConnCacheEntry *get_connection_entry(Oid umid);
static PGconn *connect_pg_server(ForeignServer *server, UserMapping *user);
static void check_conn_params(const char **keywords, const char **values);
static void configure_remote_session(PGconn *conn);
@@ -75,26 +77,12 @@ static void pgfdw_subxact_callback(SubXactEvent event,
SubTransactionId parentSubid,
void *arg);
-
/*
- * Get a PGconn which can be used to execute queries on the remote PostgreSQL
- * server with the user's authorization. A new connection is established
- * if we don't already have a suitable one, and a transaction is opened at
- * the right subtransaction nesting depth if we didn't do that already.
- *
- * will_prep_stmt must be true if caller intends to create any prepared
- * statements. Since those don't go away automatically at transaction end
- * (not even on error), we need this flag to cue manual cleanup.
- *
- * XXX Note that caching connections theoretically requires a mechanism to
- * detect change of FDW objects to invalidate already established connections.
- * We could manage that by watching for invalidation events on the relevant
- * syscaches. For the moment, though, it's not clear that this would really
- * be useful and not mere pedantry. We could not flush any active connections
- * mid-transaction anyway.
+ * Common function to acquire or create a connection cache entry.
*/
-PGconn *
-GetConnection(UserMapping *user, bool will_prep_stmt)
+
+static ConnCacheEntry *
+get_connection_entry(Oid umid)
{
bool found;
ConnCacheEntry *entry;
@@ -122,11 +110,8 @@ GetConnection(UserMapping *user, bool will_prep_stmt)
RegisterSubXactCallback(pgfdw_subxact_callback, NULL);
}
- /* Set flag that we did GetConnection during the current transaction */
- xact_got_connection = true;
-
/* Create hash key for the entry. Assume no pad bytes in key struct */
- key = user->umid;
+ key = umid;
/*
* Find or create cached entry for requested connection.
@@ -139,8 +124,39 @@ GetConnection(UserMapping *user, bool will_prep_stmt)
entry->xact_depth = 0;
entry->have_prep_stmt = false;
entry->have_error = false;
+ entry->storage = NULL;
}
+ return entry;
+}
+
+/*
+ * Get a PGconn which can be used to execute queries on the remote PostgreSQL
+ * server with the user's authorization. A new connection is established
+ * if we don't already have a suitable one, and a transaction is opened at
+ * the right subtransaction nesting depth if we didn't do that already.
+ *
+ * will_prep_stmt must be true if caller intends to create any prepared
+ * statements. Since those don't go away automatically at transaction end
+ * (not even on error), we need this flag to cue manual cleanup.
+ *
+ * XXX Note that caching connections theoretically requires a mechanism to
+ * detect change of FDW objects to invalidate already established connections.
+ * We could manage that by watching for invalidation events on the relevant
+ * syscaches. For the moment, though, it's not clear that this would really
+ * be useful and not mere pedantry. We could not flush any active connections
+ * mid-transaction anyway.
+ */
+PGconn *
+GetConnection(UserMapping *user, bool will_prep_stmt)
+{
+ ConnCacheEntry *entry;
+
+ /* Set flag that we did GetConnection during the current transaction */
+ xact_got_connection = true;
+
+ entry = get_connection_entry(user->umid);
+
/*
* We don't check the health of cached connection here, because it would
* require some overhead. Broken connection will be detected when the
@@ -177,6 +193,25 @@ GetConnection(UserMapping *user, bool will_prep_stmt)
}
/*
+ * Rerturns the connection specific storage for this user. Allocate with
+ * initsize if not exists.
+ */
+void *
+GetConnectionSpecificStorage(UserMapping *user, size_t initsize)
+{
+ ConnCacheEntry *entry;
+
+ entry = get_connection_entry(user->umid);
+ if (entry->storage == NULL)
+ {
+ entry->storage = MemoryContextAlloc(CacheMemoryContext, initsize);
+ memset(entry->storage, 0, initsize);
+ }
+
+ return entry->storage;
+}
+
+/*
* Connect to remote server using specified server and user mapping properties.
*/
static PGconn *
diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out
index 4d86ab5..c1c0320 100644
--- a/contrib/postgres_fdw/expected/postgres_fdw.out
+++ b/contrib/postgres_fdw/expected/postgres_fdw.out
@@ -6414,7 +6414,7 @@ INSERT INTO a(aa) VALUES('aaaaa');
INSERT INTO b(aa) VALUES('bbb');
INSERT INTO b(aa) VALUES('bbbb');
INSERT INTO b(aa) VALUES('bbbbb');
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
tableoid | aa
----------+-------
a | aaa
@@ -6442,7 +6442,7 @@ SELECT tableoid::regclass, * FROM ONLY a;
(3 rows)
UPDATE a SET aa = 'zzzzzz' WHERE aa LIKE 'aaaa%';
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
tableoid | aa
----------+--------
a | aaa
@@ -6470,7 +6470,7 @@ SELECT tableoid::regclass, * FROM ONLY a;
(3 rows)
UPDATE b SET aa = 'new';
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
tableoid | aa
----------+--------
a | aaa
@@ -6498,7 +6498,7 @@ SELECT tableoid::regclass, * FROM ONLY a;
(3 rows)
UPDATE a SET aa = 'newtoo';
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
tableoid | aa
----------+--------
a | newtoo
@@ -6564,35 +6564,40 @@ insert into bar2 values(3,33,33);
insert into bar2 values(4,44,44);
insert into bar2 values(7,77,77);
explain (verbose, costs off)
-select * from bar where f1 in (select f1 from foo) for update;
- QUERY PLAN
-----------------------------------------------------------------------------------------------
+select * from bar where f1 in (select f1 from foo) order by 1 for update;
+ QUERY PLAN
+-----------------------------------------------------------------------------------------------------------------
LockRows
Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid, foo.ctid, foo.*, foo.tableoid
- -> Hash Join
+ -> Merge Join
Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid, foo.ctid, foo.*, foo.tableoid
Inner Unique: true
- Hash Cond: (bar.f1 = foo.f1)
- -> Append
- -> Seq Scan on public.bar
+ Merge Cond: (bar.f1 = foo.f1)
+ -> Merge Append
+ Sort Key: bar.f1
+ -> Sort
Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid
+ Sort Key: bar.f1
+ -> Seq Scan on public.bar
+ Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid
-> Foreign Scan on public.bar2
Output: bar2.f1, bar2.f2, bar2.ctid, bar2.*, bar2.tableoid
- Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 FOR UPDATE
- -> Hash
+ Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 ORDER BY f1 ASC NULLS LAST FOR UPDATE
+ -> Sort
Output: foo.ctid, foo.*, foo.tableoid, foo.f1
+ Sort Key: foo.f1
-> HashAggregate
Output: foo.ctid, foo.*, foo.tableoid, foo.f1
Group Key: foo.f1
-> Append
- -> Seq Scan on public.foo
- Output: foo.ctid, foo.*, foo.tableoid, foo.f1
-> Foreign Scan on public.foo2
Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1
Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1
-(23 rows)
+ -> Seq Scan on public.foo
+ Output: foo.ctid, foo.*, foo.tableoid, foo.f1
+(28 rows)
-select * from bar where f1 in (select f1 from foo) for update;
+select * from bar where f1 in (select f1 from foo) order by 1 for update;
f1 | f2
----+----
1 | 11
@@ -6602,35 +6607,40 @@ select * from bar where f1 in (select f1 from foo) for update;
(4 rows)
explain (verbose, costs off)
-select * from bar where f1 in (select f1 from foo) for share;
- QUERY PLAN
-----------------------------------------------------------------------------------------------
+select * from bar where f1 in (select f1 from foo) order by 1 for share;
+ QUERY PLAN
+----------------------------------------------------------------------------------------------------------------
LockRows
Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid, foo.ctid, foo.*, foo.tableoid
- -> Hash Join
+ -> Merge Join
Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid, foo.ctid, foo.*, foo.tableoid
Inner Unique: true
- Hash Cond: (bar.f1 = foo.f1)
- -> Append
- -> Seq Scan on public.bar
+ Merge Cond: (bar.f1 = foo.f1)
+ -> Merge Append
+ Sort Key: bar.f1
+ -> Sort
Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid
+ Sort Key: bar.f1
+ -> Seq Scan on public.bar
+ Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid
-> Foreign Scan on public.bar2
Output: bar2.f1, bar2.f2, bar2.ctid, bar2.*, bar2.tableoid
- Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 FOR SHARE
- -> Hash
+ Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 ORDER BY f1 ASC NULLS LAST FOR SHARE
+ -> Sort
Output: foo.ctid, foo.*, foo.tableoid, foo.f1
+ Sort Key: foo.f1
-> HashAggregate
Output: foo.ctid, foo.*, foo.tableoid, foo.f1
Group Key: foo.f1
-> Append
- -> Seq Scan on public.foo
- Output: foo.ctid, foo.*, foo.tableoid, foo.f1
-> Foreign Scan on public.foo2
Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1
Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1
-(23 rows)
+ -> Seq Scan on public.foo
+ Output: foo.ctid, foo.*, foo.tableoid, foo.f1
+(28 rows)
-select * from bar where f1 in (select f1 from foo) for share;
+select * from bar where f1 in (select f1 from foo) order by 1 for share;
f1 | f2
----+----
1 | 11
@@ -6660,11 +6670,11 @@ update bar set f2 = f2 + 100 where f1 in (select f1 from foo);
Output: foo.ctid, foo.*, foo.tableoid, foo.f1
Group Key: foo.f1
-> Append
- -> Seq Scan on public.foo
- Output: foo.ctid, foo.*, foo.tableoid, foo.f1
-> Foreign Scan on public.foo2
Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1
Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1
+ -> Seq Scan on public.foo
+ Output: foo.ctid, foo.*, foo.tableoid, foo.f1
-> Hash Join
Output: bar2.f1, (bar2.f2 + 100), bar2.f3, bar2.ctid, foo.ctid, foo.*, foo.tableoid
Inner Unique: true
@@ -6678,11 +6688,11 @@ update bar set f2 = f2 + 100 where f1 in (select f1 from foo);
Output: foo.ctid, foo.*, foo.tableoid, foo.f1
Group Key: foo.f1
-> Append
- -> Seq Scan on public.foo
- Output: foo.ctid, foo.*, foo.tableoid, foo.f1
-> Foreign Scan on public.foo2
Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1
Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1
+ -> Seq Scan on public.foo
+ Output: foo.ctid, foo.*, foo.tableoid, foo.f1
(39 rows)
update bar set f2 = f2 + 100 where f1 in (select f1 from foo);
@@ -6713,16 +6723,16 @@ where bar.f1 = ss.f1;
Output: bar.f1, (bar.f2 + 100), bar.ctid, (ROW(foo.f1))
Hash Cond: (foo.f1 = bar.f1)
-> Append
- -> Seq Scan on public.foo
- Output: ROW(foo.f1), foo.f1
-> Foreign Scan on public.foo2
Output: ROW(foo2.f1), foo2.f1
Remote SQL: SELECT f1 FROM public.loct1
- -> Seq Scan on public.foo foo_1
- Output: ROW((foo_1.f1 + 3)), (foo_1.f1 + 3)
-> Foreign Scan on public.foo2 foo2_1
Output: ROW((foo2_1.f1 + 3)), (foo2_1.f1 + 3)
Remote SQL: SELECT f1 FROM public.loct1
+ -> Seq Scan on public.foo
+ Output: ROW(foo.f1), foo.f1
+ -> Seq Scan on public.foo foo_1
+ Output: ROW((foo_1.f1 + 3)), (foo_1.f1 + 3)
-> Hash
Output: bar.f1, bar.f2, bar.ctid
-> Seq Scan on public.bar
@@ -6740,16 +6750,16 @@ where bar.f1 = ss.f1;
Output: (ROW(foo.f1)), foo.f1
Sort Key: foo.f1
-> Append
- -> Seq Scan on public.foo
- Output: ROW(foo.f1), foo.f1
-> Foreign Scan on public.foo2
Output: ROW(foo2.f1), foo2.f1
Remote SQL: SELECT f1 FROM public.loct1
- -> Seq Scan on public.foo foo_1
- Output: ROW((foo_1.f1 + 3)), (foo_1.f1 + 3)
-> Foreign Scan on public.foo2 foo2_1
Output: ROW((foo2_1.f1 + 3)), (foo2_1.f1 + 3)
Remote SQL: SELECT f1 FROM public.loct1
+ -> Seq Scan on public.foo
+ Output: ROW(foo.f1), foo.f1
+ -> Seq Scan on public.foo foo_1
+ Output: ROW((foo_1.f1 + 3)), (foo_1.f1 + 3)
(45 rows)
update bar set f2 = f2 + 100
@@ -6900,27 +6910,33 @@ delete from foo where f1 < 5 returning *;
(5 rows)
explain (verbose, costs off)
-update bar set f2 = f2 + 100 returning *;
- QUERY PLAN
-------------------------------------------------------------------------------
- Update on public.bar
- Output: bar.f1, bar.f2
- Update on public.bar
- Foreign Update on public.bar2
- -> Seq Scan on public.bar
- Output: bar.f1, (bar.f2 + 100), bar.ctid
- -> Foreign Update on public.bar2
- Remote SQL: UPDATE public.loct2 SET f2 = (f2 + 100) RETURNING f1, f2
-(8 rows)
+with u as (update bar set f2 = f2 + 100 returning *) select * from u order by 1;
+ QUERY PLAN
+--------------------------------------------------------------------------------------
+ Sort
+ Output: u.f1, u.f2
+ Sort Key: u.f1
+ CTE u
+ -> Update on public.bar
+ Output: bar.f1, bar.f2
+ Update on public.bar
+ Foreign Update on public.bar2
+ -> Seq Scan on public.bar
+ Output: bar.f1, (bar.f2 + 100), bar.ctid
+ -> Foreign Update on public.bar2
+ Remote SQL: UPDATE public.loct2 SET f2 = (f2 + 100) RETURNING f1, f2
+ -> CTE Scan on u
+ Output: u.f1, u.f2
+(14 rows)
-update bar set f2 = f2 + 100 returning *;
+with u as (update bar set f2 = f2 + 100 returning *) select * from u order by 1;
f1 | f2
----+-----
1 | 311
2 | 322
- 6 | 266
3 | 333
4 | 344
+ 6 | 266
7 | 277
(6 rows)
diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c
index 080cb0a..6c8da30 100644
--- a/contrib/postgres_fdw/postgres_fdw.c
+++ b/contrib/postgres_fdw/postgres_fdw.c
@@ -20,6 +20,8 @@
#include "commands/defrem.h"
#include "commands/explain.h"
#include "commands/vacuum.h"
+#include "executor/execAsync.h"
+#include "executor/nodeForeignscan.h"
#include "foreign/fdwapi.h"
#include "funcapi.h"
#include "miscadmin.h"
@@ -34,6 +36,7 @@
#include "optimizer/var.h"
#include "optimizer/tlist.h"
#include "parser/parsetree.h"
+#include "pgstat.h"
#include "utils/builtins.h"
#include "utils/guc.h"
#include "utils/lsyscache.h"
@@ -53,6 +56,9 @@ PG_MODULE_MAGIC;
/* If no remote estimates, assume a sort costs 20% extra */
#define DEFAULT_FDW_SORT_MULTIPLIER 1.2
+/* Retrive PgFdwScanState struct from ForeginScanState */
+#define GetPgFdwScanState(n) ((PgFdwScanState *)(n)->fdw_state)
+
/*
* Indexes of FDW-private information stored in fdw_private lists.
*
@@ -120,10 +126,27 @@ enum FdwDirectModifyPrivateIndex
};
/*
+ * Connection private area structure.
+ */
+typedef struct PgFdwConnpriv
+{
+ ForeignScanState *current_owner; /* The node currently running a query
+ * on this connection*/
+} PgFdwConnpriv;
+
+/* Execution state base type */
+typedef struct PgFdwState
+{
+ PGconn *conn; /* connection for the scan */
+ PgFdwConnpriv *connpriv; /* connection private memory */
+} PgFdwState;
+
+/*
* Execution state of a foreign scan using postgres_fdw.
*/
typedef struct PgFdwScanState
{
+ PgFdwState s; /* common structure */
Relation rel; /* relcache entry for the foreign table. NULL
* for a foreign join scan. */
TupleDesc tupdesc; /* tuple descriptor of scan */
@@ -134,7 +157,7 @@ typedef struct PgFdwScanState
List *retrieved_attrs; /* list of retrieved attribute numbers */
/* for remote query execution */
- PGconn *conn; /* connection for the scan */
+ bool result_ready;
unsigned int cursor_number; /* quasi-unique ID for my cursor */
bool cursor_exists; /* have we created the cursor? */
int numParams; /* number of parameters passed to query */
@@ -150,6 +173,13 @@ typedef struct PgFdwScanState
/* batch-level state, for optimizing rewinds and avoiding useless fetch */
int fetch_ct_2; /* Min(# of fetches done, 2) */
bool eof_reached; /* true if last fetch reached EOF */
+ bool run_async; /* true if run asynchronously */
+ bool async_waiting; /* true if requesting the parent to wait */
+ ForeignScanState *waiter; /* Next node to run a query among nodes
+ * sharing the same connection */
+ ForeignScanState *last_waiter; /* A waiting node at the end of a waiting
+ * list. Maintained only by the current
+ * owner of the connection */
/* working memory contexts */
MemoryContext batch_cxt; /* context holding current batch of tuples */
@@ -163,11 +193,11 @@ typedef struct PgFdwScanState
*/
typedef struct PgFdwModifyState
{
+ PgFdwState s; /* common structure */
Relation rel; /* relcache entry for the foreign table */
AttInMetadata *attinmeta; /* attribute datatype conversion metadata */
/* for remote query execution */
- PGconn *conn; /* connection for the scan */
char *p_name; /* name of prepared statement, if created */
/* extracted fdw_private data */
@@ -190,6 +220,7 @@ typedef struct PgFdwModifyState
*/
typedef struct PgFdwDirectModifyState
{
+ PgFdwState s; /* common structure */
Relation rel; /* relcache entry for the foreign table */
AttInMetadata *attinmeta; /* attribute datatype conversion metadata */
@@ -288,6 +319,7 @@ static void postgresBeginForeignScan(ForeignScanState *node, int eflags);
static TupleTableSlot *postgresIterateForeignScan(ForeignScanState *node);
static void postgresReScanForeignScan(ForeignScanState *node);
static void postgresEndForeignScan(ForeignScanState *node);
+static void postgresShutdownForeignScan(ForeignScanState *node);
static void postgresAddForeignUpdateTargets(Query *parsetree,
RangeTblEntry *target_rte,
Relation target_relation);
@@ -348,6 +380,14 @@ static void postgresGetForeignUpperPaths(PlannerInfo *root,
UpperRelationKind stage,
RelOptInfo *input_rel,
RelOptInfo *output_rel);
+static bool postgresIsForeignPathAsyncCapable(ForeignPath *path);
+static void postgresForeignAsyncRequest(EState *estate,
+ PendingAsyncRequest *areq);
+static bool postgresForeignAsyncConfigureWait(EState *estate,
+ PendingAsyncRequest *areq,
+ bool reinit);
+static void postgresForeignAsyncNotify(EState *estate,
+ PendingAsyncRequest *areq);
/*
* Helper functions
@@ -368,7 +408,10 @@ static bool ec_member_matches_foreign(PlannerInfo *root, RelOptInfo *rel,
EquivalenceClass *ec, EquivalenceMember *em,
void *arg);
static void create_cursor(ForeignScanState *node);
-static void fetch_more_data(ForeignScanState *node);
+static void request_more_data(ForeignScanState *node);
+static void fetch_received_data(ForeignScanState *node);
+static void vacate_connection(PgFdwState *fdwconn);
+static void absorb_current_result(ForeignScanState *node);
static void close_cursor(PGconn *conn, unsigned int cursor_number);
static void prepare_foreign_modify(PgFdwModifyState *fmstate);
static const char **convert_prep_stmt_params(PgFdwModifyState *fmstate,
@@ -438,6 +481,7 @@ postgres_fdw_handler(PG_FUNCTION_ARGS)
routine->IterateForeignScan = postgresIterateForeignScan;
routine->ReScanForeignScan = postgresReScanForeignScan;
routine->EndForeignScan = postgresEndForeignScan;
+ routine->ShutdownForeignScan = postgresShutdownForeignScan;
/* Functions for updating foreign tables */
routine->AddForeignUpdateTargets = postgresAddForeignUpdateTargets;
@@ -472,6 +516,12 @@ postgres_fdw_handler(PG_FUNCTION_ARGS)
/* Support functions for upper relation push-down */
routine->GetForeignUpperPaths = postgresGetForeignUpperPaths;
+ /* Support functions for async execution */
+ routine->IsForeignPathAsyncCapable = postgresIsForeignPathAsyncCapable;
+ routine->ForeignAsyncRequest = postgresForeignAsyncRequest;
+ routine->ForeignAsyncConfigureWait = postgresForeignAsyncConfigureWait;
+ routine->ForeignAsyncNotify = postgresForeignAsyncNotify;
+
PG_RETURN_POINTER(routine);
}
@@ -1322,12 +1372,21 @@ postgresBeginForeignScan(ForeignScanState *node, int eflags)
* Get connection to the foreign server. Connection manager will
* establish new connection if necessary.
*/
- fsstate->conn = GetConnection(user, false);
+ fsstate->s.conn = GetConnection(user, false);
+ fsstate->s.connpriv = (PgFdwConnpriv *)
+ GetConnectionSpecificStorage(user, sizeof(PgFdwConnpriv));
+ fsstate->s.connpriv->current_owner = NULL;
+ fsstate->waiter = NULL;
+ fsstate->last_waiter = node;
/* Assign a unique ID for my cursor */
- fsstate->cursor_number = GetCursorNumber(fsstate->conn);
+ fsstate->cursor_number = GetCursorNumber(fsstate->s.conn);
fsstate->cursor_exists = false;
+ /* Initialize async execution status */
+ fsstate->run_async = false;
+ fsstate->async_waiting = false;
+
/* Get private info created by planner functions. */
fsstate->query = strVal(list_nth(fsplan->fdw_private,
FdwScanPrivateSelectSql));
@@ -1383,32 +1442,130 @@ postgresBeginForeignScan(ForeignScanState *node, int eflags)
static TupleTableSlot *
postgresIterateForeignScan(ForeignScanState *node)
{
- PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+ PgFdwScanState *fsstate = GetPgFdwScanState(node);
TupleTableSlot *slot = node->ss.ss_ScanTupleSlot;
/*
- * If this is the first call after Begin or ReScan, we need to create the
- * cursor on the remote side.
- */
- if (!fsstate->cursor_exists)
- create_cursor(node);
-
- /*
* Get some more tuples, if we've run out.
*/
if (fsstate->next_tuple >= fsstate->num_tuples)
{
- /* No point in another fetch if we already detected EOF, though. */
- if (!fsstate->eof_reached)
- fetch_more_data(node);
- /* If we didn't get any tuples, must be end of data. */
+ ForeignScanState *next_conn_owner = node;
+
+ /* This node has sent a query on this connection */
+ if (fsstate->s.connpriv->current_owner == node)
+ {
+ /* Check if the result is available */
+ if (PQisBusy(fsstate->s.conn))
+ {
+ int rc = WaitLatchOrSocket(NULL,
+ WL_SOCKET_READABLE | WL_TIMEOUT,
+ PQsocket(fsstate->s.conn), 0,
+ WAIT_EVENT_ASYNC_WAIT);
+ if (fsstate->run_async && !(rc & WL_SOCKET_READABLE))
+ {
+ /*
+ * This node is not ready yet. Tell the caller to wait.
+ */
+ fsstate->result_ready = false;
+ return ExecClearTuple(slot);
+ }
+ }
+
+ Assert(fsstate->async_waiting);
+ fsstate->async_waiting = false;
+ fetch_received_data(node);
+
+ /*
+ * If someone is waiting this node on the same connection, let the
+ * first waiter be the next owner of this connection.
+ */
+ if (fsstate->waiter)
+ {
+ PgFdwScanState *next_owner_state;
+
+ next_conn_owner = fsstate->waiter;
+ next_owner_state = GetPgFdwScanState(next_conn_owner);
+ fsstate->waiter = NULL;
+
+ /*
+ * only the current owner is responsible to maintain the shortcut
+ * to the last waiter
+ */
+ next_owner_state->last_waiter = fsstate->last_waiter;
+
+ /*
+ * for simplicity, last_waiter points itself on a node that no one
+ * is waiting for.
+ */
+ fsstate->last_waiter = node;
+ }
+ }
+ else if (fsstate->s.connpriv->current_owner &&
+ !GetPgFdwScanState(node)->eof_reached)
+ {
+ /*
+ * Anyone else is holding this connection and we want this node to
+ * run later. Add myself to the tail of the waiters' list then
+ * return not-ready. To avoid scanning through the waiters' list,
+ * the current owner is to maintain the shortcut to the last
+ * waiter.
+ */
+ PgFdwScanState *conn_owner_state =
+ GetPgFdwScanState(fsstate->s.connpriv->current_owner);
+ ForeignScanState *last_waiter = conn_owner_state->last_waiter;
+ PgFdwScanState *last_waiter_state = GetPgFdwScanState(last_waiter);
+
+ last_waiter_state->waiter = node;
+ conn_owner_state->last_waiter = node;
+
+ /* Register the node to the async-waiting node list */
+ Assert(!GetPgFdwScanState(node)->async_waiting);
+
+ GetPgFdwScanState(node)->async_waiting = true;
+
+ fsstate->result_ready = fsstate->eof_reached;
+ return ExecClearTuple(slot);
+ }
+
+ /* At this time no node is running on the connection */
+ Assert(GetPgFdwScanState(next_conn_owner)->s.connpriv->current_owner
+ == NULL);
+ /*
+ * Send the next request for the next owner of this connection if
+ * needed.
+ */
+ if (!GetPgFdwScanState(next_conn_owner)->eof_reached)
+ {
+ PgFdwScanState *next_owner_state =
+ GetPgFdwScanState(next_conn_owner);
+
+ request_more_data(next_conn_owner);
+
+ /* Register the node to the async-waiting node list */
+ if (!next_owner_state->async_waiting)
+ next_owner_state->async_waiting = true;
+
+ if (!next_owner_state->run_async)
+ fetch_received_data(next_conn_owner);
+ }
+
+
+ /*
+ * If we haven't received a result for the given node this time,
+ * return with no tuple to give way to other nodes.
+ */
if (fsstate->next_tuple >= fsstate->num_tuples)
+ {
+ fsstate->result_ready = fsstate->eof_reached;
return ExecClearTuple(slot);
+ }
}
/*
* Return the next tuple.
*/
+ fsstate->result_ready = true;
ExecStoreTuple(fsstate->tuples[fsstate->next_tuple++],
slot,
InvalidBuffer,
@@ -1424,7 +1581,7 @@ postgresIterateForeignScan(ForeignScanState *node)
static void
postgresReScanForeignScan(ForeignScanState *node)
{
- PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+ PgFdwScanState *fsstate = GetPgFdwScanState(node);
char sql[64];
PGresult *res;
@@ -1432,6 +1589,9 @@ postgresReScanForeignScan(ForeignScanState *node)
if (!fsstate->cursor_exists)
return;
+ /* Absorb the ramining result */
+ absorb_current_result(node);
+
/*
* If any internal parameters affecting this node have changed, we'd
* better destroy and recreate the cursor. Otherwise, rewinding it should
@@ -1460,9 +1620,9 @@ postgresReScanForeignScan(ForeignScanState *node)
* We don't use a PG_TRY block here, so be careful not to throw error
* without releasing the PGresult.
*/
- res = pgfdw_exec_query(fsstate->conn, sql);
+ res = pgfdw_exec_query(fsstate->s.conn, sql);
if (PQresultStatus(res) != PGRES_COMMAND_OK)
- pgfdw_report_error(ERROR, res, fsstate->conn, true, sql);
+ pgfdw_report_error(ERROR, res, fsstate->s.conn, true, sql);
PQclear(res);
/* Now force a fresh FETCH. */
@@ -1480,7 +1640,7 @@ postgresReScanForeignScan(ForeignScanState *node)
static void
postgresEndForeignScan(ForeignScanState *node)
{
- PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+ PgFdwScanState *fsstate = GetPgFdwScanState(node);
/* if fsstate is NULL, we are in EXPLAIN; nothing to do */
if (fsstate == NULL)
@@ -1488,16 +1648,32 @@ postgresEndForeignScan(ForeignScanState *node)
/* Close the cursor if open, to prevent accumulation of cursors */
if (fsstate->cursor_exists)
- close_cursor(fsstate->conn, fsstate->cursor_number);
+ close_cursor(fsstate->s.conn, fsstate->cursor_number);
/* Release remote connection */
- ReleaseConnection(fsstate->conn);
- fsstate->conn = NULL;
+ ReleaseConnection(fsstate->s.conn);
+ fsstate->s.conn = NULL;
/* MemoryContexts will be deleted automatically. */
}
/*
+ * postgresShutdownForeignScan
+ * Remove asynchrony stuff and cleanup garbage on the connection.
+ */
+static void
+postgresShutdownForeignScan(ForeignScanState *node)
+{
+ ForeignScan *plan = (ForeignScan *) node->ss.ps.plan;
+
+ if (plan->operation != CMD_SELECT)
+ return;
+
+ /* Absorb the ramining result */
+ absorb_current_result(node);
+}
+
+/*
* postgresAddForeignUpdateTargets
* Add resjunk column(s) needed for update/delete on a foreign table
*/
@@ -1699,7 +1875,9 @@ postgresBeginForeignModify(ModifyTableState *mtstate,
user = GetUserMapping(userid, table->serverid);
/* Open connection; report that we'll create a prepared statement. */
- fmstate->conn = GetConnection(user, true);
+ fmstate->s.conn = GetConnection(user, true);
+ fmstate->s.connpriv = (PgFdwConnpriv *)
+ GetConnectionSpecificStorage(user, sizeof(PgFdwConnpriv));
fmstate->p_name = NULL; /* prepared statement not made yet */
/* Deconstruct fdw_private data. */
@@ -1778,6 +1956,8 @@ postgresExecForeignInsert(EState *estate,
PGresult *res;
int n_rows;
+ vacate_connection((PgFdwState *)fmstate);
+
/* Set up the prepared statement on the remote server, if we didn't yet */
if (!fmstate->p_name)
prepare_foreign_modify(fmstate);
@@ -1788,14 +1968,14 @@ postgresExecForeignInsert(EState *estate,
/*
* Execute the prepared statement.
*/
- if (!PQsendQueryPrepared(fmstate->conn,
+ if (!PQsendQueryPrepared(fmstate->s.conn,
fmstate->p_name,
fmstate->p_nums,
p_values,
NULL,
NULL,
0))
- pgfdw_report_error(ERROR, NULL, fmstate->conn, false, fmstate->query);
+ pgfdw_report_error(ERROR, NULL, fmstate->s.conn, false, fmstate->query);
/*
* Get the result, and check for success.
@@ -1803,10 +1983,10 @@ postgresExecForeignInsert(EState *estate,
* We don't use a PG_TRY block here, so be careful not to throw error
* without releasing the PGresult.
*/
- res = pgfdw_get_result(fmstate->conn, fmstate->query);
+ res = pgfdw_get_result(fmstate->s.conn, fmstate->query);
if (PQresultStatus(res) !=
(fmstate->has_returning ? PGRES_TUPLES_OK : PGRES_COMMAND_OK))
- pgfdw_report_error(ERROR, res, fmstate->conn, true, fmstate->query);
+ pgfdw_report_error(ERROR, res, fmstate->s.conn, true, fmstate->query);
/* Check number of rows affected, and fetch RETURNING tuple if any */
if (fmstate->has_returning)
@@ -1844,6 +2024,8 @@ postgresExecForeignUpdate(EState *estate,
PGresult *res;
int n_rows;
+ vacate_connection((PgFdwState *)fmstate);
+
/* Set up the prepared statement on the remote server, if we didn't yet */
if (!fmstate->p_name)
prepare_foreign_modify(fmstate);
@@ -1864,14 +2046,14 @@ postgresExecForeignUpdate(EState *estate,
/*
* Execute the prepared statement.
*/
- if (!PQsendQueryPrepared(fmstate->conn,
+ if (!PQsendQueryPrepared(fmstate->s.conn,
fmstate->p_name,
fmstate->p_nums,
p_values,
NULL,
NULL,
0))
- pgfdw_report_error(ERROR, NULL, fmstate->conn, false, fmstate->query);
+ pgfdw_report_error(ERROR, NULL, fmstate->s.conn, false, fmstate->query);
/*
* Get the result, and check for success.
@@ -1879,10 +2061,10 @@ postgresExecForeignUpdate(EState *estate,
* We don't use a PG_TRY block here, so be careful not to throw error
* without releasing the PGresult.
*/
- res = pgfdw_get_result(fmstate->conn, fmstate->query);
+ res = pgfdw_get_result(fmstate->s.conn, fmstate->query);
if (PQresultStatus(res) !=
(fmstate->has_returning ? PGRES_TUPLES_OK : PGRES_COMMAND_OK))
- pgfdw_report_error(ERROR, res, fmstate->conn, true, fmstate->query);
+ pgfdw_report_error(ERROR, res, fmstate->s.conn, true, fmstate->query);
/* Check number of rows affected, and fetch RETURNING tuple if any */
if (fmstate->has_returning)
@@ -1920,6 +2102,8 @@ postgresExecForeignDelete(EState *estate,
PGresult *res;
int n_rows;
+ vacate_connection((PgFdwState *)fmstate);
+
/* Set up the prepared statement on the remote server, if we didn't yet */
if (!fmstate->p_name)
prepare_foreign_modify(fmstate);
@@ -1940,14 +2124,14 @@ postgresExecForeignDelete(EState *estate,
/*
* Execute the prepared statement.
*/
- if (!PQsendQueryPrepared(fmstate->conn,
+ if (!PQsendQueryPrepared(fmstate->s.conn,
fmstate->p_name,
fmstate->p_nums,
p_values,
NULL,
NULL,
0))
- pgfdw_report_error(ERROR, NULL, fmstate->conn, false, fmstate->query);
+ pgfdw_report_error(ERROR, NULL, fmstate->s.conn, false, fmstate->query);
/*
* Get the result, and check for success.
@@ -1955,10 +2139,10 @@ postgresExecForeignDelete(EState *estate,
* We don't use a PG_TRY block here, so be careful not to throw error
* without releasing the PGresult.
*/
- res = pgfdw_get_result(fmstate->conn, fmstate->query);
+ res = pgfdw_get_result(fmstate->s.conn, fmstate->query);
if (PQresultStatus(res) !=
(fmstate->has_returning ? PGRES_TUPLES_OK : PGRES_COMMAND_OK))
- pgfdw_report_error(ERROR, res, fmstate->conn, true, fmstate->query);
+ pgfdw_report_error(ERROR, res, fmstate->s.conn, true, fmstate->query);
/* Check number of rows affected, and fetch RETURNING tuple if any */
if (fmstate->has_returning)
@@ -2005,16 +2189,16 @@ postgresEndForeignModify(EState *estate,
* We don't use a PG_TRY block here, so be careful not to throw error
* without releasing the PGresult.
*/
- res = pgfdw_exec_query(fmstate->conn, sql);
+ res = pgfdw_exec_query(fmstate->s.conn, sql);
if (PQresultStatus(res) != PGRES_COMMAND_OK)
- pgfdw_report_error(ERROR, res, fmstate->conn, true, sql);
+ pgfdw_report_error(ERROR, res, fmstate->s.conn, true, sql);
PQclear(res);
fmstate->p_name = NULL;
}
/* Release remote connection */
- ReleaseConnection(fmstate->conn);
- fmstate->conn = NULL;
+ ReleaseConnection(fmstate->s.conn);
+ fmstate->s.conn = NULL;
}
/*
@@ -2302,7 +2486,9 @@ postgresBeginDirectModify(ForeignScanState *node, int eflags)
* Get connection to the foreign server. Connection manager will
* establish new connection if necessary.
*/
- dmstate->conn = GetConnection(user, false);
+ dmstate->s.conn = GetConnection(user, false);
+ dmstate->s.connpriv = (PgFdwConnpriv *)
+ GetConnectionSpecificStorage(user, sizeof(PgFdwConnpriv));
/* Initialize state variable */
dmstate->num_tuples = -1; /* -1 means not set yet */
@@ -2355,7 +2541,10 @@ postgresIterateDirectModify(ForeignScanState *node)
* If this is the first call after Begin, execute the statement.
*/
if (dmstate->num_tuples == -1)
+ {
+ vacate_connection((PgFdwState *)dmstate);
execute_dml_stmt(node);
+ }
/*
* If the local query doesn't specify RETURNING, just clear tuple slot.
@@ -2402,8 +2591,8 @@ postgresEndDirectModify(ForeignScanState *node)
PQclear(dmstate->result);
/* Release remote connection */
- ReleaseConnection(dmstate->conn);
- dmstate->conn = NULL;
+ ReleaseConnection(dmstate->s.conn);
+ dmstate->s.conn = NULL;
/* MemoryContext will be deleted automatically. */
}
@@ -2522,6 +2711,7 @@ estimate_path_cost_size(PlannerInfo *root,
List *local_param_join_conds;
StringInfoData sql;
PGconn *conn;
+ PgFdwConnpriv *connpriv;
Selectivity local_sel;
QualCost local_cost;
List *fdw_scan_tlist = NIL;
@@ -2564,6 +2754,16 @@ estimate_path_cost_size(PlannerInfo *root,
/* Get the remote estimate */
conn = GetConnection(fpinfo->user, false);
+ connpriv = GetConnectionSpecificStorage(fpinfo->user,
+ sizeof(PgFdwConnpriv));
+ if (connpriv)
+ {
+ PgFdwState tmpstate;
+ tmpstate.conn = conn;
+ tmpstate.connpriv = connpriv;
+ vacate_connection(&tmpstate);
+ }
+
get_remote_estimate(sql.data, conn, &rows, &width,
&startup_cost, &total_cost);
ReleaseConnection(conn);
@@ -2918,11 +3118,11 @@ ec_member_matches_foreign(PlannerInfo *root, RelOptInfo *rel,
static void
create_cursor(ForeignScanState *node)
{
- PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+ PgFdwScanState *fsstate = GetPgFdwScanState(node);
ExprContext *econtext = node->ss.ps.ps_ExprContext;
int numParams = fsstate->numParams;
const char **values = fsstate->param_values;
- PGconn *conn = fsstate->conn;
+ PGconn *conn = fsstate->s.conn;
StringInfoData buf;
PGresult *res;
@@ -2988,47 +3188,96 @@ create_cursor(ForeignScanState *node)
* Fetch some more rows from the node's cursor.
*/
static void
-fetch_more_data(ForeignScanState *node)
+request_more_data(ForeignScanState *node)
+{
+ PgFdwScanState *fsstate = GetPgFdwScanState(node);
+ PGconn *conn = fsstate->s.conn;
+ char sql[64];
+
+ /* The connection should be vacant */
+ Assert(fsstate->s.connpriv->current_owner == NULL);
+
+ /*
+ * If this is the first call after Begin or ReScan, we need to create the
+ * cursor on the remote side.
+ */
+ if (!fsstate->cursor_exists)
+ create_cursor(node);
+
+ snprintf(sql, sizeof(sql), "FETCH %d FROM c%u",
+ fsstate->fetch_size, fsstate->cursor_number);
+
+ if (!PQsendQuery(conn, sql))
+ pgfdw_report_error(ERROR, NULL, conn, false, sql);
+
+ fsstate->s.connpriv->current_owner = node;
+}
+
+/*
+ * Fetch some more rows from the node's cursor.
+ */
+static void
+fetch_received_data(ForeignScanState *node)
{
- PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+ PgFdwScanState *fsstate = GetPgFdwScanState(node);
PGresult *volatile res = NULL;
MemoryContext oldcontext;
+ /* I should be the current connection owner */
+ Assert(fsstate->s.connpriv->current_owner == node);
+
/*
* We'll store the tuples in the batch_cxt. First, flush the previous
- * batch.
+ * batch if no tuple is remaining
*/
- fsstate->tuples = NULL;
- MemoryContextReset(fsstate->batch_cxt);
+ if (fsstate->next_tuple >= fsstate->num_tuples)
+ {
+ fsstate->tuples = NULL;
+ fsstate->num_tuples = 0;
+ MemoryContextReset(fsstate->batch_cxt);
+ }
+ else if (fsstate->next_tuple > 0)
+ {
+ /* move the remaining tuples to the beginning of the store */
+ int n = 0;
+
+ while(fsstate->next_tuple < fsstate->num_tuples)
+ fsstate->tuples[n++] = fsstate->tuples[fsstate->next_tuple++];
+ fsstate->num_tuples = n;
+ }
+
oldcontext = MemoryContextSwitchTo(fsstate->batch_cxt);
/* PGresult must be released before leaving this function. */
PG_TRY();
{
- PGconn *conn = fsstate->conn;
+ PGconn *conn = fsstate->s.conn;
char sql[64];
- int numrows;
+ int addrows;
+ size_t newsize;
int i;
snprintf(sql, sizeof(sql), "FETCH %d FROM c%u",
fsstate->fetch_size, fsstate->cursor_number);
- res = pgfdw_exec_query(conn, sql);
+ res = pgfdw_get_result(conn, sql);
/* On error, report the original query, not the FETCH. */
if (PQresultStatus(res) != PGRES_TUPLES_OK)
pgfdw_report_error(ERROR, res, conn, false, fsstate->query);
/* Convert the data into HeapTuples */
- numrows = PQntuples(res);
- fsstate->tuples = (HeapTuple *) palloc0(numrows * sizeof(HeapTuple));
- fsstate->num_tuples = numrows;
- fsstate->next_tuple = 0;
+ addrows = PQntuples(res);
+ newsize = (fsstate->num_tuples + addrows) * sizeof(HeapTuple);
+ if (fsstate->tuples)
+ fsstate->tuples = (HeapTuple *) repalloc(fsstate->tuples, newsize);
+ else
+ fsstate->tuples = (HeapTuple *) palloc(newsize);
- for (i = 0; i < numrows; i++)
+ for (i = 0; i < addrows; i++)
{
Assert(IsA(node->ss.ps.plan, ForeignScan));
- fsstate->tuples[i] =
+ fsstate->tuples[fsstate->num_tuples + i] =
make_tuple_from_result_row(res, i,
fsstate->rel,
fsstate->attinmeta,
@@ -3038,27 +3287,82 @@ fetch_more_data(ForeignScanState *node)
}
/* Update fetch_ct_2 */
- if (fsstate->fetch_ct_2 < 2)
+ if (fsstate->fetch_ct_2 < 2 && fsstate->next_tuple == 0)
fsstate->fetch_ct_2++;
+ fsstate->next_tuple = 0;
+ fsstate->num_tuples += addrows;
+
/* Must be EOF if we didn't get as many tuples as we asked for. */
- fsstate->eof_reached = (numrows < fsstate->fetch_size);
+ fsstate->eof_reached = (addrows < fsstate->fetch_size);
PQclear(res);
res = NULL;
}
PG_CATCH();
{
+ fsstate->s.connpriv->current_owner = NULL;
if (res)
PQclear(res);
PG_RE_THROW();
}
PG_END_TRY();
+ fsstate->s.connpriv->current_owner = NULL;
+
MemoryContextSwitchTo(oldcontext);
}
/*
+ * Vacate a connection so that this node can send the next query
+ */
+static void
+vacate_connection(PgFdwState *fdwstate)
+{
+ PgFdwConnpriv *connpriv = fdwstate->connpriv;
+ ForeignScanState *owner;
+
+ if (connpriv == NULL || connpriv->current_owner == NULL)
+ return;
+
+ /*
+ * let the current connection owner read the result for the running query
+ */
+ owner = connpriv->current_owner;
+ fetch_received_data(owner);
+
+ /* Clear the waiting list */
+ while (owner)
+ {
+ PgFdwScanState *fsstate = GetPgFdwScanState(owner);
+
+ fsstate->last_waiter = NULL;
+ owner = fsstate->waiter;
+ fsstate->waiter = NULL;
+ }
+}
+
+/*
+ * Absorb the result of the current query.
+ */
+static void
+absorb_current_result(ForeignScanState *node)
+{
+ PgFdwScanState *fsstate = GetPgFdwScanState(node);
+ ForeignScanState *owner = fsstate->s.connpriv->current_owner;
+
+ if (owner)
+ {
+ PgFdwScanState *target_state = GetPgFdwScanState(owner);
+ PGconn *conn = target_state->s.conn;
+
+ while(PQisBusy(conn))
+ PQclear(PQgetResult(conn));
+ fsstate->s.connpriv->current_owner = NULL;
+ fsstate->async_waiting = false;
+ }
+}
+/*
* Force assorted GUC parameters to settings that ensure that we'll output
* data values in a form that is unambiguous to the remote server.
*
@@ -3142,7 +3446,7 @@ prepare_foreign_modify(PgFdwModifyState *fmstate)
/* Construct name we'll use for the prepared statement. */
snprintf(prep_name, sizeof(prep_name), "pgsql_fdw_prep_%u",
- GetPrepStmtNumber(fmstate->conn));
+ GetPrepStmtNumber(fmstate->s.conn));
p_name = pstrdup(prep_name);
/*
@@ -3152,12 +3456,12 @@ prepare_foreign_modify(PgFdwModifyState *fmstate)
* the prepared statements we use in this module are simple enough that
* the remote server will make the right choices.
*/
- if (!PQsendPrepare(fmstate->conn,
+ if (!PQsendPrepare(fmstate->s.conn,
p_name,
fmstate->query,
0,
NULL))
- pgfdw_report_error(ERROR, NULL, fmstate->conn, false, fmstate->query);
+ pgfdw_report_error(ERROR, NULL, fmstate->s.conn, false, fmstate->query);
/*
* Get the result, and check for success.
@@ -3165,9 +3469,9 @@ prepare_foreign_modify(PgFdwModifyState *fmstate)
* We don't use a PG_TRY block here, so be careful not to throw error
* without releasing the PGresult.
*/
- res = pgfdw_get_result(fmstate->conn, fmstate->query);
+ res = pgfdw_get_result(fmstate->s.conn, fmstate->query);
if (PQresultStatus(res) != PGRES_COMMAND_OK)
- pgfdw_report_error(ERROR, res, fmstate->conn, true, fmstate->query);
+ pgfdw_report_error(ERROR, res, fmstate->s.conn, true, fmstate->query);
PQclear(res);
/* This action shows that the prepare has been done. */
@@ -3298,9 +3602,9 @@ execute_dml_stmt(ForeignScanState *node)
* the desired result. This allows us to avoid assuming that the remote
* server has the same OIDs we do for the parameters' types.
*/
- if (!PQsendQueryParams(dmstate->conn, dmstate->query, numParams,
+ if (!PQsendQueryParams(dmstate->s.conn, dmstate->query, numParams,
NULL, values, NULL, NULL, 0))
- pgfdw_report_error(ERROR, NULL, dmstate->conn, false, dmstate->query);
+ pgfdw_report_error(ERROR, NULL, dmstate->s.conn, false, dmstate->query);
/*
* Get the result, and check for success.
@@ -3308,10 +3612,10 @@ execute_dml_stmt(ForeignScanState *node)
* We don't use a PG_TRY block here, so be careful not to throw error
* without releasing the PGresult.
*/
- dmstate->result = pgfdw_get_result(dmstate->conn, dmstate->query);
+ dmstate->result = pgfdw_get_result(dmstate->s.conn, dmstate->query);
if (PQresultStatus(dmstate->result) !=
(dmstate->has_returning ? PGRES_TUPLES_OK : PGRES_COMMAND_OK))
- pgfdw_report_error(ERROR, dmstate->result, dmstate->conn, true,
+ pgfdw_report_error(ERROR, dmstate->result, dmstate->s.conn, true,
dmstate->query);
/* Get the number of rows affected. */
@@ -4582,6 +4886,80 @@ postgresGetForeignJoinPaths(PlannerInfo *root,
/* XXX Consider parameterized paths for the join relation */
}
+static bool
+postgresIsForeignPathAsyncCapable(ForeignPath *path)
+{
+ return true;
+}
+
+/*
+ * Accept async request. Notify to the caller if the next tuple is immediately
+ * available. ExecForeignScan does additional work to finishing the returning
+ * tuple, so call it instead of postgresIterateForeignScan to acquire a tuple
+ * in expected shape.
+ */
+static void
+postgresForeignAsyncRequest(EState *estate, PendingAsyncRequest *areq)
+{
+ ForeignScanState *node = (ForeignScanState *) areq->requestee;
+ TupleTableSlot *slot;
+
+ Assert(IsA(node, ForeignScanState));
+ GetPgFdwScanState(node)->run_async = true;
+ slot = ExecForeignScan(node);
+ if (GetPgFdwScanState(node)->result_ready)
+ ExecAsyncRequestDone(estate, areq, (Node *) slot);
+ else
+ ExecAsyncSetRequiredEvents(estate, areq, 1, false, false);
+}
+
+/*
+ * Configure waiting event.
+ *
+ * Add an wait event only when the node is the connection owner. Elsewise
+ * another node on this connection is the owner.
+ */
+static bool
+postgresForeignAsyncConfigureWait(EState *estate, PendingAsyncRequest *areq,
+ bool reinit)
+{
+ ForeignScanState *node = (ForeignScanState *) areq->requestee;
+ PgFdwScanState *fsstate = GetPgFdwScanState(node);
+
+
+ /* If the caller didn't reinit, this event is already in event set */
+ if (!reinit)
+ return true;
+
+ if (fsstate->s.connpriv->current_owner == node)
+ {
+ AddWaitEventToSet(estate->es_wait_event_set,
+ WL_SOCKET_READABLE, PQsocket(fsstate->s.conn),
+ NULL, areq);
+ return true;
+ }
+
+ return false;
+}
+
+/*
+ * Process a notification from async mechanism. ExecForeignScan does
+ * additional work to complete the returning tuple, so call it instead of
+ * postgresIterateForeignScan to acquire a completed tuple.
+ */
+static void
+postgresForeignAsyncNotify(EState *estate, PendingAsyncRequest *areq)
+{
+ ForeignScanState *node = (ForeignScanState *) areq->requestee;
+ TupleTableSlot *slot;
+
+ Assert(IsA(node, ForeignScanState));
+ slot = ExecForeignScan(node);
+ Assert(GetPgFdwScanState(node)->result_ready);
+
+ ExecAsyncRequestDone(estate, areq, (Node *) slot);
+}
+
/*
* Assess whether the aggregation, grouping and having operations can be pushed
* down to the foreign server. As a side effect, save information we obtain in
@@ -4946,7 +5324,7 @@ make_tuple_from_result_row(PGresult *res,
PgFdwScanState *fdw_sstate;
Assert(fsstate);
- fdw_sstate = (PgFdwScanState *) fsstate->fdw_state;
+ fdw_sstate = GetPgFdwScanState(fsstate);
tupdesc = fdw_sstate->tupdesc;
}
diff --git a/contrib/postgres_fdw/postgres_fdw.h b/contrib/postgres_fdw/postgres_fdw.h
index 25c950d..6dd136c 100644
--- a/contrib/postgres_fdw/postgres_fdw.h
+++ b/contrib/postgres_fdw/postgres_fdw.h
@@ -77,6 +77,7 @@ typedef struct PgFdwRelationInfo
UserMapping *user; /* only set in use_remote_estimate mode */
int fetch_size; /* fetch size for this remote table */
+ bool allow_prefetch; /* true to allow overlapped fetching */
/*
* Name of the relation while EXPLAINing ForeignScan. It is used for join
@@ -116,6 +117,7 @@ extern void reset_transmission_modes(int nestlevel);
/* in connection.c */
extern PGconn *GetConnection(UserMapping *user, bool will_prep_stmt);
+void *GetConnectionSpecificStorage(UserMapping *user, size_t initsize);
extern void ReleaseConnection(PGconn *conn);
extern unsigned int GetCursorNumber(PGconn *conn);
extern unsigned int GetPrepStmtNumber(PGconn *conn);
diff --git a/contrib/postgres_fdw/sql/postgres_fdw.sql b/contrib/postgres_fdw/sql/postgres_fdw.sql
index 509bb54..3370778 100644
--- a/contrib/postgres_fdw/sql/postgres_fdw.sql
+++ b/contrib/postgres_fdw/sql/postgres_fdw.sql
@@ -1488,25 +1488,25 @@ INSERT INTO b(aa) VALUES('bbb');
INSERT INTO b(aa) VALUES('bbbb');
INSERT INTO b(aa) VALUES('bbbbb');
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
SELECT tableoid::regclass, * FROM b;
SELECT tableoid::regclass, * FROM ONLY a;
UPDATE a SET aa = 'zzzzzz' WHERE aa LIKE 'aaaa%';
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
SELECT tableoid::regclass, * FROM b;
SELECT tableoid::regclass, * FROM ONLY a;
UPDATE b SET aa = 'new';
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
SELECT tableoid::regclass, * FROM b;
SELECT tableoid::regclass, * FROM ONLY a;
UPDATE a SET aa = 'newtoo';
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
SELECT tableoid::regclass, * FROM b;
SELECT tableoid::regclass, * FROM ONLY a;
@@ -1542,12 +1542,12 @@ insert into bar2 values(4,44,44);
insert into bar2 values(7,77,77);
explain (verbose, costs off)
-select * from bar where f1 in (select f1 from foo) for update;
-select * from bar where f1 in (select f1 from foo) for update;
+select * from bar where f1 in (select f1 from foo) order by 1 for update;
+select * from bar where f1 in (select f1 from foo) order by 1 for update;
explain (verbose, costs off)
-select * from bar where f1 in (select f1 from foo) for share;
-select * from bar where f1 in (select f1 from foo) for share;
+select * from bar where f1 in (select f1 from foo) order by 1 for share;
+select * from bar where f1 in (select f1 from foo) order by 1 for share;
-- Check UPDATE with inherited target and an inherited source table
explain (verbose, costs off)
@@ -1606,8 +1606,8 @@ explain (verbose, costs off)
delete from foo where f1 < 5 returning *;
delete from foo where f1 < 5 returning *;
explain (verbose, costs off)
-update bar set f2 = f2 + 100 returning *;
-update bar set f2 = f2 + 100 returning *;
+with u as (update bar set f2 = f2 + 100 returning *) select * from u order by 1;
+with u as (update bar set f2 = f2 + 100 returning *) select * from u order by 1;
drop table foo cascade;
drop table bar cascade;
--
2.9.2
0004-Apply-unlikely-to-suggest-synchronous-route-of.patchtext/x-patch; charset=us-asciiDownload
From e70aca71198c32cd810c0bd728a24aef221b8230 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Mon, 22 May 2017 12:50:26 +0900
Subject: [PATCH 4/4] Apply unlikely to suggest synchronous route of
ExecAppend.
ExecAppend seems to get slowed down by penalty of misprediction of
branches related to async-execution. Apply unlikey to them to prevent
such penalty on exiting route. Asynchronous execution is already
having a lot of additional code so this doesn't add siginificant
degradation.
---
src/backend/executor/nodeAppend.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/src/backend/executor/nodeAppend.c b/src/backend/executor/nodeAppend.c
index 2c07095..43e777f 100644
--- a/src/backend/executor/nodeAppend.c
+++ b/src/backend/executor/nodeAppend.c
@@ -214,7 +214,7 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
TupleTableSlot *
ExecAppend(AppendState *node)
{
- if (node->as_nasyncplans > 0)
+ if (unlikely(node->as_nasyncplans > 0))
{
EState *estate = node->ps.state;
int i;
@@ -255,7 +255,7 @@ ExecAppend(AppendState *node)
/*
* if we have async requests outstanding, run the event loop
*/
- if (node->as_nasyncpending > 0)
+ if (unlikely(node->as_nasyncpending > 0))
{
long timeout = node->as_syncdone ? -1 : 0;
--
2.9.2
At Mon, 22 May 2017 13:12:14 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in <20170522.131214.20936668.horiguchi.kyotaro@lab.ntt.co.jp>
The attached patch is rebased on the current master, but no
substantial changes other than disallowing partitioned tables on
async by assertion.This is just rebased onto the current master (d761fe2).
I'll recheck further detail after this.
Sorry, the patch was missing some files to add.
regards,
--
Kyotaro Horiguchi
NTT Open Source Software Center
Attachments:
0002-Asynchronous-execution-framework.patchtext/x-patch; charset=us-asciiDownload
From b849bbbec1c3b9ba62a30c25ac34557a9e279770 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Mon, 22 May 2017 12:46:48 +0900
Subject: [PATCH 2/4] Asynchronous execution framework
This is a framework for asynchronous execution based on Robert Haas's
proposal. Any executor node can receive tuples from underlying nodes
asynchronously by this. This is a different mechanism from parallel
execution. While the parallel execution is analogous to threads, this
frame work is analogous to select(2), which handles multiple input on
single backend process. To avoid degradation of non-async execution,
this framework uses completely different channel to convey tuples.
You will see the deatil of the API at the end of
src/backend/executor/README.
---
src/backend/executor/Makefile | 2 +-
src/backend/executor/README | 45 +++
src/backend/executor/execAmi.c | 5 +
src/backend/executor/execAsync.c | 520 ++++++++++++++++++++++++++++++++
src/backend/executor/execProcnode.c | 1 +
src/backend/executor/instrument.c | 2 +-
src/backend/executor/nodeAppend.c | 169 ++++++++++-
src/backend/executor/nodeForeignscan.c | 49 +++
src/backend/nodes/copyfuncs.c | 2 +
src/backend/nodes/outfuncs.c | 2 +
src/backend/nodes/readfuncs.c | 2 +
src/backend/optimizer/plan/createplan.c | 69 ++++-
src/backend/postmaster/pgstat.c | 2 +
src/backend/utils/adt/ruleutils.c | 6 +-
src/include/executor/execAsync.h | 30 ++
src/include/executor/nodeAppend.h | 3 +
src/include/executor/nodeForeignscan.h | 7 +
src/include/foreign/fdwapi.h | 17 ++
src/include/nodes/execnodes.h | 65 +++-
src/include/nodes/plannodes.h | 2 +
src/include/pgstat.h | 3 +-
21 files changed, 974 insertions(+), 29 deletions(-)
create mode 100644 src/backend/executor/execAsync.c
create mode 100644 src/include/executor/execAsync.h
diff --git a/src/backend/executor/Makefile b/src/backend/executor/Makefile
index 083b20f..21f5ad0 100644
--- a/src/backend/executor/Makefile
+++ b/src/backend/executor/Makefile
@@ -12,7 +12,7 @@ subdir = src/backend/executor
top_builddir = ../../..
include $(top_builddir)/src/Makefile.global
-OBJS = execAmi.o execCurrent.o execExpr.o execExprInterp.o \
+OBJS = execAmi.o execAsync.o execCurrent.o execExpr.o execExprInterp.o \
execGrouping.o execIndexing.o execJunk.o \
execMain.o execParallel.o execProcnode.o \
execReplication.o execScan.o execSRF.o execTuples.o \
diff --git a/src/backend/executor/README b/src/backend/executor/README
index a004506..e6caeb7 100644
--- a/src/backend/executor/README
+++ b/src/backend/executor/README
@@ -349,3 +349,48 @@ query returning the same set of scan tuples multiple times. Likewise,
SRFs are disallowed in an UPDATE's targetlist. There, they would have the
effect of the same row being updated multiple times, which is not very
useful --- and updates after the first would have no effect anyway.
+
+Asynchronous Execution
+----------------------
+
+In certain cases, it's desirable for a node to indicate that it cannot
+return any tuple immediately but may be able to do at a later time. This
+might either because the node is waiting on an event external to the
+database system, such as a ForeignScan awaiting network I/O, or because
+the node is waiting for an event internal to the database system - e.g.
+one process involved in a parallel query may find that it cannot progress
+a certain parallel operation until some other process reaches a certain
+point in the computation. A process which discovers this type of situation
+can always handle it simply by blocking, but this may waste time that could
+be spent executing some other part of the plan where progress could be
+made immediately. This is particularly likely to occur when the plan
+contains an Append node.
+
+To use asynchronous execution, a node must first request a tuple from
+an async-capable child node using ExecAsyncRequest. Next, when the
+result is not available immediately, it must execute the asynchronous
+event loop using ExecAsyncEventLoop; it can avoid giving up control
+indefinitely by passing a timeout to this function, even passing -1 to
+poll for events without blocking. Eventually, when a node to which an
+asynchronous request has been made produces a tuple, the requesting
+node will receive a callback from the event loop via
+ExecAsyncResponse. Typically, the ExecAsyncResponse callback is the
+only one required for nodes that wish to request tuples
+asynchronously.
+
+On the other hand, nodes that wish to produce tuples asynchronously
+generally need to implement three methods:
+
+1. When an asynchronous request is made, the node's ExecAsyncRequest callback
+will be invoked; it should use ExecAsyncSetRequiredEvents to indicate the
+number of file descriptor events for which it wishes to wait and whether it
+wishes to receive a callback when the process latch is set. Alternatively,
+it can instead use ExecAsyncRequestDone if a result is available immediately.
+
+2. When the event loop wishes to wait or poll for file descriptor events and
+the process latch, the ExecAsyncConfigureWait callback is invoked to configure
+the file descriptor wait events for which the node wishes to wait. This
+callback isn't needed if the node only cares about the process latch.
+
+3. When file descriptors or the process latch become ready, the node's
+ExecAsyncNotify callback is invoked.
diff --git a/src/backend/executor/execAmi.c b/src/backend/executor/execAmi.c
index 7337d21..4c1991c 100644
--- a/src/backend/executor/execAmi.c
+++ b/src/backend/executor/execAmi.c
@@ -479,11 +479,16 @@ ExecSupportsBackwardScan(Plan *node)
{
ListCell *l;
+ /* With async, tuples may be interleaved, so can't back up. */
+ if (((Append *) node)->nasyncplans != 0)
+ return false;
+
foreach(l, ((Append *) node)->appendplans)
{
if (!ExecSupportsBackwardScan((Plan *) lfirst(l)))
return false;
}
+
/* need not check tlist because Append doesn't evaluate it */
return true;
}
diff --git a/src/backend/executor/execAsync.c b/src/backend/executor/execAsync.c
new file mode 100644
index 0000000..115b147
--- /dev/null
+++ b/src/backend/executor/execAsync.c
@@ -0,0 +1,520 @@
+/*-------------------------------------------------------------------------
+ *
+ * execAsync.c
+ * Support routines for asynchronous execution.
+ *
+ * Portions Copyright (c) 1996-2016, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/executor/execAsync.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "executor/execAsync.h"
+#include "executor/nodeAppend.h"
+#include "executor/nodeForeignscan.h"
+#include "miscadmin.h"
+#include "pgstat.h"
+#include "storage/latch.h"
+#include "utils/memutils.h"
+
+static bool ExecAsyncEventWait(EState *estate, long timeout);
+static bool ExecAsyncConfigureWait(EState *estate, PendingAsyncRequest *areq,
+ bool reinit);
+static void ExecAsyncNotify(EState *estate, PendingAsyncRequest *areq);
+static void ExecAsyncResponse(EState *estate, PendingAsyncRequest *areq);
+
+#define EVENT_BUFFER_SIZE 16
+
+/*
+ * Asynchronously request a tuple from a designed async-aware node.
+ *
+ * requestor is the node that wants the tuple; requestee is the node from
+ * which it wants the tuple. request_index is an arbitrary integer specified
+ * by the requestor which will be available at the time the requestor receives
+ * the tuple. This is useful if the requestor has multiple children and
+ * needs an easy way to figure out which one is delivering a tuple.
+ */
+void
+ExecAsyncRequest(EState *estate, PlanState *requestor, int request_index,
+ PlanState *requestee)
+{
+ PendingAsyncRequest *areq = NULL;
+ int nasync = estate->es_num_pending_async;
+
+ if (requestee->instrument)
+ InstrStartNode(requestee->instrument);
+
+ /*
+ * If the number of pending asynchronous nodes exceeds the number of
+ * available slots in the es_pending_async array, expand the array.
+ * We start with 16 slots, and thereafter double the array size each
+ * time we run out of slots.
+ */
+ if (nasync >= estate->es_max_pending_async)
+ {
+ int newmax;
+
+ newmax = estate->es_max_pending_async * 2;
+ if (estate->es_max_pending_async == 0)
+ {
+ newmax = 16;
+ estate->es_pending_async =
+ MemoryContextAllocZero(estate->es_query_cxt,
+ newmax * sizeof(PendingAsyncRequest *));
+ }
+ else
+ {
+ int newentries = newmax - estate->es_max_pending_async;
+
+ estate->es_pending_async =
+ repalloc(estate->es_pending_async,
+ newmax * sizeof(PendingAsyncRequest *));
+ MemSet(&estate->es_pending_async[estate->es_max_pending_async],
+ 0, newentries * sizeof(PendingAsyncRequest *));
+ }
+ estate->es_max_pending_async = newmax;
+ }
+
+ /*
+ * To avoid unnecessary palloc traffic, we reuse a previously-allocated
+ * PendingAsyncRequest if there is one. If not, we must allocate a new
+ * one.
+ */
+ if (estate->es_pending_async[nasync] == NULL)
+ {
+ areq = MemoryContextAllocZero(estate->es_query_cxt,
+ sizeof(PendingAsyncRequest));
+ estate->es_pending_async[nasync] = areq;
+ }
+ else
+ {
+ areq = estate->es_pending_async[nasync];
+ MemSet(areq, 0, sizeof(PendingAsyncRequest));
+ }
+ areq->myindex = estate->es_num_pending_async;
+
+ /* Initialize the new request. */
+ areq->state = ASYNCREQ_IDLE;
+ areq->requestor = requestor;
+ areq->request_index = request_index;
+ areq->requestee = requestee;
+
+ /* Give the requestee a chance to do whatever it wants. */
+ switch (nodeTag(requestee))
+ {
+ case T_ForeignScanState:
+ ExecAsyncForeignScanRequest(estate, areq);
+ break;
+ default:
+ /* If requestee doesn't support async, caller messed up. */
+ elog(ERROR, "unrecognized node type: %d",
+ (int) nodeTag(requestee));
+ }
+
+ if (areq->requestee->instrument)
+ InstrStopNode(requestee->instrument, 0);
+
+ /* No result available now, make this node pending */
+ estate->es_num_pending_async++;
+
+ return;
+}
+
+/*
+ * Execute the main loop until the timeout expires or a result is delivered
+ * to the requestor.
+ *
+ * If the timeout is -1, there is no timeout; wait indefinitely until a
+ * result is ready for requestor. If the timeout is 0, do not block, but
+ * poll for events and fire callbacks for as long as we can do so without
+ * blocking. If timeout is greater than 0, block for at most the number
+ * of milliseconds indicated by the timeout.
+ *
+ * Returns true if a result was delivered to the requestor. A return value
+ * of false indicates that the timeout was reached without delivering a
+ * result to the requestor.
+ */
+bool
+ExecAsyncEventLoop(EState *estate, PlanState *requestor, long timeout)
+{
+ instr_time start_time;
+ long cur_timeout = timeout;
+ bool requestor_done = false;
+
+ Assert(requestor != NULL);
+
+ /*
+ * If we plan to wait - but not indefinitely - we need to record the
+ * current time.
+ */
+ if (timeout > 0)
+ INSTR_TIME_SET_CURRENT(start_time);
+
+ /* Main event loop: poll for events, deliver notifications. */
+ Assert(estate->es_async_callback_pending == 0);
+ for (;;)
+ {
+ int i;
+ bool any_node_done = false;
+
+ CHECK_FOR_INTERRUPTS();
+
+ /* Check for events only if any node is async-not-ready. */
+ if (estate->es_num_async_ready < estate->es_num_pending_async)
+ {
+ /* Don't block if any tuple available. */
+ if (estate->es_async_callback_pending > 0)
+ ExecAsyncEventWait(estate, 0);
+ else if (!ExecAsyncEventWait(estate, cur_timeout))
+ { /* Not fired */
+ /* Exited before timeout. Calculate the remaining time. */
+ instr_time cur_time;
+ long cur_timeout = -1;
+
+ /* Wait forever */
+ if (timeout < 0)
+ continue;
+
+ INSTR_TIME_SET_CURRENT(cur_time);
+ INSTR_TIME_SUBTRACT(cur_time, start_time);
+ cur_timeout =
+ timeout - (long) INSTR_TIME_GET_MILLISEC(cur_time);
+
+ if (cur_timeout > 0)
+ continue;
+ }
+ }
+
+ /* Deliver notifications. */
+ for (i = 0; i < estate->es_num_pending_async; ++i)
+ {
+ PendingAsyncRequest *areq = estate->es_pending_async[i];
+
+ if (areq->requestee->instrument)
+ InstrStartNode(areq->requestee->instrument);
+
+ /* Notify if the requestee is ready */
+ if (areq->state == ASYNCREQ_CALLBACK_PENDING)
+ ExecAsyncNotify(estate, areq);
+
+ /* Deliver the acquired tuple to the requester */
+ if (areq->state == ASYNCREQ_COMPLETE)
+ {
+ any_node_done = true;
+ if (requestor == areq->requestor)
+ requestor_done = true;
+ ExecAsyncResponse(estate, areq);
+
+ if (areq->requestee->instrument)
+ InstrStopNode(areq->requestee->instrument,
+ TupIsNull((TupleTableSlot*)areq->result) ?
+ 0.0 : 1.0);
+ }
+ else if (areq->requestee->instrument)
+ InstrStopNode(areq->requestee->instrument, 0);
+ }
+
+ /* If any node completed, compact the array. */
+ if (any_node_done)
+ {
+ int hidx = 0,
+ tidx;
+
+ /*
+ * Swap all non-yet-completed items to the start of the array.
+ * Keep them in the same order.
+ */
+ for (tidx = 0; tidx < estate->es_num_pending_async; ++tidx)
+ {
+ PendingAsyncRequest *head;
+ PendingAsyncRequest *tail = estate->es_pending_async[tidx];
+
+ Assert(tail->state != ASYNCREQ_CALLBACK_PENDING);
+
+ if (tail->state == ASYNCREQ_COMPLETE)
+ continue;
+ head = estate->es_pending_async[hidx];
+ estate->es_pending_async[tidx] = head;
+ estate->es_pending_async[hidx] = tail;
+ ++hidx;
+ }
+ estate->es_num_pending_async = hidx;
+ }
+
+ /*
+ * We only consider exiting the loop when no notifications are
+ * pending. Otherwise, each call to this function might advance
+ * the computation by only a very small amount; to the contrary,
+ * we want to push it forward as far as possible.
+ */
+ if (estate->es_async_callback_pending == 0)
+ {
+ /* If requestor is ready, exit. */
+ if (requestor_done)
+ return true;
+ /* If timeout was 0 or has expired, exit. */
+ if (cur_timeout == 0)
+ return false;
+ }
+ }
+}
+
+/*
+ * Wait or poll for events. As with ExecAsyncEventLoop, a timeout of -1
+ * means wait forever, 0 means don't wait at all, and >0 means wait for the
+ * indicated number of milliseconds.
+ *
+ * Returns false if we timed out or true if anything found or there's no event
+ * to wait.
+ */
+static bool
+ExecAsyncEventWait(EState *estate, long timeout)
+{
+ WaitEvent occurred_event[EVENT_BUFFER_SIZE];
+ int noccurred;
+ int i;
+ int n;
+ bool reinit = false;
+ bool process_latch_set = false;
+ bool added = false;
+ bool fired = false;
+
+ if (estate->es_wait_event_set == NULL)
+ {
+ /*
+ * Allow for a few extra events without reinitializing. It
+ * doesn't seem worth the complexity of doing anything very
+ * aggressive here, because plans that depend on massive numbers
+ * of external FDs are likely to run afoul of kernel limits anyway.
+ */
+ estate->es_allocated_fd_events = estate->es_total_fd_events + 16;
+
+ /*
+ * The wait event set created here should be live beyond ExecutorState
+ * context but released in case of error.
+ */
+ estate->es_wait_event_set =
+ CreateWaitEventSet(TopTransactionContext,
+ TopTransactionResourceOwner,
+ estate->es_allocated_fd_events + 1);
+
+ AddWaitEventToSet(estate->es_wait_event_set,
+ WL_LATCH_SET, PGINVALID_SOCKET, MyLatch, NULL);
+ reinit = true;
+ }
+
+ /* Give each waiting node a chance to add or modify events. */
+ for (i = 0; i < estate->es_num_pending_async; ++i)
+ {
+ PendingAsyncRequest *areq = estate->es_pending_async[i];
+
+ if (areq->num_fd_events > 0 || areq->wants_process_latch)
+ added |= ExecAsyncConfigureWait(estate, areq, reinit);
+ }
+
+ /*
+ * We may have no event to wait. This occurs when all nodes that
+ * is asynchronously executing have tuples immediately available.
+ */
+ if (!added)
+ return true;
+
+ /* Wait for at least one event to occur. */
+ noccurred = WaitEventSetWait(estate->es_wait_event_set, timeout,
+ occurred_event, EVENT_BUFFER_SIZE,
+ WAIT_EVENT_ASYNC_WAIT);
+
+ if (noccurred == 0)
+ return false;
+
+ /*
+ * Loop over the occurred events and set the callback_pending flags
+ * for the appropriate requests. The waiting nodes should have
+ * registered their wait events with user_data pointing back to the
+ * PendingAsyncRequest, but the process latch needs special handling.
+ */
+ for (n = 0; n < noccurred; ++n)
+ {
+ WaitEvent *w = &occurred_event[n];
+
+ if ((w->events & WL_LATCH_SET) != 0)
+ {
+ process_latch_set = true;
+ continue;
+ }
+
+ if ((w->events & (WL_SOCKET_READABLE|WL_SOCKET_WRITEABLE)) != 0)
+ {
+ PendingAsyncRequest *areq = w->user_data;
+
+ Assert(areq->state == ASYNCREQ_WAITING);
+
+ areq->state = ASYNCREQ_CALLBACK_PENDING;
+ estate->es_async_callback_pending++;
+ fired = true;
+ }
+ }
+
+ /*
+ * If the process latch got set, we must schedule a callback for every
+ * requestee that cares about it.
+ */
+ if (process_latch_set)
+ {
+ for (i = 0; i < estate->es_num_pending_async; ++i)
+ {
+ PendingAsyncRequest *areq = estate->es_pending_async[i];
+
+ if (areq->wants_process_latch)
+ {
+ Assert(areq->state == ASYNCREQ_WAITING);
+ areq->state = ASYNCREQ_CALLBACK_PENDING;
+ estate->es_async_callback_pending++;
+ fired = true;
+ }
+ }
+ }
+
+ return fired;
+}
+
+/*
+ * Give the asynchronous node a chance to configure the file descriptor
+ * events for which it wishes to wait. We expect the node-type specific
+ * callback to make one or more calls of the following form:
+ *
+ * AddWaitEventToSet(es->es_wait_event_set, events, fd, NULL, areq);
+ *
+ * The events should include only WL_SOCKET_READABLE or WL_SOCKET_WRITEABLE,
+ * and the number of calls should not exceed areq->num_fd_events (as
+ * prevously set via ExecAsyncSetRequiredEvents).
+ *
+ * Individual requests can omit registering an event but it is a
+ * responsibility of the node driver to set at least one event per one
+ * requestor.
+ */
+static bool
+ExecAsyncConfigureWait(EState *estate, PendingAsyncRequest *areq,
+ bool reinit)
+{
+ switch (nodeTag(areq->requestee))
+ {
+ case T_ForeignScanState:
+ return ExecAsyncForeignScanConfigureWait(estate, areq, reinit);
+ break;
+ default:
+ elog(ERROR, "unrecognized node type: %d",
+ (int) nodeTag(areq->requestee));
+ }
+}
+
+/*
+ * Call the asynchronous node back when a relevant event has occurred.
+ */
+static void
+ExecAsyncNotify(EState *estate, PendingAsyncRequest *areq)
+{
+ switch (nodeTag(areq->requestee))
+ {
+ case T_ForeignScanState:
+ ExecAsyncForeignScanNotify(estate, areq);
+ break;
+ default:
+ elog(ERROR, "unrecognized node type: %d",
+ (int) nodeTag(areq->requestee));
+ }
+
+ estate->es_async_callback_pending--;
+}
+
+/*
+ * Call the requestor back when an asynchronous node has produced a result.
+ */
+static void
+ExecAsyncResponse(EState *estate, PendingAsyncRequest *areq)
+{
+ switch (nodeTag(areq->requestor))
+ {
+ case T_AppendState:
+ ExecAsyncAppendResponse(estate, areq);
+ break;
+ default:
+ elog(ERROR, "unrecognized node type: %d",
+ (int) nodeTag(areq->requestor));
+ }
+ estate->es_num_async_ready--;
+}
+
+/*
+ * An executor node should call this function to signal that it needs to wait
+ * on one or more file descriptor events that can be registered on a
+ * WaitEventSet, and possibly also on process latch. num_fd_events is the
+ * maximum number of file descriptor events that it will wish to register.
+ * force_reset should be true if the node can't reuse the WaitEventSet it most
+ * recently initialized, for example because it needs to drop a wait event
+ * from the set.
+ */
+void
+ExecAsyncSetRequiredEvents(EState *estate, PendingAsyncRequest *areq,
+ int num_fd_events, bool wants_process_latch,
+ bool force_reset)
+{
+ estate->es_total_fd_events += num_fd_events - areq->num_fd_events;
+ areq->num_fd_events = num_fd_events;
+ areq->wants_process_latch = wants_process_latch;
+ areq->state = ASYNCREQ_WAITING;
+
+ if (force_reset && estate->es_wait_event_set != NULL)
+ ExecAsyncClearEvents(estate);
+}
+
+/*
+ * An async-capable node should call this function to deliver the tuple to
+ * the node which requested it. The node can call this from its
+ * ExecAsyncRequest callback if the requested tuple is available immediately,
+ * or at a later time from its ExecAsyncNotify callback.
+ */
+void
+ExecAsyncRequestDone(EState *estate, PendingAsyncRequest *areq, Node *result)
+{
+ /*
+ * Since the request is complete, the requestee is no longer allowed
+ * to wait for any events. Note that this forces a rebuild of
+ * es_wait_event_set every time a process that was previously waiting
+ * stops doing so. It might be possible to defer that decision until
+ * we actually wait again, because it's quite possible that a new
+ * request will be made of the same node before any wait actually
+ * happens. However, we have to balance the cost of rebuilding the
+ * WaitEventSet against the additional overhead of tracking which nodes
+ * need a callback to remove registered wait events. It's not clear
+ * that we would come out ahead, so use brute force for now.
+ */
+ Assert(areq->state == ASYNCREQ_IDLE ||
+ areq->state == ASYNCREQ_CALLBACK_PENDING);
+
+ if (areq->num_fd_events > 0 || areq->wants_process_latch)
+ ExecAsyncSetRequiredEvents(estate, areq, 0, false, true);
+
+
+ /* Save result and mark request as complete. */
+ areq->result = result;
+ areq->state = ASYNCREQ_COMPLETE;
+ estate->es_num_async_ready++;
+}
+
+
+/* Clear async events */
+void
+ExecAsyncClearEvents(EState *estate)
+{
+ if (estate->es_wait_event_set == NULL)
+ return;
+
+ FreeWaitEventSet(estate->es_wait_event_set);
+ estate->es_wait_event_set = NULL;
+}
diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
index 5469cde..2b727c0 100644
--- a/src/backend/executor/execProcnode.c
+++ b/src/backend/executor/execProcnode.c
@@ -118,6 +118,7 @@
#include "executor/nodeValuesscan.h"
#include "executor/nodeWindowAgg.h"
#include "executor/nodeWorktablescan.h"
+#include "foreign/fdwapi.h"
#include "nodes/nodeFuncs.h"
#include "miscadmin.h"
diff --git a/src/backend/executor/instrument.c b/src/backend/executor/instrument.c
index 6ec96ec..959ee90 100644
--- a/src/backend/executor/instrument.c
+++ b/src/backend/executor/instrument.c
@@ -102,7 +102,7 @@ InstrStopNode(Instrumentation *instr, double nTuples)
&pgBufferUsage, &instr->bufusage_start);
/* Is this the first tuple of this cycle? */
- if (!instr->running)
+ if (!instr->running && nTuples > 0)
{
instr->running = true;
instr->firsttuple = INSTR_TIME_GET_DOUBLE(instr->counter);
diff --git a/src/backend/executor/nodeAppend.c b/src/backend/executor/nodeAppend.c
index aae5e3f..2c07095 100644
--- a/src/backend/executor/nodeAppend.c
+++ b/src/backend/executor/nodeAppend.c
@@ -58,6 +58,7 @@
#include "postgres.h"
#include "executor/execdebug.h"
+#include "executor/execAsync.h"
#include "executor/nodeAppend.h"
static bool exec_append_initialize_next(AppendState *appendstate);
@@ -79,16 +80,21 @@ exec_append_initialize_next(AppendState *appendstate)
/*
* get information from the append node
*/
- whichplan = appendstate->as_whichplan;
+ whichplan = appendstate->as_whichsyncplan;
- if (whichplan < 0)
+ /*
+ * This routine is only responsible for setting up for nodes being scanned
+ * synchronously, so the first node we can scan is given by nasyncplans
+ * and the last is given by as_nplans - 1.
+ */
+ if (whichplan < appendstate->as_nasyncplans)
{
/*
* if scanning in reverse, we start at the last scan in the list and
* then proceed back to the first.. in any case we inform ExecAppend
* that we are at the end of the line by returning FALSE
*/
- appendstate->as_whichplan = 0;
+ appendstate->as_whichsyncplan = appendstate->as_nasyncplans;
return FALSE;
}
else if (whichplan >= appendstate->as_nplans)
@@ -96,7 +102,7 @@ exec_append_initialize_next(AppendState *appendstate)
/*
* as above, end the scan if we go beyond the last scan in our list..
*/
- appendstate->as_whichplan = appendstate->as_nplans - 1;
+ appendstate->as_whichsyncplan = appendstate->as_nplans - 1;
return FALSE;
}
else
@@ -148,6 +154,15 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
appendstate->ps.state = estate;
appendstate->appendplans = appendplanstates;
appendstate->as_nplans = nplans;
+ appendstate->as_nasyncplans = node->nasyncplans;
+ appendstate->as_syncdone = (node->nasyncplans == nplans);
+ appendstate->as_asyncresult = (TupleTableSlot **)
+ palloc0(node->nasyncplans * sizeof(TupleTableSlot *));
+
+ /* initially, all async requests need a request */
+ for (i = 0; i < appendstate->as_nasyncplans; ++i)
+ appendstate->as_needrequest =
+ bms_add_member(appendstate->as_needrequest, i);
/*
* Miscellaneous initialization
@@ -182,9 +197,9 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
appendstate->ps.ps_ProjInfo = NULL;
/*
- * initialize to scan first subplan
+ * initialize to scan first synchronous subplan
*/
- appendstate->as_whichplan = 0;
+ appendstate->as_whichsyncplan = appendstate->as_nasyncplans;
exec_append_initialize_next(appendstate);
return appendstate;
@@ -199,15 +214,85 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
TupleTableSlot *
ExecAppend(AppendState *node)
{
+ if (node->as_nasyncplans > 0)
+ {
+ EState *estate = node->ps.state;
+ int i;
+
+ /*
+ * If there are any asynchronously-generated results that have
+ * not yet been returned, return one of them.
+ */
+ if (node->as_nasyncresult > 0)
+ {
+ --node->as_nasyncresult;
+ return node->as_asyncresult[node->as_nasyncresult];
+ }
+
+
+ /*
+ * XXXX: Always clear registered event. This seems a bit ineffecient
+ * but the events to wait are almost randomly altered for every
+ * calling.
+ */
+ ExecAsyncClearEvents(estate);
+
+ while ((i = bms_first_member(node->as_needrequest)) >= 0)
+ {
+ node->as_nasyncpending++;
+ ExecAsyncRequest(estate, &node->ps, i, node->appendplans[i]);
+ }
+
+ if (node->as_nasyncpending == 0 && node->as_syncdone)
+ return ExecClearTuple(node->ps.ps_ResultTupleSlot);
+ }
+
for (;;)
{
PlanState *subnode;
TupleTableSlot *result;
/*
- * figure out which subplan we are currently processing
+ * if we have async requests outstanding, run the event loop
+ */
+ if (node->as_nasyncpending > 0)
+ {
+ long timeout = node->as_syncdone ? -1 : 0;
+
+ while (node->as_nasyncpending > 0)
+ {
+ if (ExecAsyncEventLoop(node->ps.state, &node->ps, timeout) &&
+ node->as_nasyncresult > 0)
+ {
+ /* Asynchronous subplan returned a tuple! */
+ --node->as_nasyncresult;
+ return node->as_asyncresult[node->as_nasyncresult];
+ }
+
+ /* Timeout reached. Go through to sync nodes if exists */
+ if (!node->as_syncdone)
+ break;
+ }
+
+ /*
+ * If there is no asynchronous activity still pending and the
+ * synchronous activity is also complete, we're totally done
+ * scanning this node. Otherwise, we're done with the
+ * asynchronous stuff but must continue scanning the synchronous
+ * children.
+ */
+ if (node->as_syncdone)
+ {
+ Assert(node->as_nasyncpending == 0);
+ return ExecClearTuple(node->ps.ps_ResultTupleSlot);
+ }
+ }
+
+ /*
+ * figure out which synchronous subplan we are currently processing
*/
- subnode = node->appendplans[node->as_whichplan];
+ Assert(!node->as_syncdone);
+ subnode = node->appendplans[node->as_whichsyncplan];
/*
* get a tuple from the subplan
@@ -227,14 +312,21 @@ ExecAppend(AppendState *node)
/*
* Go on to the "next" subplan in the appropriate direction. If no
* more subplans, return the empty slot set up for us by
- * ExecInitAppend.
+ * ExecInitAppend, unless there are async plans we have yet to finish.
*/
if (ScanDirectionIsForward(node->ps.state->es_direction))
- node->as_whichplan++;
+ node->as_whichsyncplan++;
else
- node->as_whichplan--;
+ node->as_whichsyncplan--;
if (!exec_append_initialize_next(node))
- return ExecClearTuple(node->ps.ps_ResultTupleSlot);
+ {
+ node->as_syncdone = true;
+ if (node->as_nasyncpending == 0)
+ {
+ Assert(bms_is_empty(node->as_needrequest));
+ return ExecClearTuple(node->ps.ps_ResultTupleSlot);
+ }
+ }
/* Else loop back and try to get a tuple from the new subplan */
}
@@ -273,6 +365,16 @@ ExecReScanAppend(AppendState *node)
{
int i;
+ /*
+ * XXX. Cancel outstanding asynchronous tuple requests here! (How?)
+ */
+
+ /* Reset async state. */
+ for (i = 0; i < node->as_nasyncplans; ++i)
+ node->as_needrequest = bms_add_member(node->as_needrequest, i);
+ node->as_nasyncresult = 0;
+ node->as_syncdone = (node->as_nasyncplans == node->as_nplans);
+
for (i = 0; i < node->as_nplans; i++)
{
PlanState *subnode = node->appendplans[i];
@@ -291,6 +393,47 @@ ExecReScanAppend(AppendState *node)
if (subnode->chgParam == NULL)
ExecReScan(subnode);
}
- node->as_whichplan = 0;
+ node->as_whichsyncplan = node->as_nasyncplans;
exec_append_initialize_next(node);
}
+
+/* ----------------------------------------------------------------
+ * ExecAsyncAppendResponse
+ *
+ * Receive a response from an asynchronous request we made.
+ * ----------------------------------------------------------------
+ */
+void
+ExecAsyncAppendResponse(EState *estate, PendingAsyncRequest *areq)
+{
+ AppendState *node = (AppendState *) areq->requestor;
+ TupleTableSlot *slot;
+
+ /* We shouldn't be called until the request is complete. */
+ Assert(areq->state == ASYNCREQ_COMPLETE);
+
+ /* Our result slot shouldn't already be occupied. */
+ Assert(TupIsNull(node->ps.ps_ResultTupleSlot));
+
+ /* Result should be a TupleTableSlot or NULL. */
+ slot = (TupleTableSlot *) areq->result;
+ Assert(slot == NULL || IsA(slot, TupleTableSlot));
+
+ /* This is no longer pending */
+ --node->as_nasyncpending;
+
+ /* If the result is NULL or an empty slot, there's nothing more to do. */
+ if (TupIsNull(slot))
+ return;
+
+ /* Save result so we can return it. */
+ Assert(node->as_nasyncresult < node->as_nasyncplans);
+ node->as_asyncresult[node->as_nasyncresult++] = slot;
+
+ /*
+ * Mark the node that returned a result as ready for a new request. We
+ * don't launch another one here immediately because it might compelte
+ */
+ node->as_needrequest =
+ bms_add_member(node->as_needrequest, areq->request_index);
+}
diff --git a/src/backend/executor/nodeForeignscan.c b/src/backend/executor/nodeForeignscan.c
index 9ae1561..7db5c30 100644
--- a/src/backend/executor/nodeForeignscan.c
+++ b/src/backend/executor/nodeForeignscan.c
@@ -364,3 +364,52 @@ ExecShutdownForeignScan(ForeignScanState *node)
if (fdwroutine->ShutdownForeignScan)
fdwroutine->ShutdownForeignScan(node);
}
+
+/* ----------------------------------------------------------------
+ * ExecAsyncForeignScanRequest
+ *
+ * Initiate an asynchronous request
+ * ----------------------------------------------------------------
+ */
+void
+ExecAsyncForeignScanRequest(EState *estate, PendingAsyncRequest *areq)
+{
+ ForeignScanState *node = (ForeignScanState *) areq->requestee;
+ FdwRoutine *fdwroutine = node->fdwroutine;
+
+ Assert(fdwroutine->ForeignAsyncRequest != NULL);
+ fdwroutine->ForeignAsyncRequest(estate, areq);
+}
+
+/* ----------------------------------------------------------------
+ * ExecAsyncForeignScanConfigureWait
+ *
+ * In async mode, configure for a wait
+ * ----------------------------------------------------------------
+ */
+bool
+ExecAsyncForeignScanConfigureWait(EState *estate,
+ PendingAsyncRequest *areq, bool reinit)
+{
+ ForeignScanState *node = (ForeignScanState *) areq->requestee;
+ FdwRoutine *fdwroutine = node->fdwroutine;
+
+ Assert(fdwroutine->ForeignAsyncConfigureWait != NULL);
+ return fdwroutine->ForeignAsyncConfigureWait(estate, areq, reinit);
+}
+
+/* ----------------------------------------------------------------
+ * ExecAsyncForeignScanNotify
+ *
+ * Event loop callback
+ * ----------------------------------------------------------------
+ */
+void
+ExecAsyncForeignScanNotify(EState *estate, PendingAsyncRequest *areq)
+{
+ ForeignScanState *node = (ForeignScanState *) areq->requestee;
+ FdwRoutine *fdwroutine = node->fdwroutine;
+
+ Assert(fdwroutine->ForeignAsyncNotify != NULL);
+ fdwroutine->ForeignAsyncNotify(estate, areq);
+}
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index 7811ad5..8cd0821 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -242,6 +242,8 @@ _copyAppend(const Append *from)
*/
COPY_NODE_FIELD(partitioned_rels);
COPY_NODE_FIELD(appendplans);
+ COPY_SCALAR_FIELD(nasyncplans);
+ COPY_SCALAR_FIELD(referent);
return newnode;
}
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index 4949d58..2d50b8a 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -376,6 +376,8 @@ _outAppend(StringInfo str, const Append *node)
WRITE_NODE_FIELD(partitioned_rels);
WRITE_NODE_FIELD(appendplans);
+ WRITE_INT_FIELD(nasyncplans);
+ WRITE_INT_FIELD(referent);
}
static void
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index e24f5d6..fae9396 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -1579,6 +1579,8 @@ _readAppend(void)
READ_NODE_FIELD(partitioned_rels);
READ_NODE_FIELD(appendplans);
+ READ_INT_FIELD(nasyncplans);
+ READ_INT_FIELD(referent);
READ_DONE();
}
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index 94beeb8..9c29787 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -203,7 +203,8 @@ static NamedTuplestoreScan *make_namedtuplestorescan(List *qptlist, List *qpqual
Index scanrelid, char *enrname);
static WorkTableScan *make_worktablescan(List *qptlist, List *qpqual,
Index scanrelid, int wtParam);
-static Append *make_append(List *appendplans, List *tlist, List *partitioned_rels);
+static Append *make_append(List *asyncplans, int nasyncplans,
+ int referent, List *tlist, List *partitioned_rels);
static RecursiveUnion *make_recursive_union(List *tlist,
Plan *lefttree,
Plan *righttree,
@@ -282,7 +283,7 @@ static ModifyTable *make_modifytable(PlannerInfo *root,
List *rowMarks, OnConflictExpr *onconflict, int epqParam);
static GatherMerge *create_gather_merge_plan(PlannerInfo *root,
GatherMergePath *best_path);
-
+static bool is_async_capable_path(Path *path);
/*
* create_plan
@@ -1003,8 +1004,12 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path)
{
Append *plan;
List *tlist = build_path_tlist(root, &best_path->path);
- List *subplans = NIL;
+ List *asyncplans = NIL;
+ List *syncplans = NIL;
ListCell *subpaths;
+ int nasyncplans = 0;
+ bool first = true;
+ bool referent_is_sync = true;
/*
* The subpaths list could be empty, if every child was proven empty by
@@ -1030,7 +1035,14 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path)
return plan;
}
- /* Build the plan for each child */
+ /*
+ * Build the plan for each child
+
+ * The first child in an inheritance set is the representative in
+ * explaining tlist entries (see set_deparse_planstate). We should keep
+ * the first child in best_path->subpaths at the head of the subplan list
+ * for the reason.
+ */
foreach(subpaths, best_path->subpaths)
{
Path *subpath = (Path *) lfirst(subpaths);
@@ -1039,7 +1051,18 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path)
/* Must insist that all children return the same tlist */
subplan = create_plan_recurse(root, subpath, CP_EXACT_TLIST);
- subplans = lappend(subplans, subplan);
+ /* Classify as async-capable or not */
+ if (is_async_capable_path(subpath))
+ {
+ asyncplans = lappend(asyncplans, subplan);
+ ++nasyncplans;
+ if (first)
+ referent_is_sync = false;
+ }
+ else
+ syncplans = lappend(syncplans, subplan);
+
+ first = false;
}
/*
@@ -1049,7 +1072,9 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path)
* parent-rel Vars it'll be asked to emit.
*/
- plan = make_append(subplans, tlist, best_path->partitioned_rels);
+ plan = make_append(list_concat(asyncplans, syncplans), nasyncplans,
+ referent_is_sync ? nasyncplans : 0, tlist,
+ best_path->partitioned_rels);
copy_generic_path_info(&plan->plan, (Path *) best_path);
@@ -5268,17 +5293,23 @@ make_foreignscan(List *qptlist,
}
static Append *
-make_append(List *appendplans, List *tlist, List *partitioned_rels)
+make_append(List *appendplans, int nasyncplans, int referent,
+ List *tlist, List *partitioned_rels)
{
Append *node = makeNode(Append);
Plan *plan = &node->plan;
+ /* Currently async on partitioned tables is not available */
+ Assert(nasyncplans == 0 || partitioned_rels == NIL);
+
plan->targetlist = tlist;
plan->qual = NIL;
plan->lefttree = NULL;
plan->righttree = NULL;
node->partitioned_rels = partitioned_rels;
node->appendplans = appendplans;
+ node->nasyncplans = nasyncplans;
+ node->referent = referent;
return node;
}
@@ -6608,3 +6639,27 @@ is_projection_capable_plan(Plan *plan)
}
return true;
}
+
+/*
+ * is_projection_capable_path
+ * Check whether a given Path node is async-capable.
+ */
+static bool
+is_async_capable_path(Path *path)
+{
+ switch (nodeTag(path))
+ {
+ case T_ForeignPath:
+ {
+ FdwRoutine *fdwroutine = path->parent->fdwroutine;
+
+ Assert(fdwroutine != NULL);
+ if (fdwroutine->IsForeignPathAsyncCapable != NULL &&
+ fdwroutine->IsForeignPathAsyncCapable((ForeignPath *) path))
+ return true;
+ }
+ default:
+ break;
+ }
+ return false;
+}
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index f453dad..97337bd 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -3611,6 +3611,8 @@ pgstat_get_wait_ipc(WaitEventIPC w)
break;
case WAIT_EVENT_LOGICAL_SYNC_STATE_CHANGE:
event_name = "LogicalSyncStateChange";
+ case WAIT_EVENT_ASYNC_WAIT:
+ event_name = "AsyncExecWait";
break;
/* no default case, so that compiler will warn */
}
diff --git a/src/backend/utils/adt/ruleutils.c b/src/backend/utils/adt/ruleutils.c
index 9234bc2..0ed6d2c 100644
--- a/src/backend/utils/adt/ruleutils.c
+++ b/src/backend/utils/adt/ruleutils.c
@@ -4425,7 +4425,11 @@ set_deparse_planstate(deparse_namespace *dpns, PlanState *ps)
* lists containing references to non-target relations.
*/
if (IsA(ps, AppendState))
- dpns->outer_planstate = ((AppendState *) ps)->appendplans[0];
+ {
+ int idx = ((Append*)(((AppendState *) ps)->ps.plan))->referent;
+ dpns->outer_planstate =
+ ((AppendState *) ps)->appendplans[idx];
+ }
else if (IsA(ps, MergeAppendState))
dpns->outer_planstate = ((MergeAppendState *) ps)->mergeplans[0];
else if (IsA(ps, ModifyTableState))
diff --git a/src/include/executor/execAsync.h b/src/include/executor/execAsync.h
new file mode 100644
index 0000000..9e7845c
--- /dev/null
+++ b/src/include/executor/execAsync.h
@@ -0,0 +1,30 @@
+/*--------------------------------------------------------------------
+ * execAsync.c
+ * Support functions for asynchronous query execution
+ *
+ * Portions Copyright (c) 1996-2016, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/executor/execAsync.c
+ *--------------------------------------------------------------------
+ */
+
+#ifndef EXECASYNC_H
+#define EXECASYNC_H
+
+#include "nodes/execnodes.h"
+
+extern void ExecAsyncRequest(EState *estate, PlanState *requestor,
+ int request_index, PlanState *requestee);
+extern bool ExecAsyncEventLoop(EState *estate, PlanState *requestor,
+ long timeout);
+
+extern void ExecAsyncSetRequiredEvents(EState *estate,
+ PendingAsyncRequest *areq, int num_fd_events,
+ bool wants_process_latch, bool force_reset);
+extern void ExecAsyncRequestDone(EState *estate,
+ PendingAsyncRequest *areq, Node *result);
+extern void ExecAsyncClearEvents(EState *estate);
+
+#endif /* EXECASYNC_H */
diff --git a/src/include/executor/nodeAppend.h b/src/include/executor/nodeAppend.h
index 6fb4662..3cbf9ff 100644
--- a/src/include/executor/nodeAppend.h
+++ b/src/include/executor/nodeAppend.h
@@ -21,4 +21,7 @@ extern TupleTableSlot *ExecAppend(AppendState *node);
extern void ExecEndAppend(AppendState *node);
extern void ExecReScanAppend(AppendState *node);
+extern void ExecAsyncAppendResponse(EState *estate,
+ PendingAsyncRequest *areq);
+
#endif /* NODEAPPEND_H */
diff --git a/src/include/executor/nodeForeignscan.h b/src/include/executor/nodeForeignscan.h
index 1b167b8..e4ba4a9 100644
--- a/src/include/executor/nodeForeignscan.h
+++ b/src/include/executor/nodeForeignscan.h
@@ -30,4 +30,11 @@ extern void ExecForeignScanInitializeWorker(ForeignScanState *node,
shm_toc *toc);
extern void ExecShutdownForeignScan(ForeignScanState *node);
+extern void ExecAsyncForeignScanRequest(EState *estate,
+ PendingAsyncRequest *areq);
+extern bool ExecAsyncForeignScanConfigureWait(EState *estate,
+ PendingAsyncRequest *areq, bool reinit);
+extern void ExecAsyncForeignScanNotify(EState *estate,
+ PendingAsyncRequest *areq);
+
#endif /* NODEFOREIGNSCAN_H */
diff --git a/src/include/foreign/fdwapi.h b/src/include/foreign/fdwapi.h
index 6ca44f7..863ff0e 100644
--- a/src/include/foreign/fdwapi.h
+++ b/src/include/foreign/fdwapi.h
@@ -156,6 +156,16 @@ typedef bool (*IsForeignScanParallelSafe_function) (PlannerInfo *root,
RelOptInfo *rel,
RangeTblEntry *rte);
+typedef bool (*IsForeignPathAsyncCapable_function) (ForeignPath *path);
+typedef void (*ForeignAsyncRequest_function) (EState *estate,
+ PendingAsyncRequest *areq);
+typedef bool (*ForeignAsyncConfigureWait_function) (EState *estate,
+ PendingAsyncRequest *areq,
+ bool reinit);
+typedef void (*ForeignAsyncNotify_function) (EState *estate,
+ PendingAsyncRequest *areq);
+typedef void (*ShutdownForeignScan_function) (ForeignScanState *node);
+
/*
* FdwRoutine is the struct returned by a foreign-data wrapper's handler
* function. It provides pointers to the callback functions needed by the
@@ -225,6 +235,13 @@ typedef struct FdwRoutine
EstimateDSMForeignScan_function EstimateDSMForeignScan;
InitializeDSMForeignScan_function InitializeDSMForeignScan;
InitializeWorkerForeignScan_function InitializeWorkerForeignScan;
+
+ /* Support functions for asynchronous execution */
+ IsForeignPathAsyncCapable_function IsForeignPathAsyncCapable;
+ ForeignAsyncRequest_function ForeignAsyncRequest;
+ ForeignAsyncConfigureWait_function ForeignAsyncConfigureWait;
+ ForeignAsyncNotify_function ForeignAsyncNotify;
+
ShutdownForeignScan_function ShutdownForeignScan;
} FdwRoutine;
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index d33392f..b58c66e 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -395,6 +395,32 @@ typedef struct ResultRelInfo
} ResultRelInfo;
/* ----------------
+ * PendingAsyncRequest
+ *
+ * State for an asynchronous tuple request.
+ * ----------------
+ */
+typedef enum AsyncRequestState
+{
+ ASYNCREQ_IDLE, /* Nothing is requested */
+ ASYNCREQ_WAITING, /* Waiting for events */
+ ASYNCREQ_CALLBACK_PENDING, /* Having events to be processed */
+ ASYNCREQ_COMPLETE /* Result is available */
+} AsyncRequestState;
+
+typedef struct PendingAsyncRequest
+{
+ int myindex; /* Index in es_pending_async. */
+ struct PlanState *requestor; /* Node that wants a tuple. */
+ struct PlanState *requestee; /* Node from which a tuple is wanted. */
+ int request_index; /* Scratch space for requestor. */
+ int num_fd_events; /* Max number of FD events requestee needs. */
+ bool wants_process_latch; /* Requestee cares about MyLatch. */
+ AsyncRequestState state;
+ Node *result; /* Result (NULL if no more tuples). */
+} PendingAsyncRequest;
+
+/* ----------------
* EState information
*
* Master working state for an Executor invocation
@@ -486,6 +512,32 @@ typedef struct EState
/* The per-query shared memory area to use for parallel execution. */
struct dsa_area *es_query_dsa;
+
+ /*
+ * Support for asynchronous execution.
+ *
+ * es_max_pending_async is the allocated size of es_pending_async, and
+ * es_num_pending_aync is the number of entries that are currently valid.
+ * (Entries after that may point to storage that can be reused.)
+ * es_async_ready is the number of PendingAsyncRequests that is ready to
+ * retrieve a tuple.
+ *
+ * es_total_fd_events is the total number of FD events needed by all
+ * pending async nodes, and es_allocated_fd_events is the number any
+ * current wait event set was allocated to handle. es_wait_event_set, if
+ * non-NULL, is a previously allocated event set that may be reusable by a
+ * future wait provided that nothing's been removed and not too many more
+ * events have been added.
+ */
+ int es_num_pending_async; /* # of nodes to wait */
+ int es_max_pending_async; /* max # of pending nodes */
+ int es_async_callback_pending; /* # of nodes to callback */
+ int es_num_async_ready; /* # of tuple-ready nodes */
+ PendingAsyncRequest **es_pending_async;
+
+ int es_total_fd_events;
+ int es_allocated_fd_events;
+ struct WaitEventSet *es_wait_event_set;
} EState;
@@ -950,17 +1002,20 @@ typedef struct ModifyTableState
/* ----------------
* AppendState information
- *
- * nplans how many plans are in the array
- * whichplan which plan is being executed (0 .. n-1)
* ----------------
*/
typedef struct AppendState
{
PlanState ps; /* its first field is NodeTag */
PlanState **appendplans; /* array of PlanStates for my inputs */
- int as_nplans;
- int as_whichplan;
+ int as_nplans; /* total # of children */
+ int as_nasyncplans; /* # of async-capable children */
+ int as_whichsyncplan; /* which sync plan is being executed */
+ bool as_syncdone; /* all synchronous plans done? */
+ Bitmapset *as_needrequest; /* async plans needing a new request */
+ TupleTableSlot **as_asyncresult; /* unreturned results of async plans */
+ int as_nasyncresult; /* # of valid entries in as_asyncresult */
+ int as_nasyncpending; /* # of outstanding async requests */
} AppendState;
/* ----------------
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index d84372d..8bace1f 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -248,6 +248,8 @@ typedef struct Append
/* RT indexes of non-leaf tables in a partition tree */
List *partitioned_rels;
List *appendplans;
+ int nasyncplans; /* # of async plans, always at start of list */
+ int referent; /* index of inheritance tree referent */
} Append;
/* ----------------
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 5e029c0..7537ce2 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -812,7 +812,8 @@ typedef enum
WAIT_EVENT_SAFE_SNAPSHOT,
WAIT_EVENT_SYNC_REP,
WAIT_EVENT_LOGICAL_SYNC_DATA,
- WAIT_EVENT_LOGICAL_SYNC_STATE_CHANGE
+ WAIT_EVENT_LOGICAL_SYNC_STATE_CHANGE,
+ WAIT_EVENT_ASYNC_WAIT
} WaitEventIPC;
/* ----------
--
2.9.2
0003-Make-postgres_fdw-async-capable.patchtext/x-patch; charset=us-asciiDownload
From a902431be043ad0e930f03f77faa716ccb286360 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Mon, 22 May 2017 12:49:41 +0900
Subject: [PATCH 3/4] Make postgres_fdw async-capable.
Make postgre_fdw async-capable using the infrastructure. Additionaly,
this makes connections for postgres_fdw have a connection-specific
area to store information so that foreign scans on the same connection
can share some data. postgres_fdw shares scan node currently running
on the underlying connection. This allows us async-execution of
multiple foreign scans on one foreign server.
---
contrib/postgres_fdw/connection.c | 79 ++--
contrib/postgres_fdw/expected/postgres_fdw.out | 128 +++---
contrib/postgres_fdw/postgres_fdw.c | 522 +++++++++++++++++++++----
contrib/postgres_fdw/postgres_fdw.h | 2 +
contrib/postgres_fdw/sql/postgres_fdw.sql | 20 +-
5 files changed, 591 insertions(+), 160 deletions(-)
diff --git a/contrib/postgres_fdw/connection.c b/contrib/postgres_fdw/connection.c
index c6e3d44..d8ded74 100644
--- a/contrib/postgres_fdw/connection.c
+++ b/contrib/postgres_fdw/connection.c
@@ -49,6 +49,7 @@ typedef struct ConnCacheEntry
* one level of subxact open, etc */
bool have_prep_stmt; /* have we prepared any stmts in this xact? */
bool have_error; /* have any subxacts aborted in this xact? */
+ void *storage; /* connection specific storage */
} ConnCacheEntry;
/*
@@ -64,6 +65,7 @@ static unsigned int prep_stmt_number = 0;
static bool xact_got_connection = false;
/* prototypes of private functions */
+static ConnCacheEntry *get_connection_entry(Oid umid);
static PGconn *connect_pg_server(ForeignServer *server, UserMapping *user);
static void check_conn_params(const char **keywords, const char **values);
static void configure_remote_session(PGconn *conn);
@@ -75,26 +77,12 @@ static void pgfdw_subxact_callback(SubXactEvent event,
SubTransactionId parentSubid,
void *arg);
-
/*
- * Get a PGconn which can be used to execute queries on the remote PostgreSQL
- * server with the user's authorization. A new connection is established
- * if we don't already have a suitable one, and a transaction is opened at
- * the right subtransaction nesting depth if we didn't do that already.
- *
- * will_prep_stmt must be true if caller intends to create any prepared
- * statements. Since those don't go away automatically at transaction end
- * (not even on error), we need this flag to cue manual cleanup.
- *
- * XXX Note that caching connections theoretically requires a mechanism to
- * detect change of FDW objects to invalidate already established connections.
- * We could manage that by watching for invalidation events on the relevant
- * syscaches. For the moment, though, it's not clear that this would really
- * be useful and not mere pedantry. We could not flush any active connections
- * mid-transaction anyway.
+ * Common function to acquire or create a connection cache entry.
*/
-PGconn *
-GetConnection(UserMapping *user, bool will_prep_stmt)
+
+static ConnCacheEntry *
+get_connection_entry(Oid umid)
{
bool found;
ConnCacheEntry *entry;
@@ -122,11 +110,8 @@ GetConnection(UserMapping *user, bool will_prep_stmt)
RegisterSubXactCallback(pgfdw_subxact_callback, NULL);
}
- /* Set flag that we did GetConnection during the current transaction */
- xact_got_connection = true;
-
/* Create hash key for the entry. Assume no pad bytes in key struct */
- key = user->umid;
+ key = umid;
/*
* Find or create cached entry for requested connection.
@@ -139,8 +124,39 @@ GetConnection(UserMapping *user, bool will_prep_stmt)
entry->xact_depth = 0;
entry->have_prep_stmt = false;
entry->have_error = false;
+ entry->storage = NULL;
}
+ return entry;
+}
+
+/*
+ * Get a PGconn which can be used to execute queries on the remote PostgreSQL
+ * server with the user's authorization. A new connection is established
+ * if we don't already have a suitable one, and a transaction is opened at
+ * the right subtransaction nesting depth if we didn't do that already.
+ *
+ * will_prep_stmt must be true if caller intends to create any prepared
+ * statements. Since those don't go away automatically at transaction end
+ * (not even on error), we need this flag to cue manual cleanup.
+ *
+ * XXX Note that caching connections theoretically requires a mechanism to
+ * detect change of FDW objects to invalidate already established connections.
+ * We could manage that by watching for invalidation events on the relevant
+ * syscaches. For the moment, though, it's not clear that this would really
+ * be useful and not mere pedantry. We could not flush any active connections
+ * mid-transaction anyway.
+ */
+PGconn *
+GetConnection(UserMapping *user, bool will_prep_stmt)
+{
+ ConnCacheEntry *entry;
+
+ /* Set flag that we did GetConnection during the current transaction */
+ xact_got_connection = true;
+
+ entry = get_connection_entry(user->umid);
+
/*
* We don't check the health of cached connection here, because it would
* require some overhead. Broken connection will be detected when the
@@ -177,6 +193,25 @@ GetConnection(UserMapping *user, bool will_prep_stmt)
}
/*
+ * Rerturns the connection specific storage for this user. Allocate with
+ * initsize if not exists.
+ */
+void *
+GetConnectionSpecificStorage(UserMapping *user, size_t initsize)
+{
+ ConnCacheEntry *entry;
+
+ entry = get_connection_entry(user->umid);
+ if (entry->storage == NULL)
+ {
+ entry->storage = MemoryContextAlloc(CacheMemoryContext, initsize);
+ memset(entry->storage, 0, initsize);
+ }
+
+ return entry->storage;
+}
+
+/*
* Connect to remote server using specified server and user mapping properties.
*/
static PGconn *
diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out
index 4d86ab5..c1c0320 100644
--- a/contrib/postgres_fdw/expected/postgres_fdw.out
+++ b/contrib/postgres_fdw/expected/postgres_fdw.out
@@ -6414,7 +6414,7 @@ INSERT INTO a(aa) VALUES('aaaaa');
INSERT INTO b(aa) VALUES('bbb');
INSERT INTO b(aa) VALUES('bbbb');
INSERT INTO b(aa) VALUES('bbbbb');
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
tableoid | aa
----------+-------
a | aaa
@@ -6442,7 +6442,7 @@ SELECT tableoid::regclass, * FROM ONLY a;
(3 rows)
UPDATE a SET aa = 'zzzzzz' WHERE aa LIKE 'aaaa%';
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
tableoid | aa
----------+--------
a | aaa
@@ -6470,7 +6470,7 @@ SELECT tableoid::regclass, * FROM ONLY a;
(3 rows)
UPDATE b SET aa = 'new';
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
tableoid | aa
----------+--------
a | aaa
@@ -6498,7 +6498,7 @@ SELECT tableoid::regclass, * FROM ONLY a;
(3 rows)
UPDATE a SET aa = 'newtoo';
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
tableoid | aa
----------+--------
a | newtoo
@@ -6564,35 +6564,40 @@ insert into bar2 values(3,33,33);
insert into bar2 values(4,44,44);
insert into bar2 values(7,77,77);
explain (verbose, costs off)
-select * from bar where f1 in (select f1 from foo) for update;
- QUERY PLAN
-----------------------------------------------------------------------------------------------
+select * from bar where f1 in (select f1 from foo) order by 1 for update;
+ QUERY PLAN
+-----------------------------------------------------------------------------------------------------------------
LockRows
Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid, foo.ctid, foo.*, foo.tableoid
- -> Hash Join
+ -> Merge Join
Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid, foo.ctid, foo.*, foo.tableoid
Inner Unique: true
- Hash Cond: (bar.f1 = foo.f1)
- -> Append
- -> Seq Scan on public.bar
+ Merge Cond: (bar.f1 = foo.f1)
+ -> Merge Append
+ Sort Key: bar.f1
+ -> Sort
Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid
+ Sort Key: bar.f1
+ -> Seq Scan on public.bar
+ Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid
-> Foreign Scan on public.bar2
Output: bar2.f1, bar2.f2, bar2.ctid, bar2.*, bar2.tableoid
- Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 FOR UPDATE
- -> Hash
+ Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 ORDER BY f1 ASC NULLS LAST FOR UPDATE
+ -> Sort
Output: foo.ctid, foo.*, foo.tableoid, foo.f1
+ Sort Key: foo.f1
-> HashAggregate
Output: foo.ctid, foo.*, foo.tableoid, foo.f1
Group Key: foo.f1
-> Append
- -> Seq Scan on public.foo
- Output: foo.ctid, foo.*, foo.tableoid, foo.f1
-> Foreign Scan on public.foo2
Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1
Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1
-(23 rows)
+ -> Seq Scan on public.foo
+ Output: foo.ctid, foo.*, foo.tableoid, foo.f1
+(28 rows)
-select * from bar where f1 in (select f1 from foo) for update;
+select * from bar where f1 in (select f1 from foo) order by 1 for update;
f1 | f2
----+----
1 | 11
@@ -6602,35 +6607,40 @@ select * from bar where f1 in (select f1 from foo) for update;
(4 rows)
explain (verbose, costs off)
-select * from bar where f1 in (select f1 from foo) for share;
- QUERY PLAN
-----------------------------------------------------------------------------------------------
+select * from bar where f1 in (select f1 from foo) order by 1 for share;
+ QUERY PLAN
+----------------------------------------------------------------------------------------------------------------
LockRows
Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid, foo.ctid, foo.*, foo.tableoid
- -> Hash Join
+ -> Merge Join
Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid, foo.ctid, foo.*, foo.tableoid
Inner Unique: true
- Hash Cond: (bar.f1 = foo.f1)
- -> Append
- -> Seq Scan on public.bar
+ Merge Cond: (bar.f1 = foo.f1)
+ -> Merge Append
+ Sort Key: bar.f1
+ -> Sort
Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid
+ Sort Key: bar.f1
+ -> Seq Scan on public.bar
+ Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid
-> Foreign Scan on public.bar2
Output: bar2.f1, bar2.f2, bar2.ctid, bar2.*, bar2.tableoid
- Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 FOR SHARE
- -> Hash
+ Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 ORDER BY f1 ASC NULLS LAST FOR SHARE
+ -> Sort
Output: foo.ctid, foo.*, foo.tableoid, foo.f1
+ Sort Key: foo.f1
-> HashAggregate
Output: foo.ctid, foo.*, foo.tableoid, foo.f1
Group Key: foo.f1
-> Append
- -> Seq Scan on public.foo
- Output: foo.ctid, foo.*, foo.tableoid, foo.f1
-> Foreign Scan on public.foo2
Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1
Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1
-(23 rows)
+ -> Seq Scan on public.foo
+ Output: foo.ctid, foo.*, foo.tableoid, foo.f1
+(28 rows)
-select * from bar where f1 in (select f1 from foo) for share;
+select * from bar where f1 in (select f1 from foo) order by 1 for share;
f1 | f2
----+----
1 | 11
@@ -6660,11 +6670,11 @@ update bar set f2 = f2 + 100 where f1 in (select f1 from foo);
Output: foo.ctid, foo.*, foo.tableoid, foo.f1
Group Key: foo.f1
-> Append
- -> Seq Scan on public.foo
- Output: foo.ctid, foo.*, foo.tableoid, foo.f1
-> Foreign Scan on public.foo2
Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1
Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1
+ -> Seq Scan on public.foo
+ Output: foo.ctid, foo.*, foo.tableoid, foo.f1
-> Hash Join
Output: bar2.f1, (bar2.f2 + 100), bar2.f3, bar2.ctid, foo.ctid, foo.*, foo.tableoid
Inner Unique: true
@@ -6678,11 +6688,11 @@ update bar set f2 = f2 + 100 where f1 in (select f1 from foo);
Output: foo.ctid, foo.*, foo.tableoid, foo.f1
Group Key: foo.f1
-> Append
- -> Seq Scan on public.foo
- Output: foo.ctid, foo.*, foo.tableoid, foo.f1
-> Foreign Scan on public.foo2
Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1
Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1
+ -> Seq Scan on public.foo
+ Output: foo.ctid, foo.*, foo.tableoid, foo.f1
(39 rows)
update bar set f2 = f2 + 100 where f1 in (select f1 from foo);
@@ -6713,16 +6723,16 @@ where bar.f1 = ss.f1;
Output: bar.f1, (bar.f2 + 100), bar.ctid, (ROW(foo.f1))
Hash Cond: (foo.f1 = bar.f1)
-> Append
- -> Seq Scan on public.foo
- Output: ROW(foo.f1), foo.f1
-> Foreign Scan on public.foo2
Output: ROW(foo2.f1), foo2.f1
Remote SQL: SELECT f1 FROM public.loct1
- -> Seq Scan on public.foo foo_1
- Output: ROW((foo_1.f1 + 3)), (foo_1.f1 + 3)
-> Foreign Scan on public.foo2 foo2_1
Output: ROW((foo2_1.f1 + 3)), (foo2_1.f1 + 3)
Remote SQL: SELECT f1 FROM public.loct1
+ -> Seq Scan on public.foo
+ Output: ROW(foo.f1), foo.f1
+ -> Seq Scan on public.foo foo_1
+ Output: ROW((foo_1.f1 + 3)), (foo_1.f1 + 3)
-> Hash
Output: bar.f1, bar.f2, bar.ctid
-> Seq Scan on public.bar
@@ -6740,16 +6750,16 @@ where bar.f1 = ss.f1;
Output: (ROW(foo.f1)), foo.f1
Sort Key: foo.f1
-> Append
- -> Seq Scan on public.foo
- Output: ROW(foo.f1), foo.f1
-> Foreign Scan on public.foo2
Output: ROW(foo2.f1), foo2.f1
Remote SQL: SELECT f1 FROM public.loct1
- -> Seq Scan on public.foo foo_1
- Output: ROW((foo_1.f1 + 3)), (foo_1.f1 + 3)
-> Foreign Scan on public.foo2 foo2_1
Output: ROW((foo2_1.f1 + 3)), (foo2_1.f1 + 3)
Remote SQL: SELECT f1 FROM public.loct1
+ -> Seq Scan on public.foo
+ Output: ROW(foo.f1), foo.f1
+ -> Seq Scan on public.foo foo_1
+ Output: ROW((foo_1.f1 + 3)), (foo_1.f1 + 3)
(45 rows)
update bar set f2 = f2 + 100
@@ -6900,27 +6910,33 @@ delete from foo where f1 < 5 returning *;
(5 rows)
explain (verbose, costs off)
-update bar set f2 = f2 + 100 returning *;
- QUERY PLAN
-------------------------------------------------------------------------------
- Update on public.bar
- Output: bar.f1, bar.f2
- Update on public.bar
- Foreign Update on public.bar2
- -> Seq Scan on public.bar
- Output: bar.f1, (bar.f2 + 100), bar.ctid
- -> Foreign Update on public.bar2
- Remote SQL: UPDATE public.loct2 SET f2 = (f2 + 100) RETURNING f1, f2
-(8 rows)
+with u as (update bar set f2 = f2 + 100 returning *) select * from u order by 1;
+ QUERY PLAN
+--------------------------------------------------------------------------------------
+ Sort
+ Output: u.f1, u.f2
+ Sort Key: u.f1
+ CTE u
+ -> Update on public.bar
+ Output: bar.f1, bar.f2
+ Update on public.bar
+ Foreign Update on public.bar2
+ -> Seq Scan on public.bar
+ Output: bar.f1, (bar.f2 + 100), bar.ctid
+ -> Foreign Update on public.bar2
+ Remote SQL: UPDATE public.loct2 SET f2 = (f2 + 100) RETURNING f1, f2
+ -> CTE Scan on u
+ Output: u.f1, u.f2
+(14 rows)
-update bar set f2 = f2 + 100 returning *;
+with u as (update bar set f2 = f2 + 100 returning *) select * from u order by 1;
f1 | f2
----+-----
1 | 311
2 | 322
- 6 | 266
3 | 333
4 | 344
+ 6 | 266
7 | 277
(6 rows)
diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c
index 080cb0a..6c8da30 100644
--- a/contrib/postgres_fdw/postgres_fdw.c
+++ b/contrib/postgres_fdw/postgres_fdw.c
@@ -20,6 +20,8 @@
#include "commands/defrem.h"
#include "commands/explain.h"
#include "commands/vacuum.h"
+#include "executor/execAsync.h"
+#include "executor/nodeForeignscan.h"
#include "foreign/fdwapi.h"
#include "funcapi.h"
#include "miscadmin.h"
@@ -34,6 +36,7 @@
#include "optimizer/var.h"
#include "optimizer/tlist.h"
#include "parser/parsetree.h"
+#include "pgstat.h"
#include "utils/builtins.h"
#include "utils/guc.h"
#include "utils/lsyscache.h"
@@ -53,6 +56,9 @@ PG_MODULE_MAGIC;
/* If no remote estimates, assume a sort costs 20% extra */
#define DEFAULT_FDW_SORT_MULTIPLIER 1.2
+/* Retrive PgFdwScanState struct from ForeginScanState */
+#define GetPgFdwScanState(n) ((PgFdwScanState *)(n)->fdw_state)
+
/*
* Indexes of FDW-private information stored in fdw_private lists.
*
@@ -120,10 +126,27 @@ enum FdwDirectModifyPrivateIndex
};
/*
+ * Connection private area structure.
+ */
+typedef struct PgFdwConnpriv
+{
+ ForeignScanState *current_owner; /* The node currently running a query
+ * on this connection*/
+} PgFdwConnpriv;
+
+/* Execution state base type */
+typedef struct PgFdwState
+{
+ PGconn *conn; /* connection for the scan */
+ PgFdwConnpriv *connpriv; /* connection private memory */
+} PgFdwState;
+
+/*
* Execution state of a foreign scan using postgres_fdw.
*/
typedef struct PgFdwScanState
{
+ PgFdwState s; /* common structure */
Relation rel; /* relcache entry for the foreign table. NULL
* for a foreign join scan. */
TupleDesc tupdesc; /* tuple descriptor of scan */
@@ -134,7 +157,7 @@ typedef struct PgFdwScanState
List *retrieved_attrs; /* list of retrieved attribute numbers */
/* for remote query execution */
- PGconn *conn; /* connection for the scan */
+ bool result_ready;
unsigned int cursor_number; /* quasi-unique ID for my cursor */
bool cursor_exists; /* have we created the cursor? */
int numParams; /* number of parameters passed to query */
@@ -150,6 +173,13 @@ typedef struct PgFdwScanState
/* batch-level state, for optimizing rewinds and avoiding useless fetch */
int fetch_ct_2; /* Min(# of fetches done, 2) */
bool eof_reached; /* true if last fetch reached EOF */
+ bool run_async; /* true if run asynchronously */
+ bool async_waiting; /* true if requesting the parent to wait */
+ ForeignScanState *waiter; /* Next node to run a query among nodes
+ * sharing the same connection */
+ ForeignScanState *last_waiter; /* A waiting node at the end of a waiting
+ * list. Maintained only by the current
+ * owner of the connection */
/* working memory contexts */
MemoryContext batch_cxt; /* context holding current batch of tuples */
@@ -163,11 +193,11 @@ typedef struct PgFdwScanState
*/
typedef struct PgFdwModifyState
{
+ PgFdwState s; /* common structure */
Relation rel; /* relcache entry for the foreign table */
AttInMetadata *attinmeta; /* attribute datatype conversion metadata */
/* for remote query execution */
- PGconn *conn; /* connection for the scan */
char *p_name; /* name of prepared statement, if created */
/* extracted fdw_private data */
@@ -190,6 +220,7 @@ typedef struct PgFdwModifyState
*/
typedef struct PgFdwDirectModifyState
{
+ PgFdwState s; /* common structure */
Relation rel; /* relcache entry for the foreign table */
AttInMetadata *attinmeta; /* attribute datatype conversion metadata */
@@ -288,6 +319,7 @@ static void postgresBeginForeignScan(ForeignScanState *node, int eflags);
static TupleTableSlot *postgresIterateForeignScan(ForeignScanState *node);
static void postgresReScanForeignScan(ForeignScanState *node);
static void postgresEndForeignScan(ForeignScanState *node);
+static void postgresShutdownForeignScan(ForeignScanState *node);
static void postgresAddForeignUpdateTargets(Query *parsetree,
RangeTblEntry *target_rte,
Relation target_relation);
@@ -348,6 +380,14 @@ static void postgresGetForeignUpperPaths(PlannerInfo *root,
UpperRelationKind stage,
RelOptInfo *input_rel,
RelOptInfo *output_rel);
+static bool postgresIsForeignPathAsyncCapable(ForeignPath *path);
+static void postgresForeignAsyncRequest(EState *estate,
+ PendingAsyncRequest *areq);
+static bool postgresForeignAsyncConfigureWait(EState *estate,
+ PendingAsyncRequest *areq,
+ bool reinit);
+static void postgresForeignAsyncNotify(EState *estate,
+ PendingAsyncRequest *areq);
/*
* Helper functions
@@ -368,7 +408,10 @@ static bool ec_member_matches_foreign(PlannerInfo *root, RelOptInfo *rel,
EquivalenceClass *ec, EquivalenceMember *em,
void *arg);
static void create_cursor(ForeignScanState *node);
-static void fetch_more_data(ForeignScanState *node);
+static void request_more_data(ForeignScanState *node);
+static void fetch_received_data(ForeignScanState *node);
+static void vacate_connection(PgFdwState *fdwconn);
+static void absorb_current_result(ForeignScanState *node);
static void close_cursor(PGconn *conn, unsigned int cursor_number);
static void prepare_foreign_modify(PgFdwModifyState *fmstate);
static const char **convert_prep_stmt_params(PgFdwModifyState *fmstate,
@@ -438,6 +481,7 @@ postgres_fdw_handler(PG_FUNCTION_ARGS)
routine->IterateForeignScan = postgresIterateForeignScan;
routine->ReScanForeignScan = postgresReScanForeignScan;
routine->EndForeignScan = postgresEndForeignScan;
+ routine->ShutdownForeignScan = postgresShutdownForeignScan;
/* Functions for updating foreign tables */
routine->AddForeignUpdateTargets = postgresAddForeignUpdateTargets;
@@ -472,6 +516,12 @@ postgres_fdw_handler(PG_FUNCTION_ARGS)
/* Support functions for upper relation push-down */
routine->GetForeignUpperPaths = postgresGetForeignUpperPaths;
+ /* Support functions for async execution */
+ routine->IsForeignPathAsyncCapable = postgresIsForeignPathAsyncCapable;
+ routine->ForeignAsyncRequest = postgresForeignAsyncRequest;
+ routine->ForeignAsyncConfigureWait = postgresForeignAsyncConfigureWait;
+ routine->ForeignAsyncNotify = postgresForeignAsyncNotify;
+
PG_RETURN_POINTER(routine);
}
@@ -1322,12 +1372,21 @@ postgresBeginForeignScan(ForeignScanState *node, int eflags)
* Get connection to the foreign server. Connection manager will
* establish new connection if necessary.
*/
- fsstate->conn = GetConnection(user, false);
+ fsstate->s.conn = GetConnection(user, false);
+ fsstate->s.connpriv = (PgFdwConnpriv *)
+ GetConnectionSpecificStorage(user, sizeof(PgFdwConnpriv));
+ fsstate->s.connpriv->current_owner = NULL;
+ fsstate->waiter = NULL;
+ fsstate->last_waiter = node;
/* Assign a unique ID for my cursor */
- fsstate->cursor_number = GetCursorNumber(fsstate->conn);
+ fsstate->cursor_number = GetCursorNumber(fsstate->s.conn);
fsstate->cursor_exists = false;
+ /* Initialize async execution status */
+ fsstate->run_async = false;
+ fsstate->async_waiting = false;
+
/* Get private info created by planner functions. */
fsstate->query = strVal(list_nth(fsplan->fdw_private,
FdwScanPrivateSelectSql));
@@ -1383,32 +1442,130 @@ postgresBeginForeignScan(ForeignScanState *node, int eflags)
static TupleTableSlot *
postgresIterateForeignScan(ForeignScanState *node)
{
- PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+ PgFdwScanState *fsstate = GetPgFdwScanState(node);
TupleTableSlot *slot = node->ss.ss_ScanTupleSlot;
/*
- * If this is the first call after Begin or ReScan, we need to create the
- * cursor on the remote side.
- */
- if (!fsstate->cursor_exists)
- create_cursor(node);
-
- /*
* Get some more tuples, if we've run out.
*/
if (fsstate->next_tuple >= fsstate->num_tuples)
{
- /* No point in another fetch if we already detected EOF, though. */
- if (!fsstate->eof_reached)
- fetch_more_data(node);
- /* If we didn't get any tuples, must be end of data. */
+ ForeignScanState *next_conn_owner = node;
+
+ /* This node has sent a query on this connection */
+ if (fsstate->s.connpriv->current_owner == node)
+ {
+ /* Check if the result is available */
+ if (PQisBusy(fsstate->s.conn))
+ {
+ int rc = WaitLatchOrSocket(NULL,
+ WL_SOCKET_READABLE | WL_TIMEOUT,
+ PQsocket(fsstate->s.conn), 0,
+ WAIT_EVENT_ASYNC_WAIT);
+ if (fsstate->run_async && !(rc & WL_SOCKET_READABLE))
+ {
+ /*
+ * This node is not ready yet. Tell the caller to wait.
+ */
+ fsstate->result_ready = false;
+ return ExecClearTuple(slot);
+ }
+ }
+
+ Assert(fsstate->async_waiting);
+ fsstate->async_waiting = false;
+ fetch_received_data(node);
+
+ /*
+ * If someone is waiting this node on the same connection, let the
+ * first waiter be the next owner of this connection.
+ */
+ if (fsstate->waiter)
+ {
+ PgFdwScanState *next_owner_state;
+
+ next_conn_owner = fsstate->waiter;
+ next_owner_state = GetPgFdwScanState(next_conn_owner);
+ fsstate->waiter = NULL;
+
+ /*
+ * only the current owner is responsible to maintain the shortcut
+ * to the last waiter
+ */
+ next_owner_state->last_waiter = fsstate->last_waiter;
+
+ /*
+ * for simplicity, last_waiter points itself on a node that no one
+ * is waiting for.
+ */
+ fsstate->last_waiter = node;
+ }
+ }
+ else if (fsstate->s.connpriv->current_owner &&
+ !GetPgFdwScanState(node)->eof_reached)
+ {
+ /*
+ * Anyone else is holding this connection and we want this node to
+ * run later. Add myself to the tail of the waiters' list then
+ * return not-ready. To avoid scanning through the waiters' list,
+ * the current owner is to maintain the shortcut to the last
+ * waiter.
+ */
+ PgFdwScanState *conn_owner_state =
+ GetPgFdwScanState(fsstate->s.connpriv->current_owner);
+ ForeignScanState *last_waiter = conn_owner_state->last_waiter;
+ PgFdwScanState *last_waiter_state = GetPgFdwScanState(last_waiter);
+
+ last_waiter_state->waiter = node;
+ conn_owner_state->last_waiter = node;
+
+ /* Register the node to the async-waiting node list */
+ Assert(!GetPgFdwScanState(node)->async_waiting);
+
+ GetPgFdwScanState(node)->async_waiting = true;
+
+ fsstate->result_ready = fsstate->eof_reached;
+ return ExecClearTuple(slot);
+ }
+
+ /* At this time no node is running on the connection */
+ Assert(GetPgFdwScanState(next_conn_owner)->s.connpriv->current_owner
+ == NULL);
+ /*
+ * Send the next request for the next owner of this connection if
+ * needed.
+ */
+ if (!GetPgFdwScanState(next_conn_owner)->eof_reached)
+ {
+ PgFdwScanState *next_owner_state =
+ GetPgFdwScanState(next_conn_owner);
+
+ request_more_data(next_conn_owner);
+
+ /* Register the node to the async-waiting node list */
+ if (!next_owner_state->async_waiting)
+ next_owner_state->async_waiting = true;
+
+ if (!next_owner_state->run_async)
+ fetch_received_data(next_conn_owner);
+ }
+
+
+ /*
+ * If we haven't received a result for the given node this time,
+ * return with no tuple to give way to other nodes.
+ */
if (fsstate->next_tuple >= fsstate->num_tuples)
+ {
+ fsstate->result_ready = fsstate->eof_reached;
return ExecClearTuple(slot);
+ }
}
/*
* Return the next tuple.
*/
+ fsstate->result_ready = true;
ExecStoreTuple(fsstate->tuples[fsstate->next_tuple++],
slot,
InvalidBuffer,
@@ -1424,7 +1581,7 @@ postgresIterateForeignScan(ForeignScanState *node)
static void
postgresReScanForeignScan(ForeignScanState *node)
{
- PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+ PgFdwScanState *fsstate = GetPgFdwScanState(node);
char sql[64];
PGresult *res;
@@ -1432,6 +1589,9 @@ postgresReScanForeignScan(ForeignScanState *node)
if (!fsstate->cursor_exists)
return;
+ /* Absorb the ramining result */
+ absorb_current_result(node);
+
/*
* If any internal parameters affecting this node have changed, we'd
* better destroy and recreate the cursor. Otherwise, rewinding it should
@@ -1460,9 +1620,9 @@ postgresReScanForeignScan(ForeignScanState *node)
* We don't use a PG_TRY block here, so be careful not to throw error
* without releasing the PGresult.
*/
- res = pgfdw_exec_query(fsstate->conn, sql);
+ res = pgfdw_exec_query(fsstate->s.conn, sql);
if (PQresultStatus(res) != PGRES_COMMAND_OK)
- pgfdw_report_error(ERROR, res, fsstate->conn, true, sql);
+ pgfdw_report_error(ERROR, res, fsstate->s.conn, true, sql);
PQclear(res);
/* Now force a fresh FETCH. */
@@ -1480,7 +1640,7 @@ postgresReScanForeignScan(ForeignScanState *node)
static void
postgresEndForeignScan(ForeignScanState *node)
{
- PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+ PgFdwScanState *fsstate = GetPgFdwScanState(node);
/* if fsstate is NULL, we are in EXPLAIN; nothing to do */
if (fsstate == NULL)
@@ -1488,16 +1648,32 @@ postgresEndForeignScan(ForeignScanState *node)
/* Close the cursor if open, to prevent accumulation of cursors */
if (fsstate->cursor_exists)
- close_cursor(fsstate->conn, fsstate->cursor_number);
+ close_cursor(fsstate->s.conn, fsstate->cursor_number);
/* Release remote connection */
- ReleaseConnection(fsstate->conn);
- fsstate->conn = NULL;
+ ReleaseConnection(fsstate->s.conn);
+ fsstate->s.conn = NULL;
/* MemoryContexts will be deleted automatically. */
}
/*
+ * postgresShutdownForeignScan
+ * Remove asynchrony stuff and cleanup garbage on the connection.
+ */
+static void
+postgresShutdownForeignScan(ForeignScanState *node)
+{
+ ForeignScan *plan = (ForeignScan *) node->ss.ps.plan;
+
+ if (plan->operation != CMD_SELECT)
+ return;
+
+ /* Absorb the ramining result */
+ absorb_current_result(node);
+}
+
+/*
* postgresAddForeignUpdateTargets
* Add resjunk column(s) needed for update/delete on a foreign table
*/
@@ -1699,7 +1875,9 @@ postgresBeginForeignModify(ModifyTableState *mtstate,
user = GetUserMapping(userid, table->serverid);
/* Open connection; report that we'll create a prepared statement. */
- fmstate->conn = GetConnection(user, true);
+ fmstate->s.conn = GetConnection(user, true);
+ fmstate->s.connpriv = (PgFdwConnpriv *)
+ GetConnectionSpecificStorage(user, sizeof(PgFdwConnpriv));
fmstate->p_name = NULL; /* prepared statement not made yet */
/* Deconstruct fdw_private data. */
@@ -1778,6 +1956,8 @@ postgresExecForeignInsert(EState *estate,
PGresult *res;
int n_rows;
+ vacate_connection((PgFdwState *)fmstate);
+
/* Set up the prepared statement on the remote server, if we didn't yet */
if (!fmstate->p_name)
prepare_foreign_modify(fmstate);
@@ -1788,14 +1968,14 @@ postgresExecForeignInsert(EState *estate,
/*
* Execute the prepared statement.
*/
- if (!PQsendQueryPrepared(fmstate->conn,
+ if (!PQsendQueryPrepared(fmstate->s.conn,
fmstate->p_name,
fmstate->p_nums,
p_values,
NULL,
NULL,
0))
- pgfdw_report_error(ERROR, NULL, fmstate->conn, false, fmstate->query);
+ pgfdw_report_error(ERROR, NULL, fmstate->s.conn, false, fmstate->query);
/*
* Get the result, and check for success.
@@ -1803,10 +1983,10 @@ postgresExecForeignInsert(EState *estate,
* We don't use a PG_TRY block here, so be careful not to throw error
* without releasing the PGresult.
*/
- res = pgfdw_get_result(fmstate->conn, fmstate->query);
+ res = pgfdw_get_result(fmstate->s.conn, fmstate->query);
if (PQresultStatus(res) !=
(fmstate->has_returning ? PGRES_TUPLES_OK : PGRES_COMMAND_OK))
- pgfdw_report_error(ERROR, res, fmstate->conn, true, fmstate->query);
+ pgfdw_report_error(ERROR, res, fmstate->s.conn, true, fmstate->query);
/* Check number of rows affected, and fetch RETURNING tuple if any */
if (fmstate->has_returning)
@@ -1844,6 +2024,8 @@ postgresExecForeignUpdate(EState *estate,
PGresult *res;
int n_rows;
+ vacate_connection((PgFdwState *)fmstate);
+
/* Set up the prepared statement on the remote server, if we didn't yet */
if (!fmstate->p_name)
prepare_foreign_modify(fmstate);
@@ -1864,14 +2046,14 @@ postgresExecForeignUpdate(EState *estate,
/*
* Execute the prepared statement.
*/
- if (!PQsendQueryPrepared(fmstate->conn,
+ if (!PQsendQueryPrepared(fmstate->s.conn,
fmstate->p_name,
fmstate->p_nums,
p_values,
NULL,
NULL,
0))
- pgfdw_report_error(ERROR, NULL, fmstate->conn, false, fmstate->query);
+ pgfdw_report_error(ERROR, NULL, fmstate->s.conn, false, fmstate->query);
/*
* Get the result, and check for success.
@@ -1879,10 +2061,10 @@ postgresExecForeignUpdate(EState *estate,
* We don't use a PG_TRY block here, so be careful not to throw error
* without releasing the PGresult.
*/
- res = pgfdw_get_result(fmstate->conn, fmstate->query);
+ res = pgfdw_get_result(fmstate->s.conn, fmstate->query);
if (PQresultStatus(res) !=
(fmstate->has_returning ? PGRES_TUPLES_OK : PGRES_COMMAND_OK))
- pgfdw_report_error(ERROR, res, fmstate->conn, true, fmstate->query);
+ pgfdw_report_error(ERROR, res, fmstate->s.conn, true, fmstate->query);
/* Check number of rows affected, and fetch RETURNING tuple if any */
if (fmstate->has_returning)
@@ -1920,6 +2102,8 @@ postgresExecForeignDelete(EState *estate,
PGresult *res;
int n_rows;
+ vacate_connection((PgFdwState *)fmstate);
+
/* Set up the prepared statement on the remote server, if we didn't yet */
if (!fmstate->p_name)
prepare_foreign_modify(fmstate);
@@ -1940,14 +2124,14 @@ postgresExecForeignDelete(EState *estate,
/*
* Execute the prepared statement.
*/
- if (!PQsendQueryPrepared(fmstate->conn,
+ if (!PQsendQueryPrepared(fmstate->s.conn,
fmstate->p_name,
fmstate->p_nums,
p_values,
NULL,
NULL,
0))
- pgfdw_report_error(ERROR, NULL, fmstate->conn, false, fmstate->query);
+ pgfdw_report_error(ERROR, NULL, fmstate->s.conn, false, fmstate->query);
/*
* Get the result, and check for success.
@@ -1955,10 +2139,10 @@ postgresExecForeignDelete(EState *estate,
* We don't use a PG_TRY block here, so be careful not to throw error
* without releasing the PGresult.
*/
- res = pgfdw_get_result(fmstate->conn, fmstate->query);
+ res = pgfdw_get_result(fmstate->s.conn, fmstate->query);
if (PQresultStatus(res) !=
(fmstate->has_returning ? PGRES_TUPLES_OK : PGRES_COMMAND_OK))
- pgfdw_report_error(ERROR, res, fmstate->conn, true, fmstate->query);
+ pgfdw_report_error(ERROR, res, fmstate->s.conn, true, fmstate->query);
/* Check number of rows affected, and fetch RETURNING tuple if any */
if (fmstate->has_returning)
@@ -2005,16 +2189,16 @@ postgresEndForeignModify(EState *estate,
* We don't use a PG_TRY block here, so be careful not to throw error
* without releasing the PGresult.
*/
- res = pgfdw_exec_query(fmstate->conn, sql);
+ res = pgfdw_exec_query(fmstate->s.conn, sql);
if (PQresultStatus(res) != PGRES_COMMAND_OK)
- pgfdw_report_error(ERROR, res, fmstate->conn, true, sql);
+ pgfdw_report_error(ERROR, res, fmstate->s.conn, true, sql);
PQclear(res);
fmstate->p_name = NULL;
}
/* Release remote connection */
- ReleaseConnection(fmstate->conn);
- fmstate->conn = NULL;
+ ReleaseConnection(fmstate->s.conn);
+ fmstate->s.conn = NULL;
}
/*
@@ -2302,7 +2486,9 @@ postgresBeginDirectModify(ForeignScanState *node, int eflags)
* Get connection to the foreign server. Connection manager will
* establish new connection if necessary.
*/
- dmstate->conn = GetConnection(user, false);
+ dmstate->s.conn = GetConnection(user, false);
+ dmstate->s.connpriv = (PgFdwConnpriv *)
+ GetConnectionSpecificStorage(user, sizeof(PgFdwConnpriv));
/* Initialize state variable */
dmstate->num_tuples = -1; /* -1 means not set yet */
@@ -2355,7 +2541,10 @@ postgresIterateDirectModify(ForeignScanState *node)
* If this is the first call after Begin, execute the statement.
*/
if (dmstate->num_tuples == -1)
+ {
+ vacate_connection((PgFdwState *)dmstate);
execute_dml_stmt(node);
+ }
/*
* If the local query doesn't specify RETURNING, just clear tuple slot.
@@ -2402,8 +2591,8 @@ postgresEndDirectModify(ForeignScanState *node)
PQclear(dmstate->result);
/* Release remote connection */
- ReleaseConnection(dmstate->conn);
- dmstate->conn = NULL;
+ ReleaseConnection(dmstate->s.conn);
+ dmstate->s.conn = NULL;
/* MemoryContext will be deleted automatically. */
}
@@ -2522,6 +2711,7 @@ estimate_path_cost_size(PlannerInfo *root,
List *local_param_join_conds;
StringInfoData sql;
PGconn *conn;
+ PgFdwConnpriv *connpriv;
Selectivity local_sel;
QualCost local_cost;
List *fdw_scan_tlist = NIL;
@@ -2564,6 +2754,16 @@ estimate_path_cost_size(PlannerInfo *root,
/* Get the remote estimate */
conn = GetConnection(fpinfo->user, false);
+ connpriv = GetConnectionSpecificStorage(fpinfo->user,
+ sizeof(PgFdwConnpriv));
+ if (connpriv)
+ {
+ PgFdwState tmpstate;
+ tmpstate.conn = conn;
+ tmpstate.connpriv = connpriv;
+ vacate_connection(&tmpstate);
+ }
+
get_remote_estimate(sql.data, conn, &rows, &width,
&startup_cost, &total_cost);
ReleaseConnection(conn);
@@ -2918,11 +3118,11 @@ ec_member_matches_foreign(PlannerInfo *root, RelOptInfo *rel,
static void
create_cursor(ForeignScanState *node)
{
- PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+ PgFdwScanState *fsstate = GetPgFdwScanState(node);
ExprContext *econtext = node->ss.ps.ps_ExprContext;
int numParams = fsstate->numParams;
const char **values = fsstate->param_values;
- PGconn *conn = fsstate->conn;
+ PGconn *conn = fsstate->s.conn;
StringInfoData buf;
PGresult *res;
@@ -2988,47 +3188,96 @@ create_cursor(ForeignScanState *node)
* Fetch some more rows from the node's cursor.
*/
static void
-fetch_more_data(ForeignScanState *node)
+request_more_data(ForeignScanState *node)
+{
+ PgFdwScanState *fsstate = GetPgFdwScanState(node);
+ PGconn *conn = fsstate->s.conn;
+ char sql[64];
+
+ /* The connection should be vacant */
+ Assert(fsstate->s.connpriv->current_owner == NULL);
+
+ /*
+ * If this is the first call after Begin or ReScan, we need to create the
+ * cursor on the remote side.
+ */
+ if (!fsstate->cursor_exists)
+ create_cursor(node);
+
+ snprintf(sql, sizeof(sql), "FETCH %d FROM c%u",
+ fsstate->fetch_size, fsstate->cursor_number);
+
+ if (!PQsendQuery(conn, sql))
+ pgfdw_report_error(ERROR, NULL, conn, false, sql);
+
+ fsstate->s.connpriv->current_owner = node;
+}
+
+/*
+ * Fetch some more rows from the node's cursor.
+ */
+static void
+fetch_received_data(ForeignScanState *node)
{
- PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+ PgFdwScanState *fsstate = GetPgFdwScanState(node);
PGresult *volatile res = NULL;
MemoryContext oldcontext;
+ /* I should be the current connection owner */
+ Assert(fsstate->s.connpriv->current_owner == node);
+
/*
* We'll store the tuples in the batch_cxt. First, flush the previous
- * batch.
+ * batch if no tuple is remaining
*/
- fsstate->tuples = NULL;
- MemoryContextReset(fsstate->batch_cxt);
+ if (fsstate->next_tuple >= fsstate->num_tuples)
+ {
+ fsstate->tuples = NULL;
+ fsstate->num_tuples = 0;
+ MemoryContextReset(fsstate->batch_cxt);
+ }
+ else if (fsstate->next_tuple > 0)
+ {
+ /* move the remaining tuples to the beginning of the store */
+ int n = 0;
+
+ while(fsstate->next_tuple < fsstate->num_tuples)
+ fsstate->tuples[n++] = fsstate->tuples[fsstate->next_tuple++];
+ fsstate->num_tuples = n;
+ }
+
oldcontext = MemoryContextSwitchTo(fsstate->batch_cxt);
/* PGresult must be released before leaving this function. */
PG_TRY();
{
- PGconn *conn = fsstate->conn;
+ PGconn *conn = fsstate->s.conn;
char sql[64];
- int numrows;
+ int addrows;
+ size_t newsize;
int i;
snprintf(sql, sizeof(sql), "FETCH %d FROM c%u",
fsstate->fetch_size, fsstate->cursor_number);
- res = pgfdw_exec_query(conn, sql);
+ res = pgfdw_get_result(conn, sql);
/* On error, report the original query, not the FETCH. */
if (PQresultStatus(res) != PGRES_TUPLES_OK)
pgfdw_report_error(ERROR, res, conn, false, fsstate->query);
/* Convert the data into HeapTuples */
- numrows = PQntuples(res);
- fsstate->tuples = (HeapTuple *) palloc0(numrows * sizeof(HeapTuple));
- fsstate->num_tuples = numrows;
- fsstate->next_tuple = 0;
+ addrows = PQntuples(res);
+ newsize = (fsstate->num_tuples + addrows) * sizeof(HeapTuple);
+ if (fsstate->tuples)
+ fsstate->tuples = (HeapTuple *) repalloc(fsstate->tuples, newsize);
+ else
+ fsstate->tuples = (HeapTuple *) palloc(newsize);
- for (i = 0; i < numrows; i++)
+ for (i = 0; i < addrows; i++)
{
Assert(IsA(node->ss.ps.plan, ForeignScan));
- fsstate->tuples[i] =
+ fsstate->tuples[fsstate->num_tuples + i] =
make_tuple_from_result_row(res, i,
fsstate->rel,
fsstate->attinmeta,
@@ -3038,27 +3287,82 @@ fetch_more_data(ForeignScanState *node)
}
/* Update fetch_ct_2 */
- if (fsstate->fetch_ct_2 < 2)
+ if (fsstate->fetch_ct_2 < 2 && fsstate->next_tuple == 0)
fsstate->fetch_ct_2++;
+ fsstate->next_tuple = 0;
+ fsstate->num_tuples += addrows;
+
/* Must be EOF if we didn't get as many tuples as we asked for. */
- fsstate->eof_reached = (numrows < fsstate->fetch_size);
+ fsstate->eof_reached = (addrows < fsstate->fetch_size);
PQclear(res);
res = NULL;
}
PG_CATCH();
{
+ fsstate->s.connpriv->current_owner = NULL;
if (res)
PQclear(res);
PG_RE_THROW();
}
PG_END_TRY();
+ fsstate->s.connpriv->current_owner = NULL;
+
MemoryContextSwitchTo(oldcontext);
}
/*
+ * Vacate a connection so that this node can send the next query
+ */
+static void
+vacate_connection(PgFdwState *fdwstate)
+{
+ PgFdwConnpriv *connpriv = fdwstate->connpriv;
+ ForeignScanState *owner;
+
+ if (connpriv == NULL || connpriv->current_owner == NULL)
+ return;
+
+ /*
+ * let the current connection owner read the result for the running query
+ */
+ owner = connpriv->current_owner;
+ fetch_received_data(owner);
+
+ /* Clear the waiting list */
+ while (owner)
+ {
+ PgFdwScanState *fsstate = GetPgFdwScanState(owner);
+
+ fsstate->last_waiter = NULL;
+ owner = fsstate->waiter;
+ fsstate->waiter = NULL;
+ }
+}
+
+/*
+ * Absorb the result of the current query.
+ */
+static void
+absorb_current_result(ForeignScanState *node)
+{
+ PgFdwScanState *fsstate = GetPgFdwScanState(node);
+ ForeignScanState *owner = fsstate->s.connpriv->current_owner;
+
+ if (owner)
+ {
+ PgFdwScanState *target_state = GetPgFdwScanState(owner);
+ PGconn *conn = target_state->s.conn;
+
+ while(PQisBusy(conn))
+ PQclear(PQgetResult(conn));
+ fsstate->s.connpriv->current_owner = NULL;
+ fsstate->async_waiting = false;
+ }
+}
+/*
* Force assorted GUC parameters to settings that ensure that we'll output
* data values in a form that is unambiguous to the remote server.
*
@@ -3142,7 +3446,7 @@ prepare_foreign_modify(PgFdwModifyState *fmstate)
/* Construct name we'll use for the prepared statement. */
snprintf(prep_name, sizeof(prep_name), "pgsql_fdw_prep_%u",
- GetPrepStmtNumber(fmstate->conn));
+ GetPrepStmtNumber(fmstate->s.conn));
p_name = pstrdup(prep_name);
/*
@@ -3152,12 +3456,12 @@ prepare_foreign_modify(PgFdwModifyState *fmstate)
* the prepared statements we use in this module are simple enough that
* the remote server will make the right choices.
*/
- if (!PQsendPrepare(fmstate->conn,
+ if (!PQsendPrepare(fmstate->s.conn,
p_name,
fmstate->query,
0,
NULL))
- pgfdw_report_error(ERROR, NULL, fmstate->conn, false, fmstate->query);
+ pgfdw_report_error(ERROR, NULL, fmstate->s.conn, false, fmstate->query);
/*
* Get the result, and check for success.
@@ -3165,9 +3469,9 @@ prepare_foreign_modify(PgFdwModifyState *fmstate)
* We don't use a PG_TRY block here, so be careful not to throw error
* without releasing the PGresult.
*/
- res = pgfdw_get_result(fmstate->conn, fmstate->query);
+ res = pgfdw_get_result(fmstate->s.conn, fmstate->query);
if (PQresultStatus(res) != PGRES_COMMAND_OK)
- pgfdw_report_error(ERROR, res, fmstate->conn, true, fmstate->query);
+ pgfdw_report_error(ERROR, res, fmstate->s.conn, true, fmstate->query);
PQclear(res);
/* This action shows that the prepare has been done. */
@@ -3298,9 +3602,9 @@ execute_dml_stmt(ForeignScanState *node)
* the desired result. This allows us to avoid assuming that the remote
* server has the same OIDs we do for the parameters' types.
*/
- if (!PQsendQueryParams(dmstate->conn, dmstate->query, numParams,
+ if (!PQsendQueryParams(dmstate->s.conn, dmstate->query, numParams,
NULL, values, NULL, NULL, 0))
- pgfdw_report_error(ERROR, NULL, dmstate->conn, false, dmstate->query);
+ pgfdw_report_error(ERROR, NULL, dmstate->s.conn, false, dmstate->query);
/*
* Get the result, and check for success.
@@ -3308,10 +3612,10 @@ execute_dml_stmt(ForeignScanState *node)
* We don't use a PG_TRY block here, so be careful not to throw error
* without releasing the PGresult.
*/
- dmstate->result = pgfdw_get_result(dmstate->conn, dmstate->query);
+ dmstate->result = pgfdw_get_result(dmstate->s.conn, dmstate->query);
if (PQresultStatus(dmstate->result) !=
(dmstate->has_returning ? PGRES_TUPLES_OK : PGRES_COMMAND_OK))
- pgfdw_report_error(ERROR, dmstate->result, dmstate->conn, true,
+ pgfdw_report_error(ERROR, dmstate->result, dmstate->s.conn, true,
dmstate->query);
/* Get the number of rows affected. */
@@ -4582,6 +4886,80 @@ postgresGetForeignJoinPaths(PlannerInfo *root,
/* XXX Consider parameterized paths for the join relation */
}
+static bool
+postgresIsForeignPathAsyncCapable(ForeignPath *path)
+{
+ return true;
+}
+
+/*
+ * Accept async request. Notify to the caller if the next tuple is immediately
+ * available. ExecForeignScan does additional work to finishing the returning
+ * tuple, so call it instead of postgresIterateForeignScan to acquire a tuple
+ * in expected shape.
+ */
+static void
+postgresForeignAsyncRequest(EState *estate, PendingAsyncRequest *areq)
+{
+ ForeignScanState *node = (ForeignScanState *) areq->requestee;
+ TupleTableSlot *slot;
+
+ Assert(IsA(node, ForeignScanState));
+ GetPgFdwScanState(node)->run_async = true;
+ slot = ExecForeignScan(node);
+ if (GetPgFdwScanState(node)->result_ready)
+ ExecAsyncRequestDone(estate, areq, (Node *) slot);
+ else
+ ExecAsyncSetRequiredEvents(estate, areq, 1, false, false);
+}
+
+/*
+ * Configure waiting event.
+ *
+ * Add an wait event only when the node is the connection owner. Elsewise
+ * another node on this connection is the owner.
+ */
+static bool
+postgresForeignAsyncConfigureWait(EState *estate, PendingAsyncRequest *areq,
+ bool reinit)
+{
+ ForeignScanState *node = (ForeignScanState *) areq->requestee;
+ PgFdwScanState *fsstate = GetPgFdwScanState(node);
+
+
+ /* If the caller didn't reinit, this event is already in event set */
+ if (!reinit)
+ return true;
+
+ if (fsstate->s.connpriv->current_owner == node)
+ {
+ AddWaitEventToSet(estate->es_wait_event_set,
+ WL_SOCKET_READABLE, PQsocket(fsstate->s.conn),
+ NULL, areq);
+ return true;
+ }
+
+ return false;
+}
+
+/*
+ * Process a notification from async mechanism. ExecForeignScan does
+ * additional work to complete the returning tuple, so call it instead of
+ * postgresIterateForeignScan to acquire a completed tuple.
+ */
+static void
+postgresForeignAsyncNotify(EState *estate, PendingAsyncRequest *areq)
+{
+ ForeignScanState *node = (ForeignScanState *) areq->requestee;
+ TupleTableSlot *slot;
+
+ Assert(IsA(node, ForeignScanState));
+ slot = ExecForeignScan(node);
+ Assert(GetPgFdwScanState(node)->result_ready);
+
+ ExecAsyncRequestDone(estate, areq, (Node *) slot);
+}
+
/*
* Assess whether the aggregation, grouping and having operations can be pushed
* down to the foreign server. As a side effect, save information we obtain in
@@ -4946,7 +5324,7 @@ make_tuple_from_result_row(PGresult *res,
PgFdwScanState *fdw_sstate;
Assert(fsstate);
- fdw_sstate = (PgFdwScanState *) fsstate->fdw_state;
+ fdw_sstate = GetPgFdwScanState(fsstate);
tupdesc = fdw_sstate->tupdesc;
}
diff --git a/contrib/postgres_fdw/postgres_fdw.h b/contrib/postgres_fdw/postgres_fdw.h
index 25c950d..6dd136c 100644
--- a/contrib/postgres_fdw/postgres_fdw.h
+++ b/contrib/postgres_fdw/postgres_fdw.h
@@ -77,6 +77,7 @@ typedef struct PgFdwRelationInfo
UserMapping *user; /* only set in use_remote_estimate mode */
int fetch_size; /* fetch size for this remote table */
+ bool allow_prefetch; /* true to allow overlapped fetching */
/*
* Name of the relation while EXPLAINing ForeignScan. It is used for join
@@ -116,6 +117,7 @@ extern void reset_transmission_modes(int nestlevel);
/* in connection.c */
extern PGconn *GetConnection(UserMapping *user, bool will_prep_stmt);
+void *GetConnectionSpecificStorage(UserMapping *user, size_t initsize);
extern void ReleaseConnection(PGconn *conn);
extern unsigned int GetCursorNumber(PGconn *conn);
extern unsigned int GetPrepStmtNumber(PGconn *conn);
diff --git a/contrib/postgres_fdw/sql/postgres_fdw.sql b/contrib/postgres_fdw/sql/postgres_fdw.sql
index 509bb54..3370778 100644
--- a/contrib/postgres_fdw/sql/postgres_fdw.sql
+++ b/contrib/postgres_fdw/sql/postgres_fdw.sql
@@ -1488,25 +1488,25 @@ INSERT INTO b(aa) VALUES('bbb');
INSERT INTO b(aa) VALUES('bbbb');
INSERT INTO b(aa) VALUES('bbbbb');
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
SELECT tableoid::regclass, * FROM b;
SELECT tableoid::regclass, * FROM ONLY a;
UPDATE a SET aa = 'zzzzzz' WHERE aa LIKE 'aaaa%';
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
SELECT tableoid::regclass, * FROM b;
SELECT tableoid::regclass, * FROM ONLY a;
UPDATE b SET aa = 'new';
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
SELECT tableoid::regclass, * FROM b;
SELECT tableoid::regclass, * FROM ONLY a;
UPDATE a SET aa = 'newtoo';
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
SELECT tableoid::regclass, * FROM b;
SELECT tableoid::regclass, * FROM ONLY a;
@@ -1542,12 +1542,12 @@ insert into bar2 values(4,44,44);
insert into bar2 values(7,77,77);
explain (verbose, costs off)
-select * from bar where f1 in (select f1 from foo) for update;
-select * from bar where f1 in (select f1 from foo) for update;
+select * from bar where f1 in (select f1 from foo) order by 1 for update;
+select * from bar where f1 in (select f1 from foo) order by 1 for update;
explain (verbose, costs off)
-select * from bar where f1 in (select f1 from foo) for share;
-select * from bar where f1 in (select f1 from foo) for share;
+select * from bar where f1 in (select f1 from foo) order by 1 for share;
+select * from bar where f1 in (select f1 from foo) order by 1 for share;
-- Check UPDATE with inherited target and an inherited source table
explain (verbose, costs off)
@@ -1606,8 +1606,8 @@ explain (verbose, costs off)
delete from foo where f1 < 5 returning *;
delete from foo where f1 < 5 returning *;
explain (verbose, costs off)
-update bar set f2 = f2 + 100 returning *;
-update bar set f2 = f2 + 100 returning *;
+with u as (update bar set f2 = f2 + 100 returning *) select * from u order by 1;
+with u as (update bar set f2 = f2 + 100 returning *) select * from u order by 1;
drop table foo cascade;
drop table bar cascade;
--
2.9.2
0001-Allow-wait-event-set-to-be-registered-to-resource-ow.patchtext/x-patch; charset=us-asciiDownload
From 000f0465a59cdabd02f43e886c76c89c14d987a5 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Mon, 22 May 2017 12:42:58 +0900
Subject: [PATCH 1/4] Allow wait event set to be registered to resource owner
WaitEventSet needs to be released using resource owner for a certain
case. This change adds WaitEventSet reowner and allow the creator of a
WaitEventSet to specify a resource owner.
---
src/backend/libpq/pqcomm.c | 2 +-
src/backend/storage/ipc/latch.c | 18 ++++++-
src/backend/storage/lmgr/condition_variable.c | 2 +-
src/backend/utils/resowner/resowner.c | 68 +++++++++++++++++++++++++++
src/include/storage/latch.h | 4 +-
src/include/utils/resowner_private.h | 8 ++++
6 files changed, 97 insertions(+), 5 deletions(-)
diff --git a/src/backend/libpq/pqcomm.c b/src/backend/libpq/pqcomm.c
index d1cc38b..1c34114 100644
--- a/src/backend/libpq/pqcomm.c
+++ b/src/backend/libpq/pqcomm.c
@@ -201,7 +201,7 @@ pq_init(void)
(errmsg("could not set socket to nonblocking mode: %m")));
#endif
- FeBeWaitSet = CreateWaitEventSet(TopMemoryContext, 3);
+ FeBeWaitSet = CreateWaitEventSet(TopMemoryContext, NULL, 3);
AddWaitEventToSet(FeBeWaitSet, WL_SOCKET_WRITEABLE, MyProcPort->sock,
NULL, NULL);
AddWaitEventToSet(FeBeWaitSet, WL_LATCH_SET, -1, MyLatch, NULL);
diff --git a/src/backend/storage/ipc/latch.c b/src/backend/storage/ipc/latch.c
index 53e6bf2..8c182a2 100644
--- a/src/backend/storage/ipc/latch.c
+++ b/src/backend/storage/ipc/latch.c
@@ -51,6 +51,7 @@
#include "storage/latch.h"
#include "storage/pmsignal.h"
#include "storage/shmem.h"
+#include "utils/resowner_private.h"
/*
* Select the fd readiness primitive to use. Normally the "most modern"
@@ -77,6 +78,8 @@ struct WaitEventSet
int nevents; /* number of registered events */
int nevents_space; /* maximum number of events in this set */
+ ResourceOwner resowner; /* Resource owner */
+
/*
* Array, of nevents_space length, storing the definition of events this
* set is waiting for.
@@ -359,7 +362,7 @@ WaitLatchOrSocket(volatile Latch *latch, int wakeEvents, pgsocket sock,
int ret = 0;
int rc;
WaitEvent event;
- WaitEventSet *set = CreateWaitEventSet(CurrentMemoryContext, 3);
+ WaitEventSet *set = CreateWaitEventSet(CurrentMemoryContext, NULL, 3);
if (wakeEvents & WL_TIMEOUT)
Assert(timeout >= 0);
@@ -518,12 +521,15 @@ ResetLatch(volatile Latch *latch)
* WaitEventSetWait().
*/
WaitEventSet *
-CreateWaitEventSet(MemoryContext context, int nevents)
+CreateWaitEventSet(MemoryContext context, ResourceOwner res, int nevents)
{
WaitEventSet *set;
char *data;
Size sz = 0;
+ if (res)
+ ResourceOwnerEnlargeWESs(res);
+
/*
* Use MAXALIGN size/alignment to guarantee that later uses of memory are
* aligned correctly. E.g. epoll_event might need 8 byte alignment on some
@@ -592,6 +598,11 @@ CreateWaitEventSet(MemoryContext context, int nevents)
StaticAssertStmt(WSA_INVALID_EVENT == NULL, "");
#endif
+ /* Register this wait event set if requested */
+ set->resowner = res;
+ if (res)
+ ResourceOwnerRememberWES(set->resowner, set);
+
return set;
}
@@ -633,6 +644,9 @@ FreeWaitEventSet(WaitEventSet *set)
}
#endif
+ if (set->resowner != NULL)
+ ResourceOwnerForgetWES(set->resowner, set);
+
pfree(set);
}
diff --git a/src/backend/storage/lmgr/condition_variable.c b/src/backend/storage/lmgr/condition_variable.c
index 5afb211..1d9111e 100644
--- a/src/backend/storage/lmgr/condition_variable.c
+++ b/src/backend/storage/lmgr/condition_variable.c
@@ -66,7 +66,7 @@ ConditionVariablePrepareToSleep(ConditionVariable *cv)
/* Create a reusable WaitEventSet. */
if (cv_wait_event_set == NULL)
{
- cv_wait_event_set = CreateWaitEventSet(TopMemoryContext, 1);
+ cv_wait_event_set = CreateWaitEventSet(TopMemoryContext, NULL, 1);
AddWaitEventToSet(cv_wait_event_set, WL_LATCH_SET, PGINVALID_SOCKET,
&MyProc->procLatch, NULL);
}
diff --git a/src/backend/utils/resowner/resowner.c b/src/backend/utils/resowner/resowner.c
index af46d78..a1a1121 100644
--- a/src/backend/utils/resowner/resowner.c
+++ b/src/backend/utils/resowner/resowner.c
@@ -124,6 +124,7 @@ typedef struct ResourceOwnerData
ResourceArray snapshotarr; /* snapshot references */
ResourceArray filearr; /* open temporary files */
ResourceArray dsmarr; /* dynamic shmem segments */
+ ResourceArray wesarr; /* wait event sets */
/* We can remember up to MAX_RESOWNER_LOCKS references to local locks. */
int nlocks; /* number of owned locks */
@@ -169,6 +170,7 @@ static void PrintTupleDescLeakWarning(TupleDesc tupdesc);
static void PrintSnapshotLeakWarning(Snapshot snapshot);
static void PrintFileLeakWarning(File file);
static void PrintDSMLeakWarning(dsm_segment *seg);
+static void PrintWESLeakWarning(WaitEventSet *events);
/*****************************************************************************
@@ -437,6 +439,7 @@ ResourceOwnerCreate(ResourceOwner parent, const char *name)
ResourceArrayInit(&(owner->snapshotarr), PointerGetDatum(NULL));
ResourceArrayInit(&(owner->filearr), FileGetDatum(-1));
ResourceArrayInit(&(owner->dsmarr), PointerGetDatum(NULL));
+ ResourceArrayInit(&(owner->wesarr), PointerGetDatum(NULL));
return owner;
}
@@ -552,6 +555,16 @@ ResourceOwnerReleaseInternal(ResourceOwner owner,
PrintDSMLeakWarning(res);
dsm_detach(res);
}
+
+ /* Ditto for wait event sets */
+ while (ResourceArrayGetAny(&(owner->wesarr), &foundres))
+ {
+ WaitEventSet *event = (WaitEventSet *) DatumGetPointer(foundres);
+
+ if (isCommit)
+ PrintWESLeakWarning(event);
+ FreeWaitEventSet(event);
+ }
}
else if (phase == RESOURCE_RELEASE_LOCKS)
{
@@ -699,6 +712,7 @@ ResourceOwnerDelete(ResourceOwner owner)
Assert(owner->snapshotarr.nitems == 0);
Assert(owner->filearr.nitems == 0);
Assert(owner->dsmarr.nitems == 0);
+ Assert(owner->wesarr.nitems == 0);
Assert(owner->nlocks == 0 || owner->nlocks == MAX_RESOWNER_LOCKS + 1);
/*
@@ -725,6 +739,7 @@ ResourceOwnerDelete(ResourceOwner owner)
ResourceArrayFree(&(owner->snapshotarr));
ResourceArrayFree(&(owner->filearr));
ResourceArrayFree(&(owner->dsmarr));
+ ResourceArrayFree(&(owner->wesarr));
pfree(owner);
}
@@ -1267,3 +1282,56 @@ PrintDSMLeakWarning(dsm_segment *seg)
elog(WARNING, "dynamic shared memory leak: segment %u still referenced",
dsm_segment_handle(seg));
}
+
+/*
+ * Make sure there is room for at least one more entry in a ResourceOwner's
+ * wait event set reference array.
+ *
+ * This is separate from actually inserting an entry because if we run out
+ * of memory, it's critical to do so *before* acquiring the resource.
+ */
+void
+ResourceOwnerEnlargeWESs(ResourceOwner owner)
+{
+ ResourceArrayEnlarge(&(owner->wesarr));
+}
+
+/*
+ * Remember that a wait event set is owned by a ResourceOwner
+ *
+ * Caller must have previously done ResourceOwnerEnlargeWESs()
+ */
+void
+ResourceOwnerRememberWES(ResourceOwner owner, WaitEventSet *events)
+{
+ ResourceArrayAdd(&(owner->wesarr), PointerGetDatum(events));
+}
+
+/*
+ * Forget that a wait event set is owned by a ResourceOwner
+ */
+void
+ResourceOwnerForgetWES(ResourceOwner owner, WaitEventSet *events)
+{
+ /*
+ * XXXX: There's no property to show as an identier of a wait event set,
+ * use its pointer instead.
+ */
+ if (!ResourceArrayRemove(&(owner->wesarr), PointerGetDatum(events)))
+ elog(ERROR, "wait event set %p is not owned by resource owner %s",
+ events, owner->name);
+}
+
+/*
+ * Debugging subroutine
+ */
+static void
+PrintWESLeakWarning(WaitEventSet *events)
+{
+ /*
+ * XXXX: There's no property to show as an identier of a wait event set,
+ * use its pointer instead.
+ */
+ elog(WARNING, "wait event set leak: %p still referenced",
+ events);
+}
diff --git a/src/include/storage/latch.h b/src/include/storage/latch.h
index 3158d7b..8233b6d 100644
--- a/src/include/storage/latch.h
+++ b/src/include/storage/latch.h
@@ -101,6 +101,7 @@
#define LATCH_H
#include <signal.h>
+#include "utils/resowner.h"
/*
* Latch structure should be treated as opaque and only accessed through
@@ -152,7 +153,8 @@ extern void DisownLatch(volatile Latch *latch);
extern void SetLatch(volatile Latch *latch);
extern void ResetLatch(volatile Latch *latch);
-extern WaitEventSet *CreateWaitEventSet(MemoryContext context, int nevents);
+extern WaitEventSet *CreateWaitEventSet(MemoryContext context,
+ ResourceOwner res, int nevents);
extern void FreeWaitEventSet(WaitEventSet *set);
extern int AddWaitEventToSet(WaitEventSet *set, uint32 events, pgsocket fd,
Latch *latch, void *user_data);
diff --git a/src/include/utils/resowner_private.h b/src/include/utils/resowner_private.h
index 411d08f..0c6979a 100644
--- a/src/include/utils/resowner_private.h
+++ b/src/include/utils/resowner_private.h
@@ -18,6 +18,7 @@
#include "storage/dsm.h"
#include "storage/fd.h"
+#include "storage/latch.h"
#include "storage/lock.h"
#include "utils/catcache.h"
#include "utils/plancache.h"
@@ -88,4 +89,11 @@ extern void ResourceOwnerRememberDSM(ResourceOwner owner,
extern void ResourceOwnerForgetDSM(ResourceOwner owner,
dsm_segment *);
+/* support for wait event set management */
+extern void ResourceOwnerEnlargeWESs(ResourceOwner owner);
+extern void ResourceOwnerRememberWES(ResourceOwner owner,
+ WaitEventSet *);
+extern void ResourceOwnerForgetWES(ResourceOwner owner,
+ WaitEventSet *);
+
#endif /* RESOWNER_PRIVATE_H */
--
2.9.2
0004-Apply-unlikely-to-suggest-synchronous-route-of.patchtext/x-patch; charset=us-asciiDownload
From a02948883a160953ed2fac65c15c266d52f2163d Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Mon, 22 May 2017 12:50:26 +0900
Subject: [PATCH 4/4] Apply unlikely to suggest synchronous route of
ExecAppend.
ExecAppend seems to get slowed down by penalty of misprediction of
branches related to async-execution. Apply unlikey to them to prevent
such penalty on exiting route. Asynchronous execution is already
having a lot of additional code so this doesn't add siginificant
degradation.
---
src/backend/executor/nodeAppend.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/src/backend/executor/nodeAppend.c b/src/backend/executor/nodeAppend.c
index 2c07095..43e777f 100644
--- a/src/backend/executor/nodeAppend.c
+++ b/src/backend/executor/nodeAppend.c
@@ -214,7 +214,7 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
TupleTableSlot *
ExecAppend(AppendState *node)
{
- if (node->as_nasyncplans > 0)
+ if (unlikely(node->as_nasyncplans > 0))
{
EState *estate = node->ps.state;
int i;
@@ -255,7 +255,7 @@ ExecAppend(AppendState *node)
/*
* if we have async requests outstanding, run the event loop
*/
- if (node->as_nasyncpending > 0)
+ if (unlikely(node->as_nasyncpending > 0))
{
long timeout = node->as_syncdone ? -1 : 0;
--
2.9.2
The patch got conflicted. This is a new version just rebased to
the current master. Furtuer amendment will be taken later.
The attached patch is rebased on the current master, but no
substantial changes other than disallowing partitioned tables on
async by assertion.
regards,
--
Kyotaro Horiguchi
NTT Open Source Software Center
Attachments:
0001-Allow-wait-event-set-to-be-registered-to-resource-ow.patchtext/x-patch; charset=us-asciiDownload
From 32d5c143a679bcccee9ff29fe3807dfd8deae458 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Wed, 22 Feb 2017 09:07:49 +0900
Subject: [PATCH 1/4] Allow wait event set to be registered to resource owner
WaitEventSet needs to be released using resource owner for a certain
case. This change adds WaitEventSet reowner and allow the creator of a
WaitEventSet to specify a resource owner.
---
src/backend/libpq/pqcomm.c | 2 +-
src/backend/storage/ipc/latch.c | 18 ++++++-
src/backend/storage/lmgr/condition_variable.c | 2 +-
src/backend/utils/resowner/resowner.c | 68 +++++++++++++++++++++++++++
src/include/storage/latch.h | 4 +-
src/include/utils/resowner_private.h | 8 ++++
6 files changed, 97 insertions(+), 5 deletions(-)
diff --git a/src/backend/libpq/pqcomm.c b/src/backend/libpq/pqcomm.c
index 261e9be..c4f336d 100644
--- a/src/backend/libpq/pqcomm.c
+++ b/src/backend/libpq/pqcomm.c
@@ -201,7 +201,7 @@ pq_init(void)
(errmsg("could not set socket to nonblocking mode: %m")));
#endif
- FeBeWaitSet = CreateWaitEventSet(TopMemoryContext, 3);
+ FeBeWaitSet = CreateWaitEventSet(TopMemoryContext, NULL, 3);
AddWaitEventToSet(FeBeWaitSet, WL_SOCKET_WRITEABLE, MyProcPort->sock,
NULL, NULL);
AddWaitEventToSet(FeBeWaitSet, WL_LATCH_SET, -1, MyLatch, NULL);
diff --git a/src/backend/storage/ipc/latch.c b/src/backend/storage/ipc/latch.c
index 07b1364..9543397 100644
--- a/src/backend/storage/ipc/latch.c
+++ b/src/backend/storage/ipc/latch.c
@@ -51,6 +51,7 @@
#include "storage/latch.h"
#include "storage/pmsignal.h"
#include "storage/shmem.h"
+#include "utils/resowner_private.h"
/*
* Select the fd readiness primitive to use. Normally the "most modern"
@@ -77,6 +78,8 @@ struct WaitEventSet
int nevents; /* number of registered events */
int nevents_space; /* maximum number of events in this set */
+ ResourceOwner resowner; /* Resource owner */
+
/*
* Array, of nevents_space length, storing the definition of events this
* set is waiting for.
@@ -359,7 +362,7 @@ WaitLatchOrSocket(volatile Latch *latch, int wakeEvents, pgsocket sock,
int ret = 0;
int rc;
WaitEvent event;
- WaitEventSet *set = CreateWaitEventSet(CurrentMemoryContext, 3);
+ WaitEventSet *set = CreateWaitEventSet(CurrentMemoryContext, NULL, 3);
if (wakeEvents & WL_TIMEOUT)
Assert(timeout >= 0);
@@ -518,12 +521,15 @@ ResetLatch(volatile Latch *latch)
* WaitEventSetWait().
*/
WaitEventSet *
-CreateWaitEventSet(MemoryContext context, int nevents)
+CreateWaitEventSet(MemoryContext context, ResourceOwner res, int nevents)
{
WaitEventSet *set;
char *data;
Size sz = 0;
+ if (res)
+ ResourceOwnerEnlargeWESs(res);
+
/*
* Use MAXALIGN size/alignment to guarantee that later uses of memory are
* aligned correctly. E.g. epoll_event might need 8 byte alignment on some
@@ -592,6 +598,11 @@ CreateWaitEventSet(MemoryContext context, int nevents)
StaticAssertStmt(WSA_INVALID_EVENT == NULL, "");
#endif
+ /* Register this wait event set if requested */
+ set->resowner = res;
+ if (res)
+ ResourceOwnerRememberWES(set->resowner, set);
+
return set;
}
@@ -633,6 +644,9 @@ FreeWaitEventSet(WaitEventSet *set)
}
#endif
+ if (set->resowner != NULL)
+ ResourceOwnerForgetWES(set->resowner, set);
+
pfree(set);
}
diff --git a/src/backend/storage/lmgr/condition_variable.c b/src/backend/storage/lmgr/condition_variable.c
index b4b7d28..182f759 100644
--- a/src/backend/storage/lmgr/condition_variable.c
+++ b/src/backend/storage/lmgr/condition_variable.c
@@ -66,7 +66,7 @@ ConditionVariablePrepareToSleep(ConditionVariable *cv)
/* Create a reusable WaitEventSet. */
if (cv_wait_event_set == NULL)
{
- cv_wait_event_set = CreateWaitEventSet(TopMemoryContext, 1);
+ cv_wait_event_set = CreateWaitEventSet(TopMemoryContext, NULL, 1);
AddWaitEventToSet(cv_wait_event_set, WL_LATCH_SET, PGINVALID_SOCKET,
MyLatch, NULL);
}
diff --git a/src/backend/utils/resowner/resowner.c b/src/backend/utils/resowner/resowner.c
index 4a4a287..f2509c3 100644
--- a/src/backend/utils/resowner/resowner.c
+++ b/src/backend/utils/resowner/resowner.c
@@ -124,6 +124,7 @@ typedef struct ResourceOwnerData
ResourceArray snapshotarr; /* snapshot references */
ResourceArray filearr; /* open temporary files */
ResourceArray dsmarr; /* dynamic shmem segments */
+ ResourceArray wesarr; /* wait event sets */
/* We can remember up to MAX_RESOWNER_LOCKS references to local locks. */
int nlocks; /* number of owned locks */
@@ -169,6 +170,7 @@ static void PrintTupleDescLeakWarning(TupleDesc tupdesc);
static void PrintSnapshotLeakWarning(Snapshot snapshot);
static void PrintFileLeakWarning(File file);
static void PrintDSMLeakWarning(dsm_segment *seg);
+static void PrintWESLeakWarning(WaitEventSet *events);
/*****************************************************************************
@@ -437,6 +439,7 @@ ResourceOwnerCreate(ResourceOwner parent, const char *name)
ResourceArrayInit(&(owner->snapshotarr), PointerGetDatum(NULL));
ResourceArrayInit(&(owner->filearr), FileGetDatum(-1));
ResourceArrayInit(&(owner->dsmarr), PointerGetDatum(NULL));
+ ResourceArrayInit(&(owner->wesarr), PointerGetDatum(NULL));
return owner;
}
@@ -552,6 +555,16 @@ ResourceOwnerReleaseInternal(ResourceOwner owner,
PrintDSMLeakWarning(res);
dsm_detach(res);
}
+
+ /* Ditto for wait event sets */
+ while (ResourceArrayGetAny(&(owner->wesarr), &foundres))
+ {
+ WaitEventSet *event = (WaitEventSet *) DatumGetPointer(foundres);
+
+ if (isCommit)
+ PrintWESLeakWarning(event);
+ FreeWaitEventSet(event);
+ }
}
else if (phase == RESOURCE_RELEASE_LOCKS)
{
@@ -699,6 +712,7 @@ ResourceOwnerDelete(ResourceOwner owner)
Assert(owner->snapshotarr.nitems == 0);
Assert(owner->filearr.nitems == 0);
Assert(owner->dsmarr.nitems == 0);
+ Assert(owner->wesarr.nitems == 0);
Assert(owner->nlocks == 0 || owner->nlocks == MAX_RESOWNER_LOCKS + 1);
/*
@@ -725,6 +739,7 @@ ResourceOwnerDelete(ResourceOwner owner)
ResourceArrayFree(&(owner->snapshotarr));
ResourceArrayFree(&(owner->filearr));
ResourceArrayFree(&(owner->dsmarr));
+ ResourceArrayFree(&(owner->wesarr));
pfree(owner);
}
@@ -1267,3 +1282,56 @@ PrintDSMLeakWarning(dsm_segment *seg)
elog(WARNING, "dynamic shared memory leak: segment %u still referenced",
dsm_segment_handle(seg));
}
+
+/*
+ * Make sure there is room for at least one more entry in a ResourceOwner's
+ * wait event set reference array.
+ *
+ * This is separate from actually inserting an entry because if we run out
+ * of memory, it's critical to do so *before* acquiring the resource.
+ */
+void
+ResourceOwnerEnlargeWESs(ResourceOwner owner)
+{
+ ResourceArrayEnlarge(&(owner->wesarr));
+}
+
+/*
+ * Remember that a wait event set is owned by a ResourceOwner
+ *
+ * Caller must have previously done ResourceOwnerEnlargeWESs()
+ */
+void
+ResourceOwnerRememberWES(ResourceOwner owner, WaitEventSet *events)
+{
+ ResourceArrayAdd(&(owner->wesarr), PointerGetDatum(events));
+}
+
+/*
+ * Forget that a wait event set is owned by a ResourceOwner
+ */
+void
+ResourceOwnerForgetWES(ResourceOwner owner, WaitEventSet *events)
+{
+ /*
+ * XXXX: There's no property to show as an identier of a wait event set,
+ * use its pointer instead.
+ */
+ if (!ResourceArrayRemove(&(owner->wesarr), PointerGetDatum(events)))
+ elog(ERROR, "wait event set %p is not owned by resource owner %s",
+ events, owner->name);
+}
+
+/*
+ * Debugging subroutine
+ */
+static void
+PrintWESLeakWarning(WaitEventSet *events)
+{
+ /*
+ * XXXX: There's no property to show as an identier of a wait event set,
+ * use its pointer instead.
+ */
+ elog(WARNING, "wait event set leak: %p still referenced",
+ events);
+}
diff --git a/src/include/storage/latch.h b/src/include/storage/latch.h
index 73abfaf..392c1d6 100644
--- a/src/include/storage/latch.h
+++ b/src/include/storage/latch.h
@@ -101,6 +101,7 @@
#define LATCH_H
#include <signal.h>
+#include "utils/resowner.h"
/*
* Latch structure should be treated as opaque and only accessed through
@@ -152,7 +153,8 @@ extern void DisownLatch(volatile Latch *latch);
extern void SetLatch(volatile Latch *latch);
extern void ResetLatch(volatile Latch *latch);
-extern WaitEventSet *CreateWaitEventSet(MemoryContext context, int nevents);
+extern WaitEventSet *CreateWaitEventSet(MemoryContext context,
+ ResourceOwner res, int nevents);
extern void FreeWaitEventSet(WaitEventSet *set);
extern int AddWaitEventToSet(WaitEventSet *set, uint32 events, pgsocket fd,
Latch *latch, void *user_data);
diff --git a/src/include/utils/resowner_private.h b/src/include/utils/resowner_private.h
index 2420b65..70b0bb9 100644
--- a/src/include/utils/resowner_private.h
+++ b/src/include/utils/resowner_private.h
@@ -18,6 +18,7 @@
#include "storage/dsm.h"
#include "storage/fd.h"
+#include "storage/latch.h"
#include "storage/lock.h"
#include "utils/catcache.h"
#include "utils/plancache.h"
@@ -88,4 +89,11 @@ extern void ResourceOwnerRememberDSM(ResourceOwner owner,
extern void ResourceOwnerForgetDSM(ResourceOwner owner,
dsm_segment *);
+/* support for wait event set management */
+extern void ResourceOwnerEnlargeWESs(ResourceOwner owner);
+extern void ResourceOwnerRememberWES(ResourceOwner owner,
+ WaitEventSet *);
+extern void ResourceOwnerForgetWES(ResourceOwner owner,
+ WaitEventSet *);
+
#endif /* RESOWNER_PRIVATE_H */
--
2.9.2
0002-Asynchronous-execution-framework.patchtext/x-patch; charset=us-asciiDownload
From 1bb440d25eddcbfeff8d3f032432edca15e43477 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 23 Feb 2017 12:20:31 +0900
Subject: [PATCH 2/4] Asynchronous execution framework
This is a framework for asynchronous execution based on Robert Haas's
proposal. Any executor node can receive tuples from underlying nodes
asynchronously by this. This is a different mechanism from parallel
execution. While the parallel execution is analogous to threads, this
frame work is analogous to select(2), which handles multiple input on
single backend process. To avoid degradation of non-async execution,
this framework uses completely different channel to convey tuples.
You will see the deatil of the API at the end of
src/backend/executor/README.
---
src/backend/executor/Makefile | 2 +-
src/backend/executor/README | 45 +++
src/backend/executor/execAmi.c | 5 +
src/backend/executor/execAsync.c | 520 ++++++++++++++++++++++++++++++++
src/backend/executor/execProcnode.c | 1 +
src/backend/executor/instrument.c | 2 +-
src/backend/executor/nodeAppend.c | 169 ++++++++++-
src/backend/executor/nodeForeignscan.c | 49 +++
src/backend/nodes/copyfuncs.c | 2 +
src/backend/nodes/outfuncs.c | 2 +
src/backend/nodes/readfuncs.c | 2 +
src/backend/optimizer/plan/createplan.c | 69 ++++-
src/backend/postmaster/pgstat.c | 2 +
src/backend/utils/adt/ruleutils.c | 6 +-
src/include/executor/execAsync.h | 30 ++
src/include/executor/nodeAppend.h | 3 +
src/include/executor/nodeForeignscan.h | 7 +
src/include/foreign/fdwapi.h | 17 ++
src/include/nodes/execnodes.h | 65 +++-
src/include/nodes/plannodes.h | 2 +
src/include/pgstat.h | 3 +-
21 files changed, 974 insertions(+), 29 deletions(-)
create mode 100644 src/backend/executor/execAsync.c
create mode 100644 src/include/executor/execAsync.h
diff --git a/src/backend/executor/Makefile b/src/backend/executor/Makefile
index 083b20f..21f5ad0 100644
--- a/src/backend/executor/Makefile
+++ b/src/backend/executor/Makefile
@@ -12,7 +12,7 @@ subdir = src/backend/executor
top_builddir = ../../..
include $(top_builddir)/src/Makefile.global
-OBJS = execAmi.o execCurrent.o execExpr.o execExprInterp.o \
+OBJS = execAmi.o execAsync.o execCurrent.o execExpr.o execExprInterp.o \
execGrouping.o execIndexing.o execJunk.o \
execMain.o execParallel.o execProcnode.o \
execReplication.o execScan.o execSRF.o execTuples.o \
diff --git a/src/backend/executor/README b/src/backend/executor/README
index a004506..e6caeb7 100644
--- a/src/backend/executor/README
+++ b/src/backend/executor/README
@@ -349,3 +349,48 @@ query returning the same set of scan tuples multiple times. Likewise,
SRFs are disallowed in an UPDATE's targetlist. There, they would have the
effect of the same row being updated multiple times, which is not very
useful --- and updates after the first would have no effect anyway.
+
+Asynchronous Execution
+----------------------
+
+In certain cases, it's desirable for a node to indicate that it cannot
+return any tuple immediately but may be able to do at a later time. This
+might either because the node is waiting on an event external to the
+database system, such as a ForeignScan awaiting network I/O, or because
+the node is waiting for an event internal to the database system - e.g.
+one process involved in a parallel query may find that it cannot progress
+a certain parallel operation until some other process reaches a certain
+point in the computation. A process which discovers this type of situation
+can always handle it simply by blocking, but this may waste time that could
+be spent executing some other part of the plan where progress could be
+made immediately. This is particularly likely to occur when the plan
+contains an Append node.
+
+To use asynchronous execution, a node must first request a tuple from
+an async-capable child node using ExecAsyncRequest. Next, when the
+result is not available immediately, it must execute the asynchronous
+event loop using ExecAsyncEventLoop; it can avoid giving up control
+indefinitely by passing a timeout to this function, even passing -1 to
+poll for events without blocking. Eventually, when a node to which an
+asynchronous request has been made produces a tuple, the requesting
+node will receive a callback from the event loop via
+ExecAsyncResponse. Typically, the ExecAsyncResponse callback is the
+only one required for nodes that wish to request tuples
+asynchronously.
+
+On the other hand, nodes that wish to produce tuples asynchronously
+generally need to implement three methods:
+
+1. When an asynchronous request is made, the node's ExecAsyncRequest callback
+will be invoked; it should use ExecAsyncSetRequiredEvents to indicate the
+number of file descriptor events for which it wishes to wait and whether it
+wishes to receive a callback when the process latch is set. Alternatively,
+it can instead use ExecAsyncRequestDone if a result is available immediately.
+
+2. When the event loop wishes to wait or poll for file descriptor events and
+the process latch, the ExecAsyncConfigureWait callback is invoked to configure
+the file descriptor wait events for which the node wishes to wait. This
+callback isn't needed if the node only cares about the process latch.
+
+3. When file descriptors or the process latch become ready, the node's
+ExecAsyncNotify callback is invoked.
diff --git a/src/backend/executor/execAmi.c b/src/backend/executor/execAmi.c
index 7337d21..4c1991c 100644
--- a/src/backend/executor/execAmi.c
+++ b/src/backend/executor/execAmi.c
@@ -479,11 +479,16 @@ ExecSupportsBackwardScan(Plan *node)
{
ListCell *l;
+ /* With async, tuples may be interleaved, so can't back up. */
+ if (((Append *) node)->nasyncplans != 0)
+ return false;
+
foreach(l, ((Append *) node)->appendplans)
{
if (!ExecSupportsBackwardScan((Plan *) lfirst(l)))
return false;
}
+
/* need not check tlist because Append doesn't evaluate it */
return true;
}
diff --git a/src/backend/executor/execAsync.c b/src/backend/executor/execAsync.c
new file mode 100644
index 0000000..115b147
--- /dev/null
+++ b/src/backend/executor/execAsync.c
@@ -0,0 +1,520 @@
+/*-------------------------------------------------------------------------
+ *
+ * execAsync.c
+ * Support routines for asynchronous execution.
+ *
+ * Portions Copyright (c) 1996-2016, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/executor/execAsync.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "executor/execAsync.h"
+#include "executor/nodeAppend.h"
+#include "executor/nodeForeignscan.h"
+#include "miscadmin.h"
+#include "pgstat.h"
+#include "storage/latch.h"
+#include "utils/memutils.h"
+
+static bool ExecAsyncEventWait(EState *estate, long timeout);
+static bool ExecAsyncConfigureWait(EState *estate, PendingAsyncRequest *areq,
+ bool reinit);
+static void ExecAsyncNotify(EState *estate, PendingAsyncRequest *areq);
+static void ExecAsyncResponse(EState *estate, PendingAsyncRequest *areq);
+
+#define EVENT_BUFFER_SIZE 16
+
+/*
+ * Asynchronously request a tuple from a designed async-aware node.
+ *
+ * requestor is the node that wants the tuple; requestee is the node from
+ * which it wants the tuple. request_index is an arbitrary integer specified
+ * by the requestor which will be available at the time the requestor receives
+ * the tuple. This is useful if the requestor has multiple children and
+ * needs an easy way to figure out which one is delivering a tuple.
+ */
+void
+ExecAsyncRequest(EState *estate, PlanState *requestor, int request_index,
+ PlanState *requestee)
+{
+ PendingAsyncRequest *areq = NULL;
+ int nasync = estate->es_num_pending_async;
+
+ if (requestee->instrument)
+ InstrStartNode(requestee->instrument);
+
+ /*
+ * If the number of pending asynchronous nodes exceeds the number of
+ * available slots in the es_pending_async array, expand the array.
+ * We start with 16 slots, and thereafter double the array size each
+ * time we run out of slots.
+ */
+ if (nasync >= estate->es_max_pending_async)
+ {
+ int newmax;
+
+ newmax = estate->es_max_pending_async * 2;
+ if (estate->es_max_pending_async == 0)
+ {
+ newmax = 16;
+ estate->es_pending_async =
+ MemoryContextAllocZero(estate->es_query_cxt,
+ newmax * sizeof(PendingAsyncRequest *));
+ }
+ else
+ {
+ int newentries = newmax - estate->es_max_pending_async;
+
+ estate->es_pending_async =
+ repalloc(estate->es_pending_async,
+ newmax * sizeof(PendingAsyncRequest *));
+ MemSet(&estate->es_pending_async[estate->es_max_pending_async],
+ 0, newentries * sizeof(PendingAsyncRequest *));
+ }
+ estate->es_max_pending_async = newmax;
+ }
+
+ /*
+ * To avoid unnecessary palloc traffic, we reuse a previously-allocated
+ * PendingAsyncRequest if there is one. If not, we must allocate a new
+ * one.
+ */
+ if (estate->es_pending_async[nasync] == NULL)
+ {
+ areq = MemoryContextAllocZero(estate->es_query_cxt,
+ sizeof(PendingAsyncRequest));
+ estate->es_pending_async[nasync] = areq;
+ }
+ else
+ {
+ areq = estate->es_pending_async[nasync];
+ MemSet(areq, 0, sizeof(PendingAsyncRequest));
+ }
+ areq->myindex = estate->es_num_pending_async;
+
+ /* Initialize the new request. */
+ areq->state = ASYNCREQ_IDLE;
+ areq->requestor = requestor;
+ areq->request_index = request_index;
+ areq->requestee = requestee;
+
+ /* Give the requestee a chance to do whatever it wants. */
+ switch (nodeTag(requestee))
+ {
+ case T_ForeignScanState:
+ ExecAsyncForeignScanRequest(estate, areq);
+ break;
+ default:
+ /* If requestee doesn't support async, caller messed up. */
+ elog(ERROR, "unrecognized node type: %d",
+ (int) nodeTag(requestee));
+ }
+
+ if (areq->requestee->instrument)
+ InstrStopNode(requestee->instrument, 0);
+
+ /* No result available now, make this node pending */
+ estate->es_num_pending_async++;
+
+ return;
+}
+
+/*
+ * Execute the main loop until the timeout expires or a result is delivered
+ * to the requestor.
+ *
+ * If the timeout is -1, there is no timeout; wait indefinitely until a
+ * result is ready for requestor. If the timeout is 0, do not block, but
+ * poll for events and fire callbacks for as long as we can do so without
+ * blocking. If timeout is greater than 0, block for at most the number
+ * of milliseconds indicated by the timeout.
+ *
+ * Returns true if a result was delivered to the requestor. A return value
+ * of false indicates that the timeout was reached without delivering a
+ * result to the requestor.
+ */
+bool
+ExecAsyncEventLoop(EState *estate, PlanState *requestor, long timeout)
+{
+ instr_time start_time;
+ long cur_timeout = timeout;
+ bool requestor_done = false;
+
+ Assert(requestor != NULL);
+
+ /*
+ * If we plan to wait - but not indefinitely - we need to record the
+ * current time.
+ */
+ if (timeout > 0)
+ INSTR_TIME_SET_CURRENT(start_time);
+
+ /* Main event loop: poll for events, deliver notifications. */
+ Assert(estate->es_async_callback_pending == 0);
+ for (;;)
+ {
+ int i;
+ bool any_node_done = false;
+
+ CHECK_FOR_INTERRUPTS();
+
+ /* Check for events only if any node is async-not-ready. */
+ if (estate->es_num_async_ready < estate->es_num_pending_async)
+ {
+ /* Don't block if any tuple available. */
+ if (estate->es_async_callback_pending > 0)
+ ExecAsyncEventWait(estate, 0);
+ else if (!ExecAsyncEventWait(estate, cur_timeout))
+ { /* Not fired */
+ /* Exited before timeout. Calculate the remaining time. */
+ instr_time cur_time;
+ long cur_timeout = -1;
+
+ /* Wait forever */
+ if (timeout < 0)
+ continue;
+
+ INSTR_TIME_SET_CURRENT(cur_time);
+ INSTR_TIME_SUBTRACT(cur_time, start_time);
+ cur_timeout =
+ timeout - (long) INSTR_TIME_GET_MILLISEC(cur_time);
+
+ if (cur_timeout > 0)
+ continue;
+ }
+ }
+
+ /* Deliver notifications. */
+ for (i = 0; i < estate->es_num_pending_async; ++i)
+ {
+ PendingAsyncRequest *areq = estate->es_pending_async[i];
+
+ if (areq->requestee->instrument)
+ InstrStartNode(areq->requestee->instrument);
+
+ /* Notify if the requestee is ready */
+ if (areq->state == ASYNCREQ_CALLBACK_PENDING)
+ ExecAsyncNotify(estate, areq);
+
+ /* Deliver the acquired tuple to the requester */
+ if (areq->state == ASYNCREQ_COMPLETE)
+ {
+ any_node_done = true;
+ if (requestor == areq->requestor)
+ requestor_done = true;
+ ExecAsyncResponse(estate, areq);
+
+ if (areq->requestee->instrument)
+ InstrStopNode(areq->requestee->instrument,
+ TupIsNull((TupleTableSlot*)areq->result) ?
+ 0.0 : 1.0);
+ }
+ else if (areq->requestee->instrument)
+ InstrStopNode(areq->requestee->instrument, 0);
+ }
+
+ /* If any node completed, compact the array. */
+ if (any_node_done)
+ {
+ int hidx = 0,
+ tidx;
+
+ /*
+ * Swap all non-yet-completed items to the start of the array.
+ * Keep them in the same order.
+ */
+ for (tidx = 0; tidx < estate->es_num_pending_async; ++tidx)
+ {
+ PendingAsyncRequest *head;
+ PendingAsyncRequest *tail = estate->es_pending_async[tidx];
+
+ Assert(tail->state != ASYNCREQ_CALLBACK_PENDING);
+
+ if (tail->state == ASYNCREQ_COMPLETE)
+ continue;
+ head = estate->es_pending_async[hidx];
+ estate->es_pending_async[tidx] = head;
+ estate->es_pending_async[hidx] = tail;
+ ++hidx;
+ }
+ estate->es_num_pending_async = hidx;
+ }
+
+ /*
+ * We only consider exiting the loop when no notifications are
+ * pending. Otherwise, each call to this function might advance
+ * the computation by only a very small amount; to the contrary,
+ * we want to push it forward as far as possible.
+ */
+ if (estate->es_async_callback_pending == 0)
+ {
+ /* If requestor is ready, exit. */
+ if (requestor_done)
+ return true;
+ /* If timeout was 0 or has expired, exit. */
+ if (cur_timeout == 0)
+ return false;
+ }
+ }
+}
+
+/*
+ * Wait or poll for events. As with ExecAsyncEventLoop, a timeout of -1
+ * means wait forever, 0 means don't wait at all, and >0 means wait for the
+ * indicated number of milliseconds.
+ *
+ * Returns false if we timed out or true if anything found or there's no event
+ * to wait.
+ */
+static bool
+ExecAsyncEventWait(EState *estate, long timeout)
+{
+ WaitEvent occurred_event[EVENT_BUFFER_SIZE];
+ int noccurred;
+ int i;
+ int n;
+ bool reinit = false;
+ bool process_latch_set = false;
+ bool added = false;
+ bool fired = false;
+
+ if (estate->es_wait_event_set == NULL)
+ {
+ /*
+ * Allow for a few extra events without reinitializing. It
+ * doesn't seem worth the complexity of doing anything very
+ * aggressive here, because plans that depend on massive numbers
+ * of external FDs are likely to run afoul of kernel limits anyway.
+ */
+ estate->es_allocated_fd_events = estate->es_total_fd_events + 16;
+
+ /*
+ * The wait event set created here should be live beyond ExecutorState
+ * context but released in case of error.
+ */
+ estate->es_wait_event_set =
+ CreateWaitEventSet(TopTransactionContext,
+ TopTransactionResourceOwner,
+ estate->es_allocated_fd_events + 1);
+
+ AddWaitEventToSet(estate->es_wait_event_set,
+ WL_LATCH_SET, PGINVALID_SOCKET, MyLatch, NULL);
+ reinit = true;
+ }
+
+ /* Give each waiting node a chance to add or modify events. */
+ for (i = 0; i < estate->es_num_pending_async; ++i)
+ {
+ PendingAsyncRequest *areq = estate->es_pending_async[i];
+
+ if (areq->num_fd_events > 0 || areq->wants_process_latch)
+ added |= ExecAsyncConfigureWait(estate, areq, reinit);
+ }
+
+ /*
+ * We may have no event to wait. This occurs when all nodes that
+ * is asynchronously executing have tuples immediately available.
+ */
+ if (!added)
+ return true;
+
+ /* Wait for at least one event to occur. */
+ noccurred = WaitEventSetWait(estate->es_wait_event_set, timeout,
+ occurred_event, EVENT_BUFFER_SIZE,
+ WAIT_EVENT_ASYNC_WAIT);
+
+ if (noccurred == 0)
+ return false;
+
+ /*
+ * Loop over the occurred events and set the callback_pending flags
+ * for the appropriate requests. The waiting nodes should have
+ * registered their wait events with user_data pointing back to the
+ * PendingAsyncRequest, but the process latch needs special handling.
+ */
+ for (n = 0; n < noccurred; ++n)
+ {
+ WaitEvent *w = &occurred_event[n];
+
+ if ((w->events & WL_LATCH_SET) != 0)
+ {
+ process_latch_set = true;
+ continue;
+ }
+
+ if ((w->events & (WL_SOCKET_READABLE|WL_SOCKET_WRITEABLE)) != 0)
+ {
+ PendingAsyncRequest *areq = w->user_data;
+
+ Assert(areq->state == ASYNCREQ_WAITING);
+
+ areq->state = ASYNCREQ_CALLBACK_PENDING;
+ estate->es_async_callback_pending++;
+ fired = true;
+ }
+ }
+
+ /*
+ * If the process latch got set, we must schedule a callback for every
+ * requestee that cares about it.
+ */
+ if (process_latch_set)
+ {
+ for (i = 0; i < estate->es_num_pending_async; ++i)
+ {
+ PendingAsyncRequest *areq = estate->es_pending_async[i];
+
+ if (areq->wants_process_latch)
+ {
+ Assert(areq->state == ASYNCREQ_WAITING);
+ areq->state = ASYNCREQ_CALLBACK_PENDING;
+ estate->es_async_callback_pending++;
+ fired = true;
+ }
+ }
+ }
+
+ return fired;
+}
+
+/*
+ * Give the asynchronous node a chance to configure the file descriptor
+ * events for which it wishes to wait. We expect the node-type specific
+ * callback to make one or more calls of the following form:
+ *
+ * AddWaitEventToSet(es->es_wait_event_set, events, fd, NULL, areq);
+ *
+ * The events should include only WL_SOCKET_READABLE or WL_SOCKET_WRITEABLE,
+ * and the number of calls should not exceed areq->num_fd_events (as
+ * prevously set via ExecAsyncSetRequiredEvents).
+ *
+ * Individual requests can omit registering an event but it is a
+ * responsibility of the node driver to set at least one event per one
+ * requestor.
+ */
+static bool
+ExecAsyncConfigureWait(EState *estate, PendingAsyncRequest *areq,
+ bool reinit)
+{
+ switch (nodeTag(areq->requestee))
+ {
+ case T_ForeignScanState:
+ return ExecAsyncForeignScanConfigureWait(estate, areq, reinit);
+ break;
+ default:
+ elog(ERROR, "unrecognized node type: %d",
+ (int) nodeTag(areq->requestee));
+ }
+}
+
+/*
+ * Call the asynchronous node back when a relevant event has occurred.
+ */
+static void
+ExecAsyncNotify(EState *estate, PendingAsyncRequest *areq)
+{
+ switch (nodeTag(areq->requestee))
+ {
+ case T_ForeignScanState:
+ ExecAsyncForeignScanNotify(estate, areq);
+ break;
+ default:
+ elog(ERROR, "unrecognized node type: %d",
+ (int) nodeTag(areq->requestee));
+ }
+
+ estate->es_async_callback_pending--;
+}
+
+/*
+ * Call the requestor back when an asynchronous node has produced a result.
+ */
+static void
+ExecAsyncResponse(EState *estate, PendingAsyncRequest *areq)
+{
+ switch (nodeTag(areq->requestor))
+ {
+ case T_AppendState:
+ ExecAsyncAppendResponse(estate, areq);
+ break;
+ default:
+ elog(ERROR, "unrecognized node type: %d",
+ (int) nodeTag(areq->requestor));
+ }
+ estate->es_num_async_ready--;
+}
+
+/*
+ * An executor node should call this function to signal that it needs to wait
+ * on one or more file descriptor events that can be registered on a
+ * WaitEventSet, and possibly also on process latch. num_fd_events is the
+ * maximum number of file descriptor events that it will wish to register.
+ * force_reset should be true if the node can't reuse the WaitEventSet it most
+ * recently initialized, for example because it needs to drop a wait event
+ * from the set.
+ */
+void
+ExecAsyncSetRequiredEvents(EState *estate, PendingAsyncRequest *areq,
+ int num_fd_events, bool wants_process_latch,
+ bool force_reset)
+{
+ estate->es_total_fd_events += num_fd_events - areq->num_fd_events;
+ areq->num_fd_events = num_fd_events;
+ areq->wants_process_latch = wants_process_latch;
+ areq->state = ASYNCREQ_WAITING;
+
+ if (force_reset && estate->es_wait_event_set != NULL)
+ ExecAsyncClearEvents(estate);
+}
+
+/*
+ * An async-capable node should call this function to deliver the tuple to
+ * the node which requested it. The node can call this from its
+ * ExecAsyncRequest callback if the requested tuple is available immediately,
+ * or at a later time from its ExecAsyncNotify callback.
+ */
+void
+ExecAsyncRequestDone(EState *estate, PendingAsyncRequest *areq, Node *result)
+{
+ /*
+ * Since the request is complete, the requestee is no longer allowed
+ * to wait for any events. Note that this forces a rebuild of
+ * es_wait_event_set every time a process that was previously waiting
+ * stops doing so. It might be possible to defer that decision until
+ * we actually wait again, because it's quite possible that a new
+ * request will be made of the same node before any wait actually
+ * happens. However, we have to balance the cost of rebuilding the
+ * WaitEventSet against the additional overhead of tracking which nodes
+ * need a callback to remove registered wait events. It's not clear
+ * that we would come out ahead, so use brute force for now.
+ */
+ Assert(areq->state == ASYNCREQ_IDLE ||
+ areq->state == ASYNCREQ_CALLBACK_PENDING);
+
+ if (areq->num_fd_events > 0 || areq->wants_process_latch)
+ ExecAsyncSetRequiredEvents(estate, areq, 0, false, true);
+
+
+ /* Save result and mark request as complete. */
+ areq->result = result;
+ areq->state = ASYNCREQ_COMPLETE;
+ estate->es_num_async_ready++;
+}
+
+
+/* Clear async events */
+void
+ExecAsyncClearEvents(EState *estate)
+{
+ if (estate->es_wait_event_set == NULL)
+ return;
+
+ FreeWaitEventSet(estate->es_wait_event_set);
+ estate->es_wait_event_set = NULL;
+}
diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
index 294ad2c..8f8ad2c 100644
--- a/src/backend/executor/execProcnode.c
+++ b/src/backend/executor/execProcnode.c
@@ -118,6 +118,7 @@
#include "executor/nodeValuesscan.h"
#include "executor/nodeWindowAgg.h"
#include "executor/nodeWorktablescan.h"
+#include "foreign/fdwapi.h"
#include "nodes/nodeFuncs.h"
#include "miscadmin.h"
diff --git a/src/backend/executor/instrument.c b/src/backend/executor/instrument.c
index 6ec96ec..959ee90 100644
--- a/src/backend/executor/instrument.c
+++ b/src/backend/executor/instrument.c
@@ -102,7 +102,7 @@ InstrStopNode(Instrumentation *instr, double nTuples)
&pgBufferUsage, &instr->bufusage_start);
/* Is this the first tuple of this cycle? */
- if (!instr->running)
+ if (!instr->running && nTuples > 0)
{
instr->running = true;
instr->firsttuple = INSTR_TIME_GET_DOUBLE(instr->counter);
diff --git a/src/backend/executor/nodeAppend.c b/src/backend/executor/nodeAppend.c
index aae5e3f..2c07095 100644
--- a/src/backend/executor/nodeAppend.c
+++ b/src/backend/executor/nodeAppend.c
@@ -58,6 +58,7 @@
#include "postgres.h"
#include "executor/execdebug.h"
+#include "executor/execAsync.h"
#include "executor/nodeAppend.h"
static bool exec_append_initialize_next(AppendState *appendstate);
@@ -79,16 +80,21 @@ exec_append_initialize_next(AppendState *appendstate)
/*
* get information from the append node
*/
- whichplan = appendstate->as_whichplan;
+ whichplan = appendstate->as_whichsyncplan;
- if (whichplan < 0)
+ /*
+ * This routine is only responsible for setting up for nodes being scanned
+ * synchronously, so the first node we can scan is given by nasyncplans
+ * and the last is given by as_nplans - 1.
+ */
+ if (whichplan < appendstate->as_nasyncplans)
{
/*
* if scanning in reverse, we start at the last scan in the list and
* then proceed back to the first.. in any case we inform ExecAppend
* that we are at the end of the line by returning FALSE
*/
- appendstate->as_whichplan = 0;
+ appendstate->as_whichsyncplan = appendstate->as_nasyncplans;
return FALSE;
}
else if (whichplan >= appendstate->as_nplans)
@@ -96,7 +102,7 @@ exec_append_initialize_next(AppendState *appendstate)
/*
* as above, end the scan if we go beyond the last scan in our list..
*/
- appendstate->as_whichplan = appendstate->as_nplans - 1;
+ appendstate->as_whichsyncplan = appendstate->as_nplans - 1;
return FALSE;
}
else
@@ -148,6 +154,15 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
appendstate->ps.state = estate;
appendstate->appendplans = appendplanstates;
appendstate->as_nplans = nplans;
+ appendstate->as_nasyncplans = node->nasyncplans;
+ appendstate->as_syncdone = (node->nasyncplans == nplans);
+ appendstate->as_asyncresult = (TupleTableSlot **)
+ palloc0(node->nasyncplans * sizeof(TupleTableSlot *));
+
+ /* initially, all async requests need a request */
+ for (i = 0; i < appendstate->as_nasyncplans; ++i)
+ appendstate->as_needrequest =
+ bms_add_member(appendstate->as_needrequest, i);
/*
* Miscellaneous initialization
@@ -182,9 +197,9 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
appendstate->ps.ps_ProjInfo = NULL;
/*
- * initialize to scan first subplan
+ * initialize to scan first synchronous subplan
*/
- appendstate->as_whichplan = 0;
+ appendstate->as_whichsyncplan = appendstate->as_nasyncplans;
exec_append_initialize_next(appendstate);
return appendstate;
@@ -199,15 +214,85 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
TupleTableSlot *
ExecAppend(AppendState *node)
{
+ if (node->as_nasyncplans > 0)
+ {
+ EState *estate = node->ps.state;
+ int i;
+
+ /*
+ * If there are any asynchronously-generated results that have
+ * not yet been returned, return one of them.
+ */
+ if (node->as_nasyncresult > 0)
+ {
+ --node->as_nasyncresult;
+ return node->as_asyncresult[node->as_nasyncresult];
+ }
+
+
+ /*
+ * XXXX: Always clear registered event. This seems a bit ineffecient
+ * but the events to wait are almost randomly altered for every
+ * calling.
+ */
+ ExecAsyncClearEvents(estate);
+
+ while ((i = bms_first_member(node->as_needrequest)) >= 0)
+ {
+ node->as_nasyncpending++;
+ ExecAsyncRequest(estate, &node->ps, i, node->appendplans[i]);
+ }
+
+ if (node->as_nasyncpending == 0 && node->as_syncdone)
+ return ExecClearTuple(node->ps.ps_ResultTupleSlot);
+ }
+
for (;;)
{
PlanState *subnode;
TupleTableSlot *result;
/*
- * figure out which subplan we are currently processing
+ * if we have async requests outstanding, run the event loop
+ */
+ if (node->as_nasyncpending > 0)
+ {
+ long timeout = node->as_syncdone ? -1 : 0;
+
+ while (node->as_nasyncpending > 0)
+ {
+ if (ExecAsyncEventLoop(node->ps.state, &node->ps, timeout) &&
+ node->as_nasyncresult > 0)
+ {
+ /* Asynchronous subplan returned a tuple! */
+ --node->as_nasyncresult;
+ return node->as_asyncresult[node->as_nasyncresult];
+ }
+
+ /* Timeout reached. Go through to sync nodes if exists */
+ if (!node->as_syncdone)
+ break;
+ }
+
+ /*
+ * If there is no asynchronous activity still pending and the
+ * synchronous activity is also complete, we're totally done
+ * scanning this node. Otherwise, we're done with the
+ * asynchronous stuff but must continue scanning the synchronous
+ * children.
+ */
+ if (node->as_syncdone)
+ {
+ Assert(node->as_nasyncpending == 0);
+ return ExecClearTuple(node->ps.ps_ResultTupleSlot);
+ }
+ }
+
+ /*
+ * figure out which synchronous subplan we are currently processing
*/
- subnode = node->appendplans[node->as_whichplan];
+ Assert(!node->as_syncdone);
+ subnode = node->appendplans[node->as_whichsyncplan];
/*
* get a tuple from the subplan
@@ -227,14 +312,21 @@ ExecAppend(AppendState *node)
/*
* Go on to the "next" subplan in the appropriate direction. If no
* more subplans, return the empty slot set up for us by
- * ExecInitAppend.
+ * ExecInitAppend, unless there are async plans we have yet to finish.
*/
if (ScanDirectionIsForward(node->ps.state->es_direction))
- node->as_whichplan++;
+ node->as_whichsyncplan++;
else
- node->as_whichplan--;
+ node->as_whichsyncplan--;
if (!exec_append_initialize_next(node))
- return ExecClearTuple(node->ps.ps_ResultTupleSlot);
+ {
+ node->as_syncdone = true;
+ if (node->as_nasyncpending == 0)
+ {
+ Assert(bms_is_empty(node->as_needrequest));
+ return ExecClearTuple(node->ps.ps_ResultTupleSlot);
+ }
+ }
/* Else loop back and try to get a tuple from the new subplan */
}
@@ -273,6 +365,16 @@ ExecReScanAppend(AppendState *node)
{
int i;
+ /*
+ * XXX. Cancel outstanding asynchronous tuple requests here! (How?)
+ */
+
+ /* Reset async state. */
+ for (i = 0; i < node->as_nasyncplans; ++i)
+ node->as_needrequest = bms_add_member(node->as_needrequest, i);
+ node->as_nasyncresult = 0;
+ node->as_syncdone = (node->as_nasyncplans == node->as_nplans);
+
for (i = 0; i < node->as_nplans; i++)
{
PlanState *subnode = node->appendplans[i];
@@ -291,6 +393,47 @@ ExecReScanAppend(AppendState *node)
if (subnode->chgParam == NULL)
ExecReScan(subnode);
}
- node->as_whichplan = 0;
+ node->as_whichsyncplan = node->as_nasyncplans;
exec_append_initialize_next(node);
}
+
+/* ----------------------------------------------------------------
+ * ExecAsyncAppendResponse
+ *
+ * Receive a response from an asynchronous request we made.
+ * ----------------------------------------------------------------
+ */
+void
+ExecAsyncAppendResponse(EState *estate, PendingAsyncRequest *areq)
+{
+ AppendState *node = (AppendState *) areq->requestor;
+ TupleTableSlot *slot;
+
+ /* We shouldn't be called until the request is complete. */
+ Assert(areq->state == ASYNCREQ_COMPLETE);
+
+ /* Our result slot shouldn't already be occupied. */
+ Assert(TupIsNull(node->ps.ps_ResultTupleSlot));
+
+ /* Result should be a TupleTableSlot or NULL. */
+ slot = (TupleTableSlot *) areq->result;
+ Assert(slot == NULL || IsA(slot, TupleTableSlot));
+
+ /* This is no longer pending */
+ --node->as_nasyncpending;
+
+ /* If the result is NULL or an empty slot, there's nothing more to do. */
+ if (TupIsNull(slot))
+ return;
+
+ /* Save result so we can return it. */
+ Assert(node->as_nasyncresult < node->as_nasyncplans);
+ node->as_asyncresult[node->as_nasyncresult++] = slot;
+
+ /*
+ * Mark the node that returned a result as ready for a new request. We
+ * don't launch another one here immediately because it might compelte
+ */
+ node->as_needrequest =
+ bms_add_member(node->as_needrequest, areq->request_index);
+}
diff --git a/src/backend/executor/nodeForeignscan.c b/src/backend/executor/nodeForeignscan.c
index 9cde112..1df8ccb 100644
--- a/src/backend/executor/nodeForeignscan.c
+++ b/src/backend/executor/nodeForeignscan.c
@@ -364,3 +364,52 @@ ExecShutdownForeignScan(ForeignScanState *node)
if (fdwroutine->ShutdownForeignScan)
fdwroutine->ShutdownForeignScan(node);
}
+
+/* ----------------------------------------------------------------
+ * ExecAsyncForeignScanRequest
+ *
+ * Initiate an asynchronous request
+ * ----------------------------------------------------------------
+ */
+void
+ExecAsyncForeignScanRequest(EState *estate, PendingAsyncRequest *areq)
+{
+ ForeignScanState *node = (ForeignScanState *) areq->requestee;
+ FdwRoutine *fdwroutine = node->fdwroutine;
+
+ Assert(fdwroutine->ForeignAsyncRequest != NULL);
+ fdwroutine->ForeignAsyncRequest(estate, areq);
+}
+
+/* ----------------------------------------------------------------
+ * ExecAsyncForeignScanConfigureWait
+ *
+ * In async mode, configure for a wait
+ * ----------------------------------------------------------------
+ */
+bool
+ExecAsyncForeignScanConfigureWait(EState *estate,
+ PendingAsyncRequest *areq, bool reinit)
+{
+ ForeignScanState *node = (ForeignScanState *) areq->requestee;
+ FdwRoutine *fdwroutine = node->fdwroutine;
+
+ Assert(fdwroutine->ForeignAsyncConfigureWait != NULL);
+ return fdwroutine->ForeignAsyncConfigureWait(estate, areq, reinit);
+}
+
+/* ----------------------------------------------------------------
+ * ExecAsyncForeignScanNotify
+ *
+ * Event loop callback
+ * ----------------------------------------------------------------
+ */
+void
+ExecAsyncForeignScanNotify(EState *estate, PendingAsyncRequest *areq)
+{
+ ForeignScanState *node = (ForeignScanState *) areq->requestee;
+ FdwRoutine *fdwroutine = node->fdwroutine;
+
+ Assert(fdwroutine->ForeignAsyncNotify != NULL);
+ fdwroutine->ForeignAsyncNotify(estate, areq);
+}
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index 67ac814..7e5bb38 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -242,6 +242,8 @@ _copyAppend(const Append *from)
*/
COPY_NODE_FIELD(partitioned_rels);
COPY_NODE_FIELD(appendplans);
+ COPY_SCALAR_FIELD(nasyncplans);
+ COPY_SCALAR_FIELD(referent);
return newnode;
}
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index 3a23f0b..030ed8e 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -376,6 +376,8 @@ _outAppend(StringInfo str, const Append *node)
WRITE_NODE_FIELD(partitioned_rels);
WRITE_NODE_FIELD(appendplans);
+ WRITE_INT_FIELD(nasyncplans);
+ WRITE_INT_FIELD(referent);
}
static void
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index 2988e8b..0615d52 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -1579,6 +1579,8 @@ _readAppend(void)
READ_NODE_FIELD(partitioned_rels);
READ_NODE_FIELD(appendplans);
+ READ_INT_FIELD(nasyncplans);
+ READ_INT_FIELD(referent);
READ_DONE();
}
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index e589d92..c341805 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -203,7 +203,8 @@ static NamedTuplestoreScan *make_namedtuplestorescan(List *qptlist, List *qpqual
Index scanrelid, char *enrname);
static WorkTableScan *make_worktablescan(List *qptlist, List *qpqual,
Index scanrelid, int wtParam);
-static Append *make_append(List *appendplans, List *tlist, List *partitioned_rels);
+static Append *make_append(List *asyncplans, int nasyncplans,
+ int referent, List *tlist, List *partitioned_rels);
static RecursiveUnion *make_recursive_union(List *tlist,
Plan *lefttree,
Plan *righttree,
@@ -282,7 +283,7 @@ static ModifyTable *make_modifytable(PlannerInfo *root,
List *rowMarks, OnConflictExpr *onconflict, int epqParam);
static GatherMerge *create_gather_merge_plan(PlannerInfo *root,
GatherMergePath *best_path);
-
+static bool is_async_capable_path(Path *path);
/*
* create_plan
@@ -1003,8 +1004,12 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path)
{
Append *plan;
List *tlist = build_path_tlist(root, &best_path->path);
- List *subplans = NIL;
+ List *asyncplans = NIL;
+ List *syncplans = NIL;
ListCell *subpaths;
+ int nasyncplans = 0;
+ bool first = true;
+ bool referent_is_sync = true;
/*
* The subpaths list could be empty, if every child was proven empty by
@@ -1030,7 +1035,14 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path)
return plan;
}
- /* Build the plan for each child */
+ /*
+ * Build the plan for each child
+
+ * The first child in an inheritance set is the representative in
+ * explaining tlist entries (see set_deparse_planstate). We should keep
+ * the first child in best_path->subpaths at the head of the subplan list
+ * for the reason.
+ */
foreach(subpaths, best_path->subpaths)
{
Path *subpath = (Path *) lfirst(subpaths);
@@ -1039,7 +1051,18 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path)
/* Must insist that all children return the same tlist */
subplan = create_plan_recurse(root, subpath, CP_EXACT_TLIST);
- subplans = lappend(subplans, subplan);
+ /* Classify as async-capable or not */
+ if (is_async_capable_path(subpath))
+ {
+ asyncplans = lappend(asyncplans, subplan);
+ ++nasyncplans;
+ if (first)
+ referent_is_sync = false;
+ }
+ else
+ syncplans = lappend(syncplans, subplan);
+
+ first = false;
}
/*
@@ -1049,7 +1072,9 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path)
* parent-rel Vars it'll be asked to emit.
*/
- plan = make_append(subplans, tlist, best_path->partitioned_rels);
+ plan = make_append(list_concat(asyncplans, syncplans), nasyncplans,
+ referent_is_sync ? nasyncplans : 0, tlist,
+ best_path->partitioned_rels);
copy_generic_path_info(&plan->plan, (Path *) best_path);
@@ -5269,17 +5294,23 @@ make_foreignscan(List *qptlist,
}
static Append *
-make_append(List *appendplans, List *tlist, List *partitioned_rels)
+make_append(List *appendplans, int nasyncplans, int referent,
+ List *tlist, List *partitioned_rels)
{
Append *node = makeNode(Append);
Plan *plan = &node->plan;
+ /* Currently async on partitioned tables is not available */
+ Assert(nasyncplans == 0 || partitioned_rels == NIL);
+
plan->targetlist = tlist;
plan->qual = NIL;
plan->lefttree = NULL;
plan->righttree = NULL;
node->partitioned_rels = partitioned_rels;
node->appendplans = appendplans;
+ node->nasyncplans = nasyncplans;
+ node->referent = referent;
return node;
}
@@ -6609,3 +6640,27 @@ is_projection_capable_plan(Plan *plan)
}
return true;
}
+
+/*
+ * is_projection_capable_path
+ * Check whether a given Path node is async-capable.
+ */
+static bool
+is_async_capable_path(Path *path)
+{
+ switch (nodeTag(path))
+ {
+ case T_ForeignPath:
+ {
+ FdwRoutine *fdwroutine = path->parent->fdwroutine;
+
+ Assert(fdwroutine != NULL);
+ if (fdwroutine->IsForeignPathAsyncCapable != NULL &&
+ fdwroutine->IsForeignPathAsyncCapable((ForeignPath *) path))
+ return true;
+ }
+ default:
+ break;
+ }
+ return false;
+}
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 65b7b32..25c84bc 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -3611,6 +3611,8 @@ pgstat_get_wait_ipc(WaitEventIPC w)
break;
case WAIT_EVENT_LOGICAL_SYNC_STATE_CHANGE:
event_name = "LogicalSyncStateChange";
+ case WAIT_EVENT_ASYNC_WAIT:
+ event_name = "AsyncExecWait";
break;
/* no default case, so that compiler will warn */
}
diff --git a/src/backend/utils/adt/ruleutils.c b/src/backend/utils/adt/ruleutils.c
index 18d9e27..c7e69cb 100644
--- a/src/backend/utils/adt/ruleutils.c
+++ b/src/backend/utils/adt/ruleutils.c
@@ -4425,7 +4425,11 @@ set_deparse_planstate(deparse_namespace *dpns, PlanState *ps)
* lists containing references to non-target relations.
*/
if (IsA(ps, AppendState))
- dpns->outer_planstate = ((AppendState *) ps)->appendplans[0];
+ {
+ int idx = ((Append*)(((AppendState *) ps)->ps.plan))->referent;
+ dpns->outer_planstate =
+ ((AppendState *) ps)->appendplans[idx];
+ }
else if (IsA(ps, MergeAppendState))
dpns->outer_planstate = ((MergeAppendState *) ps)->mergeplans[0];
else if (IsA(ps, ModifyTableState))
diff --git a/src/include/executor/execAsync.h b/src/include/executor/execAsync.h
new file mode 100644
index 0000000..9e7845c
--- /dev/null
+++ b/src/include/executor/execAsync.h
@@ -0,0 +1,30 @@
+/*--------------------------------------------------------------------
+ * execAsync.c
+ * Support functions for asynchronous query execution
+ *
+ * Portions Copyright (c) 1996-2016, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/executor/execAsync.c
+ *--------------------------------------------------------------------
+ */
+
+#ifndef EXECASYNC_H
+#define EXECASYNC_H
+
+#include "nodes/execnodes.h"
+
+extern void ExecAsyncRequest(EState *estate, PlanState *requestor,
+ int request_index, PlanState *requestee);
+extern bool ExecAsyncEventLoop(EState *estate, PlanState *requestor,
+ long timeout);
+
+extern void ExecAsyncSetRequiredEvents(EState *estate,
+ PendingAsyncRequest *areq, int num_fd_events,
+ bool wants_process_latch, bool force_reset);
+extern void ExecAsyncRequestDone(EState *estate,
+ PendingAsyncRequest *areq, Node *result);
+extern void ExecAsyncClearEvents(EState *estate);
+
+#endif /* EXECASYNC_H */
diff --git a/src/include/executor/nodeAppend.h b/src/include/executor/nodeAppend.h
index ee0b6ad..d8c3e31 100644
--- a/src/include/executor/nodeAppend.h
+++ b/src/include/executor/nodeAppend.h
@@ -21,4 +21,7 @@ extern TupleTableSlot *ExecAppend(AppendState *node);
extern void ExecEndAppend(AppendState *node);
extern void ExecReScanAppend(AppendState *node);
+extern void ExecAsyncAppendResponse(EState *estate,
+ PendingAsyncRequest *areq);
+
#endif /* NODEAPPEND_H */
diff --git a/src/include/executor/nodeForeignscan.h b/src/include/executor/nodeForeignscan.h
index 3ff4ecd..e6ba392 100644
--- a/src/include/executor/nodeForeignscan.h
+++ b/src/include/executor/nodeForeignscan.h
@@ -30,4 +30,11 @@ extern void ExecForeignScanInitializeWorker(ForeignScanState *node,
shm_toc *toc);
extern void ExecShutdownForeignScan(ForeignScanState *node);
+extern void ExecAsyncForeignScanRequest(EState *estate,
+ PendingAsyncRequest *areq);
+extern bool ExecAsyncForeignScanConfigureWait(EState *estate,
+ PendingAsyncRequest *areq, bool reinit);
+extern void ExecAsyncForeignScanNotify(EState *estate,
+ PendingAsyncRequest *areq);
+
#endif /* NODEFOREIGNSCAN_H */
diff --git a/src/include/foreign/fdwapi.h b/src/include/foreign/fdwapi.h
index e391f20..57876d1 100644
--- a/src/include/foreign/fdwapi.h
+++ b/src/include/foreign/fdwapi.h
@@ -156,6 +156,16 @@ typedef bool (*IsForeignScanParallelSafe_function) (PlannerInfo *root,
RelOptInfo *rel,
RangeTblEntry *rte);
+typedef bool (*IsForeignPathAsyncCapable_function) (ForeignPath *path);
+typedef void (*ForeignAsyncRequest_function) (EState *estate,
+ PendingAsyncRequest *areq);
+typedef bool (*ForeignAsyncConfigureWait_function) (EState *estate,
+ PendingAsyncRequest *areq,
+ bool reinit);
+typedef void (*ForeignAsyncNotify_function) (EState *estate,
+ PendingAsyncRequest *areq);
+typedef void (*ShutdownForeignScan_function) (ForeignScanState *node);
+
/*
* FdwRoutine is the struct returned by a foreign-data wrapper's handler
* function. It provides pointers to the callback functions needed by the
@@ -225,6 +235,13 @@ typedef struct FdwRoutine
EstimateDSMForeignScan_function EstimateDSMForeignScan;
InitializeDSMForeignScan_function InitializeDSMForeignScan;
InitializeWorkerForeignScan_function InitializeWorkerForeignScan;
+
+ /* Support functions for asynchronous execution */
+ IsForeignPathAsyncCapable_function IsForeignPathAsyncCapable;
+ ForeignAsyncRequest_function ForeignAsyncRequest;
+ ForeignAsyncConfigureWait_function ForeignAsyncConfigureWait;
+ ForeignAsyncNotify_function ForeignAsyncNotify;
+
ShutdownForeignScan_function ShutdownForeignScan;
} FdwRoutine;
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 54c5cf5..225cb1e 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -415,6 +415,32 @@ typedef struct ResultRelInfo
} ResultRelInfo;
/* ----------------
+ * PendingAsyncRequest
+ *
+ * State for an asynchronous tuple request.
+ * ----------------
+ */
+typedef enum AsyncRequestState
+{
+ ASYNCREQ_IDLE, /* Nothing is requested */
+ ASYNCREQ_WAITING, /* Waiting for events */
+ ASYNCREQ_CALLBACK_PENDING, /* Having events to be processed */
+ ASYNCREQ_COMPLETE /* Result is available */
+} AsyncRequestState;
+
+typedef struct PendingAsyncRequest
+{
+ int myindex; /* Index in es_pending_async. */
+ struct PlanState *requestor; /* Node that wants a tuple. */
+ struct PlanState *requestee; /* Node from which a tuple is wanted. */
+ int request_index; /* Scratch space for requestor. */
+ int num_fd_events; /* Max number of FD events requestee needs. */
+ bool wants_process_latch; /* Requestee cares about MyLatch. */
+ AsyncRequestState state;
+ Node *result; /* Result (NULL if no more tuples). */
+} PendingAsyncRequest;
+
+/* ----------------
* EState information
*
* Master working state for an Executor invocation
@@ -506,6 +532,32 @@ typedef struct EState
/* The per-query shared memory area to use for parallel execution. */
struct dsa_area *es_query_dsa;
+
+ /*
+ * Support for asynchronous execution.
+ *
+ * es_max_pending_async is the allocated size of es_pending_async, and
+ * es_num_pending_aync is the number of entries that are currently valid.
+ * (Entries after that may point to storage that can be reused.)
+ * es_async_ready is the number of PendingAsyncRequests that is ready to
+ * retrieve a tuple.
+ *
+ * es_total_fd_events is the total number of FD events needed by all
+ * pending async nodes, and es_allocated_fd_events is the number any
+ * current wait event set was allocated to handle. es_wait_event_set, if
+ * non-NULL, is a previously allocated event set that may be reusable by a
+ * future wait provided that nothing's been removed and not too many more
+ * events have been added.
+ */
+ int es_num_pending_async; /* # of nodes to wait */
+ int es_max_pending_async; /* max # of pending nodes */
+ int es_async_callback_pending; /* # of nodes to callback */
+ int es_num_async_ready; /* # of tuple-ready nodes */
+ PendingAsyncRequest **es_pending_async;
+
+ int es_total_fd_events;
+ int es_allocated_fd_events;
+ struct WaitEventSet *es_wait_event_set;
} EState;
@@ -967,17 +1019,20 @@ typedef struct ModifyTableState
/* ----------------
* AppendState information
- *
- * nplans how many plans are in the array
- * whichplan which plan is being executed (0 .. n-1)
* ----------------
*/
typedef struct AppendState
{
PlanState ps; /* its first field is NodeTag */
PlanState **appendplans; /* array of PlanStates for my inputs */
- int as_nplans;
- int as_whichplan;
+ int as_nplans; /* total # of children */
+ int as_nasyncplans; /* # of async-capable children */
+ int as_whichsyncplan; /* which sync plan is being executed */
+ bool as_syncdone; /* all synchronous plans done? */
+ Bitmapset *as_needrequest; /* async plans needing a new request */
+ TupleTableSlot **as_asyncresult; /* unreturned results of async plans */
+ int as_nasyncresult; /* # of valid entries in as_asyncresult */
+ int as_nasyncpending; /* # of outstanding async requests */
} AppendState;
/* ----------------
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index f1a1b24..5abff26 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -248,6 +248,8 @@ typedef struct Append
/* RT indexes of non-leaf tables in a partition tree */
List *partitioned_rels;
List *appendplans;
+ int nasyncplans; /* # of async plans, always at start of list */
+ int referent; /* index of inheritance tree referent */
} Append;
/* ----------------
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 6bffe63..fb6d02a 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -812,7 +812,8 @@ typedef enum
WAIT_EVENT_SAFE_SNAPSHOT,
WAIT_EVENT_SYNC_REP,
WAIT_EVENT_LOGICAL_SYNC_DATA,
- WAIT_EVENT_LOGICAL_SYNC_STATE_CHANGE
+ WAIT_EVENT_LOGICAL_SYNC_STATE_CHANGE,
+ WAIT_EVENT_ASYNC_WAIT
} WaitEventIPC;
/* ----------
--
2.9.2
0003-Make-postgres_fdw-async-capable.patchtext/x-patch; charset=us-asciiDownload
From 0b279ad32ea441580ead8056c855119c3d871aca Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 23 Feb 2017 15:04:46 +0900
Subject: [PATCH 3/4] Make postgres_fdw async-capable.
Make postgre_fdw async-capable using the infrastructure. Additionaly,
this makes connections for postgres_fdw have a connection-specific
area to store information so that foreign scans on the same connection
can share some data. postgres_fdw shares scan node currently running
on the underlying connection. This allows us async-execution of
multiple foreign scans on one foreign server.
---
contrib/postgres_fdw/connection.c | 79 ++--
contrib/postgres_fdw/expected/postgres_fdw.out | 144 ++++---
contrib/postgres_fdw/postgres_fdw.c | 522 +++++++++++++++++++++----
contrib/postgres_fdw/postgres_fdw.h | 2 +
contrib/postgres_fdw/sql/postgres_fdw.sql | 12 +-
5 files changed, 595 insertions(+), 164 deletions(-)
diff --git a/contrib/postgres_fdw/connection.c b/contrib/postgres_fdw/connection.c
index 8c33dea..0b1af3b 100644
--- a/contrib/postgres_fdw/connection.c
+++ b/contrib/postgres_fdw/connection.c
@@ -53,6 +53,7 @@ typedef struct ConnCacheEntry
bool have_prep_stmt; /* have we prepared any stmts in this xact? */
bool have_error; /* have any subxacts aborted in this xact? */
bool changing_xact_state; /* xact state change in process */
+ void *storage; /* connection specific storage */
} ConnCacheEntry;
/*
@@ -68,6 +69,7 @@ static unsigned int prep_stmt_number = 0;
static bool xact_got_connection = false;
/* prototypes of private functions */
+static ConnCacheEntry *get_connection_entry(Oid umid);
static PGconn *connect_pg_server(ForeignServer *server, UserMapping *user);
static void check_conn_params(const char **keywords, const char **values);
static void configure_remote_session(PGconn *conn);
@@ -85,26 +87,12 @@ static bool pgfdw_exec_cleanup_query(PGconn *conn, const char *query,
static bool pgfdw_get_cleanup_result(PGconn *conn, TimestampTz endtime,
PGresult **result);
-
/*
- * Get a PGconn which can be used to execute queries on the remote PostgreSQL
- * server with the user's authorization. A new connection is established
- * if we don't already have a suitable one, and a transaction is opened at
- * the right subtransaction nesting depth if we didn't do that already.
- *
- * will_prep_stmt must be true if caller intends to create any prepared
- * statements. Since those don't go away automatically at transaction end
- * (not even on error), we need this flag to cue manual cleanup.
- *
- * XXX Note that caching connections theoretically requires a mechanism to
- * detect change of FDW objects to invalidate already established connections.
- * We could manage that by watching for invalidation events on the relevant
- * syscaches. For the moment, though, it's not clear that this would really
- * be useful and not mere pedantry. We could not flush any active connections
- * mid-transaction anyway.
+ * Common function to acquire or create a connection cache entry.
*/
-PGconn *
-GetConnection(UserMapping *user, bool will_prep_stmt)
+
+static ConnCacheEntry *
+get_connection_entry(Oid umid)
{
bool found;
ConnCacheEntry *entry;
@@ -132,11 +120,8 @@ GetConnection(UserMapping *user, bool will_prep_stmt)
RegisterSubXactCallback(pgfdw_subxact_callback, NULL);
}
- /* Set flag that we did GetConnection during the current transaction */
- xact_got_connection = true;
-
/* Create hash key for the entry. Assume no pad bytes in key struct */
- key = user->umid;
+ key = umid;
/*
* Find or create cached entry for requested connection.
@@ -150,11 +135,42 @@ GetConnection(UserMapping *user, bool will_prep_stmt)
entry->have_prep_stmt = false;
entry->have_error = false;
entry->changing_xact_state = false;
+ entry->storage = NULL;
}
/* Reject further use of connections which failed abort cleanup. */
pgfdw_reject_incomplete_xact_state_change(entry);
+ return entry;
+}
+
+/*
+ * Get a PGconn which can be used to execute queries on the remote PostgreSQL
+ * server with the user's authorization. A new connection is established
+ * if we don't already have a suitable one, and a transaction is opened at
+ * the right subtransaction nesting depth if we didn't do that already.
+ *
+ * will_prep_stmt must be true if caller intends to create any prepared
+ * statements. Since those don't go away automatically at transaction end
+ * (not even on error), we need this flag to cue manual cleanup.
+ *
+ * XXX Note that caching connections theoretically requires a mechanism to
+ * detect change of FDW objects to invalidate already established connections.
+ * We could manage that by watching for invalidation events on the relevant
+ * syscaches. For the moment, though, it's not clear that this would really
+ * be useful and not mere pedantry. We could not flush any active connections
+ * mid-transaction anyway.
+ */
+PGconn *
+GetConnection(UserMapping *user, bool will_prep_stmt)
+{
+ ConnCacheEntry *entry;
+
+ /* Set flag that we did GetConnection during the current transaction */
+ xact_got_connection = true;
+
+ entry = get_connection_entry(user->umid);
+
/*
* We don't check the health of cached connection here, because it would
* require some overhead. Broken connection will be detected when the
@@ -191,6 +207,25 @@ GetConnection(UserMapping *user, bool will_prep_stmt)
}
/*
+ * Rerturns the connection specific storage for this user. Allocate with
+ * initsize if not exists.
+ */
+void *
+GetConnectionSpecificStorage(UserMapping *user, size_t initsize)
+{
+ ConnCacheEntry *entry;
+
+ entry = get_connection_entry(user->umid);
+ if (entry->storage == NULL)
+ {
+ entry->storage = MemoryContextAlloc(CacheMemoryContext, initsize);
+ memset(entry->storage, 0, initsize);
+ }
+
+ return entry->storage;
+}
+
+/*
* Connect to remote server using specified server and user mapping properties.
*/
static PGconn *
diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out
index b112c19..7401304 100644
--- a/contrib/postgres_fdw/expected/postgres_fdw.out
+++ b/contrib/postgres_fdw/expected/postgres_fdw.out
@@ -6417,12 +6417,12 @@ INSERT INTO b(aa) VALUES('bbbbb');
SELECT tableoid::regclass, * FROM a;
tableoid | aa
----------+-------
- a | aaa
- a | aaaa
- a | aaaaa
b | bbb
b | bbbb
b | bbbbb
+ a | aaa
+ a | aaaa
+ a | aaaaa
(6 rows)
SELECT tableoid::regclass, * FROM b;
@@ -6445,12 +6445,12 @@ UPDATE a SET aa = 'zzzzzz' WHERE aa LIKE 'aaaa%';
SELECT tableoid::regclass, * FROM a;
tableoid | aa
----------+--------
- a | aaa
- a | zzzzzz
- a | zzzzzz
b | bbb
b | bbbb
b | bbbbb
+ a | aaa
+ a | zzzzzz
+ a | zzzzzz
(6 rows)
SELECT tableoid::regclass, * FROM b;
@@ -6473,12 +6473,12 @@ UPDATE b SET aa = 'new';
SELECT tableoid::regclass, * FROM a;
tableoid | aa
----------+--------
- a | aaa
- a | zzzzzz
- a | zzzzzz
b | new
b | new
b | new
+ a | aaa
+ a | zzzzzz
+ a | zzzzzz
(6 rows)
SELECT tableoid::regclass, * FROM b;
@@ -6501,12 +6501,12 @@ UPDATE a SET aa = 'newtoo';
SELECT tableoid::regclass, * FROM a;
tableoid | aa
----------+--------
- a | newtoo
- a | newtoo
- a | newtoo
b | newtoo
b | newtoo
b | newtoo
+ a | newtoo
+ a | newtoo
+ a | newtoo
(6 rows)
SELECT tableoid::regclass, * FROM b;
@@ -6564,35 +6564,40 @@ insert into bar2 values(3,33,33);
insert into bar2 values(4,44,44);
insert into bar2 values(7,77,77);
explain (verbose, costs off)
-select * from bar where f1 in (select f1 from foo) for update;
- QUERY PLAN
-----------------------------------------------------------------------------------------------
+select * from bar where f1 in (select f1 from foo) order by 1 for update;
+ QUERY PLAN
+-----------------------------------------------------------------------------------------------------------------
LockRows
Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid, foo.ctid, foo.*, foo.tableoid
- -> Hash Join
+ -> Merge Join
Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid, foo.ctid, foo.*, foo.tableoid
Inner Unique: true
- Hash Cond: (bar.f1 = foo.f1)
- -> Append
- -> Seq Scan on public.bar
+ Merge Cond: (bar.f1 = foo.f1)
+ -> Merge Append
+ Sort Key: bar.f1
+ -> Sort
Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid
+ Sort Key: bar.f1
+ -> Seq Scan on public.bar
+ Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid
-> Foreign Scan on public.bar2
Output: bar2.f1, bar2.f2, bar2.ctid, bar2.*, bar2.tableoid
- Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 FOR UPDATE
- -> Hash
+ Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 ORDER BY f1 ASC NULLS LAST FOR UPDATE
+ -> Sort
Output: foo.ctid, foo.*, foo.tableoid, foo.f1
+ Sort Key: foo.f1
-> HashAggregate
Output: foo.ctid, foo.*, foo.tableoid, foo.f1
Group Key: foo.f1
-> Append
- -> Seq Scan on public.foo
- Output: foo.ctid, foo.*, foo.tableoid, foo.f1
-> Foreign Scan on public.foo2
Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1
Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1
-(23 rows)
+ -> Seq Scan on public.foo
+ Output: foo.ctid, foo.*, foo.tableoid, foo.f1
+(28 rows)
-select * from bar where f1 in (select f1 from foo) for update;
+select * from bar where f1 in (select f1 from foo) order by 1 for update;
f1 | f2
----+----
1 | 11
@@ -6602,35 +6607,40 @@ select * from bar where f1 in (select f1 from foo) for update;
(4 rows)
explain (verbose, costs off)
-select * from bar where f1 in (select f1 from foo) for share;
- QUERY PLAN
-----------------------------------------------------------------------------------------------
+select * from bar where f1 in (select f1 from foo) order by 1 for share;
+ QUERY PLAN
+----------------------------------------------------------------------------------------------------------------
LockRows
Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid, foo.ctid, foo.*, foo.tableoid
- -> Hash Join
+ -> Merge Join
Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid, foo.ctid, foo.*, foo.tableoid
Inner Unique: true
- Hash Cond: (bar.f1 = foo.f1)
- -> Append
- -> Seq Scan on public.bar
+ Merge Cond: (bar.f1 = foo.f1)
+ -> Merge Append
+ Sort Key: bar.f1
+ -> Sort
Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid
+ Sort Key: bar.f1
+ -> Seq Scan on public.bar
+ Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid
-> Foreign Scan on public.bar2
Output: bar2.f1, bar2.f2, bar2.ctid, bar2.*, bar2.tableoid
- Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 FOR SHARE
- -> Hash
+ Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 ORDER BY f1 ASC NULLS LAST FOR SHARE
+ -> Sort
Output: foo.ctid, foo.*, foo.tableoid, foo.f1
+ Sort Key: foo.f1
-> HashAggregate
Output: foo.ctid, foo.*, foo.tableoid, foo.f1
Group Key: foo.f1
-> Append
- -> Seq Scan on public.foo
- Output: foo.ctid, foo.*, foo.tableoid, foo.f1
-> Foreign Scan on public.foo2
Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1
Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1
-(23 rows)
+ -> Seq Scan on public.foo
+ Output: foo.ctid, foo.*, foo.tableoid, foo.f1
+(28 rows)
-select * from bar where f1 in (select f1 from foo) for share;
+select * from bar where f1 in (select f1 from foo) order by 1 for share;
f1 | f2
----+----
1 | 11
@@ -6660,11 +6670,11 @@ update bar set f2 = f2 + 100 where f1 in (select f1 from foo);
Output: foo.ctid, foo.*, foo.tableoid, foo.f1
Group Key: foo.f1
-> Append
- -> Seq Scan on public.foo
- Output: foo.ctid, foo.*, foo.tableoid, foo.f1
-> Foreign Scan on public.foo2
Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1
Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1
+ -> Seq Scan on public.foo
+ Output: foo.ctid, foo.*, foo.tableoid, foo.f1
-> Hash Join
Output: bar2.f1, (bar2.f2 + 100), bar2.f3, bar2.ctid, foo.ctid, foo.*, foo.tableoid
Inner Unique: true
@@ -6678,11 +6688,11 @@ update bar set f2 = f2 + 100 where f1 in (select f1 from foo);
Output: foo.ctid, foo.*, foo.tableoid, foo.f1
Group Key: foo.f1
-> Append
- -> Seq Scan on public.foo
- Output: foo.ctid, foo.*, foo.tableoid, foo.f1
-> Foreign Scan on public.foo2
Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1
Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1
+ -> Seq Scan on public.foo
+ Output: foo.ctid, foo.*, foo.tableoid, foo.f1
(39 rows)
update bar set f2 = f2 + 100 where f1 in (select f1 from foo);
@@ -6713,16 +6723,16 @@ where bar.f1 = ss.f1;
Output: bar.f1, (bar.f2 + 100), bar.ctid, (ROW(foo.f1))
Hash Cond: (foo.f1 = bar.f1)
-> Append
- -> Seq Scan on public.foo
- Output: ROW(foo.f1), foo.f1
-> Foreign Scan on public.foo2
Output: ROW(foo2.f1), foo2.f1
Remote SQL: SELECT f1 FROM public.loct1
- -> Seq Scan on public.foo foo_1
- Output: ROW((foo_1.f1 + 3)), (foo_1.f1 + 3)
-> Foreign Scan on public.foo2 foo2_1
Output: ROW((foo2_1.f1 + 3)), (foo2_1.f1 + 3)
Remote SQL: SELECT f1 FROM public.loct1
+ -> Seq Scan on public.foo
+ Output: ROW(foo.f1), foo.f1
+ -> Seq Scan on public.foo foo_1
+ Output: ROW((foo_1.f1 + 3)), (foo_1.f1 + 3)
-> Hash
Output: bar.f1, bar.f2, bar.ctid
-> Seq Scan on public.bar
@@ -6740,16 +6750,16 @@ where bar.f1 = ss.f1;
Output: (ROW(foo.f1)), foo.f1
Sort Key: foo.f1
-> Append
- -> Seq Scan on public.foo
- Output: ROW(foo.f1), foo.f1
-> Foreign Scan on public.foo2
Output: ROW(foo2.f1), foo2.f1
Remote SQL: SELECT f1 FROM public.loct1
- -> Seq Scan on public.foo foo_1
- Output: ROW((foo_1.f1 + 3)), (foo_1.f1 + 3)
-> Foreign Scan on public.foo2 foo2_1
Output: ROW((foo2_1.f1 + 3)), (foo2_1.f1 + 3)
Remote SQL: SELECT f1 FROM public.loct1
+ -> Seq Scan on public.foo
+ Output: ROW(foo.f1), foo.f1
+ -> Seq Scan on public.foo foo_1
+ Output: ROW((foo_1.f1 + 3)), (foo_1.f1 + 3)
(45 rows)
update bar set f2 = f2 + 100
@@ -6900,27 +6910,33 @@ delete from foo where f1 < 5 returning *;
(5 rows)
explain (verbose, costs off)
-update bar set f2 = f2 + 100 returning *;
- QUERY PLAN
-------------------------------------------------------------------------------
- Update on public.bar
- Output: bar.f1, bar.f2
- Update on public.bar
- Foreign Update on public.bar2
- -> Seq Scan on public.bar
- Output: bar.f1, (bar.f2 + 100), bar.ctid
- -> Foreign Update on public.bar2
- Remote SQL: UPDATE public.loct2 SET f2 = (f2 + 100) RETURNING f1, f2
-(8 rows)
+with u as (update bar set f2 = f2 + 100 returning *) select * from u order by 1;
+ QUERY PLAN
+--------------------------------------------------------------------------------------
+ Sort
+ Output: u.f1, u.f2
+ Sort Key: u.f1
+ CTE u
+ -> Update on public.bar
+ Output: bar.f1, bar.f2
+ Update on public.bar
+ Foreign Update on public.bar2
+ -> Seq Scan on public.bar
+ Output: bar.f1, (bar.f2 + 100), bar.ctid
+ -> Foreign Update on public.bar2
+ Remote SQL: UPDATE public.loct2 SET f2 = (f2 + 100) RETURNING f1, f2
+ -> CTE Scan on u
+ Output: u.f1, u.f2
+(14 rows)
-update bar set f2 = f2 + 100 returning *;
+with u as (update bar set f2 = f2 + 100 returning *) select * from u order by 1;
f1 | f2
----+-----
1 | 311
2 | 322
- 6 | 266
3 | 333
4 | 344
+ 6 | 266
7 | 277
(6 rows)
diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c
index 7214666..b09a099 100644
--- a/contrib/postgres_fdw/postgres_fdw.c
+++ b/contrib/postgres_fdw/postgres_fdw.c
@@ -20,6 +20,8 @@
#include "commands/defrem.h"
#include "commands/explain.h"
#include "commands/vacuum.h"
+#include "executor/execAsync.h"
+#include "executor/nodeForeignscan.h"
#include "foreign/fdwapi.h"
#include "funcapi.h"
#include "miscadmin.h"
@@ -34,6 +36,7 @@
#include "optimizer/var.h"
#include "optimizer/tlist.h"
#include "parser/parsetree.h"
+#include "pgstat.h"
#include "utils/builtins.h"
#include "utils/guc.h"
#include "utils/lsyscache.h"
@@ -53,6 +56,9 @@ PG_MODULE_MAGIC;
/* If no remote estimates, assume a sort costs 20% extra */
#define DEFAULT_FDW_SORT_MULTIPLIER 1.2
+/* Retrive PgFdwScanState struct from ForeginScanState */
+#define GetPgFdwScanState(n) ((PgFdwScanState *)(n)->fdw_state)
+
/*
* Indexes of FDW-private information stored in fdw_private lists.
*
@@ -120,10 +126,27 @@ enum FdwDirectModifyPrivateIndex
};
/*
+ * Connection private area structure.
+ */
+typedef struct PgFdwConnpriv
+{
+ ForeignScanState *current_owner; /* The node currently running a query
+ * on this connection*/
+} PgFdwConnpriv;
+
+/* Execution state base type */
+typedef struct PgFdwState
+{
+ PGconn *conn; /* connection for the scan */
+ PgFdwConnpriv *connpriv; /* connection private memory */
+} PgFdwState;
+
+/*
* Execution state of a foreign scan using postgres_fdw.
*/
typedef struct PgFdwScanState
{
+ PgFdwState s; /* common structure */
Relation rel; /* relcache entry for the foreign table. NULL
* for a foreign join scan. */
TupleDesc tupdesc; /* tuple descriptor of scan */
@@ -134,7 +157,7 @@ typedef struct PgFdwScanState
List *retrieved_attrs; /* list of retrieved attribute numbers */
/* for remote query execution */
- PGconn *conn; /* connection for the scan */
+ bool result_ready;
unsigned int cursor_number; /* quasi-unique ID for my cursor */
bool cursor_exists; /* have we created the cursor? */
int numParams; /* number of parameters passed to query */
@@ -150,6 +173,13 @@ typedef struct PgFdwScanState
/* batch-level state, for optimizing rewinds and avoiding useless fetch */
int fetch_ct_2; /* Min(# of fetches done, 2) */
bool eof_reached; /* true if last fetch reached EOF */
+ bool run_async; /* true if run asynchronously */
+ bool async_waiting; /* true if requesting the parent to wait */
+ ForeignScanState *waiter; /* Next node to run a query among nodes
+ * sharing the same connection */
+ ForeignScanState *last_waiter; /* A waiting node at the end of a waiting
+ * list. Maintained only by the current
+ * owner of the connection */
/* working memory contexts */
MemoryContext batch_cxt; /* context holding current batch of tuples */
@@ -163,11 +193,11 @@ typedef struct PgFdwScanState
*/
typedef struct PgFdwModifyState
{
+ PgFdwState s; /* common structure */
Relation rel; /* relcache entry for the foreign table */
AttInMetadata *attinmeta; /* attribute datatype conversion metadata */
/* for remote query execution */
- PGconn *conn; /* connection for the scan */
char *p_name; /* name of prepared statement, if created */
/* extracted fdw_private data */
@@ -190,6 +220,7 @@ typedef struct PgFdwModifyState
*/
typedef struct PgFdwDirectModifyState
{
+ PgFdwState s; /* common structure */
Relation rel; /* relcache entry for the foreign table */
AttInMetadata *attinmeta; /* attribute datatype conversion metadata */
@@ -288,6 +319,7 @@ static void postgresBeginForeignScan(ForeignScanState *node, int eflags);
static TupleTableSlot *postgresIterateForeignScan(ForeignScanState *node);
static void postgresReScanForeignScan(ForeignScanState *node);
static void postgresEndForeignScan(ForeignScanState *node);
+static void postgresShutdownForeignScan(ForeignScanState *node);
static void postgresAddForeignUpdateTargets(Query *parsetree,
RangeTblEntry *target_rte,
Relation target_relation);
@@ -348,6 +380,14 @@ static void postgresGetForeignUpperPaths(PlannerInfo *root,
UpperRelationKind stage,
RelOptInfo *input_rel,
RelOptInfo *output_rel);
+static bool postgresIsForeignPathAsyncCapable(ForeignPath *path);
+static void postgresForeignAsyncRequest(EState *estate,
+ PendingAsyncRequest *areq);
+static bool postgresForeignAsyncConfigureWait(EState *estate,
+ PendingAsyncRequest *areq,
+ bool reinit);
+static void postgresForeignAsyncNotify(EState *estate,
+ PendingAsyncRequest *areq);
/*
* Helper functions
@@ -368,7 +408,10 @@ static bool ec_member_matches_foreign(PlannerInfo *root, RelOptInfo *rel,
EquivalenceClass *ec, EquivalenceMember *em,
void *arg);
static void create_cursor(ForeignScanState *node);
-static void fetch_more_data(ForeignScanState *node);
+static void request_more_data(ForeignScanState *node);
+static void fetch_received_data(ForeignScanState *node);
+static void vacate_connection(PgFdwState *fdwconn);
+static void absorb_current_result(ForeignScanState *node);
static void close_cursor(PGconn *conn, unsigned int cursor_number);
static void prepare_foreign_modify(PgFdwModifyState *fmstate);
static const char **convert_prep_stmt_params(PgFdwModifyState *fmstate,
@@ -438,6 +481,7 @@ postgres_fdw_handler(PG_FUNCTION_ARGS)
routine->IterateForeignScan = postgresIterateForeignScan;
routine->ReScanForeignScan = postgresReScanForeignScan;
routine->EndForeignScan = postgresEndForeignScan;
+ routine->ShutdownForeignScan = postgresShutdownForeignScan;
/* Functions for updating foreign tables */
routine->AddForeignUpdateTargets = postgresAddForeignUpdateTargets;
@@ -472,6 +516,12 @@ postgres_fdw_handler(PG_FUNCTION_ARGS)
/* Support functions for upper relation push-down */
routine->GetForeignUpperPaths = postgresGetForeignUpperPaths;
+ /* Support functions for async execution */
+ routine->IsForeignPathAsyncCapable = postgresIsForeignPathAsyncCapable;
+ routine->ForeignAsyncRequest = postgresForeignAsyncRequest;
+ routine->ForeignAsyncConfigureWait = postgresForeignAsyncConfigureWait;
+ routine->ForeignAsyncNotify = postgresForeignAsyncNotify;
+
PG_RETURN_POINTER(routine);
}
@@ -1322,12 +1372,21 @@ postgresBeginForeignScan(ForeignScanState *node, int eflags)
* Get connection to the foreign server. Connection manager will
* establish new connection if necessary.
*/
- fsstate->conn = GetConnection(user, false);
+ fsstate->s.conn = GetConnection(user, false);
+ fsstate->s.connpriv = (PgFdwConnpriv *)
+ GetConnectionSpecificStorage(user, sizeof(PgFdwConnpriv));
+ fsstate->s.connpriv->current_owner = NULL;
+ fsstate->waiter = NULL;
+ fsstate->last_waiter = node;
/* Assign a unique ID for my cursor */
- fsstate->cursor_number = GetCursorNumber(fsstate->conn);
+ fsstate->cursor_number = GetCursorNumber(fsstate->s.conn);
fsstate->cursor_exists = false;
+ /* Initialize async execution status */
+ fsstate->run_async = false;
+ fsstate->async_waiting = false;
+
/* Get private info created by planner functions. */
fsstate->query = strVal(list_nth(fsplan->fdw_private,
FdwScanPrivateSelectSql));
@@ -1383,32 +1442,130 @@ postgresBeginForeignScan(ForeignScanState *node, int eflags)
static TupleTableSlot *
postgresIterateForeignScan(ForeignScanState *node)
{
- PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+ PgFdwScanState *fsstate = GetPgFdwScanState(node);
TupleTableSlot *slot = node->ss.ss_ScanTupleSlot;
/*
- * If this is the first call after Begin or ReScan, we need to create the
- * cursor on the remote side.
- */
- if (!fsstate->cursor_exists)
- create_cursor(node);
-
- /*
* Get some more tuples, if we've run out.
*/
if (fsstate->next_tuple >= fsstate->num_tuples)
{
- /* No point in another fetch if we already detected EOF, though. */
- if (!fsstate->eof_reached)
- fetch_more_data(node);
- /* If we didn't get any tuples, must be end of data. */
+ ForeignScanState *next_conn_owner = node;
+
+ /* This node has sent a query on this connection */
+ if (fsstate->s.connpriv->current_owner == node)
+ {
+ /* Check if the result is available */
+ if (PQisBusy(fsstate->s.conn))
+ {
+ int rc = WaitLatchOrSocket(NULL,
+ WL_SOCKET_READABLE | WL_TIMEOUT,
+ PQsocket(fsstate->s.conn), 0,
+ WAIT_EVENT_ASYNC_WAIT);
+ if (fsstate->run_async && !(rc & WL_SOCKET_READABLE))
+ {
+ /*
+ * This node is not ready yet. Tell the caller to wait.
+ */
+ fsstate->result_ready = false;
+ return ExecClearTuple(slot);
+ }
+ }
+
+ Assert(fsstate->async_waiting);
+ fsstate->async_waiting = false;
+ fetch_received_data(node);
+
+ /*
+ * If someone is waiting this node on the same connection, let the
+ * first waiter be the next owner of this connection.
+ */
+ if (fsstate->waiter)
+ {
+ PgFdwScanState *next_owner_state;
+
+ next_conn_owner = fsstate->waiter;
+ next_owner_state = GetPgFdwScanState(next_conn_owner);
+ fsstate->waiter = NULL;
+
+ /*
+ * only the current owner is responsible to maintain the shortcut
+ * to the last waiter
+ */
+ next_owner_state->last_waiter = fsstate->last_waiter;
+
+ /*
+ * for simplicity, last_waiter points itself on a node that no one
+ * is waiting for.
+ */
+ fsstate->last_waiter = node;
+ }
+ }
+ else if (fsstate->s.connpriv->current_owner &&
+ !GetPgFdwScanState(node)->eof_reached)
+ {
+ /*
+ * Anyone else is holding this connection and we want this node to
+ * run later. Add myself to the tail of the waiters' list then
+ * return not-ready. To avoid scanning through the waiters' list,
+ * the current owner is to maintain the shortcut to the last
+ * waiter.
+ */
+ PgFdwScanState *conn_owner_state =
+ GetPgFdwScanState(fsstate->s.connpriv->current_owner);
+ ForeignScanState *last_waiter = conn_owner_state->last_waiter;
+ PgFdwScanState *last_waiter_state = GetPgFdwScanState(last_waiter);
+
+ last_waiter_state->waiter = node;
+ conn_owner_state->last_waiter = node;
+
+ /* Register the node to the async-waiting node list */
+ Assert(!GetPgFdwScanState(node)->async_waiting);
+
+ GetPgFdwScanState(node)->async_waiting = true;
+
+ fsstate->result_ready = fsstate->eof_reached;
+ return ExecClearTuple(slot);
+ }
+
+ /* At this time no node is running on the connection */
+ Assert(GetPgFdwScanState(next_conn_owner)->s.connpriv->current_owner
+ == NULL);
+ /*
+ * Send the next request for the next owner of this connection if
+ * needed.
+ */
+ if (!GetPgFdwScanState(next_conn_owner)->eof_reached)
+ {
+ PgFdwScanState *next_owner_state =
+ GetPgFdwScanState(next_conn_owner);
+
+ request_more_data(next_conn_owner);
+
+ /* Register the node to the async-waiting node list */
+ if (!next_owner_state->async_waiting)
+ next_owner_state->async_waiting = true;
+
+ if (!next_owner_state->run_async)
+ fetch_received_data(next_conn_owner);
+ }
+
+
+ /*
+ * If we haven't received a result for the given node this time,
+ * return with no tuple to give way to other nodes.
+ */
if (fsstate->next_tuple >= fsstate->num_tuples)
+ {
+ fsstate->result_ready = fsstate->eof_reached;
return ExecClearTuple(slot);
+ }
}
/*
* Return the next tuple.
*/
+ fsstate->result_ready = true;
ExecStoreTuple(fsstate->tuples[fsstate->next_tuple++],
slot,
InvalidBuffer,
@@ -1424,7 +1581,7 @@ postgresIterateForeignScan(ForeignScanState *node)
static void
postgresReScanForeignScan(ForeignScanState *node)
{
- PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+ PgFdwScanState *fsstate = GetPgFdwScanState(node);
char sql[64];
PGresult *res;
@@ -1432,6 +1589,9 @@ postgresReScanForeignScan(ForeignScanState *node)
if (!fsstate->cursor_exists)
return;
+ /* Absorb the ramining result */
+ absorb_current_result(node);
+
/*
* If any internal parameters affecting this node have changed, we'd
* better destroy and recreate the cursor. Otherwise, rewinding it should
@@ -1460,9 +1620,9 @@ postgresReScanForeignScan(ForeignScanState *node)
* We don't use a PG_TRY block here, so be careful not to throw error
* without releasing the PGresult.
*/
- res = pgfdw_exec_query(fsstate->conn, sql);
+ res = pgfdw_exec_query(fsstate->s.conn, sql);
if (PQresultStatus(res) != PGRES_COMMAND_OK)
- pgfdw_report_error(ERROR, res, fsstate->conn, true, sql);
+ pgfdw_report_error(ERROR, res, fsstate->s.conn, true, sql);
PQclear(res);
/* Now force a fresh FETCH. */
@@ -1480,7 +1640,7 @@ postgresReScanForeignScan(ForeignScanState *node)
static void
postgresEndForeignScan(ForeignScanState *node)
{
- PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+ PgFdwScanState *fsstate = GetPgFdwScanState(node);
/* if fsstate is NULL, we are in EXPLAIN; nothing to do */
if (fsstate == NULL)
@@ -1488,16 +1648,32 @@ postgresEndForeignScan(ForeignScanState *node)
/* Close the cursor if open, to prevent accumulation of cursors */
if (fsstate->cursor_exists)
- close_cursor(fsstate->conn, fsstate->cursor_number);
+ close_cursor(fsstate->s.conn, fsstate->cursor_number);
/* Release remote connection */
- ReleaseConnection(fsstate->conn);
- fsstate->conn = NULL;
+ ReleaseConnection(fsstate->s.conn);
+ fsstate->s.conn = NULL;
/* MemoryContexts will be deleted automatically. */
}
/*
+ * postgresShutdownForeignScan
+ * Remove asynchrony stuff and cleanup garbage on the connection.
+ */
+static void
+postgresShutdownForeignScan(ForeignScanState *node)
+{
+ ForeignScan *plan = (ForeignScan *) node->ss.ps.plan;
+
+ if (plan->operation != CMD_SELECT)
+ return;
+
+ /* Absorb the ramining result */
+ absorb_current_result(node);
+}
+
+/*
* postgresAddForeignUpdateTargets
* Add resjunk column(s) needed for update/delete on a foreign table
*/
@@ -1699,7 +1875,9 @@ postgresBeginForeignModify(ModifyTableState *mtstate,
user = GetUserMapping(userid, table->serverid);
/* Open connection; report that we'll create a prepared statement. */
- fmstate->conn = GetConnection(user, true);
+ fmstate->s.conn = GetConnection(user, true);
+ fmstate->s.connpriv = (PgFdwConnpriv *)
+ GetConnectionSpecificStorage(user, sizeof(PgFdwConnpriv));
fmstate->p_name = NULL; /* prepared statement not made yet */
/* Deconstruct fdw_private data. */
@@ -1778,6 +1956,8 @@ postgresExecForeignInsert(EState *estate,
PGresult *res;
int n_rows;
+ vacate_connection((PgFdwState *)fmstate);
+
/* Set up the prepared statement on the remote server, if we didn't yet */
if (!fmstate->p_name)
prepare_foreign_modify(fmstate);
@@ -1788,14 +1968,14 @@ postgresExecForeignInsert(EState *estate,
/*
* Execute the prepared statement.
*/
- if (!PQsendQueryPrepared(fmstate->conn,
+ if (!PQsendQueryPrepared(fmstate->s.conn,
fmstate->p_name,
fmstate->p_nums,
p_values,
NULL,
NULL,
0))
- pgfdw_report_error(ERROR, NULL, fmstate->conn, false, fmstate->query);
+ pgfdw_report_error(ERROR, NULL, fmstate->s.conn, false, fmstate->query);
/*
* Get the result, and check for success.
@@ -1803,10 +1983,10 @@ postgresExecForeignInsert(EState *estate,
* We don't use a PG_TRY block here, so be careful not to throw error
* without releasing the PGresult.
*/
- res = pgfdw_get_result(fmstate->conn, fmstate->query);
+ res = pgfdw_get_result(fmstate->s.conn, fmstate->query);
if (PQresultStatus(res) !=
(fmstate->has_returning ? PGRES_TUPLES_OK : PGRES_COMMAND_OK))
- pgfdw_report_error(ERROR, res, fmstate->conn, true, fmstate->query);
+ pgfdw_report_error(ERROR, res, fmstate->s.conn, true, fmstate->query);
/* Check number of rows affected, and fetch RETURNING tuple if any */
if (fmstate->has_returning)
@@ -1844,6 +2024,8 @@ postgresExecForeignUpdate(EState *estate,
PGresult *res;
int n_rows;
+ vacate_connection((PgFdwState *)fmstate);
+
/* Set up the prepared statement on the remote server, if we didn't yet */
if (!fmstate->p_name)
prepare_foreign_modify(fmstate);
@@ -1864,14 +2046,14 @@ postgresExecForeignUpdate(EState *estate,
/*
* Execute the prepared statement.
*/
- if (!PQsendQueryPrepared(fmstate->conn,
+ if (!PQsendQueryPrepared(fmstate->s.conn,
fmstate->p_name,
fmstate->p_nums,
p_values,
NULL,
NULL,
0))
- pgfdw_report_error(ERROR, NULL, fmstate->conn, false, fmstate->query);
+ pgfdw_report_error(ERROR, NULL, fmstate->s.conn, false, fmstate->query);
/*
* Get the result, and check for success.
@@ -1879,10 +2061,10 @@ postgresExecForeignUpdate(EState *estate,
* We don't use a PG_TRY block here, so be careful not to throw error
* without releasing the PGresult.
*/
- res = pgfdw_get_result(fmstate->conn, fmstate->query);
+ res = pgfdw_get_result(fmstate->s.conn, fmstate->query);
if (PQresultStatus(res) !=
(fmstate->has_returning ? PGRES_TUPLES_OK : PGRES_COMMAND_OK))
- pgfdw_report_error(ERROR, res, fmstate->conn, true, fmstate->query);
+ pgfdw_report_error(ERROR, res, fmstate->s.conn, true, fmstate->query);
/* Check number of rows affected, and fetch RETURNING tuple if any */
if (fmstate->has_returning)
@@ -1920,6 +2102,8 @@ postgresExecForeignDelete(EState *estate,
PGresult *res;
int n_rows;
+ vacate_connection((PgFdwState *)fmstate);
+
/* Set up the prepared statement on the remote server, if we didn't yet */
if (!fmstate->p_name)
prepare_foreign_modify(fmstate);
@@ -1940,14 +2124,14 @@ postgresExecForeignDelete(EState *estate,
/*
* Execute the prepared statement.
*/
- if (!PQsendQueryPrepared(fmstate->conn,
+ if (!PQsendQueryPrepared(fmstate->s.conn,
fmstate->p_name,
fmstate->p_nums,
p_values,
NULL,
NULL,
0))
- pgfdw_report_error(ERROR, NULL, fmstate->conn, false, fmstate->query);
+ pgfdw_report_error(ERROR, NULL, fmstate->s.conn, false, fmstate->query);
/*
* Get the result, and check for success.
@@ -1955,10 +2139,10 @@ postgresExecForeignDelete(EState *estate,
* We don't use a PG_TRY block here, so be careful not to throw error
* without releasing the PGresult.
*/
- res = pgfdw_get_result(fmstate->conn, fmstate->query);
+ res = pgfdw_get_result(fmstate->s.conn, fmstate->query);
if (PQresultStatus(res) !=
(fmstate->has_returning ? PGRES_TUPLES_OK : PGRES_COMMAND_OK))
- pgfdw_report_error(ERROR, res, fmstate->conn, true, fmstate->query);
+ pgfdw_report_error(ERROR, res, fmstate->s.conn, true, fmstate->query);
/* Check number of rows affected, and fetch RETURNING tuple if any */
if (fmstate->has_returning)
@@ -2005,16 +2189,16 @@ postgresEndForeignModify(EState *estate,
* We don't use a PG_TRY block here, so be careful not to throw error
* without releasing the PGresult.
*/
- res = pgfdw_exec_query(fmstate->conn, sql);
+ res = pgfdw_exec_query(fmstate->s.conn, sql);
if (PQresultStatus(res) != PGRES_COMMAND_OK)
- pgfdw_report_error(ERROR, res, fmstate->conn, true, sql);
+ pgfdw_report_error(ERROR, res, fmstate->s.conn, true, sql);
PQclear(res);
fmstate->p_name = NULL;
}
/* Release remote connection */
- ReleaseConnection(fmstate->conn);
- fmstate->conn = NULL;
+ ReleaseConnection(fmstate->s.conn);
+ fmstate->s.conn = NULL;
}
/*
@@ -2302,7 +2486,9 @@ postgresBeginDirectModify(ForeignScanState *node, int eflags)
* Get connection to the foreign server. Connection manager will
* establish new connection if necessary.
*/
- dmstate->conn = GetConnection(user, false);
+ dmstate->s.conn = GetConnection(user, false);
+ dmstate->s.connpriv = (PgFdwConnpriv *)
+ GetConnectionSpecificStorage(user, sizeof(PgFdwConnpriv));
/* Initialize state variable */
dmstate->num_tuples = -1; /* -1 means not set yet */
@@ -2355,7 +2541,10 @@ postgresIterateDirectModify(ForeignScanState *node)
* If this is the first call after Begin, execute the statement.
*/
if (dmstate->num_tuples == -1)
+ {
+ vacate_connection((PgFdwState *)dmstate);
execute_dml_stmt(node);
+ }
/*
* If the local query doesn't specify RETURNING, just clear tuple slot.
@@ -2402,8 +2591,8 @@ postgresEndDirectModify(ForeignScanState *node)
PQclear(dmstate->result);
/* Release remote connection */
- ReleaseConnection(dmstate->conn);
- dmstate->conn = NULL;
+ ReleaseConnection(dmstate->s.conn);
+ dmstate->s.conn = NULL;
/* MemoryContext will be deleted automatically. */
}
@@ -2522,6 +2711,7 @@ estimate_path_cost_size(PlannerInfo *root,
List *local_param_join_conds;
StringInfoData sql;
PGconn *conn;
+ PgFdwConnpriv *connpriv;
Selectivity local_sel;
QualCost local_cost;
List *fdw_scan_tlist = NIL;
@@ -2564,6 +2754,16 @@ estimate_path_cost_size(PlannerInfo *root,
/* Get the remote estimate */
conn = GetConnection(fpinfo->user, false);
+ connpriv = GetConnectionSpecificStorage(fpinfo->user,
+ sizeof(PgFdwConnpriv));
+ if (connpriv)
+ {
+ PgFdwState tmpstate;
+ tmpstate.conn = conn;
+ tmpstate.connpriv = connpriv;
+ vacate_connection(&tmpstate);
+ }
+
get_remote_estimate(sql.data, conn, &rows, &width,
&startup_cost, &total_cost);
ReleaseConnection(conn);
@@ -2918,11 +3118,11 @@ ec_member_matches_foreign(PlannerInfo *root, RelOptInfo *rel,
static void
create_cursor(ForeignScanState *node)
{
- PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+ PgFdwScanState *fsstate = GetPgFdwScanState(node);
ExprContext *econtext = node->ss.ps.ps_ExprContext;
int numParams = fsstate->numParams;
const char **values = fsstate->param_values;
- PGconn *conn = fsstate->conn;
+ PGconn *conn = fsstate->s.conn;
StringInfoData buf;
PGresult *res;
@@ -2988,47 +3188,96 @@ create_cursor(ForeignScanState *node)
* Fetch some more rows from the node's cursor.
*/
static void
-fetch_more_data(ForeignScanState *node)
+request_more_data(ForeignScanState *node)
+{
+ PgFdwScanState *fsstate = GetPgFdwScanState(node);
+ PGconn *conn = fsstate->s.conn;
+ char sql[64];
+
+ /* The connection should be vacant */
+ Assert(fsstate->s.connpriv->current_owner == NULL);
+
+ /*
+ * If this is the first call after Begin or ReScan, we need to create the
+ * cursor on the remote side.
+ */
+ if (!fsstate->cursor_exists)
+ create_cursor(node);
+
+ snprintf(sql, sizeof(sql), "FETCH %d FROM c%u",
+ fsstate->fetch_size, fsstate->cursor_number);
+
+ if (!PQsendQuery(conn, sql))
+ pgfdw_report_error(ERROR, NULL, conn, false, sql);
+
+ fsstate->s.connpriv->current_owner = node;
+}
+
+/*
+ * Fetch some more rows from the node's cursor.
+ */
+static void
+fetch_received_data(ForeignScanState *node)
{
- PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+ PgFdwScanState *fsstate = GetPgFdwScanState(node);
PGresult *volatile res = NULL;
MemoryContext oldcontext;
+ /* I should be the current connection owner */
+ Assert(fsstate->s.connpriv->current_owner == node);
+
/*
* We'll store the tuples in the batch_cxt. First, flush the previous
- * batch.
+ * batch if no tuple is remaining
*/
- fsstate->tuples = NULL;
- MemoryContextReset(fsstate->batch_cxt);
+ if (fsstate->next_tuple >= fsstate->num_tuples)
+ {
+ fsstate->tuples = NULL;
+ fsstate->num_tuples = 0;
+ MemoryContextReset(fsstate->batch_cxt);
+ }
+ else if (fsstate->next_tuple > 0)
+ {
+ /* move the remaining tuples to the beginning of the store */
+ int n = 0;
+
+ while(fsstate->next_tuple < fsstate->num_tuples)
+ fsstate->tuples[n++] = fsstate->tuples[fsstate->next_tuple++];
+ fsstate->num_tuples = n;
+ }
+
oldcontext = MemoryContextSwitchTo(fsstate->batch_cxt);
/* PGresult must be released before leaving this function. */
PG_TRY();
{
- PGconn *conn = fsstate->conn;
+ PGconn *conn = fsstate->s.conn;
char sql[64];
- int numrows;
+ int addrows;
+ size_t newsize;
int i;
snprintf(sql, sizeof(sql), "FETCH %d FROM c%u",
fsstate->fetch_size, fsstate->cursor_number);
- res = pgfdw_exec_query(conn, sql);
+ res = pgfdw_get_result(conn, sql);
/* On error, report the original query, not the FETCH. */
if (PQresultStatus(res) != PGRES_TUPLES_OK)
pgfdw_report_error(ERROR, res, conn, false, fsstate->query);
/* Convert the data into HeapTuples */
- numrows = PQntuples(res);
- fsstate->tuples = (HeapTuple *) palloc0(numrows * sizeof(HeapTuple));
- fsstate->num_tuples = numrows;
- fsstate->next_tuple = 0;
+ addrows = PQntuples(res);
+ newsize = (fsstate->num_tuples + addrows) * sizeof(HeapTuple);
+ if (fsstate->tuples)
+ fsstate->tuples = (HeapTuple *) repalloc(fsstate->tuples, newsize);
+ else
+ fsstate->tuples = (HeapTuple *) palloc(newsize);
- for (i = 0; i < numrows; i++)
+ for (i = 0; i < addrows; i++)
{
Assert(IsA(node->ss.ps.plan, ForeignScan));
- fsstate->tuples[i] =
+ fsstate->tuples[fsstate->num_tuples + i] =
make_tuple_from_result_row(res, i,
fsstate->rel,
fsstate->attinmeta,
@@ -3038,27 +3287,82 @@ fetch_more_data(ForeignScanState *node)
}
/* Update fetch_ct_2 */
- if (fsstate->fetch_ct_2 < 2)
+ if (fsstate->fetch_ct_2 < 2 && fsstate->next_tuple == 0)
fsstate->fetch_ct_2++;
+ fsstate->next_tuple = 0;
+ fsstate->num_tuples += addrows;
+
/* Must be EOF if we didn't get as many tuples as we asked for. */
- fsstate->eof_reached = (numrows < fsstate->fetch_size);
+ fsstate->eof_reached = (addrows < fsstate->fetch_size);
PQclear(res);
res = NULL;
}
PG_CATCH();
{
+ fsstate->s.connpriv->current_owner = NULL;
if (res)
PQclear(res);
PG_RE_THROW();
}
PG_END_TRY();
+ fsstate->s.connpriv->current_owner = NULL;
+
MemoryContextSwitchTo(oldcontext);
}
/*
+ * Vacate a connection so that this node can send the next query
+ */
+static void
+vacate_connection(PgFdwState *fdwstate)
+{
+ PgFdwConnpriv *connpriv = fdwstate->connpriv;
+ ForeignScanState *owner;
+
+ if (connpriv == NULL || connpriv->current_owner == NULL)
+ return;
+
+ /*
+ * let the current connection owner read the result for the running query
+ */
+ owner = connpriv->current_owner;
+ fetch_received_data(owner);
+
+ /* Clear the waiting list */
+ while (owner)
+ {
+ PgFdwScanState *fsstate = GetPgFdwScanState(owner);
+
+ fsstate->last_waiter = NULL;
+ owner = fsstate->waiter;
+ fsstate->waiter = NULL;
+ }
+}
+
+/*
+ * Absorb the result of the current query.
+ */
+static void
+absorb_current_result(ForeignScanState *node)
+{
+ PgFdwScanState *fsstate = GetPgFdwScanState(node);
+ ForeignScanState *owner = fsstate->s.connpriv->current_owner;
+
+ if (owner)
+ {
+ PgFdwScanState *target_state = GetPgFdwScanState(owner);
+ PGconn *conn = target_state->s.conn;
+
+ while(PQisBusy(conn))
+ PQclear(PQgetResult(conn));
+ fsstate->s.connpriv->current_owner = NULL;
+ fsstate->async_waiting = false;
+ }
+}
+/*
* Force assorted GUC parameters to settings that ensure that we'll output
* data values in a form that is unambiguous to the remote server.
*
@@ -3142,7 +3446,7 @@ prepare_foreign_modify(PgFdwModifyState *fmstate)
/* Construct name we'll use for the prepared statement. */
snprintf(prep_name, sizeof(prep_name), "pgsql_fdw_prep_%u",
- GetPrepStmtNumber(fmstate->conn));
+ GetPrepStmtNumber(fmstate->s.conn));
p_name = pstrdup(prep_name);
/*
@@ -3152,12 +3456,12 @@ prepare_foreign_modify(PgFdwModifyState *fmstate)
* the prepared statements we use in this module are simple enough that
* the remote server will make the right choices.
*/
- if (!PQsendPrepare(fmstate->conn,
+ if (!PQsendPrepare(fmstate->s.conn,
p_name,
fmstate->query,
0,
NULL))
- pgfdw_report_error(ERROR, NULL, fmstate->conn, false, fmstate->query);
+ pgfdw_report_error(ERROR, NULL, fmstate->s.conn, false, fmstate->query);
/*
* Get the result, and check for success.
@@ -3165,9 +3469,9 @@ prepare_foreign_modify(PgFdwModifyState *fmstate)
* We don't use a PG_TRY block here, so be careful not to throw error
* without releasing the PGresult.
*/
- res = pgfdw_get_result(fmstate->conn, fmstate->query);
+ res = pgfdw_get_result(fmstate->s.conn, fmstate->query);
if (PQresultStatus(res) != PGRES_COMMAND_OK)
- pgfdw_report_error(ERROR, res, fmstate->conn, true, fmstate->query);
+ pgfdw_report_error(ERROR, res, fmstate->s.conn, true, fmstate->query);
PQclear(res);
/* This action shows that the prepare has been done. */
@@ -3298,9 +3602,9 @@ execute_dml_stmt(ForeignScanState *node)
* the desired result. This allows us to avoid assuming that the remote
* server has the same OIDs we do for the parameters' types.
*/
- if (!PQsendQueryParams(dmstate->conn, dmstate->query, numParams,
+ if (!PQsendQueryParams(dmstate->s.conn, dmstate->query, numParams,
NULL, values, NULL, NULL, 0))
- pgfdw_report_error(ERROR, NULL, dmstate->conn, false, dmstate->query);
+ pgfdw_report_error(ERROR, NULL, dmstate->s.conn, false, dmstate->query);
/*
* Get the result, and check for success.
@@ -3308,10 +3612,10 @@ execute_dml_stmt(ForeignScanState *node)
* We don't use a PG_TRY block here, so be careful not to throw error
* without releasing the PGresult.
*/
- dmstate->result = pgfdw_get_result(dmstate->conn, dmstate->query);
+ dmstate->result = pgfdw_get_result(dmstate->s.conn, dmstate->query);
if (PQresultStatus(dmstate->result) !=
(dmstate->has_returning ? PGRES_TUPLES_OK : PGRES_COMMAND_OK))
- pgfdw_report_error(ERROR, dmstate->result, dmstate->conn, true,
+ pgfdw_report_error(ERROR, dmstate->result, dmstate->s.conn, true,
dmstate->query);
/* Get the number of rows affected. */
@@ -4582,6 +4886,80 @@ postgresGetForeignJoinPaths(PlannerInfo *root,
/* XXX Consider parameterized paths for the join relation */
}
+static bool
+postgresIsForeignPathAsyncCapable(ForeignPath *path)
+{
+ return true;
+}
+
+/*
+ * Accept async request. Notify to the caller if the next tuple is immediately
+ * available. ExecForeignScan does additional work to finishing the returning
+ * tuple, so call it instead of postgresIterateForeignScan to acquire a tuple
+ * in expected shape.
+ */
+static void
+postgresForeignAsyncRequest(EState *estate, PendingAsyncRequest *areq)
+{
+ ForeignScanState *node = (ForeignScanState *) areq->requestee;
+ TupleTableSlot *slot;
+
+ Assert(IsA(node, ForeignScanState));
+ GetPgFdwScanState(node)->run_async = true;
+ slot = ExecForeignScan(node);
+ if (GetPgFdwScanState(node)->result_ready)
+ ExecAsyncRequestDone(estate, areq, (Node *) slot);
+ else
+ ExecAsyncSetRequiredEvents(estate, areq, 1, false, false);
+}
+
+/*
+ * Configure waiting event.
+ *
+ * Add an wait event only when the node is the connection owner. Elsewise
+ * another node on this connection is the owner.
+ */
+static bool
+postgresForeignAsyncConfigureWait(EState *estate, PendingAsyncRequest *areq,
+ bool reinit)
+{
+ ForeignScanState *node = (ForeignScanState *) areq->requestee;
+ PgFdwScanState *fsstate = GetPgFdwScanState(node);
+
+
+ /* If the caller didn't reinit, this event is already in event set */
+ if (!reinit)
+ return true;
+
+ if (fsstate->s.connpriv->current_owner == node)
+ {
+ AddWaitEventToSet(estate->es_wait_event_set,
+ WL_SOCKET_READABLE, PQsocket(fsstate->s.conn),
+ NULL, areq);
+ return true;
+ }
+
+ return false;
+}
+
+/*
+ * Process a notification from async mechanism. ExecForeignScan does
+ * additional work to complete the returning tuple, so call it instead of
+ * postgresIterateForeignScan to acquire a completed tuple.
+ */
+static void
+postgresForeignAsyncNotify(EState *estate, PendingAsyncRequest *areq)
+{
+ ForeignScanState *node = (ForeignScanState *) areq->requestee;
+ TupleTableSlot *slot;
+
+ Assert(IsA(node, ForeignScanState));
+ slot = ExecForeignScan(node);
+ Assert(GetPgFdwScanState(node)->result_ready);
+
+ ExecAsyncRequestDone(estate, areq, (Node *) slot);
+}
+
/*
* Assess whether the aggregation, grouping and having operations can be pushed
* down to the foreign server. As a side effect, save information we obtain in
@@ -4946,7 +5324,7 @@ make_tuple_from_result_row(PGresult *res,
PgFdwScanState *fdw_sstate;
Assert(fsstate);
- fdw_sstate = (PgFdwScanState *) fsstate->fdw_state;
+ fdw_sstate = GetPgFdwScanState(fsstate);
tupdesc = fdw_sstate->tupdesc;
}
diff --git a/contrib/postgres_fdw/postgres_fdw.h b/contrib/postgres_fdw/postgres_fdw.h
index f396dae..a67da3d 100644
--- a/contrib/postgres_fdw/postgres_fdw.h
+++ b/contrib/postgres_fdw/postgres_fdw.h
@@ -77,6 +77,7 @@ typedef struct PgFdwRelationInfo
UserMapping *user; /* only set in use_remote_estimate mode */
int fetch_size; /* fetch size for this remote table */
+ bool allow_prefetch; /* true to allow overlapped fetching */
/*
* Name of the relation while EXPLAINing ForeignScan. It is used for join
@@ -116,6 +117,7 @@ extern void reset_transmission_modes(int nestlevel);
/* in connection.c */
extern PGconn *GetConnection(UserMapping *user, bool will_prep_stmt);
+void *GetConnectionSpecificStorage(UserMapping *user, size_t initsize);
extern void ReleaseConnection(PGconn *conn);
extern unsigned int GetCursorNumber(PGconn *conn);
extern unsigned int GetPrepStmtNumber(PGconn *conn);
diff --git a/contrib/postgres_fdw/sql/postgres_fdw.sql b/contrib/postgres_fdw/sql/postgres_fdw.sql
index 509bb54..1f69908 100644
--- a/contrib/postgres_fdw/sql/postgres_fdw.sql
+++ b/contrib/postgres_fdw/sql/postgres_fdw.sql
@@ -1542,12 +1542,12 @@ insert into bar2 values(4,44,44);
insert into bar2 values(7,77,77);
explain (verbose, costs off)
-select * from bar where f1 in (select f1 from foo) for update;
-select * from bar where f1 in (select f1 from foo) for update;
+select * from bar where f1 in (select f1 from foo) order by 1 for update;
+select * from bar where f1 in (select f1 from foo) order by 1 for update;
explain (verbose, costs off)
-select * from bar where f1 in (select f1 from foo) for share;
-select * from bar where f1 in (select f1 from foo) for share;
+select * from bar where f1 in (select f1 from foo) order by 1 for share;
+select * from bar where f1 in (select f1 from foo) order by 1 for share;
-- Check UPDATE with inherited target and an inherited source table
explain (verbose, costs off)
@@ -1606,8 +1606,8 @@ explain (verbose, costs off)
delete from foo where f1 < 5 returning *;
delete from foo where f1 < 5 returning *;
explain (verbose, costs off)
-update bar set f2 = f2 + 100 returning *;
-update bar set f2 = f2 + 100 returning *;
+with u as (update bar set f2 = f2 + 100 returning *) select * from u order by 1;
+with u as (update bar set f2 = f2 + 100 returning *) select * from u order by 1;
drop table foo cascade;
drop table bar cascade;
--
2.9.2
0004-Apply-unlikely-to-suggest-synchronous-route-of-ExecA.patchtext/x-patch; charset=us-asciiDownload
From cfc22e4a0cf8597ef13b82c6e177ce90a2444d78 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Tue, 25 Oct 2016 18:05:30 +0900
Subject: [PATCH 4/4] Apply unlikely to suggest synchronous route of
ExecAppend.
ExecAppend seems to get slowed down by penalty of misprediction of
branches related to async-execution. Apply unlikey to them to prevent
such penalty on exiting route. Asynchronous execution is already
having a lot of additional code so this doesn't add siginificant
degradation.
---
src/backend/executor/nodeAppend.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/src/backend/executor/nodeAppend.c b/src/backend/executor/nodeAppend.c
index 2c07095..43e777f 100644
--- a/src/backend/executor/nodeAppend.c
+++ b/src/backend/executor/nodeAppend.c
@@ -214,7 +214,7 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
TupleTableSlot *
ExecAppend(AppendState *node)
{
- if (node->as_nasyncplans > 0)
+ if (unlikely(node->as_nasyncplans > 0))
{
EState *estate = node->ps.state;
int i;
@@ -255,7 +255,7 @@ ExecAppend(AppendState *node)
/*
* if we have async requests outstanding, run the event loop
*/
- if (node->as_nasyncpending > 0)
+ if (unlikely(node->as_nasyncpending > 0))
{
long timeout = node->as_syncdone ? -1 : 0;
--
2.9.2
Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
The patch got conflicted. This is a new version just rebased to
the current master. Furtuer amendment will be taken later.
Can you please explain this part of make_append() ?
/* Currently async on partitioned tables is not available */
Assert(nasyncplans == 0 || partitioned_rels == NIL);
I don't think the output of Append plan is supposed to be ordered even if the
underlying relation is partitioned. Besides ordering, is there any other
reason not to use the asynchronous execution?
And even if there was some, the planner should ensure that executor does not
fire the assertion statement above. The script attached shows an example how
to cause the assertion failure.
--
Antonin Houska
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de, http://www.cybertec.at
Attachments:
Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
The patch got conflicted. This is a new version just rebased to
the current master. Furtuer amendment will be taken later.
Just one idea that I had while reading the code.
In ExecAsyncEventLoop you iterate estate->es_pending_async, then move the
complete requests to the end and finaly adjust estate->es_num_pending_async so
that the array no longer contains the complete requests. I think the point is
that then you can add new requests to the end of the array.
I wonder if a set (Bitmapset) of incomplete requests would make the code more
efficient. The set would contain position of each incomplete request in
estate->es_num_pending_async (I think it's the myindex field of
PendingAsyncRequest). If ExecAsyncEventLoop used this set to retrieve the
requests subject to ExecAsyncNotify etc, then the compaction of
estate->es_pending_async wouldn't be necessary.
ExecAsyncRequest would use the set to look for space for new requests by
iterating it and trying to find the first gap (which corresponds to completed
request).
And finally, item would be removed from the set at the moment the request
state is being set to ASYNCREQ_COMPLETE.
--
Antonin Houska
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de, http://www.cybertec.at
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Thank you for looking this.
At Wed, 28 Jun 2017 10:23:54 +0200, Antonin Houska <ah@cybertec.at> wrote in <4579.1498638234@localhost>
Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
The patch got conflicted. This is a new version just rebased to
the current master. Furtuer amendment will be taken later.Can you please explain this part of make_append() ?
/* Currently async on partitioned tables is not available */
Assert(nasyncplans == 0 || partitioned_rels == NIL);I don't think the output of Append plan is supposed to be ordered even if the
underlying relation is partitioned. Besides ordering, is there any other
reason not to use the asynchronous execution?
It was just a developmental sentinel that will remind me later to
consider the declarative partitions since I didn't have an idea
of the differences (or the similarity) between appendrels and
partitioned_rels. It is never to say the condition cannot
make. I'll check it out and will support partitioned_rels sooner.
Sorry for having left it as it is.
And even if there was some, the planner should ensure that executor does not
fire the assertion statement above. The script attached shows an example how
to cause the assertion failure.
regards,
--
Kyotaro Horiguchi
NTT Open Source Software Center
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Hi,
On 2017/06/29 13:45, Kyotaro HORIGUCHI wrote:
Thank you for looking this.
At Wed, 28 Jun 2017 10:23:54 +0200, Antonin Houska wrote:
Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
The patch got conflicted. This is a new version just rebased to
the current master. Furtuer amendment will be taken later.Can you please explain this part of make_append() ?
/* Currently async on partitioned tables is not available */
Assert(nasyncplans == 0 || partitioned_rels == NIL);I don't think the output of Append plan is supposed to be ordered even if the
underlying relation is partitioned. Besides ordering, is there any other
reason not to use the asynchronous execution?It was just a developmental sentinel that will remind me later to
consider the declarative partitions since I didn't have an idea
of the differences (or the similarity) between appendrels and
partitioned_rels. It is never to say the condition cannot
make. I'll check it out and will support partitioned_rels sooner.
Sorry for having left it as it is.
When making an Append for a partitioned table, among the arguments passed
to make_append(), 'partitioned_rels' is a list of RT indexes of
partitioned tables in the inheritance tree of which the aforementioned
partitioned table is the root. 'appendplans' is a list of subplans for
scanning the leaf partitions in the tree. Note that the 'appendplans'
list contains no members corresponding to the partitioned tables, because
we don't need to scan them (only leaf relations contain any data).
The point of having the 'partitioned_rels' list in the resulting Append
plan is so that the executor can identify those relations and take the
appropriate locks on them.
Thanks,
Amit
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Hi, I've returned.
At Thu, 29 Jun 2017 14:08:27 +0900, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote in <63a5a01c-2967-83e0-8bbf-c981404f529e@lab.ntt.co.jp>
Hi,
On 2017/06/29 13:45, Kyotaro HORIGUCHI wrote:
Thank you for looking this.
At Wed, 28 Jun 2017 10:23:54 +0200, Antonin Houska wrote:
Can you please explain this part of make_append() ?
/* Currently async on partitioned tables is not available */
Assert(nasyncplans == 0 || partitioned_rels == NIL);I don't think the output of Append plan is supposed to be ordered even if the
underlying relation is partitioned. Besides ordering, is there any other
reason not to use the asynchronous execution?When making an Append for a partitioned table, among the arguments passed
to make_append(), 'partitioned_rels' is a list of RT indexes of
partitioned tables in the inheritance tree of which the aforementioned
partitioned table is the root. 'appendplans' is a list of subplans for
scanning the leaf partitions in the tree. Note that the 'appendplans'
list contains no members corresponding to the partitioned tables, because
we don't need to scan them (only leaf relations contain any data).The point of having the 'partitioned_rels' list in the resulting Append
plan is so that the executor can identify those relations and take the
appropriate locks on them.
Amit, thank you for the detailed explanation. I understand what
it is and that just ignoring it is enough, then confirmed that
actually works as before.
I'll then adresss Antonin's comments tomorrow.
regards,
--
Kyotaro Horiguchi
NTT Open Source Software Center
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Thank you for the thought.
This is at PoC level so I'd be grateful for this kind of
fundamental comments.
At Wed, 28 Jun 2017 20:22:24 +0200, Antonin Houska <ah@cybertec.at> wrote in <392.1498674144@localhost>
Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
The patch got conflicted. This is a new version just rebased to
the current master. Furtuer amendment will be taken later.Just one idea that I had while reading the code.
In ExecAsyncEventLoop you iterate estate->es_pending_async, then move the
complete requests to the end and finaly adjust estate->es_num_pending_async so
that the array no longer contains the complete requests. I think the point is
that then you can add new requests to the end of the array.I wonder if a set (Bitmapset) of incomplete requests would make the code more
efficient. The set would contain position of each incomplete request in
estate->es_num_pending_async (I think it's the myindex field of
PendingAsyncRequest). If ExecAsyncEventLoop used this set to retrieve the
requests subject to ExecAsyncNotify etc, then the compaction of
estate->es_pending_async wouldn't be necessary.ExecAsyncRequest would use the set to look for space for new requests by
iterating it and trying to find the first gap (which corresponds to completed
request).And finally, item would be removed from the set at the moment the request
state is being set to ASYNCREQ_COMPLETE.
Effectively it is a waiting-queue followed by a
completed-list. The point of the compaction is keeping the order
of waiting or not-yet-completed requests, which is crucial to
avoid kind-a precedence inversion. We cannot keep the order by
using bitmapset in such way.
The current code waits all waiters at once and processes all
fired events at once. The order in the waiting-queue is
inessential in the case. On the other hand I suppoese waiting on
several-tens to near-hundred remote hosts is in a realistic
target range. Keeping the order could be crucial if we process a
part of the queue at once in the case.
Putting siginificance on the deviation of response time of
remotes, process-all-at-once is effective. In turn we should
consider the effectiveness of the lifecycle of the larger wait
event set.
Sorry for the discursive discussion but in short, I have noticed
that I have a lot to consider on this:p Thanks!
regards,
--
Kyotaro Horiguchi
NTT Open Source Software Center
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
Just one idea that I had while reading the code.
In ExecAsyncEventLoop you iterate estate->es_pending_async, then move the
complete requests to the end and finaly adjust estate->es_num_pending_async so
that the array no longer contains the complete requests. I think the point is
that then you can add new requests to the end of the array.I wonder if a set (Bitmapset) of incomplete requests would make the code more
efficient. The set would contain position of each incomplete request in
estate->es_num_pending_async (I think it's the myindex field of
PendingAsyncRequest). If ExecAsyncEventLoop used this set to retrieve the
requests subject to ExecAsyncNotify etc, then the compaction of
estate->es_pending_async wouldn't be necessary.ExecAsyncRequest would use the set to look for space for new requests by
iterating it and trying to find the first gap (which corresponds to completed
request).And finally, item would be removed from the set at the moment the request
state is being set to ASYNCREQ_COMPLETE.Effectively it is a waiting-queue followed by a
completed-list. The point of the compaction is keeping the order
of waiting or not-yet-completed requests, which is crucial to
avoid kind-a precedence inversion. We cannot keep the order by
using bitmapset in such way.
The current code waits all waiters at once and processes all
fired events at once. The order in the waiting-queue is
inessential in the case. On the other hand I suppoese waiting on
several-tens to near-hundred remote hosts is in a realistic
target range. Keeping the order could be crucial if we process a
part of the queue at once in the case.Putting siginificance on the deviation of response time of
remotes, process-all-at-once is effective. In turn we should
consider the effectiveness of the lifecycle of the larger wait
event set.
ok, I missed the fact that the order of es_pending_async entries is
important. I think this is worth adding a comment.
Actually the reason I thought of simplification was that I noticed small
inefficiency in the way you do the compaction. In particular, I think it's not
always necessary to swap the tail and head entries. Would something like this
make sense?
/* If any node completed, compact the array. */
if (any_node_done)
{
int hidx = 0,
tidx;
/*
* Swap all non-yet-completed items to the start of the array.
* Keep them in the same order.
*/
for (tidx = 0; tidx < estate->es_num_pending_async; ++tidx)
{
PendingAsyncRequest *tail = estate->es_pending_async[tidx];
Assert(tail->state != ASYNCREQ_CALLBACK_PENDING);
if (tail->state == ASYNCREQ_COMPLETE)
continue;
/*
* If the array starts with one or more incomplete requests,
* both head and tail point at the same item, so there's no
* point in swapping.
*/
if (tidx > hidx)
{
PendingAsyncRequest *head = estate->es_pending_async[hidx];
/*
* Once the tail got ahead, it should only leave
* ASYNCREQ_COMPLETE behind. Only those can then be seen
* by head.
*/
Assert(head->state == ASYNCREQ_COMPLETE);
estate->es_pending_async[tidx] = head;
estate->es_pending_async[hidx] = tail;
}
++hidx;
}
estate->es_num_pending_async = hidx;
}
And besides that, I think it'd be more intuitive if the meaning of "head" and
"tail" was reversed: if the array is iterated from lower to higher positions,
then I'd consider head to be at higher position, not tail.
--
Antonin Houska Cybertec Schönig & Schönig GmbH Gröhrmühlgasse 26 A-2700 Wiener
Neustadt Web: http://www.postgresql-support.de, http://www.cybertec.at
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Hello,
At Tue, 11 Jul 2017 10:28:51 +0200, Antonin Houska <ah@cybertec.at> wrote in <6448.1499761731@localhost>
Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
Effectively it is a waiting-queue followed by a
completed-list. The point of the compaction is keeping the order
of waiting or not-yet-completed requests, which is crucial to
avoid kind-a precedence inversion. We cannot keep the order by
using bitmapset in such way.The current code waits all waiters at once and processes all
fired events at once. The order in the waiting-queue is
inessential in the case. On the other hand I suppoese waiting on
several-tens to near-hundred remote hosts is in a realistic
target range. Keeping the order could be crucial if we process a
part of the queue at once in the case.Putting siginificance on the deviation of response time of
remotes, process-all-at-once is effective. In turn we should
consider the effectiveness of the lifecycle of the larger wait
event set.ok, I missed the fact that the order of es_pending_async entries is
important. I think this is worth adding a comment.
I'll put an upper limit to the number of waiters processed at
once. Then add a comment like that.
Actually the reason I thought of simplification was that I noticed small
inefficiency in the way you do the compaction. In particular, I think it's not
always necessary to swap the tail and head entries. Would something like this
make sense?
I'm not sure, but I suppose that it is rare that all of the first
many elements in the array are not COMPLETE. In most cases the
first element gets a response first.
/* If any node completed, compact the array. */
if (any_node_done)
{
...
for (tidx = 0; tidx < estate->es_num_pending_async; ++tidx)
{
...
if (tail->state == ASYNCREQ_COMPLETE)
continue;/*
* If the array starts with one or more incomplete requests,
* both head and tail point at the same item, so there's no
* point in swapping.
*/
if (tidx > hidx)
{
This works to skip the first several elements when all of them
are ASYNCREQ_COMPLETE. I think it makes sense as long as it
doesn't harm the loop. The optimization is more effective by
putting out of the loop like this.
| for (tidx = 0; tidx < estate->es_num_pending_async &&
estate->es_pending_async[tidx] == ASYNCREQ_COMPLETE; ++tidx);
| for (; tidx < estate->es_num_pending_async; ++tidx)
...
And besides that, I think it'd be more intuitive if the meaning of "head" and
"tail" was reversed: if the array is iterated from lower to higher positions,
then I'd consider head to be at higher position, not tail.
Yeah, but maybe the "head" is still confusing even if reversed
because it is still not a head of something. It might be less
confusing by rewriting it in more verbose-but-straightforwad way.
| int npending = 0;
|
| /* Skip over not-completed items at the beginning */
| while (npending < estate->es_num_pending_async &&
| estate->es_pending_async[npending] != ASYNCREQ_COMPLETE)
| npending++;
|
| /* Scan over the rest for not-completed items */
| for (i = npending + 1 ; i < estate->es_num_pending_async; ++i)
| {
| PendingAsyncRequest *tmp;
| PendingAsyncRequest *curr = estate->es_pending_async[i];
|
| if (curr->state == ASYNCREQ_COMPLETE)
| continue;
|
| /* Move the not-completed item to the tail of the first chunk */
| tmp = estate->es_pending_async[i];
| estate->es_pending_async[nepending] = tmp;
| estate->es_pending_async[i] = tmp;
| ++npending;
| }
regards,
--
Kyotaro Horiguchi
NTT Open Source Software Center
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Hello,
8bf58c0d9bd33686 badly conflicts with this patch, so I'll rebase
this and added a patch to refactor the function that Anotonin
pointed. This would be merged into 0002 patch.
At Tue, 18 Jul 2017 16:24:52 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in <20170718.162452.221576658.horiguchi.kyotaro@lab.ntt.co.jp>
I'll put an upper limit to the number of waiters processed at
once. Then add a comment like that.Actually the reason I thought of simplification was that I noticed small
inefficiency in the way you do the compaction. In particular, I think it's not
always necessary to swap the tail and head entries. Would something like this
make sense?I'm not sure, but I suppose that it is rare that all of the first
many elements in the array are not COMPLETE. In most cases the
first element gets a response first.
...
Yeah, but maybe the "head" is still confusing even if reversed
because it is still not a head of something. It might be less
confusing by rewriting it in more verbose-but-straightforwad way.| int npending = 0;
|
| /* Skip over not-completed items at the beginning */
| while (npending < estate->es_num_pending_async &&
| estate->es_pending_async[npending] != ASYNCREQ_COMPLETE)
| npending++;
|
| /* Scan over the rest for not-completed items */
| for (i = npending + 1 ; i < estate->es_num_pending_async; ++i)
| {
| PendingAsyncRequest *tmp;
| PendingAsyncRequest *curr = estate->es_pending_async[i];
|
| if (curr->state == ASYNCREQ_COMPLETE)
| continue;
|
| /* Move the not-completed item to the tail of the first chunk */
| tmp = estate->es_pending_async[i];
| estate->es_pending_async[nepending] = tmp;
| estate->es_pending_async[i] = tmp;
| ++npending;
| }
The last patch does something like this (with apparent bugs
fixed)
regards,
--
Kyotaro Horiguchi
NTT Open Source Software Center
Attachments:
0001-Allow-wait-event-set-to-be-registered-to-resource-ow.patchtext/x-patch; charset=us-asciiDownload
From 41ad9a7518c066da619363e6cdf8574fa00ee1e5 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Wed, 22 Feb 2017 09:07:49 +0900
Subject: [PATCH 1/5] Allow wait event set to be registered to resource owner
WaitEventSet needs to be released using resource owner for a certain
case. This change adds WaitEventSet reowner and allow the creator of a
WaitEventSet to specify a resource owner.
---
src/backend/libpq/pqcomm.c | 2 +-
src/backend/storage/ipc/latch.c | 18 ++++++-
src/backend/storage/lmgr/condition_variable.c | 2 +-
src/backend/utils/resowner/resowner.c | 68 +++++++++++++++++++++++++++
src/include/storage/latch.h | 4 +-
src/include/utils/resowner_private.h | 8 ++++
6 files changed, 97 insertions(+), 5 deletions(-)
diff --git a/src/backend/libpq/pqcomm.c b/src/backend/libpq/pqcomm.c
index 4452ea4..ed71e7c 100644
--- a/src/backend/libpq/pqcomm.c
+++ b/src/backend/libpq/pqcomm.c
@@ -220,7 +220,7 @@ pq_init(void)
(errmsg("could not set socket to nonblocking mode: %m")));
#endif
- FeBeWaitSet = CreateWaitEventSet(TopMemoryContext, 3);
+ FeBeWaitSet = CreateWaitEventSet(TopMemoryContext, NULL, 3);
AddWaitEventToSet(FeBeWaitSet, WL_SOCKET_WRITEABLE, MyProcPort->sock,
NULL, NULL);
AddWaitEventToSet(FeBeWaitSet, WL_LATCH_SET, -1, MyLatch, NULL);
diff --git a/src/backend/storage/ipc/latch.c b/src/backend/storage/ipc/latch.c
index 07b1364..9543397 100644
--- a/src/backend/storage/ipc/latch.c
+++ b/src/backend/storage/ipc/latch.c
@@ -51,6 +51,7 @@
#include "storage/latch.h"
#include "storage/pmsignal.h"
#include "storage/shmem.h"
+#include "utils/resowner_private.h"
/*
* Select the fd readiness primitive to use. Normally the "most modern"
@@ -77,6 +78,8 @@ struct WaitEventSet
int nevents; /* number of registered events */
int nevents_space; /* maximum number of events in this set */
+ ResourceOwner resowner; /* Resource owner */
+
/*
* Array, of nevents_space length, storing the definition of events this
* set is waiting for.
@@ -359,7 +362,7 @@ WaitLatchOrSocket(volatile Latch *latch, int wakeEvents, pgsocket sock,
int ret = 0;
int rc;
WaitEvent event;
- WaitEventSet *set = CreateWaitEventSet(CurrentMemoryContext, 3);
+ WaitEventSet *set = CreateWaitEventSet(CurrentMemoryContext, NULL, 3);
if (wakeEvents & WL_TIMEOUT)
Assert(timeout >= 0);
@@ -518,12 +521,15 @@ ResetLatch(volatile Latch *latch)
* WaitEventSetWait().
*/
WaitEventSet *
-CreateWaitEventSet(MemoryContext context, int nevents)
+CreateWaitEventSet(MemoryContext context, ResourceOwner res, int nevents)
{
WaitEventSet *set;
char *data;
Size sz = 0;
+ if (res)
+ ResourceOwnerEnlargeWESs(res);
+
/*
* Use MAXALIGN size/alignment to guarantee that later uses of memory are
* aligned correctly. E.g. epoll_event might need 8 byte alignment on some
@@ -592,6 +598,11 @@ CreateWaitEventSet(MemoryContext context, int nevents)
StaticAssertStmt(WSA_INVALID_EVENT == NULL, "");
#endif
+ /* Register this wait event set if requested */
+ set->resowner = res;
+ if (res)
+ ResourceOwnerRememberWES(set->resowner, set);
+
return set;
}
@@ -633,6 +644,9 @@ FreeWaitEventSet(WaitEventSet *set)
}
#endif
+ if (set->resowner != NULL)
+ ResourceOwnerForgetWES(set->resowner, set);
+
pfree(set);
}
diff --git a/src/backend/storage/lmgr/condition_variable.c b/src/backend/storage/lmgr/condition_variable.c
index b4b7d28..182f759 100644
--- a/src/backend/storage/lmgr/condition_variable.c
+++ b/src/backend/storage/lmgr/condition_variable.c
@@ -66,7 +66,7 @@ ConditionVariablePrepareToSleep(ConditionVariable *cv)
/* Create a reusable WaitEventSet. */
if (cv_wait_event_set == NULL)
{
- cv_wait_event_set = CreateWaitEventSet(TopMemoryContext, 1);
+ cv_wait_event_set = CreateWaitEventSet(TopMemoryContext, NULL, 1);
AddWaitEventToSet(cv_wait_event_set, WL_LATCH_SET, PGINVALID_SOCKET,
MyLatch, NULL);
}
diff --git a/src/backend/utils/resowner/resowner.c b/src/backend/utils/resowner/resowner.c
index 4a4a287..f2509c3 100644
--- a/src/backend/utils/resowner/resowner.c
+++ b/src/backend/utils/resowner/resowner.c
@@ -124,6 +124,7 @@ typedef struct ResourceOwnerData
ResourceArray snapshotarr; /* snapshot references */
ResourceArray filearr; /* open temporary files */
ResourceArray dsmarr; /* dynamic shmem segments */
+ ResourceArray wesarr; /* wait event sets */
/* We can remember up to MAX_RESOWNER_LOCKS references to local locks. */
int nlocks; /* number of owned locks */
@@ -169,6 +170,7 @@ static void PrintTupleDescLeakWarning(TupleDesc tupdesc);
static void PrintSnapshotLeakWarning(Snapshot snapshot);
static void PrintFileLeakWarning(File file);
static void PrintDSMLeakWarning(dsm_segment *seg);
+static void PrintWESLeakWarning(WaitEventSet *events);
/*****************************************************************************
@@ -437,6 +439,7 @@ ResourceOwnerCreate(ResourceOwner parent, const char *name)
ResourceArrayInit(&(owner->snapshotarr), PointerGetDatum(NULL));
ResourceArrayInit(&(owner->filearr), FileGetDatum(-1));
ResourceArrayInit(&(owner->dsmarr), PointerGetDatum(NULL));
+ ResourceArrayInit(&(owner->wesarr), PointerGetDatum(NULL));
return owner;
}
@@ -552,6 +555,16 @@ ResourceOwnerReleaseInternal(ResourceOwner owner,
PrintDSMLeakWarning(res);
dsm_detach(res);
}
+
+ /* Ditto for wait event sets */
+ while (ResourceArrayGetAny(&(owner->wesarr), &foundres))
+ {
+ WaitEventSet *event = (WaitEventSet *) DatumGetPointer(foundres);
+
+ if (isCommit)
+ PrintWESLeakWarning(event);
+ FreeWaitEventSet(event);
+ }
}
else if (phase == RESOURCE_RELEASE_LOCKS)
{
@@ -699,6 +712,7 @@ ResourceOwnerDelete(ResourceOwner owner)
Assert(owner->snapshotarr.nitems == 0);
Assert(owner->filearr.nitems == 0);
Assert(owner->dsmarr.nitems == 0);
+ Assert(owner->wesarr.nitems == 0);
Assert(owner->nlocks == 0 || owner->nlocks == MAX_RESOWNER_LOCKS + 1);
/*
@@ -725,6 +739,7 @@ ResourceOwnerDelete(ResourceOwner owner)
ResourceArrayFree(&(owner->snapshotarr));
ResourceArrayFree(&(owner->filearr));
ResourceArrayFree(&(owner->dsmarr));
+ ResourceArrayFree(&(owner->wesarr));
pfree(owner);
}
@@ -1267,3 +1282,56 @@ PrintDSMLeakWarning(dsm_segment *seg)
elog(WARNING, "dynamic shared memory leak: segment %u still referenced",
dsm_segment_handle(seg));
}
+
+/*
+ * Make sure there is room for at least one more entry in a ResourceOwner's
+ * wait event set reference array.
+ *
+ * This is separate from actually inserting an entry because if we run out
+ * of memory, it's critical to do so *before* acquiring the resource.
+ */
+void
+ResourceOwnerEnlargeWESs(ResourceOwner owner)
+{
+ ResourceArrayEnlarge(&(owner->wesarr));
+}
+
+/*
+ * Remember that a wait event set is owned by a ResourceOwner
+ *
+ * Caller must have previously done ResourceOwnerEnlargeWESs()
+ */
+void
+ResourceOwnerRememberWES(ResourceOwner owner, WaitEventSet *events)
+{
+ ResourceArrayAdd(&(owner->wesarr), PointerGetDatum(events));
+}
+
+/*
+ * Forget that a wait event set is owned by a ResourceOwner
+ */
+void
+ResourceOwnerForgetWES(ResourceOwner owner, WaitEventSet *events)
+{
+ /*
+ * XXXX: There's no property to show as an identier of a wait event set,
+ * use its pointer instead.
+ */
+ if (!ResourceArrayRemove(&(owner->wesarr), PointerGetDatum(events)))
+ elog(ERROR, "wait event set %p is not owned by resource owner %s",
+ events, owner->name);
+}
+
+/*
+ * Debugging subroutine
+ */
+static void
+PrintWESLeakWarning(WaitEventSet *events)
+{
+ /*
+ * XXXX: There's no property to show as an identier of a wait event set,
+ * use its pointer instead.
+ */
+ elog(WARNING, "wait event set leak: %p still referenced",
+ events);
+}
diff --git a/src/include/storage/latch.h b/src/include/storage/latch.h
index 73abfaf..392c1d6 100644
--- a/src/include/storage/latch.h
+++ b/src/include/storage/latch.h
@@ -101,6 +101,7 @@
#define LATCH_H
#include <signal.h>
+#include "utils/resowner.h"
/*
* Latch structure should be treated as opaque and only accessed through
@@ -152,7 +153,8 @@ extern void DisownLatch(volatile Latch *latch);
extern void SetLatch(volatile Latch *latch);
extern void ResetLatch(volatile Latch *latch);
-extern WaitEventSet *CreateWaitEventSet(MemoryContext context, int nevents);
+extern WaitEventSet *CreateWaitEventSet(MemoryContext context,
+ ResourceOwner res, int nevents);
extern void FreeWaitEventSet(WaitEventSet *set);
extern int AddWaitEventToSet(WaitEventSet *set, uint32 events, pgsocket fd,
Latch *latch, void *user_data);
diff --git a/src/include/utils/resowner_private.h b/src/include/utils/resowner_private.h
index 2420b65..70b0bb9 100644
--- a/src/include/utils/resowner_private.h
+++ b/src/include/utils/resowner_private.h
@@ -18,6 +18,7 @@
#include "storage/dsm.h"
#include "storage/fd.h"
+#include "storage/latch.h"
#include "storage/lock.h"
#include "utils/catcache.h"
#include "utils/plancache.h"
@@ -88,4 +89,11 @@ extern void ResourceOwnerRememberDSM(ResourceOwner owner,
extern void ResourceOwnerForgetDSM(ResourceOwner owner,
dsm_segment *);
+/* support for wait event set management */
+extern void ResourceOwnerEnlargeWESs(ResourceOwner owner);
+extern void ResourceOwnerRememberWES(ResourceOwner owner,
+ WaitEventSet *);
+extern void ResourceOwnerForgetWES(ResourceOwner owner,
+ WaitEventSet *);
+
#endif /* RESOWNER_PRIVATE_H */
--
2.9.2
0002-Asynchronous-execution-framework.patchtext/x-patch; charset=us-asciiDownload
From afb9353f48dca75c6ab4d6db7a1378d61059e78c Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 23 Feb 2017 12:20:31 +0900
Subject: [PATCH 2/5] Asynchronous execution framework
This is a framework for asynchronous execution based on Robert Haas's
proposal. Any executor node can receive tuples from underlying nodes
asynchronously by this. This is a different mechanism from parallel
execution. While the parallel execution is analogous to threads, this
frame work is analogous to select(2), which handles multiple input on
single backend process. To avoid degradation of non-async execution,
this framework uses completely different channel to convey tuples.
You will see the deatil of the API at the end of
src/backend/executor/README.
---
src/backend/executor/Makefile | 2 +-
src/backend/executor/README | 45 +++
src/backend/executor/execAmi.c | 5 +
src/backend/executor/execAsync.c | 520 ++++++++++++++++++++++++++++++++
src/backend/executor/execProcnode.c | 1 +
src/backend/executor/instrument.c | 2 +-
src/backend/executor/nodeAppend.c | 169 ++++++++++-
src/backend/executor/nodeForeignscan.c | 49 +++
src/backend/nodes/copyfuncs.c | 2 +
src/backend/nodes/outfuncs.c | 2 +
src/backend/nodes/readfuncs.c | 2 +
src/backend/optimizer/plan/createplan.c | 66 +++-
src/backend/postmaster/pgstat.c | 2 +
src/backend/utils/adt/ruleutils.c | 6 +-
src/include/executor/execAsync.h | 30 ++
src/include/executor/nodeAppend.h | 3 +
src/include/executor/nodeForeignscan.h | 7 +
src/include/foreign/fdwapi.h | 17 ++
src/include/nodes/execnodes.h | 65 +++-
src/include/nodes/plannodes.h | 2 +
src/include/pgstat.h | 3 +-
21 files changed, 971 insertions(+), 29 deletions(-)
create mode 100644 src/backend/executor/execAsync.c
create mode 100644 src/include/executor/execAsync.h
diff --git a/src/backend/executor/Makefile b/src/backend/executor/Makefile
index 083b20f..21f5ad0 100644
--- a/src/backend/executor/Makefile
+++ b/src/backend/executor/Makefile
@@ -12,7 +12,7 @@ subdir = src/backend/executor
top_builddir = ../../..
include $(top_builddir)/src/Makefile.global
-OBJS = execAmi.o execCurrent.o execExpr.o execExprInterp.o \
+OBJS = execAmi.o execAsync.o execCurrent.o execExpr.o execExprInterp.o \
execGrouping.o execIndexing.o execJunk.o \
execMain.o execParallel.o execProcnode.o \
execReplication.o execScan.o execSRF.o execTuples.o \
diff --git a/src/backend/executor/README b/src/backend/executor/README
index a004506..e6caeb7 100644
--- a/src/backend/executor/README
+++ b/src/backend/executor/README
@@ -349,3 +349,48 @@ query returning the same set of scan tuples multiple times. Likewise,
SRFs are disallowed in an UPDATE's targetlist. There, they would have the
effect of the same row being updated multiple times, which is not very
useful --- and updates after the first would have no effect anyway.
+
+Asynchronous Execution
+----------------------
+
+In certain cases, it's desirable for a node to indicate that it cannot
+return any tuple immediately but may be able to do at a later time. This
+might either because the node is waiting on an event external to the
+database system, such as a ForeignScan awaiting network I/O, or because
+the node is waiting for an event internal to the database system - e.g.
+one process involved in a parallel query may find that it cannot progress
+a certain parallel operation until some other process reaches a certain
+point in the computation. A process which discovers this type of situation
+can always handle it simply by blocking, but this may waste time that could
+be spent executing some other part of the plan where progress could be
+made immediately. This is particularly likely to occur when the plan
+contains an Append node.
+
+To use asynchronous execution, a node must first request a tuple from
+an async-capable child node using ExecAsyncRequest. Next, when the
+result is not available immediately, it must execute the asynchronous
+event loop using ExecAsyncEventLoop; it can avoid giving up control
+indefinitely by passing a timeout to this function, even passing -1 to
+poll for events without blocking. Eventually, when a node to which an
+asynchronous request has been made produces a tuple, the requesting
+node will receive a callback from the event loop via
+ExecAsyncResponse. Typically, the ExecAsyncResponse callback is the
+only one required for nodes that wish to request tuples
+asynchronously.
+
+On the other hand, nodes that wish to produce tuples asynchronously
+generally need to implement three methods:
+
+1. When an asynchronous request is made, the node's ExecAsyncRequest callback
+will be invoked; it should use ExecAsyncSetRequiredEvents to indicate the
+number of file descriptor events for which it wishes to wait and whether it
+wishes to receive a callback when the process latch is set. Alternatively,
+it can instead use ExecAsyncRequestDone if a result is available immediately.
+
+2. When the event loop wishes to wait or poll for file descriptor events and
+the process latch, the ExecAsyncConfigureWait callback is invoked to configure
+the file descriptor wait events for which the node wishes to wait. This
+callback isn't needed if the node only cares about the process latch.
+
+3. When file descriptors or the process latch become ready, the node's
+ExecAsyncNotify callback is invoked.
diff --git a/src/backend/executor/execAmi.c b/src/backend/executor/execAmi.c
index 7337d21..4c1991c 100644
--- a/src/backend/executor/execAmi.c
+++ b/src/backend/executor/execAmi.c
@@ -479,11 +479,16 @@ ExecSupportsBackwardScan(Plan *node)
{
ListCell *l;
+ /* With async, tuples may be interleaved, so can't back up. */
+ if (((Append *) node)->nasyncplans != 0)
+ return false;
+
foreach(l, ((Append *) node)->appendplans)
{
if (!ExecSupportsBackwardScan((Plan *) lfirst(l)))
return false;
}
+
/* need not check tlist because Append doesn't evaluate it */
return true;
}
diff --git a/src/backend/executor/execAsync.c b/src/backend/executor/execAsync.c
new file mode 100644
index 0000000..115b147
--- /dev/null
+++ b/src/backend/executor/execAsync.c
@@ -0,0 +1,520 @@
+/*-------------------------------------------------------------------------
+ *
+ * execAsync.c
+ * Support routines for asynchronous execution.
+ *
+ * Portions Copyright (c) 1996-2016, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/executor/execAsync.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "executor/execAsync.h"
+#include "executor/nodeAppend.h"
+#include "executor/nodeForeignscan.h"
+#include "miscadmin.h"
+#include "pgstat.h"
+#include "storage/latch.h"
+#include "utils/memutils.h"
+
+static bool ExecAsyncEventWait(EState *estate, long timeout);
+static bool ExecAsyncConfigureWait(EState *estate, PendingAsyncRequest *areq,
+ bool reinit);
+static void ExecAsyncNotify(EState *estate, PendingAsyncRequest *areq);
+static void ExecAsyncResponse(EState *estate, PendingAsyncRequest *areq);
+
+#define EVENT_BUFFER_SIZE 16
+
+/*
+ * Asynchronously request a tuple from a designed async-aware node.
+ *
+ * requestor is the node that wants the tuple; requestee is the node from
+ * which it wants the tuple. request_index is an arbitrary integer specified
+ * by the requestor which will be available at the time the requestor receives
+ * the tuple. This is useful if the requestor has multiple children and
+ * needs an easy way to figure out which one is delivering a tuple.
+ */
+void
+ExecAsyncRequest(EState *estate, PlanState *requestor, int request_index,
+ PlanState *requestee)
+{
+ PendingAsyncRequest *areq = NULL;
+ int nasync = estate->es_num_pending_async;
+
+ if (requestee->instrument)
+ InstrStartNode(requestee->instrument);
+
+ /*
+ * If the number of pending asynchronous nodes exceeds the number of
+ * available slots in the es_pending_async array, expand the array.
+ * We start with 16 slots, and thereafter double the array size each
+ * time we run out of slots.
+ */
+ if (nasync >= estate->es_max_pending_async)
+ {
+ int newmax;
+
+ newmax = estate->es_max_pending_async * 2;
+ if (estate->es_max_pending_async == 0)
+ {
+ newmax = 16;
+ estate->es_pending_async =
+ MemoryContextAllocZero(estate->es_query_cxt,
+ newmax * sizeof(PendingAsyncRequest *));
+ }
+ else
+ {
+ int newentries = newmax - estate->es_max_pending_async;
+
+ estate->es_pending_async =
+ repalloc(estate->es_pending_async,
+ newmax * sizeof(PendingAsyncRequest *));
+ MemSet(&estate->es_pending_async[estate->es_max_pending_async],
+ 0, newentries * sizeof(PendingAsyncRequest *));
+ }
+ estate->es_max_pending_async = newmax;
+ }
+
+ /*
+ * To avoid unnecessary palloc traffic, we reuse a previously-allocated
+ * PendingAsyncRequest if there is one. If not, we must allocate a new
+ * one.
+ */
+ if (estate->es_pending_async[nasync] == NULL)
+ {
+ areq = MemoryContextAllocZero(estate->es_query_cxt,
+ sizeof(PendingAsyncRequest));
+ estate->es_pending_async[nasync] = areq;
+ }
+ else
+ {
+ areq = estate->es_pending_async[nasync];
+ MemSet(areq, 0, sizeof(PendingAsyncRequest));
+ }
+ areq->myindex = estate->es_num_pending_async;
+
+ /* Initialize the new request. */
+ areq->state = ASYNCREQ_IDLE;
+ areq->requestor = requestor;
+ areq->request_index = request_index;
+ areq->requestee = requestee;
+
+ /* Give the requestee a chance to do whatever it wants. */
+ switch (nodeTag(requestee))
+ {
+ case T_ForeignScanState:
+ ExecAsyncForeignScanRequest(estate, areq);
+ break;
+ default:
+ /* If requestee doesn't support async, caller messed up. */
+ elog(ERROR, "unrecognized node type: %d",
+ (int) nodeTag(requestee));
+ }
+
+ if (areq->requestee->instrument)
+ InstrStopNode(requestee->instrument, 0);
+
+ /* No result available now, make this node pending */
+ estate->es_num_pending_async++;
+
+ return;
+}
+
+/*
+ * Execute the main loop until the timeout expires or a result is delivered
+ * to the requestor.
+ *
+ * If the timeout is -1, there is no timeout; wait indefinitely until a
+ * result is ready for requestor. If the timeout is 0, do not block, but
+ * poll for events and fire callbacks for as long as we can do so without
+ * blocking. If timeout is greater than 0, block for at most the number
+ * of milliseconds indicated by the timeout.
+ *
+ * Returns true if a result was delivered to the requestor. A return value
+ * of false indicates that the timeout was reached without delivering a
+ * result to the requestor.
+ */
+bool
+ExecAsyncEventLoop(EState *estate, PlanState *requestor, long timeout)
+{
+ instr_time start_time;
+ long cur_timeout = timeout;
+ bool requestor_done = false;
+
+ Assert(requestor != NULL);
+
+ /*
+ * If we plan to wait - but not indefinitely - we need to record the
+ * current time.
+ */
+ if (timeout > 0)
+ INSTR_TIME_SET_CURRENT(start_time);
+
+ /* Main event loop: poll for events, deliver notifications. */
+ Assert(estate->es_async_callback_pending == 0);
+ for (;;)
+ {
+ int i;
+ bool any_node_done = false;
+
+ CHECK_FOR_INTERRUPTS();
+
+ /* Check for events only if any node is async-not-ready. */
+ if (estate->es_num_async_ready < estate->es_num_pending_async)
+ {
+ /* Don't block if any tuple available. */
+ if (estate->es_async_callback_pending > 0)
+ ExecAsyncEventWait(estate, 0);
+ else if (!ExecAsyncEventWait(estate, cur_timeout))
+ { /* Not fired */
+ /* Exited before timeout. Calculate the remaining time. */
+ instr_time cur_time;
+ long cur_timeout = -1;
+
+ /* Wait forever */
+ if (timeout < 0)
+ continue;
+
+ INSTR_TIME_SET_CURRENT(cur_time);
+ INSTR_TIME_SUBTRACT(cur_time, start_time);
+ cur_timeout =
+ timeout - (long) INSTR_TIME_GET_MILLISEC(cur_time);
+
+ if (cur_timeout > 0)
+ continue;
+ }
+ }
+
+ /* Deliver notifications. */
+ for (i = 0; i < estate->es_num_pending_async; ++i)
+ {
+ PendingAsyncRequest *areq = estate->es_pending_async[i];
+
+ if (areq->requestee->instrument)
+ InstrStartNode(areq->requestee->instrument);
+
+ /* Notify if the requestee is ready */
+ if (areq->state == ASYNCREQ_CALLBACK_PENDING)
+ ExecAsyncNotify(estate, areq);
+
+ /* Deliver the acquired tuple to the requester */
+ if (areq->state == ASYNCREQ_COMPLETE)
+ {
+ any_node_done = true;
+ if (requestor == areq->requestor)
+ requestor_done = true;
+ ExecAsyncResponse(estate, areq);
+
+ if (areq->requestee->instrument)
+ InstrStopNode(areq->requestee->instrument,
+ TupIsNull((TupleTableSlot*)areq->result) ?
+ 0.0 : 1.0);
+ }
+ else if (areq->requestee->instrument)
+ InstrStopNode(areq->requestee->instrument, 0);
+ }
+
+ /* If any node completed, compact the array. */
+ if (any_node_done)
+ {
+ int hidx = 0,
+ tidx;
+
+ /*
+ * Swap all non-yet-completed items to the start of the array.
+ * Keep them in the same order.
+ */
+ for (tidx = 0; tidx < estate->es_num_pending_async; ++tidx)
+ {
+ PendingAsyncRequest *head;
+ PendingAsyncRequest *tail = estate->es_pending_async[tidx];
+
+ Assert(tail->state != ASYNCREQ_CALLBACK_PENDING);
+
+ if (tail->state == ASYNCREQ_COMPLETE)
+ continue;
+ head = estate->es_pending_async[hidx];
+ estate->es_pending_async[tidx] = head;
+ estate->es_pending_async[hidx] = tail;
+ ++hidx;
+ }
+ estate->es_num_pending_async = hidx;
+ }
+
+ /*
+ * We only consider exiting the loop when no notifications are
+ * pending. Otherwise, each call to this function might advance
+ * the computation by only a very small amount; to the contrary,
+ * we want to push it forward as far as possible.
+ */
+ if (estate->es_async_callback_pending == 0)
+ {
+ /* If requestor is ready, exit. */
+ if (requestor_done)
+ return true;
+ /* If timeout was 0 or has expired, exit. */
+ if (cur_timeout == 0)
+ return false;
+ }
+ }
+}
+
+/*
+ * Wait or poll for events. As with ExecAsyncEventLoop, a timeout of -1
+ * means wait forever, 0 means don't wait at all, and >0 means wait for the
+ * indicated number of milliseconds.
+ *
+ * Returns false if we timed out or true if anything found or there's no event
+ * to wait.
+ */
+static bool
+ExecAsyncEventWait(EState *estate, long timeout)
+{
+ WaitEvent occurred_event[EVENT_BUFFER_SIZE];
+ int noccurred;
+ int i;
+ int n;
+ bool reinit = false;
+ bool process_latch_set = false;
+ bool added = false;
+ bool fired = false;
+
+ if (estate->es_wait_event_set == NULL)
+ {
+ /*
+ * Allow for a few extra events without reinitializing. It
+ * doesn't seem worth the complexity of doing anything very
+ * aggressive here, because plans that depend on massive numbers
+ * of external FDs are likely to run afoul of kernel limits anyway.
+ */
+ estate->es_allocated_fd_events = estate->es_total_fd_events + 16;
+
+ /*
+ * The wait event set created here should be live beyond ExecutorState
+ * context but released in case of error.
+ */
+ estate->es_wait_event_set =
+ CreateWaitEventSet(TopTransactionContext,
+ TopTransactionResourceOwner,
+ estate->es_allocated_fd_events + 1);
+
+ AddWaitEventToSet(estate->es_wait_event_set,
+ WL_LATCH_SET, PGINVALID_SOCKET, MyLatch, NULL);
+ reinit = true;
+ }
+
+ /* Give each waiting node a chance to add or modify events. */
+ for (i = 0; i < estate->es_num_pending_async; ++i)
+ {
+ PendingAsyncRequest *areq = estate->es_pending_async[i];
+
+ if (areq->num_fd_events > 0 || areq->wants_process_latch)
+ added |= ExecAsyncConfigureWait(estate, areq, reinit);
+ }
+
+ /*
+ * We may have no event to wait. This occurs when all nodes that
+ * is asynchronously executing have tuples immediately available.
+ */
+ if (!added)
+ return true;
+
+ /* Wait for at least one event to occur. */
+ noccurred = WaitEventSetWait(estate->es_wait_event_set, timeout,
+ occurred_event, EVENT_BUFFER_SIZE,
+ WAIT_EVENT_ASYNC_WAIT);
+
+ if (noccurred == 0)
+ return false;
+
+ /*
+ * Loop over the occurred events and set the callback_pending flags
+ * for the appropriate requests. The waiting nodes should have
+ * registered their wait events with user_data pointing back to the
+ * PendingAsyncRequest, but the process latch needs special handling.
+ */
+ for (n = 0; n < noccurred; ++n)
+ {
+ WaitEvent *w = &occurred_event[n];
+
+ if ((w->events & WL_LATCH_SET) != 0)
+ {
+ process_latch_set = true;
+ continue;
+ }
+
+ if ((w->events & (WL_SOCKET_READABLE|WL_SOCKET_WRITEABLE)) != 0)
+ {
+ PendingAsyncRequest *areq = w->user_data;
+
+ Assert(areq->state == ASYNCREQ_WAITING);
+
+ areq->state = ASYNCREQ_CALLBACK_PENDING;
+ estate->es_async_callback_pending++;
+ fired = true;
+ }
+ }
+
+ /*
+ * If the process latch got set, we must schedule a callback for every
+ * requestee that cares about it.
+ */
+ if (process_latch_set)
+ {
+ for (i = 0; i < estate->es_num_pending_async; ++i)
+ {
+ PendingAsyncRequest *areq = estate->es_pending_async[i];
+
+ if (areq->wants_process_latch)
+ {
+ Assert(areq->state == ASYNCREQ_WAITING);
+ areq->state = ASYNCREQ_CALLBACK_PENDING;
+ estate->es_async_callback_pending++;
+ fired = true;
+ }
+ }
+ }
+
+ return fired;
+}
+
+/*
+ * Give the asynchronous node a chance to configure the file descriptor
+ * events for which it wishes to wait. We expect the node-type specific
+ * callback to make one or more calls of the following form:
+ *
+ * AddWaitEventToSet(es->es_wait_event_set, events, fd, NULL, areq);
+ *
+ * The events should include only WL_SOCKET_READABLE or WL_SOCKET_WRITEABLE,
+ * and the number of calls should not exceed areq->num_fd_events (as
+ * prevously set via ExecAsyncSetRequiredEvents).
+ *
+ * Individual requests can omit registering an event but it is a
+ * responsibility of the node driver to set at least one event per one
+ * requestor.
+ */
+static bool
+ExecAsyncConfigureWait(EState *estate, PendingAsyncRequest *areq,
+ bool reinit)
+{
+ switch (nodeTag(areq->requestee))
+ {
+ case T_ForeignScanState:
+ return ExecAsyncForeignScanConfigureWait(estate, areq, reinit);
+ break;
+ default:
+ elog(ERROR, "unrecognized node type: %d",
+ (int) nodeTag(areq->requestee));
+ }
+}
+
+/*
+ * Call the asynchronous node back when a relevant event has occurred.
+ */
+static void
+ExecAsyncNotify(EState *estate, PendingAsyncRequest *areq)
+{
+ switch (nodeTag(areq->requestee))
+ {
+ case T_ForeignScanState:
+ ExecAsyncForeignScanNotify(estate, areq);
+ break;
+ default:
+ elog(ERROR, "unrecognized node type: %d",
+ (int) nodeTag(areq->requestee));
+ }
+
+ estate->es_async_callback_pending--;
+}
+
+/*
+ * Call the requestor back when an asynchronous node has produced a result.
+ */
+static void
+ExecAsyncResponse(EState *estate, PendingAsyncRequest *areq)
+{
+ switch (nodeTag(areq->requestor))
+ {
+ case T_AppendState:
+ ExecAsyncAppendResponse(estate, areq);
+ break;
+ default:
+ elog(ERROR, "unrecognized node type: %d",
+ (int) nodeTag(areq->requestor));
+ }
+ estate->es_num_async_ready--;
+}
+
+/*
+ * An executor node should call this function to signal that it needs to wait
+ * on one or more file descriptor events that can be registered on a
+ * WaitEventSet, and possibly also on process latch. num_fd_events is the
+ * maximum number of file descriptor events that it will wish to register.
+ * force_reset should be true if the node can't reuse the WaitEventSet it most
+ * recently initialized, for example because it needs to drop a wait event
+ * from the set.
+ */
+void
+ExecAsyncSetRequiredEvents(EState *estate, PendingAsyncRequest *areq,
+ int num_fd_events, bool wants_process_latch,
+ bool force_reset)
+{
+ estate->es_total_fd_events += num_fd_events - areq->num_fd_events;
+ areq->num_fd_events = num_fd_events;
+ areq->wants_process_latch = wants_process_latch;
+ areq->state = ASYNCREQ_WAITING;
+
+ if (force_reset && estate->es_wait_event_set != NULL)
+ ExecAsyncClearEvents(estate);
+}
+
+/*
+ * An async-capable node should call this function to deliver the tuple to
+ * the node which requested it. The node can call this from its
+ * ExecAsyncRequest callback if the requested tuple is available immediately,
+ * or at a later time from its ExecAsyncNotify callback.
+ */
+void
+ExecAsyncRequestDone(EState *estate, PendingAsyncRequest *areq, Node *result)
+{
+ /*
+ * Since the request is complete, the requestee is no longer allowed
+ * to wait for any events. Note that this forces a rebuild of
+ * es_wait_event_set every time a process that was previously waiting
+ * stops doing so. It might be possible to defer that decision until
+ * we actually wait again, because it's quite possible that a new
+ * request will be made of the same node before any wait actually
+ * happens. However, we have to balance the cost of rebuilding the
+ * WaitEventSet against the additional overhead of tracking which nodes
+ * need a callback to remove registered wait events. It's not clear
+ * that we would come out ahead, so use brute force for now.
+ */
+ Assert(areq->state == ASYNCREQ_IDLE ||
+ areq->state == ASYNCREQ_CALLBACK_PENDING);
+
+ if (areq->num_fd_events > 0 || areq->wants_process_latch)
+ ExecAsyncSetRequiredEvents(estate, areq, 0, false, true);
+
+
+ /* Save result and mark request as complete. */
+ areq->result = result;
+ areq->state = ASYNCREQ_COMPLETE;
+ estate->es_num_async_ready++;
+}
+
+
+/* Clear async events */
+void
+ExecAsyncClearEvents(EState *estate)
+{
+ if (estate->es_wait_event_set == NULL)
+ return;
+
+ FreeWaitEventSet(estate->es_wait_event_set);
+ estate->es_wait_event_set = NULL;
+}
diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
index 294ad2c..8f8ad2c 100644
--- a/src/backend/executor/execProcnode.c
+++ b/src/backend/executor/execProcnode.c
@@ -118,6 +118,7 @@
#include "executor/nodeValuesscan.h"
#include "executor/nodeWindowAgg.h"
#include "executor/nodeWorktablescan.h"
+#include "foreign/fdwapi.h"
#include "nodes/nodeFuncs.h"
#include "miscadmin.h"
diff --git a/src/backend/executor/instrument.c b/src/backend/executor/instrument.c
index 6ec96ec..959ee90 100644
--- a/src/backend/executor/instrument.c
+++ b/src/backend/executor/instrument.c
@@ -102,7 +102,7 @@ InstrStopNode(Instrumentation *instr, double nTuples)
&pgBufferUsage, &instr->bufusage_start);
/* Is this the first tuple of this cycle? */
- if (!instr->running)
+ if (!instr->running && nTuples > 0)
{
instr->running = true;
instr->firsttuple = INSTR_TIME_GET_DOUBLE(instr->counter);
diff --git a/src/backend/executor/nodeAppend.c b/src/backend/executor/nodeAppend.c
index aae5e3f..2c07095 100644
--- a/src/backend/executor/nodeAppend.c
+++ b/src/backend/executor/nodeAppend.c
@@ -58,6 +58,7 @@
#include "postgres.h"
#include "executor/execdebug.h"
+#include "executor/execAsync.h"
#include "executor/nodeAppend.h"
static bool exec_append_initialize_next(AppendState *appendstate);
@@ -79,16 +80,21 @@ exec_append_initialize_next(AppendState *appendstate)
/*
* get information from the append node
*/
- whichplan = appendstate->as_whichplan;
+ whichplan = appendstate->as_whichsyncplan;
- if (whichplan < 0)
+ /*
+ * This routine is only responsible for setting up for nodes being scanned
+ * synchronously, so the first node we can scan is given by nasyncplans
+ * and the last is given by as_nplans - 1.
+ */
+ if (whichplan < appendstate->as_nasyncplans)
{
/*
* if scanning in reverse, we start at the last scan in the list and
* then proceed back to the first.. in any case we inform ExecAppend
* that we are at the end of the line by returning FALSE
*/
- appendstate->as_whichplan = 0;
+ appendstate->as_whichsyncplan = appendstate->as_nasyncplans;
return FALSE;
}
else if (whichplan >= appendstate->as_nplans)
@@ -96,7 +102,7 @@ exec_append_initialize_next(AppendState *appendstate)
/*
* as above, end the scan if we go beyond the last scan in our list..
*/
- appendstate->as_whichplan = appendstate->as_nplans - 1;
+ appendstate->as_whichsyncplan = appendstate->as_nplans - 1;
return FALSE;
}
else
@@ -148,6 +154,15 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
appendstate->ps.state = estate;
appendstate->appendplans = appendplanstates;
appendstate->as_nplans = nplans;
+ appendstate->as_nasyncplans = node->nasyncplans;
+ appendstate->as_syncdone = (node->nasyncplans == nplans);
+ appendstate->as_asyncresult = (TupleTableSlot **)
+ palloc0(node->nasyncplans * sizeof(TupleTableSlot *));
+
+ /* initially, all async requests need a request */
+ for (i = 0; i < appendstate->as_nasyncplans; ++i)
+ appendstate->as_needrequest =
+ bms_add_member(appendstate->as_needrequest, i);
/*
* Miscellaneous initialization
@@ -182,9 +197,9 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
appendstate->ps.ps_ProjInfo = NULL;
/*
- * initialize to scan first subplan
+ * initialize to scan first synchronous subplan
*/
- appendstate->as_whichplan = 0;
+ appendstate->as_whichsyncplan = appendstate->as_nasyncplans;
exec_append_initialize_next(appendstate);
return appendstate;
@@ -199,15 +214,85 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
TupleTableSlot *
ExecAppend(AppendState *node)
{
+ if (node->as_nasyncplans > 0)
+ {
+ EState *estate = node->ps.state;
+ int i;
+
+ /*
+ * If there are any asynchronously-generated results that have
+ * not yet been returned, return one of them.
+ */
+ if (node->as_nasyncresult > 0)
+ {
+ --node->as_nasyncresult;
+ return node->as_asyncresult[node->as_nasyncresult];
+ }
+
+
+ /*
+ * XXXX: Always clear registered event. This seems a bit ineffecient
+ * but the events to wait are almost randomly altered for every
+ * calling.
+ */
+ ExecAsyncClearEvents(estate);
+
+ while ((i = bms_first_member(node->as_needrequest)) >= 0)
+ {
+ node->as_nasyncpending++;
+ ExecAsyncRequest(estate, &node->ps, i, node->appendplans[i]);
+ }
+
+ if (node->as_nasyncpending == 0 && node->as_syncdone)
+ return ExecClearTuple(node->ps.ps_ResultTupleSlot);
+ }
+
for (;;)
{
PlanState *subnode;
TupleTableSlot *result;
/*
- * figure out which subplan we are currently processing
+ * if we have async requests outstanding, run the event loop
+ */
+ if (node->as_nasyncpending > 0)
+ {
+ long timeout = node->as_syncdone ? -1 : 0;
+
+ while (node->as_nasyncpending > 0)
+ {
+ if (ExecAsyncEventLoop(node->ps.state, &node->ps, timeout) &&
+ node->as_nasyncresult > 0)
+ {
+ /* Asynchronous subplan returned a tuple! */
+ --node->as_nasyncresult;
+ return node->as_asyncresult[node->as_nasyncresult];
+ }
+
+ /* Timeout reached. Go through to sync nodes if exists */
+ if (!node->as_syncdone)
+ break;
+ }
+
+ /*
+ * If there is no asynchronous activity still pending and the
+ * synchronous activity is also complete, we're totally done
+ * scanning this node. Otherwise, we're done with the
+ * asynchronous stuff but must continue scanning the synchronous
+ * children.
+ */
+ if (node->as_syncdone)
+ {
+ Assert(node->as_nasyncpending == 0);
+ return ExecClearTuple(node->ps.ps_ResultTupleSlot);
+ }
+ }
+
+ /*
+ * figure out which synchronous subplan we are currently processing
*/
- subnode = node->appendplans[node->as_whichplan];
+ Assert(!node->as_syncdone);
+ subnode = node->appendplans[node->as_whichsyncplan];
/*
* get a tuple from the subplan
@@ -227,14 +312,21 @@ ExecAppend(AppendState *node)
/*
* Go on to the "next" subplan in the appropriate direction. If no
* more subplans, return the empty slot set up for us by
- * ExecInitAppend.
+ * ExecInitAppend, unless there are async plans we have yet to finish.
*/
if (ScanDirectionIsForward(node->ps.state->es_direction))
- node->as_whichplan++;
+ node->as_whichsyncplan++;
else
- node->as_whichplan--;
+ node->as_whichsyncplan--;
if (!exec_append_initialize_next(node))
- return ExecClearTuple(node->ps.ps_ResultTupleSlot);
+ {
+ node->as_syncdone = true;
+ if (node->as_nasyncpending == 0)
+ {
+ Assert(bms_is_empty(node->as_needrequest));
+ return ExecClearTuple(node->ps.ps_ResultTupleSlot);
+ }
+ }
/* Else loop back and try to get a tuple from the new subplan */
}
@@ -273,6 +365,16 @@ ExecReScanAppend(AppendState *node)
{
int i;
+ /*
+ * XXX. Cancel outstanding asynchronous tuple requests here! (How?)
+ */
+
+ /* Reset async state. */
+ for (i = 0; i < node->as_nasyncplans; ++i)
+ node->as_needrequest = bms_add_member(node->as_needrequest, i);
+ node->as_nasyncresult = 0;
+ node->as_syncdone = (node->as_nasyncplans == node->as_nplans);
+
for (i = 0; i < node->as_nplans; i++)
{
PlanState *subnode = node->appendplans[i];
@@ -291,6 +393,47 @@ ExecReScanAppend(AppendState *node)
if (subnode->chgParam == NULL)
ExecReScan(subnode);
}
- node->as_whichplan = 0;
+ node->as_whichsyncplan = node->as_nasyncplans;
exec_append_initialize_next(node);
}
+
+/* ----------------------------------------------------------------
+ * ExecAsyncAppendResponse
+ *
+ * Receive a response from an asynchronous request we made.
+ * ----------------------------------------------------------------
+ */
+void
+ExecAsyncAppendResponse(EState *estate, PendingAsyncRequest *areq)
+{
+ AppendState *node = (AppendState *) areq->requestor;
+ TupleTableSlot *slot;
+
+ /* We shouldn't be called until the request is complete. */
+ Assert(areq->state == ASYNCREQ_COMPLETE);
+
+ /* Our result slot shouldn't already be occupied. */
+ Assert(TupIsNull(node->ps.ps_ResultTupleSlot));
+
+ /* Result should be a TupleTableSlot or NULL. */
+ slot = (TupleTableSlot *) areq->result;
+ Assert(slot == NULL || IsA(slot, TupleTableSlot));
+
+ /* This is no longer pending */
+ --node->as_nasyncpending;
+
+ /* If the result is NULL or an empty slot, there's nothing more to do. */
+ if (TupIsNull(slot))
+ return;
+
+ /* Save result so we can return it. */
+ Assert(node->as_nasyncresult < node->as_nasyncplans);
+ node->as_asyncresult[node->as_nasyncresult++] = slot;
+
+ /*
+ * Mark the node that returned a result as ready for a new request. We
+ * don't launch another one here immediately because it might compelte
+ */
+ node->as_needrequest =
+ bms_add_member(node->as_needrequest, areq->request_index);
+}
diff --git a/src/backend/executor/nodeForeignscan.c b/src/backend/executor/nodeForeignscan.c
index 9cde112..1df8ccb 100644
--- a/src/backend/executor/nodeForeignscan.c
+++ b/src/backend/executor/nodeForeignscan.c
@@ -364,3 +364,52 @@ ExecShutdownForeignScan(ForeignScanState *node)
if (fdwroutine->ShutdownForeignScan)
fdwroutine->ShutdownForeignScan(node);
}
+
+/* ----------------------------------------------------------------
+ * ExecAsyncForeignScanRequest
+ *
+ * Initiate an asynchronous request
+ * ----------------------------------------------------------------
+ */
+void
+ExecAsyncForeignScanRequest(EState *estate, PendingAsyncRequest *areq)
+{
+ ForeignScanState *node = (ForeignScanState *) areq->requestee;
+ FdwRoutine *fdwroutine = node->fdwroutine;
+
+ Assert(fdwroutine->ForeignAsyncRequest != NULL);
+ fdwroutine->ForeignAsyncRequest(estate, areq);
+}
+
+/* ----------------------------------------------------------------
+ * ExecAsyncForeignScanConfigureWait
+ *
+ * In async mode, configure for a wait
+ * ----------------------------------------------------------------
+ */
+bool
+ExecAsyncForeignScanConfigureWait(EState *estate,
+ PendingAsyncRequest *areq, bool reinit)
+{
+ ForeignScanState *node = (ForeignScanState *) areq->requestee;
+ FdwRoutine *fdwroutine = node->fdwroutine;
+
+ Assert(fdwroutine->ForeignAsyncConfigureWait != NULL);
+ return fdwroutine->ForeignAsyncConfigureWait(estate, areq, reinit);
+}
+
+/* ----------------------------------------------------------------
+ * ExecAsyncForeignScanNotify
+ *
+ * Event loop callback
+ * ----------------------------------------------------------------
+ */
+void
+ExecAsyncForeignScanNotify(EState *estate, PendingAsyncRequest *areq)
+{
+ ForeignScanState *node = (ForeignScanState *) areq->requestee;
+ FdwRoutine *fdwroutine = node->fdwroutine;
+
+ Assert(fdwroutine->ForeignAsyncNotify != NULL);
+ fdwroutine->ForeignAsyncNotify(estate, areq);
+}
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index 45a04b0..929dfea 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -242,6 +242,8 @@ _copyAppend(const Append *from)
*/
COPY_NODE_FIELD(partitioned_rels);
COPY_NODE_FIELD(appendplans);
+ COPY_SCALAR_FIELD(nasyncplans);
+ COPY_SCALAR_FIELD(referent);
return newnode;
}
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index 379d92a..823725b 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -394,6 +394,8 @@ _outAppend(StringInfo str, const Append *node)
WRITE_NODE_FIELD(partitioned_rels);
WRITE_NODE_FIELD(appendplans);
+ WRITE_INT_FIELD(nasyncplans);
+ WRITE_INT_FIELD(referent);
}
static void
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index 86c811d..5568288 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -1594,6 +1594,8 @@ _readAppend(void)
READ_NODE_FIELD(partitioned_rels);
READ_NODE_FIELD(appendplans);
+ READ_INT_FIELD(nasyncplans);
+ READ_INT_FIELD(referent);
READ_DONE();
}
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index 5c934f2..a339575 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -203,7 +203,8 @@ static NamedTuplestoreScan *make_namedtuplestorescan(List *qptlist, List *qpqual
Index scanrelid, char *enrname);
static WorkTableScan *make_worktablescan(List *qptlist, List *qpqual,
Index scanrelid, int wtParam);
-static Append *make_append(List *appendplans, List *tlist, List *partitioned_rels);
+static Append *make_append(List *asyncplans, int nasyncplans,
+ int referent, List *tlist, List *partitioned_rels);
static RecursiveUnion *make_recursive_union(List *tlist,
Plan *lefttree,
Plan *righttree,
@@ -282,7 +283,7 @@ static ModifyTable *make_modifytable(PlannerInfo *root,
List *rowMarks, OnConflictExpr *onconflict, int epqParam);
static GatherMerge *create_gather_merge_plan(PlannerInfo *root,
GatherMergePath *best_path);
-
+static bool is_async_capable_path(Path *path);
/*
* create_plan
@@ -1003,8 +1004,12 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path)
{
Append *plan;
List *tlist = build_path_tlist(root, &best_path->path);
- List *subplans = NIL;
+ List *asyncplans = NIL;
+ List *syncplans = NIL;
ListCell *subpaths;
+ int nasyncplans = 0;
+ bool first = true;
+ bool referent_is_sync = true;
/*
* The subpaths list could be empty, if every child was proven empty by
@@ -1030,7 +1035,14 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path)
return plan;
}
- /* Build the plan for each child */
+ /*
+ * Build the plan for each child
+
+ * The first child in an inheritance set is the representative in
+ * explaining tlist entries (see set_deparse_planstate). We should keep
+ * the first child in best_path->subpaths at the head of the subplan list
+ * for the reason.
+ */
foreach(subpaths, best_path->subpaths)
{
Path *subpath = (Path *) lfirst(subpaths);
@@ -1039,7 +1051,18 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path)
/* Must insist that all children return the same tlist */
subplan = create_plan_recurse(root, subpath, CP_EXACT_TLIST);
- subplans = lappend(subplans, subplan);
+ /* Classify as async-capable or not */
+ if (is_async_capable_path(subpath))
+ {
+ asyncplans = lappend(asyncplans, subplan);
+ ++nasyncplans;
+ if (first)
+ referent_is_sync = false;
+ }
+ else
+ syncplans = lappend(syncplans, subplan);
+
+ first = false;
}
/*
@@ -1049,7 +1072,9 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path)
* parent-rel Vars it'll be asked to emit.
*/
- plan = make_append(subplans, tlist, best_path->partitioned_rels);
+ plan = make_append(list_concat(asyncplans, syncplans), nasyncplans,
+ referent_is_sync ? nasyncplans : 0, tlist,
+ best_path->partitioned_rels);
copy_generic_path_info(&plan->plan, (Path *) best_path);
@@ -5270,7 +5295,8 @@ make_foreignscan(List *qptlist,
}
static Append *
-make_append(List *appendplans, List *tlist, List *partitioned_rels)
+make_append(List *appendplans, int nasyncplans, int referent,
+ List *tlist, List *partitioned_rels)
{
Append *node = makeNode(Append);
Plan *plan = &node->plan;
@@ -5281,6 +5307,8 @@ make_append(List *appendplans, List *tlist, List *partitioned_rels)
plan->righttree = NULL;
node->partitioned_rels = partitioned_rels;
node->appendplans = appendplans;
+ node->nasyncplans = nasyncplans;
+ node->referent = referent;
return node;
}
@@ -6613,3 +6641,27 @@ is_projection_capable_plan(Plan *plan)
}
return true;
}
+
+/*
+ * is_projection_capable_path
+ * Check whether a given Path node is async-capable.
+ */
+static bool
+is_async_capable_path(Path *path)
+{
+ switch (nodeTag(path))
+ {
+ case T_ForeignPath:
+ {
+ FdwRoutine *fdwroutine = path->parent->fdwroutine;
+
+ Assert(fdwroutine != NULL);
+ if (fdwroutine->IsForeignPathAsyncCapable != NULL &&
+ fdwroutine->IsForeignPathAsyncCapable((ForeignPath *) path))
+ return true;
+ }
+ default:
+ break;
+ }
+ return false;
+}
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index a0b0eec..af288be 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -3611,6 +3611,8 @@ pgstat_get_wait_ipc(WaitEventIPC w)
break;
case WAIT_EVENT_LOGICAL_SYNC_STATE_CHANGE:
event_name = "LogicalSyncStateChange";
+ case WAIT_EVENT_ASYNC_WAIT:
+ event_name = "AsyncExecWait";
break;
/* no default case, so that compiler will warn */
}
diff --git a/src/backend/utils/adt/ruleutils.c b/src/backend/utils/adt/ruleutils.c
index d83377d..1c80e85 100644
--- a/src/backend/utils/adt/ruleutils.c
+++ b/src/backend/utils/adt/ruleutils.c
@@ -4432,7 +4432,11 @@ set_deparse_planstate(deparse_namespace *dpns, PlanState *ps)
* lists containing references to non-target relations.
*/
if (IsA(ps, AppendState))
- dpns->outer_planstate = ((AppendState *) ps)->appendplans[0];
+ {
+ int idx = ((Append*)(((AppendState *) ps)->ps.plan))->referent;
+ dpns->outer_planstate =
+ ((AppendState *) ps)->appendplans[idx];
+ }
else if (IsA(ps, MergeAppendState))
dpns->outer_planstate = ((MergeAppendState *) ps)->mergeplans[0];
else if (IsA(ps, ModifyTableState))
diff --git a/src/include/executor/execAsync.h b/src/include/executor/execAsync.h
new file mode 100644
index 0000000..9e7845c
--- /dev/null
+++ b/src/include/executor/execAsync.h
@@ -0,0 +1,30 @@
+/*--------------------------------------------------------------------
+ * execAsync.c
+ * Support functions for asynchronous query execution
+ *
+ * Portions Copyright (c) 1996-2016, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/executor/execAsync.c
+ *--------------------------------------------------------------------
+ */
+
+#ifndef EXECASYNC_H
+#define EXECASYNC_H
+
+#include "nodes/execnodes.h"
+
+extern void ExecAsyncRequest(EState *estate, PlanState *requestor,
+ int request_index, PlanState *requestee);
+extern bool ExecAsyncEventLoop(EState *estate, PlanState *requestor,
+ long timeout);
+
+extern void ExecAsyncSetRequiredEvents(EState *estate,
+ PendingAsyncRequest *areq, int num_fd_events,
+ bool wants_process_latch, bool force_reset);
+extern void ExecAsyncRequestDone(EState *estate,
+ PendingAsyncRequest *areq, Node *result);
+extern void ExecAsyncClearEvents(EState *estate);
+
+#endif /* EXECASYNC_H */
diff --git a/src/include/executor/nodeAppend.h b/src/include/executor/nodeAppend.h
index ee0b6ad..d8c3e31 100644
--- a/src/include/executor/nodeAppend.h
+++ b/src/include/executor/nodeAppend.h
@@ -21,4 +21,7 @@ extern TupleTableSlot *ExecAppend(AppendState *node);
extern void ExecEndAppend(AppendState *node);
extern void ExecReScanAppend(AppendState *node);
+extern void ExecAsyncAppendResponse(EState *estate,
+ PendingAsyncRequest *areq);
+
#endif /* NODEAPPEND_H */
diff --git a/src/include/executor/nodeForeignscan.h b/src/include/executor/nodeForeignscan.h
index 3ff4ecd..e6ba392 100644
--- a/src/include/executor/nodeForeignscan.h
+++ b/src/include/executor/nodeForeignscan.h
@@ -30,4 +30,11 @@ extern void ExecForeignScanInitializeWorker(ForeignScanState *node,
shm_toc *toc);
extern void ExecShutdownForeignScan(ForeignScanState *node);
+extern void ExecAsyncForeignScanRequest(EState *estate,
+ PendingAsyncRequest *areq);
+extern bool ExecAsyncForeignScanConfigureWait(EState *estate,
+ PendingAsyncRequest *areq, bool reinit);
+extern void ExecAsyncForeignScanNotify(EState *estate,
+ PendingAsyncRequest *areq);
+
#endif /* NODEFOREIGNSCAN_H */
diff --git a/src/include/foreign/fdwapi.h b/src/include/foreign/fdwapi.h
index e391f20..57876d1 100644
--- a/src/include/foreign/fdwapi.h
+++ b/src/include/foreign/fdwapi.h
@@ -156,6 +156,16 @@ typedef bool (*IsForeignScanParallelSafe_function) (PlannerInfo *root,
RelOptInfo *rel,
RangeTblEntry *rte);
+typedef bool (*IsForeignPathAsyncCapable_function) (ForeignPath *path);
+typedef void (*ForeignAsyncRequest_function) (EState *estate,
+ PendingAsyncRequest *areq);
+typedef bool (*ForeignAsyncConfigureWait_function) (EState *estate,
+ PendingAsyncRequest *areq,
+ bool reinit);
+typedef void (*ForeignAsyncNotify_function) (EState *estate,
+ PendingAsyncRequest *areq);
+typedef void (*ShutdownForeignScan_function) (ForeignScanState *node);
+
/*
* FdwRoutine is the struct returned by a foreign-data wrapper's handler
* function. It provides pointers to the callback functions needed by the
@@ -225,6 +235,13 @@ typedef struct FdwRoutine
EstimateDSMForeignScan_function EstimateDSMForeignScan;
InitializeDSMForeignScan_function InitializeDSMForeignScan;
InitializeWorkerForeignScan_function InitializeWorkerForeignScan;
+
+ /* Support functions for asynchronous execution */
+ IsForeignPathAsyncCapable_function IsForeignPathAsyncCapable;
+ ForeignAsyncRequest_function ForeignAsyncRequest;
+ ForeignAsyncConfigureWait_function ForeignAsyncConfigureWait;
+ ForeignAsyncNotify_function ForeignAsyncNotify;
+
ShutdownForeignScan_function ShutdownForeignScan;
} FdwRoutine;
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 85fac8a..48c7c2f 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -415,6 +415,32 @@ typedef struct ResultRelInfo
} ResultRelInfo;
/* ----------------
+ * PendingAsyncRequest
+ *
+ * State for an asynchronous tuple request.
+ * ----------------
+ */
+typedef enum AsyncRequestState
+{
+ ASYNCREQ_IDLE, /* Nothing is requested */
+ ASYNCREQ_WAITING, /* Waiting for events */
+ ASYNCREQ_CALLBACK_PENDING, /* Having events to be processed */
+ ASYNCREQ_COMPLETE /* Result is available */
+} AsyncRequestState;
+
+typedef struct PendingAsyncRequest
+{
+ int myindex; /* Index in es_pending_async. */
+ struct PlanState *requestor; /* Node that wants a tuple. */
+ struct PlanState *requestee; /* Node from which a tuple is wanted. */
+ int request_index; /* Scratch space for requestor. */
+ int num_fd_events; /* Max number of FD events requestee needs. */
+ bool wants_process_latch; /* Requestee cares about MyLatch. */
+ AsyncRequestState state;
+ Node *result; /* Result (NULL if no more tuples). */
+} PendingAsyncRequest;
+
+/* ----------------
* EState information
*
* Master working state for an Executor invocation
@@ -506,6 +532,32 @@ typedef struct EState
/* The per-query shared memory area to use for parallel execution. */
struct dsa_area *es_query_dsa;
+
+ /*
+ * Support for asynchronous execution.
+ *
+ * es_max_pending_async is the allocated size of es_pending_async, and
+ * es_num_pending_aync is the number of entries that are currently valid.
+ * (Entries after that may point to storage that can be reused.)
+ * es_async_ready is the number of PendingAsyncRequests that is ready to
+ * retrieve a tuple.
+ *
+ * es_total_fd_events is the total number of FD events needed by all
+ * pending async nodes, and es_allocated_fd_events is the number any
+ * current wait event set was allocated to handle. es_wait_event_set, if
+ * non-NULL, is a previously allocated event set that may be reusable by a
+ * future wait provided that nothing's been removed and not too many more
+ * events have been added.
+ */
+ int es_num_pending_async; /* # of nodes to wait */
+ int es_max_pending_async; /* max # of pending nodes */
+ int es_async_callback_pending; /* # of nodes to callback */
+ int es_num_async_ready; /* # of tuple-ready nodes */
+ PendingAsyncRequest **es_pending_async;
+
+ int es_total_fd_events;
+ int es_allocated_fd_events;
+ struct WaitEventSet *es_wait_event_set;
} EState;
@@ -971,17 +1023,20 @@ typedef struct ModifyTableState
/* ----------------
* AppendState information
- *
- * nplans how many plans are in the array
- * whichplan which plan is being executed (0 .. n-1)
* ----------------
*/
typedef struct AppendState
{
PlanState ps; /* its first field is NodeTag */
PlanState **appendplans; /* array of PlanStates for my inputs */
- int as_nplans;
- int as_whichplan;
+ int as_nplans; /* total # of children */
+ int as_nasyncplans; /* # of async-capable children */
+ int as_whichsyncplan; /* which sync plan is being executed */
+ bool as_syncdone; /* all synchronous plans done? */
+ Bitmapset *as_needrequest; /* async plans needing a new request */
+ TupleTableSlot **as_asyncresult; /* unreturned results of async plans */
+ int as_nasyncresult; /* # of valid entries in as_asyncresult */
+ int as_nasyncpending; /* # of outstanding async requests */
} AppendState;
/* ----------------
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index f1a1b24..5abff26 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -248,6 +248,8 @@ typedef struct Append
/* RT indexes of non-leaf tables in a partition tree */
List *partitioned_rels;
List *appendplans;
+ int nasyncplans; /* # of async plans, always at start of list */
+ int referent; /* index of inheritance tree referent */
} Append;
/* ----------------
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 6bffe63..fb6d02a 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -812,7 +812,8 @@ typedef enum
WAIT_EVENT_SAFE_SNAPSHOT,
WAIT_EVENT_SYNC_REP,
WAIT_EVENT_LOGICAL_SYNC_DATA,
- WAIT_EVENT_LOGICAL_SYNC_STATE_CHANGE
+ WAIT_EVENT_LOGICAL_SYNC_STATE_CHANGE,
+ WAIT_EVENT_ASYNC_WAIT
} WaitEventIPC;
/* ----------
--
2.9.2
0003-Make-postgres_fdw-async-capable.patchtext/x-patch; charset=us-asciiDownload
From 117e3f2e0f17985af510bce9ab28a9c50f9e0b72 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 23 Feb 2017 15:04:46 +0900
Subject: [PATCH 3/5] Make postgres_fdw async-capable.
Make postgre_fdw async-capable using the infrastructure. Additionaly,
this makes connections for postgres_fdw have a connection-specific
area to store information so that foreign scans on the same connection
can share some data. postgres_fdw shares scan node currently running
on the underlying connection. This allows us async-execution of
multiple foreign scans on one foreign server.
---
contrib/postgres_fdw/connection.c | 64 ++-
contrib/postgres_fdw/expected/postgres_fdw.out | 144 ++++---
contrib/postgres_fdw/postgres_fdw.c | 522 +++++++++++++++++++++----
contrib/postgres_fdw/postgres_fdw.h | 2 +
contrib/postgres_fdw/sql/postgres_fdw.sql | 12 +-
5 files changed, 588 insertions(+), 156 deletions(-)
diff --git a/contrib/postgres_fdw/connection.c b/contrib/postgres_fdw/connection.c
index be4ec07..6247dc8 100644
--- a/contrib/postgres_fdw/connection.c
+++ b/contrib/postgres_fdw/connection.c
@@ -58,6 +58,7 @@ typedef struct ConnCacheEntry
bool invalidated; /* true if reconnect is pending */
uint32 server_hashvalue; /* hash value of foreign server OID */
uint32 mapping_hashvalue; /* hash value of user mapping OID */
+ void *storage; /* connection specific storage */
} ConnCacheEntry;
/*
@@ -73,6 +74,7 @@ static unsigned int prep_stmt_number = 0;
static bool xact_got_connection = false;
/* prototypes of private functions */
+static ConnCacheEntry *get_connection_entry(Oid umid);
static PGconn *connect_pg_server(ForeignServer *server, UserMapping *user);
static void disconnect_pg_server(ConnCacheEntry *entry);
static void check_conn_params(const char **keywords, const char **values);
@@ -94,17 +96,11 @@ static bool pgfdw_get_cleanup_result(PGconn *conn, TimestampTz endtime,
/*
- * Get a PGconn which can be used to execute queries on the remote PostgreSQL
- * server with the user's authorization. A new connection is established
- * if we don't already have a suitable one, and a transaction is opened at
- * the right subtransaction nesting depth if we didn't do that already.
- *
- * will_prep_stmt must be true if caller intends to create any prepared
- * statements. Since those don't go away automatically at transaction end
- * (not even on error), we need this flag to cue manual cleanup.
+ * Common function to acquire or create a connection cache entry.
*/
-PGconn *
-GetConnection(UserMapping *user, bool will_prep_stmt)
+
+static ConnCacheEntry *
+get_connection_entry(Oid umid)
{
bool found;
ConnCacheEntry *entry;
@@ -136,11 +132,8 @@ GetConnection(UserMapping *user, bool will_prep_stmt)
pgfdw_inval_callback, (Datum) 0);
}
- /* Set flag that we did GetConnection during the current transaction */
- xact_got_connection = true;
-
/* Create hash key for the entry. Assume no pad bytes in key struct */
- key = user->umid;
+ key = umid;
/*
* Find or create cached entry for requested connection.
@@ -158,6 +151,29 @@ GetConnection(UserMapping *user, bool will_prep_stmt)
/* Reject further use of connections which failed abort cleanup. */
pgfdw_reject_incomplete_xact_state_change(entry);
+ return entry;
+}
+
+/*
+ * Get a PGconn which can be used to execute queries on the remote PostgreSQL
+ * server with the user's authorization. A new connection is established
+ * if we don't already have a suitable one, and a transaction is opened at
+ * the right subtransaction nesting depth if we didn't do that already.
+ *
+ * will_prep_stmt must be true if caller intends to create any prepared
+ * statements. Since those don't go away automatically at transaction end
+ * (not even on error), we need this flag to cue manual cleanup.
+ */
+PGconn *
+GetConnection(UserMapping *user, bool will_prep_stmt)
+{
+ ConnCacheEntry *entry;
+
+ /* Set flag that we did GetConnection during the current transaction */
+ xact_got_connection = true;
+
+ entry = get_connection_entry(user->umid);
+
/*
* If the connection needs to be remade due to invalidation, disconnect as
* soon as we're out of all transactions.
@@ -196,6 +212,7 @@ GetConnection(UserMapping *user, bool will_prep_stmt)
entry->mapping_hashvalue =
GetSysCacheHashValue1(USERMAPPINGOID,
ObjectIdGetDatum(user->umid));
+ entry->storage =NULL;
/* Now try to make the connection */
entry->conn = connect_pg_server(server, user);
@@ -216,6 +233,25 @@ GetConnection(UserMapping *user, bool will_prep_stmt)
}
/*
+ * Rerturns the connection specific storage for this user. Allocate with
+ * initsize if not exists.
+ */
+void *
+GetConnectionSpecificStorage(UserMapping *user, size_t initsize)
+{
+ ConnCacheEntry *entry;
+
+ entry = get_connection_entry(user->umid);
+ if (entry->storage == NULL)
+ {
+ entry->storage = MemoryContextAlloc(CacheMemoryContext, initsize);
+ memset(entry->storage, 0, initsize);
+ }
+
+ return entry->storage;
+}
+
+/*
* Connect to remote server using specified server and user mapping properties.
*/
static PGconn *
diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out
index c19b331..9d7eb9b 100644
--- a/contrib/postgres_fdw/expected/postgres_fdw.out
+++ b/contrib/postgres_fdw/expected/postgres_fdw.out
@@ -6515,12 +6515,12 @@ INSERT INTO b(aa) VALUES('bbbbb');
SELECT tableoid::regclass, * FROM a;
tableoid | aa
----------+-------
- a | aaa
- a | aaaa
- a | aaaaa
b | bbb
b | bbbb
b | bbbbb
+ a | aaa
+ a | aaaa
+ a | aaaaa
(6 rows)
SELECT tableoid::regclass, * FROM b;
@@ -6543,12 +6543,12 @@ UPDATE a SET aa = 'zzzzzz' WHERE aa LIKE 'aaaa%';
SELECT tableoid::regclass, * FROM a;
tableoid | aa
----------+--------
- a | aaa
- a | zzzzzz
- a | zzzzzz
b | bbb
b | bbbb
b | bbbbb
+ a | aaa
+ a | zzzzzz
+ a | zzzzzz
(6 rows)
SELECT tableoid::regclass, * FROM b;
@@ -6571,12 +6571,12 @@ UPDATE b SET aa = 'new';
SELECT tableoid::regclass, * FROM a;
tableoid | aa
----------+--------
- a | aaa
- a | zzzzzz
- a | zzzzzz
b | new
b | new
b | new
+ a | aaa
+ a | zzzzzz
+ a | zzzzzz
(6 rows)
SELECT tableoid::regclass, * FROM b;
@@ -6599,12 +6599,12 @@ UPDATE a SET aa = 'newtoo';
SELECT tableoid::regclass, * FROM a;
tableoid | aa
----------+--------
- a | newtoo
- a | newtoo
- a | newtoo
b | newtoo
b | newtoo
b | newtoo
+ a | newtoo
+ a | newtoo
+ a | newtoo
(6 rows)
SELECT tableoid::regclass, * FROM b;
@@ -6662,35 +6662,40 @@ insert into bar2 values(3,33,33);
insert into bar2 values(4,44,44);
insert into bar2 values(7,77,77);
explain (verbose, costs off)
-select * from bar where f1 in (select f1 from foo) for update;
- QUERY PLAN
-----------------------------------------------------------------------------------------------
+select * from bar where f1 in (select f1 from foo) order by 1 for update;
+ QUERY PLAN
+-----------------------------------------------------------------------------------------------------------------
LockRows
Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid, foo.ctid, foo.*, foo.tableoid
- -> Hash Join
+ -> Merge Join
Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid, foo.ctid, foo.*, foo.tableoid
Inner Unique: true
- Hash Cond: (bar.f1 = foo.f1)
- -> Append
- -> Seq Scan on public.bar
+ Merge Cond: (bar.f1 = foo.f1)
+ -> Merge Append
+ Sort Key: bar.f1
+ -> Sort
Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid
+ Sort Key: bar.f1
+ -> Seq Scan on public.bar
+ Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid
-> Foreign Scan on public.bar2
Output: bar2.f1, bar2.f2, bar2.ctid, bar2.*, bar2.tableoid
- Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 FOR UPDATE
- -> Hash
+ Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 ORDER BY f1 ASC NULLS LAST FOR UPDATE
+ -> Sort
Output: foo.ctid, foo.*, foo.tableoid, foo.f1
+ Sort Key: foo.f1
-> HashAggregate
Output: foo.ctid, foo.*, foo.tableoid, foo.f1
Group Key: foo.f1
-> Append
- -> Seq Scan on public.foo
- Output: foo.ctid, foo.*, foo.tableoid, foo.f1
-> Foreign Scan on public.foo2
Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1
Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1
-(23 rows)
+ -> Seq Scan on public.foo
+ Output: foo.ctid, foo.*, foo.tableoid, foo.f1
+(28 rows)
-select * from bar where f1 in (select f1 from foo) for update;
+select * from bar where f1 in (select f1 from foo) order by 1 for update;
f1 | f2
----+----
1 | 11
@@ -6700,35 +6705,40 @@ select * from bar where f1 in (select f1 from foo) for update;
(4 rows)
explain (verbose, costs off)
-select * from bar where f1 in (select f1 from foo) for share;
- QUERY PLAN
-----------------------------------------------------------------------------------------------
+select * from bar where f1 in (select f1 from foo) order by 1 for share;
+ QUERY PLAN
+----------------------------------------------------------------------------------------------------------------
LockRows
Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid, foo.ctid, foo.*, foo.tableoid
- -> Hash Join
+ -> Merge Join
Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid, foo.ctid, foo.*, foo.tableoid
Inner Unique: true
- Hash Cond: (bar.f1 = foo.f1)
- -> Append
- -> Seq Scan on public.bar
+ Merge Cond: (bar.f1 = foo.f1)
+ -> Merge Append
+ Sort Key: bar.f1
+ -> Sort
Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid
+ Sort Key: bar.f1
+ -> Seq Scan on public.bar
+ Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid
-> Foreign Scan on public.bar2
Output: bar2.f1, bar2.f2, bar2.ctid, bar2.*, bar2.tableoid
- Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 FOR SHARE
- -> Hash
+ Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 ORDER BY f1 ASC NULLS LAST FOR SHARE
+ -> Sort
Output: foo.ctid, foo.*, foo.tableoid, foo.f1
+ Sort Key: foo.f1
-> HashAggregate
Output: foo.ctid, foo.*, foo.tableoid, foo.f1
Group Key: foo.f1
-> Append
- -> Seq Scan on public.foo
- Output: foo.ctid, foo.*, foo.tableoid, foo.f1
-> Foreign Scan on public.foo2
Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1
Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1
-(23 rows)
+ -> Seq Scan on public.foo
+ Output: foo.ctid, foo.*, foo.tableoid, foo.f1
+(28 rows)
-select * from bar where f1 in (select f1 from foo) for share;
+select * from bar where f1 in (select f1 from foo) order by 1 for share;
f1 | f2
----+----
1 | 11
@@ -6758,11 +6768,11 @@ update bar set f2 = f2 + 100 where f1 in (select f1 from foo);
Output: foo.ctid, foo.*, foo.tableoid, foo.f1
Group Key: foo.f1
-> Append
- -> Seq Scan on public.foo
- Output: foo.ctid, foo.*, foo.tableoid, foo.f1
-> Foreign Scan on public.foo2
Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1
Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1
+ -> Seq Scan on public.foo
+ Output: foo.ctid, foo.*, foo.tableoid, foo.f1
-> Hash Join
Output: bar2.f1, (bar2.f2 + 100), bar2.f3, bar2.ctid, foo.ctid, foo.*, foo.tableoid
Inner Unique: true
@@ -6776,11 +6786,11 @@ update bar set f2 = f2 + 100 where f1 in (select f1 from foo);
Output: foo.ctid, foo.*, foo.tableoid, foo.f1
Group Key: foo.f1
-> Append
- -> Seq Scan on public.foo
- Output: foo.ctid, foo.*, foo.tableoid, foo.f1
-> Foreign Scan on public.foo2
Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1
Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1
+ -> Seq Scan on public.foo
+ Output: foo.ctid, foo.*, foo.tableoid, foo.f1
(39 rows)
update bar set f2 = f2 + 100 where f1 in (select f1 from foo);
@@ -6811,16 +6821,16 @@ where bar.f1 = ss.f1;
Output: bar.f1, (bar.f2 + 100), bar.ctid, (ROW(foo.f1))
Hash Cond: (foo.f1 = bar.f1)
-> Append
- -> Seq Scan on public.foo
- Output: ROW(foo.f1), foo.f1
-> Foreign Scan on public.foo2
Output: ROW(foo2.f1), foo2.f1
Remote SQL: SELECT f1 FROM public.loct1
- -> Seq Scan on public.foo foo_1
- Output: ROW((foo_1.f1 + 3)), (foo_1.f1 + 3)
-> Foreign Scan on public.foo2 foo2_1
Output: ROW((foo2_1.f1 + 3)), (foo2_1.f1 + 3)
Remote SQL: SELECT f1 FROM public.loct1
+ -> Seq Scan on public.foo
+ Output: ROW(foo.f1), foo.f1
+ -> Seq Scan on public.foo foo_1
+ Output: ROW((foo_1.f1 + 3)), (foo_1.f1 + 3)
-> Hash
Output: bar.f1, bar.f2, bar.ctid
-> Seq Scan on public.bar
@@ -6838,16 +6848,16 @@ where bar.f1 = ss.f1;
Output: (ROW(foo.f1)), foo.f1
Sort Key: foo.f1
-> Append
- -> Seq Scan on public.foo
- Output: ROW(foo.f1), foo.f1
-> Foreign Scan on public.foo2
Output: ROW(foo2.f1), foo2.f1
Remote SQL: SELECT f1 FROM public.loct1
- -> Seq Scan on public.foo foo_1
- Output: ROW((foo_1.f1 + 3)), (foo_1.f1 + 3)
-> Foreign Scan on public.foo2 foo2_1
Output: ROW((foo2_1.f1 + 3)), (foo2_1.f1 + 3)
Remote SQL: SELECT f1 FROM public.loct1
+ -> Seq Scan on public.foo
+ Output: ROW(foo.f1), foo.f1
+ -> Seq Scan on public.foo foo_1
+ Output: ROW((foo_1.f1 + 3)), (foo_1.f1 + 3)
(45 rows)
update bar set f2 = f2 + 100
@@ -6998,27 +7008,33 @@ delete from foo where f1 < 5 returning *;
(5 rows)
explain (verbose, costs off)
-update bar set f2 = f2 + 100 returning *;
- QUERY PLAN
-------------------------------------------------------------------------------
- Update on public.bar
- Output: bar.f1, bar.f2
- Update on public.bar
- Foreign Update on public.bar2
- -> Seq Scan on public.bar
- Output: bar.f1, (bar.f2 + 100), bar.ctid
- -> Foreign Update on public.bar2
- Remote SQL: UPDATE public.loct2 SET f2 = (f2 + 100) RETURNING f1, f2
-(8 rows)
+with u as (update bar set f2 = f2 + 100 returning *) select * from u order by 1;
+ QUERY PLAN
+--------------------------------------------------------------------------------------
+ Sort
+ Output: u.f1, u.f2
+ Sort Key: u.f1
+ CTE u
+ -> Update on public.bar
+ Output: bar.f1, bar.f2
+ Update on public.bar
+ Foreign Update on public.bar2
+ -> Seq Scan on public.bar
+ Output: bar.f1, (bar.f2 + 100), bar.ctid
+ -> Foreign Update on public.bar2
+ Remote SQL: UPDATE public.loct2 SET f2 = (f2 + 100) RETURNING f1, f2
+ -> CTE Scan on u
+ Output: u.f1, u.f2
+(14 rows)
-update bar set f2 = f2 + 100 returning *;
+with u as (update bar set f2 = f2 + 100 returning *) select * from u order by 1;
f1 | f2
----+-----
1 | 311
2 | 322
- 6 | 266
3 | 333
4 | 344
+ 6 | 266
7 | 277
(6 rows)
diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c
index d77c2a7..01b2398 100644
--- a/contrib/postgres_fdw/postgres_fdw.c
+++ b/contrib/postgres_fdw/postgres_fdw.c
@@ -20,6 +20,8 @@
#include "commands/defrem.h"
#include "commands/explain.h"
#include "commands/vacuum.h"
+#include "executor/execAsync.h"
+#include "executor/nodeForeignscan.h"
#include "foreign/fdwapi.h"
#include "funcapi.h"
#include "miscadmin.h"
@@ -34,6 +36,7 @@
#include "optimizer/var.h"
#include "optimizer/tlist.h"
#include "parser/parsetree.h"
+#include "pgstat.h"
#include "utils/builtins.h"
#include "utils/guc.h"
#include "utils/lsyscache.h"
@@ -53,6 +56,9 @@ PG_MODULE_MAGIC;
/* If no remote estimates, assume a sort costs 20% extra */
#define DEFAULT_FDW_SORT_MULTIPLIER 1.2
+/* Retrive PgFdwScanState struct from ForeginScanState */
+#define GetPgFdwScanState(n) ((PgFdwScanState *)(n)->fdw_state)
+
/*
* Indexes of FDW-private information stored in fdw_private lists.
*
@@ -120,10 +126,27 @@ enum FdwDirectModifyPrivateIndex
};
/*
+ * Connection private area structure.
+ */
+typedef struct PgFdwConnpriv
+{
+ ForeignScanState *current_owner; /* The node currently running a query
+ * on this connection*/
+} PgFdwConnpriv;
+
+/* Execution state base type */
+typedef struct PgFdwState
+{
+ PGconn *conn; /* connection for the scan */
+ PgFdwConnpriv *connpriv; /* connection private memory */
+} PgFdwState;
+
+/*
* Execution state of a foreign scan using postgres_fdw.
*/
typedef struct PgFdwScanState
{
+ PgFdwState s; /* common structure */
Relation rel; /* relcache entry for the foreign table. NULL
* for a foreign join scan. */
TupleDesc tupdesc; /* tuple descriptor of scan */
@@ -134,7 +157,7 @@ typedef struct PgFdwScanState
List *retrieved_attrs; /* list of retrieved attribute numbers */
/* for remote query execution */
- PGconn *conn; /* connection for the scan */
+ bool result_ready;
unsigned int cursor_number; /* quasi-unique ID for my cursor */
bool cursor_exists; /* have we created the cursor? */
int numParams; /* number of parameters passed to query */
@@ -150,6 +173,13 @@ typedef struct PgFdwScanState
/* batch-level state, for optimizing rewinds and avoiding useless fetch */
int fetch_ct_2; /* Min(# of fetches done, 2) */
bool eof_reached; /* true if last fetch reached EOF */
+ bool run_async; /* true if run asynchronously */
+ bool async_waiting; /* true if requesting the parent to wait */
+ ForeignScanState *waiter; /* Next node to run a query among nodes
+ * sharing the same connection */
+ ForeignScanState *last_waiter; /* A waiting node at the end of a waiting
+ * list. Maintained only by the current
+ * owner of the connection */
/* working memory contexts */
MemoryContext batch_cxt; /* context holding current batch of tuples */
@@ -163,11 +193,11 @@ typedef struct PgFdwScanState
*/
typedef struct PgFdwModifyState
{
+ PgFdwState s; /* common structure */
Relation rel; /* relcache entry for the foreign table */
AttInMetadata *attinmeta; /* attribute datatype conversion metadata */
/* for remote query execution */
- PGconn *conn; /* connection for the scan */
char *p_name; /* name of prepared statement, if created */
/* extracted fdw_private data */
@@ -190,6 +220,7 @@ typedef struct PgFdwModifyState
*/
typedef struct PgFdwDirectModifyState
{
+ PgFdwState s; /* common structure */
Relation rel; /* relcache entry for the foreign table */
AttInMetadata *attinmeta; /* attribute datatype conversion metadata */
@@ -288,6 +319,7 @@ static void postgresBeginForeignScan(ForeignScanState *node, int eflags);
static TupleTableSlot *postgresIterateForeignScan(ForeignScanState *node);
static void postgresReScanForeignScan(ForeignScanState *node);
static void postgresEndForeignScan(ForeignScanState *node);
+static void postgresShutdownForeignScan(ForeignScanState *node);
static void postgresAddForeignUpdateTargets(Query *parsetree,
RangeTblEntry *target_rte,
Relation target_relation);
@@ -348,6 +380,14 @@ static void postgresGetForeignUpperPaths(PlannerInfo *root,
UpperRelationKind stage,
RelOptInfo *input_rel,
RelOptInfo *output_rel);
+static bool postgresIsForeignPathAsyncCapable(ForeignPath *path);
+static void postgresForeignAsyncRequest(EState *estate,
+ PendingAsyncRequest *areq);
+static bool postgresForeignAsyncConfigureWait(EState *estate,
+ PendingAsyncRequest *areq,
+ bool reinit);
+static void postgresForeignAsyncNotify(EState *estate,
+ PendingAsyncRequest *areq);
/*
* Helper functions
@@ -368,7 +408,10 @@ static bool ec_member_matches_foreign(PlannerInfo *root, RelOptInfo *rel,
EquivalenceClass *ec, EquivalenceMember *em,
void *arg);
static void create_cursor(ForeignScanState *node);
-static void fetch_more_data(ForeignScanState *node);
+static void request_more_data(ForeignScanState *node);
+static void fetch_received_data(ForeignScanState *node);
+static void vacate_connection(PgFdwState *fdwconn);
+static void absorb_current_result(ForeignScanState *node);
static void close_cursor(PGconn *conn, unsigned int cursor_number);
static void prepare_foreign_modify(PgFdwModifyState *fmstate);
static const char **convert_prep_stmt_params(PgFdwModifyState *fmstate,
@@ -438,6 +481,7 @@ postgres_fdw_handler(PG_FUNCTION_ARGS)
routine->IterateForeignScan = postgresIterateForeignScan;
routine->ReScanForeignScan = postgresReScanForeignScan;
routine->EndForeignScan = postgresEndForeignScan;
+ routine->ShutdownForeignScan = postgresShutdownForeignScan;
/* Functions for updating foreign tables */
routine->AddForeignUpdateTargets = postgresAddForeignUpdateTargets;
@@ -472,6 +516,12 @@ postgres_fdw_handler(PG_FUNCTION_ARGS)
/* Support functions for upper relation push-down */
routine->GetForeignUpperPaths = postgresGetForeignUpperPaths;
+ /* Support functions for async execution */
+ routine->IsForeignPathAsyncCapable = postgresIsForeignPathAsyncCapable;
+ routine->ForeignAsyncRequest = postgresForeignAsyncRequest;
+ routine->ForeignAsyncConfigureWait = postgresForeignAsyncConfigureWait;
+ routine->ForeignAsyncNotify = postgresForeignAsyncNotify;
+
PG_RETURN_POINTER(routine);
}
@@ -1322,12 +1372,21 @@ postgresBeginForeignScan(ForeignScanState *node, int eflags)
* Get connection to the foreign server. Connection manager will
* establish new connection if necessary.
*/
- fsstate->conn = GetConnection(user, false);
+ fsstate->s.conn = GetConnection(user, false);
+ fsstate->s.connpriv = (PgFdwConnpriv *)
+ GetConnectionSpecificStorage(user, sizeof(PgFdwConnpriv));
+ fsstate->s.connpriv->current_owner = NULL;
+ fsstate->waiter = NULL;
+ fsstate->last_waiter = node;
/* Assign a unique ID for my cursor */
- fsstate->cursor_number = GetCursorNumber(fsstate->conn);
+ fsstate->cursor_number = GetCursorNumber(fsstate->s.conn);
fsstate->cursor_exists = false;
+ /* Initialize async execution status */
+ fsstate->run_async = false;
+ fsstate->async_waiting = false;
+
/* Get private info created by planner functions. */
fsstate->query = strVal(list_nth(fsplan->fdw_private,
FdwScanPrivateSelectSql));
@@ -1383,32 +1442,130 @@ postgresBeginForeignScan(ForeignScanState *node, int eflags)
static TupleTableSlot *
postgresIterateForeignScan(ForeignScanState *node)
{
- PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+ PgFdwScanState *fsstate = GetPgFdwScanState(node);
TupleTableSlot *slot = node->ss.ss_ScanTupleSlot;
/*
- * If this is the first call after Begin or ReScan, we need to create the
- * cursor on the remote side.
- */
- if (!fsstate->cursor_exists)
- create_cursor(node);
-
- /*
* Get some more tuples, if we've run out.
*/
if (fsstate->next_tuple >= fsstate->num_tuples)
{
- /* No point in another fetch if we already detected EOF, though. */
- if (!fsstate->eof_reached)
- fetch_more_data(node);
- /* If we didn't get any tuples, must be end of data. */
+ ForeignScanState *next_conn_owner = node;
+
+ /* This node has sent a query on this connection */
+ if (fsstate->s.connpriv->current_owner == node)
+ {
+ /* Check if the result is available */
+ if (PQisBusy(fsstate->s.conn))
+ {
+ int rc = WaitLatchOrSocket(NULL,
+ WL_SOCKET_READABLE | WL_TIMEOUT,
+ PQsocket(fsstate->s.conn), 0,
+ WAIT_EVENT_ASYNC_WAIT);
+ if (fsstate->run_async && !(rc & WL_SOCKET_READABLE))
+ {
+ /*
+ * This node is not ready yet. Tell the caller to wait.
+ */
+ fsstate->result_ready = false;
+ return ExecClearTuple(slot);
+ }
+ }
+
+ Assert(fsstate->async_waiting);
+ fsstate->async_waiting = false;
+ fetch_received_data(node);
+
+ /*
+ * If someone is waiting this node on the same connection, let the
+ * first waiter be the next owner of this connection.
+ */
+ if (fsstate->waiter)
+ {
+ PgFdwScanState *next_owner_state;
+
+ next_conn_owner = fsstate->waiter;
+ next_owner_state = GetPgFdwScanState(next_conn_owner);
+ fsstate->waiter = NULL;
+
+ /*
+ * only the current owner is responsible to maintain the shortcut
+ * to the last waiter
+ */
+ next_owner_state->last_waiter = fsstate->last_waiter;
+
+ /*
+ * for simplicity, last_waiter points itself on a node that no one
+ * is waiting for.
+ */
+ fsstate->last_waiter = node;
+ }
+ }
+ else if (fsstate->s.connpriv->current_owner &&
+ !GetPgFdwScanState(node)->eof_reached)
+ {
+ /*
+ * Anyone else is holding this connection and we want this node to
+ * run later. Add myself to the tail of the waiters' list then
+ * return not-ready. To avoid scanning through the waiters' list,
+ * the current owner is to maintain the shortcut to the last
+ * waiter.
+ */
+ PgFdwScanState *conn_owner_state =
+ GetPgFdwScanState(fsstate->s.connpriv->current_owner);
+ ForeignScanState *last_waiter = conn_owner_state->last_waiter;
+ PgFdwScanState *last_waiter_state = GetPgFdwScanState(last_waiter);
+
+ last_waiter_state->waiter = node;
+ conn_owner_state->last_waiter = node;
+
+ /* Register the node to the async-waiting node list */
+ Assert(!GetPgFdwScanState(node)->async_waiting);
+
+ GetPgFdwScanState(node)->async_waiting = true;
+
+ fsstate->result_ready = fsstate->eof_reached;
+ return ExecClearTuple(slot);
+ }
+
+ /* At this time no node is running on the connection */
+ Assert(GetPgFdwScanState(next_conn_owner)->s.connpriv->current_owner
+ == NULL);
+ /*
+ * Send the next request for the next owner of this connection if
+ * needed.
+ */
+ if (!GetPgFdwScanState(next_conn_owner)->eof_reached)
+ {
+ PgFdwScanState *next_owner_state =
+ GetPgFdwScanState(next_conn_owner);
+
+ request_more_data(next_conn_owner);
+
+ /* Register the node to the async-waiting node list */
+ if (!next_owner_state->async_waiting)
+ next_owner_state->async_waiting = true;
+
+ if (!next_owner_state->run_async)
+ fetch_received_data(next_conn_owner);
+ }
+
+
+ /*
+ * If we haven't received a result for the given node this time,
+ * return with no tuple to give way to other nodes.
+ */
if (fsstate->next_tuple >= fsstate->num_tuples)
+ {
+ fsstate->result_ready = fsstate->eof_reached;
return ExecClearTuple(slot);
+ }
}
/*
* Return the next tuple.
*/
+ fsstate->result_ready = true;
ExecStoreTuple(fsstate->tuples[fsstate->next_tuple++],
slot,
InvalidBuffer,
@@ -1424,7 +1581,7 @@ postgresIterateForeignScan(ForeignScanState *node)
static void
postgresReScanForeignScan(ForeignScanState *node)
{
- PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+ PgFdwScanState *fsstate = GetPgFdwScanState(node);
char sql[64];
PGresult *res;
@@ -1432,6 +1589,9 @@ postgresReScanForeignScan(ForeignScanState *node)
if (!fsstate->cursor_exists)
return;
+ /* Absorb the ramining result */
+ absorb_current_result(node);
+
/*
* If any internal parameters affecting this node have changed, we'd
* better destroy and recreate the cursor. Otherwise, rewinding it should
@@ -1460,9 +1620,9 @@ postgresReScanForeignScan(ForeignScanState *node)
* We don't use a PG_TRY block here, so be careful not to throw error
* without releasing the PGresult.
*/
- res = pgfdw_exec_query(fsstate->conn, sql);
+ res = pgfdw_exec_query(fsstate->s.conn, sql);
if (PQresultStatus(res) != PGRES_COMMAND_OK)
- pgfdw_report_error(ERROR, res, fsstate->conn, true, sql);
+ pgfdw_report_error(ERROR, res, fsstate->s.conn, true, sql);
PQclear(res);
/* Now force a fresh FETCH. */
@@ -1480,7 +1640,7 @@ postgresReScanForeignScan(ForeignScanState *node)
static void
postgresEndForeignScan(ForeignScanState *node)
{
- PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+ PgFdwScanState *fsstate = GetPgFdwScanState(node);
/* if fsstate is NULL, we are in EXPLAIN; nothing to do */
if (fsstate == NULL)
@@ -1488,16 +1648,32 @@ postgresEndForeignScan(ForeignScanState *node)
/* Close the cursor if open, to prevent accumulation of cursors */
if (fsstate->cursor_exists)
- close_cursor(fsstate->conn, fsstate->cursor_number);
+ close_cursor(fsstate->s.conn, fsstate->cursor_number);
/* Release remote connection */
- ReleaseConnection(fsstate->conn);
- fsstate->conn = NULL;
+ ReleaseConnection(fsstate->s.conn);
+ fsstate->s.conn = NULL;
/* MemoryContexts will be deleted automatically. */
}
/*
+ * postgresShutdownForeignScan
+ * Remove asynchrony stuff and cleanup garbage on the connection.
+ */
+static void
+postgresShutdownForeignScan(ForeignScanState *node)
+{
+ ForeignScan *plan = (ForeignScan *) node->ss.ps.plan;
+
+ if (plan->operation != CMD_SELECT)
+ return;
+
+ /* Absorb the ramining result */
+ absorb_current_result(node);
+}
+
+/*
* postgresAddForeignUpdateTargets
* Add resjunk column(s) needed for update/delete on a foreign table
*/
@@ -1699,7 +1875,9 @@ postgresBeginForeignModify(ModifyTableState *mtstate,
user = GetUserMapping(userid, table->serverid);
/* Open connection; report that we'll create a prepared statement. */
- fmstate->conn = GetConnection(user, true);
+ fmstate->s.conn = GetConnection(user, true);
+ fmstate->s.connpriv = (PgFdwConnpriv *)
+ GetConnectionSpecificStorage(user, sizeof(PgFdwConnpriv));
fmstate->p_name = NULL; /* prepared statement not made yet */
/* Deconstruct fdw_private data. */
@@ -1778,6 +1956,8 @@ postgresExecForeignInsert(EState *estate,
PGresult *res;
int n_rows;
+ vacate_connection((PgFdwState *)fmstate);
+
/* Set up the prepared statement on the remote server, if we didn't yet */
if (!fmstate->p_name)
prepare_foreign_modify(fmstate);
@@ -1788,14 +1968,14 @@ postgresExecForeignInsert(EState *estate,
/*
* Execute the prepared statement.
*/
- if (!PQsendQueryPrepared(fmstate->conn,
+ if (!PQsendQueryPrepared(fmstate->s.conn,
fmstate->p_name,
fmstate->p_nums,
p_values,
NULL,
NULL,
0))
- pgfdw_report_error(ERROR, NULL, fmstate->conn, false, fmstate->query);
+ pgfdw_report_error(ERROR, NULL, fmstate->s.conn, false, fmstate->query);
/*
* Get the result, and check for success.
@@ -1803,10 +1983,10 @@ postgresExecForeignInsert(EState *estate,
* We don't use a PG_TRY block here, so be careful not to throw error
* without releasing the PGresult.
*/
- res = pgfdw_get_result(fmstate->conn, fmstate->query);
+ res = pgfdw_get_result(fmstate->s.conn, fmstate->query);
if (PQresultStatus(res) !=
(fmstate->has_returning ? PGRES_TUPLES_OK : PGRES_COMMAND_OK))
- pgfdw_report_error(ERROR, res, fmstate->conn, true, fmstate->query);
+ pgfdw_report_error(ERROR, res, fmstate->s.conn, true, fmstate->query);
/* Check number of rows affected, and fetch RETURNING tuple if any */
if (fmstate->has_returning)
@@ -1844,6 +2024,8 @@ postgresExecForeignUpdate(EState *estate,
PGresult *res;
int n_rows;
+ vacate_connection((PgFdwState *)fmstate);
+
/* Set up the prepared statement on the remote server, if we didn't yet */
if (!fmstate->p_name)
prepare_foreign_modify(fmstate);
@@ -1864,14 +2046,14 @@ postgresExecForeignUpdate(EState *estate,
/*
* Execute the prepared statement.
*/
- if (!PQsendQueryPrepared(fmstate->conn,
+ if (!PQsendQueryPrepared(fmstate->s.conn,
fmstate->p_name,
fmstate->p_nums,
p_values,
NULL,
NULL,
0))
- pgfdw_report_error(ERROR, NULL, fmstate->conn, false, fmstate->query);
+ pgfdw_report_error(ERROR, NULL, fmstate->s.conn, false, fmstate->query);
/*
* Get the result, and check for success.
@@ -1879,10 +2061,10 @@ postgresExecForeignUpdate(EState *estate,
* We don't use a PG_TRY block here, so be careful not to throw error
* without releasing the PGresult.
*/
- res = pgfdw_get_result(fmstate->conn, fmstate->query);
+ res = pgfdw_get_result(fmstate->s.conn, fmstate->query);
if (PQresultStatus(res) !=
(fmstate->has_returning ? PGRES_TUPLES_OK : PGRES_COMMAND_OK))
- pgfdw_report_error(ERROR, res, fmstate->conn, true, fmstate->query);
+ pgfdw_report_error(ERROR, res, fmstate->s.conn, true, fmstate->query);
/* Check number of rows affected, and fetch RETURNING tuple if any */
if (fmstate->has_returning)
@@ -1920,6 +2102,8 @@ postgresExecForeignDelete(EState *estate,
PGresult *res;
int n_rows;
+ vacate_connection((PgFdwState *)fmstate);
+
/* Set up the prepared statement on the remote server, if we didn't yet */
if (!fmstate->p_name)
prepare_foreign_modify(fmstate);
@@ -1940,14 +2124,14 @@ postgresExecForeignDelete(EState *estate,
/*
* Execute the prepared statement.
*/
- if (!PQsendQueryPrepared(fmstate->conn,
+ if (!PQsendQueryPrepared(fmstate->s.conn,
fmstate->p_name,
fmstate->p_nums,
p_values,
NULL,
NULL,
0))
- pgfdw_report_error(ERROR, NULL, fmstate->conn, false, fmstate->query);
+ pgfdw_report_error(ERROR, NULL, fmstate->s.conn, false, fmstate->query);
/*
* Get the result, and check for success.
@@ -1955,10 +2139,10 @@ postgresExecForeignDelete(EState *estate,
* We don't use a PG_TRY block here, so be careful not to throw error
* without releasing the PGresult.
*/
- res = pgfdw_get_result(fmstate->conn, fmstate->query);
+ res = pgfdw_get_result(fmstate->s.conn, fmstate->query);
if (PQresultStatus(res) !=
(fmstate->has_returning ? PGRES_TUPLES_OK : PGRES_COMMAND_OK))
- pgfdw_report_error(ERROR, res, fmstate->conn, true, fmstate->query);
+ pgfdw_report_error(ERROR, res, fmstate->s.conn, true, fmstate->query);
/* Check number of rows affected, and fetch RETURNING tuple if any */
if (fmstate->has_returning)
@@ -2005,16 +2189,16 @@ postgresEndForeignModify(EState *estate,
* We don't use a PG_TRY block here, so be careful not to throw error
* without releasing the PGresult.
*/
- res = pgfdw_exec_query(fmstate->conn, sql);
+ res = pgfdw_exec_query(fmstate->s.conn, sql);
if (PQresultStatus(res) != PGRES_COMMAND_OK)
- pgfdw_report_error(ERROR, res, fmstate->conn, true, sql);
+ pgfdw_report_error(ERROR, res, fmstate->s.conn, true, sql);
PQclear(res);
fmstate->p_name = NULL;
}
/* Release remote connection */
- ReleaseConnection(fmstate->conn);
- fmstate->conn = NULL;
+ ReleaseConnection(fmstate->s.conn);
+ fmstate->s.conn = NULL;
}
/*
@@ -2302,7 +2486,9 @@ postgresBeginDirectModify(ForeignScanState *node, int eflags)
* Get connection to the foreign server. Connection manager will
* establish new connection if necessary.
*/
- dmstate->conn = GetConnection(user, false);
+ dmstate->s.conn = GetConnection(user, false);
+ dmstate->s.connpriv = (PgFdwConnpriv *)
+ GetConnectionSpecificStorage(user, sizeof(PgFdwConnpriv));
/* Initialize state variable */
dmstate->num_tuples = -1; /* -1 means not set yet */
@@ -2355,7 +2541,10 @@ postgresIterateDirectModify(ForeignScanState *node)
* If this is the first call after Begin, execute the statement.
*/
if (dmstate->num_tuples == -1)
+ {
+ vacate_connection((PgFdwState *)dmstate);
execute_dml_stmt(node);
+ }
/*
* If the local query doesn't specify RETURNING, just clear tuple slot.
@@ -2402,8 +2591,8 @@ postgresEndDirectModify(ForeignScanState *node)
PQclear(dmstate->result);
/* Release remote connection */
- ReleaseConnection(dmstate->conn);
- dmstate->conn = NULL;
+ ReleaseConnection(dmstate->s.conn);
+ dmstate->s.conn = NULL;
/* MemoryContext will be deleted automatically. */
}
@@ -2522,6 +2711,7 @@ estimate_path_cost_size(PlannerInfo *root,
List *local_param_join_conds;
StringInfoData sql;
PGconn *conn;
+ PgFdwConnpriv *connpriv;
Selectivity local_sel;
QualCost local_cost;
List *fdw_scan_tlist = NIL;
@@ -2564,6 +2754,16 @@ estimate_path_cost_size(PlannerInfo *root,
/* Get the remote estimate */
conn = GetConnection(fpinfo->user, false);
+ connpriv = GetConnectionSpecificStorage(fpinfo->user,
+ sizeof(PgFdwConnpriv));
+ if (connpriv)
+ {
+ PgFdwState tmpstate;
+ tmpstate.conn = conn;
+ tmpstate.connpriv = connpriv;
+ vacate_connection(&tmpstate);
+ }
+
get_remote_estimate(sql.data, conn, &rows, &width,
&startup_cost, &total_cost);
ReleaseConnection(conn);
@@ -2918,11 +3118,11 @@ ec_member_matches_foreign(PlannerInfo *root, RelOptInfo *rel,
static void
create_cursor(ForeignScanState *node)
{
- PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+ PgFdwScanState *fsstate = GetPgFdwScanState(node);
ExprContext *econtext = node->ss.ps.ps_ExprContext;
int numParams = fsstate->numParams;
const char **values = fsstate->param_values;
- PGconn *conn = fsstate->conn;
+ PGconn *conn = fsstate->s.conn;
StringInfoData buf;
PGresult *res;
@@ -2988,47 +3188,96 @@ create_cursor(ForeignScanState *node)
* Fetch some more rows from the node's cursor.
*/
static void
-fetch_more_data(ForeignScanState *node)
+request_more_data(ForeignScanState *node)
+{
+ PgFdwScanState *fsstate = GetPgFdwScanState(node);
+ PGconn *conn = fsstate->s.conn;
+ char sql[64];
+
+ /* The connection should be vacant */
+ Assert(fsstate->s.connpriv->current_owner == NULL);
+
+ /*
+ * If this is the first call after Begin or ReScan, we need to create the
+ * cursor on the remote side.
+ */
+ if (!fsstate->cursor_exists)
+ create_cursor(node);
+
+ snprintf(sql, sizeof(sql), "FETCH %d FROM c%u",
+ fsstate->fetch_size, fsstate->cursor_number);
+
+ if (!PQsendQuery(conn, sql))
+ pgfdw_report_error(ERROR, NULL, conn, false, sql);
+
+ fsstate->s.connpriv->current_owner = node;
+}
+
+/*
+ * Fetch some more rows from the node's cursor.
+ */
+static void
+fetch_received_data(ForeignScanState *node)
{
- PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+ PgFdwScanState *fsstate = GetPgFdwScanState(node);
PGresult *volatile res = NULL;
MemoryContext oldcontext;
+ /* I should be the current connection owner */
+ Assert(fsstate->s.connpriv->current_owner == node);
+
/*
* We'll store the tuples in the batch_cxt. First, flush the previous
- * batch.
+ * batch if no tuple is remaining
*/
- fsstate->tuples = NULL;
- MemoryContextReset(fsstate->batch_cxt);
+ if (fsstate->next_tuple >= fsstate->num_tuples)
+ {
+ fsstate->tuples = NULL;
+ fsstate->num_tuples = 0;
+ MemoryContextReset(fsstate->batch_cxt);
+ }
+ else if (fsstate->next_tuple > 0)
+ {
+ /* move the remaining tuples to the beginning of the store */
+ int n = 0;
+
+ while(fsstate->next_tuple < fsstate->num_tuples)
+ fsstate->tuples[n++] = fsstate->tuples[fsstate->next_tuple++];
+ fsstate->num_tuples = n;
+ }
+
oldcontext = MemoryContextSwitchTo(fsstate->batch_cxt);
/* PGresult must be released before leaving this function. */
PG_TRY();
{
- PGconn *conn = fsstate->conn;
+ PGconn *conn = fsstate->s.conn;
char sql[64];
- int numrows;
+ int addrows;
+ size_t newsize;
int i;
snprintf(sql, sizeof(sql), "FETCH %d FROM c%u",
fsstate->fetch_size, fsstate->cursor_number);
- res = pgfdw_exec_query(conn, sql);
+ res = pgfdw_get_result(conn, sql);
/* On error, report the original query, not the FETCH. */
if (PQresultStatus(res) != PGRES_TUPLES_OK)
pgfdw_report_error(ERROR, res, conn, false, fsstate->query);
/* Convert the data into HeapTuples */
- numrows = PQntuples(res);
- fsstate->tuples = (HeapTuple *) palloc0(numrows * sizeof(HeapTuple));
- fsstate->num_tuples = numrows;
- fsstate->next_tuple = 0;
+ addrows = PQntuples(res);
+ newsize = (fsstate->num_tuples + addrows) * sizeof(HeapTuple);
+ if (fsstate->tuples)
+ fsstate->tuples = (HeapTuple *) repalloc(fsstate->tuples, newsize);
+ else
+ fsstate->tuples = (HeapTuple *) palloc(newsize);
- for (i = 0; i < numrows; i++)
+ for (i = 0; i < addrows; i++)
{
Assert(IsA(node->ss.ps.plan, ForeignScan));
- fsstate->tuples[i] =
+ fsstate->tuples[fsstate->num_tuples + i] =
make_tuple_from_result_row(res, i,
fsstate->rel,
fsstate->attinmeta,
@@ -3038,27 +3287,82 @@ fetch_more_data(ForeignScanState *node)
}
/* Update fetch_ct_2 */
- if (fsstate->fetch_ct_2 < 2)
+ if (fsstate->fetch_ct_2 < 2 && fsstate->next_tuple == 0)
fsstate->fetch_ct_2++;
+ fsstate->next_tuple = 0;
+ fsstate->num_tuples += addrows;
+
/* Must be EOF if we didn't get as many tuples as we asked for. */
- fsstate->eof_reached = (numrows < fsstate->fetch_size);
+ fsstate->eof_reached = (addrows < fsstate->fetch_size);
PQclear(res);
res = NULL;
}
PG_CATCH();
{
+ fsstate->s.connpriv->current_owner = NULL;
if (res)
PQclear(res);
PG_RE_THROW();
}
PG_END_TRY();
+ fsstate->s.connpriv->current_owner = NULL;
+
MemoryContextSwitchTo(oldcontext);
}
/*
+ * Vacate a connection so that this node can send the next query
+ */
+static void
+vacate_connection(PgFdwState *fdwstate)
+{
+ PgFdwConnpriv *connpriv = fdwstate->connpriv;
+ ForeignScanState *owner;
+
+ if (connpriv == NULL || connpriv->current_owner == NULL)
+ return;
+
+ /*
+ * let the current connection owner read the result for the running query
+ */
+ owner = connpriv->current_owner;
+ fetch_received_data(owner);
+
+ /* Clear the waiting list */
+ while (owner)
+ {
+ PgFdwScanState *fsstate = GetPgFdwScanState(owner);
+
+ fsstate->last_waiter = NULL;
+ owner = fsstate->waiter;
+ fsstate->waiter = NULL;
+ }
+}
+
+/*
+ * Absorb the result of the current query.
+ */
+static void
+absorb_current_result(ForeignScanState *node)
+{
+ PgFdwScanState *fsstate = GetPgFdwScanState(node);
+ ForeignScanState *owner = fsstate->s.connpriv->current_owner;
+
+ if (owner)
+ {
+ PgFdwScanState *target_state = GetPgFdwScanState(owner);
+ PGconn *conn = target_state->s.conn;
+
+ while(PQisBusy(conn))
+ PQclear(PQgetResult(conn));
+ fsstate->s.connpriv->current_owner = NULL;
+ fsstate->async_waiting = false;
+ }
+}
+/*
* Force assorted GUC parameters to settings that ensure that we'll output
* data values in a form that is unambiguous to the remote server.
*
@@ -3142,7 +3446,7 @@ prepare_foreign_modify(PgFdwModifyState *fmstate)
/* Construct name we'll use for the prepared statement. */
snprintf(prep_name, sizeof(prep_name), "pgsql_fdw_prep_%u",
- GetPrepStmtNumber(fmstate->conn));
+ GetPrepStmtNumber(fmstate->s.conn));
p_name = pstrdup(prep_name);
/*
@@ -3152,12 +3456,12 @@ prepare_foreign_modify(PgFdwModifyState *fmstate)
* the prepared statements we use in this module are simple enough that
* the remote server will make the right choices.
*/
- if (!PQsendPrepare(fmstate->conn,
+ if (!PQsendPrepare(fmstate->s.conn,
p_name,
fmstate->query,
0,
NULL))
- pgfdw_report_error(ERROR, NULL, fmstate->conn, false, fmstate->query);
+ pgfdw_report_error(ERROR, NULL, fmstate->s.conn, false, fmstate->query);
/*
* Get the result, and check for success.
@@ -3165,9 +3469,9 @@ prepare_foreign_modify(PgFdwModifyState *fmstate)
* We don't use a PG_TRY block here, so be careful not to throw error
* without releasing the PGresult.
*/
- res = pgfdw_get_result(fmstate->conn, fmstate->query);
+ res = pgfdw_get_result(fmstate->s.conn, fmstate->query);
if (PQresultStatus(res) != PGRES_COMMAND_OK)
- pgfdw_report_error(ERROR, res, fmstate->conn, true, fmstate->query);
+ pgfdw_report_error(ERROR, res, fmstate->s.conn, true, fmstate->query);
PQclear(res);
/* This action shows that the prepare has been done. */
@@ -3298,9 +3602,9 @@ execute_dml_stmt(ForeignScanState *node)
* the desired result. This allows us to avoid assuming that the remote
* server has the same OIDs we do for the parameters' types.
*/
- if (!PQsendQueryParams(dmstate->conn, dmstate->query, numParams,
+ if (!PQsendQueryParams(dmstate->s.conn, dmstate->query, numParams,
NULL, values, NULL, NULL, 0))
- pgfdw_report_error(ERROR, NULL, dmstate->conn, false, dmstate->query);
+ pgfdw_report_error(ERROR, NULL, dmstate->s.conn, false, dmstate->query);
/*
* Get the result, and check for success.
@@ -3308,10 +3612,10 @@ execute_dml_stmt(ForeignScanState *node)
* We don't use a PG_TRY block here, so be careful not to throw error
* without releasing the PGresult.
*/
- dmstate->result = pgfdw_get_result(dmstate->conn, dmstate->query);
+ dmstate->result = pgfdw_get_result(dmstate->s.conn, dmstate->query);
if (PQresultStatus(dmstate->result) !=
(dmstate->has_returning ? PGRES_TUPLES_OK : PGRES_COMMAND_OK))
- pgfdw_report_error(ERROR, dmstate->result, dmstate->conn, true,
+ pgfdw_report_error(ERROR, dmstate->result, dmstate->s.conn, true,
dmstate->query);
/* Get the number of rows affected. */
@@ -4582,6 +4886,80 @@ postgresGetForeignJoinPaths(PlannerInfo *root,
/* XXX Consider parameterized paths for the join relation */
}
+static bool
+postgresIsForeignPathAsyncCapable(ForeignPath *path)
+{
+ return true;
+}
+
+/*
+ * Accept async request. Notify to the caller if the next tuple is immediately
+ * available. ExecForeignScan does additional work to finishing the returning
+ * tuple, so call it instead of postgresIterateForeignScan to acquire a tuple
+ * in expected shape.
+ */
+static void
+postgresForeignAsyncRequest(EState *estate, PendingAsyncRequest *areq)
+{
+ ForeignScanState *node = (ForeignScanState *) areq->requestee;
+ TupleTableSlot *slot;
+
+ Assert(IsA(node, ForeignScanState));
+ GetPgFdwScanState(node)->run_async = true;
+ slot = ExecForeignScan(node);
+ if (GetPgFdwScanState(node)->result_ready)
+ ExecAsyncRequestDone(estate, areq, (Node *) slot);
+ else
+ ExecAsyncSetRequiredEvents(estate, areq, 1, false, false);
+}
+
+/*
+ * Configure waiting event.
+ *
+ * Add an wait event only when the node is the connection owner. Elsewise
+ * another node on this connection is the owner.
+ */
+static bool
+postgresForeignAsyncConfigureWait(EState *estate, PendingAsyncRequest *areq,
+ bool reinit)
+{
+ ForeignScanState *node = (ForeignScanState *) areq->requestee;
+ PgFdwScanState *fsstate = GetPgFdwScanState(node);
+
+
+ /* If the caller didn't reinit, this event is already in event set */
+ if (!reinit)
+ return true;
+
+ if (fsstate->s.connpriv->current_owner == node)
+ {
+ AddWaitEventToSet(estate->es_wait_event_set,
+ WL_SOCKET_READABLE, PQsocket(fsstate->s.conn),
+ NULL, areq);
+ return true;
+ }
+
+ return false;
+}
+
+/*
+ * Process a notification from async mechanism. ExecForeignScan does
+ * additional work to complete the returning tuple, so call it instead of
+ * postgresIterateForeignScan to acquire a completed tuple.
+ */
+static void
+postgresForeignAsyncNotify(EState *estate, PendingAsyncRequest *areq)
+{
+ ForeignScanState *node = (ForeignScanState *) areq->requestee;
+ TupleTableSlot *slot;
+
+ Assert(IsA(node, ForeignScanState));
+ slot = ExecForeignScan(node);
+ Assert(GetPgFdwScanState(node)->result_ready);
+
+ ExecAsyncRequestDone(estate, areq, (Node *) slot);
+}
+
/*
* Assess whether the aggregation, grouping and having operations can be pushed
* down to the foreign server. As a side effect, save information we obtain in
@@ -4946,7 +5324,7 @@ make_tuple_from_result_row(PGresult *res,
PgFdwScanState *fdw_sstate;
Assert(fsstate);
- fdw_sstate = (PgFdwScanState *) fsstate->fdw_state;
+ fdw_sstate = GetPgFdwScanState(fsstate);
tupdesc = fdw_sstate->tupdesc;
}
diff --git a/contrib/postgres_fdw/postgres_fdw.h b/contrib/postgres_fdw/postgres_fdw.h
index 788b003..41ac1d2 100644
--- a/contrib/postgres_fdw/postgres_fdw.h
+++ b/contrib/postgres_fdw/postgres_fdw.h
@@ -77,6 +77,7 @@ typedef struct PgFdwRelationInfo
UserMapping *user; /* only set in use_remote_estimate mode */
int fetch_size; /* fetch size for this remote table */
+ bool allow_prefetch; /* true to allow overlapped fetching */
/*
* Name of the relation while EXPLAINing ForeignScan. It is used for join
@@ -116,6 +117,7 @@ extern void reset_transmission_modes(int nestlevel);
/* in connection.c */
extern PGconn *GetConnection(UserMapping *user, bool will_prep_stmt);
+void *GetConnectionSpecificStorage(UserMapping *user, size_t initsize);
extern void ReleaseConnection(PGconn *conn);
extern unsigned int GetCursorNumber(PGconn *conn);
extern unsigned int GetPrepStmtNumber(PGconn *conn);
diff --git a/contrib/postgres_fdw/sql/postgres_fdw.sql b/contrib/postgres_fdw/sql/postgres_fdw.sql
index 5f65d9d..340a376 100644
--- a/contrib/postgres_fdw/sql/postgres_fdw.sql
+++ b/contrib/postgres_fdw/sql/postgres_fdw.sql
@@ -1589,12 +1589,12 @@ insert into bar2 values(4,44,44);
insert into bar2 values(7,77,77);
explain (verbose, costs off)
-select * from bar where f1 in (select f1 from foo) for update;
-select * from bar where f1 in (select f1 from foo) for update;
+select * from bar where f1 in (select f1 from foo) order by 1 for update;
+select * from bar where f1 in (select f1 from foo) order by 1 for update;
explain (verbose, costs off)
-select * from bar where f1 in (select f1 from foo) for share;
-select * from bar where f1 in (select f1 from foo) for share;
+select * from bar where f1 in (select f1 from foo) order by 1 for share;
+select * from bar where f1 in (select f1 from foo) order by 1 for share;
-- Check UPDATE with inherited target and an inherited source table
explain (verbose, costs off)
@@ -1653,8 +1653,8 @@ explain (verbose, costs off)
delete from foo where f1 < 5 returning *;
delete from foo where f1 < 5 returning *;
explain (verbose, costs off)
-update bar set f2 = f2 + 100 returning *;
-update bar set f2 = f2 + 100 returning *;
+with u as (update bar set f2 = f2 + 100 returning *) select * from u order by 1;
+with u as (update bar set f2 = f2 + 100 returning *) select * from u order by 1;
drop table foo cascade;
drop table bar cascade;
--
2.9.2
0004-Apply-unlikely-to-suggest-synchronous-route-of-ExecA.patchtext/x-patch; charset=us-asciiDownload
From 1fbebf72e4aa57bbb4d19616eabfe888c4063e29 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Tue, 25 Oct 2016 18:05:30 +0900
Subject: [PATCH 4/5] Apply unlikely to suggest synchronous route of
ExecAppend.
ExecAppend seems to get slowed down by penalty of misprediction of
branches related to async-execution. Apply unlikey to them to prevent
such penalty on exiting route. Asynchronous execution is already
having a lot of additional code so this doesn't add siginificant
degradation.
---
src/backend/executor/nodeAppend.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/src/backend/executor/nodeAppend.c b/src/backend/executor/nodeAppend.c
index 2c07095..43e777f 100644
--- a/src/backend/executor/nodeAppend.c
+++ b/src/backend/executor/nodeAppend.c
@@ -214,7 +214,7 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
TupleTableSlot *
ExecAppend(AppendState *node)
{
- if (node->as_nasyncplans > 0)
+ if (unlikely(node->as_nasyncplans > 0))
{
EState *estate = node->ps.state;
int i;
@@ -255,7 +255,7 @@ ExecAppend(AppendState *node)
/*
* if we have async requests outstanding, run the event loop
*/
- if (node->as_nasyncpending > 0)
+ if (unlikely(node->as_nasyncpending > 0))
{
long timeout = node->as_syncdone ? -1 : 0;
--
2.9.2
0005-Refactor-ExecAsyncEventLoop.patchtext/x-patch; charset=us-asciiDownload
From 0e811309902e99c40159f0984d1e1dccfd419861 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Tue, 25 Jul 2017 17:19:10 +0900
Subject: [PATCH 5/5] Refactor ExecAsyncEventLoop
The compaction loop in ExecAsyncEventLoop was written in a bit tricky
way. This patch rewrites it in a more straight-forward way. Maybe.
---
src/backend/executor/execAsync.c | 34 ++++++++++++++++++++++------------
1 file changed, 22 insertions(+), 12 deletions(-)
diff --git a/src/backend/executor/execAsync.c b/src/backend/executor/execAsync.c
index 115b147..173ee39 100644
--- a/src/backend/executor/execAsync.c
+++ b/src/backend/executor/execAsync.c
@@ -222,28 +222,38 @@ ExecAsyncEventLoop(EState *estate, PlanState *requestor, long timeout)
/* If any node completed, compact the array. */
if (any_node_done)
{
- int hidx = 0,
- tidx;
+ int i = 0;
+ int npending = 0;
/*
* Swap all non-yet-completed items to the start of the array.
* Keep them in the same order.
*/
- for (tidx = 0; tidx < estate->es_num_pending_async; ++tidx)
+ /* Step 1: skip over not-completed elements at the beginning */
+ while (npending < estate->es_num_pending_async &&
+ estate->es_pending_async[npending]->state !=
+ ASYNCREQ_COMPLETE)
+ npending++;
+
+ /* Step 2: move forward not-completed elements hereafter */
+ for (i = npending + 1; i < estate->es_num_pending_async; ++i)
{
- PendingAsyncRequest *head;
- PendingAsyncRequest *tail = estate->es_pending_async[tidx];
+ PendingAsyncRequest *tmp;
+ PendingAsyncRequest *curr = estate->es_pending_async[i];
- Assert(tail->state != ASYNCREQ_CALLBACK_PENDING);
+ Assert(curr->state != ASYNCREQ_CALLBACK_PENDING);
- if (tail->state == ASYNCREQ_COMPLETE)
+ if (curr->state == ASYNCREQ_COMPLETE)
continue;
- head = estate->es_pending_async[hidx];
- estate->es_pending_async[tidx] = head;
- estate->es_pending_async[hidx] = tail;
- ++hidx;
+
+ tmp = estate->es_pending_async[npending];
+ estate->es_pending_async[npending] =
+ estate->es_pending_async[i];
+ estate->es_pending_async[i] = tmp;
+ ++npending;
}
- estate->es_num_pending_async = hidx;
+
+ estate->es_num_pending_async = npending;
}
/*
--
2.9.2
On Tue, Jul 25, 2017 at 5:11 AM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:
[ new patches ]
I spent some time today refreshing my memory of what's going with this
thread today.
Ostensibly, the advantage of this framework over my previous proposal
is that it avoids inserting anything into ExecProcNode(), which is
probably a good thing to avoid given how frequently ExecProcNode() is
called. Unless the parent and the child both know about asynchronous
execution and choose to use it, everything runs exactly as it does
today and so there is no possibility of a complaint about a
performance hit. As far as it goes, that is good.
However, at a deeper level, I fear we haven't really solved the
problem. If an Append is directly on top of a ForeignScan node, then
this will work. But if an Append is indirectly on top of a
ForeignScan node with some other stuff in the middle, then it won't -
unless we make whichever nodes appear between the Append and the
ForeignScan async-capable. Indeed, we'd really want all kinds of
joins and aggregates to be async-capable so that examples like the one
Corey asked about in
/messages/by-id/CADkLM=fuvVdKvz92XpCRnb4=rj6bLOhSLifQ3RV=Sb4Q5rJsRA@mail.gmail.com
will work.
But if we do, then I fear we'll just be reintroducing the same
performance regression that we introduced by switching to this
framework from the previous one - or maybe a different one, but a
regression all the same. Every type of intermediate node will have to
have a code path where it uses ExecAsyncRequest() /
ExecAyncHogeResponse() rather than ExecProcNode() to get tuples, and
it seems like that will either end up duplicating a lot of code from
the regular code path or, alternatively, polluting the regular code
path with some of the async code's concerns to avoid duplication, and
maybe slowing things down.
Maybe that concern is unjustified; I'm not sure. Thoughts?
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Robert Haas <robertmhaas@gmail.com> writes:
Ostensibly, the advantage of this framework over my previous proposal
is that it avoids inserting anything into ExecProcNode(), which is
probably a good thing to avoid given how frequently ExecProcNode() is
called. Unless the parent and the child both know about asynchronous
execution and choose to use it, everything runs exactly as it does
today and so there is no possibility of a complaint about a
performance hit. As far as it goes, that is good.
However, at a deeper level, I fear we haven't really solved the
problem. If an Append is directly on top of a ForeignScan node, then
this will work. But if an Append is indirectly on top of a
ForeignScan node with some other stuff in the middle, then it won't -
unless we make whichever nodes appear between the Append and the
ForeignScan async-capable.
I have not been paying any attention to this thread whatsoever,
but I wonder if you can address your problem by building on top of
the ExecProcNode replacement that Andres is working on,
/messages/by-id/20170726012641.bmhfcp5ajpudihl6@alap3.anarazel.de
The scheme he has allows $extra_stuff to be injected into ExecProcNode at
no cost when $extra_stuff is not needed, because you simply don't insert
the wrapper function when it's not needed. I'm not sure that it will
scale well to several different kinds of insertions though, for instance
if you wanted both instrumentation and async support on the same node.
But maybe those two couldn't be arms-length from each other anyway,
in which case it might be fine as-is.
regards, tom lane
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Wed, Jul 26, 2017 at 5:43 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
I have not been paying any attention to this thread whatsoever,
but I wonder if you can address your problem by building on top of
the ExecProcNode replacement that Andres is working on,
/messages/by-id/20170726012641.bmhfcp5ajpudihl6@alap3.anarazel.deThe scheme he has allows $extra_stuff to be injected into ExecProcNode at
no cost when $extra_stuff is not needed, because you simply don't insert
the wrapper function when it's not needed. I'm not sure that it will
scale well to several different kinds of insertions though, for instance
if you wanted both instrumentation and async support on the same node.
But maybe those two couldn't be arms-length from each other anyway,
in which case it might be fine as-is.
Yeah, I don't quite see how that would apply in this case -- what we
need here is not as simple as just conditionally injecting an extra
bit.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Thank you for the comment.
At Wed, 26 Jul 2017 17:16:43 -0400, Robert Haas <robertmhaas@gmail.com> wrote in <CA+TgmoYrbgTBnLwnr1v=pk+C=znWg7AgV9=M9ehrq6TDexPQNw@mail.gmail.com>
But if we do, then I fear we'll just be reintroducing the same
performance regression that we introduced by switching to this
framework from the previous one - or maybe a different one, but a
regression all the same. Every type of intermediate node will have to
have a code path where it uses ExecAsyncRequest() /
ExecAyncHogeResponse() rather than ExecProcNode() to get tuples, and
I understand what Robert concerns and I think I share the same
opinion. It needs further different framework.
At Thu, 27 Jul 2017 06:39:51 -0400, Robert Haas <robertmhaas@gmail.com> wrote in <CA+Tgmoa=ke_zfucOAa3YEUnBSC=FSXn8SU2aYc8PGBBp=Yy9fw@mail.gmail.com>
On Wed, Jul 26, 2017 at 5:43 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
I have not been paying any attention to this thread whatsoever,
but I wonder if you can address your problem by building on top of
the ExecProcNode replacement that Andres is working on,
/messages/by-id/20170726012641.bmhfcp5ajpudihl6@alap3.anarazel.deThe scheme he has allows $extra_stuff to be injected into ExecProcNode at
no cost when $extra_stuff is not needed, because you simply don't insert
the wrapper function when it's not needed. I'm not sure that it will
scale well to several different kinds of insertions though, for instance
if you wanted both instrumentation and async support on the same node.
But maybe those two couldn't be arms-length from each other anyway,
in which case it might be fine as-is.Yeah, I don't quite see how that would apply in this case -- what we
need here is not as simple as just conditionally injecting an extra
bit.
Thank you for the pointer, Tom. The subject (segfault in HEAD...)
haven't made me think that this kind of discussion was held
there. Anyway it seems very closer to asynchronous execution so
I'll catch up it considering how I can associate with this.
Regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Import Notes
Reply to msg id not found: CA+Tgmoake_zfucOAa3YEUnBSCFSXn8SU2aYc8PGBBpYy9fw@mail.gmail.comCA+TgmoYrbgTBnLwnr1vpk+CznWg7AgV9M9ehrq6TDexPQNw@mail.gmail.com
At Fri, 28 Jul 2017 17:31:05 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in <20170728.173105.238045591.horiguchi.kyotaro@lab.ntt.co.jp>
Thank you for the comment.
At Wed, 26 Jul 2017 17:16:43 -0400, Robert Haas <robertmhaas@gmail.com> wrote in <CA+TgmoYrbgTBnLwnr1v=pk+C=znWg7AgV9=M9ehrq6TDexPQNw@mail.gmail.com>
regression all the same. Every type of intermediate node will have to
have a code path where it uses ExecAsyncRequest() /
ExecAyncHogeResponse() rather than ExecProcNode() to get tuples, andI understand what Robert concerns and I share the same
opinion. It needs further different framework.At Thu, 27 Jul 2017 06:39:51 -0400, Robert Haas <robertmhaas@gmail.com> wrote in <CA+Tgmoa=ke_zfucOAa3YEUnBSC=FSXn8SU2aYc8PGBBp=Yy9fw@mail.gmail.com>
On Wed, Jul 26, 2017 at 5:43 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
The scheme he has allows $extra_stuff to be injected into ExecProcNode at
no cost when $extra_stuff is not needed, because you simply don't insert
the wrapper function when it's not needed. I'm not sure that it will
...
Yeah, I don't quite see how that would apply in this case -- what we
need here is not as simple as just conditionally injecting an extra
bit.Thank you for the pointer, Tom. The subject (segfault in HEAD...)
haven't made me think that this kind of discussion was held
there. Anyway it seems very closer to asynchronous execution so
I'll catch up it considering how I can associate with this.
I understand the executor change which has just been made at
master based on the pointed thread. This seems to have the
capability to let exec-node switch to async-aware with no extra
cost on non-async processing. So it would be doable to (just)
*shrink* the current framework by detaching the async-aware side
of the API. But to get the most out of asynchrony, it is required
that multiple async-capable nodes distributed under async-unaware
nodes run simultaneously.
There seems two ways to achieve this.
One is propagating required-async-nodes bitmap up to the topmost
node and waiting for the all required nodes to become ready. In
the long run this requires all nodes to be async-aware and that
apparently would have bad effect on performance to async-unaware
nodes containing async-capable nodes.
Another is getting rid of recursive call to run an execution
tree. It is perhaps the same to what mentioned as "data-centric
processing" in a previous threads [1]/messages/by-id/BF2827DCCE55594C8D7A8F7FFD3AB77159A9B904@szxeml521-mbs.china.huawei.com, [2]/messages/by-id/20160629183254.frcm3dgg54ud5m6o@alap3.anarazel.de, but I'd like to I pay
attention on the aspect of "enableing to resume execution tree
from arbitrary leaf node". So I'm considering to realize it
still in one-tuple-by-one manner instead of collecting all tuples
of a leaf node first. Even though I'm not sure it is doable.
[1]: /messages/by-id/BF2827DCCE55594C8D7A8F7FFD3AB77159A9B904@szxeml521-mbs.china.huawei.com
[2]: /messages/by-id/20160629183254.frcm3dgg54ud5m6o@alap3.anarazel.de
regards,
--
Kyotaro Horiguchi
NTT Open Source Software Center
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Import Notes
Reply to msg id not found: 20170728.173105.238045591.horiguchi.kyotaro@lab.ntt.co.jp
On Mon, Jul 31, 2017 at 5:42 AM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:
Another is getting rid of recursive call to run an execution
tree.
That happens to be exactly what Andres did for expression evaluation
in commit b8d7f053c5c2bf2a7e8734fe3327f6a8bc711755, and I think
generalizing that to include the plan tree as well as expression trees
is likely to be the long-term way forward here. Unfortunately, that's
probably another gigantic patch (that should probably be written by
Andres).
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Thank you for the comment.
At Tue, 1 Aug 2017 16:27:41 -0400, Robert Haas <robertmhaas@gmail.com> wrote in <CA+TgmobbZrBPb7cvFj3ACPX2A_qSEB4ughRmB5dkGPXUYx_E+Q@mail.gmail.com>
On Mon, Jul 31, 2017 at 5:42 AM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:Another is getting rid of recursive call to run an execution
tree.That happens to be exactly what Andres did for expression evaluation
in commit b8d7f053c5c2bf2a7e8734fe3327f6a8bc711755, and I think
generalizing that to include the plan tree as well as expression trees
is likely to be the long-term way forward here.
I read it in the source tree. The patch implements converting
expression tree to an intermediate expression then run it on a
custom-made interpreter. Guessing from the word "upside down"
from Andres, the whole thing will become source-driven.
Unfortunately, that's probably another gigantic patch (that
should probably be written by Andres).
Yeah, but async executor on the current style of executor seems
furtile work, or sitting until the patch comes is also waste of
time. So I'm planning to include the following sutff in the next
PoC patch. Even I'm not sure it can land on the coming
Andres'patch.
- Tuple passing outside call-stack. (I remember it was in the
past of the thread around but not found)
This should be included in the Andres' patch.
- Give executor an ability to run from data-source (or driver)
nodes to the root.
I'm not sure this is included, but I suppose he is aiming this
kind of thing.
- Rebuid asynchronous execution on the upside-down executor.
regrds,
--
Kyotaro Horiguchi
NTT Open Source Software Center
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
At Thu, 03 Aug 2017 09:30:57 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in <20170803.093057.261590619.horiguchi.kyotaro@lab.ntt.co.jp>
Unfortunately, that's probably another gigantic patch (that
should probably be written by Andres).Yeah, but async executor on the current style of executor seems
furtile work, or sitting until the patch comes is also waste of
time. So I'm planning to include the following sutff in the next
PoC patch. Even I'm not sure it can land on the coming
Andres'patch.- Tuple passing outside call-stack. (I remember it was in the
past of the thread around but not found)This should be included in the Andres' patch.
- Give executor an ability to run from data-source (or driver)
nodes to the root.I'm not sure this is included, but I suppose he is aiming this
kind of thing.- Rebuid asynchronous execution on the upside-down executor.
Anyway, I modified ExecProcNode into push-up form and it *seems*
working to some extent. But trigger and cursors are almost broken
and several other regressions fail. Some nodes such like
windowagg are terriblly difficult to change to this push-up form
(using state machine). And of course it is terribly inefficient.
I'm afraid that all of this turns out to be in vain. But anyway,
and FWIW, I'll show the work to here after some cleansing work on
it.
regards,
--
Kyotaro Horiguchi
NTT Open Source Software Center
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
At Thu, 31 Aug 2017 21:52:36 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in <20170831.215236.135328985.horiguchi.kyotaro@lab.ntt.co.jp>
At Thu, 03 Aug 2017 09:30:57 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in <20170803.093057.261590619.horiguchi.kyotaro@lab.ntt.co.jp>
Unfortunately, that's probably another gigantic patch (that
should probably be written by Andres).Yeah, but async executor on the current style of executor seems
furtile work, or sitting until the patch comes is also waste of
time. So I'm planning to include the following sutff in the next
PoC patch. Even I'm not sure it can land on the coming
Andres'patch.- Tuple passing outside call-stack. (I remember it was in the
past of the thread around but not found)This should be included in the Andres' patch.
- Give executor an ability to run from data-source (or driver)
nodes to the root.I'm not sure this is included, but I suppose he is aiming this
kind of thing.- Rebuid asynchronous execution on the upside-down executor.
Anyway, I modified ExecProcNode into push-up form and it *seems*
working to some extent. But trigger and cursors are almost broken
and several other regressions fail. Some nodes such like
windowagg are terriblly difficult to change to this push-up form
(using state machine). And of course it is terribly inefficient.I'm afraid that all of this turns out to be in vain. But anyway,
and FWIW, I'll show the work to here after some cleansing work on
it.
So, this is that. Maybe this is really a bad way to go. Top of
the bads is it's terriblly hard to maintain because the behavior
of the state machine constructed in this patch is hardly
predictable so easily broken. During the 'cleansing work' I had
many crash or infinite-loop and they were a bit hard to
diagnose.. This will be soon broken by following commits.
Anyway and, again FWIW, this is that. I'll leave this for a while
(at least the period of this CF) and reconsider on async in
different forms.
regards,
--
Kyotaro Horiguchi
NTT Open Source Software Center
Attachments:
poc_pushexecutor_20170904_4faa1dc.tar.bz2application/octet-streamDownload
BZh91AY&SY�#��Y������������������ H ��0 a_}�� }%R������Z�X Z<��a�;j�3}���
�{���������m����������p)0��{��
��;����On���;������;��������T� #@ _k��
W@
P�o���@ ���h �������9>=�\������}��F����s�*S� '�ex{��� ���"�s�l���n>]��@ $ QUQc �$ ;n"+�����OZ���v�$vyu�6�FZ���eZ���oFe����Z���6��{��+u-��=��Q8v�wY��=K m���#���il��H�����nB%uM0��X�{���OmGf���I��t��
��� 6��x:'�z�k���,���;��kIv��W�5��nm��[���V�;n����M�2�Xk�m���bR<m\�g{Q�����^{���[�.��E�V�M����T��������-29u�Gk5�Y]j�Zml��0��g���yz��f�����N�v�q�B�n�������m>�%���_�����7^����>{>���i���{h�&��vi�����Om�Q-7
�Sc����Wo5=��H�@&��BzM4����ez'�=G������i��� i�$ ��JyQ�E=
�4��� e%$�MC*y2��S�SG�Q�h
�� �� ��I�j@�� d� �� � )&� &4�L�4h����SmL�����SO$��F@��P*$� FB`MO&Q�jx�f�jy'�j14@ @ ���� ��Z�u�~;iT�!3$�!B����?�
�������~s�7����r��o|�dM�~�H���U�P����/3�W�\��^2"