a raft of parallelism-related bug fixes
My recent commit of the Gather executor node has made it relatively
simple to write code that does an end-to-end test of all of the
parallelism-relate commits which have thus far gone into the tree.
Specifically, what I've done is hacked the planner to push a
single-copy Gather node on top of every plan that is thought to be
parallel-safe, and then run 'make check'. This turned up bugs in
nearly every parallelism-related commit that has thus far gone into
the tree, which is a little depressing, especially because some of
them are what we've taken to calling brown paper bag bugs. The good
news is that, with one or two exceptions, these are pretty much just
trivial oversights which are simple to fix, rather than any sort of
deeper design issue. Attached are 14 patches. Patches #1-#4 are
essential for testing purposes but are not proposed for commit,
although some of the code they contain may eventually become part of
other patches which are proposed for commit. Patches #5-#12 are
largely boring patches fixing fairly uninteresting mistakes; I propose
to commit these on an expedited basis. Patches #13-14 are also
proposed for commit but seem to me to be more in need of review. With
all of these patches, I can now get a clean 'make check' run, although
I think there are a few bugs remaining to be fixed because some of my
colleagues still experience misbehavior even with all of these patches
applied. The patch stack is also posted here; the branch is subject
to rebasing:
http://git.postgresql.org/gitweb/?p=users/rhaas/postgres.git;a=shortlog;h=refs/heads/gathertest
Here follows an overview of each individual patch (see also commit
messages within).
== For Testing Only ==
0001-Test-code.patch is the basic test code. In addition to pushing a
Gather node on top of apparently-safe parallel plans, it also ignores
that Gather node when generating EXPLAIN output and suppressing
parallel context in error messages, changes which are essential to
getting the regression tests to pass. I'm wondering if the parallel
context error ought to be GUC-controlled, defaulting to on but capable
of being enabled on request.
0002-contain_parallel_unsafe-check_parallel_safety.patch and
0003-Temporary-hack-to-reduce-testing-failures.patch arrange NOT to
put Gather nodes on top of plans that contain parallel-restricted
operations or refer to temporary tables. Although such things can
exist in a parallel plan, they must be above every Gather node, not
beneath it. Here, the Gather node is being placed (strictly for
testing purposes) at the very top, so we must not insert it at all if
these things are present.
0004-Partial-group-locking-implementation.patch is a partial
implementation of group locking. I found that without this, the
regression tests hang frequently, and a clean run is impossible. This
patch doesn't modify the deadlock detector, and it doesn't take any
account of locks that should be mutually exclusive even as between
members of a parallel group, but it's enough for a clean regression
test run. We will need a full solution to this problem soon enough,
but right now I am only using this to find such unrelated bugs as we
may have.
== Proposed For Commit ==
0005-Don-t-send-protocol-messages-to-a-shm_mq-that-no-lon.patch fixes
a problem in the parallel worker shutdown sequence: a background
worker can choose to redirect messages that would normally be sent to
a client to a shm_mq, and parallel workers always do this. But if the
worker generates a message after the DSM has been detached, it causes
a server crash.
0006-Transfer-current-command-counter-ID-to-parallel-work.patch fixes
a problem in the code used to set up a parallel worker's transaction
state. The command counter is presently not copied to the worker.
This is awfully embarrassing and should have been caught in the
testing of the parallel mode/contexts patch, but I got overly focused
on the stuff stored inside TransactionStateData. Don't shoot.
0007-Tighten-up-application-of-parallel-mode-checks.patch fixes
another problem with the parallel mode checks, which are intended to
catch people doing unsafe things and throw errors instead of letting
them crash the server. Investigation reveals that they don't have
this effect because parallel workers were running their pre-commit
sequence with the checks disabled. If they do something like try to
send notifications, it can lead to the worker getting an XID
assignment even though the master doesn't have one. That's really
bad, and crashes the server. That specific example should be
prohibited anyway (see patch #11) but even if we fix that I think this
is good tightening to prevent unpleasant surprises in the future.
0008-Invalidate-caches-after-cranking-up-a-parallel-worke.patch
invalidates invalidates system cache entries after cranking up a
parallel worker transaction. This is needed here for the same reason
that the logical decoding code needs to do it after time traveling:
otherwise, the worker might have leftover entries in its caches as a
result of the startup transaction that are now bogus given the changes
in what it can see.
0009-Fix-a-problem-with-parallel-workers-being-unable-to-.patch fixes
a problem with workers being unable to precisely recreate the
authorization state as it existed in the parallel leader. They need to
do that, or else it's a security vulnerability.
0010-Prohibit-parallel-query-when-the-isolation-level-is-.patch
prohibits parallel query at the serializable isolation level. This is
of course a restriction we'd rather not have, but it's a necessary one
for now because the predicate locking code doesn't understand the idea
of multiple processes with separate PGPROC structures being part of a
single transaction.
0011-Mark-more-functions-parallel-restricted-or-parallel-.patch marks
some functions as parallel-restricted or parallel-unsafe that in fact
are, but were not so marked by the commit that introduced the new
pg_proc flag. This includes functions for sending notifications and a
few others.
0012-Rewrite-interaction-of-parallel-mode-with-parallel-e.patch
rejiggers the timing of enabling and disabling parallel mode when we
are attempting parallel execution. The old coding turned out to be
fragile in multiple ways. Since it's impractical to know at planning
time with ExecutorRun will be called with a non-zero tuple count, this
patch instead observes whether or not this happens, and if it does
happen, the parallel plan is forced to run serially. In the old
coding, it instead just killed the parallel workers at the end of
ExecutorRun and therefore returned an incomplete result set. There
might be some further rejiggering that could be done here that would
be even better than this, but I'm fairly certain this is better than
what we've got in the tree right now.
0013-Modify-tqueue-infrastructure-to-support-transient-re.patch
attempts to address a deficiency in the tqueue.c/tqueue.h machinery I
recently introduced: backends can have ephemeral record types for
which they use backend-local typmods that may not be the same between
the leader and the worker. This patch has the worker send metadata
about the tuple descriptor for each such type, and the leader
registers the same tuple descriptor and then remaps the typmods from
the worker's typmod space to its own. This seems to work, but I'm a
little concerned that there may be cases it doesn't cover. Also,
there's room to question the overall approach. The only other
alternative that springs readily to mind is to try to arrange things
during the planning phase so that we never try to pass records between
parallel backends in this way, but that seems like it would be hard to
code (and thus likely to have bugs) and also pretty limiting.
0014-Fix-problems-with-ParamListInfo-serialization-mechan.patch, which
I just posted on the Parallel Seq Scan thread as a standalone patch,
fixes pretty much what the name of the file suggests. This actually
fixes two problems, one of which Noah spotted and commented on over on
that thread. By pure coincidence, the last 'make check' regression
failure I was still troubleshooting needed a fix for that issue plus a
fix to plpgsql_param_fetch. However, as I mentioned on the other
thread, I'm not quite sure which way to go with the change to
plpgsql_param_fetch so scrutiny of that point, in particular, would be
appreciated. See also
/messages/by-id/CA+TgmobN=wADVaUTwsH-xqvCdovkeRasuXw2k3R6vmpWig7raw@mail.gmail.com
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Attachments:
0004-Partial-group-locking-implementation.patchapplication/x-patch; name=0004-Partial-group-locking-implementation.patchDownload
From ea288aef684c084cafe649f3d8cfe1c928994770 Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Sat, 3 Oct 2015 13:34:35 -0400
Subject: [PATCH 04/14] Partial group locking implementation.
This doesn't touch deadlock.c but it's enough to get the regression
tests working with stuff pushed under Gather nodes.
---
src/backend/access/transam/parallel.c | 16 ++++
src/backend/storage/lmgr/lock.c | 123 ++++++++++++++++++++++++-----
src/backend/storage/lmgr/proc.c | 143 +++++++++++++++++++++++++++++++++-
src/include/storage/lock.h | 2 +-
src/include/storage/proc.h | 7 ++
5 files changed, 267 insertions(+), 24 deletions(-)
diff --git a/src/backend/access/transam/parallel.c b/src/backend/access/transam/parallel.c
index 3041dab..90735df 100644
--- a/src/backend/access/transam/parallel.c
+++ b/src/backend/access/transam/parallel.c
@@ -386,6 +386,9 @@ LaunchParallelWorkers(ParallelContext *pcxt)
if (pcxt->nworkers == 0)
return;
+ /* We need to be a lock group leader. */
+ BecomeLockGroupLeader();
+
/* If we do have workers, we'd better have a DSM segment. */
Assert(pcxt->seg != NULL);
@@ -889,6 +892,19 @@ ParallelWorkerMain(Datum main_arg)
*/
/*
+ * Join locking group. We must do this before anything that could try
+ * to acquire a heavyweight lock, because any heavyweight locks acquired
+ * to this point could block either directly against the parallel group
+ * leader or against some process which in turn waits for a lock that
+ * conflicts with the parallel group leader, causing an undetected
+ * deadlock. (If we can't join the lock group, the leader has gone away,
+ * so just exit quietly.)
+ */
+ if (!BecomeLockGroupMember(fps->parallel_master_pgproc,
+ fps->parallel_master_pid))
+ return;
+
+ /*
* Load libraries that were loaded by original backend. We want to do
* this before restoring GUCs, because the libraries might define custom
* variables.
diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index 76fc615..de6a05e 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -35,6 +35,7 @@
#include "access/transam.h"
#include "access/twophase.h"
#include "access/twophase_rmgr.h"
+#include "access/xact.h"
#include "access/xlog.h"
#include "miscadmin.h"
#include "pg_trace.h"
@@ -706,6 +707,7 @@ LockAcquireExtended(const LOCKTAG *locktag,
lockMethodTable = LockMethods[lockmethodid];
if (lockmode <= 0 || lockmode > lockMethodTable->numLockModes)
elog(ERROR, "unrecognized lock mode: %d", lockmode);
+ Assert(!IsInParallelMode() || MyProc->lockGroupLeader != NULL);
if (RecoveryInProgress() && !InRecovery &&
(locktag->locktag_type == LOCKTAG_OBJECT ||
@@ -1136,6 +1138,18 @@ SetupLockInTable(LockMethod lockMethodTable, PGPROC *proc,
{
uint32 partition = LockHashPartition(hashcode);
+ /*
+ * It might seem unsafe to access proclock->groupLeader without a lock,
+ * but it's not really. Either we are initializing a proclock on our
+ * own behalf, in which case our group leader isn't changing because
+ * the group leader for a process can only ever be changed by the
+ * process itself; or else we are transferring a fast-path lock to the
+ * main lock table, in which case that process can't change it's lock
+ * group leader without first releasing all of its locks (and in
+ * particular the one we are currently transferring).
+ */
+ proclock->groupLeader = proc->lockGroupLeader != NULL ?
+ proc->lockGroupLeader : proc;
proclock->holdMask = 0;
proclock->releaseMask = 0;
/* Add proclock to appropriate lists */
@@ -1255,9 +1269,10 @@ RemoveLocalLock(LOCALLOCK *locallock)
* NOTES:
* Here's what makes this complicated: one process's locks don't
* conflict with one another, no matter what purpose they are held for
- * (eg, session and transaction locks do not conflict).
- * So, we must subtract off our own locks when determining whether the
- * requested new lock conflicts with those already held.
+ * (eg, session and transaction locks do not conflict). Nor do the locks
+ * of one process in a lock group conflict with those of another process in
+ * the same group. So, we must subtract off these locks when determining
+ * whether the requested new lock conflicts with those already held.
*/
int
LockCheckConflicts(LockMethod lockMethodTable,
@@ -1267,8 +1282,12 @@ LockCheckConflicts(LockMethod lockMethodTable,
{
int numLockModes = lockMethodTable->numLockModes;
LOCKMASK myLocks;
- LOCKMASK otherLocks;
+ int conflictMask = lockMethodTable->conflictTab[lockmode];
+ int conflictsRemaining[MAX_LOCKMODES];
+ int totalConflictsRemaining = 0;
int i;
+ SHM_QUEUE *procLocks;
+ PROCLOCK *otherproclock;
/*
* first check for global conflicts: If no locks conflict with my request,
@@ -1279,40 +1298,91 @@ LockCheckConflicts(LockMethod lockMethodTable,
* type of lock that conflicts with request. Bitwise compare tells if
* there is a conflict.
*/
- if (!(lockMethodTable->conflictTab[lockmode] & lock->grantMask))
+ if (!(conflictMask & lock->grantMask))
{
PROCLOCK_PRINT("LockCheckConflicts: no conflict", proclock);
return STATUS_OK;
}
/*
- * Rats. Something conflicts. But it could still be my own lock. We have
- * to construct a conflict mask that does not reflect our own locks, but
- * only lock types held by other processes.
+ * Rats. Something conflicts. But it could still be my own lock, or
+ * a lock held by another member of my locking group. First, figure out
+ * how many conflicts remain after subtracting out any locks I hold
+ * myself.
*/
myLocks = proclock->holdMask;
- otherLocks = 0;
for (i = 1; i <= numLockModes; i++)
{
- int myHolding = (myLocks & LOCKBIT_ON(i)) ? 1 : 0;
+ if ((conflictMask & LOCKBIT_ON(i)) == 0)
+ {
+ conflictsRemaining[i] = 0;
+ continue;
+ }
+ conflictsRemaining[i] = lock->granted[i];
+ if (myLocks & LOCKBIT_ON(i))
+ --conflictsRemaining[i];
+ totalConflictsRemaining += conflictsRemaining[i];
+ }
- if (lock->granted[i] > myHolding)
- otherLocks |= LOCKBIT_ON(i);
+ /* If no conflicts remain, we get the lock. */
+ if (totalConflictsRemaining == 0)
+ {
+ PROCLOCK_PRINT("LockCheckConflicts: resolved (simple)", proclock);
+ return STATUS_OK;
+ }
+
+ /* If no group locking, it's definitely a conflict. */
+ if (proclock->groupLeader == MyProc && MyProc->lockGroupLeader == NULL)
+ {
+ Assert(proclock->tag.myProc == MyProc);
+ PROCLOCK_PRINT("LockCheckConflicts: conflicting (simple)",
+ proclock);
+ return STATUS_FOUND;
}
/*
- * now check again for conflicts. 'otherLocks' describes the types of
- * locks held by other processes. If one of these conflicts with the kind
- * of lock that I want, there is a conflict and I have to sleep.
+ * Locks held in conflicting modes by members of our own lock group are
+ * not real conflicts; we can subtract those out and see if we still have
+ * a conflict. This is O(N) in the number of processes holding or awaiting
+ * locks on this object. We could improve that by making the shared memory
+ * state more complex (and larger) but it doesn't seem worth it.
*/
- if (!(lockMethodTable->conflictTab[lockmode] & otherLocks))
+ procLocks = &(lock->procLocks);
+ otherproclock = (PROCLOCK *)
+ SHMQueueNext(procLocks, procLocks, offsetof(PROCLOCK, lockLink));
+ while (otherproclock != NULL)
{
- /* no conflict. OK to get the lock */
- PROCLOCK_PRINT("LockCheckConflicts: resolved", proclock);
- return STATUS_OK;
+ if (proclock != otherproclock &&
+ proclock->groupLeader == otherproclock->groupLeader &&
+ (otherproclock->holdMask & conflictMask) != 0)
+ {
+ int intersectMask = otherproclock->holdMask & conflictMask;
+
+ for (i = 1; i <= numLockModes; i++)
+ {
+ if ((intersectMask & LOCKBIT_ON(i)) != 0)
+ {
+ if (conflictsRemaining[i] <= 0)
+ elog(PANIC, "proclocks held do not match lock");
+ conflictsRemaining[i]--;
+ totalConflictsRemaining--;
+ }
+ }
+
+ if (totalConflictsRemaining == 0)
+ {
+ PROCLOCK_PRINT("LockCheckConflicts: resolved (group)",
+ proclock);
+ return STATUS_OK;
+ }
+ }
+ otherproclock = (PROCLOCK *)
+ SHMQueueNext(procLocks, &otherproclock->lockLink,
+ offsetof(PROCLOCK, lockLink));
}
- PROCLOCK_PRINT("LockCheckConflicts: conflicting", proclock);
+ /* Nope, it's a real conflict. */
+ PROCLOCK_PRINT("LockCheckConflicts: conflicting (group)", proclock);
return STATUS_FOUND;
}
@@ -3095,6 +3165,10 @@ PostPrepare_Locks(TransactionId xid)
PROCLOCKTAG proclocktag;
int partition;
+ /* Can't prepare a lock group follower. */
+ Assert(MyProc->lockGroupLeader == NULL ||
+ MyProc->lockGroupLeader == MyProc);
+
/* This is a critical section: any error means big trouble */
START_CRIT_SECTION();
@@ -3239,6 +3313,13 @@ PostPrepare_Locks(TransactionId xid)
proclocktag.myProc = newproc;
/*
+ * Update groupLeader pointer to point to the new proc. (We'd
+ * better not be a member of somebody else's lock group!)
+ */
+ Assert(proclock->groupLeader == proclock->tag.myProc);
+ proclock->groupLeader = newproc;
+
+ /*
* Update the proclock. We should not find any existing entry for
* the same hash key, since there can be only one entry for any
* given lock with my own proc.
@@ -3785,6 +3866,8 @@ lock_twophase_recover(TransactionId xid, uint16 info,
*/
if (!found)
{
+ Assert(proc->lockGroupLeader == NULL);
+ proclock->groupLeader = proc;
proclock->holdMask = 0;
proclock->releaseMask = 0;
/* Add proclock to appropriate lists */
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 2c2535b..2d55626 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -399,6 +399,11 @@ InitProcess(void)
MyProc->backendLatestXid = InvalidTransactionId;
pg_atomic_init_u32(&MyProc->nextClearXidElem, INVALID_PGPROCNO);
+ /* Check that group locking fields are in a proper initial state. */
+ Assert(MyProc->lockGroupLeaderIdentifier == 0);
+ Assert(MyProc->lockGroupLeader == NULL);
+ Assert(MyProc->lockGroupSize == 0);
+
/*
* Acquire ownership of the PGPROC's latch, so that we can use WaitLatch
* on it. That allows us to repoint the process latch, which so far
@@ -558,6 +563,11 @@ InitAuxiliaryProcess(void)
OwnLatch(&MyProc->procLatch);
SwitchToSharedLatch();
+ /* Check that group locking fields are in a proper initial state. */
+ Assert(MyProc->lockGroupLeaderIdentifier == 0);
+ Assert(MyProc->lockGroupLeader == NULL);
+ Assert(MyProc->lockGroupSize == 0);
+
/*
* We might be reusing a semaphore that belonged to a failed process. So
* be careful and reinitialize its value here. (This is not strictly
@@ -803,6 +813,33 @@ ProcKill(int code, Datum arg)
if (MyReplicationSlot != NULL)
ReplicationSlotRelease();
+ /* Detach from any lock group of which we are a member. */
+ if (MyProc->lockGroupLeader != NULL)
+ {
+ PGPROC *leader = MyProc->lockGroupLeader;
+
+ LWLockAcquire(leader->backendLock, LW_EXCLUSIVE);
+ Assert(leader->lockGroupSize > 0);
+ if (--leader->lockGroupSize == 0)
+ {
+ leader->lockGroupLeaderIdentifier = 0;
+ leader->lockGroupLeader = NULL;
+ if (leader != MyProc)
+ {
+ procgloballist = leader->procgloballist;
+
+ /* Leader exited first; return its PGPROC. */
+ SpinLockAcquire(ProcStructLock);
+ leader->links.next = (SHM_QUEUE *) *procgloballist;
+ *procgloballist = leader;
+ SpinLockRelease(ProcStructLock);
+ }
+ }
+ else if (leader != MyProc)
+ MyProc->lockGroupLeader = NULL;
+ LWLockRelease(leader->backendLock);
+ }
+
/*
* Reset MyLatch to the process local one. This is so that signal
* handlers et al can continue using the latch after the shared latch
@@ -817,9 +854,20 @@ ProcKill(int code, Datum arg)
procgloballist = proc->procgloballist;
SpinLockAcquire(ProcStructLock);
- /* Return PGPROC structure (and semaphore) to appropriate freelist */
- proc->links.next = (SHM_QUEUE *) *procgloballist;
- *procgloballist = proc;
+ /*
+ * If we're still a member of a locking group, that means we're a leader
+ * which has somehow exited before its children. The last remaining child
+ * will release our PGPROC. Otherwise, release it now.
+ */
+ if (proc->lockGroupLeader == NULL)
+ {
+ /* Since lockGroupLeader is NULL, lockGroupSize should be 0. */
+ Assert(proc->lockGroupSize == 0);
+
+ /* Return PGPROC structure (and semaphore) to appropriate freelist */
+ proc->links.next = (SHM_QUEUE *) *procgloballist;
+ *procgloballist = proc;
+ }
/* Update shared estimate of spins_per_delay */
procglobal->spins_per_delay = update_spins_per_delay(procglobal->spins_per_delay);
@@ -952,9 +1000,31 @@ ProcSleep(LOCALLOCK *locallock, LockMethod lockMethodTable)
bool allow_autovacuum_cancel = true;
int myWaitStatus;
PGPROC *proc;
+ PGPROC *leader = MyProc->lockGroupLeader;
int i;
/*
+ * If group locking is in use, locks held my members of my locking group
+ * need to be included in myHeldLocks.
+ */
+ if (leader != NULL)
+ {
+ SHM_QUEUE *procLocks = &(lock->procLocks);
+ PROCLOCK *otherproclock;
+
+ otherproclock = (PROCLOCK *)
+ SHMQueueNext(procLocks, procLocks, offsetof(PROCLOCK, lockLink));
+ while (otherproclock != NULL)
+ {
+ if (otherproclock->groupLeader == leader)
+ myHeldLocks |= otherproclock->holdMask;
+ otherproclock = (PROCLOCK *)
+ SHMQueueNext(procLocks, &otherproclock->lockLink,
+ offsetof(PROCLOCK, lockLink));
+ }
+ }
+
+ /*
* Determine where to add myself in the wait queue.
*
* Normally I should go at the end of the queue. However, if I already
@@ -978,6 +1048,15 @@ ProcSleep(LOCALLOCK *locallock, LockMethod lockMethodTable)
proc = (PGPROC *) waitQueue->links.next;
for (i = 0; i < waitQueue->size; i++)
{
+ /*
+ * If we're part of the same locking group as this waiter, its
+ * locks neither conflict with ours nor contribute to aheadRequsts.
+ */
+ if (leader != NULL && leader == proc->lockGroupLeader)
+ {
+ proc = (PGPROC *) proc->links.next;
+ continue;
+ }
/* Must he wait for me? */
if (lockMethodTable->conflictTab[proc->waitLockMode] & myHeldLocks)
{
@@ -1671,3 +1750,61 @@ ProcSendSignal(int pid)
SetLatch(&proc->procLatch);
}
}
+
+/*
+ * BecomeLockGroupLeader - designate process as lock group leader
+ *
+ * Once this function has returned, other processes can join the lock group
+ * by calling BecomeLockGroupMember.
+ */
+void
+BecomeLockGroupLeader(void)
+{
+ /* If we already did it, we don't need to do it again. */
+ if (MyProc->lockGroupLeader == MyProc)
+ return;
+
+ /* We had better not be a follower. */
+ Assert(MyProc->lockGroupLeader == NULL);
+
+ /* Create single-member group, containing only ourselves. */
+ LWLockAcquire(MyProc->backendLock, LW_EXCLUSIVE);
+ MyProc->lockGroupLeader = MyProc;
+ MyProc->lockGroupLeaderIdentifier = MyProcPid;
+ MyProc->lockGroupSize = 1;
+ LWLockRelease(MyProc->backendLock);
+}
+
+/*
+ * BecomeLockGroupMember - designate process as lock group member
+ *
+ * This is pretty straightforward except for the possibility that the leader
+ * whose group we're trying to join might exit before we manage to do so;
+ * and the PGPROC might get recycled for an unrelated process. To avoid
+ * that, we require the caller to pass the PID of the intended PGPROC as
+ * an interlock. Returns true if we successfully join the intended lock
+ * group, and false if not.
+ */
+bool
+BecomeLockGroupMember(PGPROC *leader, int pid)
+{
+ bool ok = false;
+
+ /* Group leader can't become member of group */
+ Assert(MyProc != leader);
+
+ /* PID must be valid. */
+ Assert(pid != 0);
+
+ /* Try to join the group. */
+ LWLockAcquire(leader->backendLock, LW_EXCLUSIVE);
+ if (leader->lockGroupLeaderIdentifier == pid)
+ {
+ ok = true;
+ leader->lockGroupSize++;
+ MyProc->lockGroupLeader = leader;
+ }
+ LWLockRelease(leader->backendLock);
+
+ return ok;
+}
diff --git a/src/include/storage/lock.h b/src/include/storage/lock.h
index a9cd08c..fa81003 100644
--- a/src/include/storage/lock.h
+++ b/src/include/storage/lock.h
@@ -346,6 +346,7 @@ typedef struct PROCLOCK
PROCLOCKTAG tag; /* unique identifier of proclock object */
/* data */
+ PGPROC *groupLeader; /* group leader, or NULL if no lock group */
LOCKMASK holdMask; /* bitmask for lock types currently held */
LOCKMASK releaseMask; /* bitmask for lock types to be released */
SHM_QUEUE lockLink; /* list link in LOCK's list of proclocks */
@@ -457,7 +458,6 @@ typedef enum
* worker */
} DeadLockState;
-
/*
* The lockmgr's shared hash tables are partitioned to reduce contention.
* To determine which partition a given locktag belongs to, compute the tag's
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index 3d68017..591e4ae 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -155,6 +155,10 @@ struct PGPROC
bool fpVXIDLock; /* are we holding a fast-path VXID lock? */
LocalTransactionId fpLocalTransactionId; /* lxid for fast-path VXID
* lock */
+ /* Support for lock groups. */
+ int lockGroupLeaderIdentifier; /* MyProcPid, if I'm a leader */
+ PGPROC *lockGroupLeader; /* lock group leader, if I'm a follower */
+ int lockGroupSize; /* # of members, if I'm a leader */
};
/* NOTE: "typedef struct PGPROC PGPROC" appears in storage/lock.h. */
@@ -272,4 +276,7 @@ extern void LockErrorCleanup(void);
extern void ProcWaitForSignal(void);
extern void ProcSendSignal(int pid);
+extern void BecomeLockGroupLeader(void);
+extern bool BecomeLockGroupMember(PGPROC *leader, int pid);
+
#endif /* PROC_H */
--
2.3.8 (Apple Git-58)
0005-Don-t-send-protocol-messages-to-a-shm_mq-that-no-lon.patchapplication/x-patch; name=0005-Don-t-send-protocol-messages-to-a-shm_mq-that-no-lon.patchDownload
From dc9bc8dfe6af880343db930ceb6b13a67451b4e4 Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Mon, 5 Oct 2015 13:04:10 -0400
Subject: [PATCH 05/14] Don't send protocol messages to a shm_mq that no longer
exists.
Commit 2bd9e412f92bc6a68f3e8bcb18e04955cc35001d introduced a mechanism
for relaying protocol messages from a background worker to another
backend via a shm_mq. However, there was no provision for shutting
down the communication channel. Therefore, a protocol message sent
late in the shutdown sequence, such as a DEBUG message resulting from
cranking up log_min_messages, could crash the server. To fix, install
an on_dsm_detach callback that disables sending messages to the shm_mq
when the associated DSM is detached.
---
src/backend/access/transam/parallel.c | 2 +-
src/backend/libpq/pqmq.c | 28 ++++++++++++++++++++++++++--
src/include/libpq/pqmq.h | 2 +-
3 files changed, 28 insertions(+), 4 deletions(-)
diff --git a/src/backend/access/transam/parallel.c b/src/backend/access/transam/parallel.c
index 90735df..3b87312 100644
--- a/src/backend/access/transam/parallel.c
+++ b/src/backend/access/transam/parallel.c
@@ -870,7 +870,7 @@ ParallelWorkerMain(Datum main_arg)
ParallelWorkerNumber * PARALLEL_ERROR_QUEUE_SIZE);
shm_mq_set_sender(mq, MyProc);
mqh = shm_mq_attach(mq, seg, NULL);
- pq_redirect_to_shm_mq(mq, mqh);
+ pq_redirect_to_shm_mq(seg, mqh);
pq_set_parallel_master(fps->parallel_master_pid,
fps->parallel_master_backend_id);
diff --git a/src/backend/libpq/pqmq.c b/src/backend/libpq/pqmq.c
index 9ca6b7c..0a3c2b7 100644
--- a/src/backend/libpq/pqmq.c
+++ b/src/backend/libpq/pqmq.c
@@ -26,6 +26,7 @@ static bool pq_mq_busy = false;
static pid_t pq_mq_parallel_master_pid = 0;
static pid_t pq_mq_parallel_master_backend_id = InvalidBackendId;
+static void pq_cleanup_redirect_to_shm_mq(dsm_segment *seg, Datum arg);
static void mq_comm_reset(void);
static int mq_flush(void);
static int mq_flush_if_writable(void);
@@ -51,13 +52,26 @@ static PQcommMethods PqCommMqMethods = {
* message queue.
*/
void
-pq_redirect_to_shm_mq(shm_mq *mq, shm_mq_handle *mqh)
+pq_redirect_to_shm_mq(dsm_segment *seg, shm_mq_handle *mqh)
{
PqCommMethods = &PqCommMqMethods;
- pq_mq = mq;
+ pq_mq = shm_mq_get_queue(mqh);
pq_mq_handle = mqh;
whereToSendOutput = DestRemote;
FrontendProtocol = PG_PROTOCOL_LATEST;
+ on_dsm_detach(seg, pq_cleanup_redirect_to_shm_mq, (Datum) 0);
+}
+
+/*
+ * When the DSM that contains our shm_mq goes away, we need to stop sending
+ * messages to it.
+ */
+static void
+pq_cleanup_redirect_to_shm_mq(dsm_segment *seg, Datum arg)
+{
+ pq_mq = NULL;
+ pq_mq_handle = NULL;
+ whereToSendOutput = DestNone;
}
/*
@@ -123,9 +137,19 @@ mq_putmessage(char msgtype, const char *s, size_t len)
if (pq_mq != NULL)
shm_mq_detach(pq_mq);
pq_mq = NULL;
+ pq_mq_handle = NULL;
return EOF;
}
+ /*
+ * If the message queue is already gone, just ignore the message. This
+ * doesn't necessarily indicate a problem; for example, DEBUG messages
+ * can be generated late in the shutdown sequence, after all DSMs have
+ * already been detached.
+ */
+ if (pq_mq == NULL)
+ return 0;
+
pq_mq_busy = true;
iov[0].data = &msgtype;
diff --git a/src/include/libpq/pqmq.h b/src/include/libpq/pqmq.h
index 9017565..97f17da 100644
--- a/src/include/libpq/pqmq.h
+++ b/src/include/libpq/pqmq.h
@@ -16,7 +16,7 @@
#include "lib/stringinfo.h"
#include "storage/shm_mq.h"
-extern void pq_redirect_to_shm_mq(shm_mq *, shm_mq_handle *);
+extern void pq_redirect_to_shm_mq(dsm_segment *seg, shm_mq_handle *mqh);
extern void pq_set_parallel_master(pid_t pid, BackendId backend_id);
extern void pq_parse_errornotice(StringInfo str, ErrorData *edata);
--
2.3.8 (Apple Git-58)
0006-Transfer-current-command-counter-ID-to-parallel-work.patchapplication/x-patch; name=0006-Transfer-current-command-counter-ID-to-parallel-work.patchDownload
From 0dad69fa294e1704fc9c3e9a7c8c890c51b3fa33 Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Mon, 5 Oct 2015 18:09:02 -0400
Subject: [PATCH 06/14] Transfer current command counter ID to parallel
workers.
Commit 924bcf4f16d54c55310b28f77686608684734f42 correctly forbade
parallel workers to modify the command counter while in parallel mode,
but it inexplicably neglected to actually transfer the current command
counter from leader to workers. This can result in the workers seeing
a different set of tuples from the master, which is bad. Repair.
---
src/backend/access/transam/xact.c | 46 +++++++++++++++++++++------------------
1 file changed, 25 insertions(+), 21 deletions(-)
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index e8aafba..3e24800 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -4786,8 +4786,8 @@ Size
EstimateTransactionStateSpace(void)
{
TransactionState s;
- Size nxids = 5; /* iso level, deferrable, top & current XID,
- * XID count */
+ Size nxids = 6; /* iso level, deferrable, top & current XID,
+ * command counter, XID count */
for (s = CurrentTransactionState; s != NULL; s = s->parent)
{
@@ -4807,12 +4807,13 @@ EstimateTransactionStateSpace(void)
*
* We need to save and restore XactDeferrable, XactIsoLevel, and the XIDs
* associated with this transaction. The first eight bytes of the result
- * contain XactDeferrable and XactIsoLevel; the next eight bytes contain the
- * XID of the top-level transaction and the XID of the current transaction
- * (or, in each case, InvalidTransactionId if none). After that, the next 4
- * bytes contain a count of how many additional XIDs follow; this is followed
- * by all of those XIDs one after another. We emit the XIDs in sorted order
- * for the convenience of the receiving process.
+ * contain XactDeferrable and XactIsoLevel; the next twelve bytes contain the
+ * XID of the top-level transaction, the XID of the current transaction
+ * (or, in each case, InvalidTransactionId if none), and the current command
+ * counter. After that, the next 4 bytes contain a count of how many
+ * additional XIDs follow; this is followed by all of those XIDs one after
+ * another. We emit the XIDs in sorted order for the convenience of the
+ * receiving process.
*/
void
SerializeTransactionState(Size maxsize, char *start_address)
@@ -4820,14 +4821,16 @@ SerializeTransactionState(Size maxsize, char *start_address)
TransactionState s;
Size nxids = 0;
Size i = 0;
+ Size c = 0;
TransactionId *workspace;
TransactionId *result = (TransactionId *) start_address;
- Assert(maxsize >= 5 * sizeof(TransactionId));
- result[0] = (TransactionId) XactIsoLevel;
- result[1] = (TransactionId) XactDeferrable;
- result[2] = XactTopTransactionId;
- result[3] = CurrentTransactionState->transactionId;
+ result[c++] = (TransactionId) XactIsoLevel;
+ result[c++] = (TransactionId) XactDeferrable;
+ result[c++] = XactTopTransactionId;
+ result[c++] = CurrentTransactionState->transactionId;
+ result[c++] = (TransactionId) currentCommandId;
+ Assert(maxsize >= c * sizeof(TransactionId));
/*
* If we're running in a parallel worker and launching a parallel worker
@@ -4836,9 +4839,9 @@ SerializeTransactionState(Size maxsize, char *start_address)
*/
if (nParallelCurrentXids > 0)
{
- Assert(maxsize > (nParallelCurrentXids + 4) * sizeof(TransactionId));
- result[4] = nParallelCurrentXids;
- memcpy(&result[5], ParallelCurrentXids,
+ result[c++] = nParallelCurrentXids;
+ Assert(maxsize >= (nParallelCurrentXids + c) * sizeof(TransactionId));
+ memcpy(&result[c], ParallelCurrentXids,
nParallelCurrentXids * sizeof(TransactionId));
return;
}
@@ -4853,7 +4856,7 @@ SerializeTransactionState(Size maxsize, char *start_address)
nxids = add_size(nxids, 1);
nxids = add_size(nxids, s->nChildXids);
}
- Assert(nxids * sizeof(TransactionId) < maxsize);
+ Assert((c + 1 + nxids) * sizeof(TransactionId) <= maxsize);
/* Copy them to our scratch space. */
workspace = palloc(nxids * sizeof(TransactionId));
@@ -4871,8 +4874,8 @@ SerializeTransactionState(Size maxsize, char *start_address)
qsort(workspace, nxids, sizeof(TransactionId), xidComparator);
/* Copy data into output area. */
- result[4] = (TransactionId) nxids;
- memcpy(&result[5], workspace, nxids * sizeof(TransactionId));
+ result[c++] = (TransactionId) nxids;
+ memcpy(&result[c], workspace, nxids * sizeof(TransactionId));
}
/*
@@ -4892,8 +4895,9 @@ StartParallelWorkerTransaction(char *tstatespace)
XactDeferrable = (bool) tstate[1];
XactTopTransactionId = tstate[2];
CurrentTransactionState->transactionId = tstate[3];
- nParallelCurrentXids = (int) tstate[4];
- ParallelCurrentXids = &tstate[5];
+ currentCommandId = tstate[4];
+ nParallelCurrentXids = (int) tstate[5];
+ ParallelCurrentXids = &tstate[6];
CurrentTransactionState->blockState = TBLOCK_PARALLEL_INPROGRESS;
}
--
2.3.8 (Apple Git-58)
0007-Tighten-up-application-of-parallel-mode-checks.patchapplication/x-patch; name=0007-Tighten-up-application-of-parallel-mode-checks.patchDownload
From b7e8d5f88e5d8334ed7ef75d21f9b3599201b06f Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Fri, 2 Oct 2015 19:12:18 -0400
Subject: [PATCH 07/14] Tighten up application of parallel mode checks.
Commit 924bcf4f16d54c55310b28f77686608684734f42 failed to enforce
parallel mode checks during the commit of a parallel worker, because
we exited parallel mode prior to ending the transaction so that we
could pop the active snapshot. Re-establish parallel mode during
parallel worker commit. Without this, it's far too easy for unsafe
actions during the pre-commit sequence to crash the server instead of
hitting the error checks as intended.
Just to be extra paranoid, adjust a couple of the sanity checks in
xact.c to check not only IsInParallelMode() but also
IsParallelWorker().
---
src/backend/access/transam/xact.c | 12 +++++++-----
1 file changed, 7 insertions(+), 5 deletions(-)
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 3e24800..47312f6 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -497,7 +497,7 @@ AssignTransactionId(TransactionState s)
* Workers synchronize transaction state at the beginning of each parallel
* operation, so we can't account for new XIDs at this point.
*/
- if (IsInParallelMode())
+ if (IsInParallelMode() || IsParallelWorker())
elog(ERROR, "cannot assign XIDs during a parallel operation");
/*
@@ -931,7 +931,7 @@ CommandCounterIncrement(void)
* parallel operation, so we can't account for new commands after that
* point.
*/
- if (IsInParallelMode())
+ if (IsInParallelMode() || IsParallelWorker())
elog(ERROR, "cannot start commands during a parallel operation");
currentCommandId += 1;
@@ -1927,6 +1927,10 @@ CommitTransaction(void)
is_parallel_worker = (s->blockState == TBLOCK_PARALLEL_INPROGRESS);
+ /* Enforce parallel mode restrictions during parallel worker commit. */
+ if (is_parallel_worker)
+ EnterParallelMode();
+
ShowTransactionState("CommitTransaction");
/*
@@ -1971,10 +1975,7 @@ CommitTransaction(void)
/* If we might have parallel workers, clean them up now. */
if (IsInParallelMode())
- {
AtEOXact_Parallel(true);
- s->parallelModeLevel = 0;
- }
/* Shut down the deferred-trigger manager */
AfterTriggerEndXact(true);
@@ -2013,6 +2014,7 @@ CommitTransaction(void)
* commit processing
*/
s->state = TRANS_COMMIT;
+ s->parallelModeLevel = 0;
if (!is_parallel_worker)
{
--
2.3.8 (Apple Git-58)
0008-Invalidate-caches-after-cranking-up-a-parallel-worke.patchapplication/x-patch; name=0008-Invalidate-caches-after-cranking-up-a-parallel-worke.patchDownload
From 2ef2d8d91d9cb455cf5b41b0c0f4ef273ac3fdd5 Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Sat, 3 Oct 2015 17:45:38 -0400
Subject: [PATCH 08/14] Invalidate caches after cranking up a parallel worker
transaction.
Starting a parallel worker transaction changes our notion of which XIDs
are in-progress or committed, and our notion of the current command
counter ID. Therefore, our view of these caches prior to starting
this transaction may no longer valid. Defend against that by clearing
them.
This fixes a bug in commit 924bcf4f16d54c55310b28f77686608684734f42.
---
src/backend/access/transam/parallel.c | 7 +++++++
1 file changed, 7 insertions(+)
diff --git a/src/backend/access/transam/parallel.c b/src/backend/access/transam/parallel.c
index 3b87312..a553dca 100644
--- a/src/backend/access/transam/parallel.c
+++ b/src/backend/access/transam/parallel.c
@@ -28,6 +28,7 @@
#include "tcop/tcopprot.h"
#include "utils/combocid.h"
#include "utils/guc.h"
+#include "utils/inval.h"
#include "utils/memutils.h"
#include "utils/resowner.h"
#include "utils/snapmgr.h"
@@ -944,6 +945,12 @@ ParallelWorkerMain(Datum main_arg)
Assert(asnapspace != NULL);
PushActiveSnapshot(RestoreSnapshot(asnapspace));
+ /*
+ * We've changed which tuples we can see, and must therefore invalidate
+ * system caches.
+ */
+ InvalidateSystemCaches();
+
/* Restore user ID and security context. */
SetUserIdAndSecContext(fps->current_user_id, fps->sec_context);
--
2.3.8 (Apple Git-58)
0009-Fix-a-problem-with-parallel-workers-being-unable-to-.patchapplication/x-patch; name=0009-Fix-a-problem-with-parallel-workers-being-unable-to-.patchDownload
From 4ac3ae2e4773da358a73a7831d9fff2cb5f4a8cd Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Mon, 5 Oct 2015 12:19:32 -0400
Subject: [PATCH 09/14] Fix a problem with parallel workers being unable to
restore role.
check_role() tries to verify that the user has permission to become the
requested role, but this is inappropriate in a parallel worker, which
needs to exactly recreate the master's authorization settings. So skip
the check in that case.
This fixes a bug in commit 924bcf4f16d54c55310b28f77686608684734f42.
---
src/backend/access/transam/parallel.c | 7 +++++++
src/backend/commands/variable.c | 8 ++++++--
src/include/access/parallel.h | 1 +
3 files changed, 14 insertions(+), 2 deletions(-)
diff --git a/src/backend/access/transam/parallel.c b/src/backend/access/transam/parallel.c
index a553dca..3c92a28 100644
--- a/src/backend/access/transam/parallel.c
+++ b/src/backend/access/transam/parallel.c
@@ -96,6 +96,9 @@ int ParallelWorkerNumber = -1;
/* Is there a parallel message pending which we need to receive? */
bool ParallelMessagePending = false;
+/* Are we initializing a parallel worker? */
+bool InitializingParallelWorker = false;
+
/* Pointer to our fixed parallel state. */
static FixedParallelState *MyFixedParallelState;
@@ -818,6 +821,9 @@ ParallelWorkerMain(Datum main_arg)
char *tstatespace;
StringInfoData msgbuf;
+ /* Set flag to indicate that we're initializing a parallel worker. */
+ InitializingParallelWorker = true;
+
/* Establish signal handlers. */
pqsignal(SIGTERM, die);
BackgroundWorkerUnblockSignals();
@@ -958,6 +964,7 @@ ParallelWorkerMain(Datum main_arg)
* We've initialized all of our state now; nothing should change
* hereafter.
*/
+ InitializingParallelWorker = false;
EnterParallelMode();
/*
diff --git a/src/backend/commands/variable.c b/src/backend/commands/variable.c
index 2d0a44e..16c122a 100644
--- a/src/backend/commands/variable.c
+++ b/src/backend/commands/variable.c
@@ -19,6 +19,7 @@
#include <ctype.h>
#include "access/htup_details.h"
+#include "access/parallel.h"
#include "access/xact.h"
#include "access/xlog.h"
#include "catalog/pg_authid.h"
@@ -877,9 +878,12 @@ check_role(char **newval, void **extra, GucSource source)
ReleaseSysCache(roleTup);
/*
- * Verify that session user is allowed to become this role
+ * Verify that session user is allowed to become this role, but
+ * skip this in parallel mode, where we must blindly recreate the
+ * parallel leader's state.
*/
- if (!is_member_of_role(GetSessionUserId(), roleid))
+ if (!InitializingParallelWorker &&
+ !is_member_of_role(GetSessionUserId(), roleid))
{
GUC_check_errcode(ERRCODE_INSUFFICIENT_PRIVILEGE);
GUC_check_errmsg("permission denied to set role \"%s\"",
diff --git a/src/include/access/parallel.h b/src/include/access/parallel.h
index b029c1e..44f0616 100644
--- a/src/include/access/parallel.h
+++ b/src/include/access/parallel.h
@@ -48,6 +48,7 @@ typedef struct ParallelContext
extern bool ParallelMessagePending;
extern int ParallelWorkerNumber;
+extern bool InitializingParallelWorker;
#define IsParallelWorker() (ParallelWorkerNumber >= 0)
--
2.3.8 (Apple Git-58)
0010-Prohibit-parallel-query-when-the-isolation-level-is-.patchapplication/x-patch; name=0010-Prohibit-parallel-query-when-the-isolation-level-is-.patchDownload
From 0c97636613509f289b3699e25af2c6c5b80e90ad Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Sun, 4 Oct 2015 01:11:20 -0400
Subject: [PATCH 10/14] Prohibit parallel query when the isolation level is
serializable.
In order for this to be safe, the code which hands true serializability
will need to taught that the SIRead locks taken by a parallel worker
pertain to the same transaction as those taken by the parallel leader.
Some further changes may be needed as well. Until the necessary
adaptations are made, don't generate parallel plans in serializable
mode, and if a previously-generated parallel plan is used after
serializable mode has been activated, run it serially.
This fixes a bug in commit 7aea8e4f2daa4b39ca9d1309a0c4aadb0f7ed81b.
---
src/backend/access/transam/parallel.c | 8 ++++++++
src/backend/optimizer/plan/planner.c | 10 ++++++++++
2 files changed, 18 insertions(+)
diff --git a/src/backend/access/transam/parallel.c b/src/backend/access/transam/parallel.c
index 3c92a28..edbbf9e 100644
--- a/src/backend/access/transam/parallel.c
+++ b/src/backend/access/transam/parallel.c
@@ -135,6 +135,14 @@ CreateParallelContext(parallel_worker_main_type entrypoint, int nworkers)
if (dynamic_shared_memory_type == DSM_IMPL_NONE)
nworkers = 0;
+ /*
+ * If we are running under serializable isolation, we can't use
+ * parallel workers, at least not until somebody enhances that mechanism
+ * to be parallel-aware.
+ */
+ if (IsolationIsSerializable())
+ nworkers = 0;
+
/* We might be running in a short-lived memory context. */
oldcontext = MemoryContextSwitchTo(TopTransactionContext);
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index cec2904..4a9828a 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -20,6 +20,7 @@
#include "access/htup_details.h"
#include "access/parallel.h"
+#include "access/xact.h"
#include "executor/executor.h"
#include "executor/nodeAgg.h"
#include "foreign/fdwapi.h"
@@ -245,11 +246,20 @@ standard_planner(Query *parse, int cursorOptions, ParamListInfo boundParams)
* a parallel worker. We might eventually be able to relax this
* restriction, but for now it seems best not to have parallel workers
* trying to create their own parallel workers.
+ *
+ * We can't use parallelism in serializable mode because the predicate
+ * locking code is not parallel-aware. It's not catastrophic if someone
+ * tries to run a parallel plan in serializable mode; it just won't get
+ * any workers and will run serially. But it seems like a good heuristic
+ * to assume that the same serialization level will be in effect at plan
+ * time and execution time, so don't generate a parallel plan if we're
+ * in serializable mode.
*/
glob->parallelModeOK = (cursorOptions & CURSOR_OPT_PARALLEL_OK) != 0 &&
IsUnderPostmaster && dynamic_shared_memory_type != DSM_IMPL_NONE &&
parse->commandType == CMD_SELECT && !parse->hasModifyingCTE &&
parse->utilityStmt == NULL && !IsParallelWorker() &&
+ !IsolationIsSerializable() &&
!check_parallel_safety((Node *) parse, false);
/*
--
2.3.8 (Apple Git-58)
0001-Test-code.patchapplication/x-patch; name=0001-Test-code.patchDownload
From 8540a95c8013a07cd175bab7a8d971663a9a6d09 Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Wed, 30 Sep 2015 18:35:40 -0400
Subject: [PATCH 01/14] Test code.
---
src/backend/access/transam/parallel.c | 2 +
src/backend/commands/explain.c | 12 ++++-
src/backend/optimizer/plan/planner.c | 87 +++++++++++++++++++++++++++++++++++
3 files changed, 100 insertions(+), 1 deletion(-)
diff --git a/src/backend/access/transam/parallel.c b/src/backend/access/transam/parallel.c
index 29d6ed5..3041dab 100644
--- a/src/backend/access/transam/parallel.c
+++ b/src/backend/access/transam/parallel.c
@@ -993,7 +993,9 @@ ParallelExtensionTrampoline(dsm_segment *seg, shm_toc *toc)
static void
ParallelErrorContext(void *arg)
{
+#if 0
errcontext("parallel worker, pid %d", *(int32 *) arg);
+#endif
}
/*
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 7fb8a14..8612430 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -571,6 +571,7 @@ void
ExplainPrintPlan(ExplainState *es, QueryDesc *queryDesc)
{
Bitmapset *rels_used = NULL;
+ PlanState *ps;
Assert(queryDesc->plannedstmt != NULL);
es->pstmt = queryDesc->plannedstmt;
@@ -579,7 +580,16 @@ ExplainPrintPlan(ExplainState *es, QueryDesc *queryDesc)
es->rtable_names = select_rtable_names_for_explain(es->rtable, rels_used);
es->deparse_cxt = deparse_context_for_plan_rtable(es->rtable,
es->rtable_names);
- ExplainNode(queryDesc->planstate, NIL, NULL, NULL, es);
+ /*
+ * XXX. Just for testing purposes, suppress the display of a toplevel
+ * gather node, so that we can run the regression tests with Gather
+ * nodes forcibly inserted without getting test failures due to different
+ * EXPLAIN output.
+ */
+ ps = queryDesc->planstate;
+ if (IsA(ps, GatherState))
+ ps = outerPlanState(ps);
+ ExplainNode(ps, NIL, NULL, NULL, es);
}
/*
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index e1ee67c..76ad8b3 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -47,6 +47,7 @@
#include "storage/dsm_impl.h"
#include "utils/rel.h"
#include "utils/selfuncs.h"
+#include "utils/syscache.h"
/* GUC parameter */
@@ -160,6 +161,40 @@ planner(Query *parse, int cursorOptions, ParamListInfo boundParams)
return result;
}
+/* This code is crap, just for testing. Don't confuse it with good code. */
+static bool
+rte_check_safety(RangeTblEntry *rte)
+{
+ HeapTuple tp;
+ Form_pg_class reltup;
+ bool retval;
+ ListCell *lc;
+
+ switch (rte->rtekind)
+ {
+ case RTE_RELATION:
+ tp = SearchSysCache1(RELOID, ObjectIdGetDatum(rte->relid));
+ if (!HeapTupleIsValid(tp))
+ elog(ERROR, "cache lookup failed for relation %u",
+ rte->relid);
+ reltup = (Form_pg_class) GETSTRUCT(tp);
+ retval = (reltup->relpersistence != RELPERSISTENCE_TEMP);
+ ReleaseSysCache(tp);
+ return retval;
+
+ case RTE_SUBQUERY:
+ foreach (lc, rte->subquery->rtable)
+ {
+ RangeTblEntry *rte2 = lfirst(lc);
+ if (!rte_check_safety(rte2))
+ return false;
+ }
+
+ default:
+ return true;
+ }
+}
+
PlannedStmt *
standard_planner(Query *parse, int cursorOptions, ParamListInfo boundParams)
{
@@ -284,6 +319,58 @@ standard_planner(Query *parse, int cursorOptions, ParamListInfo boundParams)
top_plan = materialize_finished_plan(top_plan);
}
+ /* XXX: Force a gather plan for testing purposes. */
+ if (glob->parallelModeOK)
+ {
+ bool use_gather = true;
+
+ /* We don't copy subplans to workers. */
+ if (glob->subplans != NIL)
+ use_gather = false;
+
+ /* Parallel mode doesn't currently support temporary tables. */
+ if (use_gather)
+ {
+ ListCell *lc;
+ ListCell *l;
+
+ foreach(lc, root->parse->rtable)
+ {
+ RangeTblEntry *rte = (RangeTblEntry *) lfirst(lc);
+ if (!rte_check_safety(rte))
+ use_gather = false;
+ }
+
+ foreach(l, glob->subroots)
+ {
+ PlannerInfo *subroot = (PlannerInfo *) lfirst(l);
+
+ foreach(lc, subroot->parse->rtable)
+ {
+ RangeTblEntry *rte = (RangeTblEntry *) lfirst(lc);
+
+ if (!rte_check_safety(rte))
+ use_gather = false;
+ }
+ }
+ }
+
+ /* No disqualifying conditions? Then do it! */
+ if (use_gather)
+ {
+ Gather *gather = makeNode(Gather);
+
+ gather->plan.targetlist = top_plan->targetlist;
+ gather->plan.qual = NIL;
+ gather->plan.lefttree = top_plan;
+ gather->plan.righttree = NULL;
+ gather->num_workers = 1;
+ gather->single_copy = true;
+ root->glob->parallelModeNeeded = true;
+ top_plan = &gather->plan;
+ }
+ }
+
/*
* If any Params were generated, run through the plan tree and compute
* each plan node's extParam/allParam sets. Ideally we'd merge this into
--
2.3.8 (Apple Git-58)
0002-contain_parallel_unsafe-check_parallel_safety.patchapplication/x-patch; name=0002-contain_parallel_unsafe-check_parallel_safety.patchDownload
From 601eef8550656be860699915b80dd01921650ad4 Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Fri, 2 Oct 2015 23:57:46 -0400
Subject: [PATCH 02/14] contain_parallel_unsafe -> check_parallel_safety.
enhance check_parallel_safety to detect use of temporary type ids.
---
src/backend/optimizer/plan/planner.c | 2 +-
src/backend/optimizer/util/clauses.c | 75 ++++++++++++++++++++++++++++--------
src/backend/utils/cache/lsyscache.c | 22 +++++++++++
src/include/optimizer/clauses.h | 2 +-
src/include/utils/lsyscache.h | 1 +
5 files changed, 85 insertions(+), 17 deletions(-)
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 76ad8b3..c502377 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -250,7 +250,7 @@ standard_planner(Query *parse, int cursorOptions, ParamListInfo boundParams)
IsUnderPostmaster && dynamic_shared_memory_type != DSM_IMPL_NONE &&
parse->commandType == CMD_SELECT && !parse->hasModifyingCTE &&
parse->utilityStmt == NULL && !IsParallelWorker() &&
- !contain_parallel_unsafe((Node *) parse);
+ !check_parallel_safety((Node *) parse, true);
/*
* glob->parallelModeOK should tell us whether it's necessary to impose
diff --git a/src/backend/optimizer/util/clauses.c b/src/backend/optimizer/util/clauses.c
index f2c8551..f4d8f98 100644
--- a/src/backend/optimizer/util/clauses.c
+++ b/src/backend/optimizer/util/clauses.c
@@ -21,6 +21,7 @@
#include "access/htup_details.h"
#include "catalog/pg_aggregate.h"
+#include "catalog/pg_class.h"
#include "catalog/pg_language.h"
#include "catalog/pg_operator.h"
#include "catalog/pg_proc.h"
@@ -87,6 +88,11 @@ typedef struct
char *prosrc;
} inline_error_callback_arg;
+typedef struct
+{
+ bool allow_restricted;
+} check_parallel_safety_arg;
+
static bool contain_agg_clause_walker(Node *node, void *context);
static bool count_agg_clauses_walker(Node *node,
count_agg_clauses_context *context);
@@ -96,7 +102,11 @@ static bool contain_subplans_walker(Node *node, void *context);
static bool contain_mutable_functions_walker(Node *node, void *context);
static bool contain_volatile_functions_walker(Node *node, void *context);
static bool contain_volatile_functions_not_nextval_walker(Node *node, void *context);
-static bool contain_parallel_unsafe_walker(Node *node, void *context);
+static bool check_parallel_safety_walker(Node *node,
+ check_parallel_safety_arg *context);
+static bool parallel_too_dangerous(char proparallel,
+ check_parallel_safety_arg *context);
+static bool typeid_is_temp(Oid typeid);
static bool contain_nonstrict_functions_walker(Node *node, void *context);
static bool contain_leaked_vars_walker(Node *node, void *context);
static Relids find_nonnullable_rels_walker(Node *node, bool top_level);
@@ -1204,13 +1214,16 @@ contain_volatile_functions_not_nextval_walker(Node *node, void *context)
*****************************************************************************/
bool
-contain_parallel_unsafe(Node *node)
+check_parallel_safety(Node *node, bool allow_restricted)
{
- return contain_parallel_unsafe_walker(node, NULL);
+ check_parallel_safety_arg context;
+
+ context.allow_restricted = allow_restricted;
+ return check_parallel_safety_walker(node, &context);
}
static bool
-contain_parallel_unsafe_walker(Node *node, void *context)
+check_parallel_safety_walker(Node *node, check_parallel_safety_arg *context)
{
if (node == NULL)
return false;
@@ -1218,7 +1231,7 @@ contain_parallel_unsafe_walker(Node *node, void *context)
{
FuncExpr *expr = (FuncExpr *) node;
- if (func_parallel(expr->funcid) == PROPARALLEL_UNSAFE)
+ if (parallel_too_dangerous(func_parallel(expr->funcid), context))
return true;
/* else fall through to check args */
}
@@ -1227,7 +1240,7 @@ contain_parallel_unsafe_walker(Node *node, void *context)
OpExpr *expr = (OpExpr *) node;
set_opfuncid(expr);
- if (func_parallel(expr->opfuncid) == PROPARALLEL_UNSAFE)
+ if (parallel_too_dangerous(func_parallel(expr->opfuncid), context))
return true;
/* else fall through to check args */
}
@@ -1236,7 +1249,7 @@ contain_parallel_unsafe_walker(Node *node, void *context)
DistinctExpr *expr = (DistinctExpr *) node;
set_opfuncid((OpExpr *) expr); /* rely on struct equivalence */
- if (func_parallel(expr->opfuncid) == PROPARALLEL_UNSAFE)
+ if (parallel_too_dangerous(func_parallel(expr->opfuncid), context))
return true;
/* else fall through to check args */
}
@@ -1245,7 +1258,7 @@ contain_parallel_unsafe_walker(Node *node, void *context)
NullIfExpr *expr = (NullIfExpr *) node;
set_opfuncid((OpExpr *) expr); /* rely on struct equivalence */
- if (func_parallel(expr->opfuncid) == PROPARALLEL_UNSAFE)
+ if (parallel_too_dangerous(func_parallel(expr->opfuncid), context))
return true;
/* else fall through to check args */
}
@@ -1254,7 +1267,7 @@ contain_parallel_unsafe_walker(Node *node, void *context)
ScalarArrayOpExpr *expr = (ScalarArrayOpExpr *) node;
set_sa_opfuncid(expr);
- if (func_parallel(expr->opfuncid) == PROPARALLEL_UNSAFE)
+ if (parallel_too_dangerous(func_parallel(expr->opfuncid), context))
return true;
/* else fall through to check args */
}
@@ -1268,12 +1281,12 @@ contain_parallel_unsafe_walker(Node *node, void *context)
/* check the result type's input function */
getTypeInputInfo(expr->resulttype,
&iofunc, &typioparam);
- if (func_parallel(iofunc) == PROPARALLEL_UNSAFE)
+ if (parallel_too_dangerous(func_parallel(iofunc), context))
return true;
/* check the input type's output function */
getTypeOutputInfo(exprType((Node *) expr->arg),
&iofunc, &typisvarlena);
- if (func_parallel(iofunc) == PROPARALLEL_UNSAFE)
+ if (parallel_too_dangerous(func_parallel(iofunc), context))
return true;
/* else fall through to check args */
}
@@ -1282,7 +1295,7 @@ contain_parallel_unsafe_walker(Node *node, void *context)
ArrayCoerceExpr *expr = (ArrayCoerceExpr *) node;
if (OidIsValid(expr->elemfuncid) &&
- func_parallel(expr->elemfuncid) == PROPARALLEL_UNSAFE)
+ parallel_too_dangerous(func_parallel(expr->elemfuncid), context))
return true;
/* else fall through to check args */
}
@@ -1294,11 +1307,23 @@ contain_parallel_unsafe_walker(Node *node, void *context)
foreach(opid, rcexpr->opnos)
{
- if (op_volatile(lfirst_oid(opid)) == PROPARALLEL_UNSAFE)
+ if (parallel_too_dangerous(op_volatile(lfirst_oid(opid)), context))
return true;
}
/* else fall through to check args */
}
+ else if (IsA(node, RowExpr))
+ {
+ RowExpr *rexpr = (RowExpr *) node;
+ if (!context->allow_restricted && typeid_is_temp(rexpr->row_typeid))
+ return true;
+ }
+ else if (IsA(node, ArrayExpr))
+ {
+ ArrayExpr *aexpr = (ArrayExpr *) node;
+ if (!context->allow_restricted && typeid_is_temp(aexpr->array_typeid))
+ return true;
+ }
else if (IsA(node, Query))
{
Query *query = (Query *) node;
@@ -1308,14 +1333,34 @@ contain_parallel_unsafe_walker(Node *node, void *context)
/* Recurse into subselects */
return query_tree_walker(query,
- contain_parallel_unsafe_walker,
+ check_parallel_safety_walker,
context, 0);
}
return expression_tree_walker(node,
- contain_parallel_unsafe_walker,
+ check_parallel_safety_walker,
context);
}
+static bool
+parallel_too_dangerous(char proparallel, check_parallel_safety_arg *context)
+{
+ if (context->allow_restricted)
+ return proparallel == PROPARALLEL_UNSAFE;
+ else
+ return proparallel != PROPARALLEL_SAFE;
+}
+
+static bool
+typeid_is_temp(Oid typeid)
+{
+ Oid relid = get_typ_typrelid(typeid);
+
+ if (!OidIsValid(relid))
+ return false;
+
+ return (get_rel_persistence(relid) == RELPERSISTENCE_TEMP);
+}
+
/*****************************************************************************
* Check clauses for nonstrict functions
*****************************************************************************/
diff --git a/src/backend/utils/cache/lsyscache.c b/src/backend/utils/cache/lsyscache.c
index 8d1cdf1..093da76 100644
--- a/src/backend/utils/cache/lsyscache.c
+++ b/src/backend/utils/cache/lsyscache.c
@@ -1787,6 +1787,28 @@ get_rel_tablespace(Oid relid)
return InvalidOid;
}
+/*
+ * get_rel_persistence
+ *
+ * Returns the relpersistence associated with a given relation.
+ */
+char
+get_rel_persistence(Oid relid)
+{
+ HeapTuple tp;
+ Form_pg_class reltup;
+ char result;
+
+ tp = SearchSysCache1(RELOID, ObjectIdGetDatum(relid));
+ if (!HeapTupleIsValid(tp))
+ elog(ERROR, "cache lookup failed for relation %u", relid);
+ reltup = (Form_pg_class) GETSTRUCT(tp);
+ result = reltup->relpersistence;
+ ReleaseSysCache(tp);
+
+ return result;
+}
+
/* ---------- TRANSFORM CACHE ---------- */
diff --git a/src/include/optimizer/clauses.h b/src/include/optimizer/clauses.h
index 5ac79b1..81a4b8f 100644
--- a/src/include/optimizer/clauses.h
+++ b/src/include/optimizer/clauses.h
@@ -62,7 +62,7 @@ extern bool contain_subplans(Node *clause);
extern bool contain_mutable_functions(Node *clause);
extern bool contain_volatile_functions(Node *clause);
extern bool contain_volatile_functions_not_nextval(Node *clause);
-extern bool contain_parallel_unsafe(Node *node);
+extern bool check_parallel_safety(Node *node, bool allow_restricted);
extern bool contain_nonstrict_functions(Node *clause);
extern bool contain_leaked_vars(Node *clause);
diff --git a/src/include/utils/lsyscache.h b/src/include/utils/lsyscache.h
index 450d9fe..dcc421f 100644
--- a/src/include/utils/lsyscache.h
+++ b/src/include/utils/lsyscache.h
@@ -103,6 +103,7 @@ extern Oid get_rel_namespace(Oid relid);
extern Oid get_rel_type_id(Oid relid);
extern char get_rel_relkind(Oid relid);
extern Oid get_rel_tablespace(Oid relid);
+extern char get_rel_persistence(Oid relid);
extern Oid get_transform_fromsql(Oid typid, Oid langid, List *trftypes);
extern Oid get_transform_tosql(Oid typid, Oid langid, List *trftypes);
extern bool get_typisdefined(Oid typid);
--
2.3.8 (Apple Git-58)
0003-Temporary-hack-to-reduce-testing-failures.patchapplication/x-patch; name=0003-Temporary-hack-to-reduce-testing-failures.patchDownload
From 7f47db8fd1f82e7000893cc227998f7f99a41b41 Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Sat, 3 Oct 2015 00:11:45 -0400
Subject: [PATCH 03/14] Temporary hack to reduce testing failures.
---
src/backend/optimizer/plan/planner.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index c502377..cec2904 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -250,7 +250,7 @@ standard_planner(Query *parse, int cursorOptions, ParamListInfo boundParams)
IsUnderPostmaster && dynamic_shared_memory_type != DSM_IMPL_NONE &&
parse->commandType == CMD_SELECT && !parse->hasModifyingCTE &&
parse->utilityStmt == NULL && !IsParallelWorker() &&
- !check_parallel_safety((Node *) parse, true);
+ !check_parallel_safety((Node *) parse, false);
/*
* glob->parallelModeOK should tell us whether it's necessary to impose
--
2.3.8 (Apple Git-58)
0011-Mark-more-functions-parallel-restricted-or-parallel-.patchapplication/x-patch; name=0011-Mark-more-functions-parallel-restricted-or-parallel-.patchDownload
From ff483195182e1b6f0bebf04c2c897154941296ab Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Fri, 2 Oct 2015 20:04:31 -0400
Subject: [PATCH 11/14] Mark more functions parallel-restricted or
parallel-unsafe.
Commit 7aea8e4f2daa4b39ca9d1309a0c4aadb0f7ed81b was overoptimistic
about the degree of safety associated with running various functions
in parallel mode. Functions that take a table name or OID as an
argument are at least parallel-restricted, because the table might be
temporary, and we currently don't allow parallel workers to touch
temporary tables. Functions that take a query as an argument are
outright unsafe, because the query could be anything, including a
parallel-unsafe query.
Also, the queue of pending notifications is backend-private, so adding
to it from a worker doesn't behave correctly. We could fix this by
transferring the worker's queue of pending notifications to the master
during worker cleanup, but that seems like more trouble than it's
worth for now.
---
src/backend/commands/async.c | 3 +++
src/include/catalog/pg_proc.h | 40 ++++++++++++++++++++--------------------
2 files changed, 23 insertions(+), 20 deletions(-)
diff --git a/src/backend/commands/async.c b/src/backend/commands/async.c
index f2b9a74..3657d69 100644
--- a/src/backend/commands/async.c
+++ b/src/backend/commands/async.c
@@ -544,6 +544,9 @@ Async_Notify(const char *channel, const char *payload)
Notification *n;
MemoryContext oldcontext;
+ if (IsInParallelMode())
+ elog(ERROR, "cannot send notifications during a parallel operation");
+
if (Trace_notify)
elog(DEBUG1, "Async_Notify(%s)", channel);
diff --git a/src/include/catalog/pg_proc.h b/src/include/catalog/pg_proc.h
index eb55b3a..f688454 100644
--- a/src/include/catalog/pg_proc.h
+++ b/src/include/catalog/pg_proc.h
@@ -2032,9 +2032,9 @@ DATA(insert OID = 1639 ( oidge PGNSP PGUID 12 1 0 0 0 f f f t t f i s 2 0
/* System-view support functions */
DATA(insert OID = 1573 ( pg_get_ruledef PGNSP PGUID 12 1 0 0 0 f f f f t f s s 1 0 25 "26" _null_ _null_ _null_ _null_ _null_ pg_get_ruledef _null_ _null_ _null_ ));
DESCR("source text of a rule");
-DATA(insert OID = 1640 ( pg_get_viewdef PGNSP PGUID 12 1 0 0 0 f f f f t f s s 1 0 25 "25" _null_ _null_ _null_ _null_ _null_ pg_get_viewdef_name _null_ _null_ _null_ ));
+DATA(insert OID = 1640 ( pg_get_viewdef PGNSP PGUID 12 1 0 0 0 f f f f t f s r 1 0 25 "25" _null_ _null_ _null_ _null_ _null_ pg_get_viewdef_name _null_ _null_ _null_ ));
DESCR("select statement of a view");
-DATA(insert OID = 1641 ( pg_get_viewdef PGNSP PGUID 12 1 0 0 0 f f f f t f s s 1 0 25 "26" _null_ _null_ _null_ _null_ _null_ pg_get_viewdef _null_ _null_ _null_ ));
+DATA(insert OID = 1641 ( pg_get_viewdef PGNSP PGUID 12 1 0 0 0 f f f f t f s r 1 0 25 "26" _null_ _null_ _null_ _null_ _null_ pg_get_viewdef _null_ _null_ _null_ ));
DESCR("select statement of a view");
DATA(insert OID = 1642 ( pg_get_userbyid PGNSP PGUID 12 1 0 0 0 f f f f t f s s 1 0 19 "26" _null_ _null_ _null_ _null_ _null_ pg_get_userbyid _null_ _null_ _null_ ));
DESCR("role name by OID (with fallback)");
@@ -4036,11 +4036,11 @@ DESCR("I/O");
/* System-view support functions with pretty-print option */
DATA(insert OID = 2504 ( pg_get_ruledef PGNSP PGUID 12 1 0 0 0 f f f f t f s s 2 0 25 "26 16" _null_ _null_ _null_ _null_ _null_ pg_get_ruledef_ext _null_ _null_ _null_ ));
DESCR("source text of a rule with pretty-print option");
-DATA(insert OID = 2505 ( pg_get_viewdef PGNSP PGUID 12 1 0 0 0 f f f f t f s s 2 0 25 "25 16" _null_ _null_ _null_ _null_ _null_ pg_get_viewdef_name_ext _null_ _null_ _null_ ));
+DATA(insert OID = 2505 ( pg_get_viewdef PGNSP PGUID 12 1 0 0 0 f f f f t f s r 2 0 25 "25 16" _null_ _null_ _null_ _null_ _null_ pg_get_viewdef_name_ext _null_ _null_ _null_ ));
DESCR("select statement of a view with pretty-print option");
-DATA(insert OID = 2506 ( pg_get_viewdef PGNSP PGUID 12 1 0 0 0 f f f f t f s s 2 0 25 "26 16" _null_ _null_ _null_ _null_ _null_ pg_get_viewdef_ext _null_ _null_ _null_ ));
+DATA(insert OID = 2506 ( pg_get_viewdef PGNSP PGUID 12 1 0 0 0 f f f f t f s r 2 0 25 "26 16" _null_ _null_ _null_ _null_ _null_ pg_get_viewdef_ext _null_ _null_ _null_ ));
DESCR("select statement of a view with pretty-print option");
-DATA(insert OID = 3159 ( pg_get_viewdef PGNSP PGUID 12 1 0 0 0 f f f f t f s s 2 0 25 "26 23" _null_ _null_ _null_ _null_ _null_ pg_get_viewdef_wrap _null_ _null_ _null_ ));
+DATA(insert OID = 3159 ( pg_get_viewdef PGNSP PGUID 12 1 0 0 0 f f f f t f s r 2 0 25 "26 23" _null_ _null_ _null_ _null_ _null_ pg_get_viewdef_wrap _null_ _null_ _null_ ));
DESCR("select statement of a view with pretty-printing and specified line wrapping");
DATA(insert OID = 2507 ( pg_get_indexdef PGNSP PGUID 12 1 0 0 0 f f f f t f s s 3 0 25 "26 23 16" _null_ _null_ _null_ _null_ _null_ pg_get_indexdef_ext _null_ _null_ _null_ ));
DESCR("index description (full create statement or single expression) with pretty-print option");
@@ -4062,7 +4062,7 @@ DESCR("trigger description with pretty-print option");
/* asynchronous notifications */
DATA(insert OID = 3035 ( pg_listening_channels PGNSP PGUID 12 1 10 0 0 f f f f t t s r 0 0 25 "" _null_ _null_ _null_ _null_ _null_ pg_listening_channels _null_ _null_ _null_ ));
DESCR("get the channels that the current backend listens to");
-DATA(insert OID = 3036 ( pg_notify PGNSP PGUID 12 1 0 0 0 f f f f f f v s 2 0 2278 "25 25" _null_ _null_ _null_ _null_ _null_ pg_notify _null_ _null_ _null_ ));
+DATA(insert OID = 3036 ( pg_notify PGNSP PGUID 12 1 0 0 0 f f f f f f v r 2 0 2278 "25 25" _null_ _null_ _null_ _null_ _null_ pg_notify _null_ _null_ _null_ ));
DESCR("send a notification event");
DATA(insert OID = 3296 ( pg_notification_queue_usage PGNSP PGUID 12 1 0 0 0 f f f f t f v s 0 0 701 "" _null_ _null_ _null_ _null_ _null_ pg_notification_queue_usage _null_ _null_ _null_ ));
DESCR("get the fraction of the asynchronous notification queue currently in use");
@@ -4327,35 +4327,35 @@ DESCR("concatenate XML values");
DATA(insert OID = 2922 ( text PGNSP PGUID 12 1 0 0 0 f f f f t f i s 1 0 25 "142" _null_ _null_ _null_ _null_ _null_ xmltotext _null_ _null_ _null_ ));
DESCR("serialize an XML value to a character string");
-DATA(insert OID = 2923 ( table_to_xml PGNSP PGUID 12 100 0 0 0 f f f f t f s s 4 0 142 "2205 16 16 25" _null_ _null_ "{tbl,nulls,tableforest,targetns}" _null_ _null_ table_to_xml _null_ _null_ _null_ ));
+DATA(insert OID = 2923 ( table_to_xml PGNSP PGUID 12 100 0 0 0 f f f f t f s r 4 0 142 "2205 16 16 25" _null_ _null_ "{tbl,nulls,tableforest,targetns}" _null_ _null_ table_to_xml _null_ _null_ _null_ ));
DESCR("map table contents to XML");
-DATA(insert OID = 2924 ( query_to_xml PGNSP PGUID 12 100 0 0 0 f f f f t f s s 4 0 142 "25 16 16 25" _null_ _null_ "{query,nulls,tableforest,targetns}" _null_ _null_ query_to_xml _null_ _null_ _null_ ));
+DATA(insert OID = 2924 ( query_to_xml PGNSP PGUID 12 100 0 0 0 f f f f t f s u 4 0 142 "25 16 16 25" _null_ _null_ "{query,nulls,tableforest,targetns}" _null_ _null_ query_to_xml _null_ _null_ _null_ ));
DESCR("map query result to XML");
-DATA(insert OID = 2925 ( cursor_to_xml PGNSP PGUID 12 100 0 0 0 f f f f t f s s 5 0 142 "1790 23 16 16 25" _null_ _null_ "{cursor,count,nulls,tableforest,targetns}" _null_ _null_ cursor_to_xml _null_ _null_ _null_ ));
+DATA(insert OID = 2925 ( cursor_to_xml PGNSP PGUID 12 100 0 0 0 f f f f t f s r 5 0 142 "1790 23 16 16 25" _null_ _null_ "{cursor,count,nulls,tableforest,targetns}" _null_ _null_ cursor_to_xml _null_ _null_ _null_ ));
DESCR("map rows from cursor to XML");
-DATA(insert OID = 2926 ( table_to_xmlschema PGNSP PGUID 12 100 0 0 0 f f f f t f s s 4 0 142 "2205 16 16 25" _null_ _null_ "{tbl,nulls,tableforest,targetns}" _null_ _null_ table_to_xmlschema _null_ _null_ _null_ ));
+DATA(insert OID = 2926 ( table_to_xmlschema PGNSP PGUID 12 100 0 0 0 f f f f t f s r 4 0 142 "2205 16 16 25" _null_ _null_ "{tbl,nulls,tableforest,targetns}" _null_ _null_ table_to_xmlschema _null_ _null_ _null_ ));
DESCR("map table structure to XML Schema");
-DATA(insert OID = 2927 ( query_to_xmlschema PGNSP PGUID 12 100 0 0 0 f f f f t f s s 4 0 142 "25 16 16 25" _null_ _null_ "{query,nulls,tableforest,targetns}" _null_ _null_ query_to_xmlschema _null_ _null_ _null_ ));
+DATA(insert OID = 2927 ( query_to_xmlschema PGNSP PGUID 12 100 0 0 0 f f f f t f s u 4 0 142 "25 16 16 25" _null_ _null_ "{query,nulls,tableforest,targetns}" _null_ _null_ query_to_xmlschema _null_ _null_ _null_ ));
DESCR("map query result structure to XML Schema");
-DATA(insert OID = 2928 ( cursor_to_xmlschema PGNSP PGUID 12 100 0 0 0 f f f f t f s s 4 0 142 "1790 16 16 25" _null_ _null_ "{cursor,nulls,tableforest,targetns}" _null_ _null_ cursor_to_xmlschema _null_ _null_ _null_ ));
+DATA(insert OID = 2928 ( cursor_to_xmlschema PGNSP PGUID 12 100 0 0 0 f f f f t f s r 4 0 142 "1790 16 16 25" _null_ _null_ "{cursor,nulls,tableforest,targetns}" _null_ _null_ cursor_to_xmlschema _null_ _null_ _null_ ));
DESCR("map cursor structure to XML Schema");
-DATA(insert OID = 2929 ( table_to_xml_and_xmlschema PGNSP PGUID 12 100 0 0 0 f f f f t f s s 4 0 142 "2205 16 16 25" _null_ _null_ "{tbl,nulls,tableforest,targetns}" _null_ _null_ table_to_xml_and_xmlschema _null_ _null_ _null_ ));
+DATA(insert OID = 2929 ( table_to_xml_and_xmlschema PGNSP PGUID 12 100 0 0 0 f f f f t f s r 4 0 142 "2205 16 16 25" _null_ _null_ "{tbl,nulls,tableforest,targetns}" _null_ _null_ table_to_xml_and_xmlschema _null_ _null_ _null_ ));
DESCR("map table contents and structure to XML and XML Schema");
-DATA(insert OID = 2930 ( query_to_xml_and_xmlschema PGNSP PGUID 12 100 0 0 0 f f f f t f s s 4 0 142 "25 16 16 25" _null_ _null_ "{query,nulls,tableforest,targetns}" _null_ _null_ query_to_xml_and_xmlschema _null_ _null_ _null_ ));
+DATA(insert OID = 2930 ( query_to_xml_and_xmlschema PGNSP PGUID 12 100 0 0 0 f f f f t f s u 4 0 142 "25 16 16 25" _null_ _null_ "{query,nulls,tableforest,targetns}" _null_ _null_ query_to_xml_and_xmlschema _null_ _null_ _null_ ));
DESCR("map query result and structure to XML and XML Schema");
-DATA(insert OID = 2933 ( schema_to_xml PGNSP PGUID 12 100 0 0 0 f f f f t f s s 4 0 142 "19 16 16 25" _null_ _null_ "{schema,nulls,tableforest,targetns}" _null_ _null_ schema_to_xml _null_ _null_ _null_ ));
+DATA(insert OID = 2933 ( schema_to_xml PGNSP PGUID 12 100 0 0 0 f f f f t f s r 4 0 142 "19 16 16 25" _null_ _null_ "{schema,nulls,tableforest,targetns}" _null_ _null_ schema_to_xml _null_ _null_ _null_ ));
DESCR("map schema contents to XML");
-DATA(insert OID = 2934 ( schema_to_xmlschema PGNSP PGUID 12 100 0 0 0 f f f f t f s s 4 0 142 "19 16 16 25" _null_ _null_ "{schema,nulls,tableforest,targetns}" _null_ _null_ schema_to_xmlschema _null_ _null_ _null_ ));
+DATA(insert OID = 2934 ( schema_to_xmlschema PGNSP PGUID 12 100 0 0 0 f f f f t f s r 4 0 142 "19 16 16 25" _null_ _null_ "{schema,nulls,tableforest,targetns}" _null_ _null_ schema_to_xmlschema _null_ _null_ _null_ ));
DESCR("map schema structure to XML Schema");
-DATA(insert OID = 2935 ( schema_to_xml_and_xmlschema PGNSP PGUID 12 100 0 0 0 f f f f t f s s 4 0 142 "19 16 16 25" _null_ _null_ "{schema,nulls,tableforest,targetns}" _null_ _null_ schema_to_xml_and_xmlschema _null_ _null_ _null_ ));
+DATA(insert OID = 2935 ( schema_to_xml_and_xmlschema PGNSP PGUID 12 100 0 0 0 f f f f t f s r 4 0 142 "19 16 16 25" _null_ _null_ "{schema,nulls,tableforest,targetns}" _null_ _null_ schema_to_xml_and_xmlschema _null_ _null_ _null_ ));
DESCR("map schema contents and structure to XML and XML Schema");
-DATA(insert OID = 2936 ( database_to_xml PGNSP PGUID 12 100 0 0 0 f f f f t f s s 3 0 142 "16 16 25" _null_ _null_ "{nulls,tableforest,targetns}" _null_ _null_ database_to_xml _null_ _null_ _null_ ));
+DATA(insert OID = 2936 ( database_to_xml PGNSP PGUID 12 100 0 0 0 f f f f t f s r 3 0 142 "16 16 25" _null_ _null_ "{nulls,tableforest,targetns}" _null_ _null_ database_to_xml _null_ _null_ _null_ ));
DESCR("map database contents to XML");
-DATA(insert OID = 2937 ( database_to_xmlschema PGNSP PGUID 12 100 0 0 0 f f f f t f s s 3 0 142 "16 16 25" _null_ _null_ "{nulls,tableforest,targetns}" _null_ _null_ database_to_xmlschema _null_ _null_ _null_ ));
+DATA(insert OID = 2937 ( database_to_xmlschema PGNSP PGUID 12 100 0 0 0 f f f f t f s r 3 0 142 "16 16 25" _null_ _null_ "{nulls,tableforest,targetns}" _null_ _null_ database_to_xmlschema _null_ _null_ _null_ ));
DESCR("map database structure to XML Schema");
-DATA(insert OID = 2938 ( database_to_xml_and_xmlschema PGNSP PGUID 12 100 0 0 0 f f f f t f s s 3 0 142 "16 16 25" _null_ _null_ "{nulls,tableforest,targetns}" _null_ _null_ database_to_xml_and_xmlschema _null_ _null_ _null_ ));
+DATA(insert OID = 2938 ( database_to_xml_and_xmlschema PGNSP PGUID 12 100 0 0 0 f f f f t f s r 3 0 142 "16 16 25" _null_ _null_ "{nulls,tableforest,targetns}" _null_ _null_ database_to_xml_and_xmlschema _null_ _null_ _null_ ));
DESCR("map database contents and structure to XML and XML Schema");
DATA(insert OID = 2931 ( xpath PGNSP PGUID 12 1 0 0 0 f f f f t f i s 3 0 143 "25 142 1009" _null_ _null_ _null_ _null_ _null_ xpath _null_ _null_ _null_ ));
--
2.3.8 (Apple Git-58)
0012-Rewrite-interaction-of-parallel-mode-with-parallel-e.patchapplication/x-patch; name=0012-Rewrite-interaction-of-parallel-mode-with-parallel-e.patchDownload
From ed37d06a5223e018d7b0f5b35231c5c17dd6126e Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Wed, 7 Oct 2015 18:16:26 -0400
Subject: [PATCH 12/14] Rewrite interaction of parallel mode with parallel
executor support.
In the previous coding, before returning from ExecutorRun, we'd shut
down all parallel workers. This was dead wrong if ExecutorRun was
called with a non-zero tuple count; it had the effect of truncating
the query output. To fix, give ExecutePlan control over whether to
enter parallel mode, and have it refuse to do so if the tuple count
is non-zero. Rewrite the Gather logic so that it can cope with being
called outside parallel mode.
Commit 7aea8e4f2daa4b39ca9d1309a0c4aadb0f7ed81b is largely to blame
for this problem, though this patch modifies some subsequently-committed
code which relied on the guarantees it purported to make.
---
src/backend/executor/execMain.c | 37 +++++++-----
src/backend/executor/execParallel.c | 17 ++++++
src/backend/executor/nodeGather.c | 108 +++++++++++++++++-------------------
src/include/executor/execParallel.h | 1 +
src/include/nodes/execnodes.h | 2 +-
5 files changed, 95 insertions(+), 70 deletions(-)
diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index 37b7bbd..a55022e 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -76,6 +76,7 @@ static void CheckValidRowMarkRel(Relation rel, RowMarkType markType);
static void ExecPostprocessPlan(EState *estate);
static void ExecEndPlan(PlanState *planstate, EState *estate);
static void ExecutePlan(EState *estate, PlanState *planstate,
+ bool use_parallel_mode,
CmdType operation,
bool sendTuples,
long numberTuples,
@@ -243,11 +244,6 @@ standard_ExecutorStart(QueryDesc *queryDesc, int eflags)
if (!(eflags & (EXEC_FLAG_SKIP_TRIGGERS | EXEC_FLAG_EXPLAIN_ONLY)))
AfterTriggerBeginQuery();
- /* Enter parallel mode, if required by the query. */
- if (queryDesc->plannedstmt->parallelModeNeeded &&
- !(eflags & EXEC_FLAG_EXPLAIN_ONLY))
- EnterParallelMode();
-
MemoryContextSwitchTo(oldcontext);
}
@@ -341,15 +337,13 @@ standard_ExecutorRun(QueryDesc *queryDesc,
if (!ScanDirectionIsNoMovement(direction))
ExecutePlan(estate,
queryDesc->planstate,
+ queryDesc->plannedstmt->parallelModeNeeded,
operation,
sendTuples,
count,
direction,
dest);
- /* Allow nodes to release or shut down resources. */
- (void) ExecShutdownNode(queryDesc->planstate);
-
/*
* shutdown tuple receiver, if we started it
*/
@@ -482,11 +476,6 @@ standard_ExecutorEnd(QueryDesc *queryDesc)
*/
MemoryContextSwitchTo(oldcontext);
- /* Exit parallel mode, if it was required by the query. */
- if (queryDesc->plannedstmt->parallelModeNeeded &&
- !(estate->es_top_eflags & EXEC_FLAG_EXPLAIN_ONLY))
- ExitParallelMode();
-
/*
* Release EState and per-query memory context. This should release
* everything the executor has allocated.
@@ -1529,6 +1518,7 @@ ExecEndPlan(PlanState *planstate, EState *estate)
static void
ExecutePlan(EState *estate,
PlanState *planstate,
+ bool use_parallel_mode,
CmdType operation,
bool sendTuples,
long numberTuples,
@@ -1549,6 +1539,20 @@ ExecutePlan(EState *estate,
estate->es_direction = direction;
/*
+ * If a tuple count was supplied, we must force the plan to run without
+ * parallelism, because we might exit early.
+ */
+ if (numberTuples != 0)
+ use_parallel_mode = false;
+
+ /*
+ * If a tuple count was supplied, we must force the plan to run without
+ * parallelism, because we might exit early.
+ */
+ if (use_parallel_mode)
+ EnterParallelMode();
+
+ /*
* Loop until we've processed the proper number of tuples from the plan.
*/
for (;;)
@@ -1566,7 +1570,11 @@ ExecutePlan(EState *estate,
* process so we just end the loop...
*/
if (TupIsNull(slot))
+ {
+ /* Allow nodes to release or shut down resources. */
+ (void) ExecShutdownNode(planstate);
break;
+ }
/*
* If we have a junk filter, then project a new tuple with the junk
@@ -1603,6 +1611,9 @@ ExecutePlan(EState *estate,
if (numberTuples && numberTuples == current_tuple_count)
break;
}
+
+ if (use_parallel_mode)
+ ExitParallelMode();
}
diff --git a/src/backend/executor/execParallel.c b/src/backend/executor/execParallel.c
index e6930c1..3bb8206 100644
--- a/src/backend/executor/execParallel.c
+++ b/src/backend/executor/execParallel.c
@@ -443,6 +443,23 @@ ExecParallelFinish(ParallelExecutorInfo *pei)
}
/*
+ * Clean up whatever ParallelExecutreInfo resources still exist after
+ * ExecParallelFinish. We separate these routines because someone might
+ * want to examine the contents of the DSM after ExecParallelFinish and
+ * before calling this routine.
+ */
+void
+ExecParallelCleanup(ParallelExecutorInfo *pei)
+{
+ if (pei->pcxt != NULL)
+ {
+ DestroyParallelContext(pei->pcxt);
+ pei->pcxt = NULL;
+ }
+ pfree(pei);
+}
+
+/*
* Create a DestReceiver to write tuples we produce to the shm_mq designated
* for that purpose.
*/
diff --git a/src/backend/executor/nodeGather.c b/src/backend/executor/nodeGather.c
index c689a4d..7e2272f 100644
--- a/src/backend/executor/nodeGather.c
+++ b/src/backend/executor/nodeGather.c
@@ -16,6 +16,7 @@
#include "postgres.h"
#include "access/relscan.h"
+#include "access/xact.h"
#include "executor/execdebug.h"
#include "executor/execParallel.h"
#include "executor/nodeGather.h"
@@ -45,7 +46,6 @@ ExecInitGather(Gather *node, EState *estate, int eflags)
gatherstate = makeNode(GatherState);
gatherstate->ps.plan = (Plan *) node;
gatherstate->ps.state = estate;
- gatherstate->need_to_scan_workers = false;
gatherstate->need_to_scan_locally = !node->single_copy;
/*
@@ -106,52 +106,57 @@ ExecGather(GatherState *node)
* needs to allocate large dynamic segement, so it is better to do if it
* is really needed.
*/
- if (!node->pei)
+ if (!node->initialized)
{
EState *estate = node->ps.state;
-
- /* Initialize the workers required to execute Gather node. */
- node->pei = ExecInitParallelPlan(node->ps.lefttree,
- estate,
- ((Gather *) (node->ps.plan))->num_workers);
+ Gather *gather = (Gather *) node->ps.plan;
/*
- * Register backend workers. If the required number of workers are not
- * available then we perform the scan with available workers and if
- * there are no more workers available, then the Gather node will just
- * scan locally.
+ * Sometimes we might have to run without parallelism; but if
+ * parallel mode is active then we can try to fire up some workers.
*/
- LaunchParallelWorkers(node->pei->pcxt);
-
- node->funnel = CreateTupleQueueFunnel();
-
- for (i = 0; i < node->pei->pcxt->nworkers; ++i)
+ if (gather->num_workers > 0 && IsInParallelMode())
{
- if (node->pei->pcxt->worker[i].bgwhandle)
+ bool got_any_worker = false;
+
+ /* Initialize the workers required to execute Gather node. */
+ node->pei = ExecInitParallelPlan(node->ps.lefttree,
+ estate,
+ gather->num_workers);
+
+ /*
+ * Register backend workers. We might not get as many as we
+ * requested, or indeed any at all.
+ */
+ LaunchParallelWorkers(node->pei->pcxt);
+
+ /* Set up a tuple queue to collect the results. */
+ node->funnel = CreateTupleQueueFunnel();
+ for (i = 0; i < node->pei->pcxt->nworkers; ++i)
{
- shm_mq_set_handle(node->pei->tqueue[i],
- node->pei->pcxt->worker[i].bgwhandle);
- RegisterTupleQueueOnFunnel(node->funnel, node->pei->tqueue[i]);
- node->need_to_scan_workers = true;
+ if (node->pei->pcxt->worker[i].bgwhandle)
+ {
+ shm_mq_set_handle(node->pei->tqueue[i],
+ node->pei->pcxt->worker[i].bgwhandle);
+ RegisterTupleQueueOnFunnel(node->funnel,
+ node->pei->tqueue[i]);
+ got_any_worker = true;
+ }
}
+
+ /* No workers? Then never mind. */
+ if (!got_any_worker)
+ ExecShutdownGather(node);
}
- /* If no workers are available, we must always scan locally. */
- if (!node->need_to_scan_workers)
- node->need_to_scan_locally = true;
+ /* Run plan locally if no workers or not single-copy. */
+ node->need_to_scan_locally = (node->funnel == NULL)
+ || !gather->single_copy;
+ node->initialized = true;
}
slot = gather_getnext(node);
- if (TupIsNull(slot))
- {
- /*
- * Destroy the parallel context once we complete fetching all the
- * tuples. Otherwise, the DSM and workers will stick around for the
- * lifetime of the entire statement.
- */
- ExecShutdownGather(node);
- }
return slot;
}
@@ -194,10 +199,9 @@ gather_getnext(GatherState *gatherstate)
*/
slot = gatherstate->ps.ps_ProjInfo->pi_slot;
- while (gatherstate->need_to_scan_workers ||
- gatherstate->need_to_scan_locally)
+ while (gatherstate->funnel != NULL || gatherstate->need_to_scan_locally)
{
- if (gatherstate->need_to_scan_workers)
+ if (gatherstate->funnel != NULL)
{
bool done = false;
@@ -206,7 +210,7 @@ gather_getnext(GatherState *gatherstate)
gatherstate->need_to_scan_locally,
&done);
if (done)
- gatherstate->need_to_scan_workers = false;
+ ExecShutdownGather(gatherstate);
if (HeapTupleIsValid(tup))
{
@@ -247,30 +251,20 @@ gather_getnext(GatherState *gatherstate)
void
ExecShutdownGather(GatherState *node)
{
- Gather *gather;
-
- if (node->pei == NULL || node->pei->pcxt == NULL)
- return;
-
- /*
- * Ensure all workers have finished before destroying the parallel context
- * to ensure a clean exit.
- */
- if (node->funnel)
+ /* Shut down tuple queue funnel before shutting down workers. */
+ if (node->funnel != NULL)
{
DestroyTupleQueueFunnel(node->funnel);
node->funnel = NULL;
}
- ExecParallelFinish(node->pei);
-
- /* destroy parallel context. */
- DestroyParallelContext(node->pei->pcxt);
- node->pei->pcxt = NULL;
-
- gather = (Gather *) node->ps.plan;
- node->need_to_scan_locally = !gather->single_copy;
- node->need_to_scan_workers = false;
+ /* Now shut down the workers. */
+ if (node->pei != NULL)
+ {
+ ExecParallelFinish(node->pei);
+ ExecParallelCleanup(node->pei);
+ node->pei = NULL;
+ }
}
/* ----------------------------------------------------------------
@@ -295,5 +289,7 @@ ExecReScanGather(GatherState *node)
*/
ExecShutdownGather(node);
+ node->initialized = false;
+
ExecReScan(node->ps.lefttree);
}
diff --git a/src/include/executor/execParallel.h b/src/include/executor/execParallel.h
index 4fc797a..505500e 100644
--- a/src/include/executor/execParallel.h
+++ b/src/include/executor/execParallel.h
@@ -32,5 +32,6 @@ typedef struct ParallelExecutorInfo
extern ParallelExecutorInfo *ExecInitParallelPlan(PlanState *planstate,
EState *estate, int nworkers);
extern void ExecParallelFinish(ParallelExecutorInfo *pei);
+extern void ExecParallelCleanup(ParallelExecutorInfo *pei);
#endif /* EXECPARALLEL_H */
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index b6895f9..d705445 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1960,9 +1960,9 @@ typedef struct UniqueState
typedef struct GatherState
{
PlanState ps; /* its first field is NodeTag */
+ bool initialized;
struct ParallelExecutorInfo *pei;
struct TupleQueueFunnel *funnel;
- bool need_to_scan_workers;
bool need_to_scan_locally;
} GatherState;
--
2.3.8 (Apple Git-58)
0013-Modify-tqueue-infrastructure-to-support-transient-re.patchapplication/x-patch; name=0013-Modify-tqueue-infrastructure-to-support-transient-re.patchDownload
From 2ee78b44a088b8f9e7b4fa0f1d05a7c89e9f169e Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Wed, 7 Oct 2015 12:43:22 -0400
Subject: [PATCH 13/14] Modify tqueue infrastructure to support transient
record types.
Commit 4a4e6893aa080b9094dadbe0e65f8a75fee41ac6, which introduced this
mechanism, failed to account for the fact that the RECORD pseudo-type
uses transient typmods that are only meaningful within a single
backend. Transferring such tuples without modification between two
cooperating backends does not work. This commit installs a system
for passing the tuple descriptors over the same shm_mq being used to
send the tuples themselves. The two sides might not assign the same
transient typmod to any given tuple descriptor, so we must also
substitute the appropriate receiver-side typmod for the one used by
the sender. That adds some CPU overhead, but still seems better than
being unable to pass records between cooperating parallel processes.
---
src/backend/executor/nodeGather.c | 1 +
src/backend/executor/tqueue.c | 492 +++++++++++++++++++++++++++++++++++---
src/include/executor/tqueue.h | 4 +-
3 files changed, 467 insertions(+), 30 deletions(-)
diff --git a/src/backend/executor/nodeGather.c b/src/backend/executor/nodeGather.c
index 7e2272f..bf62eee 100644
--- a/src/backend/executor/nodeGather.c
+++ b/src/backend/executor/nodeGather.c
@@ -207,6 +207,7 @@ gather_getnext(GatherState *gatherstate)
/* wait only if local scan is done */
tup = TupleQueueFunnelNext(gatherstate->funnel,
+ slot->tts_tupleDescriptor,
gatherstate->need_to_scan_locally,
&done);
if (done)
diff --git a/src/backend/executor/tqueue.c b/src/backend/executor/tqueue.c
index 67143d3..53b69e0 100644
--- a/src/backend/executor/tqueue.c
+++ b/src/backend/executor/tqueue.c
@@ -21,23 +21,55 @@
#include "postgres.h"
#include "access/htup_details.h"
+#include "catalog/pg_type.h"
#include "executor/tqueue.h"
+#include "funcapi.h"
+#include "lib/stringinfo.h"
#include "miscadmin.h"
+#include "utils/array.h"
+#include "utils/memutils.h"
+#include "utils/typcache.h"
typedef struct
{
DestReceiver pub;
shm_mq_handle *handle;
+ MemoryContext tmpcontext;
+ HTAB *recordhtab;
+ char mode;
} TQueueDestReceiver;
+typedef struct RecordTypemodMap
+{
+ int remotetypmod;
+ int localtypmod;
+} RecordTypemodMap;
+
struct TupleQueueFunnel
{
int nqueues;
int maxqueues;
int nextqueue;
shm_mq_handle **queue;
+ char *mode;
+ HTAB *typmodmap;
};
+#define TUPLE_QUEUE_MODE_CONTROL 'c'
+#define TUPLE_QUEUE_MODE_DATA 'd'
+
+static void tqueueWalkRecord(TQueueDestReceiver *tqueue, Datum value);
+static void tqueueWalkRecordArray(TQueueDestReceiver *tqueue, Datum value);
+static void TupleQueueHandleControlMessage(TupleQueueFunnel *funnel,
+ Size nbytes, char *data);
+static HeapTuple TupleQueueHandleDataMessage(TupleQueueFunnel *funnel,
+ TupleDesc tupledesc, Size nbytes,
+ HeapTupleHeader data);
+static HeapTuple TupleQueueRemapTuple(TupleQueueFunnel *funnel,
+ TupleDesc tupledesc, HeapTuple tuple);
+static Datum TupleQueueRemapRecord(TupleQueueFunnel *funnel, Datum value);
+static Datum TupleQueueRemapRecordArray(TupleQueueFunnel *funnel, Datum value);
+
/*
* Receive a tuple.
*/
@@ -46,12 +78,178 @@ tqueueReceiveSlot(TupleTableSlot *slot, DestReceiver *self)
{
TQueueDestReceiver *tqueue = (TQueueDestReceiver *) self;
HeapTuple tuple;
+ HeapTupleHeader tup;
+ AttrNumber i;
tuple = ExecMaterializeSlot(slot);
+ tup = tuple->t_data;
+
+ /*
+ * If any of the columns that we're sending back are records, special
+ * handling is required, because the tuple descriptors are stored in a
+ * backend-local cache, and the backend receiving data from us need not
+ * have the same cache contents we do. We grovel through the tuple,
+ * find all the transient record types contained therein, and send
+ * special control messages through the queue so that the receiving
+ * process can interpret them correctly.
+ */
+ for (i = 0; i < slot->tts_tupleDescriptor->natts; ++i)
+ {
+ Form_pg_attribute attr = slot->tts_tupleDescriptor->attrs[i];
+ MemoryContext oldcontext;
+
+ /* Ignore nulls and non-records. */
+ if (slot->tts_isnull[i] || (attr->atttypid != RECORDOID
+ && attr->atttypid != RECORDARRAYOID))
+ continue;
+
+ /*
+ * OK, we're going to need to examine this attribute. We could
+ * use heap_deform_tuple here, but there's a possibility that the
+ * slot already constains the deconstructed tuple, in which case
+ * deforming it again would be needlessly inefficient.
+ */
+ slot_getallattrs(slot);
+
+ /* Switch to temporary memory context to avoid leaking. */
+ if (tqueue->tmpcontext == NULL)
+ tqueue->tmpcontext =
+ AllocSetContextCreate(TopTransactionContext,
+ "tqueue temporary context",
+ ALLOCSET_DEFAULT_MINSIZE,
+ ALLOCSET_DEFAULT_INITSIZE,
+ ALLOCSET_DEFAULT_MAXSIZE);
+ oldcontext = MemoryContextSwitchTo(tqueue->tmpcontext);
+ if (attr->atttypid == RECORDOID)
+ tqueueWalkRecord(tqueue, slot->tts_values[i]);
+ else
+ tqueueWalkRecordArray(tqueue, slot->tts_values[i]);
+ MemoryContextSwitchTo(oldcontext);
+
+ /* Clean up anything memory we allocated. */
+ MemoryContextReset(tqueue->tmpcontext);
+ }
+
+ /* If we entered control mode, switch back to data mode. */
+ if (tqueue->mode != TUPLE_QUEUE_MODE_DATA)
+ {
+ tqueue->mode = TUPLE_QUEUE_MODE_DATA;
+ shm_mq_send(tqueue->handle, sizeof(char), &tqueue->mode, false);
+ }
+
+ /* Send the tuple itself. */
shm_mq_send(tqueue->handle, tuple->t_len, tuple->t_data, false);
}
/*
+ * Walk a record and send control messages for transient record types
+ * contained therein.
+ */
+static void
+tqueueWalkRecord(TQueueDestReceiver *tqueue, Datum value)
+{
+ HeapTupleHeader tup;
+ Oid typmod;
+ bool found;
+ TupleDesc tupledesc;
+ Datum *values;
+ bool *isnull;
+ HeapTupleData tdata;
+ AttrNumber i;
+
+ /* Extract typmod from tuple. */
+ tup = DatumGetHeapTupleHeader(value);
+ typmod = HeapTupleHeaderGetTypMod(tup);
+
+ /* Look up tuple descriptor in typecache. */
+ tupledesc = lookup_rowtype_tupdesc(RECORDOID, typmod);
+
+ /* Initialize hash table if not done yet. */
+ if (tqueue->recordhtab == NULL)
+ {
+ HASHCTL ctl;
+
+ ctl.keysize = sizeof(int);
+ ctl.entrysize = sizeof(int);
+ ctl.hcxt = TopMemoryContext;
+ tqueue->recordhtab = hash_create("tqueue record hashtable",
+ 100, &ctl, HASH_ELEM | HASH_CONTEXT);
+ }
+
+ /* Have we already seen this record type? If not, must report it. */
+ hash_search(tqueue->recordhtab, &typmod, HASH_ENTER, &found);
+ if (!found)
+ {
+ StringInfoData buf;
+
+ /* If message queue is in data mode, switch to control mode. */
+ if (tqueue->mode != TUPLE_QUEUE_MODE_CONTROL)
+ {
+ tqueue->mode = TUPLE_QUEUE_MODE_CONTROL;
+ shm_mq_send(tqueue->handle, sizeof(char), &tqueue->mode, false);
+ }
+
+ /* Assemble a control message. */
+ initStringInfo(&buf);
+ appendBinaryStringInfo(&buf, (char *) &typmod, sizeof(int));
+ appendBinaryStringInfo(&buf, (char *) &tupledesc->natts, sizeof(int));
+ appendBinaryStringInfo(&buf, (char *) &tupledesc->tdhasoid,
+ sizeof(bool));
+ for (i = 0; i < tupledesc->natts; ++i)
+ appendBinaryStringInfo(&buf, (char *) tupledesc->attrs[i],
+ sizeof(FormData_pg_attribute));
+
+ /* Send control message. */
+ shm_mq_send(tqueue->handle, buf.len, buf.data, false);
+ }
+
+ /* Deform the tuple so we can check each column within. */
+ values = palloc(tupledesc->natts * sizeof(Datum));
+ isnull = palloc(tupledesc->natts * sizeof(bool));
+ tdata.t_len = HeapTupleHeaderGetDatumLength(tup);
+ ItemPointerSetInvalid(&(tdata.t_self));
+ tdata.t_tableOid = InvalidOid;
+ tdata.t_data = tup;
+ heap_deform_tuple(&tdata, tupledesc, values, isnull);
+
+ /* Recursively check each non-NULL attribute. */
+ for (i = 0; i < tupledesc->natts; ++i)
+ {
+ Form_pg_attribute attr = tupledesc->attrs[i];
+ if (isnull[i])
+ continue;
+ if (attr->atttypid == RECORDOID)
+ tqueueWalkRecord(tqueue, values[i]);
+ if (attr->atttypid == RECORDARRAYOID)
+ tqueueWalkRecordArray(tqueue, values[i]);
+ }
+
+ /* Release reference count acquired by lookup_rowtype_tupdesc. */
+ DecrTupleDescRefCount(tupledesc);
+}
+
+/*
+ * Walk a record and send control messages for transient record types
+ * contained therein.
+ */
+static void
+tqueueWalkRecordArray(TQueueDestReceiver *tqueue, Datum value)
+{
+ ArrayType *arr = DatumGetArrayTypeP(value);
+ Datum *elem_values;
+ bool *elem_nulls;
+ int num_elems;
+ int i;
+
+ Assert(ARR_ELEMTYPE(arr) == RECORDOID);
+ deconstruct_array(arr, RECORDOID, -1, false, 'd',
+ &elem_values, &elem_nulls, &num_elems);
+ for (i = 0; i < num_elems; ++i)
+ if (!elem_nulls[i])
+ tqueueWalkRecord(tqueue, elem_values[i]);
+}
+
+/*
* Prepare to receive tuples from executor.
*/
static void
@@ -77,6 +275,12 @@ tqueueShutdownReceiver(DestReceiver *self)
static void
tqueueDestroyReceiver(DestReceiver *self)
{
+ TQueueDestReceiver *tqueue = (TQueueDestReceiver *) self;
+
+ if (tqueue->tmpcontext != NULL)
+ MemoryContextDelete(tqueue->tmpcontext);
+ if (tqueue->recordhtab != NULL)
+ hash_destroy(tqueue->recordhtab);
pfree(self);
}
@@ -96,6 +300,9 @@ CreateTupleQueueDestReceiver(shm_mq_handle *handle)
self->pub.rDestroy = tqueueDestroyReceiver;
self->pub.mydest = DestTupleQueue;
self->handle = handle;
+ self->tmpcontext = NULL;
+ self->recordhtab = NULL;
+ self->mode = TUPLE_QUEUE_MODE_DATA;
return (DestReceiver *) self;
}
@@ -110,6 +317,7 @@ CreateTupleQueueFunnel(void)
funnel->maxqueues = 8;
funnel->queue = palloc(funnel->maxqueues * sizeof(shm_mq_handle *));
+ funnel->mode = palloc(funnel->maxqueues * sizeof(char));
return funnel;
}
@@ -125,6 +333,9 @@ DestroyTupleQueueFunnel(TupleQueueFunnel *funnel)
for (i = 0; i < funnel->nqueues; i++)
shm_mq_detach(shm_mq_get_queue(funnel->queue[i]));
pfree(funnel->queue);
+ pfree(funnel->mode);
+ if (funnel->typmodmap != NULL)
+ hash_destroy(funnel->typmodmap);
pfree(funnel);
}
@@ -134,12 +345,6 @@ DestroyTupleQueueFunnel(TupleQueueFunnel *funnel)
void
RegisterTupleQueueOnFunnel(TupleQueueFunnel *funnel, shm_mq_handle *handle)
{
- if (funnel->nqueues < funnel->maxqueues)
- {
- funnel->queue[funnel->nqueues++] = handle;
- return;
- }
-
if (funnel->nqueues >= funnel->maxqueues)
{
int newsize = funnel->nqueues * 2;
@@ -148,10 +353,12 @@ RegisterTupleQueueOnFunnel(TupleQueueFunnel *funnel, shm_mq_handle *handle)
funnel->queue = repalloc(funnel->queue,
newsize * sizeof(shm_mq_handle *));
+ funnel->mode = repalloc(funnel->mode, newsize * sizeof(bool));
funnel->maxqueues = newsize;
}
- funnel->queue[funnel->nqueues++] = handle;
+ funnel->queue[funnel->nqueues] = handle;
+ funnel->mode[funnel->nqueues++] = TUPLE_QUEUE_MODE_DATA;
}
/*
@@ -172,7 +379,8 @@ RegisterTupleQueueOnFunnel(TupleQueueFunnel *funnel, shm_mq_handle *handle)
* any other case.
*/
HeapTuple
-TupleQueueFunnelNext(TupleQueueFunnel *funnel, bool nowait, bool *done)
+TupleQueueFunnelNext(TupleQueueFunnel *funnel, TupleDesc tupledesc,
+ bool nowait, bool *done)
{
int waitpos = funnel->nextqueue;
@@ -190,6 +398,7 @@ TupleQueueFunnelNext(TupleQueueFunnel *funnel, bool nowait, bool *done)
for (;;)
{
shm_mq_handle *mqh = funnel->queue[funnel->nextqueue];
+ char *modep = &funnel->mode[funnel->nextqueue];
shm_mq_result result;
Size nbytes;
void *data;
@@ -198,15 +407,10 @@ TupleQueueFunnelNext(TupleQueueFunnel *funnel, bool nowait, bool *done)
result = shm_mq_receive(mqh, &nbytes, &data, true);
/*
- * Normally, we advance funnel->nextqueue to the next queue at this
- * point, but if we're pointing to a queue that we've just discovered
- * is detached, then forget that queue and leave the pointer where it
- * is until the number of remaining queues fall below that pointer and
- * at that point make the pointer point to the first queue.
+ * If this queue has been detached, forget about it and shift the
+ * remmaining queues downward in the array.
*/
- if (result != SHM_MQ_DETACHED)
- funnel->nextqueue = (funnel->nextqueue + 1) % funnel->nqueues;
- else
+ if (result == SHM_MQ_DETACHED)
{
--funnel->nqueues;
if (funnel->nqueues == 0)
@@ -230,21 +434,32 @@ TupleQueueFunnelNext(TupleQueueFunnel *funnel, bool nowait, bool *done)
continue;
}
+ /* Advance nextqueue pointer to next queue in round-robin fashion. */
+ funnel->nextqueue = (funnel->nextqueue + 1) % funnel->nqueues;
+
/* If we got a message, return it. */
if (result == SHM_MQ_SUCCESS)
{
- HeapTupleData htup;
-
- /*
- * The tuple data we just read from the queue is only valid until
- * we again attempt to read from it. Copy the tuple into a single
- * palloc'd chunk as callers will expect.
- */
- ItemPointerSetInvalid(&htup.t_self);
- htup.t_tableOid = InvalidOid;
- htup.t_len = nbytes;
- htup.t_data = data;
- return heap_copytuple(&htup);
+ if (nbytes == 1)
+ {
+ /* Mode switch message. */
+ *modep = ((char *) data)[0];
+ continue;
+ }
+ else if (*modep == TUPLE_QUEUE_MODE_DATA)
+ {
+ /* Tuple data. */
+ return TupleQueueHandleDataMessage(funnel, tupledesc,
+ nbytes, data);
+ }
+ else if (*modep == TUPLE_QUEUE_MODE_CONTROL)
+ {
+ /* Control message, describing a transient record type. */
+ TupleQueueHandleControlMessage(funnel, nbytes, data);
+ continue;
+ }
+ else
+ elog(ERROR, "invalid mode: %d", (int) *modep);
}
/*
@@ -262,3 +477,224 @@ TupleQueueFunnelNext(TupleQueueFunnel *funnel, bool nowait, bool *done)
}
}
}
+
+/*
+ * Handle a data message - that is, a tuple - from the tuple queue funnel.
+ */
+static HeapTuple
+TupleQueueHandleDataMessage(TupleQueueFunnel *funnel, TupleDesc tupledesc,
+ Size nbytes, HeapTupleHeader data)
+{
+ HeapTupleData htup;
+
+ ItemPointerSetInvalid(&htup.t_self);
+ htup.t_tableOid = InvalidOid;
+ htup.t_len = nbytes;
+ htup.t_data = data;
+
+ /* If necessary, remap record typmods. */
+ if (funnel->typmodmap != NULL)
+ {
+ HeapTuple newtuple;
+
+ newtuple = TupleQueueRemapTuple(funnel, tupledesc, &htup);
+ if (newtuple != NULL)
+ return newtuple;
+ }
+
+ /*
+ * Otherwise, just copy the tuple into a single palloc'd chunk, as
+ * callers will expect.
+ */
+ return heap_copytuple(&htup);
+}
+
+/*
+ * Remap tuple typmods per control information received from remote side.
+ */
+static HeapTuple
+TupleQueueRemapTuple(TupleQueueFunnel *funnel, TupleDesc tupledesc,
+ HeapTuple tuple)
+{
+ Datum *values;
+ bool *isnull;
+ bool dirty = false;
+ int i;
+
+ /* Deform tuple so we can remap record typmods for individual attrs. */
+ values = palloc(tupledesc->natts * sizeof(Datum));
+ isnull = palloc(tupledesc->natts * sizeof(bool));
+ heap_deform_tuple(tuple, tupledesc, values, isnull);
+
+ /* Recursively check each non-NULL attribute. */
+ for (i = 0; i < tupledesc->natts; ++i)
+ {
+ Form_pg_attribute attr = tupledesc->attrs[i];
+
+ if (isnull[i])
+ continue;
+
+ if (attr->atttypid == RECORDOID)
+ {
+ values[i] = TupleQueueRemapRecord(funnel, values[i]);
+ dirty = true;
+ }
+
+
+ if (attr->atttypid == RECORDARRAYOID)
+ {
+ values[i] = TupleQueueRemapRecordArray(funnel, values[i]);
+ dirty = true;
+ }
+ }
+
+ /* If we didn't need to change anything, just return NULL. */
+ if (!dirty)
+ return NULL;
+
+ /* Reform the modified tuple. */
+ return heap_form_tuple(tupledesc, values, isnull);
+}
+
+static Datum
+TupleQueueRemapRecord(TupleQueueFunnel *funnel, Datum value)
+{
+ HeapTupleHeader tup;
+ int remotetypmod;
+ RecordTypemodMap *mapent;
+ TupleDesc atupledesc;
+ HeapTupleData htup;
+ HeapTuple atup;
+
+ tup = DatumGetHeapTupleHeader(value);
+
+ /* Map remote typmod to local typmod and get tupledesc. */
+ remotetypmod = HeapTupleHeaderGetTypMod(tup);
+ Assert(funnel->typmodmap != NULL);
+ mapent = hash_search(funnel->typmodmap, &remotetypmod,
+ HASH_FIND, NULL);
+ if (mapent == NULL)
+ elog(ERROR, "found unrecognized remote typmod %d",
+ mapent->remotetypmod);
+ atupledesc = lookup_rowtype_tupdesc(RECORDOID, mapent->localtypmod);
+
+ /* Recursively process contents of record. */
+ ItemPointerSetInvalid(&htup.t_self);
+ htup.t_tableOid = InvalidOid;
+ htup.t_len = HeapTupleHeaderGetDatumLength(tup);
+ htup.t_data = tup;
+ atup = TupleQueueRemapTuple(funnel, atupledesc, &htup);
+
+ /* Release reference count acquired by lookup_rowtype_tupdesc. */
+ DecrTupleDescRefCount(atupledesc);
+
+ /*
+ * Even if none of the attributes inside this tuple are records that
+ * require typmod remapping, we still need to change the typmod on
+ * the record itself. However, we can do that by copying the tuple
+ * rather than reforming it.
+ */
+ if (atup == NULL)
+ {
+ atup = heap_copytuple(&htup);
+ HeapTupleHeaderSetTypMod(atup->t_data, mapent->localtypmod);
+ }
+
+ return HeapTupleHeaderGetDatum(atup->t_data);
+}
+
+static Datum
+TupleQueueRemapRecordArray(TupleQueueFunnel *funnel, Datum value)
+{
+ ArrayType *arr = DatumGetArrayTypeP(value);
+ Datum *elem_values;
+ bool *elem_nulls;
+ int num_elems;
+ int i;
+
+ Assert(ARR_ELEMTYPE(arr) == RECORDOID);
+ deconstruct_array(arr, RECORDOID, -1, false, 'd',
+ &elem_values, &elem_nulls, &num_elems);
+ for (i = 0; i < num_elems; ++i)
+ if (!elem_nulls[i])
+ elem_values[i] = TupleQueueRemapRecord(funnel, elem_values[i]);
+ arr = construct_md_array(elem_values, elem_nulls,
+ ARR_NDIM(arr), ARR_DIMS(arr), ARR_LBOUND(arr),
+ RECORDOID,
+ -1, false, 'd');
+ return PointerGetDatum(arr);
+}
+
+/*
+ * Handle a control message from the tuple queue funnel.
+ *
+ * Control messages are sent when the remote side is sending tuples that
+ * contain transient record types. We need to arrange to bless those
+ * record types locally and translate between remote and local typmods.
+ */
+static void
+TupleQueueHandleControlMessage(TupleQueueFunnel *funnel, Size nbytes,
+ char *data)
+{
+ int natts;
+ int remotetypmod;
+ bool hasoid;
+ char *buf = data;
+ int rc = 0;
+ int i;
+ Form_pg_attribute *attrs;
+ MemoryContext oldcontext;
+ TupleDesc tupledesc;
+ RecordTypemodMap *mapent;
+ bool found;
+
+ /* Extract remote typmod. */
+ memcpy(&remotetypmod, &buf[rc], sizeof(int));
+ rc += sizeof(int);
+
+ /* Extract attribute count. */
+ memcpy(&natts, &buf[rc], sizeof(int));
+ rc += sizeof(int);
+
+ /* Extract hasoid flag. */
+ memcpy(&hasoid, &buf[rc], sizeof(bool));
+ rc += sizeof(bool);
+
+ /* Extract attribute details. */
+ oldcontext = MemoryContextSwitchTo(CurTransactionContext);
+ attrs = palloc(natts * sizeof(Form_pg_attribute));
+ for (i = 0; i < natts; ++i)
+ {
+ attrs[i] = palloc(sizeof(FormData_pg_attribute));
+ memcpy(attrs[i], &buf[rc], sizeof(FormData_pg_attribute));
+ rc += sizeof(FormData_pg_attribute);
+ }
+ MemoryContextSwitchTo(oldcontext);
+
+ /* We should have read the whole message. */
+ Assert(rc == nbytes);
+
+ /* Construct TupleDesc. */
+ tupledesc = CreateTupleDesc(natts, hasoid, attrs);
+ tupledesc = BlessTupleDesc(tupledesc);
+
+ /* Create map if it doesn't exist already. */
+ if (funnel->typmodmap == NULL)
+ {
+ HASHCTL ctl;
+
+ ctl.keysize = sizeof(int);
+ ctl.entrysize = sizeof(RecordTypemodMap);
+ ctl.hcxt = CurTransactionContext;
+ funnel->typmodmap = hash_create("typmodmap hashtable",
+ 100, &ctl, HASH_ELEM | HASH_CONTEXT);
+ }
+
+ /* Create map entry. */
+ mapent = hash_search(funnel->typmodmap, &remotetypmod, HASH_ENTER,
+ &found);
+ if (found)
+ elog(ERROR, "duplicate message for typmod %d",
+ remotetypmod);
+ mapent->localtypmod = tupledesc->tdtypmod;
+}
diff --git a/src/include/executor/tqueue.h b/src/include/executor/tqueue.h
index 6f8eb73..59f35c7 100644
--- a/src/include/executor/tqueue.h
+++ b/src/include/executor/tqueue.h
@@ -25,7 +25,7 @@ typedef struct TupleQueueFunnel TupleQueueFunnel;
extern TupleQueueFunnel *CreateTupleQueueFunnel(void);
extern void DestroyTupleQueueFunnel(TupleQueueFunnel *funnel);
extern void RegisterTupleQueueOnFunnel(TupleQueueFunnel *, shm_mq_handle *);
-extern HeapTuple TupleQueueFunnelNext(TupleQueueFunnel *, bool nowait,
- bool *done);
+extern HeapTuple TupleQueueFunnelNext(TupleQueueFunnel *, TupleDesc tupledesc,
+ bool nowait, bool *done);
#endif /* TQUEUE_H */
--
2.3.8 (Apple Git-58)
0014-Fix-problems-with-ParamListInfo-serialization-mechan.patchapplication/x-patch; name=0014-Fix-problems-with-ParamListInfo-serialization-mechan.patchDownload
From ad2fbc6fbf143db4f8b2231f03100b60029a1275 Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Mon, 12 Oct 2015 11:46:40 -0400
Subject: [PATCH 14/14] Fix problems with ParamListInfo serialization
mechanism.
Commit d1b7c1ffe72e86932b5395f29e006c3f503bc53d introduced a mechanism
for serializing a ParamListInfo structure to be passed to a parallel
worker. However, this mechanism failed to handle external expanded
values, as pointed out by Noah Misch. Moreover, plpgsql_param_fetch
requires adjustment because the serialization mechanism needs it to skip
evaluating unused parameters just as we would do when it is called from
copyParamList, but params == estate->paramLI in that case. To fix, do
the relevant bms_is_member test unconditionally.
---
src/backend/utils/adt/datum.c | 16 ++++++++++++++++
src/pl/plpgsql/src/pl_exec.c | 26 +++++++++++---------------
2 files changed, 27 insertions(+), 15 deletions(-)
diff --git a/src/backend/utils/adt/datum.c b/src/backend/utils/adt/datum.c
index 3d9e354..0d61950 100644
--- a/src/backend/utils/adt/datum.c
+++ b/src/backend/utils/adt/datum.c
@@ -264,6 +264,11 @@ datumEstimateSpace(Datum value, bool isnull, bool typByVal, int typLen)
/* no need to use add_size, can't overflow */
if (typByVal)
sz += sizeof(Datum);
+ else if (VARATT_IS_EXTERNAL_EXPANDED(value))
+ {
+ ExpandedObjectHeader *eoh = DatumGetEOHP(value);
+ sz += EOH_get_flat_size(eoh);
+ }
else
sz += datumGetSize(value, typByVal, typLen);
}
@@ -292,6 +297,7 @@ void
datumSerialize(Datum value, bool isnull, bool typByVal, int typLen,
char **start_address)
{
+ ExpandedObjectHeader *eoh = NULL;
int header;
/* Write header word. */
@@ -299,6 +305,11 @@ datumSerialize(Datum value, bool isnull, bool typByVal, int typLen,
header = -2;
else if (typByVal)
header = -1;
+ else if (VARATT_IS_EXTERNAL_EXPANDED(value))
+ {
+ eoh = DatumGetEOHP(value);
+ header = EOH_get_flat_size(eoh);
+ }
else
header = datumGetSize(value, typByVal, typLen);
memcpy(*start_address, &header, sizeof(int));
@@ -312,6 +323,11 @@ datumSerialize(Datum value, bool isnull, bool typByVal, int typLen,
memcpy(*start_address, &value, sizeof(Datum));
*start_address += sizeof(Datum);
}
+ else if (eoh)
+ {
+ EOH_flatten_into(eoh, (void *) *start_address, header);
+ *start_address += header;
+ }
else
{
memcpy(*start_address, DatumGetPointer(value), header);
diff --git a/src/pl/plpgsql/src/pl_exec.c b/src/pl/plpgsql/src/pl_exec.c
index c73f20b..346e8f8 100644
--- a/src/pl/plpgsql/src/pl_exec.c
+++ b/src/pl/plpgsql/src/pl_exec.c
@@ -5696,21 +5696,17 @@ plpgsql_param_fetch(ParamListInfo params, int paramid)
/* now we can access the target datum */
datum = estate->datums[dno];
- /* need to behave slightly differently for shared and unshared arrays */
- if (params != estate->paramLI)
- {
- /*
- * We're being called, presumably from copyParamList(), for cursor
- * parameters. Since copyParamList() will try to materialize every
- * single parameter slot, it's important to do nothing when asked for
- * a datum that's not supposed to be used by this SQL expression.
- * Otherwise we risk failures in exec_eval_datum(), not to mention
- * possibly copying a lot more data than the cursor actually uses.
- */
- if (!bms_is_member(dno, expr->paramnos))
- return;
- }
- else
+ /*
+ * Since copyParamList() and SerializeParamList() will try to materialize
+ * every single parameter slot, it's important to do nothing when asked for
+ * a datum that's not supposed to be used by this SQL expression.
+ * Otherwise we risk failures in exec_eval_datum(), not to mention
+ * possibly copying a lot more data than the cursor actually uses.
+ */
+ if (!bms_is_member(dno, expr->paramnos))
+ return;
+
+ if (params == estate->paramLI)
{
/*
* Normal evaluation cases. We don't need to sanity-check dno, but we
--
2.3.8 (Apple Git-58)
On Mon, Oct 12, 2015 at 1:04 PM, Robert Haas <robertmhaas@gmail.com> wrote:
Attached are 14 patches. Patches #1-#4 are
essential for testing purposes but are not proposed for commit,
although some of the code they contain may eventually become part of
other patches which are proposed for commit. Patches #5-#12 are
largely boring patches fixing fairly uninteresting mistakes; I propose
to commit these on an expedited basis. Patches #13-14 are also
proposed for commit but seem to me to be more in need of review.
Hearing no objections, I've now gone and committed #5-#12.
0013-Modify-tqueue-infrastructure-to-support-transient-re.patch
attempts to address a deficiency in the tqueue.c/tqueue.h machinery I
recently introduced: backends can have ephemeral record types for
which they use backend-local typmods that may not be the same between
the leader and the worker. This patch has the worker send metadata
about the tuple descriptor for each such type, and the leader
registers the same tuple descriptor and then remaps the typmods from
the worker's typmod space to its own. This seems to work, but I'm a
little concerned that there may be cases it doesn't cover. Also,
there's room to question the overall approach. The only other
alternative that springs readily to mind is to try to arrange things
during the planning phase so that we never try to pass records between
parallel backends in this way, but that seems like it would be hard to
code (and thus likely to have bugs) and also pretty limiting.
I am still hoping someone will step up to review this.
0014-Fix-problems-with-ParamListInfo-serialization-mechan.patch, which
I just posted on the Parallel Seq Scan thread as a standalone patch,
fixes pretty much what the name of the file suggests. This actually
fixes two problems, one of which Noah spotted and commented on over on
that thread. By pure coincidence, the last 'make check' regression
failure I was still troubleshooting needed a fix for that issue plus a
fix to plpgsql_param_fetch. However, as I mentioned on the other
thread, I'm not quite sure which way to go with the change to
plpgsql_param_fetch so scrutiny of that point, in particular, would be
appreciated. See also
/messages/by-id/CA+TgmobN=wADVaUTwsH-xqvCdovkeRasuXw2k3R6vmpWig7raw@mail.gmail.com
Noah's been helping with this issue on the other thread. I'll revise
this patch along the lines discussed there and resubmit.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 12 October 2015 at 18:04, Robert Haas <robertmhaas@gmail.com> wrote:
My recent commit of the Gather executor node has made it relatively
simple to write code that does an end-to-end test of all of the
parallelism-relate commits which have thus far gone into the tree.
I've been wanting to help here for a while, but time remains limited for
next month or so.
From reading this my understanding is that there isn't a test suite
included with this commit?
I've tried to review the Gather node commit and I note that the commit
message contains a longer description of the functionality in that patch
than any comments in the patch as a whole. No design comments, no README,
no file header comments. For such a major feature that isn't acceptable - I
would reject a patch from others on that basis alone (and have done so). We
must keep the level of comments high if we are to encourage wider
participation in the project.
So reviewing patch 13 isn't possible without prior knowledge.
Hoping we'll be able to find some time on this at PGConf.eu; thanks for
coming over.
--
Simon Riggs http://www.2ndQuadrant.com/
<http://www.2ndquadrant.com/>
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Sat, Oct 17, 2015 at 9:16 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
From reading this my understanding is that there isn't a test suite included
with this commit?
Right. The patches on the thread contain code that can be used for
testing, but the committed code does not itself include test coverage.
I welcome thoughts on how we could perform automated testing of this
code. I think at least part of the answer is that I need to press on
toward getting the rest of Amit's parallel sequential scan patch
committed, because then this becomes a user-visible feature and I
expect that to make it much easier to find whatever bugs remain. A
big part of the difficulty in testing this up until now is that I've
been building towards, hey, we have parallel query. Until we actually
do, you need to write C code to test this, which raises the bar
considerably.
Now, that does not mean we shouldn't test this in other ways, and of
course I have, and so have Amit and other people from the community -
of late, Noah Misch and Haribabu Kommi have found several bugs through
code inspection and testing, which included some of the same ones that
I was busy finding and fixing using the test code attached to this
thread. That's one of the reasons why I wanted to press forward with
getting the fixes for those bugs committed. It's just a waste of
everybody's time if we re-finding known bugs for which fixes already
exist.
But the question of how to test this in the buildfarm is a good one,
and I don't have a complete answer. Once the rest of this goes in,
which I hope will be soon, we can EXPLAIN or EXPLAIN ANALYZE or just
straight up run parallel queries in the regression test suite and see
that they behave as expected. But I don't expect that to provide
terribly good test coverage. One idea that I think would provide
*excellent* test coverage is to take the test code included on this
thread and run it on the buildfarm. The idea of the code is to
basically run the regression test suite with every parallel-eligible
query forced to unnecessarily use parallelism. Turning that and
running 'make check' found, directly or indirectly, all of these bugs.
Doing that on the whole buildfarm would probably find more.
However, I'm pretty sure that we don't want to switch the *entire*
buildfarm to using lots of unnecessary parallelism. What we might be
able to do is have some critters that people spin up for this precise
purpose. Just like we currently have CLOBBER_CACHE_ALWAYS buildfarm
members, we could have GRATUITOUSLY_PARALLEL buildfarm members. If
Andrew is willing to add buildfarm support for that option and a few
people are willing to run critters in that mode, I will be happy -
more than happy, really - to put the test code into committable form,
guarded by a #define, and away we go.
Of course, other ideas for testing are also welcome.
I've tried to review the Gather node commit and I note that the commit
message contains a longer description of the functionality in that patch
than any comments in the patch as a whole. No design comments, no README, no
file header comments. For such a major feature that isn't acceptable - I
would reject a patch from others on that basis alone (and have done so). We
must keep the level of comments high if we are to encourage wider
participation in the project.
It's good to have your perspective on how this can be improved, and
I'm definitely willing to write more documentation. Any lack in that
area is probably due to being too close to the subject area, having
spent several years on parallelism in general, and 200+ emails on
parallel sequential scan in particular. Your point about the lack of
a good header file comment for execParallel.c is a good one, and I'll
rectify that next week.
It's worth noting, though, that the executor files in general don't
contain great gobs of comments, and the executor README even has this
vintage 2001 comment:
XXX a great deal more documentation needs to be written here...
Well, yeah. It's taken me a long time to understand how the executor
actually works, and there are parts of it - particularly related to
EvalPlanQual - that I still don't fully understand. So some of the
lack of comments in, for example, nodeGather.c is because it copies
the style of other executor nodes, like nodeSeqscan.c. It's not
exactly clear to me what more to document there. You either
understand what a rescan node is, in which case the code for each
node's rescan method tends to be fairly self-evident, or you don't -
but that clearly shouldn't be re-explained in every file. So I guess
what I'm saying is I could use some advice on what kinds things would
be most useful to document, and where to put that documentation.
Right now, the best explanation of how parallelism works is in
src/backend/access/transam/README.parallel -- but, as you rightly
point out, that doesn't cover the executor bits. Should we have SGML
documentation under "VII. Internals" that explains what's under the
hood in the same way that we have sections for "Database Physical
Storage" and "PostgreSQL Coding Conventions"? Should the stuff in the
existing README.parallel be moved there? Or I could just add some
words to src/backend/executor/README to cover the parallel executor
stuff, if that is preferred. Advice?
Also, regardless of how we document what's going on at the code level,
I think we probably should have a section *somewhere* in the main SGML
documentation that kind of explains the general concepts behind
parallel query from a user/DBA perspective. But I don't know where to
put it. Under "Server Administration"? Exactly what to explain there
needs some thought, too. I'm sort of wondering if we need two
chapters in the documentation on this, one that covers it from a
user/DBA perspective and the other of which covers it from a hacker
perspective. But then maybe the hacker stuff should just go in README
files. I'm not sure. I may have to try writing some of this and see
how it goes, but advice is definitely appreciated.
I am happy to definitively commit to writing whatever documentation
the community feels is necessary here, and I will do that certainly
before end of development for 9.6 and hopefully much sooner than that.
I will do that even if I don't get any specific feedback on what to
write and where to put it, but the more feedback I get, the better the
result will probably be. Some of the reason this hasn't been done
already is because we're still getting the infrastructure into place,
and we're fixing and adjusting things as we go along, so while the
overall picture isn't changing much, there are bits of the design that
are still in flux as we realize, oh, crap, that was a dumb idea. As
we get a clearer idea what will be in 9.6, it will get easier to
present the overall picture in a coherent way.
So reviewing patch 13 isn't possible without prior knowledge.
The basic question for patch 13 is whether ephemeral record types can
occur in executor tuples in any contexts that I haven't identified. I
know that a tuple table slot can contain have a column that is of type
record or record[], and those records can themselves contain
attributes of type record or record[], and so on as far down as you
like. I *think* that's the only case. For example, I don't believe
that a TupleTableSlot can contain a *named* record type that has an
anonymous record buried down inside of it somehow. But I'm not
positive I'm right about that.
Hoping we'll be able to find some time on this at PGConf.eu; thanks for
coming over.
Sure thing.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Sat, Oct 17, 2015 at 06:17:37PM -0400, Robert Haas wrote:
One idea that I think would provide
*excellent* test coverage is to take the test code included on this
thread and run it on the buildfarm. The idea of the code is to
basically run the regression test suite with every parallel-eligible
query forced to unnecessarily use parallelism. Turning that and
running 'make check' found, directly or indirectly, all of these bugs.
Doing that on the whole buildfarm would probably find more.However, I'm pretty sure that we don't want to switch the *entire*
buildfarm to using lots of unnecessary parallelism. What we might be
able to do is have some critters that people spin up for this precise
purpose. Just like we currently have CLOBBER_CACHE_ALWAYS buildfarm
members, we could have GRATUITOUSLY_PARALLEL buildfarm members. If
Andrew is willing to add buildfarm support for that option and a few
What, if anything, would this mode require beyond adding a #define? If
nothing, it won't require specific support in the buildfarm script.
CLOBBER_CACHE_ALWAYS has no specific support.
people are willing to run critters in that mode, I will be happy -
more than happy, really - to put the test code into committable form,
guarded by a #define, and away we go.
I would make one such animal.
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
* Noah Misch (noah@leadboat.com) wrote:
On Sat, Oct 17, 2015 at 06:17:37PM -0400, Robert Haas wrote:
people are willing to run critters in that mode, I will be happy -
more than happy, really - to put the test code into committable form,
guarded by a #define, and away we go.I would make one such animal.
We're also looking at what animals it makes sense to run as part of
pginfra and I expect we'd be able to include an animal for these tests
also (though Stefan is the one really driving that effort).
Thanks!
Stephen
On 10/17/2015 06:17 PM, Robert Haas wrote:
However, I'm pretty sure that we don't want to switch the *entire*
buildfarm to using lots of unnecessary parallelism. What we might be
able to do is have some critters that people spin up for this precise
purpose. Just like we currently have CLOBBER_CACHE_ALWAYS buildfarm
members, we could have GRATUITOUSLY_PARALLEL buildfarm members. If
Andrew is willing to add buildfarm support for that option and a few
people are willing to run critters in that mode, I will be happy -
more than happy, really - to put the test code into committable form,
guarded by a #define, and away we go.
If all that is required is a #define, like CLOBBER_CACHE_ALWAYS, then no
special buildfarm support is required - you would just add that to the
animal's config file, more or less like this:
config_env =>
{
CPPFLAGS => '-DGRATUITOUSLY_PARALLEL',
},
I try to make things easy :-)
cheers
andrew
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Sat, Oct 17, 2015 at 9:16 PM, Andrew Dunstan <andrew@dunslane.net> wrote:
If all that is required is a #define, like CLOBBER_CACHE_ALWAYS, then no
special buildfarm support is required - you would just add that to the
animal's config file, more or less like this:config_env =>
{
CPPFLAGS => '-DGRATUITOUSLY_PARALLEL',
},I try to make things easy :-)
Wow, that's great. So, I'll try to rework the test code I posted
previously into something less hacky, and eventually add a #define
like this so we can run it on the buildfarm. There's a few other
things that need to get done before that really makes sense - like
getting the rest of the bug fix patches committed - otherwise any
buildfarm critters we add will just be permanently red.
Thanks to Noah and Stephen for your replies also - it is good to hear
that if I spend the time to make this committable, somebody will use
it.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Sat, Oct 17, 2015 at 6:17 PM, Robert Haas <robertmhaas@gmail.com> wrote:
It's good to have your perspective on how this can be improved, and
I'm definitely willing to write more documentation. Any lack in that
area is probably due to being too close to the subject area, having
spent several years on parallelism in general, and 200+ emails on
parallel sequential scan in particular. Your point about the lack of
a good header file comment for execParallel.c is a good one, and I'll
rectify that next week.
Here is a patch to add a hopefully-useful file header comment to
execParallel.c. I included one for nodeGather.c as well, which seems
to be contrary to previous practice, but actually it seems like
previous practice is not the greatest: surely it's not self-evident
what all of the executor nodes do.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Attachments:
parallel-exec-header-comments.patchapplication/x-patch; name=parallel-exec-header-comments.patchDownload
diff --git a/src/backend/executor/execParallel.c b/src/backend/executor/execParallel.c
index 3bb8206..d99e170 100644
--- a/src/backend/executor/execParallel.c
+++ b/src/backend/executor/execParallel.c
@@ -6,6 +6,14 @@
* Portions Copyright (c) 1996-2015, PostgreSQL Global Development Group
* Portions Copyright (c) 1994, Regents of the University of California
*
+ * This file contains routines that are intended to support setting up,
+ * using, and tearing down a ParallelContext from within the PostgreSQL
+ * executor. The ParallelContext machinery will handle starting the
+ * workers and ensuring that their state generally matches that of the
+ * leader; see src/backend/access/transam/README.parallel for details.
+ * However, we must save and restore relevant executor state, such as
+ * any ParamListInfo associated witih the query, buffer usage info, and
+ * the actual plan to be passed down to the worker.
*
* IDENTIFICATION
* src/backend/executor/execParallel.c
diff --git a/src/backend/executor/nodeGather.c b/src/backend/executor/nodeGather.c
index 7e2272f..017adf2 100644
--- a/src/backend/executor/nodeGather.c
+++ b/src/backend/executor/nodeGather.c
@@ -6,6 +6,20 @@
* Portions Copyright (c) 1996-2015, PostgreSQL Global Development Group
* Portions Copyright (c) 1994, Regents of the University of California
*
+ * A Gather executor launches parallel workers to run multiple copies of a
+ * plan. It can also run the plan itself, if the workers are not available
+ * or have not started up yet. It then merges all of the results it produces
+ * and the results from the workers into a single output stream. Therefore,
+ * it will normally be used with a plan where running multiple copies of the
+ * same plan does not produce duplicate output, such as PartialSeqScan.
+ *
+ * Alternatively, a Gather node can be configured to use just one worker
+ * and the single-copy flag can be set. In this case, the Gather node will
+ * run the plan in one worker and will not execute the plan itself. In
+ * this case, it simply returns whatever tuples were returned by the worker.
+ * If a worker cannot be obtained, then it will run the plan itself and
+ * return the results. Therefore, a plan used with a single-copy Gather
+ * node not be parallel-aware.
*
* IDENTIFICATION
* src/backend/executor/nodeGather.c
On 17 October 2015 at 18:17, Robert Haas <robertmhaas@gmail.com> wrote:
It's good to have your perspective on how this can be improved, and
I'm definitely willing to write more documentation. Any lack in that
area is probably due to being too close to the subject area, having
spent several years on parallelism in general, and 200+ emails on
parallel sequential scan in particular. Your point about the lack of
a good header file comment for execParallel.c is a good one, and I'll
rectify that next week.
Not on your case in a big way, just noting the need for change there.
I'll help as well, but if you could start with enough basics to allow me to
ask questions that will help. Thanks.
--
Simon Riggs http://www.2ndQuadrant.com/
<http://www.2ndquadrant.com/>
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Tue, Oct 20, 2015 at 8:16 PM, Robert Haas <robertmhaas@gmail.com> wrote:
On Sat, Oct 17, 2015 at 6:17 PM, Robert Haas <robertmhaas@gmail.com>
wrote:
It's good to have your perspective on how this can be improved, and
I'm definitely willing to write more documentation. Any lack in that
area is probably due to being too close to the subject area, having
spent several years on parallelism in general, and 200+ emails on
parallel sequential scan in particular. Your point about the lack of
a good header file comment for execParallel.c is a good one, and I'll
rectify that next week.Here is a patch to add a hopefully-useful file header comment to
execParallel.c. I included one for nodeGather.c as well, which seems
to be contrary to previous practice, but actually it seems like
previous practice is not the greatest: surely it's not self-evident
what all of the executor nodes do.
+ * any ParamListInfo associated witih the query, buffer usage info, and
+ * the actual plan to be passed down to the worker.
typo 'witih'.
+ * return the results. Therefore, a plan used with a single-copy Gather
+ * node not be parallel-aware.
"node not" seems to be incomplete.
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
On Wednesday, 21 October 2015, Amit Kapila <amit.kapila16@gmail.com> wrote:
On Tue, Oct 20, 2015 at 8:16 PM, Robert Haas <robertmhaas@gmail.com
<javascript:_e(%7B%7D,'cvml','robertmhaas@gmail.com');>> wrote:On Sat, Oct 17, 2015 at 6:17 PM, Robert Haas <robertmhaas@gmail.com
<javascript:_e(%7B%7D,'cvml','robertmhaas@gmail.com');>> wrote:
It's good to have your perspective on how this can be improved, and
I'm definitely willing to write more documentation. Any lack in that
area is probably due to being too close to the subject area, having
spent several years on parallelism in general, and 200+ emails on
parallel sequential scan in particular. Your point about the lack of
a good header file comment for execParallel.c is a good one, and I'll
rectify that next week.Here is a patch to add a hopefully-useful file header comment to
execParallel.c. I included one for nodeGather.c as well, which seems
to be contrary to previous practice, but actually it seems like
previous practice is not the greatest: surely it's not self-evident
what all of the executor nodes do.+ * any ParamListInfo associated witih the query, buffer usage info, and + * the actual plan to be passed down to the worker.typo 'witih'.
+ * return the results. Therefore, a plan used with a single-copy Gather + * node not be parallel-aware."node not" seems to be incomplete.
... node *need* not be parallel aware?
Thanks,
Amit
On Wed, Oct 21, 2015 at 9:04 AM, Amit Langote <amitlangote09@gmail.com> wrote:
... node *need* not be parallel aware?
Yes, thanks. Committed that way.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Tue, Oct 20, 2015 at 6:12 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
Not on your case in a big way, just noting the need for change there.
Yes, I appreciate your attitude. I think we are on the same wavelength.
I'll help as well, but if you could start with enough basics to allow me to
ask questions that will help. Thanks.
Will try to keep pushing in that direction. May be easier once some
of the dust has settled.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Sun, Oct 18, 2015 at 12:17 AM, Robert Haas <robertmhaas@gmail.com> wrote:
So reviewing patch 13 isn't possible without prior knowledge.
The basic question for patch 13 is whether ephemeral record types can
occur in executor tuples in any contexts that I haven't identified. I
know that a tuple table slot can contain have a column that is of type
record or record[], and those records can themselves contain
attributes of type record or record[], and so on as far down as you
like. I *think* that's the only case. For example, I don't believe
that a TupleTableSlot can contain a *named* record type that has an
anonymous record buried down inside of it somehow. But I'm not
positive I'm right about that.
I have done some more testing and investigation and determined that
this optimism was unwarranted. It turns out that the type information
for composite and record types gets stored in two different places.
First, the TupleTableSlot has a type OID, indicating the sort of the
value it expects to be stored for that slot attribute. Second, the
value itself contains a type OID and typmod. And these don't have to
match. For example, consider this query:
select row_to_json(i) from int8_tbl i(x,y);
Without i(x,y), the HeapTuple passed to row_to_json is labelled with
the pg_type OID of int8_tbl. But with the query as written, it's
labeled as an anonymous record type. If I jigger things by hacking
the code so that this is planned as Gather (single-copy) -> SeqScan,
with row_to_json evaluated at the Gather node, then the sequential
scan kicks out a tuple with a transient record type and stores it into
a slot whose type OID is still that of int8_tbl. My previous patch
failed to deal with that; the attached one does.
The previous patch was also defective in a few other respects. The
most significant of those, maybe, is that it somehow thought it was OK
to assume that transient typmods from all workers could be treated
interchangeably rather than individually. To fix this, I've changed
the TupleQueueFunnel implemented by tqueue.c to be merely a
TupleQueueReader which handles reading from a single worker only.
nodeGather.c therefore creates one TupleQueueReader per worker instead
of a single TupleQueueFunnel for all workers; accordingly, the logic
for multiplexing multiple queues now lives in nodeGather.c. This is
probably how I should have done it originally - someone, I think Jeff
Davis - complained previously that tqueue.c had no business embedding
the round-robin policy decision, and he was right. So this addresses
that complaint as well.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Attachments:
tqueue-record-types-v2.patchapplication/x-patch; name=tqueue-record-types-v2.patchDownload
From db5b2a90ec35adf3f5fac72483679ebcefdb29af Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Wed, 7 Oct 2015 12:43:22 -0400
Subject: [PATCH 7/8] Modify tqueue infrastructure to support transient record
types.
Commit 4a4e6893aa080b9094dadbe0e65f8a75fee41ac6, which introduced this
mechanism, failed to account for the fact that the RECORD pseudo-type
uses transient typmods that are only meaningful within a single
backend. Transferring such tuples without modification between two
cooperating backends does not work. This commit installs a system
for passing the tuple descriptors over the same shm_mq being used to
send the tuples themselves. The two sides might not assign the same
transient typmod to any given tuple descriptor, so we must also
substitute the appropriate receiver-side typmod for the one used by
the sender. That adds some CPU overhead, but still seems better than
being unable to pass records between cooperating parallel processes.
Along the way, move the logic for handling multiple tuple queues from
tqueue.c to nodeGather.c; tqueue.c now provides a TupleQueueReader,
which reads from a single queue, rather than a TupleQueueFunnel, which
potentially reads from multiple queues. This change was suggested
previously as a way to make sure that nodeGather.c rather than tqueue.c
had policy control over the order in which to read from queues, but
it wasn't clear to me until now how good an idea it was. typmod
mapping needs to be performed separately for each queue, and it is
much simpler if the tqueue.c code handles that and leaves multiplexing
multiple queues to higher layers of the stack.
---
src/backend/executor/nodeGather.c | 139 ++++--
src/backend/executor/tqueue.c | 977 +++++++++++++++++++++++++++++++++-----
src/include/executor/tqueue.h | 12 +-
src/include/nodes/execnodes.h | 4 +-
src/tools/pgindent/typedefs.list | 2 +-
5 files changed, 980 insertions(+), 154 deletions(-)
diff --git a/src/backend/executor/nodeGather.c b/src/backend/executor/nodeGather.c
index 9c1533e..312302a 100644
--- a/src/backend/executor/nodeGather.c
+++ b/src/backend/executor/nodeGather.c
@@ -36,11 +36,13 @@
#include "executor/nodeGather.h"
#include "executor/nodeSubplan.h"
#include "executor/tqueue.h"
+#include "miscadmin.h"
#include "utils/memutils.h"
#include "utils/rel.h"
static TupleTableSlot *gather_getnext(GatherState *gatherstate);
+static HeapTuple gather_readnext(GatherState *gatherstate);
/* ----------------------------------------------------------------
@@ -124,6 +126,7 @@ ExecInitGather(Gather *node, EState *estate, int eflags)
TupleTableSlot *
ExecGather(GatherState *node)
{
+ TupleTableSlot *fslot = node->funnel_slot;
int i;
TupleTableSlot *slot;
TupleTableSlot *resultSlot;
@@ -147,6 +150,7 @@ ExecGather(GatherState *node)
*/
if (gather->num_workers > 0 && IsInParallelMode())
{
+ ParallelContext *pcxt;
bool got_any_worker = false;
/* Initialize the workers required to execute Gather node. */
@@ -158,18 +162,26 @@ ExecGather(GatherState *node)
* Register backend workers. We might not get as many as we
* requested, or indeed any at all.
*/
- LaunchParallelWorkers(node->pei->pcxt);
+ pcxt = node->pei->pcxt;
+ LaunchParallelWorkers(pcxt);
- /* Set up a tuple queue to collect the results. */
- node->funnel = CreateTupleQueueFunnel();
- for (i = 0; i < node->pei->pcxt->nworkers; ++i)
+ /* Set up tuple queue readers to read the results. */
+ if (pcxt->nworkers > 0)
{
- if (node->pei->pcxt->worker[i].bgwhandle)
+ node->nreaders = 0;
+ node->reader =
+ palloc(pcxt->nworkers * sizeof(TupleQueueReader *));
+
+ for (i = 0; i < pcxt->nworkers; ++i)
{
+ if (pcxt->worker[i].bgwhandle == NULL)
+ continue;
+
shm_mq_set_handle(node->pei->tqueue[i],
- node->pei->pcxt->worker[i].bgwhandle);
- RegisterTupleQueueOnFunnel(node->funnel,
- node->pei->tqueue[i]);
+ pcxt->worker[i].bgwhandle);
+ node->reader[node->nreaders++] =
+ CreateTupleQueueReader(node->pei->tqueue[i],
+ fslot->tts_tupleDescriptor);
got_any_worker = true;
}
}
@@ -180,7 +192,7 @@ ExecGather(GatherState *node)
}
/* Run plan locally if no workers or not single-copy. */
- node->need_to_scan_locally = (node->funnel == NULL)
+ node->need_to_scan_locally = (node->reader == NULL)
|| !gather->single_copy;
node->initialized = true;
}
@@ -252,13 +264,9 @@ ExecEndGather(GatherState *node)
}
/*
- * gather_getnext
- *
- * Get the next tuple from shared memory queue. This function
- * is responsible for fetching tuples from all the queues associated
- * with worker backends used in Gather node execution and if there is
- * no data available from queues or no worker is available, it does
- * fetch the data from local node.
+ * Read the next tuple. We might fetch a tuple from one of the tuple queues
+ * using gather_readnext, or if no tuple queue contains a tuple and the
+ * single_copy flag is not set, we might generate one locally instead.
*/
static TupleTableSlot *
gather_getnext(GatherState *gatherstate)
@@ -268,19 +276,11 @@ gather_getnext(GatherState *gatherstate)
TupleTableSlot *fslot = gatherstate->funnel_slot;
HeapTuple tup;
- while (gatherstate->funnel != NULL || gatherstate->need_to_scan_locally)
+ while (gatherstate->reader != NULL || gatherstate->need_to_scan_locally)
{
- if (gatherstate->funnel != NULL)
+ if (gatherstate->reader != NULL)
{
- bool done = false;
-
- /* wait only if local scan is done */
- tup = TupleQueueFunnelNext(gatherstate->funnel,
- gatherstate->need_to_scan_locally,
- &done);
- if (done)
- ExecShutdownGather(gatherstate);
-
+ tup = gather_readnext(gatherstate);
if (HeapTupleIsValid(tup))
{
ExecStoreTuple(tup, /* tuple to store */
@@ -307,6 +307,80 @@ gather_getnext(GatherState *gatherstate)
return ExecClearTuple(fslot);
}
+/*
+ * Attempt to read a tuple from one of our parallel workers.
+ */
+static HeapTuple
+gather_readnext(GatherState *gatherstate)
+{
+ int waitpos = gatherstate->nextreader;
+
+ for (;;)
+ {
+ TupleQueueReader *reader;
+ HeapTuple tup;
+ bool readerdone;
+
+ /* Make sure we've read all messages from workers. */
+ HandleParallelMessages();
+
+ /* Attempt to read a tuple, but don't block if none is available. */
+ reader = gatherstate->reader[gatherstate->nextreader];
+ tup = TupleQueueReaderNext(reader, true, &readerdone);
+
+ /*
+ * If this reader is done, remove it. If all readers are done,
+ * clean up remaining worker state.
+ */
+ if (readerdone)
+ {
+ DestroyTupleQueueReader(reader);
+ --gatherstate->nreaders;
+ if (gatherstate->nreaders == 0)
+ {
+ ExecShutdownGather(gatherstate);
+ return NULL;
+ }
+ else
+ {
+ memmove(&gatherstate->reader[gatherstate->nextreader],
+ &gatherstate->reader[gatherstate->nextreader + 1],
+ sizeof(TupleQueueReader *)
+ * (gatherstate->nreaders - gatherstate->nextreader));
+ if (gatherstate->nextreader >= gatherstate->nreaders)
+ gatherstate->nextreader = 0;
+ if (gatherstate->nextreader < waitpos)
+ --waitpos;
+ }
+ continue;
+ }
+
+ /* Advance nextreader pointer in round-robin fashion. */
+ gatherstate->nextreader =
+ (gatherstate->nextreader + 1) % gatherstate->nreaders;
+
+ /* If we got a tuple, return it. */
+ if (tup)
+ return tup;
+
+ /* Have we visited every TupleQueueReader? */
+ if (gatherstate->nextreader == waitpos)
+ {
+ /*
+ * If (still) running plan locally, return NULL so caller can
+ * generate another tuple from the local copy of the plan.
+ */
+ if (gatherstate->need_to_scan_locally)
+ return NULL;
+
+ /* Nothing to do except wait for developments. */
+ WaitLatch(MyLatch, WL_LATCH_SET, 0);
+ CHECK_FOR_INTERRUPTS();
+ ResetLatch(MyLatch);
+ }
+ }
+}
+
/* ----------------------------------------------------------------
* ExecShutdownGather
*
@@ -318,11 +392,14 @@ gather_getnext(GatherState *gatherstate)
void
ExecShutdownGather(GatherState *node)
{
- /* Shut down tuple queue funnel before shutting down workers. */
- if (node->funnel != NULL)
+ /* Shut down tuple queue readers before shutting down workers. */
+ if (node->reader != NULL)
{
- DestroyTupleQueueFunnel(node->funnel);
- node->funnel = NULL;
+ int i;
+
+ for (i = 0; i < node->nreaders; ++i)
+ DestroyTupleQueueReader(node->reader[i]);
+ node->reader = NULL;
}
/* Now shut down the workers. */
diff --git a/src/backend/executor/tqueue.c b/src/backend/executor/tqueue.c
index 67143d3..1b326e8 100644
--- a/src/backend/executor/tqueue.c
+++ b/src/backend/executor/tqueue.c
@@ -4,10 +4,15 @@
* Use shm_mq to send & receive tuples between parallel backends
*
* A DestReceiver of type DestTupleQueue, which is a TQueueDestReceiver
- * under the hood, writes tuples from the executor to a shm_mq.
+ * under the hood, writes tuples from the executor to a shm_mq. If
+ * necessary, it also writes control messages describing transient
+ * record types used within the tuple.
*
- * A TupleQueueFunnel helps manage the process of reading tuples from
- * one or more shm_mq objects being used as tuple queues.
+ * A TupleQueueReader reads tuples, and if any are sent control messages,
+ * from a shm_mq and returns the tuples. If transient record types are
+ * in use, it registers those types based on the received control messages
+ * and rewrites the typemods sent by the remote side to the corresponding
+ * local record typemods.
*
* Portions Copyright (c) 1996-2015, PostgreSQL Global Development Group
* Portions Copyright (c) 1994, Regents of the University of California
@@ -21,37 +26,404 @@
#include "postgres.h"
#include "access/htup_details.h"
+#include "catalog/pg_type.h"
#include "executor/tqueue.h"
+#include "funcapi.h"
+#include "lib/stringinfo.h"
#include "miscadmin.h"
+#include "utils/array.h"
+#include "utils/lsyscache.h"
+#include "utils/memutils.h"
+#include "utils/rangetypes.h"
+#include "utils/syscache.h"
+#include "utils/typcache.h"
+
+typedef enum
+{
+ TQUEUE_REMAP_NONE, /* no special processing required */
+ TQUEUE_REMAP_ARRAY, /* array */
+ TQUEUE_REMAP_RANGE, /* range */
+ TQUEUE_REMAP_RECORD /* composite type, named or anonymous */
+} RemapClass;
+
+typedef struct
+{
+ int natts;
+ RemapClass mapping[FLEXIBLE_ARRAY_MEMBER];
+} RemapInfo;
typedef struct
{
DestReceiver pub;
shm_mq_handle *handle;
+ MemoryContext tmpcontext;
+ HTAB *recordhtab;
+ char mode;
+ TupleDesc tupledesc;
+ RemapInfo *remapinfo;
} TQueueDestReceiver;
-struct TupleQueueFunnel
+typedef struct RecordTypemodMap
{
- int nqueues;
- int maxqueues;
- int nextqueue;
- shm_mq_handle **queue;
+ int remotetypmod;
+ int localtypmod;
+} RecordTypemodMap;
+
+struct TupleQueueReader
+{
+ shm_mq_handle *queue;
+ char mode;
+ TupleDesc tupledesc;
+ RemapInfo *remapinfo;
+ HTAB *typmodmap;
};
+#define TUPLE_QUEUE_MODE_CONTROL 'c'
+#define TUPLE_QUEUE_MODE_DATA 'd'
+
+static void tqueueWalk(TQueueDestReceiver * tqueue, RemapClass walktype,
+ Datum value);
+static void tqueueWalkRecord(TQueueDestReceiver * tqueue, Datum value);
+static void tqueueWalkArray(TQueueDestReceiver * tqueue, Datum value);
+static void tqueueWalkRange(TQueueDestReceiver * tqueue, Datum value);
+static void tqueueSendTypmodInfo(TQueueDestReceiver * tqueue, int typmod,
+ TupleDesc tupledesc);
+static void TupleQueueHandleControlMessage(TupleQueueReader *reader,
+ Size nbytes, char *data);
+static HeapTuple TupleQueueHandleDataMessage(TupleQueueReader *reader,
+ Size nbytes, HeapTupleHeader data);
+static HeapTuple TupleQueueRemapTuple(TupleQueueReader *reader,
+ TupleDesc tupledesc, RemapInfo * remapinfo,
+ HeapTuple tuple);
+static Datum TupleQueueRemap(TupleQueueReader *reader, RemapClass remapclass,
+ Datum value);
+static Datum TupleQueueRemapArray(TupleQueueReader *reader, Datum value);
+static Datum TupleQueueRemapRange(TupleQueueReader *reader, Datum value);
+static Datum TupleQueueRemapRecord(TupleQueueReader *reader, Datum value);
+static RemapClass GetRemapClass(Oid typeid);
+static RemapInfo *BuildRemapInfo(TupleDesc tupledesc);
+
/*
* Receive a tuple.
+ *
+ * This is, at core, pretty simple: just send the tuple to the designated
+ * shm_mq. The complicated part is that if the tuple contains transient
+ * record types (see lookup_rowtype_tupdesc), we need to send control
+ * information to the shm_mq receiver so that those typemods can be correctly
+ * interpreted, as they are merely held in a backend-local cache. Worse, the
+ * record type may not at the top level: we could have a range over an array
+ * type over a range type over a range type over an array type over a record,
+ * or something like that.
*/
static void
tqueueReceiveSlot(TupleTableSlot *slot, DestReceiver *self)
{
TQueueDestReceiver *tqueue = (TQueueDestReceiver *) self;
+ TupleDesc tupledesc = slot->tts_tupleDescriptor;
HeapTuple tuple;
+ HeapTupleHeader tup;
+
+ /*
+ * Test to see whether the tupledesc has changed; if so, set up for the
+ * new tupledesc. This is a strange test both because the executor really
+ * shouldn't change the tupledesc, and also because it would be unsafe if
+ * the old tupledesc could be freed and a new one allocated at the same
+ * address. But since some very old code in printtup.c uses this test, we
+ * adopt it here as well.
+ */
+ if (tqueue->tupledesc != tupledesc ||
+ tqueue->remapinfo->natts != tupledesc->natts)
+ {
+ if (tqueue->remapinfo != NULL)
+ pfree(tqueue->remapinfo);
+ tqueue->remapinfo = BuildRemapInfo(tupledesc);
+ }
tuple = ExecMaterializeSlot(slot);
+ tup = tuple->t_data;
+
+ /*
+ * When, because of the types being transmitted, no record typemod mapping
+ * can be needed, we can skip a good deal of work.
+ */
+ if (tqueue->remapinfo != NULL)
+ {
+ RemapInfo *remapinfo = tqueue->remapinfo;
+ AttrNumber i;
+ MemoryContext oldcontext = NULL;
+
+ /* Deform the tuple so we can examine it, if not done already. */
+ slot_getallattrs(slot);
+
+ /* Iterate over each attribute and search it for transient typemods. */
+ Assert(slot->tts_tupleDescriptor->natts == remapinfo->natts);
+ for (i = 0; i < remapinfo->natts; ++i)
+ {
+ /* Ignore nulls and types that don't need special handling. */
+ if (slot->tts_isnull[i] ||
+ remapinfo->mapping[i] == TQUEUE_REMAP_NONE)
+ continue;
+
+ /* Switch to temporary memory context to avoid leaking. */
+ if (oldcontext == NULL)
+ {
+ if (tqueue->tmpcontext == NULL)
+ tqueue->tmpcontext =
+ AllocSetContextCreate(TopMemoryContext,
+ "tqueue temporary context",
+ ALLOCSET_DEFAULT_MINSIZE,
+ ALLOCSET_DEFAULT_INITSIZE,
+ ALLOCSET_DEFAULT_MAXSIZE);
+ oldcontext = MemoryContextSwitchTo(tqueue->tmpcontext);
+ }
+
+ /* Invoke the appropriate walker function. */
+ tqueueWalk(tqueue, remapinfo->mapping[i], slot->tts_values[i]);
+ }
+
+ /* If we used the temp context, reset it and restore prior context. */
+ if (oldcontext != NULL)
+ {
+ MemoryContextSwitchTo(oldcontext);
+ MemoryContextReset(tqueue->tmpcontext);
+ }
+
+ /* If we entered control mode, switch back to data mode. */
+ if (tqueue->mode != TUPLE_QUEUE_MODE_DATA)
+ {
+ tqueue->mode = TUPLE_QUEUE_MODE_DATA;
+ shm_mq_send(tqueue->handle, sizeof(char), &tqueue->mode, false);
+ }
+ }
+
+ /* Send the tuple itself. */
shm_mq_send(tqueue->handle, tuple->t_len, tuple->t_data, false);
}
/*
+ * Invoke the appropriate walker function based on the given RemapClass.
+ */
+static void
+tqueueWalk(TQueueDestReceiver * tqueue, RemapClass walktype, Datum value)
+{
+ check_stack_depth();
+
+ switch (walktype)
+ {
+ case TQUEUE_REMAP_NONE:
+ break;
+ case TQUEUE_REMAP_ARRAY:
+ tqueueWalkArray(tqueue, value);
+ break;
+ case TQUEUE_REMAP_RANGE:
+ tqueueWalkRange(tqueue, value);
+ break;
+ case TQUEUE_REMAP_RECORD:
+ tqueueWalkRecord(tqueue, value);
+ break;
+ }
+}
+
+/*
+ * Walk a record and send control messages for transient record types
+ * contained therein.
+ */
+static void
+tqueueWalkRecord(TQueueDestReceiver * tqueue, Datum value)
+{
+ HeapTupleHeader tup;
+ Oid typeid;
+ Oid typmod;
+ TupleDesc tupledesc;
+ RemapInfo *remapinfo;
+
+ /* Extract typmod from tuple. */
+ tup = DatumGetHeapTupleHeader(value);
+ typeid = HeapTupleHeaderGetTypeId(tup);
+ typmod = HeapTupleHeaderGetTypMod(tup);
+
+ /* Look up tuple descriptor in typecache. */
+ tupledesc = lookup_rowtype_tupdesc(typeid, typmod);
+
+ /*
+ * If this is a transient record time, send its TupleDesc as a control
+ * message. (tqueueSendTypemodInfo is smart enough to do this only once
+ * per typmod.)
+ */
+ if (typeid == RECORDOID)
+ tqueueSendTypmodInfo(tqueue, typmod, tupledesc);
+
+ /*
+ * Build the remap information for this tupledesc. We might want to think
+ * about keeping a cache of this information keyed by typeid and typemod,
+ * but let's keep it simple for now.
+ */
+ remapinfo = BuildRemapInfo(tupledesc);
+
+ /*
+ * If remapping is required, deform the tuple and process each field. When
+ * BuildRemapInfo is null, the data types are such that there can be no
+ * transient record types here, so we can skip all this work.
+ */
+ if (remapinfo != NULL)
+ {
+ Datum *values;
+ bool *isnull;
+ HeapTupleData tdata;
+ AttrNumber i;
+
+ /* Deform the tuple so we can check each column within. */
+ values = palloc(tupledesc->natts * sizeof(Datum));
+ isnull = palloc(tupledesc->natts * sizeof(bool));
+ tdata.t_len = HeapTupleHeaderGetDatumLength(tup);
+ ItemPointerSetInvalid(&(tdata.t_self));
+ tdata.t_tableOid = InvalidOid;
+ tdata.t_data = tup;
+ heap_deform_tuple(&tdata, tupledesc, values, isnull);
+
+ /* Recursively check each non-NULL attribute. */
+ for (i = 0; i < tupledesc->natts; ++i)
+ if (!isnull[i])
+ tqueueWalk(tqueue, remapinfo->mapping[i], values[i]);
+ }
+
+ /* Release reference count acquired by lookup_rowtype_tupdesc. */
+ DecrTupleDescRefCount(tupledesc);
+}
+
+/*
+ * Walk a record and send control messages for transient record types
+ * contained therein.
+ */
+static void
+tqueueWalkArray(TQueueDestReceiver * tqueue, Datum value)
+{
+ ArrayType *arr = DatumGetArrayTypeP(value);
+ Oid typeid = ARR_ELEMTYPE(arr);
+ RemapClass remapclass;
+ int16 typlen;
+ bool typbyval;
+ char typalign;
+ Datum *elem_values;
+ bool *elem_nulls;
+ int num_elems;
+ int i;
+
+ remapclass = GetRemapClass(typeid);
+
+ /*
+ * If the elements of the array don't need to be walked, we shouldn't have
+ * been called in the first place: GetRemapClass should have returned NULL
+ * when asked about this array type.
+ */
+ Assert(remapclass != TQUEUE_REMAP_NONE);
+
+ /* Deconstruct the array. */
+ get_typlenbyvalalign(typeid, &typlen, &typbyval, &typalign);
+ deconstruct_array(arr, typeid, typlen, typbyval, typalign,
+ &elem_values, &elem_nulls, &num_elems);
+
+ /* Walk each element. */
+ for (i = 0; i < num_elems; ++i)
+ if (!elem_nulls[i])
+ tqueueWalk(tqueue, remapclass, elem_values[i]);
+}
+
+/*
+ * Walk a range type and send control messages for transient record types
+ * contained therein.
+ */
+static void
+tqueueWalkRange(TQueueDestReceiver * tqueue, Datum value)
+{
+ RangeType *range = DatumGetRangeType(value);
+ Oid typeid = RangeTypeGetOid(range);
+ RemapClass remapclass;
+ TypeCacheEntry *typcache;
+ RangeBound lower;
+ RangeBound upper;
+ bool empty;
+
+ /*
+ * Extract the lower and upper bounds. It might be worth implementing
+ * some caching scheme here so that we don't look up the same typeids in
+ * the type cache repeatedly, but for now let's keep it simple.
+ */
+ typcache = lookup_type_cache(typeid, TYPECACHE_RANGE_INFO);
+ if (typcache->rngelemtype == NULL)
+ elog(ERROR, "type %u is not a range type", typeid);
+ range_deserialize(typcache, range, &lower, &upper, &empty);
+
+ /* Nothing to do for an empty range. */
+ if (empty)
+ return;
+
+ /*
+ * If the range bounds don't need to be walked, we shouldn't have been
+ * called in the first place: GetRemapClass should have returned NULL when
+ * asked about this range type.
+ */
+ remapclass = GetRemapClass(typeid);
+ Assert(remapclass != TQUEUE_REMAP_NONE);
+
+ /* Walk each bound, if present. */
+ if (!upper.infinite)
+ tqueueWalk(tqueue, remapclass, upper.val);
+ if (!lower.infinite)
+ tqueueWalk(tqueue, remapclass, lower.val);
+}
+
+/*
+ * Send tuple descriptor information for a transient typemod, unless we've
+ * already done so previously.
+ */
+static void
+tqueueSendTypmodInfo(TQueueDestReceiver * tqueue, int typmod,
+ TupleDesc tupledesc)
+{
+ StringInfoData buf;
+ bool found;
+ AttrNumber i;
+
+ /* Initialize hash table if not done yet. */
+ if (tqueue->recordhtab == NULL)
+ {
+ HASHCTL ctl;
+
+ ctl.keysize = sizeof(int);
+ ctl.entrysize = sizeof(int);
+ ctl.hcxt = TopMemoryContext;
+ tqueue->recordhtab = hash_create("tqueue record hashtable",
+ 100, &ctl, HASH_ELEM | HASH_CONTEXT);
+ }
+
+ /* Have we already seen this record type? If not, must report it. */
+ hash_search(tqueue->recordhtab, &typmod, HASH_ENTER, &found);
+ if (found)
+ return;
+
+ /* If message queue is in data mode, switch to control mode. */
+ if (tqueue->mode != TUPLE_QUEUE_MODE_CONTROL)
+ {
+ tqueue->mode = TUPLE_QUEUE_MODE_CONTROL;
+ shm_mq_send(tqueue->handle, sizeof(char), &tqueue->mode, false);
+ }
+
+ /* Assemble a control message. */
+ initStringInfo(&buf);
+ appendBinaryStringInfo(&buf, (char *) &typmod, sizeof(int));
+ appendBinaryStringInfo(&buf, (char *) &tupledesc->natts, sizeof(int));
+ appendBinaryStringInfo(&buf, (char *) &tupledesc->tdhasoid,
+ sizeof(bool));
+ for (i = 0; i < tupledesc->natts; ++i)
+ appendBinaryStringInfo(&buf, (char *) tupledesc->attrs[i],
+ sizeof(FormData_pg_attribute));
+
+ /* Send control message. */
+ shm_mq_send(tqueue->handle, buf.len, buf.data, false);
+}
+
+/*
* Prepare to receive tuples from executor.
*/
static void
@@ -77,6 +449,14 @@ tqueueShutdownReceiver(DestReceiver *self)
static void
tqueueDestroyReceiver(DestReceiver *self)
{
+ TQueueDestReceiver *tqueue = (TQueueDestReceiver *) self;
+
+ if (tqueue->tmpcontext != NULL)
+ MemoryContextDelete(tqueue->tmpcontext);
+ if (tqueue->recordhtab != NULL)
+ hash_destroy(tqueue->recordhtab);
+ if (tqueue->remapinfo != NULL)
+ pfree(tqueue->remapinfo);
pfree(self);
}
@@ -96,169 +476,536 @@ CreateTupleQueueDestReceiver(shm_mq_handle *handle)
self->pub.rDestroy = tqueueDestroyReceiver;
self->pub.mydest = DestTupleQueue;
self->handle = handle;
+ self->tmpcontext = NULL;
+ self->recordhtab = NULL;
+ self->mode = TUPLE_QUEUE_MODE_DATA;
+ self->remapinfo = NULL;
return (DestReceiver *) self;
}
/*
- * Create a tuple queue funnel.
+ * Create a tuple queue reader.
*/
-TupleQueueFunnel *
-CreateTupleQueueFunnel(void)
+TupleQueueReader *
+CreateTupleQueueReader(shm_mq_handle *handle, TupleDesc tupledesc)
{
- TupleQueueFunnel *funnel = palloc0(sizeof(TupleQueueFunnel));
+ TupleQueueReader *reader = palloc0(sizeof(TupleQueueReader));
- funnel->maxqueues = 8;
- funnel->queue = palloc(funnel->maxqueues * sizeof(shm_mq_handle *));
+ reader->queue = handle;
+ reader->mode = TUPLE_QUEUE_MODE_DATA;
+ reader->tupledesc = tupledesc;
+ reader->remapinfo = BuildRemapInfo(tupledesc);
- return funnel;
+ return reader;
}
/*
- * Destroy a tuple queue funnel.
+ * Destroy a tuple queue reader.
*/
void
-DestroyTupleQueueFunnel(TupleQueueFunnel *funnel)
+DestroyTupleQueueReader(TupleQueueReader *reader)
{
- int i;
+ shm_mq_detach(shm_mq_get_queue(reader->queue));
+ if (reader->remapinfo != NULL)
+ pfree(reader->remapinfo);
+ pfree(reader);
+}
+
+/*
+ * Fetch a tuple from a tuple queue reader.
+ *
+ * Even when shm_mq_receive() returns SHM_MQ_WOULD_BLOCK, this can still
+ * accumulate bytes from a partially-read message, so it's useful to call
+ * this with nowait = true even if nothing is returned.
+ *
+ * The return value is NULL if there are no remaining queues or if
+ * nowait = true and no tuple is ready to return. *done, if not NULL,
+ * is set to true when queue is detached and otherwise to false.
+ */
+HeapTuple
+TupleQueueReaderNext(TupleQueueReader *reader, bool nowait, bool *done)
+{
+ shm_mq_result result;
+
+ if (done != NULL)
+ *done = false;
+
+ for (;;)
+ {
+ Size nbytes;
+ void *data;
+
+ /* Attempt to read a message. */
+ result = shm_mq_receive(reader->queue, &nbytes, &data, true);
- for (i = 0; i < funnel->nqueues; i++)
- shm_mq_detach(shm_mq_get_queue(funnel->queue[i]));
- pfree(funnel->queue);
- pfree(funnel);
+ /* If queue is detached, set *done and return NULL. */
+ if (result == SHM_MQ_DETACHED)
+ {
+ if (done != NULL)
+ *done = true;
+ return NULL;
+ }
+
+ /* In non-blocking mode, bail out if no message ready yet. */
+ if (result == SHM_MQ_WOULD_BLOCK)
+ return NULL;
+ Assert(result == SHM_MQ_SUCCESS);
+
+ /*
+ * OK, we got a message. Process it.
+ *
+ * One-byte messages are mode switch messages, so that we can switch
+ * between "control" and "data" mode. When in "data" mode, each
+ * message (unless exactly one byte) is a tuple. When in "control"
+ * mode, each message provides a transient-typmod-to-tupledesc mapping
+ * so we can interpret future tuples.
+ */
+ if (nbytes == 1)
+ {
+ /* Mode switch message. */
+ reader->mode = ((char *) data)[0];
+ }
+ else if (reader->mode == TUPLE_QUEUE_MODE_DATA)
+ {
+ /* Tuple data. */
+ return TupleQueueHandleDataMessage(reader, nbytes, data);
+ }
+ else if (reader->mode == TUPLE_QUEUE_MODE_CONTROL)
+ {
+ /* Control message, describing a transient record type. */
+ TupleQueueHandleControlMessage(reader, nbytes, data);
+ }
+ else
+ elog(ERROR, "invalid mode: %d", (int) reader->mode);
+ }
}
/*
- * Remember the shared memory queue handle in funnel.
+ * Handle a data message - that is, a tuple - from the remote side.
*/
-void
-RegisterTupleQueueOnFunnel(TupleQueueFunnel *funnel, shm_mq_handle *handle)
+static HeapTuple
+TupleQueueHandleDataMessage(TupleQueueReader *reader,
+ Size nbytes,
+ HeapTupleHeader data)
{
- if (funnel->nqueues < funnel->maxqueues)
+ HeapTupleData htup;
+
+ ItemPointerSetInvalid(&htup.t_self);
+ htup.t_tableOid = InvalidOid;
+ htup.t_len = nbytes;
+ htup.t_data = data;
+
+ return TupleQueueRemapTuple(reader, reader->tupledesc, reader->remapinfo,
+ &htup);
+}
+
+/*
+ * Remap tuple typmods per control information received from remote side.
+ */
+static HeapTuple
+TupleQueueRemapTuple(TupleQueueReader *reader, TupleDesc tupledesc,
+ RemapInfo * remapinfo, HeapTuple tuple)
+{
+ Datum *values;
+ bool *isnull;
+ bool dirty = false;
+ int i;
+
+ /*
+ * If no remapping is necessary, just copy the tuple into a single
+ * palloc'd chunk, as caller will expect.
+ */
+ if (remapinfo == NULL)
+ return heap_copytuple(tuple);
+
+ /* Deform tuple so we can remap record typmods for individual attrs. */
+ values = palloc(tupledesc->natts * sizeof(Datum));
+ isnull = palloc(tupledesc->natts * sizeof(bool));
+ heap_deform_tuple(tuple, tupledesc, values, isnull);
+ Assert(tupledesc->natts == remapinfo->natts);
+
+ /* Recursively check each non-NULL attribute. */
+ for (i = 0; i < tupledesc->natts; ++i)
{
- funnel->queue[funnel->nqueues++] = handle;
- return;
+ if (isnull[i] || remapinfo->mapping[i] == TQUEUE_REMAP_NONE)
+ continue;
+ values[i] = TupleQueueRemap(reader, remapinfo->mapping[i], values[i]);
+ dirty = true;
}
- if (funnel->nqueues >= funnel->maxqueues)
+ /* Reform the modified tuple. */
+ return heap_form_tuple(tupledesc, values, isnull);
+}
+
+/*
+ * Remap a value based on the specified remap class.
+ */
+static Datum
+TupleQueueRemap(TupleQueueReader *reader, RemapClass remapclass, Datum value)
+{
+ check_stack_depth();
+
+ switch (remapclass)
{
- int newsize = funnel->nqueues * 2;
+ case TQUEUE_REMAP_NONE:
+ /* caller probably shouldn't have called us at all, but... */
+ return value;
+
+ case TQUEUE_REMAP_ARRAY:
+ return TupleQueueRemapArray(reader, value);
- Assert(funnel->nqueues == funnel->maxqueues);
+ case TQUEUE_REMAP_RANGE:
+ return TupleQueueRemapRange(reader, value);
- funnel->queue = repalloc(funnel->queue,
- newsize * sizeof(shm_mq_handle *));
- funnel->maxqueues = newsize;
+ case TQUEUE_REMAP_RECORD:
+ return TupleQueueRemapRecord(reader, value);
}
+}
+
+/*
+ * Remap an array.
+ */
+static Datum
+TupleQueueRemapArray(TupleQueueReader *reader, Datum value)
+{
+ ArrayType *arr = DatumGetArrayTypeP(value);
+ Oid typeid = ARR_ELEMTYPE(arr);
+ RemapClass remapclass;
+ int16 typlen;
+ bool typbyval;
+ char typalign;
+ Datum *elem_values;
+ bool *elem_nulls;
+ int num_elems;
+ int i;
- funnel->queue[funnel->nqueues++] = handle;
+ remapclass = GetRemapClass(typeid);
+
+ /*
+ * If the elements of the array don't need to be walked, we shouldn't have
+ * been called in the first place: GetRemapClass should have returned NULL
+ * when asked about this array type.
+ */
+ Assert(remapclass != TQUEUE_REMAP_NONE);
+
+ /* Deconstruct the array. */
+ get_typlenbyvalalign(typeid, &typlen, &typbyval, &typalign);
+ deconstruct_array(arr, typeid, typlen, typbyval, typalign,
+ &elem_values, &elem_nulls, &num_elems);
+
+ /* Remap each element. */
+ for (i = 0; i < num_elems; ++i)
+ if (!elem_nulls[i])
+ elem_values[i] = TupleQueueRemap(reader, remapclass,
+ elem_values[i]);
+
+ /* Reconstruct and return the array. */
+ arr = construct_md_array(elem_values, elem_nulls,
+ ARR_NDIM(arr), ARR_DIMS(arr), ARR_LBOUND(arr),
+ typeid, typlen, typbyval, typalign);
+ return PointerGetDatum(arr);
}
/*
- * Fetch a tuple from a tuple queue funnel.
- *
- * We try to read from the queues in round-robin fashion so as to avoid
- * the situation where some workers get their tuples read expediently while
- * others are barely ever serviced.
- *
- * Even when nowait = false, we read from the individual queues in
- * non-blocking mode. Even when shm_mq_receive() returns SHM_MQ_WOULD_BLOCK,
- * it can still accumulate bytes from a partially-read message, so doing it
- * this way should outperform doing a blocking read on each queue in turn.
+ * Remap a range type.
+ */
+static Datum
+TupleQueueRemapRange(TupleQueueReader *reader, Datum value)
+{
+ RangeType *range = DatumGetRangeType(value);
+ Oid typeid = RangeTypeGetOid(range);
+ RemapClass remapclass;
+ TypeCacheEntry *typcache;
+ RangeBound lower;
+ RangeBound upper;
+ bool empty;
+
+ /*
+ * Extract the lower and upper bounds. As in tqueueWalkRange, some
+ * caching might be a good idea here.
+ */
+ typcache = lookup_type_cache(typeid, TYPECACHE_RANGE_INFO);
+ if (typcache->rngelemtype == NULL)
+ elog(ERROR, "type %u is not a range type", typeid);
+ range_deserialize(typcache, range, &lower, &upper, &empty);
+
+ /* Nothing to do for an empty range. */
+ if (empty)
+ return value;
+
+ /*
+ * If the range bounds don't need to be walked, we shouldn't have been
+ * called in the first place: GetRemapClass should have returned NULL when
+ * asked about this range type.
+ */
+ remapclass = GetRemapClass(typeid);
+ Assert(remapclass != TQUEUE_REMAP_NONE);
+
+ /* Remap each bound, if present. */
+ if (!upper.infinite)
+ upper.val = TupleQueueRemap(reader, remapclass, upper.val);
+ if (!lower.infinite)
+ lower.val = TupleQueueRemap(reader, remapclass, lower.val);
+
+ /* And reserialize. */
+ range = range_serialize(typcache, &lower, &upper, empty);
+ return RangeTypeGetDatum(range);
+}
+
+/*
+ * Remap a record.
+ */
+static Datum
+TupleQueueRemapRecord(TupleQueueReader *reader, Datum value)
+{
+ HeapTupleHeader tup;
+ Oid typeid;
+ int typmod;
+ RecordTypemodMap *mapent;
+ TupleDesc tupledesc;
+ RemapInfo *remapinfo;
+ HeapTupleData htup;
+ HeapTuple atup;
+
+ /* Fetch type OID and typemod. */
+ tup = DatumGetHeapTupleHeader(value);
+ typeid = HeapTupleHeaderGetTypeId(tup);
+ typmod = HeapTupleHeaderGetTypMod(tup);
+
+ /* If transient record, replace remote typmod with local typmod. */
+ if (typeid == RECORDOID)
+ {
+ Assert(reader->typmodmap != NULL);
+ mapent = hash_search(reader->typmodmap, &typmod,
+ HASH_FIND, NULL);
+ if (mapent == NULL)
+ elog(ERROR, "found unrecognized remote typmod %d", typmod);
+ typmod = mapent->localtypmod;
+ }
+
+ /*
+ * Fetch tupledesc and compute remap info. We should probably cache this
+ * so that we don't have to keep recomputing it.
+ */
+ tupledesc = lookup_rowtype_tupdesc(typeid, typmod);
+ remapinfo = BuildRemapInfo(tupledesc);
+ DecrTupleDescRefCount(tupledesc);
+
+ /* Remap tuple. */
+ ItemPointerSetInvalid(&htup.t_self);
+ htup.t_tableOid = InvalidOid;
+ htup.t_len = HeapTupleHeaderGetDatumLength(tup);
+ htup.t_data = tup;
+ atup = TupleQueueRemapTuple(reader, tupledesc, remapinfo, &htup);
+ HeapTupleHeaderSetTypeId(atup->t_data, typeid);
+ HeapTupleHeaderSetTypMod(atup->t_data, typmod);
+ HeapTupleHeaderSetDatumLength(atup->t_data, htup.t_len);
+
+ /* And return the results. */
+ return HeapTupleHeaderGetDatum(atup->t_data);
+}
+
+/*
+ * Handle a control message from the tuple queue reader.
*
- * The return value is NULL if there are no remaining queues or if
- * nowait = true and no queue returned a tuple without blocking. *done, if
- * not NULL, is set to true when there are no remaining queues and false in
- * any other case.
+ * Control messages are sent when the remote side is sending tuples that
+ * contain transient record types. We need to arrange to bless those
+ * record types locally and translate between remote and local typmods.
*/
-HeapTuple
-TupleQueueFunnelNext(TupleQueueFunnel *funnel, bool nowait, bool *done)
+static void
+TupleQueueHandleControlMessage(TupleQueueReader *reader, Size nbytes,
+ char *data)
{
- int waitpos = funnel->nextqueue;
+ int natts;
+ int remotetypmod;
+ bool hasoid;
+ char *buf = data;
+ int rc = 0;
+ int i;
+ Form_pg_attribute *attrs;
+ MemoryContext oldcontext;
+ TupleDesc tupledesc;
+ RecordTypemodMap *mapent;
+ bool found;
+
+ /* Extract remote typmod. */
+ memcpy(&remotetypmod, &buf[rc], sizeof(int));
+ rc += sizeof(int);
+
+ /* Extract attribute count. */
+ memcpy(&natts, &buf[rc], sizeof(int));
+ rc += sizeof(int);
+
+ /* Extract hasoid flag. */
+ memcpy(&hasoid, &buf[rc], sizeof(bool));
+ rc += sizeof(bool);
+
+ /* Extract attribute details. */
+ oldcontext = MemoryContextSwitchTo(CurTransactionContext);
+ attrs = palloc(natts * sizeof(Form_pg_attribute));
+ for (i = 0; i < natts; ++i)
+ {
+ attrs[i] = palloc(sizeof(FormData_pg_attribute));
+ memcpy(attrs[i], &buf[rc], sizeof(FormData_pg_attribute));
+ rc += sizeof(FormData_pg_attribute);
+ }
+ MemoryContextSwitchTo(oldcontext);
+
+ /* We should have read the whole message. */
+ Assert(rc == nbytes);
+
+ /* Construct TupleDesc. */
+ tupledesc = CreateTupleDesc(natts, hasoid, attrs);
+ tupledesc = BlessTupleDesc(tupledesc);
- /* Corner case: called before adding any queues, or after all are gone. */
- if (funnel->nqueues == 0)
+ /* Create map if it doesn't exist already. */
+ if (reader->typmodmap == NULL)
{
- if (done != NULL)
- *done = true;
- return NULL;
+ HASHCTL ctl;
+
+ ctl.keysize = sizeof(int);
+ ctl.entrysize = sizeof(RecordTypemodMap);
+ ctl.hcxt = CurTransactionContext;
+ reader->typmodmap = hash_create("typmodmap hashtable",
+ 100, &ctl, HASH_ELEM | HASH_CONTEXT);
}
- if (done != NULL)
- *done = false;
+ /* Create map entry. */
+ mapent = hash_search(reader->typmodmap, &remotetypmod, HASH_ENTER,
+ &found);
+ if (found)
+ elog(ERROR, "duplicate message for typmod %d",
+ remotetypmod);
+ mapent->localtypmod = tupledesc->tdtypmod;
+ elog(DEBUG3, "mapping remote typmod %d to local typmod %d",
+ remotetypmod, tupledesc->tdtypmod);
+}
- for (;;)
+/*
+ * Build a mapping indicating what remapping class applies to each attribute
+ * described by a tupledesc.
+ */
+static RemapInfo *
+BuildRemapInfo(TupleDesc tupledesc)
+{
+ RemapInfo *remapinfo;
+ Size size;
+ AttrNumber i;
+ bool noop = true;
+ StringInfoData buf;
+
+ initStringInfo(&buf);
+
+ size = offsetof(RemapInfo, mapping) +
+ sizeof(RemapClass) * tupledesc->natts;
+ remapinfo = MemoryContextAllocZero(TopMemoryContext, size);
+ remapinfo->natts = tupledesc->natts;
+ for (i = 0; i < tupledesc->natts; ++i)
{
- shm_mq_handle *mqh = funnel->queue[funnel->nextqueue];
- shm_mq_result result;
- Size nbytes;
- void *data;
+ Form_pg_attribute attr = tupledesc->attrs[i];
- /* Attempt to read a message. */
- result = shm_mq_receive(mqh, &nbytes, &data, true);
+ remapinfo->mapping[i] = GetRemapClass(attr->atttypid);
+ if (remapinfo->mapping[i] != TQUEUE_REMAP_NONE)
+ noop = false;
+ }
- /*
- * Normally, we advance funnel->nextqueue to the next queue at this
- * point, but if we're pointing to a queue that we've just discovered
- * is detached, then forget that queue and leave the pointer where it
- * is until the number of remaining queues fall below that pointer and
- * at that point make the pointer point to the first queue.
- */
- if (result != SHM_MQ_DETACHED)
- funnel->nextqueue = (funnel->nextqueue + 1) % funnel->nqueues;
- else
- {
- --funnel->nqueues;
- if (funnel->nqueues == 0)
- {
- if (done != NULL)
- *done = true;
- return NULL;
- }
+ if (noop)
+ {
+ appendStringInfo(&buf, "noop");
+ pfree(remapinfo);
+ remapinfo = NULL;
+ }
- memmove(&funnel->queue[funnel->nextqueue],
- &funnel->queue[funnel->nextqueue + 1],
- sizeof(shm_mq_handle *)
- * (funnel->nqueues - funnel->nextqueue));
+ return remapinfo;
+}
- if (funnel->nextqueue >= funnel->nqueues)
- funnel->nextqueue = 0;
+/*
+ * Determine the remap class assocociated with a particular data type.
+ *
+ * Transient record types need to have the typmod applied on the sending side
+ * replaced with a value on the receiving side that has the same meaning.
+ *
+ * Arrays, range types, and all record types (including named composite types)
+ * need to searched for transient record values buried within them.
+ * Surprisingly, a walker is required even when the indicated type is a
+ * composite type, because the actual value may be a compatible transient
+ * record type.
+ */
+static RemapClass
+GetRemapClass(Oid typeid)
+{
+ RemapClass forceResult = TQUEUE_REMAP_NONE;
+ RemapClass innerResult = TQUEUE_REMAP_NONE;
- if (funnel->nextqueue < waitpos)
- --waitpos;
+ for (;;)
+ {
+ HeapTuple tup;
+ Form_pg_type typ;
+ /* Simple cases. */
+ if (typeid == RECORDOID)
+ {
+ innerResult = TQUEUE_REMAP_RECORD;
+ break;
+ }
+ if (typeid == RECORDARRAYOID)
+ {
+ innerResult = TQUEUE_REMAP_ARRAY;
+ break;
+ }
+
+ /* Otherwise, we need a syscache lookup to figure it out. */
+ tup = SearchSysCache1(TYPEOID, ObjectIdGetDatum(typeid));
+ if (!HeapTupleIsValid(tup))
+ elog(ERROR, "cache lookup failed for type %u", typeid);
+ typ = (Form_pg_type) GETSTRUCT(tup);
+
+ /* Look through domains to underlying base type. */
+ if (typ->typtype == TYPTYPE_DOMAIN)
+ {
+ typeid = typ->typbasetype;
+ ReleaseSysCache(tup);
continue;
}
- /* If we got a message, return it. */
- if (result == SHM_MQ_SUCCESS)
+ /*
+ * Look through arrays to underlying base type, but the final return
+ * value must be either TQUEUE_REMAP_ARRAY or TQUEUE_REMAP_NONE. (If
+ * this is an array of integers, for example, we don't need to walk
+ * it.)
+ */
+ if (OidIsValid(typ->typelem) && typ->typlen == -1)
{
- HeapTupleData htup;
-
- /*
- * The tuple data we just read from the queue is only valid until
- * we again attempt to read from it. Copy the tuple into a single
- * palloc'd chunk as callers will expect.
- */
- ItemPointerSetInvalid(&htup.t_self);
- htup.t_tableOid = InvalidOid;
- htup.t_len = nbytes;
- htup.t_data = data;
- return heap_copytuple(&htup);
+ typeid = typ->typelem;
+ ReleaseSysCache(tup);
+ if (forceResult == TQUEUE_REMAP_NONE)
+ forceResult = TQUEUE_REMAP_ARRAY;
+ continue;
}
/*
- * If we've visited all of the queues, then we should either give up
- * and return NULL (if we're in non-blocking mode) or wait for the
- * process latch to be set (otherwise).
+ * Similarly, look through ranges to the underlying base type, but the
+ * final return value must be either TQUEUE_REMAP_RANGE or
+ * TQUEUE_REMAP_NONE.
*/
- if (funnel->nextqueue == waitpos)
+ if (typ->typtype == TYPTYPE_RANGE)
{
- if (nowait)
- return NULL;
- WaitLatch(MyLatch, WL_LATCH_SET, 0);
- CHECK_FOR_INTERRUPTS();
- ResetLatch(MyLatch);
+ ReleaseSysCache(tup);
+ if (forceResult == TQUEUE_REMAP_NONE)
+ forceResult = TQUEUE_REMAP_RANGE;
+ typeid = get_range_subtype(typeid);
+ continue;
}
+
+ /* Walk composite types. Nothing else needs special handling. */
+ if (typ->typtype == TYPTYPE_COMPOSITE)
+ innerResult = TQUEUE_REMAP_RECORD;
+ ReleaseSysCache(tup);
+ break;
}
+
+ if (innerResult != TQUEUE_REMAP_NONE && forceResult != TQUEUE_REMAP_NONE)
+ return forceResult;
+ return innerResult;
}
diff --git a/src/include/executor/tqueue.h b/src/include/executor/tqueue.h
index 6f8eb73..6a668fa 100644
--- a/src/include/executor/tqueue.h
+++ b/src/include/executor/tqueue.h
@@ -21,11 +21,11 @@
extern DestReceiver *CreateTupleQueueDestReceiver(shm_mq_handle *handle);
/* Use these to receive tuples from a shm_mq. */
-typedef struct TupleQueueFunnel TupleQueueFunnel;
-extern TupleQueueFunnel *CreateTupleQueueFunnel(void);
-extern void DestroyTupleQueueFunnel(TupleQueueFunnel *funnel);
-extern void RegisterTupleQueueOnFunnel(TupleQueueFunnel *, shm_mq_handle *);
-extern HeapTuple TupleQueueFunnelNext(TupleQueueFunnel *, bool nowait,
- bool *done);
+typedef struct TupleQueueReader TupleQueueReader;
+extern TupleQueueReader *CreateTupleQueueReader(shm_mq_handle *handle,
+ TupleDesc tupledesc);
+extern void DestroyTupleQueueReader(TupleQueueReader *funnel);
+extern HeapTuple TupleQueueReaderNext(TupleQueueReader *,
+ bool nowait, bool *done);
#endif /* TQUEUE_H */
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 939bc0e..58ec889 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1963,7 +1963,9 @@ typedef struct GatherState
PlanState ps; /* its first field is NodeTag */
bool initialized;
struct ParallelExecutorInfo *pei;
- struct TupleQueueFunnel *funnel;
+ int nreaders;
+ int nextreader;
+ struct TupleQueueReader **reader;
TupleTableSlot *funnel_slot;
bool need_to_scan_locally;
} GatherState;
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index feb821b..03e1d2c 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2018,7 +2018,7 @@ TupleHashEntry
TupleHashEntryData
TupleHashIterator
TupleHashTable
-TupleQueueFunnel
+TupleQueueReader
TupleTableSlot
Tuplesortstate
Tuplestorestate
--
2.3.8 (Apple Git-58)
On Wed, Oct 28, 2015 at 10:23 AM, Robert Haas <robertmhaas@gmail.com> wrote:
On Sun, Oct 18, 2015 at 12:17 AM, Robert Haas <robertmhaas@gmail.com> wrote:
So reviewing patch 13 isn't possible without prior knowledge.
The basic question for patch 13 is whether ephemeral record types can
occur in executor tuples in any contexts that I haven't identified. I
know that a tuple table slot can contain have a column that is of type
record or record[], and those records can themselves contain
attributes of type record or record[], and so on as far down as you
like. I *think* that's the only case. For example, I don't believe
that a TupleTableSlot can contain a *named* record type that has an
anonymous record buried down inside of it somehow. But I'm not
positive I'm right about that.I have done some more testing and investigation and determined that
this optimism was unwarranted. It turns out that the type information
for composite and record types gets stored in two different places.
First, the TupleTableSlot has a type OID, indicating the sort of the
value it expects to be stored for that slot attribute. Second, the
value itself contains a type OID and typmod. And these don't have to
match. For example, consider this query:select row_to_json(i) from int8_tbl i(x,y);
Without i(x,y), the HeapTuple passed to row_to_json is labelled with
the pg_type OID of int8_tbl. But with the query as written, it's
labeled as an anonymous record type. If I jigger things by hacking
the code so that this is planned as Gather (single-copy) -> SeqScan,
with row_to_json evaluated at the Gather node, then the sequential
scan kicks out a tuple with a transient record type and stores it into
a slot whose type OID is still that of int8_tbl. My previous patch
failed to deal with that; the attached one does.The previous patch was also defective in a few other respects. The
most significant of those, maybe, is that it somehow thought it was OK
to assume that transient typmods from all workers could be treated
interchangeably rather than individually. To fix this, I've changed
the TupleQueueFunnel implemented by tqueue.c to be merely a
TupleQueueReader which handles reading from a single worker only.
nodeGather.c therefore creates one TupleQueueReader per worker instead
of a single TupleQueueFunnel for all workers; accordingly, the logic
for multiplexing multiple queues now lives in nodeGather.c. This is
probably how I should have done it originally - someone, I think Jeff
Davis - complained previously that tqueue.c had no business embedding
the round-robin policy decision, and he was right. So this addresses
that complaint as well.
Here is an updated version. This is rebased over recent commits, and
I added a missing check for attisdropped.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Attachments:
tqueue-record-types-v3.patchtext/x-diff; charset=US-ASCII; name=tqueue-record-types-v3.patchDownload
From fa31300a884cc942d22c66d6a30fa4c2fcba3c6f Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Wed, 7 Oct 2015 12:43:22 -0400
Subject: [PATCH 5/5] Modify tqueue infrastructure to support transient record
types.
Commit 4a4e6893aa080b9094dadbe0e65f8a75fee41ac6, which introduced this
mechanism, failed to account for the fact that the RECORD pseudo-type
uses transient typmods that are only meaningful within a single
backend. Transferring such tuples without modification between two
cooperating backends does not work. This commit installs a system
for passing the tuple descriptors over the same shm_mq being used to
send the tuples themselves. The two sides might not assign the same
transient typmod to any given tuple descriptor, so we must also
substitute the appropriate receiver-side typmod for the one used by
the sender. That adds some CPU overhead, but still seems better than
being unable to pass records between cooperating parallel processes.
Along the way, move the logic for handling multiple tuple queues from
tqueue.c to nodeGather.c; tqueue.c now provides a TupleQueueReader,
which reads from a single queue, rather than a TupleQueueFunnel, which
potentially reads from multiple queues. This change was suggested
previously as a way to make sure that nodeGather.c rather than tqueue.c
had policy control over the order in which to read from queues, but
it wasn't clear to me until now how good an idea it was. typmod
mapping needs to be performed separately for each queue, and it is
much simpler if the tqueue.c code handles that and leaves multiplexing
multiple queues to higher layers of the stack.
---
src/backend/executor/nodeGather.c | 138 ++++--
src/backend/executor/tqueue.c | 983 +++++++++++++++++++++++++++++++++-----
src/include/executor/tqueue.h | 12 +-
src/include/nodes/execnodes.h | 4 +-
src/tools/pgindent/typedefs.list | 2 +-
5 files changed, 986 insertions(+), 153 deletions(-)
diff --git a/src/backend/executor/nodeGather.c b/src/backend/executor/nodeGather.c
index 5f58961..850c67e 100644
--- a/src/backend/executor/nodeGather.c
+++ b/src/backend/executor/nodeGather.c
@@ -36,11 +36,13 @@
#include "executor/nodeGather.h"
#include "executor/nodeSubplan.h"
#include "executor/tqueue.h"
+#include "miscadmin.h"
#include "utils/memutils.h"
#include "utils/rel.h"
static TupleTableSlot *gather_getnext(GatherState *gatherstate);
+static HeapTuple gather_readnext(GatherState *gatherstate);
static void ExecShutdownGatherWorkers(GatherState *node);
@@ -125,6 +127,7 @@ ExecInitGather(Gather *node, EState *estate, int eflags)
TupleTableSlot *
ExecGather(GatherState *node)
{
+ TupleTableSlot *fslot = node->funnel_slot;
int i;
TupleTableSlot *slot;
TupleTableSlot *resultSlot;
@@ -148,6 +151,7 @@ ExecGather(GatherState *node)
*/
if (gather->num_workers > 0 && IsInParallelMode())
{
+ ParallelContext *pcxt;
bool got_any_worker = false;
/* Initialize the workers required to execute Gather node. */
@@ -160,18 +164,26 @@ ExecGather(GatherState *node)
* Register backend workers. We might not get as many as we
* requested, or indeed any at all.
*/
- LaunchParallelWorkers(node->pei->pcxt);
+ pcxt = node->pei->pcxt;
+ LaunchParallelWorkers(pcxt);
- /* Set up a tuple queue to collect the results. */
- node->funnel = CreateTupleQueueFunnel();
- for (i = 0; i < node->pei->pcxt->nworkers; ++i)
+ /* Set up tuple queue readers to read the results. */
+ if (pcxt->nworkers > 0)
{
- if (node->pei->pcxt->worker[i].bgwhandle)
+ node->nreaders = 0;
+ node->reader =
+ palloc(pcxt->nworkers * sizeof(TupleQueueReader *));
+
+ for (i = 0; i < pcxt->nworkers; ++i)
{
+ if (pcxt->worker[i].bgwhandle == NULL)
+ continue;
+
shm_mq_set_handle(node->pei->tqueue[i],
- node->pei->pcxt->worker[i].bgwhandle);
- RegisterTupleQueueOnFunnel(node->funnel,
- node->pei->tqueue[i]);
+ pcxt->worker[i].bgwhandle);
+ node->reader[node->nreaders++] =
+ CreateTupleQueueReader(node->pei->tqueue[i],
+ fslot->tts_tupleDescriptor);
got_any_worker = true;
}
}
@@ -182,7 +194,7 @@ ExecGather(GatherState *node)
}
/* Run plan locally if no workers or not single-copy. */
- node->need_to_scan_locally = (node->funnel == NULL)
+ node->need_to_scan_locally = (node->reader == NULL)
|| !gather->single_copy;
node->initialized = true;
}
@@ -254,13 +266,9 @@ ExecEndGather(GatherState *node)
}
/*
- * gather_getnext
- *
- * Get the next tuple from shared memory queue. This function
- * is responsible for fetching tuples from all the queues associated
- * with worker backends used in Gather node execution and if there is
- * no data available from queues or no worker is available, it does
- * fetch the data from local node.
+ * Read the next tuple. We might fetch a tuple from one of the tuple queues
+ * using gather_readnext, or if no tuple queue contains a tuple and the
+ * single_copy flag is not set, we might generate one locally instead.
*/
static TupleTableSlot *
gather_getnext(GatherState *gatherstate)
@@ -270,18 +278,11 @@ gather_getnext(GatherState *gatherstate)
TupleTableSlot *fslot = gatherstate->funnel_slot;
HeapTuple tup;
- while (gatherstate->funnel != NULL || gatherstate->need_to_scan_locally)
+ while (gatherstate->reader != NULL || gatherstate->need_to_scan_locally)
{
- if (gatherstate->funnel != NULL)
+ if (gatherstate->reader != NULL)
{
- bool done = false;
-
- /* wait only if local scan is done */
- tup = TupleQueueFunnelNext(gatherstate->funnel,
- gatherstate->need_to_scan_locally,
- &done);
- if (done)
- ExecShutdownGatherWorkers(gatherstate);
+ tup = gather_readnext(gatherstate);
if (HeapTupleIsValid(tup))
{
@@ -309,6 +310,80 @@ gather_getnext(GatherState *gatherstate)
return ExecClearTuple(fslot);
}
+/*
+ * Attempt to read a tuple from one of our parallel workers.
+ */
+static HeapTuple
+gather_readnext(GatherState *gatherstate)
+{
+ int waitpos = gatherstate->nextreader;
+
+ for (;;)
+ {
+ TupleQueueReader *reader;
+ HeapTuple tup;
+ bool readerdone;
+
+ /* Make sure we've read all messages from workers. */
+ HandleParallelMessages();
+
+ /* Attempt to read a tuple, but don't block if none is available. */
+ reader = gatherstate->reader[gatherstate->nextreader];
+ tup = TupleQueueReaderNext(reader, true, &readerdone);
+
+ /*
+ * If this reader is done, remove it. If all readers are done,
+ * clean up remaining worker state.
+ */
+ if (readerdone)
+ {
+ DestroyTupleQueueReader(reader);
+ --gatherstate->nreaders;
+ if (gatherstate->nreaders == 0)
+ {
+ ExecShutdownGather(gatherstate);
+ return NULL;
+ }
+ else
+ {
+ memmove(&gatherstate->reader[gatherstate->nextreader],
+ &gatherstate->reader[gatherstate->nextreader + 1],
+ sizeof(TupleQueueReader *)
+ * (gatherstate->nreaders - gatherstate->nextreader));
+ if (gatherstate->nextreader >= gatherstate->nreaders)
+ gatherstate->nextreader = 0;
+ if (gatherstate->nextreader < waitpos)
+ --waitpos;
+ }
+ continue;
+ }
+
+ /* Advance nextreader pointer in round-robin fashion. */
+ gatherstate->nextreader =
+ (gatherstate->nextreader + 1) % gatherstate->nreaders;
+
+ /* If we got a tuple, return it. */
+ if (tup)
+ return tup;
+
+ /* Have we visited every TupleQueueReader? */
+ if (gatherstate->nextreader == waitpos)
+ {
+ /*
+ * If (still) running plan locally, return NULL so caller can
+ * generate another tuple from the local copy of the plan.
+ */
+ if (gatherstate->need_to_scan_locally)
+ return NULL;
+
+ /* Nothing to do except wait for developments. */
+ WaitLatch(MyLatch, WL_LATCH_SET, 0);
+ CHECK_FOR_INTERRUPTS();
+ ResetLatch(MyLatch);
+ }
+ }
+}
+
/* ----------------------------------------------------------------
* ExecShutdownGatherWorkers
*
@@ -320,11 +395,14 @@ gather_getnext(GatherState *gatherstate)
void
ExecShutdownGatherWorkers(GatherState *node)
{
- /* Shut down tuple queue funnel before shutting down workers. */
- if (node->funnel != NULL)
+ /* Shut down tuple queue readers before shutting down workers. */
+ if (node->reader != NULL)
{
- DestroyTupleQueueFunnel(node->funnel);
- node->funnel = NULL;
+ int i;
+
+ for (i = 0; i < node->nreaders; ++i)
+ DestroyTupleQueueReader(node->reader[i]);
+ node->reader = NULL;
}
/* Now shut down the workers. */
diff --git a/src/backend/executor/tqueue.c b/src/backend/executor/tqueue.c
index 67143d3..f465b1d 100644
--- a/src/backend/executor/tqueue.c
+++ b/src/backend/executor/tqueue.c
@@ -4,10 +4,15 @@
* Use shm_mq to send & receive tuples between parallel backends
*
* A DestReceiver of type DestTupleQueue, which is a TQueueDestReceiver
- * under the hood, writes tuples from the executor to a shm_mq.
+ * under the hood, writes tuples from the executor to a shm_mq. If
+ * necessary, it also writes control messages describing transient
+ * record types used within the tuple.
*
- * A TupleQueueFunnel helps manage the process of reading tuples from
- * one or more shm_mq objects being used as tuple queues.
+ * A TupleQueueReader reads tuples, and if any are sent control messages,
+ * from a shm_mq and returns the tuples. If transient record types are
+ * in use, it registers those types based on the received control messages
+ * and rewrites the typemods sent by the remote side to the corresponding
+ * local record typemods.
*
* Portions Copyright (c) 1996-2015, PostgreSQL Global Development Group
* Portions Copyright (c) 1994, Regents of the University of California
@@ -21,37 +26,404 @@
#include "postgres.h"
#include "access/htup_details.h"
+#include "catalog/pg_type.h"
#include "executor/tqueue.h"
+#include "funcapi.h"
+#include "lib/stringinfo.h"
#include "miscadmin.h"
+#include "utils/array.h"
+#include "utils/lsyscache.h"
+#include "utils/memutils.h"
+#include "utils/rangetypes.h"
+#include "utils/syscache.h"
+#include "utils/typcache.h"
+
+typedef enum
+{
+ TQUEUE_REMAP_NONE, /* no special processing required */
+ TQUEUE_REMAP_ARRAY, /* array */
+ TQUEUE_REMAP_RANGE, /* range */
+ TQUEUE_REMAP_RECORD /* composite type, named or anonymous */
+} RemapClass;
+
+typedef struct
+{
+ int natts;
+ RemapClass mapping[FLEXIBLE_ARRAY_MEMBER];
+} RemapInfo;
typedef struct
{
DestReceiver pub;
shm_mq_handle *handle;
+ MemoryContext tmpcontext;
+ HTAB *recordhtab;
+ char mode;
+ TupleDesc tupledesc;
+ RemapInfo *remapinfo;
} TQueueDestReceiver;
-struct TupleQueueFunnel
+typedef struct RecordTypemodMap
{
- int nqueues;
- int maxqueues;
- int nextqueue;
- shm_mq_handle **queue;
+ int remotetypmod;
+ int localtypmod;
+} RecordTypemodMap;
+
+struct TupleQueueReader
+{
+ shm_mq_handle *queue;
+ char mode;
+ TupleDesc tupledesc;
+ RemapInfo *remapinfo;
+ HTAB *typmodmap;
};
+#define TUPLE_QUEUE_MODE_CONTROL 'c'
+#define TUPLE_QUEUE_MODE_DATA 'd'
+
+static void tqueueWalk(TQueueDestReceiver * tqueue, RemapClass walktype,
+ Datum value);
+static void tqueueWalkRecord(TQueueDestReceiver * tqueue, Datum value);
+static void tqueueWalkArray(TQueueDestReceiver * tqueue, Datum value);
+static void tqueueWalkRange(TQueueDestReceiver * tqueue, Datum value);
+static void tqueueSendTypmodInfo(TQueueDestReceiver * tqueue, int typmod,
+ TupleDesc tupledesc);
+static void TupleQueueHandleControlMessage(TupleQueueReader *reader,
+ Size nbytes, char *data);
+static HeapTuple TupleQueueHandleDataMessage(TupleQueueReader *reader,
+ Size nbytes, HeapTupleHeader data);
+static HeapTuple TupleQueueRemapTuple(TupleQueueReader *reader,
+ TupleDesc tupledesc, RemapInfo * remapinfo,
+ HeapTuple tuple);
+static Datum TupleQueueRemap(TupleQueueReader *reader, RemapClass remapclass,
+ Datum value);
+static Datum TupleQueueRemapArray(TupleQueueReader *reader, Datum value);
+static Datum TupleQueueRemapRange(TupleQueueReader *reader, Datum value);
+static Datum TupleQueueRemapRecord(TupleQueueReader *reader, Datum value);
+static RemapClass GetRemapClass(Oid typeid);
+static RemapInfo *BuildRemapInfo(TupleDesc tupledesc);
+
/*
* Receive a tuple.
+ *
+ * This is, at core, pretty simple: just send the tuple to the designated
+ * shm_mq. The complicated part is that if the tuple contains transient
+ * record types (see lookup_rowtype_tupdesc), we need to send control
+ * information to the shm_mq receiver so that those typemods can be correctly
+ * interpreted, as they are merely held in a backend-local cache. Worse, the
+ * record type may not at the top level: we could have a range over an array
+ * type over a range type over a range type over an array type over a record,
+ * or something like that.
*/
static void
tqueueReceiveSlot(TupleTableSlot *slot, DestReceiver *self)
{
TQueueDestReceiver *tqueue = (TQueueDestReceiver *) self;
+ TupleDesc tupledesc = slot->tts_tupleDescriptor;
HeapTuple tuple;
+ HeapTupleHeader tup;
+
+ /*
+ * Test to see whether the tupledesc has changed; if so, set up for the
+ * new tupledesc. This is a strange test both because the executor really
+ * shouldn't change the tupledesc, and also because it would be unsafe if
+ * the old tupledesc could be freed and a new one allocated at the same
+ * address. But since some very old code in printtup.c uses this test, we
+ * adopt it here as well.
+ */
+ if (tqueue->tupledesc != tupledesc ||
+ tqueue->remapinfo->natts != tupledesc->natts)
+ {
+ if (tqueue->remapinfo != NULL)
+ pfree(tqueue->remapinfo);
+ tqueue->remapinfo = BuildRemapInfo(tupledesc);
+ }
tuple = ExecMaterializeSlot(slot);
+ tup = tuple->t_data;
+
+ /*
+ * When, because of the types being transmitted, no record typemod mapping
+ * can be needed, we can skip a good deal of work.
+ */
+ if (tqueue->remapinfo != NULL)
+ {
+ RemapInfo *remapinfo = tqueue->remapinfo;
+ AttrNumber i;
+ MemoryContext oldcontext = NULL;
+
+ /* Deform the tuple so we can examine it, if not done already. */
+ slot_getallattrs(slot);
+
+ /* Iterate over each attribute and search it for transient typemods. */
+ Assert(slot->tts_tupleDescriptor->natts == remapinfo->natts);
+ for (i = 0; i < remapinfo->natts; ++i)
+ {
+ /* Ignore nulls and types that don't need special handling. */
+ if (slot->tts_isnull[i] ||
+ remapinfo->mapping[i] == TQUEUE_REMAP_NONE)
+ continue;
+
+ /* Switch to temporary memory context to avoid leaking. */
+ if (oldcontext == NULL)
+ {
+ if (tqueue->tmpcontext == NULL)
+ tqueue->tmpcontext =
+ AllocSetContextCreate(TopMemoryContext,
+ "tqueue temporary context",
+ ALLOCSET_DEFAULT_MINSIZE,
+ ALLOCSET_DEFAULT_INITSIZE,
+ ALLOCSET_DEFAULT_MAXSIZE);
+ oldcontext = MemoryContextSwitchTo(tqueue->tmpcontext);
+ }
+
+ /* Invoke the appropriate walker function. */
+ tqueueWalk(tqueue, remapinfo->mapping[i], slot->tts_values[i]);
+ }
+
+ /* If we used the temp context, reset it and restore prior context. */
+ if (oldcontext != NULL)
+ {
+ MemoryContextSwitchTo(oldcontext);
+ MemoryContextReset(tqueue->tmpcontext);
+ }
+
+ /* If we entered control mode, switch back to data mode. */
+ if (tqueue->mode != TUPLE_QUEUE_MODE_DATA)
+ {
+ tqueue->mode = TUPLE_QUEUE_MODE_DATA;
+ shm_mq_send(tqueue->handle, sizeof(char), &tqueue->mode, false);
+ }
+ }
+
+ /* Send the tuple itself. */
shm_mq_send(tqueue->handle, tuple->t_len, tuple->t_data, false);
}
/*
+ * Invoke the appropriate walker function based on the given RemapClass.
+ */
+static void
+tqueueWalk(TQueueDestReceiver * tqueue, RemapClass walktype, Datum value)
+{
+ check_stack_depth();
+
+ switch (walktype)
+ {
+ case TQUEUE_REMAP_NONE:
+ break;
+ case TQUEUE_REMAP_ARRAY:
+ tqueueWalkArray(tqueue, value);
+ break;
+ case TQUEUE_REMAP_RANGE:
+ tqueueWalkRange(tqueue, value);
+ break;
+ case TQUEUE_REMAP_RECORD:
+ tqueueWalkRecord(tqueue, value);
+ break;
+ }
+}
+
+/*
+ * Walk a record and send control messages for transient record types
+ * contained therein.
+ */
+static void
+tqueueWalkRecord(TQueueDestReceiver * tqueue, Datum value)
+{
+ HeapTupleHeader tup;
+ Oid typeid;
+ Oid typmod;
+ TupleDesc tupledesc;
+ RemapInfo *remapinfo;
+
+ /* Extract typmod from tuple. */
+ tup = DatumGetHeapTupleHeader(value);
+ typeid = HeapTupleHeaderGetTypeId(tup);
+ typmod = HeapTupleHeaderGetTypMod(tup);
+
+ /* Look up tuple descriptor in typecache. */
+ tupledesc = lookup_rowtype_tupdesc(typeid, typmod);
+
+ /*
+ * If this is a transient record time, send its TupleDesc as a control
+ * message. (tqueueSendTypemodInfo is smart enough to do this only once
+ * per typmod.)
+ */
+ if (typeid == RECORDOID)
+ tqueueSendTypmodInfo(tqueue, typmod, tupledesc);
+
+ /*
+ * Build the remap information for this tupledesc. We might want to think
+ * about keeping a cache of this information keyed by typeid and typemod,
+ * but let's keep it simple for now.
+ */
+ remapinfo = BuildRemapInfo(tupledesc);
+
+ /*
+ * If remapping is required, deform the tuple and process each field. When
+ * BuildRemapInfo is null, the data types are such that there can be no
+ * transient record types here, so we can skip all this work.
+ */
+ if (remapinfo != NULL)
+ {
+ Datum *values;
+ bool *isnull;
+ HeapTupleData tdata;
+ AttrNumber i;
+
+ /* Deform the tuple so we can check each column within. */
+ values = palloc(tupledesc->natts * sizeof(Datum));
+ isnull = palloc(tupledesc->natts * sizeof(bool));
+ tdata.t_len = HeapTupleHeaderGetDatumLength(tup);
+ ItemPointerSetInvalid(&(tdata.t_self));
+ tdata.t_tableOid = InvalidOid;
+ tdata.t_data = tup;
+ heap_deform_tuple(&tdata, tupledesc, values, isnull);
+
+ /* Recursively check each non-NULL attribute. */
+ for (i = 0; i < tupledesc->natts; ++i)
+ if (!isnull[i])
+ tqueueWalk(tqueue, remapinfo->mapping[i], values[i]);
+ }
+
+ /* Release reference count acquired by lookup_rowtype_tupdesc. */
+ DecrTupleDescRefCount(tupledesc);
+}
+
+/*
+ * Walk a record and send control messages for transient record types
+ * contained therein.
+ */
+static void
+tqueueWalkArray(TQueueDestReceiver * tqueue, Datum value)
+{
+ ArrayType *arr = DatumGetArrayTypeP(value);
+ Oid typeid = ARR_ELEMTYPE(arr);
+ RemapClass remapclass;
+ int16 typlen;
+ bool typbyval;
+ char typalign;
+ Datum *elem_values;
+ bool *elem_nulls;
+ int num_elems;
+ int i;
+
+ remapclass = GetRemapClass(typeid);
+
+ /*
+ * If the elements of the array don't need to be walked, we shouldn't have
+ * been called in the first place: GetRemapClass should have returned NULL
+ * when asked about this array type.
+ */
+ Assert(remapclass != TQUEUE_REMAP_NONE);
+
+ /* Deconstruct the array. */
+ get_typlenbyvalalign(typeid, &typlen, &typbyval, &typalign);
+ deconstruct_array(arr, typeid, typlen, typbyval, typalign,
+ &elem_values, &elem_nulls, &num_elems);
+
+ /* Walk each element. */
+ for (i = 0; i < num_elems; ++i)
+ if (!elem_nulls[i])
+ tqueueWalk(tqueue, remapclass, elem_values[i]);
+}
+
+/*
+ * Walk a range type and send control messages for transient record types
+ * contained therein.
+ */
+static void
+tqueueWalkRange(TQueueDestReceiver * tqueue, Datum value)
+{
+ RangeType *range = DatumGetRangeType(value);
+ Oid typeid = RangeTypeGetOid(range);
+ RemapClass remapclass;
+ TypeCacheEntry *typcache;
+ RangeBound lower;
+ RangeBound upper;
+ bool empty;
+
+ /*
+ * Extract the lower and upper bounds. It might be worth implementing
+ * some caching scheme here so that we don't look up the same typeids in
+ * the type cache repeatedly, but for now let's keep it simple.
+ */
+ typcache = lookup_type_cache(typeid, TYPECACHE_RANGE_INFO);
+ if (typcache->rngelemtype == NULL)
+ elog(ERROR, "type %u is not a range type", typeid);
+ range_deserialize(typcache, range, &lower, &upper, &empty);
+
+ /* Nothing to do for an empty range. */
+ if (empty)
+ return;
+
+ /*
+ * If the range bounds don't need to be walked, we shouldn't have been
+ * called in the first place: GetRemapClass should have returned NULL when
+ * asked about this range type.
+ */
+ remapclass = GetRemapClass(typeid);
+ Assert(remapclass != TQUEUE_REMAP_NONE);
+
+ /* Walk each bound, if present. */
+ if (!upper.infinite)
+ tqueueWalk(tqueue, remapclass, upper.val);
+ if (!lower.infinite)
+ tqueueWalk(tqueue, remapclass, lower.val);
+}
+
+/*
+ * Send tuple descriptor information for a transient typemod, unless we've
+ * already done so previously.
+ */
+static void
+tqueueSendTypmodInfo(TQueueDestReceiver * tqueue, int typmod,
+ TupleDesc tupledesc)
+{
+ StringInfoData buf;
+ bool found;
+ AttrNumber i;
+
+ /* Initialize hash table if not done yet. */
+ if (tqueue->recordhtab == NULL)
+ {
+ HASHCTL ctl;
+
+ ctl.keysize = sizeof(int);
+ ctl.entrysize = sizeof(int);
+ ctl.hcxt = TopMemoryContext;
+ tqueue->recordhtab = hash_create("tqueue record hashtable",
+ 100, &ctl, HASH_ELEM | HASH_CONTEXT);
+ }
+
+ /* Have we already seen this record type? If not, must report it. */
+ hash_search(tqueue->recordhtab, &typmod, HASH_ENTER, &found);
+ if (found)
+ return;
+
+ /* If message queue is in data mode, switch to control mode. */
+ if (tqueue->mode != TUPLE_QUEUE_MODE_CONTROL)
+ {
+ tqueue->mode = TUPLE_QUEUE_MODE_CONTROL;
+ shm_mq_send(tqueue->handle, sizeof(char), &tqueue->mode, false);
+ }
+
+ /* Assemble a control message. */
+ initStringInfo(&buf);
+ appendBinaryStringInfo(&buf, (char *) &typmod, sizeof(int));
+ appendBinaryStringInfo(&buf, (char *) &tupledesc->natts, sizeof(int));
+ appendBinaryStringInfo(&buf, (char *) &tupledesc->tdhasoid,
+ sizeof(bool));
+ for (i = 0; i < tupledesc->natts; ++i)
+ appendBinaryStringInfo(&buf, (char *) tupledesc->attrs[i],
+ sizeof(FormData_pg_attribute));
+
+ /* Send control message. */
+ shm_mq_send(tqueue->handle, buf.len, buf.data, false);
+}
+
+/*
* Prepare to receive tuples from executor.
*/
static void
@@ -77,6 +449,14 @@ tqueueShutdownReceiver(DestReceiver *self)
static void
tqueueDestroyReceiver(DestReceiver *self)
{
+ TQueueDestReceiver *tqueue = (TQueueDestReceiver *) self;
+
+ if (tqueue->tmpcontext != NULL)
+ MemoryContextDelete(tqueue->tmpcontext);
+ if (tqueue->recordhtab != NULL)
+ hash_destroy(tqueue->recordhtab);
+ if (tqueue->remapinfo != NULL)
+ pfree(tqueue->remapinfo);
pfree(self);
}
@@ -96,169 +476,542 @@ CreateTupleQueueDestReceiver(shm_mq_handle *handle)
self->pub.rDestroy = tqueueDestroyReceiver;
self->pub.mydest = DestTupleQueue;
self->handle = handle;
+ self->tmpcontext = NULL;
+ self->recordhtab = NULL;
+ self->mode = TUPLE_QUEUE_MODE_DATA;
+ self->remapinfo = NULL;
return (DestReceiver *) self;
}
/*
- * Create a tuple queue funnel.
+ * Create a tuple queue reader.
*/
-TupleQueueFunnel *
-CreateTupleQueueFunnel(void)
+TupleQueueReader *
+CreateTupleQueueReader(shm_mq_handle *handle, TupleDesc tupledesc)
{
- TupleQueueFunnel *funnel = palloc0(sizeof(TupleQueueFunnel));
+ TupleQueueReader *reader = palloc0(sizeof(TupleQueueReader));
- funnel->maxqueues = 8;
- funnel->queue = palloc(funnel->maxqueues * sizeof(shm_mq_handle *));
+ reader->queue = handle;
+ reader->mode = TUPLE_QUEUE_MODE_DATA;
+ reader->tupledesc = tupledesc;
+ reader->remapinfo = BuildRemapInfo(tupledesc);
- return funnel;
+ return reader;
}
/*
- * Destroy a tuple queue funnel.
+ * Destroy a tuple queue reader.
*/
void
-DestroyTupleQueueFunnel(TupleQueueFunnel *funnel)
+DestroyTupleQueueReader(TupleQueueReader *reader)
{
- int i;
+ shm_mq_detach(shm_mq_get_queue(reader->queue));
+ if (reader->remapinfo != NULL)
+ pfree(reader->remapinfo);
+ pfree(reader);
+}
+
+/*
+ * Fetch a tuple from a tuple queue reader.
+ *
+ * Even when shm_mq_receive() returns SHM_MQ_WOULD_BLOCK, this can still
+ * accumulate bytes from a partially-read message, so it's useful to call
+ * this with nowait = true even if nothing is returned.
+ *
+ * The return value is NULL if there are no remaining queues or if
+ * nowait = true and no tuple is ready to return. *done, if not NULL,
+ * is set to true when queue is detached and otherwise to false.
+ */
+HeapTuple
+TupleQueueReaderNext(TupleQueueReader *reader, bool nowait, bool *done)
+{
+ shm_mq_result result;
+
+ if (done != NULL)
+ *done = false;
+
+ for (;;)
+ {
+ Size nbytes;
+ void *data;
+
+ /* Attempt to read a message. */
+ result = shm_mq_receive(reader->queue, &nbytes, &data, true);
+
+ /* If queue is detached, set *done and return NULL. */
+ if (result == SHM_MQ_DETACHED)
+ {
+ if (done != NULL)
+ *done = true;
+ return NULL;
+ }
+
+ /* In non-blocking mode, bail out if no message ready yet. */
+ if (result == SHM_MQ_WOULD_BLOCK)
+ return NULL;
+ Assert(result == SHM_MQ_SUCCESS);
- for (i = 0; i < funnel->nqueues; i++)
- shm_mq_detach(shm_mq_get_queue(funnel->queue[i]));
- pfree(funnel->queue);
- pfree(funnel);
+ /*
+ * OK, we got a message. Process it.
+ *
+ * One-byte messages are mode switch messages, so that we can switch
+ * between "control" and "data" mode. When in "data" mode, each
+ * message (unless exactly one byte) is a tuple. When in "control"
+ * mode, each message provides a transient-typmod-to-tupledesc mapping
+ * so we can interpret future tuples.
+ */
+ if (nbytes == 1)
+ {
+ /* Mode switch message. */
+ reader->mode = ((char *) data)[0];
+ }
+ else if (reader->mode == TUPLE_QUEUE_MODE_DATA)
+ {
+ /* Tuple data. */
+ return TupleQueueHandleDataMessage(reader, nbytes, data);
+ }
+ else if (reader->mode == TUPLE_QUEUE_MODE_CONTROL)
+ {
+ /* Control message, describing a transient record type. */
+ TupleQueueHandleControlMessage(reader, nbytes, data);
+ }
+ else
+ elog(ERROR, "invalid mode: %d", (int) reader->mode);
+ }
}
/*
- * Remember the shared memory queue handle in funnel.
+ * Handle a data message - that is, a tuple - from the remote side.
*/
-void
-RegisterTupleQueueOnFunnel(TupleQueueFunnel *funnel, shm_mq_handle *handle)
+static HeapTuple
+TupleQueueHandleDataMessage(TupleQueueReader *reader,
+ Size nbytes,
+ HeapTupleHeader data)
{
- if (funnel->nqueues < funnel->maxqueues)
+ HeapTupleData htup;
+
+ ItemPointerSetInvalid(&htup.t_self);
+ htup.t_tableOid = InvalidOid;
+ htup.t_len = nbytes;
+ htup.t_data = data;
+
+ return TupleQueueRemapTuple(reader, reader->tupledesc, reader->remapinfo,
+ &htup);
+}
+
+/*
+ * Remap tuple typmods per control information received from remote side.
+ */
+static HeapTuple
+TupleQueueRemapTuple(TupleQueueReader *reader, TupleDesc tupledesc,
+ RemapInfo * remapinfo, HeapTuple tuple)
+{
+ Datum *values;
+ bool *isnull;
+ bool dirty = false;
+ int i;
+
+ /*
+ * If no remapping is necessary, just copy the tuple into a single
+ * palloc'd chunk, as caller will expect.
+ */
+ if (remapinfo == NULL)
+ return heap_copytuple(tuple);
+
+ /* Deform tuple so we can remap record typmods for individual attrs. */
+ values = palloc(tupledesc->natts * sizeof(Datum));
+ isnull = palloc(tupledesc->natts * sizeof(bool));
+ heap_deform_tuple(tuple, tupledesc, values, isnull);
+ Assert(tupledesc->natts == remapinfo->natts);
+
+ /* Recursively check each non-NULL attribute. */
+ for (i = 0; i < tupledesc->natts; ++i)
{
- funnel->queue[funnel->nqueues++] = handle;
- return;
+ if (isnull[i] || remapinfo->mapping[i] == TQUEUE_REMAP_NONE)
+ continue;
+ values[i] = TupleQueueRemap(reader, remapinfo->mapping[i], values[i]);
+ dirty = true;
}
- if (funnel->nqueues >= funnel->maxqueues)
+ /* Reform the modified tuple. */
+ return heap_form_tuple(tupledesc, values, isnull);
+}
+
+/*
+ * Remap a value based on the specified remap class.
+ */
+static Datum
+TupleQueueRemap(TupleQueueReader *reader, RemapClass remapclass, Datum value)
+{
+ check_stack_depth();
+
+ switch (remapclass)
{
- int newsize = funnel->nqueues * 2;
+ case TQUEUE_REMAP_NONE:
+ /* caller probably shouldn't have called us at all, but... */
+ return value;
+
+ case TQUEUE_REMAP_ARRAY:
+ return TupleQueueRemapArray(reader, value);
- Assert(funnel->nqueues == funnel->maxqueues);
+ case TQUEUE_REMAP_RANGE:
+ return TupleQueueRemapRange(reader, value);
- funnel->queue = repalloc(funnel->queue,
- newsize * sizeof(shm_mq_handle *));
- funnel->maxqueues = newsize;
+ case TQUEUE_REMAP_RECORD:
+ return TupleQueueRemapRecord(reader, value);
}
+}
- funnel->queue[funnel->nqueues++] = handle;
+/*
+ * Remap an array.
+ */
+static Datum
+TupleQueueRemapArray(TupleQueueReader *reader, Datum value)
+{
+ ArrayType *arr = DatumGetArrayTypeP(value);
+ Oid typeid = ARR_ELEMTYPE(arr);
+ RemapClass remapclass;
+ int16 typlen;
+ bool typbyval;
+ char typalign;
+ Datum *elem_values;
+ bool *elem_nulls;
+ int num_elems;
+ int i;
+
+ remapclass = GetRemapClass(typeid);
+
+ /*
+ * If the elements of the array don't need to be walked, we shouldn't have
+ * been called in the first place: GetRemapClass should have returned NULL
+ * when asked about this array type.
+ */
+ Assert(remapclass != TQUEUE_REMAP_NONE);
+
+ /* Deconstruct the array. */
+ get_typlenbyvalalign(typeid, &typlen, &typbyval, &typalign);
+ deconstruct_array(arr, typeid, typlen, typbyval, typalign,
+ &elem_values, &elem_nulls, &num_elems);
+
+ /* Remap each element. */
+ for (i = 0; i < num_elems; ++i)
+ if (!elem_nulls[i])
+ elem_values[i] = TupleQueueRemap(reader, remapclass,
+ elem_values[i]);
+
+ /* Reconstruct and return the array. */
+ arr = construct_md_array(elem_values, elem_nulls,
+ ARR_NDIM(arr), ARR_DIMS(arr), ARR_LBOUND(arr),
+ typeid, typlen, typbyval, typalign);
+ return PointerGetDatum(arr);
}
/*
- * Fetch a tuple from a tuple queue funnel.
- *
- * We try to read from the queues in round-robin fashion so as to avoid
- * the situation where some workers get their tuples read expediently while
- * others are barely ever serviced.
- *
- * Even when nowait = false, we read from the individual queues in
- * non-blocking mode. Even when shm_mq_receive() returns SHM_MQ_WOULD_BLOCK,
- * it can still accumulate bytes from a partially-read message, so doing it
- * this way should outperform doing a blocking read on each queue in turn.
- *
- * The return value is NULL if there are no remaining queues or if
- * nowait = true and no queue returned a tuple without blocking. *done, if
- * not NULL, is set to true when there are no remaining queues and false in
- * any other case.
+ * Remap a range type.
*/
-HeapTuple
-TupleQueueFunnelNext(TupleQueueFunnel *funnel, bool nowait, bool *done)
+static Datum
+TupleQueueRemapRange(TupleQueueReader *reader, Datum value)
{
- int waitpos = funnel->nextqueue;
+ RangeType *range = DatumGetRangeType(value);
+ Oid typeid = RangeTypeGetOid(range);
+ RemapClass remapclass;
+ TypeCacheEntry *typcache;
+ RangeBound lower;
+ RangeBound upper;
+ bool empty;
+
+ /*
+ * Extract the lower and upper bounds. As in tqueueWalkRange, some
+ * caching might be a good idea here.
+ */
+ typcache = lookup_type_cache(typeid, TYPECACHE_RANGE_INFO);
+ if (typcache->rngelemtype == NULL)
+ elog(ERROR, "type %u is not a range type", typeid);
+ range_deserialize(typcache, range, &lower, &upper, &empty);
+
+ /* Nothing to do for an empty range. */
+ if (empty)
+ return value;
+
+ /*
+ * If the range bounds don't need to be walked, we shouldn't have been
+ * called in the first place: GetRemapClass should have returned NULL when
+ * asked about this range type.
+ */
+ remapclass = GetRemapClass(typeid);
+ Assert(remapclass != TQUEUE_REMAP_NONE);
+
+ /* Remap each bound, if present. */
+ if (!upper.infinite)
+ upper.val = TupleQueueRemap(reader, remapclass, upper.val);
+ if (!lower.infinite)
+ lower.val = TupleQueueRemap(reader, remapclass, lower.val);
+
+ /* And reserialize. */
+ range = range_serialize(typcache, &lower, &upper, empty);
+ return RangeTypeGetDatum(range);
+}
- /* Corner case: called before adding any queues, or after all are gone. */
- if (funnel->nqueues == 0)
+/*
+ * Remap a record.
+ */
+static Datum
+TupleQueueRemapRecord(TupleQueueReader *reader, Datum value)
+{
+ HeapTupleHeader tup;
+ Oid typeid;
+ int typmod;
+ RecordTypemodMap *mapent;
+ TupleDesc tupledesc;
+ RemapInfo *remapinfo;
+ HeapTupleData htup;
+ HeapTuple atup;
+
+ /* Fetch type OID and typemod. */
+ tup = DatumGetHeapTupleHeader(value);
+ typeid = HeapTupleHeaderGetTypeId(tup);
+ typmod = HeapTupleHeaderGetTypMod(tup);
+
+ /* If transient record, replace remote typmod with local typmod. */
+ if (typeid == RECORDOID)
{
- if (done != NULL)
- *done = true;
- return NULL;
+ Assert(reader->typmodmap != NULL);
+ mapent = hash_search(reader->typmodmap, &typmod,
+ HASH_FIND, NULL);
+ if (mapent == NULL)
+ elog(ERROR, "found unrecognized remote typmod %d", typmod);
+ typmod = mapent->localtypmod;
}
- if (done != NULL)
- *done = false;
+ /*
+ * Fetch tupledesc and compute remap info. We should probably cache this
+ * so that we don't have to keep recomputing it.
+ */
+ tupledesc = lookup_rowtype_tupdesc(typeid, typmod);
+ remapinfo = BuildRemapInfo(tupledesc);
+ DecrTupleDescRefCount(tupledesc);
+
+ /* Remap tuple. */
+ ItemPointerSetInvalid(&htup.t_self);
+ htup.t_tableOid = InvalidOid;
+ htup.t_len = HeapTupleHeaderGetDatumLength(tup);
+ htup.t_data = tup;
+ atup = TupleQueueRemapTuple(reader, tupledesc, remapinfo, &htup);
+ HeapTupleHeaderSetTypeId(atup->t_data, typeid);
+ HeapTupleHeaderSetTypMod(atup->t_data, typmod);
+ HeapTupleHeaderSetDatumLength(atup->t_data, htup.t_len);
+
+ /* And return the results. */
+ return HeapTupleHeaderGetDatum(atup->t_data);
+}
- for (;;)
+/*
+ * Handle a control message from the tuple queue reader.
+ *
+ * Control messages are sent when the remote side is sending tuples that
+ * contain transient record types. We need to arrange to bless those
+ * record types locally and translate between remote and local typmods.
+ */
+static void
+TupleQueueHandleControlMessage(TupleQueueReader *reader, Size nbytes,
+ char *data)
+{
+ int natts;
+ int remotetypmod;
+ bool hasoid;
+ char *buf = data;
+ int rc = 0;
+ int i;
+ Form_pg_attribute *attrs;
+ MemoryContext oldcontext;
+ TupleDesc tupledesc;
+ RecordTypemodMap *mapent;
+ bool found;
+
+ /* Extract remote typmod. */
+ memcpy(&remotetypmod, &buf[rc], sizeof(int));
+ rc += sizeof(int);
+
+ /* Extract attribute count. */
+ memcpy(&natts, &buf[rc], sizeof(int));
+ rc += sizeof(int);
+
+ /* Extract hasoid flag. */
+ memcpy(&hasoid, &buf[rc], sizeof(bool));
+ rc += sizeof(bool);
+
+ /* Extract attribute details. */
+ oldcontext = MemoryContextSwitchTo(CurTransactionContext);
+ attrs = palloc(natts * sizeof(Form_pg_attribute));
+ for (i = 0; i < natts; ++i)
{
- shm_mq_handle *mqh = funnel->queue[funnel->nextqueue];
- shm_mq_result result;
- Size nbytes;
- void *data;
+ attrs[i] = palloc(sizeof(FormData_pg_attribute));
+ memcpy(attrs[i], &buf[rc], sizeof(FormData_pg_attribute));
+ rc += sizeof(FormData_pg_attribute);
+ }
+ MemoryContextSwitchTo(oldcontext);
- /* Attempt to read a message. */
- result = shm_mq_receive(mqh, &nbytes, &data, true);
+ /* We should have read the whole message. */
+ Assert(rc == nbytes);
- /*
- * Normally, we advance funnel->nextqueue to the next queue at this
- * point, but if we're pointing to a queue that we've just discovered
- * is detached, then forget that queue and leave the pointer where it
- * is until the number of remaining queues fall below that pointer and
- * at that point make the pointer point to the first queue.
- */
- if (result != SHM_MQ_DETACHED)
- funnel->nextqueue = (funnel->nextqueue + 1) % funnel->nqueues;
- else
+ /* Construct TupleDesc. */
+ tupledesc = CreateTupleDesc(natts, hasoid, attrs);
+ tupledesc = BlessTupleDesc(tupledesc);
+
+ /* Create map if it doesn't exist already. */
+ if (reader->typmodmap == NULL)
+ {
+ HASHCTL ctl;
+
+ ctl.keysize = sizeof(int);
+ ctl.entrysize = sizeof(RecordTypemodMap);
+ ctl.hcxt = CurTransactionContext;
+ reader->typmodmap = hash_create("typmodmap hashtable",
+ 100, &ctl, HASH_ELEM | HASH_CONTEXT);
+ }
+
+ /* Create map entry. */
+ mapent = hash_search(reader->typmodmap, &remotetypmod, HASH_ENTER,
+ &found);
+ if (found)
+ elog(ERROR, "duplicate message for typmod %d",
+ remotetypmod);
+ mapent->localtypmod = tupledesc->tdtypmod;
+ elog(DEBUG3, "mapping remote typmod %d to local typmod %d",
+ remotetypmod, tupledesc->tdtypmod);
+}
+
+/*
+ * Build a mapping indicating what remapping class applies to each attribute
+ * described by a tupledesc.
+ */
+static RemapInfo *
+BuildRemapInfo(TupleDesc tupledesc)
+{
+ RemapInfo *remapinfo;
+ Size size;
+ AttrNumber i;
+ bool noop = true;
+ StringInfoData buf;
+
+ initStringInfo(&buf);
+
+ size = offsetof(RemapInfo, mapping) +
+ sizeof(RemapClass) * tupledesc->natts;
+ remapinfo = MemoryContextAllocZero(TopMemoryContext, size);
+ remapinfo->natts = tupledesc->natts;
+ for (i = 0; i < tupledesc->natts; ++i)
+ {
+ Form_pg_attribute attr = tupledesc->attrs[i];
+
+ if (attr->attisdropped)
{
- --funnel->nqueues;
- if (funnel->nqueues == 0)
- {
- if (done != NULL)
- *done = true;
- return NULL;
- }
+ remapinfo->mapping[i] = TQUEUE_REMAP_NONE;
+ continue;
+ }
- memmove(&funnel->queue[funnel->nextqueue],
- &funnel->queue[funnel->nextqueue + 1],
- sizeof(shm_mq_handle *)
- * (funnel->nqueues - funnel->nextqueue));
+ remapinfo->mapping[i] = GetRemapClass(attr->atttypid);
+ if (remapinfo->mapping[i] != TQUEUE_REMAP_NONE)
+ noop = false;
+ }
+
+ if (noop)
+ {
+ appendStringInfo(&buf, "noop");
+ pfree(remapinfo);
+ remapinfo = NULL;
+ }
+
+ return remapinfo;
+}
+
+/*
+ * Determine the remap class assocociated with a particular data type.
+ *
+ * Transient record types need to have the typmod applied on the sending side
+ * replaced with a value on the receiving side that has the same meaning.
+ *
+ * Arrays, range types, and all record types (including named composite types)
+ * need to searched for transient record values buried within them.
+ * Surprisingly, a walker is required even when the indicated type is a
+ * composite type, because the actual value may be a compatible transient
+ * record type.
+ */
+static RemapClass
+GetRemapClass(Oid typeid)
+{
+ RemapClass forceResult = TQUEUE_REMAP_NONE;
+ RemapClass innerResult = TQUEUE_REMAP_NONE;
+
+ for (;;)
+ {
+ HeapTuple tup;
+ Form_pg_type typ;
- if (funnel->nextqueue >= funnel->nqueues)
- funnel->nextqueue = 0;
+ /* Simple cases. */
+ if (typeid == RECORDOID)
+ {
+ innerResult = TQUEUE_REMAP_RECORD;
+ break;
+ }
+ if (typeid == RECORDARRAYOID)
+ {
+ innerResult = TQUEUE_REMAP_ARRAY;
+ break;
+ }
- if (funnel->nextqueue < waitpos)
- --waitpos;
+ /* Otherwise, we need a syscache lookup to figure it out. */
+ tup = SearchSysCache1(TYPEOID, ObjectIdGetDatum(typeid));
+ if (!HeapTupleIsValid(tup))
+ elog(ERROR, "cache lookup failed for type %u", typeid);
+ typ = (Form_pg_type) GETSTRUCT(tup);
+ /* Look through domains to underlying base type. */
+ if (typ->typtype == TYPTYPE_DOMAIN)
+ {
+ typeid = typ->typbasetype;
+ ReleaseSysCache(tup);
continue;
}
- /* If we got a message, return it. */
- if (result == SHM_MQ_SUCCESS)
+ /*
+ * Look through arrays to underlying base type, but the final return
+ * value must be either TQUEUE_REMAP_ARRAY or TQUEUE_REMAP_NONE. (If
+ * this is an array of integers, for example, we don't need to walk
+ * it.)
+ */
+ if (OidIsValid(typ->typelem) && typ->typlen == -1)
{
- HeapTupleData htup;
-
- /*
- * The tuple data we just read from the queue is only valid until
- * we again attempt to read from it. Copy the tuple into a single
- * palloc'd chunk as callers will expect.
- */
- ItemPointerSetInvalid(&htup.t_self);
- htup.t_tableOid = InvalidOid;
- htup.t_len = nbytes;
- htup.t_data = data;
- return heap_copytuple(&htup);
+ typeid = typ->typelem;
+ ReleaseSysCache(tup);
+ if (forceResult == TQUEUE_REMAP_NONE)
+ forceResult = TQUEUE_REMAP_ARRAY;
+ continue;
}
/*
- * If we've visited all of the queues, then we should either give up
- * and return NULL (if we're in non-blocking mode) or wait for the
- * process latch to be set (otherwise).
+ * Similarly, look through ranges to the underlying base type, but the
+ * final return value must be either TQUEUE_REMAP_RANGE or
+ * TQUEUE_REMAP_NONE.
*/
- if (funnel->nextqueue == waitpos)
+ if (typ->typtype == TYPTYPE_RANGE)
{
- if (nowait)
- return NULL;
- WaitLatch(MyLatch, WL_LATCH_SET, 0);
- CHECK_FOR_INTERRUPTS();
- ResetLatch(MyLatch);
+ ReleaseSysCache(tup);
+ if (forceResult == TQUEUE_REMAP_NONE)
+ forceResult = TQUEUE_REMAP_RANGE;
+ typeid = get_range_subtype(typeid);
+ continue;
}
+
+ /* Walk composite types. Nothing else needs special handling. */
+ if (typ->typtype == TYPTYPE_COMPOSITE)
+ innerResult = TQUEUE_REMAP_RECORD;
+ ReleaseSysCache(tup);
+ break;
}
+
+ if (innerResult != TQUEUE_REMAP_NONE && forceResult != TQUEUE_REMAP_NONE)
+ return forceResult;
+ return innerResult;
}
diff --git a/src/include/executor/tqueue.h b/src/include/executor/tqueue.h
index 6f8eb73..6a668fa 100644
--- a/src/include/executor/tqueue.h
+++ b/src/include/executor/tqueue.h
@@ -21,11 +21,11 @@
extern DestReceiver *CreateTupleQueueDestReceiver(shm_mq_handle *handle);
/* Use these to receive tuples from a shm_mq. */
-typedef struct TupleQueueFunnel TupleQueueFunnel;
-extern TupleQueueFunnel *CreateTupleQueueFunnel(void);
-extern void DestroyTupleQueueFunnel(TupleQueueFunnel *funnel);
-extern void RegisterTupleQueueOnFunnel(TupleQueueFunnel *, shm_mq_handle *);
-extern HeapTuple TupleQueueFunnelNext(TupleQueueFunnel *, bool nowait,
- bool *done);
+typedef struct TupleQueueReader TupleQueueReader;
+extern TupleQueueReader *CreateTupleQueueReader(shm_mq_handle *handle,
+ TupleDesc tupledesc);
+extern void DestroyTupleQueueReader(TupleQueueReader *funnel);
+extern HeapTuple TupleQueueReaderNext(TupleQueueReader *,
+ bool nowait, bool *done);
#endif /* TQUEUE_H */
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 939bc0e..58ec889 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1963,7 +1963,9 @@ typedef struct GatherState
PlanState ps; /* its first field is NodeTag */
bool initialized;
struct ParallelExecutorInfo *pei;
- struct TupleQueueFunnel *funnel;
+ int nreaders;
+ int nextreader;
+ struct TupleQueueReader **reader;
TupleTableSlot *funnel_slot;
bool need_to_scan_locally;
} GatherState;
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index feb821b..03e1d2c 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2018,7 +2018,7 @@ TupleHashEntry
TupleHashEntryData
TupleHashIterator
TupleHashTable
-TupleQueueFunnel
+TupleQueueReader
TupleTableSlot
Tuplesortstate
Tuplestorestate
--
2.3.8 (Apple Git-58)
On Mon, Nov 2, 2015 at 9:29 PM, Robert Haas <robertmhaas@gmail.com> wrote:
On Wed, Oct 28, 2015 at 10:23 AM, Robert Haas <robertmhaas@gmail.com> wrote:
On Sun, Oct 18, 2015 at 12:17 AM, Robert Haas <robertmhaas@gmail.com> wrote:
So reviewing patch 13 isn't possible without prior knowledge.
The basic question for patch 13 is whether ephemeral record types can
occur in executor tuples in any contexts that I haven't identified. I
know that a tuple table slot can contain have a column that is of type
record or record[], and those records can themselves contain
attributes of type record or record[], and so on as far down as you
like. I *think* that's the only case. For example, I don't believe
that a TupleTableSlot can contain a *named* record type that has an
anonymous record buried down inside of it somehow. But I'm not
positive I'm right about that.I have done some more testing and investigation and determined that
this optimism was unwarranted. It turns out that the type information
for composite and record types gets stored in two different places.
First, the TupleTableSlot has a type OID, indicating the sort of the
value it expects to be stored for that slot attribute. Second, the
value itself contains a type OID and typmod. And these don't have to
match. For example, consider this query:select row_to_json(i) from int8_tbl i(x,y);
Without i(x,y), the HeapTuple passed to row_to_json is labelled with
the pg_type OID of int8_tbl. But with the query as written, it's
labeled as an anonymous record type. If I jigger things by hacking
the code so that this is planned as Gather (single-copy) -> SeqScan,
with row_to_json evaluated at the Gather node, then the sequential
scan kicks out a tuple with a transient record type and stores it into
a slot whose type OID is still that of int8_tbl. My previous patch
failed to deal with that; the attached one does.The previous patch was also defective in a few other respects. The
most significant of those, maybe, is that it somehow thought it was OK
to assume that transient typmods from all workers could be treated
interchangeably rather than individually. To fix this, I've changed
the TupleQueueFunnel implemented by tqueue.c to be merely a
TupleQueueReader which handles reading from a single worker only.
nodeGather.c therefore creates one TupleQueueReader per worker instead
of a single TupleQueueFunnel for all workers; accordingly, the logic
for multiplexing multiple queues now lives in nodeGather.c. This is
probably how I should have done it originally - someone, I think Jeff
Davis - complained previously that tqueue.c had no business embedding
the round-robin policy decision, and he was right. So this addresses
that complaint as well.Here is an updated version. This is rebased over recent commits, and
I added a missing check for attisdropped.
Committed.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Mon, Oct 19, 2015 at 12:02 PM, Robert Haas <robertmhaas@gmail.com> wrote:
On Sat, Oct 17, 2015 at 9:16 PM, Andrew Dunstan <andrew@dunslane.net> wrote:
If all that is required is a #define, like CLOBBER_CACHE_ALWAYS, then no
special buildfarm support is required - you would just add that to the
animal's config file, more or less like this:config_env =>
{
CPPFLAGS => '-DGRATUITOUSLY_PARALLEL',
},I try to make things easy :-)
Wow, that's great. So, I'll try to rework the test code I posted
previously into something less hacky, and eventually add a #define
like this so we can run it on the buildfarm. There's a few other
things that need to get done before that really makes sense - like
getting the rest of the bug fix patches committed - otherwise any
buildfarm critters we add will just be permanently red.
OK, so after a bit more delay than I would have liked, I now have a
working set of patches that we can use to ensure automated testing of
the parallel mode infrastructure. I ended up doing something that
does not require a #define, so I'll need some guidance on what to do
on the BF side given that context. Please find attached three
patches, two of them for commit.
group-locking-v1.patch is a vastly improved version of the group
locking patch that we discussed, uh, extensively last year. I realize
that there was a lot of doubt about this approach, but I still believe
it's the right approach, I have put a lot of work into making it work
correctly, I don't think anyone has come up with a really plausible
alternative approach (except one other approach I tried which turned
out to work but with significantly more restrictions), and I'm
committed to fixing it in whatever way is necessary if it turns out to
be broken, even if that amounts to a full rewrite. Review is welcome,
but I honestly believe it's a good idea to get this into the tree
sooner rather than later at this point, because automated regression
testing falls to pieces without these changes, and I believe that
automated regression testing is a really good idea to shake out
whatever bugs we may have in the parallel query stuff. The code in
this patch is all mine, but Amit Kapila deserves credit as co-author
for doing a lot of prototyping (that ended up getting tossed) and
testing. This patch includes comments and an addition to
src/backend/storage/lmgr/README which explain in more detail what this
patch does, how it does it, and why that's OK.
force-parallel-mode-v1.patch is what adds the actual infrastructure
for automated testing. You can set force_parallel_mode=on to force
queries to be ru in a worker whenever possible; this can help test
whether your user-defined functions have been erroneously labeled as
PARALLEL SAFE. If they error out or misbehave with this setting
enabled, you should label them PARALLEL RESTRICTED or PARALLEL UNSAFE.
If you set force_parallel_mode=regress, then some additional changes
intended specifically for regression testing kick in; those changes
are intended to ensure that you get exactly the same output from
running the regression tests with the parallelism infrastructure
forcibly enabled that you would have gotten anyway. Most of this code
is mine, but there are also contributions from Amit Kapila and Rushabh
Lathia.
With both of these patches, you can create a file that says:
force_parallel_mode=regress
max_parallel_degree=2
Then you can run: make check-world TEMP_CONFIG=/path/to/aforementioned/file
If you do, you'll find that while the core regression tests pass
(whee!) the pg_upgrade regression tests fail (oops) because of a
pre-existing bug in the parallelism code introduced by neither of
these two patches. I'm not exactly sure how to fix that bug yet - I
have a couple of ideas - but I think the fact that this test code
promptly found a bug is good sign that it provides enough test
coverage to be useful. Sticking a Gather node on top of every query
where it looks safe just turns out to exercise a lot of things: the
code that decides whether it's safe to put that Gather node, the code
to launch and manage parallel workers, the code those workers
themselves run, etc. The point is just to force as much of the
parallel code to be used as possible even when it's not expected to
make anything faster.
test-group-locking-v1.patch is useful for testing possible deadlock
scenarios with the group locking patch. It's not otherwise safe to
use this, like, at all, and the patch is not proposed for commit.
This patch is entirely by Amit Kapila.
In addition to what's in these patches, I'd like to add a new chapter
to the documentation explaining which queries can be parallelized and
in what ways, what the restrictions are that keep parallel query from
getting used, and some high-level details of how parallelism "works"
in PostgreSQL from a user perspective. Things will obviously change
here as we get more capabilities, but I think we're at a point where
it makes sense to start putting this together. What I'm less clear
about is where exactly in the current SGML documentation such a new
chapter might fit; suggestions very welcome.
Thanks,
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Attachments:
group-locking-v1.patchapplication/x-patch; name=group-locking-v1.patchDownload
From fec950b2d1e1686defb950ce95763b107bd2f656 Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Sat, 3 Oct 2015 13:34:35 -0400
Subject: [PATCH 1/3] Introduce group locking to prevent parallel processes
from deadlocking.
For locking purposes, we now regard heavyweight locks as mutually
non-conflicting between cooperating parallel processes. There are some
possible pitfalls to this approach that are not to be taken lightly,
but it works OK for now and can be changed later if we find a better
approach. Without this, it's very easy for parallel queries to
silently self-deadlock if the user backend holds strong relation locks.
Robert Haas, with help from Amit Kapila.
---
src/backend/access/transam/parallel.c | 16 ++
src/backend/storage/lmgr/README | 63 ++++++++
src/backend/storage/lmgr/deadlock.c | 279 +++++++++++++++++++++++++++-------
src/backend/storage/lmgr/lock.c | 122 ++++++++++++---
src/backend/storage/lmgr/proc.c | 158 ++++++++++++++++++-
src/include/storage/lock.h | 13 +-
src/include/storage/proc.h | 12 ++
7 files changed, 587 insertions(+), 76 deletions(-)
diff --git a/src/backend/access/transam/parallel.c b/src/backend/access/transam/parallel.c
index 8eea092..bf2e691 100644
--- a/src/backend/access/transam/parallel.c
+++ b/src/backend/access/transam/parallel.c
@@ -432,6 +432,9 @@ LaunchParallelWorkers(ParallelContext *pcxt)
if (pcxt->nworkers == 0)
return;
+ /* We need to be a lock group leader. */
+ BecomeLockGroupLeader();
+
/* If we do have workers, we'd better have a DSM segment. */
Assert(pcxt->seg != NULL);
@@ -952,6 +955,19 @@ ParallelWorkerMain(Datum main_arg)
*/
/*
+ * Join locking group. We must do this before anything that could try
+ * to acquire a heavyweight lock, because any heavyweight locks acquired
+ * to this point could block either directly against the parallel group
+ * leader or against some process which in turn waits for a lock that
+ * conflicts with the parallel group leader, causing an undetected
+ * deadlock. (If we can't join the lock group, the leader has gone away,
+ * so just exit quietly.)
+ */
+ if (!BecomeLockGroupMember(fps->parallel_master_pgproc,
+ fps->parallel_master_pid))
+ return;
+
+ /*
* Load libraries that were loaded by original backend. We want to do
* this before restoring GUCs, because the libraries might define custom
* variables.
diff --git a/src/backend/storage/lmgr/README b/src/backend/storage/lmgr/README
index 8898e25..cb9c7d6 100644
--- a/src/backend/storage/lmgr/README
+++ b/src/backend/storage/lmgr/README
@@ -586,6 +586,69 @@ The caller can then send a cancellation signal. This implements the
principle that autovacuum has a low locking priority (eg it must not block
DDL on the table).
+Group Locking
+-------------
+
+As if all of that weren't already complicated enough, PostgreSQL now supports
+parallelism (see src/backend/access/transam/README.parallel), which means that
+we might need to resolve deadlocks that occur between gangs of related processes
+rather than individual processes. This doesn't change the basic deadlock
+detection algorithm very much, but it makes the bookkeeping more complicated.
+
+We choose to regard locks held by processes in the same parallel group as
+non-conflicting. This means that two processes in a parallel group can hold
+a self-exclusive lock on the same relation at the same time, or one process
+can acquire an AccessShareLock while the other already holds AccessExclusiveLock.
+This might seem dangerous and could be in some cases (more on that below), but
+if we didn't do this then parallel query would be extremely prone to
+self-deadlock. For example, a parallel query against a relation on which the
+leader had already AccessExclusiveLock would hang, because the workers would
+try to lock the same relation and be blocked by the leader; yet the leader can't
+finish until it receives completion indications from all workers. An undetected
+deadlock results. This is far from the only scenario where such a problem
+happens. The same thing will occur if the leader holds only AccessShareLock,
+the worker seeks AccessShareLock, but between the time the leader attempts to
+acquire the lock and the time the worker attempts to acquire it, some other
+process queues up waiting for an AccessExclusiveLock. In this case, too, an
+indefinite hang results.
+
+It might seem that we could predict which locks the workers will attempt to
+acquire and ensure before going parallel that those locks would be acquired
+successfully. But this is very difficult to make work in a general way. For
+example, a parallel worker's portion of the query plan could involve an
+SQL-callable function which generates a query dynamically, and that query
+might happen to hit a table on which the leader happens to hold
+AccessExcusiveLock. By imposing enough restrictions on what workers can do,
+we could eventually create a situation where their behavior can be adequately
+restricted, but these restrictions would be fairly onerous, and even then, the
+system required to decide whether the workers will succeed at acquiring the
+necessary locks would be complex and possibly buggy.
+
+So, instead, we take the approach of deciding that locks within a lock group
+do not conflict. This eliminates the possibility of an undetected deadlock,
+but also opens up some problem cases: if the leader and worker try to do some
+operation at the same time which would ordinarily be prevented by the heavyweight
+lock mechanism, undefined behavior might result. In practice, the dangers are
+modest. The leader and worker share the same transaction, snapshot, and combo
+CID hash, and neither can perform any DDL or, indeed, write any data at all.
+Thus, for either to read a table locked exclusively by the other is safe enough.
+Problems would occur if the leader initiated parallelism from a point in the
+code at which it had some backend-private state that made table access from
+another process unsafe, for example after calling SetReindexProcessing and
+before calling ResetReindexProcessing, catastrophe could ensue, because the
+worker won't have that state. Similarly, problems could occur with certain
+kinds of non-relation locks, such as relation extension locks. It's no safer
+for two related processes to extend the same relation at the time than for
+unrelated processes to do the same. However, since parallel mode is strictly
+read-only at present, neither this nor most of the similar cases can arise at
+present. To allow parallel writes, we'll either need to (1) further enhance
+the deadlock detector to handle those types of locks in a different way than
+other types; or (2) have parallel workers use some other mutual exclusion
+method for such cases; or (3) revise those cases so that they no longer use
+heavyweight locking in the first place (which is not a crazy idea, given that
+such lock acquisitions are not expected to deadlock and that heavyweight lock
+acquisition is fairly slow anyway).
+
User Locks (Advisory Locks)
---------------------------
diff --git a/src/backend/storage/lmgr/deadlock.c b/src/backend/storage/lmgr/deadlock.c
index a68aaf6..69f678b 100644
--- a/src/backend/storage/lmgr/deadlock.c
+++ b/src/backend/storage/lmgr/deadlock.c
@@ -38,6 +38,7 @@ typedef struct
{
PGPROC *waiter; /* the waiting process */
PGPROC *blocker; /* the process it is waiting for */
+ LOCK *lock; /* the lock it is waiting for */
int pred; /* workspace for TopoSort */
int link; /* workspace for TopoSort */
} EDGE;
@@ -72,6 +73,9 @@ static bool FindLockCycle(PGPROC *checkProc,
EDGE *softEdges, int *nSoftEdges);
static bool FindLockCycleRecurse(PGPROC *checkProc, int depth,
EDGE *softEdges, int *nSoftEdges);
+static bool FindLockCycleRecurseMember(PGPROC *checkProc,
+ PGPROC *checkProcLeader,
+ int depth, EDGE *softEdges, int *nSoftEdges);
static bool ExpandConstraints(EDGE *constraints, int nConstraints);
static bool TopoSort(LOCK *lock, EDGE *constraints, int nConstraints,
PGPROC **ordering);
@@ -449,18 +453,15 @@ FindLockCycleRecurse(PGPROC *checkProc,
EDGE *softEdges, /* output argument */
int *nSoftEdges) /* output argument */
{
- PGPROC *proc;
- PGXACT *pgxact;
- LOCK *lock;
- PROCLOCK *proclock;
- SHM_QUEUE *procLocks;
- LockMethod lockMethodTable;
- PROC_QUEUE *waitQueue;
- int queue_size;
- int conflictMask;
int i;
- int numLockModes,
- lm;
+ dlist_iter iter;
+
+ /*
+ * If this process is a lock group member, check the leader instead. (Note
+ * that we might be the leader, in which case this is a no-op.)
+ */
+ if (checkProc->lockGroupLeader != NULL)
+ checkProc = checkProc->lockGroupLeader;
/*
* Have we already seen this proc?
@@ -494,13 +495,57 @@ FindLockCycleRecurse(PGPROC *checkProc,
visitedProcs[nVisitedProcs++] = checkProc;
/*
- * If the proc is not waiting, we have no outgoing waits-for edges.
+ * If the process is waiting, there is an outgoing waits-for edge to each
+ * process that blocks it.
+ */
+ if (checkProc->links.next != NULL && checkProc->waitLock != NULL &&
+ FindLockCycleRecurseMember(checkProc, checkProc, depth, softEdges,
+ nSoftEdges))
+ return true;
+
+ /*
+ * If the process is not waiting, there could still be outgoing waits-for
+ * edges if it is part of a lock group, because other members of the lock
+ * group might be waiting even though this process is not. (Given lock
+ * groups {A1, A2} and {B1, B2}, if A1 waits for B1 and B2 waits for A2,
+ * that is a deadlock even neither of B1 and A2 are waiting for anything.)
*/
- if (checkProc->links.next == NULL)
- return false;
- lock = checkProc->waitLock;
- if (lock == NULL)
- return false;
+ dlist_foreach(iter, &checkProc->lockGroupMembers)
+ {
+ PGPROC *memberProc;
+
+ memberProc = dlist_container(PGPROC, lockGroupLink, iter.cur);
+
+ if (memberProc->links.next != NULL && memberProc->waitLock != NULL &&
+ memberProc != checkProc &&
+ FindLockCycleRecurseMember(memberProc, checkProc, depth, softEdges,
+ nSoftEdges))
+ return true;
+ }
+
+ return false;
+}
+
+static bool
+FindLockCycleRecurseMember(PGPROC *checkProc,
+ PGPROC *checkProcLeader,
+ int depth,
+ EDGE *softEdges, /* output argument */
+ int *nSoftEdges) /* output argument */
+{
+ PGPROC *proc;
+ LOCK *lock = checkProc->waitLock;
+ PGXACT *pgxact;
+ PROCLOCK *proclock;
+ SHM_QUEUE *procLocks;
+ LockMethod lockMethodTable;
+ PROC_QUEUE *waitQueue;
+ int queue_size;
+ int conflictMask;
+ int i;
+ int numLockModes,
+ lm;
+
lockMethodTable = GetLocksMethodTable(lock);
numLockModes = lockMethodTable->numLockModes;
conflictMask = lockMethodTable->conflictTab[checkProc->waitLockMode];
@@ -516,11 +561,14 @@ FindLockCycleRecurse(PGPROC *checkProc,
while (proclock)
{
+ PGPROC *leader;
+
proc = proclock->tag.myProc;
pgxact = &ProcGlobal->allPgXact[proc->pgprocno];
+ leader = proc->lockGroupLeader == NULL ? proc : proc->lockGroupLeader;
- /* A proc never blocks itself */
- if (proc != checkProc)
+ /* A proc never blocks itself or any other lock group member */
+ if (leader != checkProcLeader)
{
for (lm = 1; lm <= numLockModes; lm++)
{
@@ -601,10 +649,20 @@ FindLockCycleRecurse(PGPROC *checkProc,
for (i = 0; i < queue_size; i++)
{
+ PGPROC *leader;
+
proc = procs[i];
+ leader = proc->lockGroupLeader == NULL ? proc :
+ proc->lockGroupLeader;
- /* Done when we reach the target proc */
- if (proc == checkProc)
+ /*
+ * TopoSort will always return an ordering with group members
+ * adjacent to each other in the wait queue (see comments
+ * therein). So, as soon as we reach a process in the same lock
+ * group as checkProc, we know we've found all the conflicts that
+ * precede any member of the lock group lead by checkProcLeader.
+ */
+ if (leader == checkProcLeader)
break;
/* Is there a conflict with this guy's request? */
@@ -625,8 +683,9 @@ FindLockCycleRecurse(PGPROC *checkProc,
* Add this edge to the list of soft edges in the cycle
*/
Assert(*nSoftEdges < MaxBackends);
- softEdges[*nSoftEdges].waiter = checkProc;
- softEdges[*nSoftEdges].blocker = proc;
+ softEdges[*nSoftEdges].waiter = checkProcLeader;
+ softEdges[*nSoftEdges].blocker = leader;
+ softEdges[*nSoftEdges].lock = lock;
(*nSoftEdges)++;
return true;
}
@@ -635,20 +694,52 @@ FindLockCycleRecurse(PGPROC *checkProc,
}
else
{
+ PGPROC *lastGroupMember = NULL;
+
/* Use the true lock wait queue order */
waitQueue = &(lock->waitProcs);
- queue_size = waitQueue->size;
- proc = (PGPROC *) waitQueue->links.next;
+ /*
+ * Find the last member of the lock group that is present in the wait
+ * queue. Anything after this is not a soft lock conflict. If group
+ * locking is not in use, then we know immediately which process we're
+ * looking for, but otherwise we've got to search the wait queue to
+ * find the last process actually present.
+ */
+ if (checkProc->lockGroupLeader == NULL)
+ lastGroupMember = checkProc;
+ else
+ {
+ proc = (PGPROC *) waitQueue->links.next;
+ queue_size = waitQueue->size;
+ while (queue_size-- > 0)
+ {
+ if (proc->lockGroupLeader == checkProcLeader)
+ lastGroupMember = proc;
+ proc = (PGPROC *) proc->links.next;
+ }
+ Assert(lastGroupMember != NULL);
+ }
+ /*
+ * OK, now rescan (or scan) the queue to identify the soft conflicts.
+ */
+ queue_size = waitQueue->size;
+ proc = (PGPROC *) waitQueue->links.next;
while (queue_size-- > 0)
{
+ PGPROC *leader;
+
+ leader = proc->lockGroupLeader == NULL ? proc :
+ proc->lockGroupLeader;
+
/* Done when we reach the target proc */
- if (proc == checkProc)
+ if (proc == lastGroupMember)
break;
/* Is there a conflict with this guy's request? */
- if ((LOCKBIT_ON(proc->waitLockMode) & conflictMask) != 0)
+ if ((LOCKBIT_ON(proc->waitLockMode) & conflictMask) != 0 &&
+ leader != checkProcLeader)
{
/* This proc soft-blocks checkProc */
if (FindLockCycleRecurse(proc, depth + 1,
@@ -665,8 +756,9 @@ FindLockCycleRecurse(PGPROC *checkProc,
* Add this edge to the list of soft edges in the cycle
*/
Assert(*nSoftEdges < MaxBackends);
- softEdges[*nSoftEdges].waiter = checkProc;
- softEdges[*nSoftEdges].blocker = proc;
+ softEdges[*nSoftEdges].waiter = checkProcLeader;
+ softEdges[*nSoftEdges].blocker = leader;
+ softEdges[*nSoftEdges].lock = lock;
(*nSoftEdges)++;
return true;
}
@@ -711,8 +803,7 @@ ExpandConstraints(EDGE *constraints,
*/
for (i = nConstraints; --i >= 0;)
{
- PGPROC *proc = constraints[i].waiter;
- LOCK *lock = proc->waitLock;
+ LOCK *lock = constraints[i].lock;
/* Did we already make a list for this lock? */
for (j = nWaitOrders; --j >= 0;)
@@ -778,7 +869,9 @@ TopoSort(LOCK *lock,
PGPROC *proc;
int i,
j,
+ jj,
k,
+ kk,
last;
/* First, fill topoProcs[] array with the procs in their current order */
@@ -798,41 +891,95 @@ TopoSort(LOCK *lock,
* stores its list link in constraints[i].link (note any constraint will
* be in just one list). The array index for the before-proc of the i'th
* constraint is remembered in constraints[i].pred.
+ *
+ * Note that it's not necessarily the case that every constraint affects
+ * this particular wait queue. Prior to group locking, a process could be
+ * waiting for at most one lock. But a lock group can be waiting for
+ * zero, one, or multiple locks. Since topoProcs[] is an array of the
+ * processes actually waiting, while constraints[] is an array of group
+ * leaders, we've got to scan through topoProcs[] for each constraint,
+ * checking whether both a waiter and a blocker for that group are
+ * present. If so, the constraint is relevant to this wait queue; if not,
+ * it isn't.
*/
MemSet(beforeConstraints, 0, queue_size * sizeof(int));
MemSet(afterConstraints, 0, queue_size * sizeof(int));
for (i = 0; i < nConstraints; i++)
{
+ /*
+ * Find a representative process that is on the lock queue and part of
+ * the waiting lock group. This may or may not be the leader, which
+ * may or may not be waiting at all. If there are any other processes
+ * in the same lock group on the queue, set their number of
+ * beforeConstraints to -1 to indicate that they should be emitted
+ * with their groupmates rather than considered separately.
+ */
proc = constraints[i].waiter;
- /* Ignore constraint if not for this lock */
- if (proc->waitLock != lock)
- continue;
- /* Find the waiter proc in the array */
+ Assert(proc != NULL);
+ jj = -1;
for (j = queue_size; --j >= 0;)
{
- if (topoProcs[j] == proc)
+ PGPROC *waiter = topoProcs[j];
+
+ if (waiter == proc || waiter->lockGroupLeader == proc)
+ {
+ Assert(waiter->waitLock == lock);
+ if (jj == -1)
+ jj = j;
+ else
+ {
+ Assert(beforeConstraints[j] <= 0);
+ beforeConstraints[j] = -1;
+ }
break;
+ }
}
- Assert(j >= 0); /* should have found a match */
- /* Find the blocker proc in the array */
+
+ /* If no matching waiter, constraint is not relevant to this lock. */
+ if (jj < 0)
+ continue;
+
+ /*
+ * Similarly, find a representative process that is on the lock queue
+ * and waiting for the blocking lock group. Again, this could be the
+ * leader but does not need to be.
+ */
proc = constraints[i].blocker;
+ Assert(proc != NULL);
+ kk = -1;
for (k = queue_size; --k >= 0;)
{
- if (topoProcs[k] == proc)
- break;
+ PGPROC *blocker = topoProcs[k];
+
+ if (blocker == proc || blocker->lockGroupLeader == proc)
+ {
+ Assert(blocker->waitLock == lock);
+ if (kk == -1)
+ kk = k;
+ else
+ {
+ Assert(beforeConstraints[k] <= 0);
+ beforeConstraints[k] = -1;
+ }
+ }
}
- Assert(k >= 0); /* should have found a match */
- beforeConstraints[j]++; /* waiter must come before */
+
+ /* If no matching blocker, constraint is not relevant to this lock. */
+ if (kk < 0)
+ continue;
+
+ beforeConstraints[jj]++; /* waiter must come before */
/* add this constraint to list of after-constraints for blocker */
- constraints[i].pred = j;
- constraints[i].link = afterConstraints[k];
- afterConstraints[k] = i + 1;
+ constraints[i].pred = jj;
+ constraints[i].link = afterConstraints[kk];
+ afterConstraints[kk] = i + 1;
}
+
/*--------------------
* Now scan the topoProcs array backwards. At each step, output the
- * last proc that has no remaining before-constraints, and decrease
- * the beforeConstraints count of each of the procs it was constrained
- * against.
+ * last proc that has no remaining before-constraints plus any other
+ * members of the same lock group; then decrease the beforeConstraints
+ * count of each of the procs it was constrained against.
* i = index of ordering[] entry we want to output this time
* j = search index for topoProcs[]
* k = temp for scanning constraint list for proc j
@@ -840,8 +987,11 @@ TopoSort(LOCK *lock,
*--------------------
*/
last = queue_size - 1;
- for (i = queue_size; --i >= 0;)
+ for (i = queue_size - 1; i >= 0;)
{
+ int c;
+ int nmatches = 0;
+
/* Find next candidate to output */
while (topoProcs[last] == NULL)
last--;
@@ -850,12 +1000,37 @@ TopoSort(LOCK *lock,
if (topoProcs[j] != NULL && beforeConstraints[j] == 0)
break;
}
+
/* If no available candidate, topological sort fails */
if (j < 0)
return false;
- /* Output candidate, and mark it done by zeroing topoProcs[] entry */
- ordering[i] = topoProcs[j];
- topoProcs[j] = NULL;
+
+ /*
+ * Output everything in the lock group. There's no point in outputing
+ * an ordering where members of the same lock group are not
+ * consecutive on the wait queue: if some other waiter is between two
+ * requests that belong to the same group, then either it conflicts
+ * with both of them and is certainly not a solution; or it conflicts
+ * with at most one of them and is thus isomorphic to an ordering
+ * where the group members are consecutive.
+ */
+ proc = topoProcs[j];
+ if (proc->lockGroupLeader != NULL)
+ proc = proc->lockGroupLeader;
+ Assert(proc != NULL);
+ for (c = 0; c <= last; ++c)
+ {
+ if (topoProcs[c] == proc || (topoProcs[c] != NULL &&
+ topoProcs[c]->lockGroupLeader == proc))
+ {
+ ordering[i - nmatches] = topoProcs[c];
+ topoProcs[c] = NULL;
+ ++nmatches;
+ }
+ }
+ Assert(nmatches > 0);
+ i -= nmatches;
+
/* Update beforeConstraints counts of its predecessors */
for (k = afterConstraints[j]; k > 0; k = constraints[k - 1].link)
beforeConstraints[constraints[k - 1].pred]--;
diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index 269fe14..e3e9599 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -35,6 +35,7 @@
#include "access/transam.h"
#include "access/twophase.h"
#include "access/twophase_rmgr.h"
+#include "access/xact.h"
#include "access/xlog.h"
#include "miscadmin.h"
#include "pg_trace.h"
@@ -1136,6 +1137,18 @@ SetupLockInTable(LockMethod lockMethodTable, PGPROC *proc,
{
uint32 partition = LockHashPartition(hashcode);
+ /*
+ * It might seem unsafe to access proclock->groupLeader without a lock,
+ * but it's not really. Either we are initializing a proclock on our
+ * own behalf, in which case our group leader isn't changing because
+ * the group leader for a process can only ever be changed by the
+ * process itself; or else we are transferring a fast-path lock to the
+ * main lock table, in which case that process can't change it's lock
+ * group leader without first releasing all of its locks (and in
+ * particular the one we are currently transferring).
+ */
+ proclock->groupLeader = proc->lockGroupLeader != NULL ?
+ proc->lockGroupLeader : proc;
proclock->holdMask = 0;
proclock->releaseMask = 0;
/* Add proclock to appropriate lists */
@@ -1255,9 +1268,10 @@ RemoveLocalLock(LOCALLOCK *locallock)
* NOTES:
* Here's what makes this complicated: one process's locks don't
* conflict with one another, no matter what purpose they are held for
- * (eg, session and transaction locks do not conflict).
- * So, we must subtract off our own locks when determining whether the
- * requested new lock conflicts with those already held.
+ * (eg, session and transaction locks do not conflict). Nor do the locks
+ * of one process in a lock group conflict with those of another process in
+ * the same group. So, we must subtract off these locks when determining
+ * whether the requested new lock conflicts with those already held.
*/
int
LockCheckConflicts(LockMethod lockMethodTable,
@@ -1267,8 +1281,12 @@ LockCheckConflicts(LockMethod lockMethodTable,
{
int numLockModes = lockMethodTable->numLockModes;
LOCKMASK myLocks;
- LOCKMASK otherLocks;
+ int conflictMask = lockMethodTable->conflictTab[lockmode];
+ int conflictsRemaining[MAX_LOCKMODES];
+ int totalConflictsRemaining = 0;
int i;
+ SHM_QUEUE *procLocks;
+ PROCLOCK *otherproclock;
/*
* first check for global conflicts: If no locks conflict with my request,
@@ -1279,40 +1297,91 @@ LockCheckConflicts(LockMethod lockMethodTable,
* type of lock that conflicts with request. Bitwise compare tells if
* there is a conflict.
*/
- if (!(lockMethodTable->conflictTab[lockmode] & lock->grantMask))
+ if (!(conflictMask & lock->grantMask))
{
PROCLOCK_PRINT("LockCheckConflicts: no conflict", proclock);
return STATUS_OK;
}
/*
- * Rats. Something conflicts. But it could still be my own lock. We have
- * to construct a conflict mask that does not reflect our own locks, but
- * only lock types held by other processes.
+ * Rats. Something conflicts. But it could still be my own lock, or
+ * a lock held by another member of my locking group. First, figure out
+ * how many conflicts remain after subtracting out any locks I hold
+ * myself.
*/
myLocks = proclock->holdMask;
- otherLocks = 0;
for (i = 1; i <= numLockModes; i++)
{
- int myHolding = (myLocks & LOCKBIT_ON(i)) ? 1 : 0;
+ if ((conflictMask & LOCKBIT_ON(i)) == 0)
+ {
+ conflictsRemaining[i] = 0;
+ continue;
+ }
+ conflictsRemaining[i] = lock->granted[i];
+ if (myLocks & LOCKBIT_ON(i))
+ --conflictsRemaining[i];
+ totalConflictsRemaining += conflictsRemaining[i];
+ }
- if (lock->granted[i] > myHolding)
- otherLocks |= LOCKBIT_ON(i);
+ /* If no conflicts remain, we get the lock. */
+ if (totalConflictsRemaining == 0)
+ {
+ PROCLOCK_PRINT("LockCheckConflicts: resolved (simple)", proclock);
+ return STATUS_OK;
+ }
+
+ /* If no group locking, it's definitely a conflict. */
+ if (proclock->groupLeader == MyProc && MyProc->lockGroupLeader == NULL)
+ {
+ Assert(proclock->tag.myProc == MyProc);
+ PROCLOCK_PRINT("LockCheckConflicts: conflicting (simple)",
+ proclock);
+ return STATUS_FOUND;
}
/*
- * now check again for conflicts. 'otherLocks' describes the types of
- * locks held by other processes. If one of these conflicts with the kind
- * of lock that I want, there is a conflict and I have to sleep.
+ * Locks held in conflicting modes by members of our own lock group are
+ * not real conflicts; we can subtract those out and see if we still have
+ * a conflict. This is O(N) in the number of processes holding or awaiting
+ * locks on this object. We could improve that by making the shared memory
+ * state more complex (and larger) but it doesn't seem worth it.
*/
- if (!(lockMethodTable->conflictTab[lockmode] & otherLocks))
+ procLocks = &(lock->procLocks);
+ otherproclock = (PROCLOCK *)
+ SHMQueueNext(procLocks, procLocks, offsetof(PROCLOCK, lockLink));
+ while (otherproclock != NULL)
{
- /* no conflict. OK to get the lock */
- PROCLOCK_PRINT("LockCheckConflicts: resolved", proclock);
- return STATUS_OK;
+ if (proclock != otherproclock &&
+ proclock->groupLeader == otherproclock->groupLeader &&
+ (otherproclock->holdMask & conflictMask) != 0)
+ {
+ int intersectMask = otherproclock->holdMask & conflictMask;
+
+ for (i = 1; i <= numLockModes; i++)
+ {
+ if ((intersectMask & LOCKBIT_ON(i)) != 0)
+ {
+ if (conflictsRemaining[i] <= 0)
+ elog(PANIC, "proclocks held do not match lock");
+ conflictsRemaining[i]--;
+ totalConflictsRemaining--;
+ }
+ }
+
+ if (totalConflictsRemaining == 0)
+ {
+ PROCLOCK_PRINT("LockCheckConflicts: resolved (group)",
+ proclock);
+ return STATUS_OK;
+ }
+ }
+ otherproclock = (PROCLOCK *)
+ SHMQueueNext(procLocks, &otherproclock->lockLink,
+ offsetof(PROCLOCK, lockLink));
}
- PROCLOCK_PRINT("LockCheckConflicts: conflicting", proclock);
+ /* Nope, it's a real conflict. */
+ PROCLOCK_PRINT("LockCheckConflicts: conflicting (group)", proclock);
return STATUS_FOUND;
}
@@ -3095,6 +3164,10 @@ PostPrepare_Locks(TransactionId xid)
PROCLOCKTAG proclocktag;
int partition;
+ /* Can't prepare a lock group follower. */
+ Assert(MyProc->lockGroupLeader == NULL ||
+ MyProc->lockGroupLeader == MyProc);
+
/* This is a critical section: any error means big trouble */
START_CRIT_SECTION();
@@ -3239,6 +3312,13 @@ PostPrepare_Locks(TransactionId xid)
proclocktag.myProc = newproc;
/*
+ * Update groupLeader pointer to point to the new proc. (We'd
+ * better not be a member of somebody else's lock group!)
+ */
+ Assert(proclock->groupLeader == proclock->tag.myProc);
+ proclock->groupLeader = newproc;
+
+ /*
* Update the proclock. We should not find any existing entry for
* the same hash key, since there can be only one entry for any
* given lock with my own proc.
@@ -3785,6 +3865,8 @@ lock_twophase_recover(TransactionId xid, uint16 info,
*/
if (!found)
{
+ Assert(proc->lockGroupLeader == NULL);
+ proclock->groupLeader = proc;
proclock->holdMask = 0;
proclock->releaseMask = 0;
/* Add proclock to appropriate lists */
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 3690753..084be5a 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -263,6 +263,9 @@ InitProcGlobal(void)
/* Initialize myProcLocks[] shared memory queues. */
for (j = 0; j < NUM_LOCK_PARTITIONS; j++)
SHMQueueInit(&(procs[i].myProcLocks[j]));
+
+ /* Initialize lockGroupMembers list. */
+ dlist_init(&procs[i].lockGroupMembers);
}
/*
@@ -397,6 +400,11 @@ InitProcess(void)
MyProc->backendLatestXid = InvalidTransactionId;
pg_atomic_init_u32(&MyProc->nextClearXidElem, INVALID_PGPROCNO);
+ /* Check that group locking fields are in a proper initial state. */
+ Assert(MyProc->lockGroupLeaderIdentifier == 0);
+ Assert(MyProc->lockGroupLeader == NULL);
+ Assert(dlist_is_empty(&MyProc->lockGroupMembers));
+
/*
* Acquire ownership of the PGPROC's latch, so that we can use WaitLatch
* on it. That allows us to repoint the process latch, which so far
@@ -556,6 +564,11 @@ InitAuxiliaryProcess(void)
OwnLatch(&MyProc->procLatch);
SwitchToSharedLatch();
+ /* Check that group locking fields are in a proper initial state. */
+ Assert(MyProc->lockGroupLeaderIdentifier == 0);
+ Assert(MyProc->lockGroupLeader == NULL);
+ Assert(dlist_is_empty(&MyProc->lockGroupMembers));
+
/*
* We might be reusing a semaphore that belonged to a failed process. So
* be careful and reinitialize its value here. (This is not strictly
@@ -794,6 +807,40 @@ ProcKill(int code, Datum arg)
ReplicationSlotRelease();
/*
+ * Detach from any lock group of which we are a member. If the leader
+ * exist before all other group members, it's PGPROC will remain allocated
+ * until the last group process exits; that process must return the
+ * leader's PGPROC to the appropriate list.
+ */
+ if (MyProc->lockGroupLeader != NULL)
+ {
+ PGPROC *leader = MyProc->lockGroupLeader;
+ LWLock *leader_lwlock = LockHashPartitionLockByProc(leader);
+
+ LWLockAcquire(leader_lwlock, LW_EXCLUSIVE);
+ Assert(!dlist_is_empty(&leader->lockGroupMembers));
+ dlist_delete(&MyProc->lockGroupLink);
+ if (dlist_is_empty(&leader->lockGroupMembers))
+ {
+ leader->lockGroupLeaderIdentifier = 0;
+ leader->lockGroupLeader = NULL;
+ if (leader != MyProc)
+ {
+ procgloballist = leader->procgloballist;
+
+ /* Leader exited first; return its PGPROC. */
+ SpinLockAcquire(ProcStructLock);
+ leader->links.next = (SHM_QUEUE *) *procgloballist;
+ *procgloballist = leader;
+ SpinLockRelease(ProcStructLock);
+ }
+ }
+ else if (leader != MyProc)
+ MyProc->lockGroupLeader = NULL;
+ LWLockRelease(leader_lwlock);
+ }
+
+ /*
* Reset MyLatch to the process local one. This is so that signal
* handlers et al can continue using the latch after the shared latch
* isn't ours anymore. After that clear MyProc and disown the shared
@@ -807,9 +854,20 @@ ProcKill(int code, Datum arg)
procgloballist = proc->procgloballist;
SpinLockAcquire(ProcStructLock);
- /* Return PGPROC structure (and semaphore) to appropriate freelist */
- proc->links.next = (SHM_QUEUE *) *procgloballist;
- *procgloballist = proc;
+ /*
+ * If we're still a member of a locking group, that means we're a leader
+ * which has somehow exited before its children. The last remaining child
+ * will release our PGPROC. Otherwise, release it now.
+ */
+ if (proc->lockGroupLeader == NULL)
+ {
+ /* Since lockGroupLeader is NULL, lockGroupMembers should be empty. */
+ Assert(dlist_is_empty(&proc->lockGroupMembers));
+
+ /* Return PGPROC structure (and semaphore) to appropriate freelist */
+ proc->links.next = (SHM_QUEUE *) *procgloballist;
+ *procgloballist = proc;
+ }
/* Update shared estimate of spins_per_delay */
ProcGlobal->spins_per_delay = update_spins_per_delay(ProcGlobal->spins_per_delay);
@@ -942,9 +1000,31 @@ ProcSleep(LOCALLOCK *locallock, LockMethod lockMethodTable)
bool allow_autovacuum_cancel = true;
int myWaitStatus;
PGPROC *proc;
+ PGPROC *leader = MyProc->lockGroupLeader;
int i;
/*
+ * If group locking is in use, locks held my members of my locking group
+ * need to be included in myHeldLocks.
+ */
+ if (leader != NULL)
+ {
+ SHM_QUEUE *procLocks = &(lock->procLocks);
+ PROCLOCK *otherproclock;
+
+ otherproclock = (PROCLOCK *)
+ SHMQueueNext(procLocks, procLocks, offsetof(PROCLOCK, lockLink));
+ while (otherproclock != NULL)
+ {
+ if (otherproclock->groupLeader == leader)
+ myHeldLocks |= otherproclock->holdMask;
+ otherproclock = (PROCLOCK *)
+ SHMQueueNext(procLocks, &otherproclock->lockLink,
+ offsetof(PROCLOCK, lockLink));
+ }
+ }
+
+ /*
* Determine where to add myself in the wait queue.
*
* Normally I should go at the end of the queue. However, if I already
@@ -968,6 +1048,15 @@ ProcSleep(LOCALLOCK *locallock, LockMethod lockMethodTable)
proc = (PGPROC *) waitQueue->links.next;
for (i = 0; i < waitQueue->size; i++)
{
+ /*
+ * If we're part of the same locking group as this waiter, its
+ * locks neither conflict with ours nor contribute to aheadRequsts.
+ */
+ if (leader != NULL && leader == proc->lockGroupLeader)
+ {
+ proc = (PGPROC *) proc->links.next;
+ continue;
+ }
/* Must he wait for me? */
if (lockMethodTable->conflictTab[proc->waitLockMode] & myHeldLocks)
{
@@ -1658,3 +1747,66 @@ ProcSendSignal(int pid)
SetLatch(&proc->procLatch);
}
}
+
+/*
+ * BecomeLockGroupLeader - designate process as lock group leader
+ *
+ * Once this function has returned, other processes can join the lock group
+ * by calling BecomeLockGroupMember.
+ */
+void
+BecomeLockGroupLeader(void)
+{
+ LWLock *leader_lwlock;
+
+ /* If we already did it, we don't need to do it again. */
+ if (MyProc->lockGroupLeader == MyProc)
+ return;
+
+ /* We had better not be a follower. */
+ Assert(MyProc->lockGroupLeader == NULL);
+
+ /* Create single-member group, containing only ourselves. */
+ leader_lwlock = LockHashPartitionLockByProc(MyProc);
+ LWLockAcquire(leader_lwlock, LW_EXCLUSIVE);
+ MyProc->lockGroupLeader = MyProc;
+ MyProc->lockGroupLeaderIdentifier = MyProcPid;
+ dlist_push_head(&MyProc->lockGroupMembers, &MyProc->lockGroupLink);
+ LWLockRelease(leader_lwlock);
+}
+
+/*
+ * BecomeLockGroupMember - designate process as lock group member
+ *
+ * This is pretty straightforward except for the possibility that the leader
+ * whose group we're trying to join might exit before we manage to do so;
+ * and the PGPROC might get recycled for an unrelated process. To avoid
+ * that, we require the caller to pass the PID of the intended PGPROC as
+ * an interlock. Returns true if we successfully join the intended lock
+ * group, and false if not.
+ */
+bool
+BecomeLockGroupMember(PGPROC *leader, int pid)
+{
+ LWLock *leader_lwlock;
+ bool ok = false;
+
+ /* Group leader can't become member of group */
+ Assert(MyProc != leader);
+
+ /* PID must be valid. */
+ Assert(pid != 0);
+
+ /* Try to join the group. */
+ leader_lwlock = LockHashPartitionLockByProc(MyProc);
+ LWLockAcquire(leader_lwlock, LW_EXCLUSIVE);
+ if (leader->lockGroupLeaderIdentifier == pid)
+ {
+ ok = true;
+ MyProc->lockGroupLeader = leader;
+ dlist_push_tail(&leader->lockGroupMembers, &MyProc->lockGroupLink);
+ }
+ LWLockRelease(leader_lwlock);
+
+ return ok;
+}
diff --git a/src/include/storage/lock.h b/src/include/storage/lock.h
index 43eca86..6b4e365 100644
--- a/src/include/storage/lock.h
+++ b/src/include/storage/lock.h
@@ -346,6 +346,7 @@ typedef struct PROCLOCK
PROCLOCKTAG tag; /* unique identifier of proclock object */
/* data */
+ PGPROC *groupLeader; /* group leader, or NULL if no lock group */
LOCKMASK holdMask; /* bitmask for lock types currently held */
LOCKMASK releaseMask; /* bitmask for lock types to be released */
SHM_QUEUE lockLink; /* list link in LOCK's list of proclocks */
@@ -457,7 +458,6 @@ typedef enum
* worker */
} DeadLockState;
-
/*
* The lockmgr's shared hash tables are partitioned to reduce contention.
* To determine which partition a given locktag belongs to, compute the tag's
@@ -473,6 +473,17 @@ typedef enum
(&MainLWLockArray[LOCK_MANAGER_LWLOCK_OFFSET + (i)].lock)
/*
+ * The deadlock detector needs to be able to access lockGroupLeader and
+ * related fields in the PGPROC, so we arrange for those fields to be protected
+ * by one of the lock hash partition locks. Since the deadlock detector
+ * acquires all such locks anyway, this makes it safe for it to access these
+ * fields without doing anything extra. To avoid contention as much as
+ * possible, we map different PGPROCs to different partition locks.
+ */
+#define LockHashPartitionLockByProc(p) \
+ LockHashPartitionLock((p)->pgprocno)
+
+/*
* function prototypes
*/
extern void InitLocks(void);
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index 3441288..66ab255 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -155,6 +155,15 @@ struct PGPROC
bool fpVXIDLock; /* are we holding a fast-path VXID lock? */
LocalTransactionId fpLocalTransactionId; /* lxid for fast-path VXID
* lock */
+
+ /*
+ * Support for lock groups. Use LockHashPartitionLockByProc to get the
+ * LWLock protecting these fields.
+ */
+ int lockGroupLeaderIdentifier; /* MyProcPid, if I'm a leader */
+ PGPROC *lockGroupLeader; /* lock group leader, if I'm a follower */
+ dlist_head lockGroupMembers; /* list of members, if I'm a leader */
+ dlist_node lockGroupLink; /* my member link, if I'm a member */
};
/* NOTE: "typedef struct PGPROC PGPROC" appears in storage/lock.h. */
@@ -272,4 +281,7 @@ extern void LockErrorCleanup(void);
extern void ProcWaitForSignal(void);
extern void ProcSendSignal(int pid);
+extern void BecomeLockGroupLeader(void);
+extern bool BecomeLockGroupMember(PGPROC *leader, int pid);
+
#endif /* PROC_H */
--
2.5.4 (Apple Git-61)
test-group-locking-v1.patchapplication/x-patch; name=test-group-locking-v1.patchDownload
From b101d27611dd42109f11b09ab3ba65dba91e6341 Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Thu, 21 Jan 2016 14:33:07 -0500
Subject: [PATCH 2/3] contrib module test_group_deadlocks, not for commit.
Amit Kapila
---
contrib/Makefile | 1 +
contrib/test_group_deadlocks/Makefile | 19 ++++++++
.../test_group_deadlocks--1.0.sql | 15 ++++++
.../test_group_deadlocks/test_group_deadlocks.c | 57 ++++++++++++++++++++++
.../test_group_deadlocks.control | 5 ++
5 files changed, 97 insertions(+)
create mode 100644 contrib/test_group_deadlocks/Makefile
create mode 100644 contrib/test_group_deadlocks/test_group_deadlocks--1.0.sql
create mode 100644 contrib/test_group_deadlocks/test_group_deadlocks.c
create mode 100644 contrib/test_group_deadlocks/test_group_deadlocks.control
diff --git a/contrib/Makefile b/contrib/Makefile
index bd251f6..ff3c54d 100644
--- a/contrib/Makefile
+++ b/contrib/Makefile
@@ -43,6 +43,7 @@ SUBDIRS = \
tablefunc \
tcn \
test_decoding \
+ test_group_deadlocks \
tsm_system_rows \
tsm_system_time \
tsearch2 \
diff --git a/contrib/test_group_deadlocks/Makefile b/contrib/test_group_deadlocks/Makefile
new file mode 100644
index 0000000..057448c
--- /dev/null
+++ b/contrib/test_group_deadlocks/Makefile
@@ -0,0 +1,19 @@
+# contrib/test_group_deadlocks/Makefile
+
+MODULE_big = test_group_deadlocks
+OBJS = test_group_deadlocks.o $(WIN32RES)
+
+EXTENSION = test_group_deadlocks
+DATA = test_group_deadlocks--1.0.sql
+PGFILEDESC = "test_group_deadlocks - participate in group locking"
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = contrib/test_group_deadlocks
+top_builddir = ../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/contrib/test_group_deadlocks/test_group_deadlocks--1.0.sql b/contrib/test_group_deadlocks/test_group_deadlocks--1.0.sql
new file mode 100644
index 0000000..377c363
--- /dev/null
+++ b/contrib/test_group_deadlocks/test_group_deadlocks--1.0.sql
@@ -0,0 +1,15 @@
+/* contrib/test_group_deadlocks/test_group_deadlocks--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION test_group_deadlocks" to load this file. \quit
+
+-- Register the function.
+CREATE FUNCTION become_lock_group_leader()
+RETURNS pg_catalog.void
+AS 'MODULE_PATHNAME'
+LANGUAGE C;
+
+CREATE FUNCTION become_lock_group_member(pid pg_catalog.int4)
+RETURNS pg_catalog.bool
+AS 'MODULE_PATHNAME'
+LANGUAGE C;
diff --git a/contrib/test_group_deadlocks/test_group_deadlocks.c b/contrib/test_group_deadlocks/test_group_deadlocks.c
new file mode 100644
index 0000000..f3d980a
--- /dev/null
+++ b/contrib/test_group_deadlocks/test_group_deadlocks.c
@@ -0,0 +1,57 @@
+/*-------------------------------------------------------------------------
+ *
+ * test_group_deadlocks.c
+ * group locking utilities
+ *
+ * Copyright (c) 2010-2014, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * contrib/test_group_deadlocks/test_group_deadlocks.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "fmgr.h"
+#include "storage/proc.h"
+#include "storage/procarray.h"
+
+PG_MODULE_MAGIC;
+
+PG_FUNCTION_INFO_V1(become_lock_group_leader);
+PG_FUNCTION_INFO_V1(become_lock_group_member);
+
+
+/*
+ * become_lock_group_leader
+ *
+ * This function makes current backend process as lock group
+ * leader.
+ */
+Datum
+become_lock_group_leader(PG_FUNCTION_ARGS)
+{
+ BecomeLockGroupLeader();
+
+ PG_RETURN_VOID();
+}
+
+/*
+ * become_lock_group_member
+ *
+ * This function makes current backend process as lock group
+ * member of the group owned by the process whose pid is passed
+ * as first argument.
+ */
+Datum
+become_lock_group_member(PG_FUNCTION_ARGS)
+{
+ bool member;
+ PGPROC *procleader;
+ int32 pid = PG_GETARG_INT32(0);
+
+ procleader = BackendPidGetProc(pid);
+ member = BecomeLockGroupMember(procleader, pid);
+
+ PG_RETURN_BOOL(member);
+}
diff --git a/contrib/test_group_deadlocks/test_group_deadlocks.control b/contrib/test_group_deadlocks/test_group_deadlocks.control
new file mode 100644
index 0000000..e2dcc71
--- /dev/null
+++ b/contrib/test_group_deadlocks/test_group_deadlocks.control
@@ -0,0 +1,5 @@
+# test_group_locking extension
+comment = 'become part of group'
+default_version = '1.0'
+module_pathname = '$libdir/test_group_deadlocks'
+relocatable = true
--
2.5.4 (Apple Git-61)
force-parallel-mode-v1.patchapplication/x-patch; name=force-parallel-mode-v1.patchDownload
From c6b2249ce16f278287dcee0710ca469c271c5cab Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Wed, 30 Sep 2015 18:35:40 -0400
Subject: [PATCH 3/3] Introduce a new GUC force_parallel_mode for testing
purposes.
When force_parallel_mode = true, we enable the parallel mode restrictions
for all queries for which this is believed to be safe. For the subset of
those queries believed to be safe to run entirely within a worker, we spin
up a worker and run the query there instead of running it in the
original process.
Robert Haas, with help from Amit Kapila and Rushabh Lathia.
---
doc/src/sgml/config.sgml | 45 +++++++++++++++++
src/backend/access/transam/parallel.c | 4 +-
src/backend/commands/explain.c | 14 ++++-
src/backend/nodes/copyfuncs.c | 1 +
src/backend/nodes/outfuncs.c | 2 +
src/backend/nodes/readfuncs.c | 1 +
src/backend/optimizer/plan/createplan.c | 5 ++
src/backend/optimizer/plan/planner.c | 73 ++++++++++++++++++++-------
src/backend/utils/misc/guc.c | 24 +++++++++
src/backend/utils/misc/postgresql.conf.sample | 1 +
src/include/nodes/plannodes.h | 1 +
src/include/nodes/relation.h | 3 ++
src/include/optimizer/planmain.h | 9 ++++
13 files changed, 163 insertions(+), 20 deletions(-)
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 392eb70..de84b77 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -3802,6 +3802,51 @@ SELECT * FROM parent WHERE key = 2400;
</listitem>
</varlistentry>
+ <varlistentry id="guc-force-parallel-mode" xreflabel="force_parallel_mode">
+ <term><varname>force_parallel_mode</varname> (<type>enum</type>)
+ <indexterm>
+ <primary><varname>force_parallel_mode</> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Allows the use of parallel queries for testing purposes even in cases
+ where no performance benefit is expected.
+ The allowed values of <varname>force_parallel_mode</> are
+ <literal>off</> (use parallel mode only when it is expected to improve
+ performance), <literal>on</> (force parallel query for all queries
+ for which it is thought to be safe), and <literal>regress</> (like
+ on, but with additional behavior changes to facilitate automated
+ regression testing).
+ </para>
+
+ <para>
+ More specifically, setting this value to <literal>on</> will add
+ a <literal>Gather</> node to the top of any query plan for which this
+ appears to be safe, so that the query runs inside of a parallel worker.
+ Even when a parallel worker is not available or cannot be used,
+ operations such as starting a subtransaction that would be prohibited
+ in a parallel query context will be prohibited unless the planner
+ believes that this will cause the query to fail. If failures or
+ unexpected results occur when this option is set, some functions used
+ by the query may need to be marked <literal>PARALLEL UNSAFE</literal>
+ (or, possibly, <literal>PARALLEL RESTRICTED</literal>).
+ </para>
+
+ <para>
+ Setting this value to <literal>regress</> has all of the same effects
+ as setting it to <literal>on</> plus some additional effect that are
+ intended to facilitate automated regression testing. Normally,
+ messages from a parallel worker are prefixed with a context line,
+ but a setting of <literal>regress</> suppresses this to guarantee
+ reproducible results. Also, the <literal>Gather</> nodes added to
+ plans by this setting are hidden from the <literal>EXPLAIN</> output
+ so that the output matches what would be obtained if this setting
+ were turned <literal>off</>.
+ </para>
+ </listitem>
+ </varlistentry>
+
</variablelist>
</sect2>
</sect1>
diff --git a/src/backend/access/transam/parallel.c b/src/backend/access/transam/parallel.c
index bf2e691..4f91cd0 100644
--- a/src/backend/access/transam/parallel.c
+++ b/src/backend/access/transam/parallel.c
@@ -22,6 +22,7 @@
#include "libpq/pqformat.h"
#include "libpq/pqmq.h"
#include "miscadmin.h"
+#include "optimizer/planmain.h"
#include "storage/ipc.h"
#include "storage/sinval.h"
#include "storage/spin.h"
@@ -1079,7 +1080,8 @@ ParallelExtensionTrampoline(dsm_segment *seg, shm_toc *toc)
static void
ParallelErrorContext(void *arg)
{
- errcontext("parallel worker, PID %d", *(int32 *) arg);
+ if (force_parallel_mode != FORCE_PARALLEL_REGRESS)
+ errcontext("parallel worker, PID %d", *(int32 *) arg);
}
/*
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 25d8ca0..ee13136 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -23,6 +23,7 @@
#include "foreign/fdwapi.h"
#include "nodes/nodeFuncs.h"
#include "optimizer/clauses.h"
+#include "optimizer/planmain.h"
#include "parser/parsetree.h"
#include "rewrite/rewriteHandler.h"
#include "tcop/tcopprot.h"
@@ -572,6 +573,7 @@ void
ExplainPrintPlan(ExplainState *es, QueryDesc *queryDesc)
{
Bitmapset *rels_used = NULL;
+ PlanState *ps;
Assert(queryDesc->plannedstmt != NULL);
es->pstmt = queryDesc->plannedstmt;
@@ -580,7 +582,17 @@ ExplainPrintPlan(ExplainState *es, QueryDesc *queryDesc)
es->rtable_names = select_rtable_names_for_explain(es->rtable, rels_used);
es->deparse_cxt = deparse_context_for_plan_rtable(es->rtable,
es->rtable_names);
- ExplainNode(queryDesc->planstate, NIL, NULL, NULL, es);
+
+ /*
+ * Sometimes we mark a Gather node as "invisible", which means that it's
+ * not displayed in EXPLAIN output. The purpose of this is to allow
+ * running regression tests with force_parallel_mode=regress to get the
+ * same results as running the same tests with force_parallel_mode=off.
+ */
+ ps = queryDesc->planstate;
+ if (IsA(ps, GatherState) &&((Gather *) ps->plan)->invisible)
+ ps = outerPlanState(ps);
+ ExplainNode(ps, NIL, NULL, NULL, es);
}
/*
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index a8b79fa..e54d174 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -334,6 +334,7 @@ _copyGather(const Gather *from)
*/
COPY_SCALAR_FIELD(num_workers);
COPY_SCALAR_FIELD(single_copy);
+ COPY_SCALAR_FIELD(invisible);
return newnode;
}
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index b487c00..97b7fef 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -443,6 +443,7 @@ _outGather(StringInfo str, const Gather *node)
WRITE_INT_FIELD(num_workers);
WRITE_BOOL_FIELD(single_copy);
+ WRITE_BOOL_FIELD(invisible);
}
static void
@@ -1826,6 +1827,7 @@ _outPlannerGlobal(StringInfo str, const PlannerGlobal *node)
WRITE_BOOL_FIELD(hasRowSecurity);
WRITE_BOOL_FIELD(parallelModeOK);
WRITE_BOOL_FIELD(parallelModeNeeded);
+ WRITE_BOOL_FIELD(wholePlanParallelSafe);
WRITE_BOOL_FIELD(hasForeignJoin);
}
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index 6c46151..e4d41ee 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -2053,6 +2053,7 @@ _readGather(void)
READ_INT_FIELD(num_workers);
READ_BOOL_FIELD(single_copy);
+ READ_BOOL_FIELD(invisible);
READ_DONE();
}
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index 54ff7f6..6e0db08 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -212,6 +212,10 @@ create_plan(PlannerInfo *root, Path *best_path)
/* Recursively process the path tree */
plan = create_plan_recurse(root, best_path);
+ /* Update parallel safety information if needed. */
+ if (!best_path->parallel_safe)
+ root->glob->wholePlanParallelSafe = false;
+
/* Check we successfully assigned all NestLoopParams to plan nodes */
if (root->curOuterParams != NIL)
elog(ERROR, "failed to assign all NestLoopParams to plan nodes");
@@ -4829,6 +4833,7 @@ make_gather(List *qptlist,
plan->righttree = NULL;
node->num_workers = nworkers;
node->single_copy = single_copy;
+ node->invisible = false;
return node;
}
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index a09b4b5..a3cc274 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -48,10 +48,12 @@
#include "storage/dsm_impl.h"
#include "utils/rel.h"
#include "utils/selfuncs.h"
+#include "utils/syscache.h"
-/* GUC parameter */
+/* GUC parameters */
double cursor_tuple_fraction = DEFAULT_CURSOR_TUPLE_FRACTION;
+int force_parallel_mode = FORCE_PARALLEL_OFF;
/* Hook for plugins to get control in planner() */
planner_hook_type planner_hook = NULL;
@@ -230,25 +232,31 @@ standard_planner(Query *parse, int cursorOptions, ParamListInfo boundParams)
!has_parallel_hazard((Node *) parse, true);
/*
- * glob->parallelModeOK should tell us whether it's necessary to impose
- * the parallel mode restrictions, but we don't actually want to impose
- * them unless we choose a parallel plan, so that people who mislabel
- * their functions but don't use parallelism anyway aren't harmed.
- * However, it's useful for testing purposes to be able to force the
- * restrictions to be imposed whenever a parallel plan is actually chosen
- * or not.
+ * glob->parallelModeNeeded should tell us whether it's necessary to
+ * impose the parallel mode restrictions, but we don't actually want to
+ * impose them unless we choose a parallel plan, so that people who
+ * mislabel their functions but don't use parallelism anyway aren't
+ * harmed. But when force_parallel_mode is set, we enable the restrictions
+ * whenever possible for testing purposes.
*
- * (It's been suggested that we should always impose these restrictions
- * whenever glob->parallelModeOK is true, so that it's easier to notice
- * incorrectly-labeled functions sooner. That might be the right thing to
- * do, but for now I've taken this approach. We could also control this
- * with a GUC.)
+ * glob->wholePlanParallelSafe should tell us whether it's OK to stick a
+ * Gather node on top of the entire plan. However, it only needs to be
+ * accurate when force_parallel_mode is 'on' or 'regress', so we don't
+ * bother doing the work otherwise. The value we set here is just a
+ * preliminary guess; it may get changed from true to false later, but
+ * not visca versa.
*/
-#ifdef FORCE_PARALLEL_MODE
- glob->parallelModeNeeded = glob->parallelModeOK;
-#else
- glob->parallelModeNeeded = false;
-#endif
+ if (force_parallel_mode == FORCE_PARALLEL_OFF || !glob->parallelModeOK)
+ {
+ glob->parallelModeNeeded = false;
+ glob->wholePlanParallelSafe = false; /* either false or don't care */
+ }
+ else
+ {
+ glob->parallelModeNeeded = true;
+ glob->wholePlanParallelSafe =
+ !has_parallel_hazard((Node *) parse, false);
+ }
/* Determine what fraction of the plan is likely to be scanned */
if (cursorOptions & CURSOR_OPT_FAST_PLAN)
@@ -293,6 +301,35 @@ standard_planner(Query *parse, int cursorOptions, ParamListInfo boundParams)
}
/*
+ * At present, we don't copy subplans to workers. The presence of a
+ * subplan in one part of the plan doesn't preclude the use of parallelism
+ * in some other part of the plan, but it does preclude the possibility of
+ * regarding the entire plan parallel-safe.
+ */
+ if (glob->subplans != NULL)
+ glob->wholePlanParallelSafe = false;
+
+ /*
+ * Optionally add a Gather node for testing purposes, provided this is
+ * actually a safe thing to do.
+ */
+ if (glob->wholePlanParallelSafe &&
+ force_parallel_mode != FORCE_PARALLEL_OFF)
+ {
+ Gather *gather = makeNode(Gather);
+
+ gather->plan.targetlist = top_plan->targetlist;
+ gather->plan.qual = NIL;
+ gather->plan.lefttree = top_plan;
+ gather->plan.righttree = NULL;
+ gather->num_workers = 1;
+ gather->single_copy = true;
+ gather->invisible = (force_parallel_mode == FORCE_PARALLEL_REGRESS);
+ root->glob->parallelModeNeeded = true;
+ top_plan = &gather->plan;
+ }
+
+ /*
* If any Params were generated, run through the plan tree and compute
* each plan node's extParam/allParam sets. Ideally we'd merge this into
* set_plan_references' tree traversal, but for now it has to be separate
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 38ba82f..14212ee 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -379,6 +379,19 @@ static const struct config_enum_entry huge_pages_options[] = {
{NULL, 0, false}
};
+static const struct config_enum_entry force_parallel_mode_options[] = {
+ {"off", FORCE_PARALLEL_OFF, false},
+ {"on", FORCE_PARALLEL_ON, false},
+ {"regress", FORCE_PARALLEL_REGRESS, false},
+ {"true", FORCE_PARALLEL_ON, true},
+ {"false", FORCE_PARALLEL_OFF, true},
+ {"yes", FORCE_PARALLEL_ON, true},
+ {"no", FORCE_PARALLEL_OFF, true},
+ {"1", FORCE_PARALLEL_ON, true},
+ {"0", FORCE_PARALLEL_OFF, true},
+ {NULL, 0, false}
+};
+
/*
* Options for enum values stored in other modules
*/
@@ -863,6 +876,7 @@ static struct config_bool ConfigureNamesBool[] =
true,
NULL, NULL, NULL
},
+
{
{"geqo", PGC_USERSET, QUERY_TUNING_GEQO,
gettext_noop("Enables genetic query optimization."),
@@ -3672,6 +3686,16 @@ static struct config_enum ConfigureNamesEnum[] =
NULL, NULL, NULL
},
+ {
+ {"force_parallel_mode", PGC_USERSET, QUERY_TUNING_OTHER,
+ gettext_noop("Forces use of parallel query facilities."),
+ gettext_noop("If possible, run query using a parallel worker and with parallel restrictions.")
+ },
+ &force_parallel_mode,
+ FORCE_PARALLEL_OFF, force_parallel_mode_options,
+ NULL, NULL, NULL
+ },
+
/* End-of-list marker */
{
{NULL, 0, 0, NULL, NULL}, NULL, 0, NULL, NULL, NULL, NULL
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 029114f..09b2003 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -313,6 +313,7 @@
#from_collapse_limit = 8
#join_collapse_limit = 8 # 1 disables collapsing of explicit
# JOIN clauses
+#force_parallel_mode = off
#------------------------------------------------------------------------------
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index 55d6bbe..ae224cf 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -775,6 +775,7 @@ typedef struct Gather
Plan plan;
int num_workers;
bool single_copy;
+ bool invisible; /* suppress EXPLAIN display (for testing)? */
} Gather;
/* ----------------
diff --git a/src/include/nodes/relation.h b/src/include/nodes/relation.h
index 9492598..5c22679 100644
--- a/src/include/nodes/relation.h
+++ b/src/include/nodes/relation.h
@@ -108,6 +108,9 @@ typedef struct PlannerGlobal
bool parallelModeOK; /* parallel mode potentially OK? */
bool parallelModeNeeded; /* parallel mode actually required? */
+
+ bool wholePlanParallelSafe; /* is the entire plan parallel safe? */
+
bool hasForeignJoin; /* does have a pushed down foreign join */
} PlannerGlobal;
diff --git a/src/include/optimizer/planmain.h b/src/include/optimizer/planmain.h
index 7ae7367..eaa642b 100644
--- a/src/include/optimizer/planmain.h
+++ b/src/include/optimizer/planmain.h
@@ -17,9 +17,18 @@
#include "nodes/plannodes.h"
#include "nodes/relation.h"
+/* possible values for force_parallel_mode */
+typedef enum
+{
+ FORCE_PARALLEL_OFF,
+ FORCE_PARALLEL_ON,
+ FORCE_PARALLEL_REGRESS
+} ForceParallelMode;
+
/* GUC parameters */
#define DEFAULT_CURSOR_TUPLE_FRACTION 0.1
extern double cursor_tuple_fraction;
+extern int force_parallel_mode;
/* query_planner callback to compute query_pathkeys */
typedef void (*query_pathkeys_callback) (PlannerInfo *root, void *extra);
--
2.5.4 (Apple Git-61)
On 2016-02-02 15:41:45 -0500, Robert Haas wrote:
group-locking-v1.patch is a vastly improved version of the group
locking patch that we discussed, uh, extensively last year. I realize
that there was a lot of doubt about this approach, but I still believe
it's the right approach, I have put a lot of work into making it work
correctly, I don't think anyone has come up with a really plausible
alternative approach (except one other approach I tried which turned
out to work but with significantly more restrictions), and I'm
committed to fixing it in whatever way is necessary if it turns out to
be broken, even if that amounts to a full rewrite. Review is welcome,
but I honestly believe it's a good idea to get this into the tree
sooner rather than later at this point, because automated regression
testing falls to pieces without these changes, and I believe that
automated regression testing is a really good idea to shake out
whatever bugs we may have in the parallel query stuff. The code in
this patch is all mine, but Amit Kapila deserves credit as co-author
for doing a lot of prototyping (that ended up getting tossed) and
testing. This patch includes comments and an addition to
src/backend/storage/lmgr/README which explain in more detail what this
patch does, how it does it, and why that's OK.
I see you pushed group locking support. I do wonder if somebody has
actually reviewed this? On a quick scrollthrough it seems fairly
invasive, touching some parts where bugs are really hard to find.
I realize that this stuff has all been brewing long, and that there's
still a lot to do. So you gotta keep moving. And I'm not sure that
there's anything wrong or if there's any actually better approach. But
pushing an unreviewed, complex patch, that originated in a thread
orginally about different relatively small/mundane items, for a
contentious issue, a few days after the initial post. Hm. Not sure how
you'd react if you weren't the author.
Greetings,
Andres Freund
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Mon, Feb 8, 2016 at 10:17 AM, Andres Freund <andres@anarazel.de> wrote:
On 2016-02-02 15:41:45 -0500, Robert Haas wrote:
group-locking-v1.patch is a vastly improved version of the group
locking patch that we discussed, uh, extensively last year. I realize
that there was a lot of doubt about this approach, but I still believe
it's the right approach, I have put a lot of work into making it work
correctly, I don't think anyone has come up with a really plausible
alternative approach (except one other approach I tried which turned
out to work but with significantly more restrictions), and I'm
committed to fixing it in whatever way is necessary if it turns out to
be broken, even if that amounts to a full rewrite. Review is welcome,
but I honestly believe it's a good idea to get this into the tree
sooner rather than later at this point, because automated regression
testing falls to pieces without these changes, and I believe that
automated regression testing is a really good idea to shake out
whatever bugs we may have in the parallel query stuff. The code in
this patch is all mine, but Amit Kapila deserves credit as co-author
for doing a lot of prototyping (that ended up getting tossed) and
testing. This patch includes comments and an addition to
src/backend/storage/lmgr/README which explain in more detail what this
patch does, how it does it, and why that's OK.I see you pushed group locking support. I do wonder if somebody has
actually reviewed this? On a quick scrollthrough it seems fairly
invasive, touching some parts where bugs are really hard to find.I realize that this stuff has all been brewing long, and that there's
still a lot to do. So you gotta keep moving. And I'm not sure that
there's anything wrong or if there's any actually better approach. But
pushing an unreviewed, complex patch, that originated in a thread
orginally about different relatively small/mundane items, for a
contentious issue, a few days after the initial post. Hm. Not sure how
you'd react if you weren't the author.
Probably not very well. Do you want me to revert it?
I mean, look. Without that patch, parallel query is definitely
broken. Just revert the patch and try running the regression tests
with force_parallel_mode=regress and max_parallel_degree>0. It hangs
all over the place. With the patch, every regression test suite we
have runs cleanly with those settings. Without the patch, it's
trivial to construct a test case where parallel query experiences an
undetected deadlock. With the patch, it appears to work reliably.
Could there bugs someplace? Yes, there absolutely could. Do I really
think anybody was going to spend the time to understand deadlock.c
well enough to verify my changes? No, I don't. What I think would
have happened is that the patch would have sat around like an
albatross around my neck - totally ignored by everyone - until the end
of the last CF, and then the discussion would have gone one of three
ways:
1. Boy, this patch is complicated and I don't understand it. Let's
reject it, even though without it parallel query is trivially broken!
Uh, we'll just let parallel query be broken.
2. Like #1, but we rip out parallel query in its entirety on the eve of beta.
3. Oh well, Robert says we need this, I guess we'd better let him commit it.
I don't find any of those options to be better than the status quo.
If the patch is broken, another two months of having in the tree give
us a better chance of finding the bugs, especially because, combined
with the other patch which I also pushed, it enables *actual automated
regression testing* of the parallelism code, which I personally think
is a really good thing - and I'd like to see the buildfarm doing that
as soon as possible, so that we can find some of those bugs before
we're deep in beta. Not just bugs in group locking but all sorts of
parallelism bugs that might be revealed by end-to-end testing. The
*entire stack of patches* that began this thread was a response to
problems that were found by the automated testing that you can't do
without this patch. Those bug fixes resulted in a huge increase in
the robustness of parallel query, and that would not have happened
without this code. Every single one of those problems, some of them
in commits dating back years, was detected by the same method: run the
regression tests with parallel mode and parallel workers used for
every query for which that seems to be safe.
And, by the way, the patch, aside from the deadlock.c portion, was
posted back in October, admittedly without much fanfare, but nobody
reviewed that or any other patch on this thread. If I'd waited for
those reviews to come in, parallel query would not be committed now,
nor probably in 9.6, nor probably in 9.7 or 9.8 either. The whole
project would just be impossible on its face. It would be impossible
in the first instance if I did not have a commit bit, because there is
just not enough committer bandwidth - even reviewer bandwidth more
generally - to review the number of patches that I've submitted
related to parallelism, so in the end some, perhaps many, of those are
going to be committed mostly on the strength of my personal opinion
that committing them is better than not committing them. I am going
to have a heck of a lot of egg on my face if it turns out that I've
been too aggressive in pushing this stuff into the tree. But,
basically, the alternative is that we don't get the feature, and I
think the feature is important enough to justify taking some risk.
I think it's myopic to say "well, but this patch might have bugs".
Very true. But also, all the other parallelism patches that are
already committed or that are still under review but which can't be
properly tested without this patch might have bugs, too, so you've got
to weigh the risk that this patch might get better if I wait longer to
commit it against the possibility that not having committed it reduces
the chances of finding bugs elsewhere. I don't want it to seem like
I'm forcing this down the community's throat - I don't have a right to
do that, and I will certainly revert this patch if that is the
consensus. But that is not what I think best myself. What I think
would be better is to (1) make an effort to get the buildfarm testing
which this patch enables up and running as soon as possible and (2)
for somebody to read over the committed code and raise any issues that
they find. Or, for that matter, to read over the committed code for
any of the *other* parallelism patches and raise any issues that they
find with *that* code. There's certainly scads of code here and this
is far from the only bit that might have bugs.
Oh: another thing that I would like to do is commit the isolation
tests I wrote for the deadlock detector a while back, which nobody has
reviewed either, though Tom and Alvaro seemed reasonably positive
about the concept. Right now, the deadlock.c part of this patch isn't
tested at all by any of our regression test suites, because NOTHING in
deadlock.c is tested by any of our regression test suites. You can
blow it up with dynamite and the regression tests are perfectly happy,
and that's pretty scary.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 02/08/2016 10:45 AM, Robert Haas wrote:
On Mon, Feb 8, 2016 at 10:17 AM, Andres Freund <andres@anarazel.de> wrote:
On 2016-02-02 15:41:45 -0500, Robert Haas wrote:
I realize that this stuff has all been brewing long, and that there's
still a lot to do. So you gotta keep moving. And I'm not sure that
there's anything wrong or if there's any actually better approach. But
pushing an unreviewed, complex patch, that originated in a thread
orginally about different relatively small/mundane items, for a
contentious issue, a few days after the initial post. Hm. Not sure how
you'd react if you weren't the author.Probably not very well. Do you want me to revert it?
If I am off base, please feel free to yell Latin at me again but isn't
this exactly what different trees are for in Git? Would it be possible
to say:
Robert says, "Hey pull XYZ, run ABC tests. They are what the parallelism
fixes do"?
I can't review this patch but I can run a test suite on a number of
platforms and see if it behaves as expected.
albatross around my neck - totally ignored by everyone - until the end
of the last CF, and then the discussion would have gone one of three
ways:1. Boy, this patch is complicated and I don't understand it. Let's
reject it, even though without it parallel query is trivially broken!
Uh, we'll just let parallel query be broken.
2. Like #1, but we rip out parallel query in its entirety on the eve of beta.
3. Oh well, Robert says we need this, I guess we'd better let him commit it.
4. We need to push the release so we can test this.
I don't find any of those options to be better than the status quo.
If the patch is broken, another two months of having in the tree give
us a better chance of finding the bugs, especially because, combined
I think this further points to the need for more reviewers and less
feature pushes. There are fundamental features that we could use, this
is one of them. It is certainly more important than say pgLogical or BDR
(not that those aren't useful but that we do have external solutions for
that problem).
Oh: another thing that I would like to do is commit the isolation
tests I wrote for the deadlock detector a while back, which nobody has
reviewed either, though Tom and Alvaro seemed reasonably positive
about the concept. Right now, the deadlock.c part of this patch isn't
tested at all by any of our regression test suites, because NOTHING in
deadlock.c is tested by any of our regression test suites. You can
blow it up with dynamite and the regression tests are perfectly happy,
and that's pretty scary.
Test test test. Please commit.
Sincerely,
JD
--
Command Prompt, Inc. http://the.postgres.company/
+1-503-667-4564
PostgreSQL Centered full stack support, consulting and development.
Everyone appreciates your honesty, until you are honest with them.
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Mon, Feb 8, 2016 at 2:00 PM, Joshua D. Drake <jd@commandprompt.com> wrote:
If I am off base, please feel free to yell Latin at me again but isn't this
exactly what different trees are for in Git? Would it be possible to say:Robert says, "Hey pull XYZ, run ABC tests. They are what the parallelism
fixes do"?I can't review this patch but I can run a test suite on a number of
platforms and see if it behaves as expected.
Sure, I'd love to have the ability to push a branch into the buildfarm
and have the tests get run on all the buildfarm machines and let that
bake for a while before putting it into the main tree. The problem
here is that the complicated part of this patch is something that's
only going to be tested in very rare cases. The simple part of the
patch, which handles the simple-deadlock case, is easy to hit,
although apparently zero people other than Amit and I have found it in
the few months since parallel sequential scan was committed, which
makes me thing people haven't tried very hard to break any part of
parallel query, which is a shame. The really hairy part is in
deadlock.c, and it's actually very hard to hit that case. It won't be
hit in real life except in pretty rare circumstances. So testing is
good, but you not only need to know what you are testing but probably
have an automated tool that can run the test a gazillion times in a
loop, or be really clever and find a test case that Amit and I didn't
foresee. And the reality is that getting anybody independent of the
parallel query effort to take an interested in deep testing has not
gone anywhere at all up until now. I'd be happy for that change,
whether because of this commit or for any other reason.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Robert Haas <robertmhaas@gmail.com> writes:
Oh: another thing that I would like to do is commit the isolation
tests I wrote for the deadlock detector a while back, which nobody has
reviewed either, though Tom and Alvaro seemed reasonably positive
about the concept.
Possibly the reason that wasn't reviewed is that it's not in the
commitfest list (or at least if it is, I sure don't see it).
Having said that, I don't have much of a problem with you pushing it
anyway, unless it will add 15 minutes to make check-world or some such.
regards, tom lane
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 02/08/2016 11:24 AM, Robert Haas wrote:
On Mon, Feb 8, 2016 at 2:00 PM, Joshua D. Drake <jd@commandprompt.com> wrote:
If I am off base, please feel free to yell Latin at me again but isn't this
exactly what different trees are for in Git? Would it be possible to say:Robert says, "Hey pull XYZ, run ABC tests. They are what the parallelism
fixes do"?I can't review this patch but I can run a test suite on a number of
platforms and see if it behaves as expected.Sure, I'd love to have the ability to push a branch into the buildfarm
and have the tests get run on all the buildfarm machines and let that
bake for a while before putting it into the main tree. The problem
here is that the complicated part of this patch is something that's
only going to be tested in very rare cases. The simple part of the
I have no problem running any test cases you wish on a branch in a loop
for the next week and reporting back any errors.
Where this gets tricky is the tooling itself. For me to be able to do so
(and others really) I need to be able to do this:
* Download (preferably a tarball but I can do a git pull)
* Exact instructions on how to set up the tests
* Exact instructions on how to run the tests
* Exact instructions on how to report the tests
If anyone takes the time to do that, I will take the time and resources
to run them.
What I can't do, is fiddle around trying to figure out how to set this
stuff up. I don't have the time and it isn't productive for me. I don't
think I am the only one in this boat.
Let's be honest, a lot of people won't even bother to play with this
even though it is easily one of the best features we have coming for 9.6
until we release 9.6.0. That is a bad time to be testing.
The easier we make it for people like me, practitioners to test, the
better it is for the whole project.
Sincerely,
JD
--
Command Prompt, Inc. http://the.postgres.company/
+1-503-667-4564
PostgreSQL Centered full stack support, consulting and development.
Everyone appreciates your honesty, until you are honest with them.
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Mon, Feb 8, 2016 at 10:45 AM, Robert Haas <robertmhaas@gmail.com> wrote:
And, by the way, the patch, aside from the deadlock.c portion, was
posted back in October, admittedly without much fanfare, but nobody
reviewed that or any other patch on this thread. If I'd waited for
those reviews to come in, parallel query would not be committed now,
nor probably in 9.6, nor probably in 9.7 or 9.8 either. The whole
project would just be impossible on its face. It would be impossible
in the first instance if I did not have a commit bit, because there is
just not enough committer bandwidth - even reviewer bandwidth more
generally - to review the number of patches that I've submitted
related to parallelism, so in the end some, perhaps many, of those are
going to be committed mostly on the strength of my personal opinion
that committing them is better than not committing them. I am going
to have a heck of a lot of egg on my face if it turns out that I've
been too aggressive in pushing this stuff into the tree. But,
basically, the alternative is that we don't get the feature, and I
think the feature is important enough to justify taking some risk.
FWIW, I appreciate your candor. However, I think that you could have
done a better job of making things easier for reviewers, even if that
might not have made an enormous difference. I suspect I would have not
been able to get UPSERT done as a non-committer if it wasn't for the
epic wiki page, that made it at least possible for someone to jump in.
To be more specific, I thought it was really hard to test parallel
sequential scan a few months ago, because there was so many threads
and so many dependencies. I appreciate that we now use git
format-patch patch series for complicated stuff these days, but it's
important to make it clear how everything fits together. That's
actually what I was thinking about when I said we need to be clear on
how things fit together from the CF app patch page, because there
doesn't seem to be a culture of being particular about that, having
good "annotations", etc.
--
Peter Geoghegan
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Mon, Feb 8, 2016 at 2:36 PM, Joshua D. Drake <jd@commandprompt.com> wrote:
I have no problem running any test cases you wish on a branch in a loop for
the next week and reporting back any errors.Where this gets tricky is the tooling itself. For me to be able to do so
(and others really) I need to be able to do this:* Download (preferably a tarball but I can do a git pull)
* Exact instructions on how to set up the tests
* Exact instructions on how to run the tests
* Exact instructions on how to report the testsIf anyone takes the time to do that, I will take the time and resources to
run them.
Well, what I've done is push into the buildfarm code that will allow
us to do *the most exhaustive* testing that I know how to do in an
automated fashion. Which is to create a file that says this:
force_parallel_mode=regress
max_parallel_degree=2
And then run this: make check-world TEMP_CONFIG=/path/to/aforementioned/file
Now, that is not going to find bugs in the deadlock.c portion of the
group locking patch, but it's been wildly successful in finding bugs
in other parts of the parallelism code, and there might well be a few
more that we haven't found yet, which is why I'm hoping that we'll get
this procedure running regularly either on all buildfarm machines, or
on some subset of them, or on new animals that just do this.
Testing the deadlock.c changes is harder. I don't know of a good way
to do it in an automated fashion, which is why I also posted the test
code Amit devised which allows construction of manual test cases.
Constructing a manual test case is *hard* but doable. I think it
would be good to automate this and if somebody's got a good idea about
how to fuzz test it I think that would be *great*. But that's not
easy to do. We haven't had any testing at all of the deadlock
detector up until now, but somehow the deadlock detector itself has
been in the tree for a very long time...
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Robert Haas wrote:
Oh: another thing that I would like to do is commit the isolation
tests I wrote for the deadlock detector a while back, which nobody has
reviewed either, though Tom and Alvaro seemed reasonably positive
about the concept. Right now, the deadlock.c part of this patch isn't
tested at all by any of our regression test suites, because NOTHING in
deadlock.c is tested by any of our regression test suites. You can
blow it up with dynamite and the regression tests are perfectly happy,
and that's pretty scary.
FWIW a couple of months back I thought you had already pushed that one
and was surprised to find that you hadn't. +1 from me on pushing it.
(I don't mean specifically the deadlock tests, but rather the
isolationtester changes that allowed you to have multiple blocked
backends.)
--
�lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Mon, Feb 8, 2016 at 2:48 PM, Peter Geoghegan <pg@heroku.com> wrote:
FWIW, I appreciate your candor. However, I think that you could have
done a better job of making things easier for reviewers, even if that
might not have made an enormous difference. I suspect I would have not
been able to get UPSERT done as a non-committer if it wasn't for the
epic wiki page, that made it at least possible for someone to jump in.
I'm not going to argue with the proposition that it could have been
done better. Equally, I'm going to disclaim having the ability to
have done it better. I've been working on this for three years, and
most of the work that I've put into it has gone into tinkering with C
code that was not in any way user-testable. I've modified essentially
every major component of the system. We had a shared memory facility;
I built another one. We had background workers; I overhauled them. I
invented a message queueing system, and then layered a modified
version of the FE/BE protocol on top of that message queue, and then
later layered tuple-passing on top of that same message queue and then
invented a bespoke protocol that is used to handle typemod mapping.
We had a transaction system; I made substantial, invasive
modifications to it. I tinkered with the GUC subsystem, the combocid
system, and the system for loading loadable modules. Amit added read
functions to a whole class of nodes that never had them before and
together we overhauled core pieces of the executer machinery. Then I
hit the planner with hammer. Finally there's this patch, which
affects heavyweight locking and deadlock detection. I don't believe
that during the time I've been involved with this project anyone else
has ever attempted a project that required changing as many subsystems
as this one did - in some cases rather lightly, but in a number of
cases in pretty significant, invasive ways. No other project in
recent memory has been this invasive to my knowledge. Hot Standby
probably comes closest, but I think (admittedly being much closer to
this work than I was to that work) that this has its fingers in more
places. So, there may be a person who knows how to do all of that
work and get it done in a reasonable time frame and also knows how to
make sure that everybody has the opportunity to be as involved in the
process as they want to be and that there are no bugs or controversial
design decisions, but I am not that person. I am doing my best.
To be more specific, I thought it was really hard to test parallel
sequential scan a few months ago, because there was so many threads
and so many dependencies. I appreciate that we now use git
format-patch patch series for complicated stuff these days, but it's
important to make it clear how everything fits together. That's
actually what I was thinking about when I said we need to be clear on
how things fit together from the CF app patch page, because there
doesn't seem to be a culture of being particular about that, having
good "annotations", etc.
I agree that you had to be pretty deeply involved in that thread to
follow everything that was going on. But it's not entirely fair to
say that it was impossible for anyone else to get involved. Both
Amit and I, mostly Amit, posted directions at various times saying:
here is the sequence of patches that you currently need to apply as of
this time. There was not a heck of a lot of evidence that anyone was
doing that, though, though I think a few people did, and towards the
end things changed very quickly as I committed patches in the series.
We certainly knew what each other were doing and not because of some
hidden off-list collaboration that we kept secret from the community -
we do talk every week, but almost all of our correspondence on those
patches was on-list.
I think it's an inherent peril of complicated patch sets that people
who are not intimately involved in what is going on will have trouble
following just because it takes a lot of work. Is anybody here
following what is going on on the postgres_fdw join pushdown thread?
There's only one patch to apply there right now (though there have
been as many as four at times in the past) and the people who are
actually working on it can follow along, but I'm not a bit surprised
if other people feel lost. It's hard to think that the cause of that
is anything other than "it's hard to find the time to get invested in
a patch that other people are already working hard and apparently
diligently on, especially if you're not personally interested in
seeing that patch get committed, but sometimes even if you are". For
example, I really want the work Fabien and Andres are doing on the
checkpointer to get committed this release. I am reading the emails,
but I haven't tried the patches and I probably won't. I don't have
time to be that involved in every patch. I'm trusting that whatever
Andres commits - which will probably be a whole lot more complex than
what Fabien initially did - will be the right thing to commit.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Mon, Feb 8, 2016 at 12:18 PM, Robert Haas <robertmhaas@gmail.com> wrote:
So, there may be a person who knows how to do all of that
work and get it done in a reasonable time frame and also knows how to
make sure that everybody has the opportunity to be as involved in the
process as they want to be and that there are no bugs or controversial
design decisions, but I am not that person. I am doing my best.To be more specific, I thought it was really hard to test parallel
sequential scan a few months ago, because there was so many threads
and so many dependencies. I appreciate that we now use git
format-patch patch series for complicated stuff these days, but it's
important to make it clear how everything fits together. That's
actually what I was thinking about when I said we need to be clear on
how things fit together from the CF app patch page, because there
doesn't seem to be a culture of being particular about that, having
good "annotations", etc.I agree that you had to be pretty deeply involved in that thread to
follow everything that was going on. But it's not entirely fair to
say that it was impossible for anyone else to get involved.
All that I wanted to do was look at EXPLAIN ANALYZE output that showed
a parallel seq scan on my laptop, simply because I wanted to see a
cool thing happen. I had to complain about it [1]/messages/by-id/CAM3SWZSefE4uQk3r_3gwpfDWWtT3P51SceVsL4=g8v_mE2Abtg@mail.gmail.com to get clarification
from you [2]/messages/by-id/CA+TgmoartTF8eptBhiNwxUkfkctsFc7WtZFhGEGQywE8e2vCmg@mail.gmail.com -- Peter Geoghegan.
I accept that this might have been a somewhat isolated incident (that
I couldn't easily get *at least* a little instant gratification), but
it still should be avoided. You've accused me of burying the lead
plenty of times. Don't tell me that it was too hard to prominently
place those details somewhere where I or any other contributor could
reasonably expect to find them, like the CF app page, or a wiki page
that is maintained on an ongoing basis (and linked to at the start of
each thread). If I said that that was too much to you, you'd probably
shout at me. If I persisted, you wouldn't commit my patch, and for me
that probably means it's DOA.
I don't think I'm asking for much here.
[1]: /messages/by-id/CAM3SWZSefE4uQk3r_3gwpfDWWtT3P51SceVsL4=g8v_mE2Abtg@mail.gmail.com
[2]: /messages/by-id/CA+TgmoartTF8eptBhiNwxUkfkctsFc7WtZFhGEGQywE8e2vCmg@mail.gmail.com -- Peter Geoghegan
--
Peter Geoghegan
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 02/08/2016 01:11 PM, Peter Geoghegan wrote:
On Mon, Feb 8, 2016 at 12:18 PM, Robert Haas <robertmhaas@gmail.com> wrote:
I accept that this might have been a somewhat isolated incident (that
I couldn't easily get *at least* a little instant gratification), but
it still should be avoided. You've accused me of burying the lead
plenty of times. Don't tell me that it was too hard to prominently
place those details somewhere where I or any other contributor could
reasonably expect to find them, like the CF app page, or a wiki page
that is maintained on an ongoing basis (and linked to at the start of
each thread). If I said that that was too much to you, you'd probably
shout at me. If I persisted, you wouldn't commit my patch, and for me
that probably means it's DOA.I don't think I'm asking for much here.
[1] /messages/by-id/CAM3SWZSefE4uQk3r_3gwpfDWWtT3P51SceVsL4=g8v_mE2Abtg@mail.gmail.com
[2] /messages/by-id/CA+TgmoartTF8eptBhiNwxUkfkctsFc7WtZFhGEGQywE8e2vCmg@mail.gmail.com
This part of the thread seems like something that should be a new thread
about how to write patches. I agree that patches that are large features
that are in depth discussed on a maintained wiki page would be awesome.
Creating that knowledge base without having to troll through code would
be priceless in value.
Sincerely,
JD
--
Command Prompt, Inc. http://the.postgres.company/
+1-503-667-4564
PostgreSQL Centered full stack support, consulting and development.
Everyone appreciates your honesty, until you are honest with them.
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Mon, Feb 8, 2016 at 4:11 PM, Peter Geoghegan <pg@heroku.com> wrote:
All that I wanted to do was look at EXPLAIN ANALYZE output that showed
a parallel seq scan on my laptop, simply because I wanted to see a
cool thing happen. I had to complain about it [1] to get clarification
from you [2].I accept that this might have been a somewhat isolated incident (that
I couldn't easily get *at least* a little instant gratification), but
it still should be avoided. You've accused me of burying the lead
plenty of times. Don't tell me that it was too hard to prominently
place those details somewhere where I or any other contributor could
reasonably expect to find them, like the CF app page, or a wiki page
that is maintained on an ongoing basis (and linked to at the start of
each thread). If I said that that was too much to you, you'd probably
shout at me. If I persisted, you wouldn't commit my patch, and for me
that probably means it's DOA.I don't think I'm asking for much here.
I don't think you are asking for too much; what I think is that Amit
and I were trying to do exactly the thing you asked for, and mostly
did. On March 20th, Amit posted version 11 of the sequential scan
patch, and included directions about the order in which to apply the
patches:
/messages/by-id/CAA4eK1JSSonzKSN=L-DWuCEWdLqkbMUjvfpE3fGW2tn2zPo2RQ@mail.gmail.com
On March 25th, Amit posted version 12 of the sequential scan patch,
and again included directions about which patches to apply:
/messages/by-id/CAA4eK1L50Y0Y1OGt_DH2eOUyQ-rQCnPvJBOon2PcGjq+1byi4w@mail.gmail.com
On March 27th, Amit posted version 13 of the sequential scan patch,
which did not include those directions:
/messages/by-id/CAA4eK1LFR8sR9viUpLPMKRqUVcRhEFDjSz1019rpwgjYftrXeQ@mail.gmail.com
While perhaps Amit might have included directions again, I think it's
pretty reasonable that he felt that it might not be entirely necessary
to do so given that he had already done it twice in the last week.
This was still the state of affairs when you asked your question on
April 20th. Two days after you asked that question, Amit posted
version 14 of the patch, and again included directions about what
patches to apply:
/messages/by-id/CAA4eK1JLv+2y1AwjhsQPFisKhBF7jWF_Nzirmzyno9uPBRCpGw@mail.gmail.com
Far from the negligence that you seem to be implying, I think Amit was
remarkably diligent about providing these kinds of updates. I
admittedly didn't duplicate those same updates on the parallel
mode/contexts thread to which you replied, but that's partly because I
would often whack around that patch first and then Amit would adjust
his patch to cope with my changes after the fact. That doesn't seem
to have been the case in this particular example, but if this is the
closest thing you can come up with to a process failure during the
development of parallel query, I'm not going to be sad about it: I'm
going to have a beer. Seriously: we worked really hard at this.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Mon, Feb 8, 2016 at 1:45 PM, Robert Haas <robertmhaas@gmail.com> wrote:
Far from the negligence that you seem to be implying, I think Amit was
remarkably diligent about providing these kinds of updates.
I don't think I remotely implied negligence. That word has very severe
connotations (think "criminal negligence") that are far from what I
intended.
I admittedly didn't duplicate those same updates on the parallel
mode/contexts thread to which you replied, but that's partly because I
would often whack around that patch first and then Amit would adjust
his patch to cope with my changes after the fact. That doesn't seem
to have been the case in this particular example, but if this is the
closest thing you can come up with to a process failure during the
development of parallel query, I'm not going to be sad about it: I'm
going to have a beer. Seriously: we worked really hard at this.
I don't want to get stuck on that one example, which I acknowledged
might not be representative when I raised it. I'm not really talking
about parallel query in particular anyway. I'm mostly arguing for a
consistent way to get instructions on how to at least build the patch,
where that might be warranted.
The CF app is one way. Another good way is: As long as we're using a
patch series, be explicit about what goes where in the commit message.
Have message-id references. That sort of thing. I already try to do
that. That's all.
Thank you (and Amit) for working really hard on parallelism.
--
Peter Geoghegan
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Mon, Feb 8, 2016 at 4:54 PM, Peter Geoghegan <pg@heroku.com> wrote:
On Mon, Feb 8, 2016 at 1:45 PM, Robert Haas <robertmhaas@gmail.com> wrote:
Far from the negligence that you seem to be implying, I think Amit was
remarkably diligent about providing these kinds of updates.I don't think I remotely implied negligence. That word has very severe
connotations (think "criminal negligence") that are far from what I
intended.
OK, sorry, I think I misread your tone.
I don't want to get stuck on that one example, which I acknowledged
might not be representative when I raised it. I'm not really talking
about parallel query in particular anyway. I'm mostly arguing for a
consistent way to get instructions on how to at least build the patch,
where that might be warranted.The CF app is one way. Another good way is: As long as we're using a
patch series, be explicit about what goes where in the commit message.
Have message-id references. That sort of thing. I already try to do
that. That's all.
Yeah, me too. Generally, although with some exceptions, my practice
is to keep reposting the whole patch stack, so that everything is in
one email. In this particular case, though, there were patches from
me and patches from Amit, so that was harder to do. I wasn't using
his patches to test my patches; I had other test code for that. He
was using my patches as a base for his patches, but linked to them
instead of reposting them. That's an unusually complicated scenario,
though: it's pretty rare around here to have two developers working
together on something as closely as Amit and I did on those patches.
Thank you (and Amit) for working really hard on parallelism.
Thanks.
By the way, it bears saying, or if I've said it before repeating, that
although most of the parallelism code that has been committed was
written by me, Amit has made an absolutely invaluable contribution to
parallel query, and it wouldn't be committed today or maybe ever
without that contribution. In addition to those parts of the code
that were committed as he wrote them, he prototyped quite a number of
things that I ended up rewriting, reviewed a ton of code that I wrote
and found bugs in it, wrote numerous bits and pieces of test code, and
generally put up with an absolutely insane level of me nitpicking his
work, breaking it by committing pieces of it or committing different
pieces that replaced pieces he had, demanding repeated rebases on
short time scales, and generally beating him up in just about every
conceivable way. I am deeply appreciative of him being willing to
jump into this project, do a ton of work, and put up with me.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Hi Robert,
On 2016-02-08 13:45:37 -0500, Robert Haas wrote:
I realize that this stuff has all been brewing long, and that there's
still a lot to do. So you gotta keep moving. And I'm not sure that
there's anything wrong or if there's any actually better approach. But
pushing an unreviewed, complex patch, that originated in a thread
orginally about different relatively small/mundane items, for a
contentious issue, a few days after the initial post. Hm. Not sure how
you'd react if you weren't the author.Probably not very well. Do you want me to revert it?
No. I want(ed) to express that I am not comfortable with how this got
in. My aim wasn't to generate a flurry of responses with everybody
piling on, or anything like that. But it's unfortunately hard to
avoid. I wish I knew a way, besides only sending private mails. Which I
don't think is a great approach either.
I do agree that we need something to tackle this problem, and that this
quite possibly is the least bad way to do this. And certainly the only
one that's been implemented and posted with any degree of completeness.
But even given the last paragraph, posting a complex new patch in a
somewhat related thread, and then pushing it 5 days later is pretty darn
quick.
I mean, look. [explanation why we need the infrastructure]. Do I really
think anybody was going to spend the time to understand deadlock.c
well enough to verify my changes? No, I don't. What I think would
have happened is that the patch would have sat around like an
albatross around my neck - totally ignored by everyone - until the end
of the last CF, and then the discussion would have gone one of three
ways:
Yes, believe me, I really get that. It's awfully hard to get substantial
review for pieces of code that require a lot of context.
But I think posting this patch in a new thread, posting a message that
you're intending to commit unless somebody protests with a substantial
arguments and/or a timeline of review, and then waiting a few days, are
something that should be done for a major piece of new infrastructure,
especially when it's somewhat controversial.
This doesn't just affect parallel execution, it affects one of least
understood parts of postgres code. And where hard to find bugs, likely
to only trigger in production, are to be expected.
And, by the way, the patch, aside from the deadlock.c portion, was
posted back in October, admittedly without much fanfare, but nobody
reviewed that or any other patch on this thread.
I think it's unrealistic to expect random patches without a commitest
entry, posted somewhere deep in a thread, to get a review when there's
so many open commitfest entries that haven't gotten feedback, and which
we are supposed to look at.
If I'd waited for those reviews to come in, parallel query would not
be committed now, nor probably in 9.6, nor probably in 9.7 or 9.8
either. The whole project would just be impossible on its face.
Yes, that's a problem. But you're not the only one facing it, and you've
argued hard against such an approach in some other cases.
I think it's myopic to say "well, but this patch might have bugs".
Very true. But also, all the other parallelism patches that are
already committed or that are still under review but which can't be
properly tested without this patch might have bugs, too, so you've got
to weigh the risk that this patch might get better if I wait longer to
commit it against the possibility that not having committed it reduces
the chances of finding bugs elsewhere. I don't want it to seem like
I'm forcing this down the community's throat - I don't have a right to
do that, and I will certainly revert this patch if that is the
consensus. But that is not what I think best myself. What I think
would be better is to (1) make an effort to get the buildfarm testing
which this patch enables up and running as soon as possible and (2)
for somebody to read over the committed code and raise any issues that
they find. Or, for that matter, to read over the committed code for
any of the *other* parallelism patches and raise any issues that they
find with *that* code. There's certainly scads of code here and this
is far from the only bit that might have bugs.
I think you are, and *you have to*, walk a very thin line here. I agree
that realistically there's just nobody with the bandwidth to keep up
with a fully loaded Robert. Not without ignoring their own stuff at
least. And I think the importance of what you're building means we need
to be flexible. But I think that thin line in turn means that you have
to be *doubly* careful about communication. I.e. post new infrastructure
to new threads, "warn" that you're intending to commit something
potentially needing debate/review, etc.
Oh: another thing that I would like to do is commit the isolation
tests I wrote for the deadlock detector a while back, which nobody has
reviewed either, though Tom and Alvaro seemed reasonably positive
about the concept.
I think adding new regression tests should have a barrier to commit
that's about two magnitudes lower than something like group locks. I
mean the worst that they could so is to flap around for some reason, or
take a bit too long. So please please go ahead.
Greetings,
Andres Freund
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2016-02-08 15:18:13 -0500, Robert Haas wrote:
I agree that you had to be pretty deeply involved in that thread to
follow everything that was going on. But it's not entirely fair to
say that it was impossible for anyone else to get involved. Both
Amit and I, mostly Amit, posted directions at various times saying:
here is the sequence of patches that you currently need to apply as of
this time. There was not a heck of a lot of evidence that anyone was
doing that, though, though I think a few people did, and towards the
end things changed very quickly as I committed patches in the series.
We certainly knew what each other were doing and not because of some
hidden off-list collaboration that we kept secret from the community -
we do talk every week, but almost all of our correspondence on those
patches was on-list.
I think having a public git tree, that contains the current state, is
greatly helpful for that. Just announce that you're going to screw
wildly with history, and that you're not going to be terribly careful
about commit messages. That means observers can just do a fetch and a
reset --hard to see the absolutely latest and greatest. By all means
post a series to the list every now and then, but I think for minor
changes it's perfectly sane to say 'pull to see the fixups for the
issues you noticed'.
I think it's an inherent peril of complicated patch sets that people
who are not intimately involved in what is going on will have trouble
following just because it takes a lot of work.
True. But it becomes doubly hard if there's no up-to-date high level
design overview somewhere outside $sizeable_brain. I know it sucks to
write these, believe me. Especially because one definitely feels that
nobody is reading those.
Greetings,
Andres Freund
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Mon, Feb 8, 2016 at 2:35 PM, Andres Freund <andres@anarazel.de> wrote:
I think having a public git tree, that contains the current state, is
greatly helpful for that. Just announce that you're going to screw
wildly with history, and that you're not going to be terribly careful
about commit messages. That means observers can just do a fetch and a
reset --hard to see the absolutely latest and greatest. By all means
post a series to the list every now and then, but I think for minor
changes it's perfectly sane to say 'pull to see the fixups for the
issues you noticed'.
I would really like for there to be a way to do that more often. It
would be a significant time saver, because it removes problems with
minor bitrot.
--
Peter Geoghegan
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Mon, Feb 8, 2016 at 5:27 PM, Andres Freund <andres@anarazel.de> wrote:
contentious issue, a few days after the initial post. Hm. Not sure how
you'd react if you weren't the author.Probably not very well. Do you want me to revert it?
No. I want(ed) to express that I am not comfortable with how this got
in. My aim wasn't to generate a flurry of responses with everybody
piling on, or anything like that. But it's unfortunately hard to
avoid. I wish I knew a way, besides only sending private mails. Which I
don't think is a great approach either.I do agree that we need something to tackle this problem, and that this
quite possibly is the least bad way to do this. And certainly the only
one that's been implemented and posted with any degree of completeness.But even given the last paragraph, posting a complex new patch in a
somewhat related thread, and then pushing it 5 days later is pretty darn
quick.
Sorry. I understand your discomfort, and you're probably right. I'll
try to handle it better next time. I think my frustration with the
process got the better of me a little bit here. This patch may very
well not be perfect, but it's sure as heck better than doing nothing,
and if I'd gone out of my way to say "hey, everybody, here's a patch
that you might want to object to" I'm sure I could have found some
volunteers to do just that. But, you know, that's not really what I
want. What I want is somebody to do a detailed review and help me fix
whatever the problems the patch may have. And ideally, I'd like that
person to understand that you can't have parallel query without doing
something in this area - which I think you do, but certainly not
everybody probably did - and that a lot of simplistic, non-invasive
ideas for how to handle this are going to be utterly inadequate in
complex cases. Unless you or Noah want to take a hand, I don't expect
to get that sort of review. Now, that having been said, I think your
frustration with the way I handled it is somewhat justified, and since
you are not arguing for a revert I'm not sure what I can do except try
not to let my frustration get in the way next time. Which I will try
to do.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Hi!
Thanks for the answer. Sounds good.
On 2016-02-08 18:47:18 -0500, Robert Haas wrote:
and if I'd gone out of my way to say "hey, everybody, here's a patch
that you might want to object to" I'm sure I could have found some
volunteers to do just that. But, you know, that's not really what I
want.
Sometimes I wonder if three shooting-from-the-hip answers shouldn't cost
a jog around the block or such (of which I'm sometimes guilty as
well!). Wouldn't just help the on-list volume, but also our collective
health ;)
Unless you or Noah want to take a hand, I don't expect to get that
sort of review. Now, that having been said, I think your frustration
with the way I handled it is somewhat justified, and since you are not
arguing for a revert I'm not sure what I can do except try not to let
my frustration get in the way next time. Which I will try to do.
FWIW, I do hope to put more time into reviewing parallelism stuff in the
coming weeks. It's hard to balance all that one likes to do.
- Andres
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Mon, Feb 08, 2016 at 02:49:27PM -0500, Robert Haas wrote:
Well, what I've done is push into the buildfarm code that will allow
us to do *the most exhaustive* testing that I know how to do in an
automated fashion. Which is to create a file that says this:force_parallel_mode=regress
max_parallel_degree=2And then run this: make check-world TEMP_CONFIG=/path/to/aforementioned/file
Now, that is not going to find bugs in the deadlock.c portion of the
group locking patch, but it's been wildly successful in finding bugs
in other parts of the parallelism code, and there might well be a few
more that we haven't found yet, which is why I'm hoping that we'll get
this procedure running regularly either on all buildfarm machines, or
on some subset of them, or on new animals that just do this.
I configured a copy of animal "mandrill" that way and launched a test run.
The postgres_fdw suite failed as attached. A manual "make -C contrib
installcheck" fails the same way on a ppc64 GNU/Linux box, but it passes on
x86_64 and aarch64. Since contrib test suites don't recognize TEMP_CONFIG,
check-world passes everywhere.
Attachments:
contrib-install-check-C.log.gzapplication/x-gunzipDownload
� �6�V�]is�F�������v,
���U������L����J�T Ix& �����n�K_�0�����x�����M�ZY�t�
Y����X�g����Z.���|�����N��:t|w}������?���o���W�W��:�_��|��|�agy�����O7��_��|�.���}w~�!���{�
zh����v�����"��I~�[����LT���r�,����"�:����I*��t�7K�]Q��TH���y�;��� ��!����)�/���� �S���� @���d3�!?
�*L�F�d��!�!bh���Y����Sg�@�v���8�o{')��h� �q��U&i7T���(��V�yk��j���h�8�����h�r���(Q���/>���������b�&tg���� 0��Sz�1B�s�{��n������3K41Er��eM�KGB��������g���� ZQ���
�-����[��h;z��+tX����he��b��N�l�n��}�n-���|��B�Bs7:���������m�d���
hDD��ok���))+���Y�Z��f5�Zs+p��]5��?�����W�S$����h����wJ-��F�^��o�������t^�i��v��������<����A4�^3�<4b����M�'�*�%����L&���^D;�_�"U�"�����
\D�\w�=����S>@D.��e�EBp� ���!)P����YO�p����}p��-��}�Z��{E��X��]��H����E���UFo6����j4ti&k��� ���|N������$?�,� �,l�1��%�����������{�&_:m���)~v��Y�!�|��|7���,?��c�?f��������j*�}��@1�o>3&�O�C���D%�7���WA3����y��-:����t"��� ���$#�p���`���A�w�9Jq��p�ko�N��NAFW=`t��EGW/��_�2��S��~}�W7����Y3�o��K���/�u����0����?����K7_n��f�.�O$�}<H�z ���Z:��>� 8����������Z��WK�n�p���]�������h+c�0��8�/9^*���i@�6�>�k* �,����Z
��mt�>xXc�i���F�[t����EB�m
�����������4�5��}�9�/��c����6�����N��6:�����-::����\� (���*1��v�,(����
f|�.����]��@
4��k�y���cP��L����?�ji�����|��`v�5�75��������� �WVh��o�E�Cx`�C���M@���������c�3����->���m��'j����3���$
��=�P�XwZ���;{sN�IZ�?�o��@cz3�����-:�7E7����7o�����i�{���'5uku��o������?z�����-,�`��3)JK#�>�����
�{��}�6^4�����/�CX�HYD�f?�������F7�.c���K��c�^H_"�� L��>S>�����-��$�b����Yg���S�CHi�\X5l�k��nh��k��u�Q��G�y�"��F
h�Q'5BQ�\_��WW�o[�L��R�;�z�0J*���J��lgem���"��TM�RR7��������@j��&����b�0De$PR%6ye����r�a0P�!�r���J
��A����I���0����P�i�"����� �\|�>���Dq���LZ��Z������6�7���@����0z�������,�a7D�������e���h#:4_zwWnp�|����Y"v��h�"���4'�P�$���X�x+���k6��Z�i����NE��:��vh�lR�d������P�>9�~�r�W� IK\��������- ���k�����(���.�z���x�J]5���.��n����� �W�]U������l����S���8�U����W�(W�=����LIv]TT��������Z��Ij��s����s��������TG�X�L�c���&���n7�!�$���y������E�yM9�����=�t Ck=#(D�wR Sp=3o�oi=�c��W����|E�1c��s������3DP�k��B���";\"Pl�j"���7o�_�k#yWY:pe���k#7tH���FgmSV������4��=`�$q��g���5�[�����H7���>���w��U5a��'�^"��<�2Q����je�Qq}��(p�#k��:�
P`�f�P��m�drT�LH���d�y*��6��nb�-�� +�1I��������K�D��O?��H�JI���Ss?���GO��[?pJ������|��;����z8c�������E�(�
?���g����9U��� Y����� z��l�a�X���>U��y�_T�.�/cBt�,��������:|������
etwC���C�@�$�k��n���i����6�A�%��� ������I��(�,L��WI�w��������r�,}�v��..���.���� � ���K����Q�-*�+_�E�&�v�.-��PZ���[T�������hZ��M�����Wl��M�c� �]�K��<�����?>z������M�����B1$���hZ��K��B���m�f����JSAC�~s����a��OM�qW L��u����&�'�|�L��)6'����.^�?4�����)I�3���~-�XA������f��w�FV�ja���|8��\89�{�]����+g5��|�[���ih�{!�-���E{���C�����.�Y-��$/��
�[��nWhim�6�q(-?�h�$W� ��lq��mgC���^'����D1�
�`���m�(y������b�n� ����lk�'X����rn<���Bo��E�7�f����5\8UV 8%�-7��T�%dw�L�4��I&{ �/���
:��$I�Z�l���K����e��=a;E�o��_���o>]|�~-0g?ry��� �
�6kN*����zm�(���N���5 p?�� ����/g�4#�8����3����F�8���Bu��e��D��-K�hF����;T��m�]���P1�����_��k�O�hi_����V��t����vA���o��YoW����b��>