sinval synchronization considered harmful
For the last week or so, in between various other tasks, I've been
trying to understand why performance drops off when you run pgbench -n
-S -c $CLIENTS -j $CLIENTS -T $A_FEW_MINUTES at very high client
counts. The answer, in a word, is SIGetDataEntries(). I believe we
need to bite the bullet and rewrite this using a lock-free algorithm,
using memory barriers on processors with weak memory ordering.
Perhaps there is another way to do it, but nothing I've tried has
really worked so far, and I've tried quite a few things. Here's the
data.
On unpatched master, performance scales pretty much linearly out to 32
clients. As you add more clients, it drops off:
[1 client]
tps = 4459.203159 (including connections establishing)
tps = 4488.456559 (including connections establishing)
tps = 4449.570530 (including connections establishing)
[8 clients]
tps = 33524.668762 (including connections establishing)
tps = 33501.833907 (including connections establishing)
tps = 33382.380908 (including connections establishing)
[32 clients]
tps = 178394.863906 (including connections establishing)
tps = 178457.706972 (including connections establishing)
tps = 179365.310179 (including connections establishing)
[80 clients]
tps = 132518.586371 (including connections establishing)
tps = 130968.749747 (including connections establishing)
tps = 132574.338942 (including connections establishing)
With the lazy vxid locks patch, all of those numbers get better by a
few percentage points (the more clients, the more points) except at 80
clients, where the performance is sometimes better and sometimes
worse. These are 5-minute runs:
[80 clients, with lazy vxid locks]
tps = 119215.958372 (including connections establishing)
tps = 113056.859871 (including connections establishing)
tps = 160562.770998 (including connections establishing)
I can't explain in detail why there is so much variation between the
5-minute runs, or why some runs perform worse than without the lazy
vxid locks, but I can say that apply the first of the two attached
patches (sinval-per-backend-mutex.patch) fixes it. The patch gets rid
of SInvalReadLock and instead gives each backend its own spinlock.
SICleanupQueue() must therefore get somewhat fuzzy answers, but it
shouldn't matter. The only real problem is if you need to do the
super-duper-cleanup where you subtract a constant from all the values
in unison. I just got rid of that completely, by widening the
counters to 64 bits. Assuming I haven't botched the implementation,
this is clearly a win. There is still some performance drop-off going
from 32 clients to 80 clients, but it is much less.
[80 clients, with lazy vxid locks and sinval-per-backend-mutex]
tps = 167392.042393 (including connections establishing)
tps = 171336.145020 (including connections establishing)
tps = 170500.529303 (including connections establishing)
Profiling this combination of patches reveals that there is still some
pretty ugly spinlock contention on sinval's msgNumLock. And it occurs
to me that on x86, we really don't need this lock ... or
SInvalReadLock ... or a per-backend mutex. The whole of
SIGetDataEntries() can pretty easily be made lock-free. The only real
changes that seem to be are needed are (1) to use a 64-bit counter, so
you never need to decrement and (2) to recheck resetState after
reading the entries from the queue, to see if we got reset while we
were reading those entries. Since x86 guarantees that writes will
become visible in the order they are executed, we only need to make
sure that the compiler doesn't rearrange things. As long as we first
read the maxMsgNum and then read the messages, we can't read garbage.
As long as we read the messages before we check resetState, we will be
sure to notice if we got reset before we read all the messages (which
is the only way that we can have read garbage messages).
Now this implementation (sinval-unlocked.patch) is obviously not a
serious proposal as it stands, because on processors with weak memory
ordering it will presumably blow up all over the place. But here are
the results on x86:
[80 clients, with lazy vxid locks and sinval-unlocked]
tps = 203256.701227 (including connections establishing)
tps = 190637.957571 (including connections establishing)
tps = 190228.617178 (including connections establishing)
Now, that's actually *faster* then the above 32-core numbers. Turns
out, with this approach, there's essentially no degradation between 32
clients and 80 clients. It appears that even one spinlock acquisition
in SIGetDataEntries() is too many. Currently, we have *three*: one to
get SInvalReadLock, one to get msgnumLock, and one to release
SInvalReadLock.
For contrast, on a two-core MacBook Pro, I can't measure any
difference at all between lazy-vxid,
lazy-vxid+sinval-per-backend-mutex, and lazy-vxid+sinval-unlocked.
The last might even be a touch slower, although there's enough noise
that it's hard to tell for sure, and the effect is very small if there
is one. But on the 32-core machine, the differences are dramatic.
Thoughts? Comments? Ideas?
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Attachments:
sinval-per-backend-mutex.patchapplication/octet-stream; name=sinval-per-backend-mutex.patchDownload
*** a/src/backend/storage/ipc/sinvaladt.c
--- b/src/backend/storage/ipc/sinvaladt.c
***************
*** 70,105 ****
* that aren't stuck will propagate their interrupts to the next guy.
*
* We would have problems if the MsgNum values overflow an integer, so
! * whenever minMsgNum exceeds MSGNUMWRAPAROUND, we subtract MSGNUMWRAPAROUND
! * from all the MsgNum variables simultaneously. MSGNUMWRAPAROUND can be
! * large so that we don't need to do this often. It must be a multiple of
! * MAXNUMMESSAGES so that the existing circular-buffer entries don't need
! * to be moved when we do it.
*
! * Access to the shared sinval array is protected by two locks, SInvalReadLock
! * and SInvalWriteLock. Readers take SInvalReadLock in shared mode; this
! * authorizes them to modify their own ProcState but not to modify or even
! * look at anyone else's. When we need to perform array-wide updates,
! * such as in SICleanupQueue, we take SInvalReadLock in exclusive mode to
! * lock out all readers. Writers take SInvalWriteLock (always in exclusive
! * mode) to serialize adding messages to the queue. Note that a writer
! * can operate in parallel with one or more readers, because the writer
! * has no need to touch anyone's ProcState, except in the infrequent cases
! * when SICleanupQueue is needed. The only point of overlap is that
! * the writer wants to change maxMsgNum while readers need to read it.
! * We deal with that by having a spinlock that readers must take for just
! * long enough to read maxMsgNum, while writers take it for just long enough
! * to write maxMsgNum. (The exact rule is that you need the spinlock to
! * read maxMsgNum if you are not holding SInvalWriteLock, and you need the
! * spinlock to write maxMsgNum unless you are holding both locks.)
*
! * Note: since maxMsgNum is an int and hence presumably atomically readable/
! * writable, the spinlock might seem unnecessary. The reason it is needed
! * is to provide a memory barrier: we need to be sure that messages written
! * to the array are actually there before maxMsgNum is increased, and that
! * readers will see that data after fetching maxMsgNum. Multiprocessors
! * that have weak memory-ordering guarantees can fail without the memory
! * barrier instructions that are included in the spinlock sequences.
*/
--- 70,101 ----
* that aren't stuck will propagate their interrupts to the next guy.
*
* We would have problems if the MsgNum values overflow an integer, so
! * we use a 64-bit counter to make sure that doesn't happen. (Of course,
! * technically you can overflow anything... but as we only support 2^64 bytes
! * of write-ahead log over the lifetime of the cluster, assuming that we won't
! * have more sinval messages than that during one postmaster session seems
! * pretty safe. At 10 million messages per second, it would take more than
! * fifty-eight thousand years to overflow.)
*
! * Writers take SInvalWriteLock (always in exclusive mode) to serialize adding
! * messages to the queue. Note that a writer can operate in parallel with one
! * or more readers (who only need to take the spinlocks that protect their own
! * state), because the writer has no need to touch anyone's ProcState, except
! * in the infrequent cases when SICleanupQueue is needed. The only point of
! * overlap is that the writer wants to change maxMsgNum while readers need to
! * read it. We deal with that by having a spinlock that readers must take for
! * just long enough to read maxMsgNum, while writers take it for just long
! * enough to write maxMsgNum. (The exact rule is that you need the spinlock to
! * read maxMsgNum if you are not holding SInvalWriteLock, and you must always
! * hold the spinlock to write maxMsgNum.)
*
! * Note: since maxMsgNum is presumably atomically readable/writable, the
! * spinlock might seem unnecessary. The reason it is needed is to provide a
! * memory barrier: we need to be sure that messages written to the array are
! * actually there before maxMsgNum is increased, and that readers will see
! * that data after fetching maxMsgNum. Multiprocessors that have weak
! * memory-ordering guarantees can fail without the memory barrier instructions
! * that are included in the spinlock sequences.
*/
***************
*** 109,117 ****
* MAXNUMMESSAGES: max number of shared-inval messages we can buffer.
* Must be a power of 2 for speed.
*
- * MSGNUMWRAPAROUND: how often to reduce MsgNum variables to avoid overflow.
- * Must be a multiple of MAXNUMMESSAGES. Should be large.
- *
* CLEANUP_MIN: the minimum number of messages that must be in the buffer
* before we bother to call SICleanupQueue.
*
--- 105,110 ----
***************
*** 128,134 ****
*/
#define MAXNUMMESSAGES 4096
- #define MSGNUMWRAPAROUND (MAXNUMMESSAGES * 262144)
#define CLEANUP_MIN (MAXNUMMESSAGES / 2)
#define CLEANUP_QUANTUM (MAXNUMMESSAGES / 16)
#define SIG_THRESHOLD (MAXNUMMESSAGES / 2)
--- 121,126 ----
***************
*** 139,146 **** typedef struct ProcState
{
/* procPid is zero in an inactive ProcState array entry. */
pid_t procPid; /* PID of backend, for signaling */
/* nextMsgNum is meaningless if procPid == 0 or resetState is true. */
! int nextMsgNum; /* next message number to read */
bool resetState; /* backend needs to reset its state */
bool signaled; /* backend has been sent catchup signal */
--- 131,139 ----
{
/* procPid is zero in an inactive ProcState array entry. */
pid_t procPid; /* PID of backend, for signaling */
+ slock_t mutex; /* protects nextMsgNum, resetState, signaled */
/* nextMsgNum is meaningless if procPid == 0 or resetState is true. */
! uint64 nextMsgNum; /* next message number to read */
bool resetState; /* backend needs to reset its state */
bool signaled; /* backend has been sent catchup signal */
***************
*** 167,175 **** typedef struct SISeg
/*
* General state information
*/
! int minMsgNum; /* oldest message still needed */
! int maxMsgNum; /* next message number to be assigned */
! int nextThreshold; /* # of messages to call SICleanupQueue */
int lastBackend; /* index of last active procState entry, +1 */
int maxBackends; /* size of procState array */
--- 160,168 ----
/*
* General state information
*/
! uint64 minMsgNum; /* oldest message still needed */
! uint64 maxMsgNum; /* next message number to be assigned */
! uint64 nextThreshold; /* # of messages to call SICleanupQueue */
int lastBackend; /* index of last active procState entry, +1 */
int maxBackends; /* size of procState array */
***************
*** 245,250 **** CreateSharedInvalidationState(void)
--- 238,244 ----
for (i = 0; i < shmInvalBuffer->maxBackends; i++)
{
shmInvalBuffer->procState[i].procPid = 0; /* inactive */
+ SpinLockInit(&shmInvalBuffer->procState[i].mutex);
shmInvalBuffer->procState[i].nextMsgNum = 0; /* meaningless */
shmInvalBuffer->procState[i].resetState = false;
shmInvalBuffer->procState[i].signaled = false;
***************
*** 267,274 **** SharedInvalBackendInit(bool sendOnly)
* This can run in parallel with read operations, and for that matter with
* write operations; but not in parallel with additions and removals of
* backends, nor in parallel with SICleanupQueue. It doesn't seem worth
! * having a third lock, so we choose to use SInvalWriteLock to serialize
! * additions/removals.
*/
LWLockAcquire(SInvalWriteLock, LW_EXCLUSIVE);
--- 261,267 ----
* This can run in parallel with read operations, and for that matter with
* write operations; but not in parallel with additions and removals of
* backends, nor in parallel with SICleanupQueue. It doesn't seem worth
! * having another lock, so we just use SInvalWriteLock.
*/
LWLockAcquire(SInvalWriteLock, LW_EXCLUSIVE);
***************
*** 415,421 **** SIInsertDataEntries(const SharedInvalidationMessage *data, int n)
while (n > 0)
{
int nthistime = Min(n, WRITE_QUANTUM);
! int numMsgs;
int max;
n -= nthistime;
--- 408,414 ----
while (n > 0)
{
int nthistime = Min(n, WRITE_QUANTUM);
! uint64 numMsgs;
int max;
n -= nthistime;
***************
*** 479,495 **** SIInsertDataEntries(const SharedInvalidationMessage *data, int n)
* NB: this can run in parallel with other instances of SIGetDataEntries
* executing on behalf of other backends, since each instance will modify only
* fields of its own backend's ProcState, and no instance will look at fields
! * of other backends' ProcStates. We express this by grabbing SInvalReadLock
! * in shared mode. Note that this is not exactly the normal (read-only)
! * interpretation of a shared lock! Look closely at the interactions before
! * allowing SInvalReadLock to be grabbed in shared mode for any other reason!
*
* NB: this can also run in parallel with SIInsertDataEntries. It is not
* guaranteed that we will return any messages added after the routine is
* entered.
- *
- * Note: we assume that "datasize" is not so large that it might be important
- * to break our hold on SInvalReadLock into segments.
*/
int
SIGetDataEntries(SharedInvalidationMessage *data, int datasize)
--- 472,482 ----
* NB: this can run in parallel with other instances of SIGetDataEntries
* executing on behalf of other backends, since each instance will modify only
* fields of its own backend's ProcState, and no instance will look at fields
! * of other backends' ProcStates.
*
* NB: this can also run in parallel with SIInsertDataEntries. It is not
* guaranteed that we will return any messages added after the routine is
* entered.
*/
int
SIGetDataEntries(SharedInvalidationMessage *data, int datasize)
***************
*** 499,506 **** SIGetDataEntries(SharedInvalidationMessage *data, int datasize)
int max;
int n;
- LWLockAcquire(SInvalReadLock, LW_SHARED);
-
segP = shmInvalBuffer;
stateP = &segP->procState[MyBackendId - 1];
--- 486,491 ----
***************
*** 514,519 **** SIGetDataEntries(SharedInvalidationMessage *data, int datasize)
--- 499,514 ----
SpinLockRelease(&vsegP->msgnumLock);
}
+ /*
+ * It's OK to acquire the mutex after reading maxMsgNum. It's possible
+ * that maxMsgNum will have advanced further in the meantime, but since
+ * we only guarantee that we'll read invalidation entries added prior to
+ * entry into this function, that's no problem. On the flip side, if
+ * any messages have been removed from the queue that we still need to
+ * read, SICleanupQueue will have set stateP->resetState.
+ */
+ SpinLockAcquire(&stateP->mutex);
+
if (stateP->resetState)
{
/*
***************
*** 524,530 **** SIGetDataEntries(SharedInvalidationMessage *data, int datasize)
stateP->nextMsgNum = max;
stateP->resetState = false;
stateP->signaled = false;
! LWLockRelease(SInvalReadLock);
return -1;
}
--- 519,525 ----
stateP->nextMsgNum = max;
stateP->resetState = false;
stateP->signaled = false;
! SpinLockRelease(&stateP->mutex);
return -1;
}
***************
*** 549,555 **** SIGetDataEntries(SharedInvalidationMessage *data, int datasize)
if (stateP->nextMsgNum >= max)
stateP->signaled = false;
! LWLockRelease(SInvalReadLock);
return n;
}
--- 544,550 ----
if (stateP->nextMsgNum >= max)
stateP->signaled = false;
! SpinLockRelease(&stateP->mutex);
return n;
}
***************
*** 573,589 **** void
SICleanupQueue(bool callerHasWriteLock, int minFree)
{
SISeg *segP = shmInvalBuffer;
! int min,
minsig,
lowbound,
! numMsgs,
! i;
ProcState *needSig = NULL;
! /* Lock out all writers and readers */
if (!callerHasWriteLock)
LWLockAcquire(SInvalWriteLock, LW_EXCLUSIVE);
- LWLockAcquire(SInvalReadLock, LW_EXCLUSIVE);
/*
* Recompute minMsgNum = minimum of all backends' nextMsgNum, identify the
--- 568,583 ----
SICleanupQueue(bool callerHasWriteLock, int minFree)
{
SISeg *segP = shmInvalBuffer;
! uint64 min,
minsig,
lowbound,
! numMsgs;
! int i;
ProcState *needSig = NULL;
! /* Lock out writers */
if (!callerHasWriteLock)
LWLockAcquire(SInvalWriteLock, LW_EXCLUSIVE);
/*
* Recompute minMsgNum = minimum of all backends' nextMsgNum, identify the
***************
*** 599,649 **** SICleanupQueue(bool callerHasWriteLock, int minFree)
for (i = 0; i < segP->lastBackend; i++)
{
ProcState *stateP = &segP->procState[i];
! int n = stateP->nextMsgNum;
! /* Ignore if inactive or already in reset state */
! if (stateP->procPid == 0 || stateP->resetState || stateP->sendOnly)
continue;
/*
* If we must free some space and this backend is preventing it, force
* him into reset state and then ignore until he catches up.
*/
if (n < lowbound)
{
stateP->resetState = true;
/* no point in signaling him ... */
- continue;
}
!
! /* Track the global minimum nextMsgNum */
! if (n < min)
! min = n;
!
! /* Also see who's furthest back of the unsignaled backends */
! if (n < minsig && !stateP->signaled)
{
! minsig = n;
! needSig = stateP;
}
- }
- segP->minMsgNum = min;
! /*
! * When minMsgNum gets really large, decrement all message counters so as
! * to forestall overflow of the counters. This happens seldom enough that
! * folding it into the previous loop would be a loser.
! */
! if (min >= MSGNUMWRAPAROUND)
! {
! segP->minMsgNum -= MSGNUMWRAPAROUND;
! segP->maxMsgNum -= MSGNUMWRAPAROUND;
! for (i = 0; i < segP->lastBackend; i++)
! {
! /* we don't bother skipping inactive entries here */
! segP->procState[i].nextMsgNum -= MSGNUMWRAPAROUND;
! }
}
/*
* Determine how many messages are still in the queue, and set the
--- 593,642 ----
for (i = 0; i < segP->lastBackend; i++)
{
ProcState *stateP = &segP->procState[i];
! int n;
!
! /* Ignore if inactive or send-only */
! if (stateP->procPid == 0 || stateP->sendOnly)
! continue;
! /* Must acquire mutex before examining any more state */
! SpinLockAcquire(&stateP->mutex);
!
! /* Ignore if already reset */
! if (stateP->resetState)
! {
! SpinLockRelease(&stateP->mutex);
continue;
+ }
/*
* If we must free some space and this backend is preventing it, force
* him into reset state and then ignore until he catches up.
*/
+ n = stateP->nextMsgNum;
if (n < lowbound)
{
stateP->resetState = true;
/* no point in signaling him ... */
}
! else
{
! /* Track the global minimum nextMsgNum */
! if (n < min)
! min = n;
!
! /* Also see who's furthest back of the unsignaled backends */
! if (n < minsig && !stateP->signaled)
! {
! minsig = n;
! needSig = stateP;
! }
}
! /* Done examining state for this backend */
! SpinLockRelease(&stateP->mutex);
}
+ segP->minMsgNum = min;
/*
* Determine how many messages are still in the queue, and set the
***************
*** 665,672 **** SICleanupQueue(bool callerHasWriteLock, int minFree)
pid_t his_pid = needSig->procPid;
BackendId his_backendId = (needSig - &segP->procState[0]) + 1;
needSig->signaled = true;
! LWLockRelease(SInvalReadLock);
LWLockRelease(SInvalWriteLock);
elog(DEBUG4, "sending sinval catchup signal to PID %d", (int) his_pid);
SendProcSignal(his_pid, PROCSIG_CATCHUP_INTERRUPT, his_backendId);
--- 658,666 ----
pid_t his_pid = needSig->procPid;
BackendId his_backendId = (needSig - &segP->procState[0]) + 1;
+ SpinLockAcquire(&needSig->mutex);
needSig->signaled = true;
! SpinLockRelease(&needSig->mutex);
LWLockRelease(SInvalWriteLock);
elog(DEBUG4, "sending sinval catchup signal to PID %d", (int) his_pid);
SendProcSignal(his_pid, PROCSIG_CATCHUP_INTERRUPT, his_backendId);
***************
*** 675,681 **** SICleanupQueue(bool callerHasWriteLock, int minFree)
}
else
{
- LWLockRelease(SInvalReadLock);
if (!callerHasWriteLock)
LWLockRelease(SInvalWriteLock);
}
--- 669,674 ----
*** a/src/include/storage/lwlock.h
--- b/src/include/storage/lwlock.h
***************
*** 51,57 **** typedef enum LWLockId
OidGenLock,
XidGenLock,
ProcArrayLock,
! SInvalReadLock,
SInvalWriteLock,
WALInsertLock,
WALWriteLock,
--- 51,57 ----
OidGenLock,
XidGenLock,
ProcArrayLock,
! PlaceholderForSInvalReadLock, /* no longer used */
SInvalWriteLock,
WALInsertLock,
WALWriteLock,
sinval-unlocked.patchapplication/octet-stream; name=sinval-unlocked.patchDownload
*** a/src/backend/storage/ipc/sinvaladt.c
--- b/src/backend/storage/ipc/sinvaladt.c
***************
*** 140,146 **** typedef struct ProcState
/* procPid is zero in an inactive ProcState array entry. */
pid_t procPid; /* PID of backend, for signaling */
/* nextMsgNum is meaningless if procPid == 0 or resetState is true. */
! int nextMsgNum; /* next message number to read */
bool resetState; /* backend needs to reset its state */
bool signaled; /* backend has been sent catchup signal */
--- 140,146 ----
/* procPid is zero in an inactive ProcState array entry. */
pid_t procPid; /* PID of backend, for signaling */
/* nextMsgNum is meaningless if procPid == 0 or resetState is true. */
! uint64 nextMsgNum; /* next message number to read */
bool resetState; /* backend needs to reset its state */
bool signaled; /* backend has been sent catchup signal */
***************
*** 167,180 **** typedef struct SISeg
/*
* General state information
*/
! int minMsgNum; /* oldest message still needed */
! int maxMsgNum; /* next message number to be assigned */
! int nextThreshold; /* # of messages to call SICleanupQueue */
int lastBackend; /* index of last active procState entry, +1 */
int maxBackends; /* size of procState array */
- slock_t msgnumLock; /* spinlock protecting maxMsgNum */
-
/*
* Circular buffer holding shared-inval messages
*/
--- 167,178 ----
/*
* General state information
*/
! uint64 minMsgNum; /* oldest message still needed */
! uint64 maxMsgNum; /* next message number to be assigned */
! uint64 nextThreshold; /* # of messages to call SICleanupQueue */
int lastBackend; /* index of last active procState entry, +1 */
int maxBackends; /* size of procState array */
/*
* Circular buffer holding shared-inval messages
*/
***************
*** 237,243 **** CreateSharedInvalidationState(void)
shmInvalBuffer->nextThreshold = CLEANUP_MIN;
shmInvalBuffer->lastBackend = 0;
shmInvalBuffer->maxBackends = MaxBackends;
- SpinLockInit(&shmInvalBuffer->msgnumLock);
/* The buffer[] array is initially all unused, so we need not fill it */
--- 235,240 ----
***************
*** 414,422 **** SIInsertDataEntries(const SharedInvalidationMessage *data, int n)
*/
while (n > 0)
{
! int nthistime = Min(n, WRITE_QUANTUM);
! int numMsgs;
! int max;
n -= nthistime;
--- 411,419 ----
*/
while (n > 0)
{
! uint64 nthistime = Min(n, WRITE_QUANTUM);
! uint64 numMsgs;
! uint64 max;
n -= nthistime;
***************
*** 454,462 **** SIInsertDataEntries(const SharedInvalidationMessage *data, int n)
/* use volatile pointer to prevent code rearrangement */
volatile SISeg *vsegP = segP;
- SpinLockAcquire(&vsegP->msgnumLock);
vsegP->maxMsgNum = max;
- SpinLockRelease(&vsegP->msgnumLock);
}
LWLockRelease(SInvalWriteLock);
--- 451,457 ----
***************
*** 495,517 **** int
SIGetDataEntries(SharedInvalidationMessage *data, int datasize)
{
SISeg *segP;
! ProcState *stateP;
! int max;
int n;
- LWLockAcquire(SInvalReadLock, LW_SHARED);
-
segP = shmInvalBuffer;
stateP = &segP->procState[MyBackendId - 1];
! /* Fetch current value of maxMsgNum using spinlock */
{
/* use volatile pointer to prevent code rearrangement */
volatile SISeg *vsegP = segP;
- SpinLockAcquire(&vsegP->msgnumLock);
max = vsegP->maxMsgNum;
- SpinLockRelease(&vsegP->msgnumLock);
}
if (stateP->resetState)
--- 490,508 ----
SIGetDataEntries(SharedInvalidationMessage *data, int datasize)
{
SISeg *segP;
! volatile ProcState *stateP;
! uint64 max;
int n;
segP = shmInvalBuffer;
stateP = &segP->procState[MyBackendId - 1];
! /* Fetch current value of maxMsgNum */
{
/* use volatile pointer to prevent code rearrangement */
volatile SISeg *vsegP = segP;
max = vsegP->maxMsgNum;
}
if (stateP->resetState)
***************
*** 524,530 **** SIGetDataEntries(SharedInvalidationMessage *data, int datasize)
stateP->nextMsgNum = max;
stateP->resetState = false;
stateP->signaled = false;
- LWLockRelease(SInvalReadLock);
return -1;
}
--- 515,520 ----
***************
*** 544,555 **** SIGetDataEntries(SharedInvalidationMessage *data, int datasize)
}
/*
* Reset our "signaled" flag whenever we have caught up completely.
*/
if (stateP->nextMsgNum >= max)
stateP->signaled = false;
- LWLockRelease(SInvalReadLock);
return n;
}
--- 534,560 ----
}
/*
+ * Recheck whether we've been reset.
+ */
+ if (stateP->resetState)
+ {
+ /*
+ * Force reset. We can say we have dealt with any messages added
+ * since the reset, as well; and that means we should clear the
+ * signaled flag, too.
+ */
+ stateP->nextMsgNum = max;
+ stateP->resetState = false;
+ stateP->signaled = false;
+ return -1;
+ }
+
+ /*
* Reset our "signaled" flag whenever we have caught up completely.
*/
if (stateP->nextMsgNum >= max)
stateP->signaled = false;
return n;
}
***************
*** 573,589 **** void
SICleanupQueue(bool callerHasWriteLock, int minFree)
{
SISeg *segP = shmInvalBuffer;
! int min,
minsig,
lowbound,
! numMsgs,
! i;
ProcState *needSig = NULL;
/* Lock out all writers and readers */
if (!callerHasWriteLock)
LWLockAcquire(SInvalWriteLock, LW_EXCLUSIVE);
- LWLockAcquire(SInvalReadLock, LW_EXCLUSIVE);
/*
* Recompute minMsgNum = minimum of all backends' nextMsgNum, identify the
--- 578,593 ----
SICleanupQueue(bool callerHasWriteLock, int minFree)
{
SISeg *segP = shmInvalBuffer;
! uint64 min,
minsig,
lowbound,
! numMsgs;
! int i;
ProcState *needSig = NULL;
/* Lock out all writers and readers */
if (!callerHasWriteLock)
LWLockAcquire(SInvalWriteLock, LW_EXCLUSIVE);
/*
* Recompute minMsgNum = minimum of all backends' nextMsgNum, identify the
***************
*** 630,651 **** SICleanupQueue(bool callerHasWriteLock, int minFree)
segP->minMsgNum = min;
/*
- * When minMsgNum gets really large, decrement all message counters so as
- * to forestall overflow of the counters. This happens seldom enough that
- * folding it into the previous loop would be a loser.
- */
- if (min >= MSGNUMWRAPAROUND)
- {
- segP->minMsgNum -= MSGNUMWRAPAROUND;
- segP->maxMsgNum -= MSGNUMWRAPAROUND;
- for (i = 0; i < segP->lastBackend; i++)
- {
- /* we don't bother skipping inactive entries here */
- segP->procState[i].nextMsgNum -= MSGNUMWRAPAROUND;
- }
- }
-
- /*
* Determine how many messages are still in the queue, and set the
* threshold at which we should repeat SICleanupQueue().
*/
--- 634,639 ----
***************
*** 666,672 **** SICleanupQueue(bool callerHasWriteLock, int minFree)
BackendId his_backendId = (needSig - &segP->procState[0]) + 1;
needSig->signaled = true;
- LWLockRelease(SInvalReadLock);
LWLockRelease(SInvalWriteLock);
elog(DEBUG4, "sending sinval catchup signal to PID %d", (int) his_pid);
SendProcSignal(his_pid, PROCSIG_CATCHUP_INTERRUPT, his_backendId);
--- 654,659 ----
***************
*** 675,681 **** SICleanupQueue(bool callerHasWriteLock, int minFree)
}
else
{
- LWLockRelease(SInvalReadLock);
if (!callerHasWriteLock)
LWLockRelease(SInvalWriteLock);
}
--- 662,667 ----
*** a/src/include/storage/lwlock.h
--- b/src/include/storage/lwlock.h
***************
*** 51,57 **** typedef enum LWLockId
OidGenLock,
XidGenLock,
ProcArrayLock,
- SInvalReadLock,
SInvalWriteLock,
WALInsertLock,
WALWriteLock,
--- 51,56 ----
Robert Haas wrote:
SIGetDataEntries(). I believe we need to bite the bullet and
rewrite this using a lock-free algorithm, using memory barriers on
processors with weak memory ordering.
[32 processors; 80 clients]
On unpatched master
tps = 132518.586371 (including connections establishing)
tps = 130968.749747 (including connections establishing)
tps = 132574.338942 (including connections establishing)
With the lazy vxid locks patch
tps = 119215.958372 (including connections establishing)
tps = 113056.859871 (including connections establishing)
tps = 160562.770998 (including connections establishing)
gets rid of SInvalReadLock and instead gives each backend its own
spinlock.
tps = 167392.042393 (including connections establishing)
tps = 171336.145020 (including connections establishing)
tps = 170500.529303 (including connections establishing)
SIGetDataEntries() can pretty easily be made lock-free.
tps = 203256.701227 (including connections establishing)
tps = 190637.957571 (including connections establishing)
tps = 190228.617178 (including connections establishing)
Thoughts? Comments? Ideas?
Very impressive! Those numbers definitely justify some #ifdef code
to provide alternatives for weak memory ordering machines versus
others. With the number of CPUs climbing as it is, this is very
important work!
-Kevin
Import Notes
Reference msg id not found: 4E27D703020000250003F626@gw.wicourts.govReference msg id not found: 4E280A8C020000250003F661@gw.wicourts.gov | Resolved by subject fallback
On Thu, Jul 21, 2011 at 12:16 PM, Kevin Grittner
<Kevin.Grittner@wicourts.gov> wrote:
Very impressive! Those numbers definitely justify some #ifdef code
to provide alternatives for weak memory ordering machines versus
others. With the number of CPUs climbing as it is, this is very
important work!
Thanks. I'm not thinking so much about #ifdef (although that could
work, too) as I am about providing some primitives to allow this sort
of code-writing to be done in a somewhat less ad-hoc fashion. It
seems like there are basically two categories of machines we need to
worry about.
1. Machines with strong memory ordering. On this category of machines
(which include x86), the CPU basically does not perform loads or
stores out of order. On some of these machines, it is apparently
possible for there to be some ordering of stores relative to loads,
but if the program stores two values or loads two values, those
operations will performed in the same order they appear in the
program. The main thing you need to make your code work reliably on
these machines is a primitive that keeps the compiler from reordering
your code during optimization. On x86, certain categories of exotic
instructions do require
2. Machines with weak memory ordering. On this category of machines
(which includes PowerPC, Dec Alpha, and maybe some others), the CPU
reorders memory accesses arbitrarily unless you explicitly issue
instructions that enforce synchronization. You still need to keep the
compiler from moving things around, too. Alpha is particularly
pernicious, because something like a->b can fetch the pointed-to value
before loading the pointer itself. This is otherwise known as "we
have basically no cache coherency circuits on this chip at all". On
these machines, you need to issue an explicit memory barrier
instruction at each sequence point, or just acquire and release a
spinlock.
So you can imagine a primitive that is defined to be a compiler
barrier on machines with strong memory ordering, and as a memory
fencing instruction on machines with weak memory ordering.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes:
... On these machines, you need to issue an explicit memory barrier
instruction at each sequence point, or just acquire and release a
spinlock.
Right, and the reason that a spinlock fixes it is that we have memory
barrier instructions built into the spinlock code sequences on machines
where it matters.
To get to the point where we could do the sort of optimization Robert
is talking about, someone will have to build suitable primitives for
all the platforms we support. In the cases where we use gcc ASM in
s_lock.h, it shouldn't be too hard to pull out the barrier
instruction(s) ... but on platforms where we rely on OS-supplied
functions, some research is going to be needed.
regards, tom lane
On Thu, Jul 21, 2011 at 2:50 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Robert Haas <robertmhaas@gmail.com> writes:
... On these machines, you need to issue an explicit memory barrier
instruction at each sequence point, or just acquire and release a
spinlock.Right, and the reason that a spinlock fixes it is that we have memory
barrier instructions built into the spinlock code sequences on machines
where it matters.To get to the point where we could do the sort of optimization Robert
is talking about, someone will have to build suitable primitives for
all the platforms we support. In the cases where we use gcc ASM in
s_lock.h, it shouldn't be too hard to pull out the barrier
instruction(s) ... but on platforms where we rely on OS-supplied
functions, some research is going to be needed.
Yeah, although falling back to SpinLockAcquire() and SpinLockRelease()
on a backend-private slock_t should work anywhere that PostgreSQL
works at all[1]This was a suggestion from Noah Misch. I wasn't quite convinced when he initially made it, but having studied the issue a lot more, I now am. The CPU doesn't know how many processes have the memory mapped into their address space.. That will probably be slower than a memory fence
instruction and certainly slower than a compiler barrier, but the
point is that - right now - we're doing it the slow way everywhere.
I think the real challenge is going to be testing. If anyone has a
machine with weak memory ordering they can give me access to, that
would be really helpful for flushing the bugs out of this stuff.
Getting it to work on x86 is not the hard part.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
[1]: This was a suggestion from Noah Misch. I wasn't quite convinced when he initially made it, but having studied the issue a lot more, I now am. The CPU doesn't know how many processes have the memory mapped into their address space.
when he initially made it, but having studied the issue a lot more, I
now am. The CPU doesn't know how many processes have the memory
mapped into their address space.
On Thu, Jul 21, 2011 at 8:15 PM, Robert Haas <robertmhaas@gmail.com> wrote:
I think the real challenge is going to be testing. If anyone has a
machine with weak memory ordering they can give me access to, that
would be really helpful for flushing the bugs out of this stuff.
Getting it to work on x86 is not the hard part.
I believe there's a PPC box in our storage facility in NJ that we
might be able to dig out for you. There's also a couple in our India
office. Let me know if they'd be of help.
--
Dave Page
Blog: http://pgsnake.blogspot.com
Twitter: @pgsnake
EnterpriseDB UK: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Thu, Jul 21, 2011 at 3:22 PM, Dave Page <dpage@pgadmin.org> wrote:
On Thu, Jul 21, 2011 at 8:15 PM, Robert Haas <robertmhaas@gmail.com> wrote:
I think the real challenge is going to be testing. If anyone has a
machine with weak memory ordering they can give me access to, that
would be really helpful for flushing the bugs out of this stuff.
Getting it to work on x86 is not the hard part.I believe there's a PPC box in our storage facility in NJ that we
might be able to dig out for you. There's also a couple in our India
office. Let me know if they'd be of help.
Yes!
More processors is better, of course, but having anything at all to
test on would be an improvement.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Thu, Jul 21, 2011 at 8:25 PM, Robert Haas <robertmhaas@gmail.com> wrote:
On Thu, Jul 21, 2011 at 3:22 PM, Dave Page <dpage@pgadmin.org> wrote:
On Thu, Jul 21, 2011 at 8:15 PM, Robert Haas <robertmhaas@gmail.com> wrote:
I think the real challenge is going to be testing. If anyone has a
machine with weak memory ordering they can give me access to, that
would be really helpful for flushing the bugs out of this stuff.
Getting it to work on x86 is not the hard part.I believe there's a PPC box in our storage facility in NJ that we
might be able to dig out for you. There's also a couple in our India
office. Let me know if they'd be of help.Yes!
More processors is better, of course, but having anything at all to
test on would be an improvement.
OK, will check with India first, as it'll be easier for them to deploy.
--
Dave Page
Blog: http://pgsnake.blogspot.com
Twitter: @pgsnake
EnterpriseDB UK: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes:
I think the real challenge is going to be testing. If anyone has a
machine with weak memory ordering they can give me access to, that
would be really helpful for flushing the bugs out of this stuff.
There are multi-CPU PPCen in the buildfarm, or at least there were last
time I broke the sinval code ;-). Note that testing on a single-core
PPC will prove nothing.
regards, tom lane
On Thu, Jul 21, 2011 at 3:53 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Robert Haas <robertmhaas@gmail.com> writes:
I think the real challenge is going to be testing. If anyone has a
machine with weak memory ordering they can give me access to, that
would be really helpful for flushing the bugs out of this stuff.There are multi-CPU PPCen in the buildfarm, or at least there were last
time I broke the sinval code ;-). Note that testing on a single-core
PPC will prove nothing.
Yeah, I was just thinking about that.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Wed, Jul 20, 2011 at 09:46:33PM -0400, Robert Haas wrote:
For the last week or so, in between various other tasks, I've been
trying to understand why performance drops off when you run pgbench -n
-S -c $CLIENTS -j $CLIENTS -T $A_FEW_MINUTES at very high client
counts. The answer, in a word, is SIGetDataEntries(). I believe we
need to bite the bullet and rewrite this using a lock-free algorithm,
using memory barriers on processors with weak memory ordering.
Perhaps there is another way to do it, but nothing I've tried has
really worked so far, and I've tried quite a few things. Here's the
data.On unpatched master, performance scales pretty much linearly out to 32
clients. As you add more clients, it drops off:
[80 clients]
tps = 132518.586371 (including connections establishing)
tps = 130968.749747 (including connections establishing)
tps = 132574.338942 (including connections establishing)
[80 clients, with lazy vxid locks and sinval-unlocked]
tps = 203256.701227 (including connections establishing)
tps = 190637.957571 (including connections establishing)
tps = 190228.617178 (including connections establishing)
Nice numbers. The sinval-unlocked.patch implementation looks like it's taking a
sound direction.
In
http://archives.postgresql.org/message-id/CA+TgmobbxMh_9zjudheSWO6m8sBMb5hdZt+3ChCLuv5eztv8Ug@mail.gmail.com,
you quoted 210k TPS when you stubbed out AcceptInvalidationMessages(). Is it
correct to conclude that AcceptInvalidationMessages() still reduces the
transaction rate by 5-10% with this stack of patches?
--
Noah Misch http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
On Thu, Jul 21, 2011 at 4:54 PM, Noah Misch <noah@2ndquadrant.com> wrote:
On Wed, Jul 20, 2011 at 09:46:33PM -0400, Robert Haas wrote:
For the last week or so, in between various other tasks, I've been
trying to understand why performance drops off when you run pgbench -n
-S -c $CLIENTS -j $CLIENTS -T $A_FEW_MINUTES at very high client
counts. The answer, in a word, is SIGetDataEntries(). I believe we
need to bite the bullet and rewrite this using a lock-free algorithm,
using memory barriers on processors with weak memory ordering.
Perhaps there is another way to do it, but nothing I've tried has
really worked so far, and I've tried quite a few things. Here's the
data.On unpatched master, performance scales pretty much linearly out to 32
clients. As you add more clients, it drops off:[80 clients]
tps = 132518.586371 (including connections establishing)
tps = 130968.749747 (including connections establishing)
tps = 132574.338942 (including connections establishing)[80 clients, with lazy vxid locks and sinval-unlocked]
tps = 203256.701227 (including connections establishing)
tps = 190637.957571 (including connections establishing)
tps = 190228.617178 (including connections establishing)Nice numbers. The sinval-unlocked.patch implementation looks like it's taking a
sound direction.In
http://archives.postgresql.org/message-id/CA+TgmobbxMh_9zjudheSWO6m8sBMb5hdZt+3ChCLuv5eztv8Ug@mail.gmail.com,
you quoted 210k TPS when you stubbed out AcceptInvalidationMessages(). Is it
correct to conclude that AcceptInvalidationMessages() still reduces the
transaction rate by 5-10% with this stack of patches?
Good question - I have not tested.
One idea I just had... if we use a 64-bit counter for maxMsgNum, maybe
we could make AcceptInvalidationMessages() a macro, something like
this:
if (MyProcState->nextMsgNum != shmInvalState->maxMsgNum)
ReallyAcceptInvalidationMessages();
That ought to be extremely cheap and - if we use 64-bit counters for
the message-number counters - safe. You might object that the load of
maxMsgNum might migrate backward, but it can't possibly back up any
further than the preceding lock acquisition, since that's required to
be a full memory barrier on every architecture. And if we haven't
acquired a relevant lock, then a relevant sinval message could show up
an instance after we check regardless of the implementation.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Jul21, 2011, at 21:15 , Robert Haas wrote:
On Thu, Jul 21, 2011 at 2:50 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Robert Haas <robertmhaas@gmail.com> writes:
... On these machines, you need to issue an explicit memory barrier
instruction at each sequence point, or just acquire and release a
spinlock.Right, and the reason that a spinlock fixes it is that we have memory
barrier instructions built into the spinlock code sequences on machines
where it matters.To get to the point where we could do the sort of optimization Robert
is talking about, someone will have to build suitable primitives for
all the platforms we support. In the cases where we use gcc ASM in
s_lock.h, it shouldn't be too hard to pull out the barrier
instruction(s) ... but on platforms where we rely on OS-supplied
functions, some research is going to be needed.Yeah, although falling back to SpinLockAcquire() and SpinLockRelease()
on a backend-private slock_t should work anywhere that PostgreSQL
works at all[1]. That will probably be slower than a memory fence
instruction and certainly slower than a compiler barrier, but the
point is that - right now - we're doing it the slow way everywhere.
As I discovered while playing with various lockless algorithms to
improve our LWLocks, spin locks aren't actually a replacement for
a (full) barrier.
Lock acquisition only really needs to guarantee that loads and stores
which come after the acquisition operation in program order (i.e., in
the instruction stream) aren't globally visible before that operation
completes. This kind of barrier behaviour is often fittingly called
"acquire barrier".
Similarly, a lock release operation only needs to guarantee that loads
and stores which occur before that operation in program order are
globally visible before the release operation completes. This, again,
is fittingly called "release barrier".
Now assume the following code fragment
global1 = 1;
SpinLockAcquire();
SpinLockRelease();
global2 = 1;
If SpinLockAcquire() has "acquire barrier" semantics, and SpinLockRelease()
has "release barrier" sematics, the it's possible for the store to global1
to be delayed until after SpinLockAcquire(), and similarly for the store
to global2 to be executed before SpinLockRelease() completes. In other
words, what happens is
SpinLockAcquire();
global1 = 1;
global2 = 1;
SpinLockRelease();
But once that can happens, there's no reason that it couldn't also be
SpinLockAcquire();
global2 = 1;
global1 = 1;
SpinLockRelease();
I didn't check if any of our spin lock implementations is actually affected
by this, but it doesn't seem wise to rely on them being full barriers, even
if it may be true today.
best regards,
Florian Pflug
On Wed, Jul 20, 2011 at 09:46:33PM -0400, Robert Haas wrote:
Profiling this combination of patches reveals that there is still some
pretty ugly spinlock contention on sinval's msgNumLock. And it occurs
to me that on x86, we really don't need this lock ... or
SInvalReadLock ... or a per-backend mutex. The whole of
SIGetDataEntries() can pretty easily be made lock-free. The only real
changes that seem to be are needed are (1) to use a 64-bit counter, so
you never need to decrement
On second thought, won't this be inadequate on 32-bit systems, where updating
the 64-bit counter produces two stores? You must avoid reading it between those
stores.
--
Noah Misch http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
On Thu, Jul 21, 2011 at 02:31:15PM -0400, Robert Haas wrote:
1. Machines with strong memory ordering. On this category of machines
(which include x86), the CPU basically does not perform loads or
stores out of order. On some of these machines, it is apparently
possible for there to be some ordering of stores relative to loads,
but if the program stores two values or loads two values, those
operations will performed in the same order they appear in the
program.
This is all correct, but...
The main thing you need to make your code work reliably on
these machines is a primitive that keeps the compiler from reordering
your code during optimization.
If you're suggesting that hardware memory barriers aren't going to be
needed to implement lock-free code on x86, that isn't true. Because a
read can be reordered with respect to a write to a different memory
location, you can still have problems. So you do still need memory
barriers, just fewer of them.
Dekker's algorithm is the classic example: two threads each set a flag
and then check whether the other thread's flag is set. In any
sequential execution, at least one should see the other's flag set, but
on the x86 that doesn't always happen. One thread's read might be
reordered before its write.
2. Machines with weak memory ordering. On this category of machines
(which includes PowerPC, Dec Alpha, and maybe some others), the CPU
reorders memory accesses arbitrarily unless you explicitly issue
instructions that enforce synchronization. You still need to keep the
compiler from moving things around, too. Alpha is particularly
pernicious, because something like a->b can fetch the pointed-to value
before loading the pointer itself. This is otherwise known as "we
have basically no cache coherency circuits on this chip at all". On
these machines, you need to issue an explicit memory barrier
instruction at each sequence point, or just acquire and release a
spinlock.
The Alpha is pretty much unique (thankfully!) in allowing dependent
reads to be reordered. That makes it even weaker than the typical
weak-ordering machine. Since reading a pointer and then dereferencing
it is a pretty reasonable thing to do regularly in RCU code, you
probably don't want to emit barriers in between on architectures where
it's not actually necessary. That argues for another operation that's
defined to be a barrier (mb) on the Alpha but a no-op elsewhere.
Certainly the Linux kernel found it useful to do so
(read_barrier_depends)
Alternatively, one might question how important it is to support the
Alpha these days...
Dan
--
Dan R. K. Ports MIT CSAIL http://drkp.net/
On Jul21, 2011, at 03:46 , Robert Haas wrote:
Profiling this combination of patches reveals that there is still some
pretty ugly spinlock contention on sinval's msgNumLock. And it occurs
to me that on x86, we really don't need this lock ... or
SInvalReadLock ... or a per-backend mutex. The whole of
SIGetDataEntries() can pretty easily be made lock-free. The only real
changes that seem to be are needed are (1) to use a 64-bit counter, so
you never need to decrement and (2) to recheck resetState after
reading the entries from the queue, to see if we got reset while we
were reading those entries. Since x86 guarantees that writes will
become visible in the order they are executed, we only need to make
sure that the compiler doesn't rearrange things. As long as we first
read the maxMsgNum and then read the messages, we can't read garbage.
As long as we read the messages before we check resetState, we will be
sure to notice if we got reset before we read all the messages (which
is the only way that we can have read garbage messages).
Sounds sensible. There're one additional hazard though - you'll also
need the reads to be atomic. x86 guarantees that for up to 32 (i386)
respectively 64 (x64) loads, but only for reads from properly aligned
addresses (4 bytes for 4-byte reads, 8 bytes for 8-byte reads).
I founds that out the hard way a few days ago, again while playing with
different LWLock implementations, when I botched my test setup and
the proc array entries ended up being miss-aligned. Boy, was it fun
to debug the random crashes caused by non-atomic pointer reads...
If we widen the counter to 64-bit, reading it atomically on x86 becomes
a bit of a challenge on i386, but is doable also. From what I remember,
there are two options. You can either use the 8-byte compare-and-exchange
operation, but it might be that only quite recent CPUs support that. The
other options seems to be to use floating-point instructions. I believe
the latter is what Intel's own Thread Building Blocks library does, but
I'd have to re-check to be sure. It might also be that, once you starting
using floating-point instructions, you find that you actually do need
fencing instructions even on x86. Dunno if the weaker ordering affects only
SIMD instructions or all floating point stuff...
best regards,
Florian Pflug
On Thu, Jul 21, 2011 at 6:22 PM, Florian Pflug <fgp@phlo.org> wrote:
On Jul21, 2011, at 21:15 , Robert Haas wrote:
On Thu, Jul 21, 2011 at 2:50 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Robert Haas <robertmhaas@gmail.com> writes:
... On these machines, you need to issue an explicit memory barrier
instruction at each sequence point, or just acquire and release a
spinlock.Right, and the reason that a spinlock fixes it is that we have memory
barrier instructions built into the spinlock code sequences on machines
where it matters.To get to the point where we could do the sort of optimization Robert
is talking about, someone will have to build suitable primitives for
all the platforms we support. In the cases where we use gcc ASM in
s_lock.h, it shouldn't be too hard to pull out the barrier
instruction(s) ... but on platforms where we rely on OS-supplied
functions, some research is going to be needed.Yeah, although falling back to SpinLockAcquire() and SpinLockRelease()
on a backend-private slock_t should work anywhere that PostgreSQL
works at all[1]. That will probably be slower than a memory fence
instruction and certainly slower than a compiler barrier, but the
point is that - right now - we're doing it the slow way everywhere.As I discovered while playing with various lockless algorithms to
improve our LWLocks, spin locks aren't actually a replacement for
a (full) barrier.Lock acquisition only really needs to guarantee that loads and stores
which come after the acquisition operation in program order (i.e., in
the instruction stream) aren't globally visible before that operation
completes. This kind of barrier behaviour is often fittingly called
"acquire barrier".Similarly, a lock release operation only needs to guarantee that loads
and stores which occur before that operation in program order are
globally visible before the release operation completes. This, again,
is fittingly called "release barrier".Now assume the following code fragment
global1 = 1;
SpinLockAcquire();
SpinLockRelease();
global2 = 1;If SpinLockAcquire() has "acquire barrier" semantics, and SpinLockRelease()
has "release barrier" sematics, the it's possible for the store to global1
to be delayed until after SpinLockAcquire(), and similarly for the store
to global2 to be executed before SpinLockRelease() completes. In other
words, what happens isSpinLockAcquire();
global1 = 1;
global2 = 1;
SpinLockRelease();But once that can happens, there's no reason that it couldn't also be
SpinLockAcquire();
global2 = 1;
global1 = 1;
SpinLockRelease();I didn't check if any of our spin lock implementations is actually affected
by this, but it doesn't seem wise to rely on them being full barriers, even
if it may be true today.
Hmm. I'm not worried about that. AFAIK, only IA64 has such an
implementation, and our existing spinlock implementation doesn't use
it. If we were to add something like that in the future, we'd
presumably know that we were doing it, and would add the appropriate
memory barrier primitive at the same time.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Thu, Jul 21, 2011 at 6:44 PM, Dan Ports <drkp@csail.mit.edu> wrote:
If you're suggesting that hardware memory barriers aren't going to be
needed to implement lock-free code on x86, that isn't true. Because a
read can be reordered with respect to a write to a different memory
location, you can still have problems. So you do still need memory
barriers, just fewer of them.Dekker's algorithm is the classic example: two threads each set a flag
and then check whether the other thread's flag is set. In any
sequential execution, at least one should see the other's flag set, but
on the x86 that doesn't always happen. One thread's read might be
reordered before its write.
In the case of sinval, what we need to do for SIGetDataEntries() is,
approximately, a bunch of loads, followed by a store to one of the
locations we loaded (which no one else can have written meanwhile).
So I think that's OK.
In SIInsertDataEntries(), what we need to do is, approximately, take a
lwlock, load from a location which can only be written while holding
the lwlock, do a bunch of stores, ending with a store to that first
location, and release the lwlock. I think that's OK, too.
2. Machines with weak memory ordering. On this category of machines
(which includes PowerPC, Dec Alpha, and maybe some others), the CPU
reorders memory accesses arbitrarily unless you explicitly issue
instructions that enforce synchronization. You still need to keep the
compiler from moving things around, too. Alpha is particularly
pernicious, because something like a->b can fetch the pointed-to value
before loading the pointer itself. This is otherwise known as "we
have basically no cache coherency circuits on this chip at all". On
these machines, you need to issue an explicit memory barrier
instruction at each sequence point, or just acquire and release a
spinlock.The Alpha is pretty much unique (thankfully!) in allowing dependent
reads to be reordered. That makes it even weaker than the typical
weak-ordering machine. Since reading a pointer and then dereferencing
it is a pretty reasonable thing to do regularly in RCU code, you
probably don't want to emit barriers in between on architectures where
it's not actually necessary. That argues for another operation that's
defined to be a barrier (mb) on the Alpha but a no-op elsewhere.
Certainly the Linux kernel found it useful to do so
(read_barrier_depends)Alternatively, one might question how important it is to support the
Alpha these days...
Well, currently, we do, so we probably don't want to drop support for
that without some careful thought. I searched the archive and found
someone trying to compile 8.3.something on Alpha just a few years ago,
so it's apparently not totally dead yet.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Thu, Jul 21, 2011 at 6:43 PM, Noah Misch <noah@2ndquadrant.com> wrote:
On Wed, Jul 20, 2011 at 09:46:33PM -0400, Robert Haas wrote:
Profiling this combination of patches reveals that there is still some
pretty ugly spinlock contention on sinval's msgNumLock. And it occurs
to me that on x86, we really don't need this lock ... or
SInvalReadLock ... or a per-backend mutex. The whole of
SIGetDataEntries() can pretty easily be made lock-free. The only real
changes that seem to be are needed are (1) to use a 64-bit counter, so
you never need to decrementOn second thought, won't this be inadequate on 32-bit systems, where updating
the 64-bit counter produces two stores? You must avoid reading it between those
stores.
Now that is a potentially big problem.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes:
On Thu, Jul 21, 2011 at 6:43 PM, Noah Misch <noah@2ndquadrant.com> wrote:
On Wed, Jul 20, 2011 at 09:46:33PM -0400, Robert Haas wrote:
SIGetDataEntries() can pretty easily be made lock-free. The only real
changes that seem to be are needed are (1) to use a 64-bit counter, so
you never need to decrement
On second thought, won't this be inadequate on 32-bit systems, where updating
the 64-bit counter produces two stores? You must avoid reading it between those stores.
Now that is a potentially big problem.
Could we do something similar to the xxid hacks? That is, we have a lot
of counters that should be fairly close to each other, so we store only
the low-order 32 bits of each notional value, and separately maintain a
common high-order word. You probably would need some additional
overhead each time the high-order word bumps, but that's reasonably
infrequent.
regards, tom lane
On Thu, Jul 21, 2011 at 9:19 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Robert Haas <robertmhaas@gmail.com> writes:
On Thu, Jul 21, 2011 at 6:43 PM, Noah Misch <noah@2ndquadrant.com> wrote:
On Wed, Jul 20, 2011 at 09:46:33PM -0400, Robert Haas wrote:
SIGetDataEntries() can pretty easily be made lock-free. The only real
changes that seem to be are needed are (1) to use a 64-bit counter, so
you never need to decrementOn second thought, won't this be inadequate on 32-bit systems, where updating
the 64-bit counter produces two stores? You must avoid reading it between those stores.Now that is a potentially big problem.
Could we do something similar to the xxid hacks? That is, we have a lot
of counters that should be fairly close to each other, so we store only
the low-order 32 bits of each notional value, and separately maintain a
common high-order word. You probably would need some additional
overhead each time the high-order word bumps, but that's reasonably
infrequent.
Well, the trouble is figuring out what the shape of that additional
overhead needs to look like. I think I have a simpler idea, though:
before acquiring any locks, just have SIGetDataEntries() do this:
+ if (stateP->nextMsgNum == segP->maxMsgNum && !stateP->resetState)
+ return 0;
Patch (with comment explaining why I think this is OK) attached. If
the message numbers happen to be equal only because the counter has
wrapped, then stateP->resetState will be true, so we'll still realize
we need to do some work.
Test results, with the lazy vxid patch plus this patch, at 8 clients:
tps = 34028.144439 (including connections establishing)
tps = 34079.085935 (including connections establishing)
tps = 34125.295938 (including connections establishing)
And at 32 clients:
tps = 185521.605364 (including connections establishing)
tps = 188250.700451 (including connections establishing)
tps = 186077.847215 (including connections establishing)
And at 80 clients:
tps = 188568.886569 (including connections establishing)
tps = 191035.971512 (including connections establishing)
tps = 189363.019377 (including connections establishing)
Not quite as good as the unlocked version, but better than the
per-backend mutex, and a whole lot simpler.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Attachments:
sinval-fastpath.patchapplication/octet-stream; name=sinval-fastpath.patchDownload
diff --git a/src/backend/storage/ipc/sinvaladt.c b/src/backend/storage/ipc/sinvaladt.c
index 4f446aa..7d60f31 100644
--- a/src/backend/storage/ipc/sinvaladt.c
+++ b/src/backend/storage/ipc/sinvaladt.c
@@ -499,11 +499,31 @@ SIGetDataEntries(SharedInvalidationMessage *data, int datasize)
int max;
int n;
- LWLockAcquire(SInvalReadLock, LW_SHARED);
-
segP = shmInvalBuffer;
stateP = &segP->procState[MyBackendId - 1];
+ /*
+ * Before starting to take locks, do a quick, unlocked test to see whether
+ * there can possibly be anything to read. On a multiprocessor system,
+ * it's possible these loads could migrate backwards and occur before we
+ * actually enter this function, so we might miss a sinval message that
+ * was just added by some other processor. But they can't migrate
+ * backwards over a preceding lock acquisition, so it should be OK. If
+ * we haven't acquired a lock preventing against further relevant
+ * invalidations, any such occurrence is not much different than if the
+ * invalidation had arrived slightly later in the first place.
+ *
+ * It's important that we don't use the value read here to actually
+ * read from the queue. On machines with weak memory ordering, the
+ * maxMsgNum bump could become visible before the queue entries themselves.
+ * But the locking below will take care of that before rereading the
+ * value.
+ */
+ if (stateP->nextMsgNum == segP->maxMsgNum && !stateP->resetState)
+ return 0;
+
+ LWLockAcquire(SInvalReadLock, LW_SHARED);
+
/* Fetch current value of maxMsgNum using spinlock */
{
/* use volatile pointer to prevent code rearrangement */
On Thu, Jul 21, 2011 at 11:37:27PM -0400, Robert Haas wrote:
I think I have a simpler idea, though:
before acquiring any locks, just have SIGetDataEntries() do this:+ if (stateP->nextMsgNum == segP->maxMsgNum && !stateP->resetState) + return 0;Patch (with comment explaining why I think this is OK) attached. If
the message numbers happen to be equal only because the counter has
wrapped, then stateP->resetState will be true, so we'll still realize
we need to do some work.
This is attractive, and I don't see any problems with it. (In theory, you could
hit a case where the load of resetState gives an ancient "false" just as the
counters wrap to match. Given that the wrap interval is 1000000x as long as the
reset interval, I'm not worried about problems on actual silicon.)
+1 for doing this and moving on.
--
Noah Misch http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
On Fri, Jul 22, 2011 at 3:28 PM, Noah Misch <noah@2ndquadrant.com> wrote:
On Thu, Jul 21, 2011 at 11:37:27PM -0400, Robert Haas wrote:
I think I have a simpler idea, though:
before acquiring any locks, just have SIGetDataEntries() do this:+ if (stateP->nextMsgNum == segP->maxMsgNum && !stateP->resetState) + return 0;Patch (with comment explaining why I think this is OK) attached. If
the message numbers happen to be equal only because the counter has
wrapped, then stateP->resetState will be true, so we'll still realize
we need to do some work.This is attractive, and I don't see any problems with it. (In theory, you could
hit a case where the load of resetState gives an ancient "false" just as the
counters wrap to match. Given that the wrap interval is 1000000x as long as the
reset interval, I'm not worried about problems on actual silicon.)
It's actually 262,144 times as long - see MSGNUMWRAPAROUND.
It would be pretty easy to eliminate even the theoretical possibility
of a race by getting rid of resetState altogether and using nextMsgNum
= -1 to mean that. Maybe I should go ahead and do that.
+1 for doing this and moving on.
Yeah, I think I'll go ahead and commit something along these lines if
no one objects. We can always fine-tune it more if needed (but it
probably isn't).
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Fri, Jul 22, 2011 at 03:54:03PM -0400, Robert Haas wrote:
On Fri, Jul 22, 2011 at 3:28 PM, Noah Misch <noah@2ndquadrant.com> wrote:
This is attractive, and I don't see any problems with it. �(In theory, you could
hit a case where the load of resetState gives an ancient "false" just as the
counters wrap to match. �Given that the wrap interval is 1000000x as long as the
reset interval, I'm not worried about problems on actual silicon.)It's actually 262,144 times as long - see MSGNUMWRAPAROUND.
Ah, so it is.
It would be pretty easy to eliminate even the theoretical possibility
of a race by getting rid of resetState altogether and using nextMsgNum
= -1 to mean that. Maybe I should go ahead and do that.
Seems like a nice simplification.
--
Noah Misch http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
On Mon, Jul 25, 2011 at 6:24 PM, Noah Misch <noah@2ndquadrant.com> wrote:
On Fri, Jul 22, 2011 at 03:54:03PM -0400, Robert Haas wrote:
On Fri, Jul 22, 2011 at 3:28 PM, Noah Misch <noah@2ndquadrant.com> wrote:
This is attractive, and I don't see any problems with it. (In theory, you could
hit a case where the load of resetState gives an ancient "false" just as the
counters wrap to match. Given that the wrap interval is 1000000x as long as the
reset interval, I'm not worried about problems on actual silicon.)It's actually 262,144 times as long - see MSGNUMWRAPAROUND.
Ah, so it is.
It would be pretty easy to eliminate even the theoretical possibility
of a race by getting rid of resetState altogether and using nextMsgNum
= -1 to mean that. Maybe I should go ahead and do that.Seems like a nice simplification.
On further reflection, I don't see that this helps: it just moves the
problem around. With resetState as a separate variable, nextMsgNum is
never changed by anyone other than the owner, so we can never have a
stale load. But if we overload nextMsgNum to also indicate whether
our state has been reset, then there's a race between when we load
nextMsgNum and when we load maxMsgNum (instead of code I posted
previously, which has a race between when we load resetState and when
we load maxMsgNum). Now, as you say, it seems really, really
difficult to hit that in practice, but I don't see a way of getting
rid of the theoretical possibility without either (1) a spinlock or
(2) a fence. (Of course, on x86, the fence could be optimized down to
a compiler barrier.) I guess the question is "should we worry about
that?".
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes:
On further reflection, I don't see that this helps: it just moves the
problem around. With resetState as a separate variable, nextMsgNum is
never changed by anyone other than the owner, so we can never have a
stale load. But if we overload nextMsgNum to also indicate whether
our state has been reset, then there's a race between when we load
nextMsgNum and when we load maxMsgNum (instead of code I posted
previously, which has a race between when we load resetState and when
we load maxMsgNum). Now, as you say, it seems really, really
difficult to hit that in practice, but I don't see a way of getting
rid of the theoretical possibility without either (1) a spinlock or
(2) a fence. (Of course, on x86, the fence could be optimized down to
a compiler barrier.) I guess the question is "should we worry about
that?".
Uh, yes. I've lost count of the number of times I've seen people hit
one-instruction-wide race condition windows, SIGSEGV crash based on
accessing only one byte past the valid data structure, etc etc.
The worst thing about this type of bug is that you can't reproduce the
failure when you try; doesn't mean your users won't see it.
regards, tom lane
Excerpts from Tom Lane's message of mar jul 26 13:43:08 -0400 2011:
Uh, yes. I've lost count of the number of times I've seen people hit
one-instruction-wide race condition windows, SIGSEGV crash based on
accessing only one byte past the valid data structure, etc etc.
I think you need a better locking protocol on that counter of yours.
--
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support
On Tue, Jul 26, 2011 at 6:36 PM, Robert Haas <robertmhaas@gmail.com> wrote:
Now, as you say, it seems really, really
difficult to hit that in practice, but I don't see a way of getting
rid of the theoretical possibility without either (1) a spinlock or
(2) a fence. (Of course, on x86, the fence could be optimized down to
a compiler barrier.) I guess the question is "should we worry about
that?".
Perhaps the answer lies in a different direction altogether?
Let me ask a few questions to stimulate a different solution
* Can we do this using an active technique (e.g. signals) rather than
a passive one (reading a counter?)
* Can we partition the sinval lock, so we have multiple copies? That
increases the task for those who trigger an invalidation, but will
relieve the pressure for most readers.
* Can we put the sinval info in a different place? e.g. inside each
lock partition.
* Why do we have a different mechanism for cache invalidation
internally (sinval) to the one we offer externally (LISTEN/NOTIFY)?
Why don't we have just one?
--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
Excerpts from Simon Riggs's message of mar jul 26 14:11:19 -0400 2011:
Let me ask a few questions to stimulate a different solution
* Can we do this using an active technique (e.g. signals) rather than
a passive one (reading a counter?)
Signals are already in use for special cases (queue is full), and I
think going through the kernel to achieve much more will lower
performance significantly.
* Can we partition the sinval lock, so we have multiple copies? That
increases the task for those who trigger an invalidation, but will
relieve the pressure for most readers.
Not sure there's a way to meaningfully partition the queue. In any
case, I think the problem being dealt with here is how to update the
read heads of the queue, not its contents.
* Can we put the sinval info in a different place? e.g. inside each
lock partition.
I don't think you can relate sinval messages to particular locks, which
would make this infeasible. I think this bit is directly related to the
point above.
* Why do we have a different mechanism for cache invalidation
internally (sinval) to the one we offer externally (LISTEN/NOTIFY)?
Why don't we have just one?
Performance. Not sure there are other reasons as well; but just this
one is significant enough.
--
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support
On Tue, Jul 26, 2011 at 7:24 PM, Alvaro Herrera
<alvherre@commandprompt.com> wrote:
Excerpts from Simon Riggs's message of mar jul 26 14:11:19 -0400 2011:
Let me ask a few questions to stimulate a different solution
* Can we do this using an active technique (e.g. signals) rather than
a passive one (reading a counter?)Signals are already in use for special cases (queue is full), and I
think going through the kernel to achieve much more will lower
performance significantly.
If there are no invalidations, there would be no signals. How would
zero signals decrease performance?
* Can we partition the sinval lock, so we have multiple copies? That
increases the task for those who trigger an invalidation, but will
relieve the pressure for most readers.Not sure there's a way to meaningfully partition the queue. In any
case, I think the problem being dealt with here is how to update the
read heads of the queue, not its contents.
I agree there's no meaningful way to partition the queue, but we can
store the information in more than one place to reduce the contention
of people requesting it.
Both those ideas are still on the table.
--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
On Tue, Jul 26, 2011 at 2:56 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
On Tue, Jul 26, 2011 at 7:24 PM, Alvaro Herrera
<alvherre@commandprompt.com> wrote:Excerpts from Simon Riggs's message of mar jul 26 14:11:19 -0400 2011:
Let me ask a few questions to stimulate a different solution
* Can we do this using an active technique (e.g. signals) rather than
a passive one (reading a counter?)Signals are already in use for special cases (queue is full), and I
think going through the kernel to achieve much more will lower
performance significantly.If there are no invalidations, there would be no signals. How would
zero signals decrease performance?
It wouldn't, although it might be bad in the case where there are lots
of temp tables being created and dropped. I think the biggest problem
with signals is that they don't provide any meaningful synchronization
guarantees. When you send somebody a signal, you don't really know
how long it's going to take for them to receive it.
* Can we partition the sinval lock, so we have multiple copies? That
increases the task for those who trigger an invalidation, but will
relieve the pressure for most readers.Not sure there's a way to meaningfully partition the queue. In any
case, I think the problem being dealt with here is how to update the
read heads of the queue, not its contents.I agree there's no meaningful way to partition the queue, but we can
store the information in more than one place to reduce the contention
of people requesting it.
I thought about that. Basically, that saves you a factor of N in
contention on the read side (where N is the number of copies) and
costs you a factor of N on the write side (you have to update N
copies, taking a spinlock or lwlock for each). In the limit, you
could do one copy of the counter per backend.
I think, though, that a lock-free implementation using memory barriers
is going to end up being the only real solution. We might possibly
convince ourselves that we're OK with increasing the cost of
SIInsertDataEntries(), but any solution that involves replication the
data is still based on doing at least some locking. And I am pretty
well convinced that even one spinlock acquisition in
SIInsertDataEntries() is too many.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Tue, Jul 26, 2011 at 8:11 PM, Robert Haas <robertmhaas@gmail.com> wrote:
It wouldn't, although it might be bad in the case where there are lots
of temp tables being created and dropped.
Do temp tables cause relcache invalidations?
That seems like something we'd want to change in itself.
--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
Simon Riggs <simon@2ndQuadrant.com> writes:
On Tue, Jul 26, 2011 at 7:24 PM, Alvaro Herrera
<alvherre@commandprompt.com> wrote:Signals are already in use for special cases (queue is full), and I
think going through the kernel to achieve much more will lower
performance significantly.
If there are no invalidations, there would be no signals. How would
zero signals decrease performance?
But if there *is* an invalidation (which is not a negligible case),
it'd get significantly slower.
regards, tom lane
On Tue, Jul 26, 2011 at 8:11 PM, Robert Haas <robertmhaas@gmail.com> wrote:
* Can we partition the sinval lock, so we have multiple copies? That
increases the task for those who trigger an invalidation, but will
relieve the pressure for most readers.Not sure there's a way to meaningfully partition the queue. In any
case, I think the problem being dealt with here is how to update the
read heads of the queue, not its contents.I agree there's no meaningful way to partition the queue, but we can
store the information in more than one place to reduce the contention
of people requesting it.I thought about that. Basically, that saves you a factor of N in
contention on the read side (where N is the number of copies) and
costs you a factor of N on the write side (you have to update N
copies, taking a spinlock or lwlock for each). In the limit, you
could do one copy of the counter per backend.I think, though, that a lock-free implementation using memory barriers
is going to end up being the only real solution. We might possibly
convince ourselves that we're OK with increasing the cost of
SIInsertDataEntries(), but any solution that involves replication the
data is still based on doing at least some locking. And I am pretty
well convinced that even one spinlock acquisition in
SIInsertDataEntries() is too many.
You might be right, but I think we have little knowledge of how some
memory barrier code you haven't written yet effects performance on
various architectures.
A spinlock per backend would cache very nicely, now you mention it. So
my money would be on the multiple copies.
It's not completely clear to me that updating N copies would be more
expensive. Accessing N low contention copies rather than 1
high-contention value might actually be a win.
--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
Noah Misch <noah@2ndQuadrant.com> writes:
On Thu, Jul 21, 2011 at 11:37:27PM -0400, Robert Haas wrote:
I think I have a simpler idea, though:
before acquiring any locks, just have SIGetDataEntries() do this:+ if (stateP->nextMsgNum == segP->maxMsgNum && !stateP->resetState) + return 0;Patch (with comment explaining why I think this is OK) attached. If
the message numbers happen to be equal only because the counter has
wrapped, then stateP->resetState will be true, so we'll still realize
we need to do some work.
This is attractive, and I don't see any problems with it. (In theory, you could
hit a case where the load of resetState gives an ancient "false" just as the
counters wrap to match. Given that the wrap interval is 1000000x as long as the
reset interval, I'm not worried about problems on actual silicon.)
+1 for doing this and moving on.
After some further reflection I believe this patch actually is pretty
safe, although Noah's explanation of why seems a bit confused. The key
points are that (1) each of the reads is atomic, and (2) we should not
be able to see a value that is older than our last acquisition/release
of a shared memory lock. These being granted, we will make a decision
that is correct, or at least was correct as of some time after that last
lock action. As stated in the patch comments, we are not required to
promptly assimilate sinval actions on objects we don't hold any lock on,
so this seems sufficient. In essence, we are relying on an assumed
prior visit to the lock manager to provide memory access synchronization
for the sinval no-work-to-do test.
The case Noah speculates about above amounts to supposing that this
reasoning fails to hold for the resetState value, and I don't see why
that would be true. Even if the above scenario did manage to happen,
it would not mean that we missed the reset, just that we hadn't noticed
it *yet*. And by hypothesis, none of the as-yet-not-seen catalog
updates are a problem for us.
BTW, the patch really ought to add something to the comment around
line 100.
regards, tom lane
On Tue, Jul 26, 2011 at 03:40:32PM -0400, Tom Lane wrote:
Noah Misch <noah@2ndQuadrant.com> writes:
On Thu, Jul 21, 2011 at 11:37:27PM -0400, Robert Haas wrote:
I think I have a simpler idea, though:
before acquiring any locks, just have SIGetDataEntries() do this:+ if (stateP->nextMsgNum == segP->maxMsgNum && !stateP->resetState) + return 0;Patch (with comment explaining why I think this is OK) attached. If
the message numbers happen to be equal only because the counter has
wrapped, then stateP->resetState will be true, so we'll still realize
we need to do some work.This is attractive, and I don't see any problems with it. (In theory, you could
hit a case where the load of resetState gives an ancient "false" just as the
counters wrap to match. Given that the wrap interval is 1000000x as long as the
reset interval, I'm not worried about problems on actual silicon.)+1 for doing this and moving on.
After some further reflection I believe this patch actually is pretty
safe, although Noah's explanation of why seems a bit confused. The key
points are that (1) each of the reads is atomic, and (2) we should not
be able to see a value that is older than our last acquisition/release
of a shared memory lock. These being granted, we will make a decision
that is correct, or at least was correct as of some time after that last
lock action. As stated in the patch comments, we are not required to
promptly assimilate sinval actions on objects we don't hold any lock on,
so this seems sufficient. In essence, we are relying on an assumed
prior visit to the lock manager to provide memory access synchronization
for the sinval no-work-to-do test.The case Noah speculates about above amounts to supposing that this
reasoning fails to hold for the resetState value, and I don't see why
that would be true. Even if the above scenario did manage to happen,
it would not mean that we missed the reset, just that we hadn't noticed
it *yet*. And by hypothesis, none of the as-yet-not-seen catalog
updates are a problem for us.
Here's the way it can fail:
1. Backend enters SIGetDataEntries() with main memory bearing stateP->resetState
= false, stateP->nextMsgNum = 500, segP->maxMsgNum = 505. The CPU has those
latest stateP values in cache, but segP->maxMsgNum is *not* in cache.
2. Backend stalls for <long time>. Meanwhile, other backends issue
MSGNUMWRAPAROUND - 5 invalidation messages. Main memory bears
stateP->resetState = true, stateP->nextMsgNum = 500 - MSGNUMWRAPAROUND,
segP->maxMsgNum = 500.
3. Backend wakes up, uses its cached stateP values and reads segP->maxMsgNum =
500 from main memory. The patch's test finds no need to reset or process
invalidation messages.
That's the theoretical risk I wished to illustrate. Though this appears
possible on an abstract x86_64 system, I think it's unrealistic to suppose that
a dirty cache line could persist *throughout* the issuance of more than 10^9
invalidation messages on a concrete implementation.
A way to think about the problem is that our read of segP->maxMsgNum is too new.
If we had used a snapshot of values as of the most recent lock
acquisition/release, there would be no problem. It's the mix of a new-enough
stateP with an all-too-new segP that yields the anomaly.
--
Noah Misch http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
On Tue, Jul 26, 2011 at 01:36:26PM -0400, Robert Haas wrote:
On Mon, Jul 25, 2011 at 6:24 PM, Noah Misch <noah@2ndquadrant.com> wrote:
On Fri, Jul 22, 2011 at 03:54:03PM -0400, Robert Haas wrote:
On Fri, Jul 22, 2011 at 3:28 PM, Noah Misch <noah@2ndquadrant.com> wrote:
This is attractive, and I don't see any problems with it. �(In theory, you could
hit a case where the load of resetState gives an ancient "false" just as the
counters wrap to match. �Given that the wrap interval is 1000000x as long as the
reset interval, I'm not worried about problems on actual silicon.)It's actually 262,144 times as long - see MSGNUMWRAPAROUND.
Ah, so it is.
It would be pretty easy to eliminate even the theoretical possibility
of a race by getting rid of resetState altogether and using nextMsgNum
= -1 to mean that. �Maybe I should go ahead and do that.Seems like a nice simplification.
On further reflection, I don't see that this helps: it just moves the
problem around.
Alas, yes.
With resetState as a separate variable, nextMsgNum is
never changed by anyone other than the owner, so we can never have a
stale load.
It's also changed when SICleanupQueue() decides to wrap everyone. This could
probably be eliminated by using uint32 and letting overflow take care of wrap
implicitly, but ...
But if we overload nextMsgNum to also indicate whether
our state has been reset, then there's a race between when we load
nextMsgNum and when we load maxMsgNum (instead of code I posted
previously, which has a race between when we load resetState and when
we load maxMsgNum). Now, as you say, it seems really, really
difficult to hit that in practice, but I don't see a way of getting
rid of the theoretical possibility without either (1) a spinlock or
(2) a fence. (Of course, on x86, the fence could be optimized down to
a compiler barrier.)
No new ideas come to mind, here. We could migrate back toward your original
proposal of making the counter a non-wrapping uint64; Florian described some
nice techniques for doing atomic 64-bit reads on x86:
http://archives.postgresql.org/message-id/7C94C386-122F-4918-8624-A14352E8DBC5@phlo.org
I'm not personally acquainted with those approaches, but they sound promising.
Since the overlap between 32-bit installations and performance-sensitive
installations is all but gone, we could even just use a spinlock under 32-bit.
I guess the question is "should we worry about that?".
I'd lean toward "no". I share Tom's unease about writing off a race condition
as being too improbable, but this is quite exceptionally improbable. On the
other hand, some of the fixes don't look too invasive.
--
Noah Misch http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
Noah Misch <noah@2ndQuadrant.com> writes:
On Tue, Jul 26, 2011 at 03:40:32PM -0400, Tom Lane wrote:
After some further reflection I believe this patch actually is pretty
safe, although Noah's explanation of why seems a bit confused.
Here's the way it can fail:
1. Backend enters SIGetDataEntries() with main memory bearing stateP->resetState
= false, stateP->nextMsgNum = 500, segP->maxMsgNum = 505. The CPU has those
latest stateP values in cache, but segP->maxMsgNum is *not* in cache.
2. Backend stalls for <long time>. Meanwhile, other backends issue
MSGNUMWRAPAROUND - 5 invalidation messages. Main memory bears
stateP->resetState = true, stateP->nextMsgNum = 500 - MSGNUMWRAPAROUND,
segP->maxMsgNum = 500.
3. Backend wakes up, uses its cached stateP values and reads segP->maxMsgNum =
500 from main memory. The patch's test finds no need to reset or process
invalidation messages.
[ squint... ] Hmm, you're right. The case where this would break
things is if (some of) the five unprocessed messages relate to some
object we've just locked. But the initial state you describe would be
valid right after obtaining such a lock.
That's the theoretical risk I wished to illustrate. Though this appears
possible on an abstract x86_64 system, I think it's unrealistic to suppose that
a dirty cache line could persist *throughout* the issuance of more than 10^9
invalidation messages on a concrete implementation.
Dirty cache line, maybe not, but what if the assembly code commands the
CPU to load those variables into CPU registers before doing the
comparison? If they're loaded with maxMsgNum coming in last (or at
least after resetState), I think you can have the problem without any
assumptions about cache line behavior at all. You just need the process
to lose the CPU at the right time.
If we marked the pointers volatile, we could probably ensure that the
assembly code tests resetState last, but that's not sufficient to avoid
the stale-cache-line risk.
regards, tom lane
On Tue, Jul 26, 2011 at 05:05:15PM -0400, Tom Lane wrote:
Noah Misch <noah@2ndQuadrant.com> writes:
That's the theoretical risk I wished to illustrate. Though this appears
possible on an abstract x86_64 system, I think it's unrealistic to suppose that
a dirty cache line could persist *throughout* the issuance of more than 10^9
invalidation messages on a concrete implementation.Dirty cache line, maybe not, but what if the assembly code commands the
CPU to load those variables into CPU registers before doing the
comparison? If they're loaded with maxMsgNum coming in last (or at
least after resetState), I think you can have the problem without any
assumptions about cache line behavior at all. You just need the process
to lose the CPU at the right time.
True. If the compiler places the resetState load first, you could hit the
anomaly by "merely" setting a breakpoint on the next instruction, waiting for
exactly MSGNUMWRAPAROUND messages to enqueue, and letting the backend continue.
I think, though, we should either plug that _and_ the cache incoherency case or
worry about neither.
--
Noah Misch http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
Noah Misch <noah@2ndQuadrant.com> writes:
On Tue, Jul 26, 2011 at 05:05:15PM -0400, Tom Lane wrote:
Dirty cache line, maybe not, but what if the assembly code commands the
CPU to load those variables into CPU registers before doing the
comparison? If they're loaded with maxMsgNum coming in last (or at
least after resetState), I think you can have the problem without any
assumptions about cache line behavior at all. You just need the process
to lose the CPU at the right time.
True. If the compiler places the resetState load first, you could hit the
anomaly by "merely" setting a breakpoint on the next instruction, waiting for
exactly MSGNUMWRAPAROUND messages to enqueue, and letting the backend continue.
I think, though, we should either plug that _and_ the cache incoherency case or
worry about neither.
How do you figure that? The poor-assembly-code-order risk is both a lot
easier to fix and a lot higher probability. Admittedly, it's still way
way down there, but you only need a precisely-timed sleep, not a
precisely-timed sleep *and* a cache line that somehow remained stale.
regards, tom lane
On Tue, Jul 26, 2011 at 3:34 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
You might be right, but I think we have little knowledge of how some
memory barrier code you haven't written yet effects performance on
various architectures.A spinlock per backend would cache very nicely, now you mention it. So
my money would be on the multiple copies.
Maybe so, but you can see from the numbers in my OP that the results
still leave something to be desired.
It's not completely clear to me that updating N copies would be more
expensive. Accessing N low contention copies rather than 1
high-contention value might actually be a win.
Yeah, I haven't tested that approach.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Tue, Jul 26, 2011 at 3:25 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
On Tue, Jul 26, 2011 at 8:11 PM, Robert Haas <robertmhaas@gmail.com> wrote:
It wouldn't, although it might be bad in the case where there are lots
of temp tables being created and dropped.Do temp tables cause relcache invalidations?
That seems like something we'd want to change in itself.
I agree. Unfortunately, I think it's a non-trivial fix.
I've also been wondering if we could avoid taking an
AccessExclusiveLock on a newly created (temporary?) table. It seems
like no one should be able to see it until commit, at which point we'd
be releasing the lock anyway.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Tue, Jul 26, 2011 at 4:38 PM, Noah Misch <noah@2ndquadrant.com> wrote:
No new ideas come to mind, here.
OK, I have a new idea. :-)
1. Add a new flag to each procState called something like "timeToPayAttention".
2. Each call to SIGetDataEntries() iterates over all the ProcStates
whose index is < lastBackend and sets stateP->timeToPayAttention =
TRUE for each.
3. At the beginning of SIGetDataEntries(), we do an unlocked if
(!stateP->timeToPayAttention) return 0.
4. Immediately following that if statement and before acquiring any
locks, we set stateP->timeToPayAttention = FALSE.
The LWLockRelease() in SIGetDataEntries() will be sufficient to force
all of the stateP->timeToPayAttention writes out to main memory, and
the read side is OK because we either just took a lock (which acted as
a fence) or else there's a race anyway. But unlike my previous
proposal, it doesn't involve *comparing* anything. We needn't worry
about whether we could read two different values that are through
great misfortune out of sync because we're only reading one value.
If, by chance, the value is set to true just after we set it to false,
that won't mess us up either: we'll still read all the messages after
acquiring the lock.
Now, there's some downside to this approach - it involves doing O(N)
work for each invalidation message we receive. But as long as it's
only O(N) stores and not O(N) lock acquisitions and releases, I think
that might be OK.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Tue, Jul 26, 2011 at 06:04:16PM -0400, Tom Lane wrote:
Noah Misch <noah@2ndQuadrant.com> writes:
On Tue, Jul 26, 2011 at 05:05:15PM -0400, Tom Lane wrote:
Dirty cache line, maybe not, but what if the assembly code commands the
CPU to load those variables into CPU registers before doing the
comparison? If they're loaded with maxMsgNum coming in last (or at
least after resetState), I think you can have the problem without any
assumptions about cache line behavior at all. You just need the process
to lose the CPU at the right time.True. If the compiler places the resetState load first, you could hit the
anomaly by "merely" setting a breakpoint on the next instruction, waiting for
exactly MSGNUMWRAPAROUND messages to enqueue, and letting the backend continue.
I think, though, we should either plug that _and_ the cache incoherency case or
worry about neither.How do you figure that? The poor-assembly-code-order risk is both a lot
easier to fix and a lot higher probability. Admittedly, it's still way
way down there, but you only need a precisely-timed sleep, not a
precisely-timed sleep *and* a cache line that somehow remained stale.
I think both probabilities are too low to usefully distinguish. An sinval
wraparound takes a long time even in a deliberate test setup: almost 30 hours @
10k messages/sec. To get a backend to sleep that long, you'll probably need
something like SIGSTOP or a debugger attach. The sleep has to fall within the
space of no more than a few instructions. Then, you'd need to release the
process at the exact moment for it to observe wrapped equality. In other words,
you get one split-millisecond opportunity every 30 hours of process sleep time.
If your backends don't have multi-hour sleeps, it can't ever happen.
Even so, all the better if we settle on an approach that has neither hazard.
--
Noah Misch http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
On Tue, Jul 26, 2011 at 9:57 PM, Robert Haas <robertmhaas@gmail.com> wrote:
On Tue, Jul 26, 2011 at 4:38 PM, Noah Misch <noah@2ndquadrant.com> wrote:
No new ideas come to mind, here.
OK, I have a new idea. :-)
1. Add a new flag to each procState called something like "timeToPayAttention".
2. Each call to SIGetDataEntries() iterates over all the ProcStates
whose index is < lastBackend and sets stateP->timeToPayAttention =
TRUE for each.
3. At the beginning of SIGetDataEntries(), we do an unlocked if
(!stateP->timeToPayAttention) return 0.
4. Immediately following that if statement and before acquiring any
locks, we set stateP->timeToPayAttention = FALSE.The LWLockRelease() in SIGetDataEntries() will be sufficient to force
all of the stateP->timeToPayAttention writes out to main memory, and
the read side is OK because we either just took a lock (which acted as
a fence) or else there's a race anyway. But unlike my previous
proposal, it doesn't involve *comparing* anything. We needn't worry
about whether we could read two different values that are through
great misfortune out of sync because we're only reading one value.If, by chance, the value is set to true just after we set it to false,
that won't mess us up either: we'll still read all the messages after
acquiring the lock.
There turned out to be a little bit of further subtlety to this, but
it seems to work. Patch attached.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Attachments:
sinval-hasmessages.patchapplication/octet-stream; name=sinval-hasmessages.patchDownload
diff --git a/src/backend/storage/ipc/sinvaladt.c b/src/backend/storage/ipc/sinvaladt.c
index 4f446aa..1ecf4bd 100644
--- a/src/backend/storage/ipc/sinvaladt.c
+++ b/src/backend/storage/ipc/sinvaladt.c
@@ -143,6 +143,7 @@ typedef struct ProcState
int nextMsgNum; /* next message number to read */
bool resetState; /* backend needs to reset its state */
bool signaled; /* backend has been sent catchup signal */
+ bool hasMessages; /* backend has unread messages */
/*
* Backend only sends invalidations, never receives them. This only makes
@@ -248,6 +249,7 @@ CreateSharedInvalidationState(void)
shmInvalBuffer->procState[i].nextMsgNum = 0; /* meaningless */
shmInvalBuffer->procState[i].resetState = false;
shmInvalBuffer->procState[i].signaled = false;
+ shmInvalBuffer->procState[i].hasMessages = false;
shmInvalBuffer->procState[i].nextLXID = InvalidLocalTransactionId;
}
}
@@ -264,11 +266,9 @@ SharedInvalBackendInit(bool sendOnly)
SISeg *segP = shmInvalBuffer;
/*
- * This can run in parallel with read operations, and for that matter with
- * write operations; but not in parallel with additions and removals of
- * backends, nor in parallel with SICleanupQueue. It doesn't seem worth
- * having a third lock, so we choose to use SInvalWriteLock to serialize
- * additions/removals.
+ * This can run in parallel with read operations, but not with write
+ * operations, since SIInsertDataEntries relies on lastBackend to set
+ * hasMessages appropriately.
*/
LWLockAcquire(SInvalWriteLock, LW_EXCLUSIVE);
@@ -316,6 +316,7 @@ SharedInvalBackendInit(bool sendOnly)
stateP->nextMsgNum = segP->maxMsgNum;
stateP->resetState = false;
stateP->signaled = false;
+ stateP->hasMessages = false;
stateP->sendOnly = sendOnly;
LWLockRelease(SInvalWriteLock);
@@ -417,6 +418,7 @@ SIInsertDataEntries(const SharedInvalidationMessage *data, int n)
int nthistime = Min(n, WRITE_QUANTUM);
int numMsgs;
int max;
+ int i;
n -= nthistime;
@@ -459,6 +461,19 @@ SIInsertDataEntries(const SharedInvalidationMessage *data, int n)
SpinLockRelease(&vsegP->msgnumLock);
}
+ /*
+ * Now that the maxMsgNum change is globally visible, we give
+ * everyone a swift kick to make sure they read the newly added
+ * messages. Releasing SInvalWriteLock will enforce a full memory
+ * barrier, so these (unlocked) changes will be committed to memory
+ * before we exit the function.
+ */
+ for (i = 0; i < segP->lastBackend; i++)
+ {
+ ProcState *stateP = &segP->procState[i];
+ stateP->hasMessages = TRUE;
+ }
+
LWLockRelease(SInvalWriteLock);
}
}
@@ -499,11 +514,36 @@ SIGetDataEntries(SharedInvalidationMessage *data, int datasize)
int max;
int n;
- LWLockAcquire(SInvalReadLock, LW_SHARED);
-
segP = shmInvalBuffer;
stateP = &segP->procState[MyBackendId - 1];
+ /*
+ * Before starting to take locks, do a quick, unlocked test to see whether
+ * there can possibly be anything to read. On a multiprocessor system,
+ * it's possible that this load could migrate backwards and occur before we
+ * actually enter this function, so we might miss a sinval message that
+ * was just added by some other processor. But they can't migrate
+ * backwards over a preceding lock acquisition, so it should be OK. If
+ * we haven't acquired a lock preventing against further relevant
+ * invalidations, any such occurrence is not much different than if the
+ * invalidation had arrived slightly later in the first place.
+ */
+ if (!stateP->hasMessages)
+ return 0;
+
+ LWLockAcquire(SInvalReadLock, LW_SHARED);
+
+ /*
+ * We must reset hasMessages before determining how many messages we're
+ * going to read. That way, if new messages arrive after we have
+ * determined how many we're reading, the flag will get reset and we'll
+ * notice those messages part-way through.
+ *
+ * Note that, if we don't end up reading all of the messages, we had
+ * better be certain to reset this flag before exiting!
+ */
+ stateP->hasMessages = FALSE;
+
/* Fetch current value of maxMsgNum using spinlock */
{
/* use volatile pointer to prevent code rearrangement */
@@ -544,10 +584,16 @@ SIGetDataEntries(SharedInvalidationMessage *data, int datasize)
}
/*
- * Reset our "signaled" flag whenever we have caught up completely.
+ * If we have caught up completely, reset our "signaled" flag so that
+ * we'll get another signal if we fall behind again.
+ *
+ * If we haven't catch up completely, reset the hasMessages flag so that
+ * we see the remaining messages next time.
*/
if (stateP->nextMsgNum >= max)
stateP->signaled = false;
+ else
+ stateP->hasMessages = TRUE;
LWLockRelease(SInvalReadLock);
return n;
Robert Haas <robertmhaas@gmail.com> writes:
On Tue, Jul 26, 2011 at 9:57 PM, Robert Haas <robertmhaas@gmail.com> wrote:
1. Add a new flag to each procState called something like "timeToPayAttention".
2. Each call to SIGetDataEntries() iterates over all the ProcStates
whose index is < lastBackend and sets stateP->timeToPayAttention =
TRUE for each.
3. At the beginning of SIGetDataEntries(), we do an unlocked if
(!stateP->timeToPayAttention) return 0.
4. Immediately following that if statement and before acquiring any
locks, we set stateP->timeToPayAttention = FALSE.
There turned out to be a little bit of further subtlety to this, but
it seems to work. Patch attached.
And?
It didn't sound to me like this could possibly be a performance win,
but I await some numbers ...
regards, tom lane
Robert Haas <robertmhaas@gmail.com> wrote:
There turned out to be a little bit of further subtlety to this,
but it seems to work. Patch attached.
Stylistic question: Why is stateP->hasMessages set to false in one
place and FALSE and TRUE in others? It seems like it would be less
confusing to be consistent.
A quick count of the usage of both forms in the PostgreSQL source
codes shows the lowercase is used about 10 times more often:
kgrittn@kgrittn-desktop:~/git/postgresql/kgrittn$ find -name '*.h'
-or -name '*.c' | egrep -v '/tmp_check/' | xargs cat | egrep -c
'\b(FALSE|TRUE)\b'
1670
kgrittn@kgrittn-desktop:~/git/postgresql/kgrittn$ find -name '*.h'
-or -name '*.c' | egrep -v '/tmp_check/' | xargs cat | egrep -c
'\b(false|true)\b'
17349
-Kevin
On Wed, Jul 27, 2011 at 12:34 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Robert Haas <robertmhaas@gmail.com> writes:
On Tue, Jul 26, 2011 at 9:57 PM, Robert Haas <robertmhaas@gmail.com> wrote:
1. Add a new flag to each procState called something like "timeToPayAttention".
2. Each call to SIGetDataEntries() iterates over all the ProcStates
whose index is < lastBackend and sets stateP->timeToPayAttention =
TRUE for each.
3. At the beginning of SIGetDataEntries(), we do an unlocked if
(!stateP->timeToPayAttention) return 0.
4. Immediately following that if statement and before acquiring any
locks, we set stateP->timeToPayAttention = FALSE.There turned out to be a little bit of further subtlety to this, but
it seems to work. Patch attached.And?
It didn't sound to me like this could possibly be a performance win,
but I await some numbers ...
On master, with the patch, scale factor 100, pgbench -S -c $CLIENTS -j
$CLIENTS -T 300 results for different client counts, on a 32-core
machine with 128GB RAM, shared_buffers = 8GB:
01 tps = 4444.280459 (including connections establishing)
01 tps = 4438.365576 (including connections establishing)
01 tps = 4423.718466 (including connections establishing)
08 tps = 33187.827872 (including connections establishing)
08 tps = 33288.247330 (including connections establishing)
08 tps = 33304.307835 (including connections establishing)
32 tps = 178876.350559 (including connections establishing)
32 tps = 177293.082295 (including connections establishing)
32 tps = 175577.058885 (including connections establishing)
80 tps = 159202.449993 (including connections establishing)
80 tps = 151541.717088 (including connections establishing)
80 tps = 150454.658132 (including connections establishing)
Without the patch:
01 tps = 4447.660101 (including connections establishing)
01 tps = 4440.830713 (including connections establishing)
01 tps = 4411.610348 (including connections establishing)
08 tps = 33135.773476 (including connections establishing)
08 tps = 33365.387051 (including connections establishing)
08 tps = 33364.859705 (including connections establishing)
32 tps = 175834.280471 (including connections establishing)
32 tps = 176713.118271 (including connections establishing)
32 tps = 176830.687087 (including connections establishing)
80 tps = 135211.036587 (including connections establishing)
80 tps = 130642.264016 (including connections establishing)
80 tps = 133621.549513 (including connections establishing)
I'm fairly certain the results will be even more dramatic with the
lazy-vxid patch applied, since master still has to fight with the lock
manager on this workload. I haven't tested that yet, but there's not
much reason to suppose that the results will be dramatically different
from any of the other methods of reducing the sinval overhead that
I've benchmarked, many of which I've already posted about. See, for
example, the OP on this thread.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Tue, Jul 26, 2011 at 09:57:10PM -0400, Robert Haas wrote:
On Tue, Jul 26, 2011 at 4:38 PM, Noah Misch <noah@2ndquadrant.com> wrote:
No new ideas come to mind, here.
OK, I have a new idea. :-)
1. Add a new flag to each procState called something like "timeToPayAttention".
2. Each call to SIGetDataEntries() iterates over all the ProcStates
whose index is < lastBackend and sets stateP->timeToPayAttention =
TRUE for each.
3. At the beginning of SIGetDataEntries(), we do an unlocked if
(!stateP->timeToPayAttention) return 0.
4. Immediately following that if statement and before acquiring any
locks, we set stateP->timeToPayAttention = FALSE.The LWLockRelease() in SIGetDataEntries() will be sufficient to force
all of the stateP->timeToPayAttention writes out to main memory, and
the read side is OK because we either just took a lock (which acted as
a fence) or else there's a race anyway. But unlike my previous
proposal, it doesn't involve *comparing* anything. We needn't worry
about whether we could read two different values that are through
great misfortune out of sync because we're only reading one value.If, by chance, the value is set to true just after we set it to false,
that won't mess us up either: we'll still read all the messages after
acquiring the lock.
This approach would work if a spinlock release constrained the global stores
timeline. It makes a weaker guarantee: all stores preceding the lock release
in program order will precede it globally. Consequently, no processor will
observe the spinlock to be available without also observing all stores made by
the last holding processor before that processor released it. That guarantee
is not enough for this sequence of events:
Backend 0:
add a message for table "a"
store 5 => maxMsgNum
store true => timeToPayAttention
LWLockRelease(SInvalWriteLock)
<plenty of time passes>
Backend 2:
LOCK TABLE a;
load timeToPayAttention => true
Backend 1:
add a message for table "b"
store 6 => maxMsgNum
SpinLockRelease(&vsegP->msgnumLock);
store true => timeToPayAttention
Backend 2:
store false => timeToPayAttention
LWLockAcquire(SInvalReadLock, LW_SHARED)
load maxMsgNum => 5 [!]
Backend 1:
LWLockRelease(SInvalWriteLock);
Backend 2:
LOCK TABLE b;
load timeToPayAttention => false
<use b without processing updates>
The "SpinLockRelease(&vsegP->msgnumLock)" guarantees that any process
observing the backend 2 store of timeToPayAttention will also observe
maxMsgNum = 6. However, nothing constrains which timeToPayAttention store
will "win" in main memory. Backend 1 can observe neither update from backend
2, yet still have its store appear later than the backend 1 stores in the
global timeline.
Now, there's some downside to this approach - it involves doing O(N)
work for each invalidation message we receive. But as long as it's
only O(N) stores and not O(N) lock acquisitions and releases, I think
that might be OK.
I think a benchmark is in order, something like 900 idle connections and 80
connections running small transactions that create a few temporary tables. If
that shows no statistically significant regression, then we're probably fine
here. I'm not sure what result to expect, honestly.
What did you think of making the message number a uint64, removing wraparound
handling, and retaining SISeg.msgnumLock for 32-bit only? You could isolate the
variant logic in READ_MSGNUM and WRITE_MSGNUM macros.
--
Noah Misch http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
On Wed, Jul 27, 2011 at 12:55 PM, Noah Misch <noah@2ndquadrant.com> wrote:
This approach would work if a spinlock release constrained the global stores
timeline. It makes a weaker guarantee: all stores preceding the lock release
in program order will precede it globally. Consequently, no processor will
observe the spinlock to be available without also observing all stores made by
the last holding processor before that processor released it. That guarantee
is not enough for this sequence of events:Backend 0:
add a message for table "a"
store 5 => maxMsgNum
store true => timeToPayAttention
LWLockRelease(SInvalWriteLock)
<plenty of time passes>
Backend 2:
LOCK TABLE a;
load timeToPayAttention => true
Backend 1:
add a message for table "b"
store 6 => maxMsgNum
SpinLockRelease(&vsegP->msgnumLock);
store true => timeToPayAttention
Backend 2:
store false => timeToPayAttention
LWLockAcquire(SInvalReadLock, LW_SHARED)
load maxMsgNum => 5 [!]
Eh, how can this possibly happen? You have to hold msgNumLock to to
set maxMsgNum and msgNumLock to read maxMsgNum. If that's not enough
to guarantee that we never read a stale value, what is?
I think a benchmark is in order, something like 900 idle connections and 80
connections running small transactions that create a few temporary tables. If
that shows no statistically significant regression, then we're probably fine
here. I'm not sure what result to expect, honestly.
That's setting the bar pretty high. I don't mind doing the
experiment, but I'm not sure that's the case we should be optimizing
for.
What did you think of making the message number a uint64, removing wraparound
handling, and retaining SISeg.msgnumLock for 32-bit only? You could isolate the
variant logic in READ_MSGNUM and WRITE_MSGNUM macros.
Well, what you really need to know is whether the platform has 8-byte
atomic stores, which doesn't seem trivial to figure out, plus you need
a memory fence. If that's the only method of fixing this problem we
can agree on, I'm willing to work on it, but an
architecture-independent fix would be nicer.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Wed, Jul 27, 2011 at 01:30:47PM -0400, Robert Haas wrote:
On Wed, Jul 27, 2011 at 12:55 PM, Noah Misch <noah@2ndquadrant.com> wrote:
[wrong objection]
Eh, how can this possibly happen? You have to hold msgNumLock to to
set maxMsgNum and msgNumLock to read maxMsgNum. If that's not enough
to guarantee that we never read a stale value, what is?
Indeed, my analysis was all wrong.
I think a benchmark is in order, something like 900 idle connections and 80
connections running small transactions that create a few temporary tables. �If
that shows no statistically significant regression, then we're probably fine
here. �I'm not sure what result to expect, honestly.That's setting the bar pretty high. I don't mind doing the
experiment, but I'm not sure that's the case we should be optimizing
for.
Granted. How about 32 clients running the temporary table transaction, no idle
connections? Given the meager benefit of this patch compared to your previous
version, it would be hard to justify a notable performance drop in return.
What did you think of making the message number a uint64, removing wraparound
handling, and retaining SISeg.msgnumLock for 32-bit only? �You could isolate the
variant logic in READ_MSGNUM and WRITE_MSGNUM macros.Well, what you really need to know is whether the platform has 8-byte
atomic stores, which doesn't seem trivial to figure out, plus you need
a memory fence. If that's the only method of fixing this problem we
can agree on, I'm willing to work on it, but an
architecture-independent fix would be nicer.
Fair enough.
Thanks,
nm
--
Noah Misch http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
On Wed, Jul 27, 2011 at 1:58 PM, Noah Misch <noah@2ndquadrant.com> wrote:
I think a benchmark is in order, something like 900 idle connections and 80
connections running small transactions that create a few temporary tables. If
that shows no statistically significant regression, then we're probably fine
here. I'm not sure what result to expect, honestly.That's setting the bar pretty high. I don't mind doing the
experiment, but I'm not sure that's the case we should be optimizing
for.Granted. How about 32 clients running the temporary table transaction, no idle
connections? Given the meager benefit of this patch compared to your previous
version, it would be hard to justify a notable performance drop in return.
The reason the benefit is smaller is, I believe, because the previous
numbers were generated with the lazy vxid locks patch applied, and
these were generated against master. With the lock manager as a
bottleneck, the sinval stuff doesn't get hit quite as hard, so the
benefit is less. I can regenerate the numbers with the lazy vxid
patch applied; I suspect they'll be similar to what we saw before.
I'll also test out creating and dropping some tables.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Wed, Jul 27, 2011 at 2:29 PM, Robert Haas <robertmhaas@gmail.com> wrote:
The reason the benefit is smaller is, I believe, because the previous
numbers were generated with the lazy vxid locks patch applied, and
these were generated against master. With the lock manager as a
bottleneck, the sinval stuff doesn't get hit quite as hard, so the
benefit is less. I can regenerate the numbers with the lazy vxid
patch applied; I suspect they'll be similar to what we saw before.
Yep. Here's with both lazy-vxids and sinval-hasmessages;
01 tps = 4470.133776 (including connections establishing)
01 tps = 4471.450650 (including connections establishing)
01 tps = 4490.833194 (including connections establishing)
32 tps = 191416.960956 (including connections establishing)
32 tps = 190653.742400 (including connections establishing)
32 tps = 191832.231458 (including connections establishing)
80 tps = 189348.509378 (including connections establishing)
80 tps = 191080.641878 (including connections establishing)
80 tps = 191366.728275 (including connections establishing)
And with just lazy vxids:
01 tps = 4458.667013 (including connections establishing)
01 tps = 4526.922638 (including connections establishing)
01 tps = 4480.415099 (including connections establishing)
32 tps = 193273.434028 (including connections establishing)
32 tps = 190661.279391 (including connections establishing)
32 tps = 189526.560031 (including connections establishing)
80 tps = 150572.020250 (including connections establishing)
80 tps = 118643.970622 (including connections establishing)
80 tps = 119211.643930 (including connections establishing)
Same select-only, scale-factor-100 pgbench test, same 32 core machine,
as I've been using for my other recent tests.
I'll also test out creating and dropping some tables.
Still need to work on this one.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Thu, Jul 28, 2011 at 10:05 AM, Robert Haas <robertmhaas@gmail.com> wrote:
I'll also test out creating and dropping some tables.
Still need to work on this one.
And there results are in. I set up the following sophisticated test
script for pgbench:
CREATE TEMP TABLE foo (a int);
DROP TABLE foo;
And then did 5-minute test runs with varying numbers of clients on
Nate Boley's 32-core machine with (a) master, (b) master +
sinval-hasmessages, (c) master + lazy-vxid, and (d) master + lazy-vxid
+ sinval-hasmessages. I held my breath until the results started
popping out, then felt much better. In the table below, results with
L are with the lazy-vxid patch, S is with the sinval-hasmessages
patch, LS with both, and results with no letters are neither. The
numbers are the client count. Long story short, it seems that the
patch actually makes this test case faster at any client code, though
the improvement at 1 and 8 clients might be in the noise:
01L tps = 514.880290 (including connections establishing)
01L tps = 525.097199 (including connections establishing)
01L tps = 508.319588 (including connections establishing)
08L tps = 1834.259810 (including connections establishing)
08L tps = 1846.846089 (including connections establishing)
08L tps = 1841.402433 (including connections establishing)
32L tps = 1463.822936 (including connections establishing)
32L tps = 1481.169483 (including connections establishing)
32L tps = 1393.780335 (including connections establishing)
80L tps = 1192.768020 (including connections establishing)
80L tps = 1165.545010 (including connections establishing)
80L tps = 1169.776066 (including connections establishing)
01LS tps = 517.624068 (including connections establishing)
01LS tps = 524.507723 (including connections establishing)
01LS tps = 507.847622 (including connections establishing)
08LS tps = 1831.248178 (including connections establishing)
08LS tps = 1873.932133 (including connections establishing)
08LS tps = 1863.048113 (including connections establishing)
32LS tps = 1851.143407 (including connections establishing)
32LS tps = 1754.683356 (including connections establishing)
32LS tps = 1785.926527 (including connections establishing)
80LS tps = 1510.778084 (including connections establishing)
80LS tps = 1484.423486 (including connections establishing)
80LS tps = 1480.692051 (including connections establishing)
01 tps = 511.572832 (including connections establishing)
01 tps = 499.389527 (including connections establishing)
01 tps = 495.697080 (including connections establishing)
08 tps = 1832.762548 (including connections establishing)
08 tps = 1819.884564 (including connections establishing)
08 tps = 1835.608561 (including connections establishing)
32 tps = 1417.168790 (including connections establishing)
32 tps = 1447.478971 (including connections establishing)
32 tps = 1427.489879 (including connections establishing)
80 tps = 1154.272515 (including connections establishing)
80 tps = 1168.805881 (including connections establishing)
80 tps = 1173.971801 (including connections establishing)
01S tps = 519.860218 (including connections establishing)
01S tps = 510.759087 (including connections establishing)
01S tps = 517.159276 (including connections establishing)
08S tps = 1880.179600 (including connections establishing)
08S tps = 1829.693454 (including connections establishing)
08S tps = 1886.168401 (including connections establishing)
32S tps = 1809.950929 (including connections establishing)
32S tps = 1809.474070 (including connections establishing)
32S tps = 1798.620842 (including connections establishing)
80S tps = 1483.037788 (including connections establishing)
80S tps = 1481.059504 (including connections establishing)
80S tps = 1487.215748 (including connections establishing)
So, apparently, the extra work in SIInsertDataEntries() is more than
paid for by the speedup in SIGetDataEntries(). I'm guessing that at
high client counts you win because of reduced spinlock contention, and
at low client counts you still win a little bit because
SIGetDataEntries() is called multiple times per transaction, whereas
SIInsertDataEntries() is only called once. I could be all wet on the
reason, but at any rate the numbers look pretty good.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Thu, Jul 28, 2011 at 03:03:05PM -0400, Robert Haas wrote:
On Thu, Jul 28, 2011 at 10:05 AM, Robert Haas <robertmhaas@gmail.com> wrote:
I'll also test out creating and dropping some tables.
Still need to work on this one.
And there results are in. I set up the following sophisticated test
script for pgbench:CREATE TEMP TABLE foo (a int);
DROP TABLE foo;And then did 5-minute test runs with varying numbers of clients on
Nate Boley's 32-core machine with (a) master, (b) master +
sinval-hasmessages, (c) master + lazy-vxid, and (d) master + lazy-vxid
+ sinval-hasmessages.
The comparison I had in mind was (a) master + lazy-vxid + [1]http://archives.postgresql.org/message-id/CA+TgmoZc8Z_JTj44xvpWpXKQt2jGjB5YGCZ3T9u5-QZVdBmyCA@mail.gmail.comsinval-fastpath
vs. (b) master + lazy-vxid + [2]http://archives.postgresql.org/message-id/CA+TgmoZjo1bkP6eX=xOX3aHuPYbmJT9+PKW6qubQzN7sukK3Dg@mail.gmail.comsinval-hasmessages. The only claimed benefit of
[2]: http://archives.postgresql.org/message-id/CA+TgmoZjo1bkP6eX=xOX3aHuPYbmJT9+PKW6qubQzN7sukK3Dg@mail.gmail.com
[3]: http://archives.postgresql.org/message-id/20110727033537.GB18910@tornado.leadboat.com
performance we exchange for its benefit.
I did a bit of (relatively unrigorous) poking at this workload with oprofile,
and I never saw SIInsertDataEntries take more than 0.26% of CPU time. It's
perhaps too cheap relative to the workload's other costs to matter. That's good
enough for me.
[1]: http://archives.postgresql.org/message-id/CA+TgmoZc8Z_JTj44xvpWpXKQt2jGjB5YGCZ3T9u5-QZVdBmyCA@mail.gmail.com
[2]: http://archives.postgresql.org/message-id/CA+TgmoZjo1bkP6eX=xOX3aHuPYbmJT9+PKW6qubQzN7sukK3Dg@mail.gmail.com
[3]: http://archives.postgresql.org/message-id/20110727033537.GB18910@tornado.leadboat.com
80L tps = 1192.768020 (including connections establishing)
80L tps = 1165.545010 (including connections establishing)
80L tps = 1169.776066 (including connections establishing)
80LS tps = 1510.778084 (including connections establishing)
80LS tps = 1484.423486 (including connections establishing)
80LS tps = 1480.692051 (including connections establishing)
80 tps = 1154.272515 (including connections establishing)
80 tps = 1168.805881 (including connections establishing)
80 tps = 1173.971801 (including connections establishing)
80S tps = 1483.037788 (including connections establishing)
80S tps = 1481.059504 (including connections establishing)
80S tps = 1487.215748 (including connections establishing)So, apparently, the extra work in SIInsertDataEntries() is more than
paid for by the speedup in SIGetDataEntries(). I'm guessing that at
high client counts you win because of reduced spinlock contention, and
at low client counts you still win a little bit because
SIGetDataEntries() is called multiple times per transaction, whereas
SIInsertDataEntries() is only called once. I could be all wet on the
reason, but at any rate the numbers look pretty good.
Interesting. I had hypothesized that at two clients per core, nearly every call
to SIGetDataEntries() would find work to do, making the patch a strict loss. A
bit of instrumented testing showed otherwise: at two clients per core,
sinval-hasmessages optimized away 93% of the calls. At fifty clients per core,
it still optimized away more than half of the calls.
--
Noah Misch http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
On Thu, Jul 28, 2011 at 11:05 PM, Noah Misch <noah@2ndquadrant.com> wrote:
The comparison I had in mind was (a) master + lazy-vxid + [1]sinval-fastpath
vs. (b) master + lazy-vxid + [2]sinval-hasmessages. The only claimed benefit of
[2] over [1], as far as I can see, is invulnerability to the hazard described in
[3]. Before selecting [2] over [1], it would be instructive to know how much
performance we exchange for its benefit.I did a bit of (relatively unrigorous) poking at this workload with oprofile,
and I never saw SIInsertDataEntries take more than 0.26% of CPU time. It's
perhaps too cheap relative to the workload's other costs to matter. That's good
enough for me.
Me, too. There's another possible benefit, though: in
sinval-fastpath, everybody's got to read maxMsgNum every time through;
whereas in sinval-hasmessages, everyone's reading their own flag. I'm
not sure how much it costs to have memory that gets read by multiple
readers rather than just a single reader, but I think I've seen some
things that indicate it might not always be free.
Interesting. I had hypothesized that at two clients per core, nearly every call
to SIGetDataEntries() would find work to do, making the patch a strict loss. A
bit of instrumented testing showed otherwise: at two clients per core,
sinval-hasmessages optimized away 93% of the calls. At fifty clients per core,
it still optimized away more than half of the calls.
Wow. That's better than I expected. Great idea for a test, too.
Unless someone has another concern here, I think we've got this one nailed.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company