sinvaladt.c: remove msgnumLock, use atomic operations on maxMsgNum
Good day,
Investigating some performance issues of a client, our engineers found
msgnumLock to be contended.
Looking closer it is obvious it is not needed at all: it used only as
memory barrier. It is even stated in comment at file start:
* We deal with that by having a spinlock that readers must take for just
* long enough to read maxMsgNum, while writers take it for just long enough
* to write maxMsgNum. (The exact rule is that you need the spinlock to
* read maxMsgNum if you are not holding SInvalWriteLock, and you need the
* spinlock to write maxMsgNum unless you are holding both locks.)
*
* Note: since maxMsgNum is an int and hence presumably atomically readable/
* writable, the spinlock might seem unnecessary. The reason it is needed
* is to provide a memory barrier: we need to be sure that messages written
* to the array are actually there before maxMsgNum is increased, and that
* readers will see that data after fetching maxMsgNum.
So we changed maxMsgNum to be pg_atomic_uint32, and put appropriate memory
barriers:
- pg_write_barrier() before write to maxMsgNum (when only SInvalWriteLock
is held)
- pg_read_barrier() after read of maxMsgNum (when only SInvalReadLock is held)
It improved situation for our client.
Note: pg_(write|read)_barrier() is chosen instead of
pg_atomic_(read|write)_membarrier_u32() because it certainly cheaper at
least on x86_64 where it is translated to just a compiler barrier (empty
asm statement).
At least pg_atomic_read_membarrier_u32() is implemented currently as a
write operation, that's not good for contended place.
-----
regards
Yura Sokolov aka funny-falcon
Attachments:
v0-0001-sinvaladt.c-use-atomic-operations-on-maxMsgNum.patchtext/x-patch; charset=UTF-8; name=v0-0001-sinvaladt.c-use-atomic-operations-on-maxMsgNum.patchDownload
From 351b3ecb5d085559beb13ffd6dd4416dab4b4a84 Mon Sep 17 00:00:00 2001
From: Yura Sokolov <y.sokolov@postgrespro.ru>
Date: Mon, 3 Feb 2025 11:58:33 +0300
Subject: [PATCH v0] sinvaladt.c: use atomic operations on maxMsgNum
msgnumLock spinlock could be highly contended.
Comment states it was used as memory barrier.
Lets use atomic ops with memory barriers directly instead.
Note: patch uses pg_read_barrier()/pg_write_barrier() instead of
pg_atomic_read_membarrier_u32()/pg_atomic_write_membarrier_u32() since
no full barrier semantic is required, and explicit read/write barriers
are cheaper at least on x86_64.
---
src/backend/storage/ipc/sinvaladt.c | 56 ++++++++++++-----------------
1 file changed, 23 insertions(+), 33 deletions(-)
diff --git a/src/backend/storage/ipc/sinvaladt.c b/src/backend/storage/ipc/sinvaladt.c
index 2da91738c32..7d4912c6419 100644
--- a/src/backend/storage/ipc/sinvaladt.c
+++ b/src/backend/storage/ipc/sinvaladt.c
@@ -86,19 +86,10 @@
* has no need to touch anyone's ProcState, except in the infrequent cases
* when SICleanupQueue is needed. The only point of overlap is that
* the writer wants to change maxMsgNum while readers need to read it.
- * We deal with that by having a spinlock that readers must take for just
- * long enough to read maxMsgNum, while writers take it for just long enough
- * to write maxMsgNum. (The exact rule is that you need the spinlock to
- * read maxMsgNum if you are not holding SInvalWriteLock, and you need the
- * spinlock to write maxMsgNum unless you are holding both locks.)
- *
- * Note: since maxMsgNum is an int and hence presumably atomically readable/
- * writable, the spinlock might seem unnecessary. The reason it is needed
- * is to provide a memory barrier: we need to be sure that messages written
- * to the array are actually there before maxMsgNum is increased, and that
- * readers will see that data after fetching maxMsgNum. Multiprocessors
- * that have weak memory-ordering guarantees can fail without the memory
- * barrier instructions that are included in the spinlock sequences.
+ * We deal with that by using atomic operations with proper memory barriers.
+ * (The exact rule is that you need the read barrier after atomic read
+ * maxMsgNum if you are not holding SInvalWriteLock, and you need the
+ * write barrier before write maxMsgNum unless you are holding both locks.)
*/
@@ -168,11 +159,9 @@ typedef struct SISeg
* General state information
*/
int minMsgNum; /* oldest message still needed */
- int maxMsgNum; /* next message number to be assigned */
+ pg_atomic_uint32 maxMsgNum; /* next message number to be assigned */
int nextThreshold; /* # of messages to call SICleanupQueue */
- slock_t msgnumLock; /* spinlock protecting maxMsgNum */
-
/*
* Circular buffer holding shared-inval messages
*/
@@ -243,9 +232,8 @@ SharedInvalShmemInit(void)
/* Clear message counters, save size of procState array, init spinlock */
shmInvalBuffer->minMsgNum = 0;
- shmInvalBuffer->maxMsgNum = 0;
+ pg_atomic_init_u32(&shmInvalBuffer->maxMsgNum, 0);
shmInvalBuffer->nextThreshold = CLEANUP_MIN;
- SpinLockInit(&shmInvalBuffer->msgnumLock);
/* The buffer[] array is initially all unused, so we need not fill it */
@@ -303,7 +291,7 @@ SharedInvalBackendInit(bool sendOnly)
/* mark myself active, with all extant messages already read */
stateP->procPid = MyProcPid;
- stateP->nextMsgNum = segP->maxMsgNum;
+ stateP->nextMsgNum = pg_atomic_read_u32(&segP->maxMsgNum);
stateP->resetState = false;
stateP->signaled = false;
stateP->hasMessages = false;
@@ -399,7 +387,7 @@ SIInsertDataEntries(const SharedInvalidationMessage *data, int n)
*/
for (;;)
{
- numMsgs = segP->maxMsgNum - segP->minMsgNum;
+ numMsgs = pg_atomic_read_u32(&segP->maxMsgNum) - segP->minMsgNum;
if (numMsgs + nthistime > MAXNUMMESSAGES ||
numMsgs >= segP->nextThreshold)
SICleanupQueue(true, nthistime);
@@ -410,17 +398,16 @@ SIInsertDataEntries(const SharedInvalidationMessage *data, int n)
/*
* Insert new message(s) into proper slot of circular buffer
*/
- max = segP->maxMsgNum;
+ max = pg_atomic_read_u32(&segP->maxMsgNum);
while (nthistime-- > 0)
{
segP->buffer[max % MAXNUMMESSAGES] = *data++;
max++;
}
- /* Update current value of maxMsgNum using spinlock */
- SpinLockAcquire(&segP->msgnumLock);
- segP->maxMsgNum = max;
- SpinLockRelease(&segP->msgnumLock);
+ /* Update current value of maxMsgNum using atomic with memory barrier */
+ pg_write_barrier();
+ pg_atomic_write_u32(&segP->maxMsgNum, max);
/*
* Now that the maxMsgNum change is globally visible, we give everyone
@@ -506,10 +493,9 @@ SIGetDataEntries(SharedInvalidationMessage *data, int datasize)
*/
stateP->hasMessages = false;
- /* Fetch current value of maxMsgNum using spinlock */
- SpinLockAcquire(&segP->msgnumLock);
- max = segP->maxMsgNum;
- SpinLockRelease(&segP->msgnumLock);
+ /* Fetch current value of maxMsgNum using atomic with memory barrier */
+ max = pg_atomic_read_u32(&segP->maxMsgNum);
+ pg_read_barrier();
if (stateP->resetState)
{
@@ -595,7 +581,7 @@ SICleanupQueue(bool callerHasWriteLock, int minFree)
* backends here it is possible for them to keep sending messages without
* a problem even when they are the only active backend.
*/
- min = segP->maxMsgNum;
+ min = pg_atomic_read_u32(&segP->maxMsgNum);
minsig = min - SIG_THRESHOLD;
lowbound = min - MAXNUMMESSAGES + minFree;
@@ -640,8 +626,12 @@ SICleanupQueue(bool callerHasWriteLock, int minFree)
*/
if (min >= MSGNUMWRAPAROUND)
{
- segP->minMsgNum -= MSGNUMWRAPAROUND;
- segP->maxMsgNum -= MSGNUMWRAPAROUND;
+ /*
+ * we don't need memory barrier here, but using sub_fetch is just
+ * simpler than read_u32+write_u32 pair, and this place is not
+ * contented.
+ */
+ pg_atomic_sub_fetch_u32(&segP->maxMsgNum, MSGNUMWRAPAROUND);
for (i = 0; i < segP->numProcs; i++)
segP->procState[segP->pgprocnos[i]].nextMsgNum -= MSGNUMWRAPAROUND;
}
@@ -650,7 +640,7 @@ SICleanupQueue(bool callerHasWriteLock, int minFree)
* Determine how many messages are still in the queue, and set the
* threshold at which we should repeat SICleanupQueue().
*/
- numMsgs = segP->maxMsgNum - segP->minMsgNum;
+ numMsgs = pg_atomic_read_u32(&segP->maxMsgNum) - segP->minMsgNum;
if (numMsgs < CLEANUP_MIN)
segP->nextThreshold = CLEANUP_MIN;
else
--
2.43.0
On 03/02/2025 13:05, Yura Sokolov wrote:
Investigating some performance issues of a client, our engineers found
msgnumLock to be contended.Looking closer it is obvious it is not needed at all: it used only as
memory barrier. It is even stated in comment at file start:* We deal with that by having a spinlock that readers must take for just
* long enough to read maxMsgNum, while writers take it for just long enough
* to write maxMsgNum. (The exact rule is that you need the spinlock to
* read maxMsgNum if you are not holding SInvalWriteLock, and you need the
* spinlock to write maxMsgNum unless you are holding both locks.)
*
* Note: since maxMsgNum is an int and hence presumably atomically readable/
* writable, the spinlock might seem unnecessary. The reason it is needed
* is to provide a memory barrier: we need to be sure that messages written
* to the array are actually there before maxMsgNum is increased, and that
* readers will see that data after fetching maxMsgNum.So we changed maxMsgNum to be pg_atomic_uint32, and put appropriate memory
barriers:
- pg_write_barrier() before write to maxMsgNum (when only SInvalWriteLock
is held)
- pg_read_barrier() after read of maxMsgNum (when only SInvalReadLock is held)It improved situation for our client.
Note: pg_(write|read)_barrier() is chosen instead of
pg_atomic_(read|write)_membarrier_u32() because it certainly cheaper at
least on x86_64 where it is translated to just a compiler barrier (empty
asm statement).
At least pg_atomic_read_membarrier_u32() is implemented currently as a
write operation, that's not good for contended place.
Makes sense.
@@ -640,8 +626,12 @@ SICleanupQueue(bool callerHasWriteLock, int minFree) */ if (min >= MSGNUMWRAPAROUND) { - segP->minMsgNum -= MSGNUMWRAPAROUND; - segP->maxMsgNum -= MSGNUMWRAPAROUND; + /* + * we don't need memory barrier here, but using sub_fetch is just + * simpler than read_u32+write_u32 pair, and this place is not + * contented. + */ + pg_atomic_sub_fetch_u32(&segP->maxMsgNum, MSGNUMWRAPAROUND); for (i = 0; i < segP->numProcs; i++) segP->procState[segP->pgprocnos[i]].nextMsgNum -= MSGNUMWRAPAROUND; }
Did you lose the 'segP->minMsgNum -= MSGNUMWRAPAROUND;' here? Do we have
any tests for the wraparound?
Now that maxMsgNum is unsigned, should we switch to uint32 for minMsgNum
and nextThreshold for consistency? They still don't need to be atomic
IIRC, they're protected by SInvalReadLock and SInvalWriteLock.
I kind of wonder why we need that MSGNUMWRAPAROUND limit at all; can't
we just let the integer wrap around naturally? (after switching to
unsigned, that is). That doesn't need to be part of this patch though,
it can be done separately, if it's worthwhile..
.
--
Heikki Linnakangas
Neon (https://neon.tech)
03.02.2025 19:49, Heikki Linnakangas пишет:
On 03/02/2025 13:05, Yura Sokolov wrote:
Investigating some performance issues of a client, our engineers found
msgnumLock to be contended.Looking closer it is obvious it is not needed at all: it used only as
memory barrier. It is even stated in comment at file start:* We deal with that by having a spinlock that readers must take for just
* long enough to read maxMsgNum, while writers take it for just long enough
* to write maxMsgNum. (The exact rule is that you need the spinlock to
* read maxMsgNum if you are not holding SInvalWriteLock, and you need the
* spinlock to write maxMsgNum unless you are holding both locks.)
*
* Note: since maxMsgNum is an int and hence presumably atomically readable/
* writable, the spinlock might seem unnecessary. The reason it is needed
* is to provide a memory barrier: we need to be sure that messages written
* to the array are actually there before maxMsgNum is increased, and that
* readers will see that data after fetching maxMsgNum.So we changed maxMsgNum to be pg_atomic_uint32, and put appropriate memory
barriers:
- pg_write_barrier() before write to maxMsgNum (when only SInvalWriteLock
is held)
- pg_read_barrier() after read of maxMsgNum (when only SInvalReadLock is held)It improved situation for our client.
Note: pg_(write|read)_barrier() is chosen instead of
pg_atomic_(read|write)_membarrier_u32() because it certainly cheaper at
least on x86_64 where it is translated to just a compiler barrier (empty
asm statement).
At least pg_atomic_read_membarrier_u32() is implemented currently as a
write operation, that's not good for contended place.Makes sense.
@@ -640,8 +626,12 @@ SICleanupQueue(bool callerHasWriteLock, int minFree) */ if (min >= MSGNUMWRAPAROUND) { - segP->minMsgNum -= MSGNUMWRAPAROUND; - segP->maxMsgNum -= MSGNUMWRAPAROUND; + /* + * we don't need memory barrier here, but using sub_fetch is just + * simpler than read_u32+write_u32 pair, and this place is not + * contented. + */ + pg_atomic_sub_fetch_u32(&segP->maxMsgNum, MSGNUMWRAPAROUND); for (i = 0; i < segP->numProcs; i++) segP->procState[segP->pgprocnos[i]].nextMsgNum -= MSGNUMWRAPAROUND; }Did you lose the 'segP->minMsgNum -= MSGNUMWRAPAROUND;' here? Do we have
any tests for the wraparound?
Oops, yes, thanks. I didn't miss it in private version, but lose during
cherry-picking on top of 'master'. Fixed.
Also removed comment above sub_fetch.
Now that maxMsgNum is unsigned, should we switch to uint32 for minMsgNum
and nextThreshold for consistency? They still don't need to be atomic
IIRC, they're protected by SInvalReadLock and SInvalWriteLock.
Ok, I did. And couple of local variables.
There are signed arguments of SIInsertDataEntries, SIGetDataEntries,
SICleanupQueue . Should they be changed to unsigned?
I kind of wonder why we need that MSGNUMWRAPAROUND limit at all; can't
we just let the integer wrap around naturally? (after switching to
unsigned, that is). That doesn't need to be part of this patch though,
it can be done separately, if it's worthwhile...
There still will be need to handle wraparound, but in comparisons then.
Overall, code will not be simpler, I think. Are there any other benefits
from its removal?
-------
regards,
Yura Sokolov aka funny-falcon
Attachments:
v1-0001-sinvaladt.c-use-atomic-operations-on-maxMsgNum.patchtext/x-patch; charset=UTF-8; name=v1-0001-sinvaladt.c-use-atomic-operations-on-maxMsgNum.patchDownload
From 37cfdbe3e09aee16d68ee001ae3e8501327cd24a Mon Sep 17 00:00:00 2001
From: Yura Sokolov <y.sokolov@postgrespro.ru>
Date: Mon, 3 Feb 2025 11:58:33 +0300
Subject: [PATCH v1] sinvaladt.c: use atomic operations on maxMsgNum
msgnumLock spinlock could be highly contended.
Comment states it was used as memory barrier.
Lets use atomic ops with memory barriers directly instead.
Note: patch uses pg_read_barrier()/pg_write_barrier() instead of
pg_atomic_read_membarrier_u32()/pg_atomic_write_membarrier_u32() since
no full barrier semantic is required, and explicit read/write barriers
are cheaper at least on x86_64.
---
src/backend/storage/ipc/sinvaladt.c | 74 ++++++++++++-----------------
1 file changed, 30 insertions(+), 44 deletions(-)
diff --git a/src/backend/storage/ipc/sinvaladt.c b/src/backend/storage/ipc/sinvaladt.c
index 2da91738c32..c3b89afce9f 100644
--- a/src/backend/storage/ipc/sinvaladt.c
+++ b/src/backend/storage/ipc/sinvaladt.c
@@ -86,19 +86,10 @@
* has no need to touch anyone's ProcState, except in the infrequent cases
* when SICleanupQueue is needed. The only point of overlap is that
* the writer wants to change maxMsgNum while readers need to read it.
- * We deal with that by having a spinlock that readers must take for just
- * long enough to read maxMsgNum, while writers take it for just long enough
- * to write maxMsgNum. (The exact rule is that you need the spinlock to
- * read maxMsgNum if you are not holding SInvalWriteLock, and you need the
- * spinlock to write maxMsgNum unless you are holding both locks.)
- *
- * Note: since maxMsgNum is an int and hence presumably atomically readable/
- * writable, the spinlock might seem unnecessary. The reason it is needed
- * is to provide a memory barrier: we need to be sure that messages written
- * to the array are actually there before maxMsgNum is increased, and that
- * readers will see that data after fetching maxMsgNum. Multiprocessors
- * that have weak memory-ordering guarantees can fail without the memory
- * barrier instructions that are included in the spinlock sequences.
+ * We deal with that by using atomic operations with proper memory barriers.
+ * (The exact rule is that you need the read barrier after atomic read
+ * maxMsgNum if you are not holding SInvalWriteLock, and you need the
+ * write barrier before write maxMsgNum unless you are holding both locks.)
*/
@@ -139,7 +130,7 @@ typedef struct ProcState
/* procPid is zero in an inactive ProcState array entry. */
pid_t procPid; /* PID of backend, for signaling */
/* nextMsgNum is meaningless if procPid == 0 or resetState is true. */
- int nextMsgNum; /* next message number to read */
+ uint32 nextMsgNum; /* next message number to read */
bool resetState; /* backend needs to reset its state */
bool signaled; /* backend has been sent catchup signal */
bool hasMessages; /* backend has unread messages */
@@ -167,11 +158,9 @@ typedef struct SISeg
/*
* General state information
*/
- int minMsgNum; /* oldest message still needed */
- int maxMsgNum; /* next message number to be assigned */
- int nextThreshold; /* # of messages to call SICleanupQueue */
-
- slock_t msgnumLock; /* spinlock protecting maxMsgNum */
+ uint32 minMsgNum; /* oldest message still needed */
+ pg_atomic_uint32 maxMsgNum; /* next message number to be assigned */
+ uint32 nextThreshold; /* # of messages to call SICleanupQueue */
/*
* Circular buffer holding shared-inval messages
@@ -243,9 +232,8 @@ SharedInvalShmemInit(void)
/* Clear message counters, save size of procState array, init spinlock */
shmInvalBuffer->minMsgNum = 0;
- shmInvalBuffer->maxMsgNum = 0;
+ pg_atomic_init_u32(&shmInvalBuffer->maxMsgNum, 0);
shmInvalBuffer->nextThreshold = CLEANUP_MIN;
- SpinLockInit(&shmInvalBuffer->msgnumLock);
/* The buffer[] array is initially all unused, so we need not fill it */
@@ -303,7 +291,7 @@ SharedInvalBackendInit(bool sendOnly)
/* mark myself active, with all extant messages already read */
stateP->procPid = MyProcPid;
- stateP->nextMsgNum = segP->maxMsgNum;
+ stateP->nextMsgNum = pg_atomic_read_u32(&segP->maxMsgNum);
stateP->resetState = false;
stateP->signaled = false;
stateP->hasMessages = false;
@@ -382,8 +370,8 @@ SIInsertDataEntries(const SharedInvalidationMessage *data, int n)
while (n > 0)
{
int nthistime = Min(n, WRITE_QUANTUM);
- int numMsgs;
- int max;
+ uint32 numMsgs;
+ uint32 max;
int i;
n -= nthistime;
@@ -399,7 +387,7 @@ SIInsertDataEntries(const SharedInvalidationMessage *data, int n)
*/
for (;;)
{
- numMsgs = segP->maxMsgNum - segP->minMsgNum;
+ numMsgs = pg_atomic_read_u32(&segP->maxMsgNum) - segP->minMsgNum;
if (numMsgs + nthistime > MAXNUMMESSAGES ||
numMsgs >= segP->nextThreshold)
SICleanupQueue(true, nthistime);
@@ -410,17 +398,16 @@ SIInsertDataEntries(const SharedInvalidationMessage *data, int n)
/*
* Insert new message(s) into proper slot of circular buffer
*/
- max = segP->maxMsgNum;
+ max = pg_atomic_read_u32(&segP->maxMsgNum);
while (nthistime-- > 0)
{
segP->buffer[max % MAXNUMMESSAGES] = *data++;
max++;
}
- /* Update current value of maxMsgNum using spinlock */
- SpinLockAcquire(&segP->msgnumLock);
- segP->maxMsgNum = max;
- SpinLockRelease(&segP->msgnumLock);
+ /* Update current value of maxMsgNum using atomic with memory barrier */
+ pg_write_barrier();
+ pg_atomic_write_u32(&segP->maxMsgNum, max);
/*
* Now that the maxMsgNum change is globally visible, we give everyone
@@ -473,7 +460,7 @@ SIGetDataEntries(SharedInvalidationMessage *data, int datasize)
{
SISeg *segP;
ProcState *stateP;
- int max;
+ uint32 max;
int n;
segP = shmInvalBuffer;
@@ -506,10 +493,9 @@ SIGetDataEntries(SharedInvalidationMessage *data, int datasize)
*/
stateP->hasMessages = false;
- /* Fetch current value of maxMsgNum using spinlock */
- SpinLockAcquire(&segP->msgnumLock);
- max = segP->maxMsgNum;
- SpinLockRelease(&segP->msgnumLock);
+ /* Fetch current value of maxMsgNum using atomic with memory barrier */
+ max = pg_atomic_read_u32(&segP->maxMsgNum);
+ pg_read_barrier();
if (stateP->resetState)
{
@@ -576,11 +562,11 @@ void
SICleanupQueue(bool callerHasWriteLock, int minFree)
{
SISeg *segP = shmInvalBuffer;
- int min,
+ uint32 min,
minsig,
lowbound,
- numMsgs,
- i;
+ numMsgs;
+ int i;
ProcState *needSig = NULL;
/* Lock out all writers and readers */
@@ -595,14 +581,14 @@ SICleanupQueue(bool callerHasWriteLock, int minFree)
* backends here it is possible for them to keep sending messages without
* a problem even when they are the only active backend.
*/
- min = segP->maxMsgNum;
- minsig = min - SIG_THRESHOLD;
- lowbound = min - MAXNUMMESSAGES + minFree;
+ min = pg_atomic_read_u32(&segP->maxMsgNum);
+ minsig = min - Min(min, SIG_THRESHOLD);
+ lowbound = min - Min(min, MAXNUMMESSAGES - minFree);
for (i = 0; i < segP->numProcs; i++)
{
ProcState *stateP = &segP->procState[segP->pgprocnos[i]];
- int n = stateP->nextMsgNum;
+ uint32 n = stateP->nextMsgNum;
/* Ignore if already in reset state */
Assert(stateP->procPid != 0);
@@ -641,7 +627,7 @@ SICleanupQueue(bool callerHasWriteLock, int minFree)
if (min >= MSGNUMWRAPAROUND)
{
segP->minMsgNum -= MSGNUMWRAPAROUND;
- segP->maxMsgNum -= MSGNUMWRAPAROUND;
+ pg_atomic_sub_fetch_u32(&segP->maxMsgNum, MSGNUMWRAPAROUND);
for (i = 0; i < segP->numProcs; i++)
segP->procState[segP->pgprocnos[i]].nextMsgNum -= MSGNUMWRAPAROUND;
}
@@ -650,7 +636,7 @@ SICleanupQueue(bool callerHasWriteLock, int minFree)
* Determine how many messages are still in the queue, and set the
* threshold at which we should repeat SICleanupQueue().
*/
- numMsgs = segP->maxMsgNum - segP->minMsgNum;
+ numMsgs = pg_atomic_read_u32(&segP->maxMsgNum) - segP->minMsgNum;
if (numMsgs < CLEANUP_MIN)
segP->nextThreshold = CLEANUP_MIN;
else
--
2.43.0
Just rebased the patch.
-------
regards
Yura Sokolov aka funny-falcon
Attachments:
v2-0001-sinvaladt.c-use-atomic-operations-on-maxMsgNum.patchtext/x-patch; charset=UTF-8; name=v2-0001-sinvaladt.c-use-atomic-operations-on-maxMsgNum.patchDownload
From 080c9e0de5e6e10751347e1ff50b65df424744cb Mon Sep 17 00:00:00 2001
From: Yura Sokolov <y.sokolov@postgrespro.ru>
Date: Mon, 3 Feb 2025 11:58:33 +0300
Subject: [PATCH v2] sinvaladt.c: use atomic operations on maxMsgNum
msgnumLock spinlock could be highly contended.
Comment states it was used as memory barrier.
Lets use atomic ops with memory barriers directly instead.
Note: patch uses pg_read_barrier()/pg_write_barrier() instead of
pg_atomic_read_membarrier_u32()/pg_atomic_write_membarrier_u32() since
no full barrier semantic is required, and explicit read/write barriers
are cheaper at least on x86_64.
---
src/backend/storage/ipc/sinvaladt.c | 74 ++++++++++++-----------------
1 file changed, 30 insertions(+), 44 deletions(-)
diff --git a/src/backend/storage/ipc/sinvaladt.c b/src/backend/storage/ipc/sinvaladt.c
index 2da91738c32..c3b89afce9f 100644
--- a/src/backend/storage/ipc/sinvaladt.c
+++ b/src/backend/storage/ipc/sinvaladt.c
@@ -86,19 +86,10 @@
* has no need to touch anyone's ProcState, except in the infrequent cases
* when SICleanupQueue is needed. The only point of overlap is that
* the writer wants to change maxMsgNum while readers need to read it.
- * We deal with that by having a spinlock that readers must take for just
- * long enough to read maxMsgNum, while writers take it for just long enough
- * to write maxMsgNum. (The exact rule is that you need the spinlock to
- * read maxMsgNum if you are not holding SInvalWriteLock, and you need the
- * spinlock to write maxMsgNum unless you are holding both locks.)
- *
- * Note: since maxMsgNum is an int and hence presumably atomically readable/
- * writable, the spinlock might seem unnecessary. The reason it is needed
- * is to provide a memory barrier: we need to be sure that messages written
- * to the array are actually there before maxMsgNum is increased, and that
- * readers will see that data after fetching maxMsgNum. Multiprocessors
- * that have weak memory-ordering guarantees can fail without the memory
- * barrier instructions that are included in the spinlock sequences.
+ * We deal with that by using atomic operations with proper memory barriers.
+ * (The exact rule is that you need the read barrier after atomic read
+ * maxMsgNum if you are not holding SInvalWriteLock, and you need the
+ * write barrier before write maxMsgNum unless you are holding both locks.)
*/
@@ -139,7 +130,7 @@ typedef struct ProcState
/* procPid is zero in an inactive ProcState array entry. */
pid_t procPid; /* PID of backend, for signaling */
/* nextMsgNum is meaningless if procPid == 0 or resetState is true. */
- int nextMsgNum; /* next message number to read */
+ uint32 nextMsgNum; /* next message number to read */
bool resetState; /* backend needs to reset its state */
bool signaled; /* backend has been sent catchup signal */
bool hasMessages; /* backend has unread messages */
@@ -167,11 +158,9 @@ typedef struct SISeg
/*
* General state information
*/
- int minMsgNum; /* oldest message still needed */
- int maxMsgNum; /* next message number to be assigned */
- int nextThreshold; /* # of messages to call SICleanupQueue */
-
- slock_t msgnumLock; /* spinlock protecting maxMsgNum */
+ uint32 minMsgNum; /* oldest message still needed */
+ pg_atomic_uint32 maxMsgNum; /* next message number to be assigned */
+ uint32 nextThreshold; /* # of messages to call SICleanupQueue */
/*
* Circular buffer holding shared-inval messages
@@ -243,9 +232,8 @@ SharedInvalShmemInit(void)
/* Clear message counters, save size of procState array, init spinlock */
shmInvalBuffer->minMsgNum = 0;
- shmInvalBuffer->maxMsgNum = 0;
+ pg_atomic_init_u32(&shmInvalBuffer->maxMsgNum, 0);
shmInvalBuffer->nextThreshold = CLEANUP_MIN;
- SpinLockInit(&shmInvalBuffer->msgnumLock);
/* The buffer[] array is initially all unused, so we need not fill it */
@@ -303,7 +291,7 @@ SharedInvalBackendInit(bool sendOnly)
/* mark myself active, with all extant messages already read */
stateP->procPid = MyProcPid;
- stateP->nextMsgNum = segP->maxMsgNum;
+ stateP->nextMsgNum = pg_atomic_read_u32(&segP->maxMsgNum);
stateP->resetState = false;
stateP->signaled = false;
stateP->hasMessages = false;
@@ -382,8 +370,8 @@ SIInsertDataEntries(const SharedInvalidationMessage *data, int n)
while (n > 0)
{
int nthistime = Min(n, WRITE_QUANTUM);
- int numMsgs;
- int max;
+ uint32 numMsgs;
+ uint32 max;
int i;
n -= nthistime;
@@ -399,7 +387,7 @@ SIInsertDataEntries(const SharedInvalidationMessage *data, int n)
*/
for (;;)
{
- numMsgs = segP->maxMsgNum - segP->minMsgNum;
+ numMsgs = pg_atomic_read_u32(&segP->maxMsgNum) - segP->minMsgNum;
if (numMsgs + nthistime > MAXNUMMESSAGES ||
numMsgs >= segP->nextThreshold)
SICleanupQueue(true, nthistime);
@@ -410,17 +398,16 @@ SIInsertDataEntries(const SharedInvalidationMessage *data, int n)
/*
* Insert new message(s) into proper slot of circular buffer
*/
- max = segP->maxMsgNum;
+ max = pg_atomic_read_u32(&segP->maxMsgNum);
while (nthistime-- > 0)
{
segP->buffer[max % MAXNUMMESSAGES] = *data++;
max++;
}
- /* Update current value of maxMsgNum using spinlock */
- SpinLockAcquire(&segP->msgnumLock);
- segP->maxMsgNum = max;
- SpinLockRelease(&segP->msgnumLock);
+ /* Update current value of maxMsgNum using atomic with memory barrier */
+ pg_write_barrier();
+ pg_atomic_write_u32(&segP->maxMsgNum, max);
/*
* Now that the maxMsgNum change is globally visible, we give everyone
@@ -473,7 +460,7 @@ SIGetDataEntries(SharedInvalidationMessage *data, int datasize)
{
SISeg *segP;
ProcState *stateP;
- int max;
+ uint32 max;
int n;
segP = shmInvalBuffer;
@@ -506,10 +493,9 @@ SIGetDataEntries(SharedInvalidationMessage *data, int datasize)
*/
stateP->hasMessages = false;
- /* Fetch current value of maxMsgNum using spinlock */
- SpinLockAcquire(&segP->msgnumLock);
- max = segP->maxMsgNum;
- SpinLockRelease(&segP->msgnumLock);
+ /* Fetch current value of maxMsgNum using atomic with memory barrier */
+ max = pg_atomic_read_u32(&segP->maxMsgNum);
+ pg_read_barrier();
if (stateP->resetState)
{
@@ -576,11 +562,11 @@ void
SICleanupQueue(bool callerHasWriteLock, int minFree)
{
SISeg *segP = shmInvalBuffer;
- int min,
+ uint32 min,
minsig,
lowbound,
- numMsgs,
- i;
+ numMsgs;
+ int i;
ProcState *needSig = NULL;
/* Lock out all writers and readers */
@@ -595,14 +581,14 @@ SICleanupQueue(bool callerHasWriteLock, int minFree)
* backends here it is possible for them to keep sending messages without
* a problem even when they are the only active backend.
*/
- min = segP->maxMsgNum;
- minsig = min - SIG_THRESHOLD;
- lowbound = min - MAXNUMMESSAGES + minFree;
+ min = pg_atomic_read_u32(&segP->maxMsgNum);
+ minsig = min - Min(min, SIG_THRESHOLD);
+ lowbound = min - Min(min, MAXNUMMESSAGES - minFree);
for (i = 0; i < segP->numProcs; i++)
{
ProcState *stateP = &segP->procState[segP->pgprocnos[i]];
- int n = stateP->nextMsgNum;
+ uint32 n = stateP->nextMsgNum;
/* Ignore if already in reset state */
Assert(stateP->procPid != 0);
@@ -641,7 +627,7 @@ SICleanupQueue(bool callerHasWriteLock, int minFree)
if (min >= MSGNUMWRAPAROUND)
{
segP->minMsgNum -= MSGNUMWRAPAROUND;
- segP->maxMsgNum -= MSGNUMWRAPAROUND;
+ pg_atomic_sub_fetch_u32(&segP->maxMsgNum, MSGNUMWRAPAROUND);
for (i = 0; i < segP->numProcs; i++)
segP->procState[segP->pgprocnos[i]].nextMsgNum -= MSGNUMWRAPAROUND;
}
@@ -650,7 +636,7 @@ SICleanupQueue(bool callerHasWriteLock, int minFree)
* Determine how many messages are still in the queue, and set the
* threshold at which we should repeat SICleanupQueue().
*/
- numMsgs = segP->maxMsgNum - segP->minMsgNum;
+ numMsgs = pg_atomic_read_u32(&segP->maxMsgNum) - segP->minMsgNum;
if (numMsgs < CLEANUP_MIN)
segP->nextThreshold = CLEANUP_MIN;
else
--
2.43.0
Hi,
On 2025-03-21 14:35:16 +0300, Yura Sokolov wrote:
From 080c9e0de5e6e10751347e1ff50b65df424744cb Mon Sep 17 00:00:00 2001
From: Yura Sokolov <y.sokolov@postgrespro.ru>
Date: Mon, 3 Feb 2025 11:58:33 +0300
Subject: [PATCH v2] sinvaladt.c: use atomic operations on maxMsgNummsgnumLock spinlock could be highly contended.
Comment states it was used as memory barrier.
Lets use atomic ops with memory barriers directly instead.Note: patch uses pg_read_barrier()/pg_write_barrier() instead of
pg_atomic_read_membarrier_u32()/pg_atomic_write_membarrier_u32() since
no full barrier semantic is required, and explicit read/write barriers
are cheaper at least on x86_64.
Is it actually true that full barriers aren't required? I think we might
actually rely on a stronger ordering.
@@ -506,10 +493,9 @@ SIGetDataEntries(SharedInvalidationMessage *data, int datasize)
*/
stateP->hasMessages = false;- /* Fetch current value of maxMsgNum using spinlock */ - SpinLockAcquire(&segP->msgnumLock); - max = segP->maxMsgNum; - SpinLockRelease(&segP->msgnumLock); + /* Fetch current value of maxMsgNum using atomic with memory barrier */ + max = pg_atomic_read_u32(&segP->maxMsgNum); + pg_read_barrier();if (stateP->resetState)
{
/*
* Force reset. We can say we have dealt with any messages added
* since the reset, as well; and that means we should clear the
* signaled flag, too.
*/
stateP->nextMsgNum = max;
stateP->resetState = false;
stateP->signaled = false;
LWLockRelease(SInvalReadLock);
return -1;
}
vs
@@ -410,17 +398,16 @@ SIInsertDataEntries(const SharedInvalidationMessage *data, int n) /* * Insert new message(s) into proper slot of circular buffer */ - max = segP->maxMsgNum; + max = pg_atomic_read_u32(&segP->maxMsgNum); while (nthistime-- > 0) { segP->buffer[max % MAXNUMMESSAGES] = *data++; max++; }- /* Update current value of maxMsgNum using spinlock */ - SpinLockAcquire(&segP->msgnumLock); - segP->maxMsgNum = max; - SpinLockRelease(&segP->msgnumLock); + /* Update current value of maxMsgNum using atomic with memory barrier */ + pg_write_barrier(); + pg_atomic_write_u32(&segP->maxMsgNum, max);/*
* Now that the maxMsgNum change is globally visible, we give everyone
* a swift kick to make sure they read the newly added messages.
* Releasing SInvalWriteLock will enforce a full memory barrier, so
* these (unlocked) changes will be committed to memory before we exit
* the function.
*/
for (i = 0; i < segP->numProcs; i++)
{
ProcState *stateP = &segP->procState[segP->pgprocnos[i]];stateP->hasMessages = true;
}
On a loosely ordered architecture, the hasMessage=false in SIGetDataEntries()
could be reordered with the read of maxMsgNum. Which, afaict, would lead to
missing messages. That's not prevented by the pg_write_barrier() in
SIInsertDataEntries(). I think there may be other similar dangers.
This could be solved by adding full memory barriers in a few places. But:
I'd also like to know a bit more about the motivation here - I can easily
believe that you hit contention around the shared inval queue, but I find it
somewhat hard to believe that a spinlock that's acquired *once* per batch
("quantum"), covering a single read/write, is going to be the bottleneck,
rather than the much longer held LWLock, that protects iterating over all
procs.
Have you verified that this actually addresses the performance issue?
Greetings,
Andres Freund
Hi, Andres
21.03.2025 19:33, Andres Freund wrote:
Hi,
On 2025-03-21 14:35:16 +0300, Yura Sokolov wrote:
From 080c9e0de5e6e10751347e1ff50b65df424744cb Mon Sep 17 00:00:00 2001
From: Yura Sokolov <y.sokolov@postgrespro.ru>
Date: Mon, 3 Feb 2025 11:58:33 +0300
Subject: [PATCH v2] sinvaladt.c: use atomic operations on maxMsgNummsgnumLock spinlock could be highly contended.
Comment states it was used as memory barrier.
Lets use atomic ops with memory barriers directly instead.Note: patch uses pg_read_barrier()/pg_write_barrier() instead of
pg_atomic_read_membarrier_u32()/pg_atomic_write_membarrier_u32() since
no full barrier semantic is required, and explicit read/write barriers
are cheaper at least on x86_64.Is it actually true that full barriers aren't required? I think we might
actually rely on a stronger ordering.@@ -506,10 +493,9 @@ SIGetDataEntries(SharedInvalidationMessage *data, int datasize)
*/
stateP->hasMessages = false;- /* Fetch current value of maxMsgNum using spinlock */ - SpinLockAcquire(&segP->msgnumLock); - max = segP->maxMsgNum; - SpinLockRelease(&segP->msgnumLock); + /* Fetch current value of maxMsgNum using atomic with memory barrier */ + max = pg_atomic_read_u32(&segP->maxMsgNum); + pg_read_barrier();if (stateP->resetState)
{
/*
* Force reset. We can say we have dealt with any messages added
* since the reset, as well; and that means we should clear the
* signaled flag, too.
*/
stateP->nextMsgNum = max;
stateP->resetState = false;
stateP->signaled = false;
LWLockRelease(SInvalReadLock);
return -1;
}vs
@@ -410,17 +398,16 @@ SIInsertDataEntries(const SharedInvalidationMessage *data, int n) /* * Insert new message(s) into proper slot of circular buffer */ - max = segP->maxMsgNum; + max = pg_atomic_read_u32(&segP->maxMsgNum); while (nthistime-- > 0) { segP->buffer[max % MAXNUMMESSAGES] = *data++; max++; }- /* Update current value of maxMsgNum using spinlock */ - SpinLockAcquire(&segP->msgnumLock); - segP->maxMsgNum = max; - SpinLockRelease(&segP->msgnumLock); + /* Update current value of maxMsgNum using atomic with memory barrier */ + pg_write_barrier(); + pg_atomic_write_u32(&segP->maxMsgNum, max);/*
* Now that the maxMsgNum change is globally visible, we give everyone
* a swift kick to make sure they read the newly added messages.
* Releasing SInvalWriteLock will enforce a full memory barrier, so
* these (unlocked) changes will be committed to memory before we exit
* the function.
*/
for (i = 0; i < segP->numProcs; i++)
{
ProcState *stateP = &segP->procState[segP->pgprocnos[i]];stateP->hasMessages = true;
}On a loosely ordered architecture, the hasMessage=false in SIGetDataEntries()
could be reordered with the read of maxMsgNum. Which, afaict, would lead to
missing messages. That's not prevented by the pg_write_barrier() in
SIInsertDataEntries(). I think there may be other similar dangers.This could be solved by adding full memory barriers in a few places.
Big thanks for review and suggestion!
I agree, pg_memory_barrier should be added before read of segP->maxMsgNum.
I think, change of stateP->hasMessages to atomic variable is better way,
but it will change sizeof ProcState.
I don't see the need to full barrier after read of maxMsgNum, since other
ProcState fields are protected by SInvalReadLock, aren't they? So I leave
read_barrier there.
I still avoid use of read_membarrier since it is actually write operation.
Although pg_memory_barrier is implemented as write operation as well at
x86_64, but on memory cell on process's stack, so it will not be contended.
And atomic_write_membarrier is used to write maxMsgNum just to simplify
code. If backport is considered, then write_barriers before and after could
be used instead.
Fixes version is attached.
But:
I'd also like to know a bit more about the motivation here - I can easily
believe that you hit contention around the shared inval queue, but I find it
somewhat hard to believe that a spinlock that's acquired *once* per batch
("quantum"), covering a single read/write, is going to be the bottleneck,
rather than the much longer held LWLock, that protects iterating over all
procs.Have you verified that this actually addresses the performance issue?
Problem on this spinlock were observed at least by two independent technical
supports. First, some friendly vendor company shared the idea to remove it.
We don't know exactly their situation. But I suppose it was quite similar
to situation out tech support investigated at our client some months later:
(Cite from tech support report:)
Almost 20% of CPU time is spend at backtraces like:
4b0d2d s_lock (/opt/pgpro/ent-15/bin/postgres)
49c847 SIGetDataEntries
49bf94 ReceiveSharedInvalidMessages
4a14ba LockRelationOid
1671f4 relation_open
1de1cd table_open
5e82aa RelationGetStatExtList
402a01 get_relation_statistics (inlined)
402a01 get_relation_info
407a9e build_simple_rel
3daa1d add_base_rels_to_query
3daa1d add_base_rels_to_query
3dd92b query_planner
Client has many NUMA-nodes in single machine, and software actively
generates invalidation messages (probably, due active usage of temporary
tables).
Since, backtrace is quite obvious and ends at s_lock, the patch have to help.
--
regards
Yura Sokolov aka funny-falcon
Attachments:
v3-0001-sinvaladt.c-use-atomic-operations-on-maxMsgNum.patchtext/x-patch; charset=UTF-8; name=v3-0001-sinvaladt.c-use-atomic-operations-on-maxMsgNum.patchDownload
From de96fd271a107ce4300f758822d7cc2f9ab1ff1c Mon Sep 17 00:00:00 2001
From: Yura Sokolov <y.sokolov@postgrespro.ru>
Date: Mon, 3 Feb 2025 11:58:33 +0300
Subject: [PATCH v3] sinvaladt.c: use atomic operations on maxMsgNum
msgnumLock spinlock could be contended.
Comment states it was used as memory barrier.
Lets use atomic ops with memory barriers directly instead.
---
src/backend/storage/ipc/sinvaladt.c | 80 +++++++++++++----------------
1 file changed, 36 insertions(+), 44 deletions(-)
diff --git a/src/backend/storage/ipc/sinvaladt.c b/src/backend/storage/ipc/sinvaladt.c
index 2da91738c32..400bfdb0072 100644
--- a/src/backend/storage/ipc/sinvaladt.c
+++ b/src/backend/storage/ipc/sinvaladt.c
@@ -86,19 +86,7 @@
* has no need to touch anyone's ProcState, except in the infrequent cases
* when SICleanupQueue is needed. The only point of overlap is that
* the writer wants to change maxMsgNum while readers need to read it.
- * We deal with that by having a spinlock that readers must take for just
- * long enough to read maxMsgNum, while writers take it for just long enough
- * to write maxMsgNum. (The exact rule is that you need the spinlock to
- * read maxMsgNum if you are not holding SInvalWriteLock, and you need the
- * spinlock to write maxMsgNum unless you are holding both locks.)
- *
- * Note: since maxMsgNum is an int and hence presumably atomically readable/
- * writable, the spinlock might seem unnecessary. The reason it is needed
- * is to provide a memory barrier: we need to be sure that messages written
- * to the array are actually there before maxMsgNum is increased, and that
- * readers will see that data after fetching maxMsgNum. Multiprocessors
- * that have weak memory-ordering guarantees can fail without the memory
- * barrier instructions that are included in the spinlock sequences.
+ * We deal with that by using atomic operations with proper memory barriers.
*/
@@ -139,7 +127,7 @@ typedef struct ProcState
/* procPid is zero in an inactive ProcState array entry. */
pid_t procPid; /* PID of backend, for signaling */
/* nextMsgNum is meaningless if procPid == 0 or resetState is true. */
- int nextMsgNum; /* next message number to read */
+ uint32 nextMsgNum; /* next message number to read */
bool resetState; /* backend needs to reset its state */
bool signaled; /* backend has been sent catchup signal */
bool hasMessages; /* backend has unread messages */
@@ -167,11 +155,9 @@ typedef struct SISeg
/*
* General state information
*/
- int minMsgNum; /* oldest message still needed */
- int maxMsgNum; /* next message number to be assigned */
- int nextThreshold; /* # of messages to call SICleanupQueue */
-
- slock_t msgnumLock; /* spinlock protecting maxMsgNum */
+ uint32 minMsgNum; /* oldest message still needed */
+ pg_atomic_uint32 maxMsgNum; /* next message number to be assigned */
+ uint32 nextThreshold; /* # of messages to call SICleanupQueue */
/*
* Circular buffer holding shared-inval messages
@@ -243,9 +229,8 @@ SharedInvalShmemInit(void)
/* Clear message counters, save size of procState array, init spinlock */
shmInvalBuffer->minMsgNum = 0;
- shmInvalBuffer->maxMsgNum = 0;
+ pg_atomic_init_u32(&shmInvalBuffer->maxMsgNum, 0);
shmInvalBuffer->nextThreshold = CLEANUP_MIN;
- SpinLockInit(&shmInvalBuffer->msgnumLock);
/* The buffer[] array is initially all unused, so we need not fill it */
@@ -303,7 +288,7 @@ SharedInvalBackendInit(bool sendOnly)
/* mark myself active, with all extant messages already read */
stateP->procPid = MyProcPid;
- stateP->nextMsgNum = segP->maxMsgNum;
+ stateP->nextMsgNum = pg_atomic_read_u32(&segP->maxMsgNum);
stateP->resetState = false;
stateP->signaled = false;
stateP->hasMessages = false;
@@ -382,8 +367,8 @@ SIInsertDataEntries(const SharedInvalidationMessage *data, int n)
while (n > 0)
{
int nthistime = Min(n, WRITE_QUANTUM);
- int numMsgs;
- int max;
+ uint32 numMsgs;
+ uint32 max;
int i;
n -= nthistime;
@@ -399,7 +384,7 @@ SIInsertDataEntries(const SharedInvalidationMessage *data, int n)
*/
for (;;)
{
- numMsgs = segP->maxMsgNum - segP->minMsgNum;
+ numMsgs = pg_atomic_read_u32(&segP->maxMsgNum) - segP->minMsgNum;
if (numMsgs + nthistime > MAXNUMMESSAGES ||
numMsgs >= segP->nextThreshold)
SICleanupQueue(true, nthistime);
@@ -410,17 +395,20 @@ SIInsertDataEntries(const SharedInvalidationMessage *data, int n)
/*
* Insert new message(s) into proper slot of circular buffer
*/
- max = segP->maxMsgNum;
+ max = pg_atomic_read_u32(&segP->maxMsgNum);
while (nthistime-- > 0)
{
segP->buffer[max % MAXNUMMESSAGES] = *data++;
max++;
}
- /* Update current value of maxMsgNum using spinlock */
- SpinLockAcquire(&segP->msgnumLock);
- segP->maxMsgNum = max;
- SpinLockRelease(&segP->msgnumLock);
+ /*
+ * We need to write maxMsgNum strictly after write of messages and
+ * strictly before write to hasMessages. Two write barriers could be
+ * used before and after. But to simplify code, write_membarrier is
+ * used.
+ */
+ pg_atomic_write_membarrier_u32(&segP->maxMsgNum, max);
/*
* Now that the maxMsgNum change is globally visible, we give everyone
@@ -473,7 +461,7 @@ SIGetDataEntries(SharedInvalidationMessage *data, int datasize)
{
SISeg *segP;
ProcState *stateP;
- int max;
+ uint32 max;
int n;
segP = shmInvalBuffer;
@@ -506,10 +494,14 @@ SIGetDataEntries(SharedInvalidationMessage *data, int datasize)
*/
stateP->hasMessages = false;
- /* Fetch current value of maxMsgNum using spinlock */
- SpinLockAcquire(&segP->msgnumLock);
- max = segP->maxMsgNum;
- SpinLockRelease(&segP->msgnumLock);
+ /*
+ * Full barrier before read of maxMsgNum is to synchronize against
+ * hasMessages=false. To synchronize message reads read barrier after is
+ * enough.
+ */
+ pg_memory_barrier();
+ max = pg_atomic_read_u32(&segP->maxMsgNum);
+ pg_read_barrier();
if (stateP->resetState)
{
@@ -576,11 +568,11 @@ void
SICleanupQueue(bool callerHasWriteLock, int minFree)
{
SISeg *segP = shmInvalBuffer;
- int min,
+ uint32 min,
minsig,
lowbound,
- numMsgs,
- i;
+ numMsgs;
+ int i;
ProcState *needSig = NULL;
/* Lock out all writers and readers */
@@ -595,14 +587,14 @@ SICleanupQueue(bool callerHasWriteLock, int minFree)
* backends here it is possible for them to keep sending messages without
* a problem even when they are the only active backend.
*/
- min = segP->maxMsgNum;
- minsig = min - SIG_THRESHOLD;
- lowbound = min - MAXNUMMESSAGES + minFree;
+ min = pg_atomic_read_u32(&segP->maxMsgNum);
+ minsig = min - Min(min, SIG_THRESHOLD);
+ lowbound = min - Min(min, MAXNUMMESSAGES - minFree);
for (i = 0; i < segP->numProcs; i++)
{
ProcState *stateP = &segP->procState[segP->pgprocnos[i]];
- int n = stateP->nextMsgNum;
+ uint32 n = stateP->nextMsgNum;
/* Ignore if already in reset state */
Assert(stateP->procPid != 0);
@@ -641,7 +633,7 @@ SICleanupQueue(bool callerHasWriteLock, int minFree)
if (min >= MSGNUMWRAPAROUND)
{
segP->minMsgNum -= MSGNUMWRAPAROUND;
- segP->maxMsgNum -= MSGNUMWRAPAROUND;
+ pg_atomic_sub_fetch_u32(&segP->maxMsgNum, MSGNUMWRAPAROUND);
for (i = 0; i < segP->numProcs; i++)
segP->procState[segP->pgprocnos[i]].nextMsgNum -= MSGNUMWRAPAROUND;
}
@@ -650,7 +642,7 @@ SICleanupQueue(bool callerHasWriteLock, int minFree)
* Determine how many messages are still in the queue, and set the
* threshold at which we should repeat SICleanupQueue().
*/
- numMsgs = segP->maxMsgNum - segP->minMsgNum;
+ numMsgs = pg_atomic_read_u32(&segP->maxMsgNum) - segP->minMsgNum;
if (numMsgs < CLEANUP_MIN)
segP->nextThreshold = CLEANUP_MIN;
else
--
2.43.0
Hi,
On 2025-03-24 13:41:17 +0300, Yura Sokolov wrote:
21.03.2025 19:33, Andres Freund wrote:
I'd also like to know a bit more about the motivation here - I can easily
believe that you hit contention around the shared inval queue, but I find it
somewhat hard to believe that a spinlock that's acquired *once* per batch
("quantum"), covering a single read/write, is going to be the bottleneck,
rather than the much longer held LWLock, that protects iterating over all
procs.Have you verified that this actually addresses the performance issue?
Problem on this spinlock were observed at least by two independent technical
supports. First, some friendly vendor company shared the idea to remove it.
We don't know exactly their situation. But I suppose it was quite similar
to situation out tech support investigated at our client some months later:(Cite from tech support report:)
Almost 20% of CPU time is spend at backtraces like:
4b0d2d s_lock (/opt/pgpro/ent-15/bin/postgres)
49c847 SIGetDataEntries
49bf94 ReceiveSharedInvalidMessages
4a14ba LockRelationOid
1671f4 relation_open
1de1cd table_open
5e82aa RelationGetStatExtList
402a01 get_relation_statistics (inlined)
402a01 get_relation_info
407a9e build_simple_rel
3daa1d add_base_rels_to_query
3daa1d add_base_rels_to_query
3dd92b query_plannerClient has many NUMA-nodes in single machine, and software actively
generates invalidation messages (probably, due active usage of temporary
tables).Since, backtrace is quite obvious and ends at s_lock, the patch have to help.
I don't believe we have the whole story here. It just doesn't seem plausible
that, with the current code, the spinlock in SIGetDataEntries() would be the
bottleneck, rather than contention on the lwlock. The spinlock just covers a
*single read*. Unless pgpro has modified the relevant code?
One possible explanation for why the spinlock shows up so badly is that it is
due to false sharing. Note that SiSeg->msgnumLock and the start of
SiSeg->buffer[] are on the same cache line.
How was this "Almost 20% of CPU time is spend at backtraces like" determined?
Greetings,
Andres Freund
Good day, Andres
24.03.2025 16:08, Andres Freund wrote:
On 2025-03-24 13:41:17 +0300, Yura Sokolov wrote:
21.03.2025 19:33, Andres Freund wrote:
I'd also like to know a bit more about the motivation here - I can easily
believe that you hit contention around the shared inval queue, but I
find it
somewhat hard to believe that a spinlock that's acquired *once* per batch
("quantum"), covering a single read/write, is going to be the bottleneck,
rather than the much longer held LWLock, that protects iterating over all
procs.Have you verified that this actually addresses the performance issue?
Problem on this spinlock were observed at least by two independent technical
supports. First, some friendly vendor company shared the idea to remove it.
We don't know exactly their situation. But I suppose it was quite similar
to situation out tech support investigated at our client some months later:(Cite from tech support report:)
Almost 20% of CPU time is spend at backtraces like:
4b0d2d s_lock (/opt/pgpro/ent-15/bin/postgres)
49c847 SIGetDataEntries
49bf94 ReceiveSharedInvalidMessages
4a14ba LockRelationOid
1671f4 relation_open
1de1cd table_open
5e82aa RelationGetStatExtList
402a01 get_relation_statistics (inlined)
402a01 get_relation_info
407a9e build_simple_rel
3daa1d add_base_rels_to_query
3daa1d add_base_rels_to_query
3dd92b query_plannerClient has many NUMA-nodes in single machine, and software actively
generates invalidation messages (probably, due active usage of temporary
tables).Since, backtrace is quite obvious and ends at s_lock, the patch have to
help.
I don't believe we have the whole story here. It just doesn't seem plausible
that, with the current code, the spinlock in SIGetDataEntries() would be the
bottleneck, rather than contention on the lwlock. The spinlock just covers a
*single read*. Unless pgpro has modified the relevant code?One possible explanation for why the spinlock shows up so badly is that it is
due to false sharing. Note that SiSeg->msgnumLock and the start of
SiSeg->buffer[] are on the same cache line.How was this "Almost 20% of CPU time is spend at backtraces like" determined?
Excuse me I didn't attached flamegraph collected by our tech support from
client's server during peak of the problem. So I attach it now.
If you open it in browser and search for "SIGetDataEntries", you'll see it
consumes 18.4%. It is not single large bar. Instead there are dozens of
calls to SIGetDataEntries, and every one spend almost all its time in
s_lock. If you search for s_lock, it consumes 16.9%, and almost every call
to s_lock is from SIGetDataEntries.
Looks like, we call to ReceiveSharedInvalidMessages
(AcceptInvalidationMessages, actually) too frequently during planing. And
if there are large stream of invalidation messages, SIGetDataEntries have
some work very frequently. Therefore many backends, which plans their
queries at the moment, start to fight for msgNumLock.
If ReceiveSharedInvalidMessages (and SIGetDataEntries by it) were called
rarely, then you conclusion were right: taking spinlock around just read of
one variable before processing large batch of messages looks to not be
source of problem. But since it is called very frequently, and stream of
messages is high, "there is always few new messages".
As I've said, it is most probably due to use of famous 1C software, which
uses a lot of temporary tables. So it generates high amount of invalidation
messages. We've patched pgpro postgres to NOT SEND most of invalidation
messages generated by temporary tables, but it is difficult to not send all
of such.
--
regards
Yura Sokolov aka funny-falcon
Attachments:
perf20250123_1120.lst.bind.svg.gzapplication/gzip; name=perf20250123_1120.lst.bind.svg.gzDownload
� 6�g perf20250123_1120.lst.bind.svg �\}{�6������H�L�����'q�^���O�k{��z(
�x�H��,��|��@��(���6��Mr0f~���_n�1y�Y������B�"HfA�&�FIR�����O^��}�?��&���|��/�~{K����G�v<~�����+���x��[�(��X]����F��j���_e�j���H���L��Y1S`
�\�CW�&����@����><�I��%J�b@��B#�y�no�h�N �K���M�P Ar������P�}����O./��8XR2�����<_q�[P��*��� �8��b���a�O3
zM����� �
�O3��R#;@j���2^�8��������m�\�4W������w�_�]�bF����g��d0�,�IA"���1��u2S�N-��
��O[����g�y���u�q��(�Q�6��>��M��
w�N�=�__�����c.2p�����dA��8�s��3����O���`���h�
���y�������8�����6��
��O�g��0_����y�f4r
��UF��JS�g$\gy�]��()hVv�Ha�`0T�E�i����Qm
]0\O���qr\�e���t�f1ka
d+�m$������W[�^'������v��i��9���dy���j��<O��=AEv���E4C�fQ����Up�?�����e�\��.�n�Y�*��M�e��2'�p�������5��������1���A�#�7eZ�7���_�A.���#��,J�WX�v�����rD�W��e��l��$D�'����� ����p�����c������
A���(���E���Y%r�`N�q@C��A{B>��H� N����5BP�!��������`�-�� x�A��_>l�������d�������&Jf�F
f���@�u�4��@ �(|PF����[��QH��a�1o5�*�uzO��
��P�tFQnrsC�@���a���Ew�
�q�f�Xgh+���������}����
��"��<�� �'`&I>P��(���d1��Z�YQ��py�h��-7���$uid�=&�,]�
�����"��c:��c�?��j`��c�L��/�` �����G���_6G�+�X8#���x
�K�7��Q��O�t�@T���J�%y]��$~J��y�����c��teT�Z)�j�~�1��u���u����I�����;N�}�a}j_�[�%aN
�2�I�#���:-��-~�y�,��8���!�� ��PL�"������#�����f�;A�"����:�jL�y�(���$@���O�q����b�`-��O��y���u����m
<�P*���^�������i�'y�H���( �z��N��h��4��2a�����#yr��$���C9#9�M9�z���!�l�GtM�*$�/�����J���EN�^�R/�%?%�������!�K���e����#f����8n��8����Thl5��a�,�T��a���\�W,hF��P_� L��:fE�P� ��jf)��|�O0�O)�)���p����
��K.5�2jm ��PS��j�#]e���)xf����I��'� 8SN������*������Y����~) zuF!��`����/��_�<��x�����D���N��-�� ���2�c� �E��_���I���W���(
���kb�/��/�����.4�u�a��/0�<�-�ySE�N�rr�ez��!_�m�b�>�!C��,��[O�`y��k�Po"�
���fX�~� =�
���-�$�3x��a������'&��
��&�i��T(7s6��A�+���S�K(?�h����o�L��'�k����1�w����[�{�JtE�F3��"?�}�r#,+��c���>\�P��..�T�����U�?\n�+��N�/��F��Q�5�_��3��j���������\]��0��t�Y��)x���7�����u�i��,�a��a� �e ��zD�P��>j�q `
��t��RH���Bc�Dc�*-���q]����� {���-,4��OSH�!I���@�`��a��������5���9��c�REt���#Q�~K��2JZH�Gd��I�^��*�];��7% @'v �i��1����?����1���f%���o���$��5���>��%V��l�&������a���uSx�:P�/�'F����U�-jYD[
����f�U<+��2����v'h� '���#[��Y{��� ����!d0iV������*F�Pn<�&h��k�w]W�,��S���"�LCD��[�p��B��ts�� $�aF]rP��@0���is��(*^6�r��S)U���P4'a�,%�}�s�FW�.�e��vS6��>6 ������\:_9���=n��e�DK��wv�� �z���H ���7���&j����y�����[P�a��j*���q��M%��P��oq��U����������d�Tv$�o�h_?8�4vVE]��SC���F�r�=�-���7 ��������RA����[+b�E)x������|����'�)D
�~� ���x�@�I��AF�t��~1�"�G�*6�`����/`�d��_/�[G�$�8a���C'WPh
�����w����bW4:����+;`%�q^,�|O��I mG+H�=�����k�r���'7]G7�s�NS���������J�T�W#:�R9e*L�{;iY�M��}VY3��U��� ������v�&�7�q<���9<}�g0�q�b�w���_�e���2�z'Nt��o,�w����9�U�w|��27�4)��"��Ae�����J�%��l ��I�c�o��R��LM�D��2��T�X ���T���
�\��t|��g�1^ma/�P����VL!]�ve#�9�U����B��C���"�#���V�R�����"J��������������3���&�
��1Y|nc.��O5�����5T�V����� R�q�{�D"���i� ��L��f!��~ l�x=Caa�E�l�q��BT�n�/@�M�R����I��E���>l>`�*�W��7�"��m �� d�<�#3u���������Y�Z�b)D����Fh�q��y���g�z$���zZ�>S!�Au2��t��~��a/���Q6��
��P.��� ��}��F�9^���E�./�+A�9`w`A�����lF3����sF�xbf���R��LK|������:�V�� 6!o��Q/��s���hEv��.�����1�LTX
x�~x���Z4��}����m�r��u!L���RJd
�/*�+-q��T>�io*��Oe)� �O���$��?�(�k�����+�-`�y�����V��M����><����$���@��9�s��W���#��E!�i��%w�Z����zpY�xo�u>���*o�]�s����Y��e�Y0�������!�,x=F����.��q���0���{ Mr��m�
��sh��������MI9���a�k5�E����*��4G�3j���g���|>]�.O�/EB���>�,
`�]�O�������W������|��6d�k{k�w��R[�#v5����s��j�����.��V�����������{��0�{�A'N���%�q�Jz
g;)���t�2�W�L��u��'��{�-^E �R�@:0��)�td��M�&��!���<{�����h^Hn����� �\AvCd�EdC/a�����V���#��F�D��n��.vR��y��I���I�+����$�~������6��6��C5��5�k�����#M�v\���tM�&9:<��L�"�p<0�&����ZV]�.W5A��>������p�[�Dx@��L�8E7���J�sW�m������i��T���kH�*� ����Qk���������nYB�����U�Cv���1�$�[ +.��i\�y8����o}<�3U�L���Q-�]�6�n�=G���;�w`�SGS:��<��p�7�\�ZdwlW���5��n�# ��:��zr
�[��p��u��6;�� ��{[q;�p��A�����s�x���W��:�)T= ������zOH�����%Q����y���+H��#�#��9B����������5��RD ��-&a��M"���D �tDj:gW2��\�!?PM��s�y%h~��&P;FK�7sz����*���fc����%AE��UG[�w��!������#7�=�����#�_����l��Aq@�a�!�m��]� ,.z�]�w�L
�]N6�~�����/P�>��.����E>d`��r<���k{�I��)����vlC@���=!����6F�Dn����I"S�}P������eL4gdIb���e�*���x
p&�c�>b;�dW�}��D�Yu@�.���K������2X-����R�����VX�e��C��i���)��L�$6���(k������L�8C�,2�Q;b�-����7�q��"Q��\�]4#�����m~i�"C�jU������3R$niX��q��Eos��6H^���h���^�.�������6�_��/�2�~Z?��z4��K��H��xIeq
!����dF�@��U�q�+���-R�zA')�!�����]O��F='��h2������o��({�q+N��i�#Ml�U-���_�= ���o4/:�#���K�G6F!����#f�A/kq}a,��G�EWZ�yF�/��D�9kr���� iMS��fq���"����l�%I�x�EQ������G
������:[�+�lY��Z^�l��Wxb����i���Dp��
'�����s���#���]���m��^��%�^Zw�~�a���h�1*���Y+t��$zF���Cv����&-<"���%���t������v[��TP������-��M]5���2�4X�������;�x�{c�����]�2pi�����n��8������6�7�)��T�K�f������+��}N�,M c^:
�A�o�!����K��r�1�Jr�o�l@��Sf1�f�\�%)��pg�1\���C�ea9aJ<�������� 1���c�JT4��7�� �x�kJC�D>�GU�S���t�Z�����.UuYr��D��$��!����a���i}
�dG�����bu�������=������W�ij�A�����X��L���+��w��,���AEe�t�;�g<_�� g�_{�
��h+�\�*����4:������Z����y�����&�"H��rD�:�m��p����%�;��N��bE��4[�c���V�a�v[o�����~�hu�O<����&(�r��&.g�k�~g+�����E�S�[v$�YV���Z��i-���E&�H�"�4�rN��E�/��� �����Vp�j�:E�zw, ��,�Qu�@����4�k��<HWx��@g�q�4#O�E�v�G���3`*��u������j���yMy��Ou��`k�>���3w���i���M#�����C�^H������v�Yr����A�4i�(���A��t��)��o�t�]0��\����s x�e��=������q���+~�]�����������\�d�s����U�n��H6������ D(f���<xY��*�ch��0#_x�vo����"�
��.O��m_V���$_��Z�� &��5^OH��F�C������� ���t_��H�'����q����������sY�2����s��iQ�rY�E�������������W�f�J��m��
�p&�%�t��W�b��c�6�
0�W�0jYa��1
������W}�4����_�p�+�]0�E����2+���?~�����w��@@��6��4L��}�m!�4�M���|��7j���_�DtW��/<. �n����������!���O�����J*��UqC�4IT�Z+E�������������SlaN��Wi��Y1�k�Ju-��T������#�4r������*Pj�M��A�� 2N@���f}�*�B�y��s����0�%Hn�w����?��������$l��q���$����B�S����pI\����
�>���STw��qs�����e,B�{�/������N��[���j1����6�<�]ge��mw�zoG#��rO�+�J���&�3�f2�qW�%�����Q��D�'Ux�H��� �_�d�}��C{^��bpR�h��1q Va����86�l��.>�}�z�?'�(��0�)��V��^y�%����}�=M:��D�����9�t8�`�@g�
����D�_F�`���Y�I-h1F�5!k��{:����M:YZ��b��-0a�1�+{g�h���!�D8Qg�LmD[�F/G��J�h����Z��F*pq�o
�5$r��X������j�{?����?������}J�]��`UB��8C�����7���� #�8C���^������y���t��8�n�e��� �96�2�L��#�4.�����{�\�{�7��~���L��A�U2o�����6>��2������/���O��$*�.����r�8(0��ED�5cdo|�B;=��>��t>M�1������C���q,�)���d��_/-ZU�������&�pU�~UbU40��d(
�;�k�db�~~���qM?K� ^{�+��V������?�����n���O�K(�
HO|�!+S������H�"��'�tsI�n*��P�����6>{�2s��������>m�ABNi��U���L�$��������n��
����n�[�����MB+e���E���u�-b��
�����b�N\qD�p`��t��[L4sw�}
7:y��m�?Q���=�F?��"��b��������
�KkYr ���"�?Y��[�+�9:\{��s�q�g��y������eM�j���;[�� �w� �c'D[�O�{y�e�����G��I�����v9�����wV��Y6��x���������g��#�La���25�M�3Y'V�������� �;����iJA��:��h�xR:�&����
+��oy�//=�%NtQ�7����
\��������4%2�h^W(NM��-T��1���8yl�^Nqp|�?��8���P)0��@�eV��U0�5�/��s����d�K�ak�G��TU�m�*������:#�6��;��������I�����g'����}����6�0�no����Mxw��-FnT�f@Q�����$����4iZ��6���s��������9�S5Z��NE���b���G
�W�
�����[�����jH���N ������IF��8��)9C@>�j,����8Yw���"��_��Q�oW��;���8'@�D��;�/�N�5�&�I�s.�.����62:�����fU���k�����.bV�y�,�u�Ho�y^X�H�I
�Q2������G)�w���v��K eTq�����������p��O^�"2��6i�w]U�����'������_�����F���B�K ��y�4�v��sO�
����uy����+gZ���,�
���P|�r��/@B��U@�o��(U&H��<���VV��Tv�~��l"1�A��� h�� Dkd5~ ���2��>P[����m��70���>P*/�}+ZYop'�a~������6!���e�~�y���r��!�yK!K���sv��W�JN4��b��M<�%/3;��T���������L�U�
:���3&I������m�� ����+�1�qQ�b�|!^q~��HH��aC�J�Z������f�����=l���B�6��Ij����T�O!K�a�-��pq�M�&�=�hi�ex�������7�+�����s�:�l���s�����%��2��q�uA
�bS��a���1��� �����&���B���-@��#v�N����Z>�zb��S��h�X�����w�
v+k�&$t��V�L�������WY��
��G}���o�4�[<(�����0<�2�'�`h���Q���mK��g1�+��
j�b��)���m����+�%PR�,��������s�#�Y��i���8Z)��m��M2%_��%�$r�Blj��L\I����������\������|1���F���R�pm���t���b�q#���A 6Mz�K�ACSao#�}���H�]c��X+$�3�n�.�Sbx��������o�� ��&���z��C���;��AH����y����NM��BS<I��`���o�jm;e�S�l�����Yc2
���-��u�y��^P�M����v���^<+�����|��hg��Mb�I���$�i�>�n��@�����4h��&oB-���)Z��_�'�k0�x^���Bth�Y�x:?I�����U����W��z�I������{v�bAZTv
��H���3������*�=S��:)�9�
��Ng&�-'|��e�����hl�B�J�x�
�wn �I���R�`��)�{X��!�t�#��7l0�j��xOy��c��K��W�j�8f
<HR�ae��:6���_(�[4�Z����7h�U��p��?Nx����O��a �j��3��.�5�eU�����������2��[1�����j�����i����4�_��$�rW��V��xn\��2+wB�maZmf�p��:9�`�~@5�u?'v-G�����S������������M�|\�_�[]����<�B5��6����)��mW������jz��':RB�G�k�d>)���bG/��,�
����.�� y���tB�)�u ��Hd�� a�� ^�L
,m��|����niK�N�
*�y�
��~2���W��xh����2!<����K�H���-$e�/�5U�T������,SA��
_�E����U���"��%?�\Ks���~r8&6^;�{���8���6�Xd&���xl�65���t�1De���f��H�"�'�����S�
���_� WR+&;����Z����kd/;:v��sq(i�L����������*�-�z-��E�!�*�}Q2�'BTT���{����%����-@����^��23{�&�����������0n��M��;�\����.������={|+J}|��^������y������.�-3b�"�4�*�����"fd��s���7��Nv���������b|�i�����Y2���.���-�+�j� @I��Z
>��Y�p�(��^���ww`��mF��1Jj$:P�.�s$���pd������WQ���@�l��@!y�@s�����
\�?�����~����u,�/�"���3�����)r�!���-$�/�r��
��]}�u�{�I���9�9����L�`f�������Q����;qQG�����r�]���W�������5���[�x�D��d��������VaFy`��e��!����7���S4���b������sM�bG�,c�g`]^��2��Jt�f\��o���{/�lX��.�4�{�7�+m���t���y�AM8���c�`_�FZ�l�c)��*��b�*X8|�<�)��Wi0!��B�a�#�k��0�R�K��>��C�F+/
�a>�lO_v%�D��������A ��F��s�������[�I{���dMA�v�x��$����:���;�fq���G-������lO"���
���=�@���L�`��g���m����6d��0�|kA2Z]��g/{���cN����9����-`88�!�>q��n������������� "��`�Jr�����T�)�e���p�I=���1E8�cdR�81�h>f|&q��[�P�&A�*^^iR������H��M��.3�k�P��U�H��};��5J�s����//�c����q��EIm��F��1dOE�T�FI*�����$U�����)�����V/�:��KU�T�H�c�����A�&��k��y��������c+��x��;Cb��R)��G��#�Tox����<)16q�I��}Kl���_(���O r0?��B�I�Q3�Q{��3����iZZ'�$&�~���O�o��r�;;/���+��a�������7���bi��gA���/��~��}��]���>�z��~���1�w]#��^d/p�vbR9�����k
l:���Q�x��
0u���t����z�P}��1O��HnO�"Uq���/����m_�P|� �!@�]u����S��6��#7�M����*J�T����"�k�Y8W�����t�@�[�WhR�In���
�)����W�� J.
�Unl��,�")����$b�o����Hl��/�����+��7��Y��;F��-tE�&�Q)v]1�bc�����l�a5]qLl��b"�����y�>�{�(�2�-�-oLrCP����AZ��������h?���h�����X�?�����5.���F
�7�bR I}\� F���'