Slow standby snapshot
Hi,
I recently ran into a problem in one of our production postgresql cluster.
I had noticed lock contention on procarray lock on standby, which causes
WAL replay lag growth.
To reproduce this, you can do the following:
1) set max_connections to big number, like 100000
2) begin a transaction on primary
3) start pgbench workload on primary and on standby
After a while it will be possible to see KnownAssignedXidsGetAndSetXmin in
perf top consuming abount 75 % of CPU.
%%
PerfTop: 1060 irqs/sec kernel: 0.0% exact: 0.0% [4000Hz cycles:u],
(target_pid: 273361)
-------------------------------------------------------------------------------
73.92% postgres [.] KnownAssignedXidsGetAndSetXmin
1.40% postgres [.] base_yyparse
0.96% postgres [.] LWLockAttemptLock
0.84% postgres [.] hash_search_with_hash_value
0.84% postgres [.] AtEOXact_GUC
0.72% postgres [.] ResetAllOptions
0.70% postgres [.] AllocSetAlloc
0.60% postgres [.] _bt_compare
0.55% postgres [.] core_yylex
0.42% libc-2.27.so [.] __strlen_avx2
0.23% postgres [.] LWLockRelease
0.19% postgres [.] MemoryContextAllocZeroAligned
0.18% postgres [.] expression_tree_walker.part.3
0.18% libc-2.27.so [.] __memmove_avx_unaligned_erms
0.17% postgres [.] PostgresMain
0.17% postgres [.] palloc
0.17% libc-2.27.so [.] _int_malloc
0.17% postgres [.] set_config_option
0.17% postgres [.] ScanKeywordLookup
0.16% postgres [.] _bt_checkpage
%%
We have tried to fix this by using BitMapSet instead of boolean array
KnownAssignedXidsValid, but this does not help too much.
Instead, using a doubly linked list helps a little more, we got +1000 tps
on pgbench workload with patched postgresql. The general idea of this patch
is that, instead of memorizing which elements in KnownAssignedXids are
valid, lets maintain a doubly linked list of them. This solution will work
in exactly the same way, except that taking a snapshot on the replica is
now O(running transaction) instead of O(head - tail) which is significantly
faster under some workloads. The patch helps to reduce CPU usage of
KnownAssignedXidsGetAndSetXmin to ~48% instead of ~74%, but does eliminate
it from perf top.
The problem is better reproduced on PG13 since PG14 has some snapshot
optimization.
Thanks!
Best regards, reshke
sorry, forgot to add a patch to the letter
чт, 20 мая 2021 г. в 13:52, Кирилл Решке <reshkekirill@gmail.com>:
Show quoted text
Hi,
I recently ran into a problem in one of our production postgresql cluster.
I had noticed lock contention on procarray lock on standby, which causes
WAL replay lag growth.
To reproduce this, you can do the following:1) set max_connections to big number, like 100000
2) begin a transaction on primary
3) start pgbench workload on primary and on standbyAfter a while it will be possible to see KnownAssignedXidsGetAndSetXmin in
perf top consuming abount 75 % of CPU.%%
PerfTop: 1060 irqs/sec kernel: 0.0% exact: 0.0% [4000Hz cycles:u],
(target_pid: 273361)-------------------------------------------------------------------------------
73.92% postgres [.] KnownAssignedXidsGetAndSetXmin
1.40% postgres [.] base_yyparse
0.96% postgres [.] LWLockAttemptLock
0.84% postgres [.] hash_search_with_hash_value
0.84% postgres [.] AtEOXact_GUC
0.72% postgres [.] ResetAllOptions
0.70% postgres [.] AllocSetAlloc
0.60% postgres [.] _bt_compare
0.55% postgres [.] core_yylex
0.42% libc-2.27.so [.] __strlen_avx2
0.23% postgres [.] LWLockRelease
0.19% postgres [.] MemoryContextAllocZeroAligned
0.18% postgres [.] expression_tree_walker.part.3
0.18% libc-2.27.so [.] __memmove_avx_unaligned_erms
0.17% postgres [.] PostgresMain
0.17% postgres [.] palloc
0.17% libc-2.27.so [.] _int_malloc
0.17% postgres [.] set_config_option
0.17% postgres [.] ScanKeywordLookup
0.16% postgres [.] _bt_checkpage%%
We have tried to fix this by using BitMapSet instead of boolean array
KnownAssignedXidsValid, but this does not help too much.Instead, using a doubly linked list helps a little more, we got +1000 tps
on pgbench workload with patched postgresql. The general idea of this patch
is that, instead of memorizing which elements in KnownAssignedXids are
valid, lets maintain a doubly linked list of them. This solution will work
in exactly the same way, except that taking a snapshot on the replica is
now O(running transaction) instead of O(head - tail) which is significantly
faster under some workloads. The patch helps to reduce CPU usage of
KnownAssignedXidsGetAndSetXmin to ~48% instead of ~74%, but does eliminate
it from perf top.The problem is better reproduced on PG13 since PG14 has some snapshot
optimization.Thanks!
Best regards, reshke
Attachments:
UseDoublyLinkedListInKnowAssingedXods.patchtext/x-patch; charset=US-ASCII; name=UseDoublyLinkedListInKnowAssingedXods.patchDownload
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index 5ff8cab394..165cf56ea9 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -255,9 +255,18 @@ static PGPROC *allProcs;
* Bookkeeping for tracking emulated transactions in recovery
*/
static TransactionId *KnownAssignedXids;
-static bool *KnownAssignedXidsValid;
+
+typedef struct {
+ int prv;
+ int nxt;
+} KnownAssignedXidValidNode;
+
+// Doubly linked list of valid xids
+static KnownAssignedXidValidNode *KnownAssignedXidsValidDLL;
static TransactionId latestObservedXid = InvalidTransactionId;
+
+
/*
* If we're in STANDBY_SNAPSHOT_PENDING state, standbySnapshotPendingXmin is
* the highest xid that might still be running that we don't have in
@@ -348,6 +357,9 @@ static void GlobalVisUpdateApply(ComputeXidHorizonsResult *horizons);
/*
* Report shared-memory space needed by CreateSharedProcArray.
*/
+
+static KnownAssignedXidValidNode KAX_E_INVALID;
+
Size
ProcArrayShmemSize(void)
{
@@ -380,13 +392,20 @@ ProcArrayShmemSize(void)
size = add_size(size,
mul_size(sizeof(TransactionId),
TOTAL_MAX_CACHED_SUBXIDS));
- size = add_size(size,
- mul_size(sizeof(bool), TOTAL_MAX_CACHED_SUBXIDS));
+ size = add_size(size,
+ mul_size(sizeof(KnownAssignedXidValidNode), TOTAL_MAX_CACHED_SUBXIDS));
}
+ KAX_E_INVALID.prv = -1;
+ KAX_E_INVALID.nxt = TOTAL_MAX_CACHED_SUBXIDS;
+
return size;
}
+#define KAX_DLL_INDEX_VALID(i) (-1 < i && i < TOTAL_MAX_CACHED_SUBXIDS)
+#define KAX_DLL_ENTRY_INVALID(i) (-1 == KnownAssignedXidsValidDLL[i].prv && KnownAssignedXidsValidDLL[i].nxt == TOTAL_MAX_CACHED_SUBXIDS)
+
+
/*
* Initialize the shared PGPROC array during postmaster startup.
*/
@@ -431,10 +450,15 @@ CreateSharedProcArray(void)
mul_size(sizeof(TransactionId),
TOTAL_MAX_CACHED_SUBXIDS),
&found);
- KnownAssignedXidsValid = (bool *)
- ShmemInitStruct("KnownAssignedXidsValid",
- mul_size(sizeof(bool), TOTAL_MAX_CACHED_SUBXIDS),
+
+ KnownAssignedXidsValidDLL = (KnownAssignedXidValidNode*)
+ ShmemInitStruct("KnownAssignedXidsValidDLL",
+ mul_size(sizeof(KnownAssignedXidValidNode), TOTAL_MAX_CACHED_SUBXIDS),
&found);
+
+ for (int i = 0; i < TOTAL_MAX_CACHED_SUBXIDS; ++ i) {
+ KnownAssignedXidsValidDLL[i] = KAX_E_INVALID;
+ }
}
}
@@ -4545,15 +4569,17 @@ KnownAssignedXidsCompress(bool force)
* re-aligning data to 0th element.
*/
compress_index = 0;
- for (i = tail; i < head; i++)
+ int prv = -1;
+ for (i = tail; i < head; i = KnownAssignedXidsValidDLL[i].nxt)
{
- if (KnownAssignedXidsValid[i])
- {
- KnownAssignedXids[compress_index] = KnownAssignedXids[i];
- KnownAssignedXidsValid[compress_index] = true;
- compress_index++;
- }
+ KnownAssignedXids[compress_index] = KnownAssignedXids[i];
+ KnownAssignedXidsValidDLL[compress_index].prv = prv;
+ KnownAssignedXidsValidDLL[compress_index].nxt = compress_index + 1;
+
+ compress_index++;
}
+ // fix last entry
+ KnownAssignedXidsValidDLL[compress_index - 1].nxt = TOTAL_MAX_CACHED_SUBXIDS;
pArray->tailKnownAssignedXids = 0;
pArray->headKnownAssignedXids = compress_index;
@@ -4652,7 +4678,8 @@ KnownAssignedXidsAdd(TransactionId from_xid, TransactionId to_xid,
for (i = 0; i < nxids; i++)
{
KnownAssignedXids[head] = next_xid;
- KnownAssignedXidsValid[head] = true;
+ KnownAssignedXidsValidDLL[head].nxt = 1 + head;
+ KnownAssignedXidsValidDLL[head + 1].prv = head;
TransactionIdAdvance(next_xid);
head++;
}
@@ -4741,12 +4768,23 @@ KnownAssignedXidsSearch(TransactionId xid, bool remove)
if (result_index < 0)
return false; /* not in array */
- if (!KnownAssignedXidsValid[result_index])
+ if (KAX_DLL_ENTRY_INVALID(result_index))
return false; /* in array, but invalid */
if (remove)
{
- KnownAssignedXidsValid[result_index] = false;
+ int prv = KnownAssignedXidsValidDLL[result_index].prv;
+ int nxt = KnownAssignedXidsValidDLL[result_index].nxt;
+
+ KnownAssignedXidsValidDLL[result_index] = KAX_E_INVALID;
+
+ if (KAX_DLL_INDEX_VALID(prv)) {
+ KnownAssignedXidsValidDLL[prv].nxt = nxt;
+ }
+
+ if (KAX_DLL_INDEX_VALID(nxt)) {
+ KnownAssignedXidsValidDLL[nxt].prv = prv;
+ }
pArray->numKnownAssignedXids--;
Assert(pArray->numKnownAssignedXids >= 0);
@@ -4757,9 +4795,7 @@ KnownAssignedXidsSearch(TransactionId xid, bool remove)
*/
if (result_index == tail)
{
- tail++;
- while (tail < head && !KnownAssignedXidsValid[tail])
- tail++;
+ tail = KnownAssignedXidsValidDLL[tail].nxt;
if (tail >= head)
{
/* Array is empty, so we can reset both pointers */
@@ -4868,21 +4904,23 @@ KnownAssignedXidsRemovePreceding(TransactionId removeXid)
tail = pArray->tailKnownAssignedXids;
head = pArray->headKnownAssignedXids;
- for (i = tail; i < head; i++)
+ for (i = tail; i < head; i = KnownAssignedXidsValidDLL[i].nxt)
{
- if (KnownAssignedXidsValid[i])
- {
- TransactionId knownXid = KnownAssignedXids[i];
+ TransactionId knownXid = KnownAssignedXids[i];
- if (TransactionIdFollowsOrEquals(knownXid, removeXid))
- break;
+ if (TransactionIdFollowsOrEquals(knownXid, removeXid))
+ break;
- if (!StandbyTransactionIdIsPrepared(knownXid))
- {
- KnownAssignedXidsValid[i] = false;
- count++;
- }
- }
+ if (!StandbyTransactionIdIsPrepared(knownXid))
+ {
+ int nxt = KnownAssignedXidsValidDLL[i].nxt;
+ KnownAssignedXidsValidDLL[i] = KAX_E_INVALID;
+
+ if (KAX_DLL_INDEX_VALID(nxt)) {
+ KnownAssignedXidsValidDLL[i].prv = -1;
+ }
+ count++;
+ }
}
pArray->numKnownAssignedXids -= count;
@@ -4891,21 +4929,20 @@ KnownAssignedXidsRemovePreceding(TransactionId removeXid)
/*
* Advance the tail pointer if we've marked the tail item invalid.
*/
- for (i = tail; i < head; i++)
- {
- if (KnownAssignedXidsValid[i])
- break;
- }
- if (i >= head)
- {
- /* Array is empty, so we can reset both pointers */
- pArray->headKnownAssignedXids = 0;
- pArray->tailKnownAssignedXids = 0;
- }
- else
- {
- pArray->tailKnownAssignedXids = i;
- }
+ if (KAX_DLL_ENTRY_INVALID(tail)) {
+ tail = KnownAssignedXidsValidDLL[tail].nxt;
+
+ if (!KAX_DLL_INDEX_VALID(tail))
+ {
+ /* Array is empty, so we can reset both pointers */
+ pArray->headKnownAssignedXids = 0;
+ pArray->tailKnownAssignedXids = 0;
+ }
+ else
+ {
+ pArray->tailKnownAssignedXids = tail;
+ }
+ }
/* Opportunistically compress the array */
KnownAssignedXidsCompress(false);
@@ -4957,32 +4994,29 @@ KnownAssignedXidsGetAndSetXmin(TransactionId *xarray, TransactionId *xmin,
head = procArray->headKnownAssignedXids;
SpinLockRelease(&procArray->known_assigned_xids_lck);
- for (i = tail; i < head; i++)
+ for (i = tail; i < head; i = KnownAssignedXidsValidDLL[i].nxt)
{
/* Skip any gaps in the array */
- if (KnownAssignedXidsValid[i])
- {
- TransactionId knownXid = KnownAssignedXids[i];
-
- /*
- * Update xmin if required. Only the first XID need be checked,
- * since the array is sorted.
- */
- if (count == 0 &&
- TransactionIdPrecedes(knownXid, *xmin))
- *xmin = knownXid;
-
- /*
- * Filter out anything >= xmax, again relying on sorted property
- * of array.
- */
- if (TransactionIdIsValid(xmax) &&
- TransactionIdFollowsOrEquals(knownXid, xmax))
- break;
-
- /* Add knownXid into output array */
- xarray[count++] = knownXid;
- }
+ TransactionId knownXid = KnownAssignedXids[i];
+
+ /*
+ * Update xmin if required. Only the first XID need be checked,
+ * since the array is sorted.
+ */
+ if (count == 0 &&
+ TransactionIdPrecedes(knownXid, *xmin))
+ *xmin = knownXid;
+
+ /*
+ * Filter out anything >= xmax, again relying on sorted property
+ * of array.
+ */
+ if (TransactionIdIsValid(xmax) &&
+ TransactionIdFollowsOrEquals(knownXid, xmax))
+ break;
+
+ /* Add knownXid into output array */
+ xarray[count++] = knownXid;
}
return count;
@@ -5007,12 +5041,12 @@ KnownAssignedXidsGetOldestXmin(void)
head = procArray->headKnownAssignedXids;
SpinLockRelease(&procArray->known_assigned_xids_lck);
- for (i = tail; i < head; i++)
- {
- /* Skip any gaps in the array */
- if (KnownAssignedXidsValid[i])
- return KnownAssignedXids[i];
- }
+ i = KnownAssignedXidsValidDLL[i].nxt;
+
+ /* Skip any gaps in the array */
+ if (i < head) {
+ return KnownAssignedXids[i];
+ }
return InvalidTransactionId;
}
@@ -5042,13 +5076,10 @@ KnownAssignedXidsDisplay(int trace_level)
initStringInfo(&buf);
- for (i = tail; i < head; i++)
+ for (i = tail; i < head; i = KnownAssignedXidsValidDLL[i].nxt)
{
- if (KnownAssignedXidsValid[i])
- {
- nxids++;
- appendStringInfo(&buf, "[%d]=%u ", i, KnownAssignedXids[i]);
- }
+ nxids++;
+ appendStringInfo(&buf, "[%d]=%u ", i, KnownAssignedXids[i]);
}
elog(trace_level, "%d KnownAssignedXids (num=%d tail=%d head=%d) %s",
)Hello.
I recently ran into a problem in one of our production postgresql cluster.
I had noticed lock contention on procarray lock on standby, which causes WAL
replay lag growth.
Yes, I saw the same issue on my production cluster.
1) set max_connections to big number, like 100000
I made the tests with a more realistic value - 5000. It is valid value
for Amazon RDS for example (default is
LEAST({DBInstanceClassMemory/9531392}, 5000)).
The test looks like this:
pgbench -i -s 10 -U postgres -d postgres
pgbench -b select-only -p 6543 -j 1 -c 50 -n -P 1 -T 18000 -U postgres postgres
pgbench -b simple-update -j 1 -c 50 -n -P 1 -T 18000 -U postgres postgres
long transaction on primary - begin;select txid_current();
perf top -p <pid of some standby>
So, on postgres 14 (master) non-patched version looks like this:
5.13% postgres [.] KnownAssignedXidsGetAndSetXmin
4.61% postgres [.] pg_checksum_block
2.54% postgres [.] AllocSetAlloc
2.44% postgres [.] base_yyparse
It is too much to spend 5-6% of CPU running throw an array :) I think
it should be fixed for both the 13 and 14 versions.
The patched version like this (was unable to notice
KnownAssignedXidsGetAndSetXmin):
3.08% postgres [.] pg_checksum_block
2.89% postgres [.] AllocSetAlloc
2.66% postgres [.] base_yyparse
2.00% postgres [.] MemoryContextAllocZeroAligned
On postgres 13 non patched version looks even worse (definitely need
to be fixed in my opinion):
26.44% postgres [.] KnownAssignedXidsGetAndSetXmin
2.17% postgres [.] base_yyparse
2.01% postgres [.] AllocSetAlloc
1.55% postgres [.] MemoryContextAllocZeroAligned
But your patch does not apply to REL_13_STABLE. Could you please
provide two versions?
Also, there are warnings while building with patch:
procarray.c:4595:9: warning: ISO C90 forbids mixed
declarations and code [-Wdeclaration-after-statement]
4595 | int prv = -1;
| ^~~
procarray.c: In function ‘KnownAssignedXidsGetOldestXmin’:
procarray.c:5056:5: warning: variable ‘tail’ set but not used
[-Wunused-but-set-variable]
5056 | tail;
| ^~~~
procarray.c:5067:38: warning: ‘i’ is used uninitialized in
this function [-Wuninitialized]
5067 | i = KnownAssignedXidsValidDLL[i].nxt;
Some of them are clear errors, so, please recheck the code.
Also, maybe it is better to reduce the invasivity by using a more
simple approach. For example, use the first bit to mark xid as valid
and the last 7 bit (128 values) as an optimistic offset to the next
valid xid (jump by 127 steps in the worse scenario).
What do you think?
Also, it is a good idea to register the patch in the commitfest app
(https://commitfest.postgresql.org/).
Thanks,
Michail.
Hello, Kirill.
Also, maybe it is better to reduce the invasivity by using a more
simple approach. For example, use the first bit to mark xid as valid
and the last 7 bit (128 values) as an optimistic offset to the next
valid xid (jump by 127 steps in the worse scenario).
What do you think?
I have tried such an approach but looks like it is not effective,
probably because of CPU caching issues.
I have looked again at your patch, ut seems like it has a lot of
issues at the moment:
* error in KnownAssignedXidsGetOldestXmin, `i` is uninitialized, logic is wrong
* error in compressing function
(```KnownAssignedXidsValidDLL[compress_index].prv = prv;```, `prv` is
never updated)
* probably other errors?
* compilation warnings
* looks a little complex logic with `KAX_DLL_ENTRY_INVALID`
* variable\methods placing is bad (see `KAX_E_INVALID` and others)
* need to update documentation about KnownAssignedXidsValid, see ```To
keep individual deletions cheap, we need to allow gaps in the array```
in procarray.c
* formatting is broken
Do you have plans to update it? If not - I could try to rewrite it.
Also, what is about to add a patch to commitfest?
Thanks,
Michail.
Hello.
I have tried such an approach but looks like it is not effective,
probably because of CPU caching issues.
It was a mistake by me. I have repeated the approach and got good
results with small and a non-invasive patch.
The main idea is simple optimistic optimization - store offset to next
valid entry. So, in most cases, we could just skip all the gaps.
Of course, it adds some additional impact for workloads without long
(few seconds) transactions but it is almost not detectable (because of
CPU caches).
* TEST
The next testing setup was used:
max_connections=5000 (taken from real RDS installation)
pgbench -i -s 10 -U postgres -d postgres
# emulate typical OLTP load
pgbench -b simple-update -j 1 -c 50 -n -P 1 -T 18000 -U postgres postgres
#emulate typical cache load on replica
pgbench -b select-only -p 6543 -j 1 -c 50 -n -P 1 -T 18000 -U postgres postgres
# emulate some typical long transactions up to 60 seconds on primary
echo "\set t random(0, 60)
BEGIN;
select txid_current();
select pg_sleep(:t);
COMMIT;" > slow_transactions.bench
pgbench -f /home/nkey/pg/slow_transactions.bench -p 5432 -j 1 -c 10 -n
-P 1 -T 18000 -U postgres postgres
* RESULTS
*REL_13_STABLE* - 23.02% vs 0.76%
non-patched:
23.02% postgres [.] KnownAssignedXidsGetAndSetXmin
2.56% postgres [.] base_yyparse
2.15% postgres [.] AllocSetAlloc
1.68% postgres [.] MemoryContextAllocZeroAligned
1.51% postgres [.] hash_search_with_hash_value
1.26% postgres [.] SearchCatCacheInternal
1.03% postgres [.] hash_bytes
0.92% postgres [.] pg_checksum_block
0.89% postgres [.] expression_tree_walker
0.81% postgres [.] core_yylex
0.69% postgres [.] palloc
0.68% [kernel] [k] do_syscall_64
0.59% postgres [.] _bt_compare
0.54% postgres [.] new_list
patched:
3.09% postgres [.] base_yyparse
3.00% postgres [.] AllocSetAlloc
2.01% postgres [.] MemoryContextAllocZeroAligned
1.89% postgres [.] SearchCatCacheInternal
1.80% postgres [.] hash_search_with_hash_value
1.27% postgres [.] expression_tree_walker
1.27% postgres [.] pg_checksum_block
1.18% postgres [.] hash_bytes
1.10% postgres [.] core_yylex
0.96% [kernel] [k] do_syscall_64
0.86% postgres [.] palloc
0.84% postgres [.] _bt_compare
0.79% postgres [.] new_list
0.76% postgres [.] KnownAssignedXidsGetAndSetXmin
*MASTER* - 6.16% vs ~0%
(includes snapshot scalability optimization by Andres Freund (1))
non-patched:
6.16% postgres [.] KnownAssignedXidsGetAndSetXmin
3.05% postgres [.] AllocSetAlloc
2.59% postgres [.] base_yyparse
1.95% postgres [.] hash_search_with_hash_value
1.87% postgres [.] MemoryContextAllocZeroAligned
1.85% postgres [.] SearchCatCacheInternal
1.27% postgres [.] hash_bytes
1.16% postgres [.] expression_tree_walker
1.06% postgres [.] core_yylex
0.94% [kernel] [k] do_syscall_64
patched:
3.35% postgres [.] base_yyparse
2.84% postgres [.] AllocSetAlloc
1.89% postgres [.] hash_search_with_hash_value
1.82% postgres [.] MemoryContextAllocZeroAligned
1.79% postgres [.] SearchCatCacheInternal
1.49% postgres [.] pg_checksum_block
1.26% postgres [.] hash_bytes
1.26% postgres [.] expression_tree_walker
1.08% postgres [.] core_yylex
1.04% [kernel] [k] do_syscall_64
0.81% postgres [.] palloc
Looks like it is possible to get a significant TPS increase on a very
typical standby workload.
Currently, I have no environment to measure TPS accurately. Could you
please try it on yours?
I have attached two versions of the patch - for master and REL_13_STABLE.
Also, I am going to add a patch to commitfest (2).
Thanks,
MIchail.
(1): https://commitfest.postgresql.org/29/2500/
(2): https://commitfest.postgresql.org/34/3271/
Attachments:
know_xid_optimization.patchtext/x-patch; charset=US-ASCII; name=know_xid_optimization.patchDownload
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index c7816fcfb3..27486c096b 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -267,6 +267,7 @@ static PGPROC *allProcs;
*/
static TransactionId *KnownAssignedXids;
static bool *KnownAssignedXidsValid;
+static pg_atomic_uint32 *KnownAssignedXidsNext;
static TransactionId latestObservedXid = InvalidTransactionId;
/*
@@ -405,6 +406,7 @@ void
CreateSharedProcArray(void)
{
bool found;
+ int i;
/* Create or attach to the ProcArray shared structure */
procArray = (ProcArrayStruct *)
@@ -446,6 +448,12 @@ CreateSharedProcArray(void)
ShmemInitStruct("KnownAssignedXidsValid",
mul_size(sizeof(bool), TOTAL_MAX_CACHED_SUBXIDS),
&found);
+ KnownAssignedXidsNext = (pg_atomic_uint32 *)
+ ShmemInitStruct("KnownAssignedXidsNext",
+ mul_size(sizeof(pg_atomic_uint32), TOTAL_MAX_CACHED_SUBXIDS),
+ &found);
+ for (i = 0; i < TOTAL_MAX_CACHED_SUBXIDS; i++)
+ pg_atomic_init_u32(&KnownAssignedXidsNext[i], 1);
}
}
@@ -4598,12 +4606,13 @@ KnownAssignedXidsCompress(bool force)
* re-aligning data to 0th element.
*/
compress_index = 0;
- for (i = tail; i < head; i++)
+ for (i = tail; i < head; i += pg_atomic_read_u32(&KnownAssignedXidsNext[i]))
{
if (KnownAssignedXidsValid[i])
{
KnownAssignedXids[compress_index] = KnownAssignedXids[i];
KnownAssignedXidsValid[compress_index] = true;
+ pg_atomic_write_u32(&KnownAssignedXidsNext[compress_index], 1);
compress_index++;
}
}
@@ -4706,6 +4715,7 @@ KnownAssignedXidsAdd(TransactionId from_xid, TransactionId to_xid,
{
KnownAssignedXids[head] = next_xid;
KnownAssignedXidsValid[head] = true;
+ pg_atomic_write_u32(&KnownAssignedXidsNext[head], 1);
TransactionIdAdvance(next_xid);
head++;
}
@@ -4921,7 +4931,7 @@ KnownAssignedXidsRemovePreceding(TransactionId removeXid)
tail = pArray->tailKnownAssignedXids;
head = pArray->headKnownAssignedXids;
- for (i = tail; i < head; i++)
+ for (i = tail; i < head; i += pg_atomic_read_u32(&KnownAssignedXidsNext[i]))
{
if (KnownAssignedXidsValid[i])
{
@@ -4944,7 +4954,7 @@ KnownAssignedXidsRemovePreceding(TransactionId removeXid)
/*
* Advance the tail pointer if we've marked the tail item invalid.
*/
- for (i = tail; i < head; i++)
+ for (i = tail; i < head; i += pg_atomic_read_u32(&KnownAssignedXidsNext[i]))
{
if (KnownAssignedXidsValid[i])
break;
@@ -4994,7 +5004,8 @@ KnownAssignedXidsGetAndSetXmin(TransactionId *xarray, TransactionId *xmin,
int count = 0;
int head,
tail;
- int i;
+ int i,
+ prev;
/*
* Fetch head just once, since it may change while we loop. We can stop
@@ -5008,9 +5019,10 @@ KnownAssignedXidsGetAndSetXmin(TransactionId *xarray, TransactionId *xmin,
SpinLockAcquire(&procArray->known_assigned_xids_lck);
tail = procArray->tailKnownAssignedXids;
head = procArray->headKnownAssignedXids;
+ prev = tail;
SpinLockRelease(&procArray->known_assigned_xids_lck);
- for (i = tail; i < head; i++)
+ for (i = tail; i < head; i += pg_atomic_read_u32(&KnownAssignedXidsNext[i]))
{
/* Skip any gaps in the array */
if (KnownAssignedXidsValid[i])
@@ -5033,6 +5045,14 @@ KnownAssignedXidsGetAndSetXmin(TransactionId *xarray, TransactionId *xmin,
TransactionIdFollowsOrEquals(knownXid, xmax))
break;
+ if (prev != i)
+ {
+ uint32 offset = i - prev;
+ if (pg_atomic_read_u32(&KnownAssignedXidsNext[prev]) != offset)
+ pg_atomic_write_u32(&KnownAssignedXidsNext[prev], offset);
+ prev = i;
+ }
+
/* Add knownXid into output array */
xarray[count++] = knownXid;
}
@@ -5060,7 +5080,7 @@ KnownAssignedXidsGetOldestXmin(void)
head = procArray->headKnownAssignedXids;
SpinLockRelease(&procArray->known_assigned_xids_lck);
- for (i = tail; i < head; i++)
+ for (i = tail; i < head; i += pg_atomic_read_u32(&KnownAssignedXidsNext[i]))
{
/* Skip any gaps in the array */
if (KnownAssignedXidsValid[i])
@@ -5095,7 +5115,7 @@ KnownAssignedXidsDisplay(int trace_level)
initStringInfo(&buf);
- for (i = tail; i < head; i++)
+ for (i = tail; i < head; i += pg_atomic_read_u32(&KnownAssignedXidsNext[i]))
{
if (KnownAssignedXidsValid[i])
{
pg13_know_xid_optimization.patchtext/x-patch; charset=US-ASCII; name=pg13_know_xid_optimization.patchDownload
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index 3447481b16..fd600944b6 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -109,6 +109,7 @@ static PGXACT *allPgXact;
*/
static TransactionId *KnownAssignedXids;
static bool *KnownAssignedXidsValid;
+static pg_atomic_uint32 *KnownAssignedXidsNext;
static TransactionId latestObservedXid = InvalidTransactionId;
/*
@@ -225,6 +226,7 @@ void
CreateSharedProcArray(void)
{
bool found;
+ int i;
/* Create or attach to the ProcArray shared structure */
procArray = (ProcArrayStruct *)
@@ -266,6 +268,12 @@ CreateSharedProcArray(void)
ShmemInitStruct("KnownAssignedXidsValid",
mul_size(sizeof(bool), TOTAL_MAX_CACHED_SUBXIDS),
&found);
+ KnownAssignedXidsNext = (pg_atomic_uint32 *)
+ ShmemInitStruct("KnownAssignedXidsNext",
+ mul_size(sizeof(pg_atomic_uint32), TOTAL_MAX_CACHED_SUBXIDS),
+ &found);
+ for (i = 0; i < TOTAL_MAX_CACHED_SUBXIDS; i++)
+ pg_atomic_init_u32(&KnownAssignedXidsNext[i], 1);
}
}
@@ -3558,12 +3566,13 @@ KnownAssignedXidsCompress(bool force)
* re-aligning data to 0th element.
*/
compress_index = 0;
- for (i = tail; i < head; i++)
+ for (i = tail; i < head; i += pg_atomic_read_u32(&KnownAssignedXidsNext[i]))
{
if (KnownAssignedXidsValid[i])
{
KnownAssignedXids[compress_index] = KnownAssignedXids[i];
KnownAssignedXidsValid[compress_index] = true;
+ pg_atomic_write_u32(&KnownAssignedXidsNext[compress_index], 1);
compress_index++;
}
}
@@ -3666,6 +3675,7 @@ KnownAssignedXidsAdd(TransactionId from_xid, TransactionId to_xid,
{
KnownAssignedXids[head] = next_xid;
KnownAssignedXidsValid[head] = true;
+ pg_atomic_write_u32(&KnownAssignedXidsNext[head], 1);
TransactionIdAdvance(next_xid);
head++;
}
@@ -3881,7 +3891,7 @@ KnownAssignedXidsRemovePreceding(TransactionId removeXid)
tail = pArray->tailKnownAssignedXids;
head = pArray->headKnownAssignedXids;
- for (i = tail; i < head; i++)
+ for (i = tail; i < head; i += pg_atomic_read_u32(&KnownAssignedXidsNext[i]))
{
if (KnownAssignedXidsValid[i])
{
@@ -3904,7 +3914,7 @@ KnownAssignedXidsRemovePreceding(TransactionId removeXid)
/*
* Advance the tail pointer if we've marked the tail item invalid.
*/
- for (i = tail; i < head; i++)
+ for (i = tail; i < head; i += pg_atomic_read_u32(&KnownAssignedXidsNext[i]))
{
if (KnownAssignedXidsValid[i])
break;
@@ -3954,7 +3964,8 @@ KnownAssignedXidsGetAndSetXmin(TransactionId *xarray, TransactionId *xmin,
int count = 0;
int head,
tail;
- int i;
+ int i,
+ prev;
/*
* Fetch head just once, since it may change while we loop. We can stop
@@ -3968,9 +3979,10 @@ KnownAssignedXidsGetAndSetXmin(TransactionId *xarray, TransactionId *xmin,
SpinLockAcquire(&procArray->known_assigned_xids_lck);
tail = procArray->tailKnownAssignedXids;
head = procArray->headKnownAssignedXids;
+ prev = tail;
SpinLockRelease(&procArray->known_assigned_xids_lck);
- for (i = tail; i < head; i++)
+ for (i = tail; i < head; i += pg_atomic_read_u32(&KnownAssignedXidsNext[i]))
{
/* Skip any gaps in the array */
if (KnownAssignedXidsValid[i])
@@ -3993,6 +4005,14 @@ KnownAssignedXidsGetAndSetXmin(TransactionId *xarray, TransactionId *xmin,
TransactionIdFollowsOrEquals(knownXid, xmax))
break;
+ if (prev != i)
+ {
+ uint32 offset = i - prev;
+ if (pg_atomic_read_u32(&KnownAssignedXidsNext[prev]) != offset)
+ pg_atomic_write_u32(&KnownAssignedXidsNext[prev], offset);
+ prev = i;
+ }
+
/* Add knownXid into output array */
xarray[count++] = knownXid;
}
@@ -4020,7 +4040,7 @@ KnownAssignedXidsGetOldestXmin(void)
head = procArray->headKnownAssignedXids;
SpinLockRelease(&procArray->known_assigned_xids_lck);
- for (i = tail; i < head; i++)
+ for (i = tail; i < head; i += pg_atomic_read_u32(&KnownAssignedXidsNext[i]))
{
/* Skip any gaps in the array */
if (KnownAssignedXidsValid[i])
@@ -4055,7 +4075,7 @@ KnownAssignedXidsDisplay(int trace_level)
initStringInfo(&buf);
- for (i = tail; i < head; i++)
+ for (i = tail; i < head; i += pg_atomic_read_u32(&KnownAssignedXidsNext[i]))
{
if (KnownAssignedXidsValid[i])
{
Hi,
On 2021-08-03 00:07:23 +0300, Michail Nikolaev wrote:
The main idea is simple optimistic optimization - store offset to next
valid entry. So, in most cases, we could just skip all the gaps.
Of course, it adds some additional impact for workloads without long
(few seconds) transactions but it is almost not detectable (because of
CPU caches).
I'm doubtful that that's really the right direction. For workloads that
are replay heavy we already often can see the cost of maintaining the
known xids datastructures show up significantly - not surprising, given
the datastructure. And for standby workloads with active primaries the
cost of searching through the array in all backends is noticeable as
well. I think this needs a bigger data structure redesign.
Greetings,
Andres Freund
Hi,
3 авг. 2021 г., в 03:01, Andres Freund <andres@anarazel.de> написал(а):
Hi,
On 2021-08-03 00:07:23 +0300, Michail Nikolaev wrote:
The main idea is simple optimistic optimization - store offset to next
valid entry. So, in most cases, we could just skip all the gaps.
Of course, it adds some additional impact for workloads without long
(few seconds) transactions but it is almost not detectable (because of
CPU caches).I'm doubtful that that's really the right direction. For workloads that
are replay heavy we already often can see the cost of maintaining the
known xids datastructures show up significantly - not surprising, given
the datastructure. And for standby workloads with active primaries the
cost of searching through the array in all backends is noticeable as
well. I think this needs a bigger data structure redesign.
KnownAssignedXids implements simple membership test idea. What kind of redesign would you suggest? Proposed optimisation makes it close to optimal, but needs eventual compression.
Maybe use a hashtable of running transactions? It will be slightly faster when adding\removing single transactions. But much worse when doing KnownAssignedXidsRemove().
Maybe use a tree? (AVL\RB or something like that) It will be slightly better, because it does not need eventual compression like exiting array.
Best regards, Andrey Borodin.
Hi,
On 2021-08-03 10:33:50 +0500, Andrey Borodin wrote:
3 авг. 2021 г., в 03:01, Andres Freund <andres@anarazel.de> написал(а):
On 2021-08-03 00:07:23 +0300, Michail Nikolaev wrote:The main idea is simple optimistic optimization - store offset to next
valid entry. So, in most cases, we could just skip all the gaps.
Of course, it adds some additional impact for workloads without long
(few seconds) transactions but it is almost not detectable (because of
CPU caches).I'm doubtful that that's really the right direction. For workloads that
are replay heavy we already often can see the cost of maintaining the
known xids datastructures show up significantly - not surprising, given
the datastructure. And for standby workloads with active primaries the
cost of searching through the array in all backends is noticeable as
well. I think this needs a bigger data structure redesign.KnownAssignedXids implements simple membership test idea. What kind of
redesign would you suggest? Proposed optimisation makes it close to optimal,
but needs eventual compression.
Binary searches are very ineffecient on modern CPUs (unpredictable memory
accesses, unpredictable branches). We constantly need to do binary searches
during replay to remove xids from the array. I don't see how you can address
that with just the current datastructure.
Maybe use a hashtable of running transactions? It will be slightly faster
when adding\removing single transactions. But much worse when doing
KnownAssignedXidsRemove().
Why would it be worse for KnownAssignedXidsRemove()? Were you intending to
write KnownAssignedXidsGet[AndSetXmin]()?
Maybe use a tree? (AVL\RB or something like that) It will be slightly better, because it does not need eventual compression like exiting array.
I'm not entirely sure what datastructure would work best. I can see something
like a radix tree work well, or a skiplist style approach. Or a hashtable:
I'm not sure that we need to care as much about the cost of
KnownAssignedXidsGetAndSetXmin() - for one, the caching we now have makes that
less frequent. But more importantly, it'd not be hard to maintain an
occasionally (or opportunistically) maintained denser version for
GetSnapshotData() - there's only a single writer, making the concurrency
issues a lot simpler.
Greetings,
Andres Freund
Hello, Andres.
Thanks for your feedback.
Maybe use a hashtable of running transactions? It will be slightly faster
when adding\removing single transactions. But much worse when doing
KnownAssignedXidsRemove().Why would it be worse for KnownAssignedXidsRemove()? Were you intending to
write KnownAssignedXidsGet[AndSetXmin]()?
It is actually was a hashtable in 2010. It was replaced by Simon Riggs
in 2871b4618af1acc85665eec0912c48f8341504c4.
I'm not sure that we need to care as much about the cost of
KnownAssignedXidsGetAndSetXmin() - for one, the caching we now have makes that
less frequent.
It is still about 5-7% of CPU for a typical workload, a considerable
amount for me. And a lot of systems still work on 12 and 13.
The proposed approach eliminates KnownAssignedXidsGetAndSetXmin from
the top by a small changeset.
But more importantly, it'd not be hard to maintain an
occasionally (or opportunistically) maintained denser version for
GetSnapshotData() - there's only a single writer, making the concurrency
issues a lot simpler.
Could you please explain it in more detail?
Single writer and GetSnapshotData() already exclusively hold
ProcArrayLock at the moment of KnownAssignedXidsRemove,
KnownAssignedXidsGetAndSetXmin, and sometimes KnownAssignedXidsAdd.
BTW, the tiny thing we could also fix now is (comment from code):
(We could dispense with the spinlock if we were to
* create suitable memory access barrier primitives and use those instead.)
* The spinlock must be taken to read or write the head/tail pointers unless
* the caller holds ProcArrayLock exclusively.
Thanks,
Michail.
Hi,
On 2021-08-03 22:23:58 +0300, Michail Nikolaev wrote:
I'm not sure that we need to care as much about the cost of
KnownAssignedXidsGetAndSetXmin() - for one, the caching we now have makes that
less frequent.It is still about 5-7% of CPU for a typical workload, a considerable
amount for me.
I'm not saying we shouldn't optimize things. Just that it's less pressing. And
what kind of price we're willing to optimize may have changed.
And a lot of systems still work on 12 and 13.
I don't see us backporting performance improvements around this to 12 and 13,
so I don't think that matters much... We've done that a few times, but usually
when there's some accidentally quadratic behaviour or such.
But more importantly, it'd not be hard to maintain an
occasionally (or opportunistically) maintained denser version for
GetSnapshotData() - there's only a single writer, making the concurrency
issues a lot simpler.Could you please explain it in more detail?
Single writer and GetSnapshotData() already exclusively hold
ProcArrayLock at the moment of KnownAssignedXidsRemove,
KnownAssignedXidsGetAndSetXmin, and sometimes KnownAssignedXidsAdd.
GetSnapshotData() only locks ProcArrayLock in shared mode.
The problem is that we don't want to add a lot of work
KnownAssignedXidsAdd/Remove, because very often nobody will build a snapshot
for that moment and building a sorted, gap-free, linear array of xids isn't
cheap. In my experience it's more common to be bottlenecked by replay CPU
performance than on replica snapshot building.
During recovery, where there's only one writer to the procarray / known xids,
it might not be hard to opportunistically build a dense version of the known
xids whenever a snapshot is built.
Greetings,
Andres Freund
Hello, Andres.
Thanks for the feedback again.
The problem is that we don't want to add a lot of work
KnownAssignedXidsAdd/Remove, because very often nobody will build a snapshot
for that moment and building a sorted, gap-free, linear array of xids isn't
cheap. In my experience it's more common to be bottlenecked by replay CPU
performance than on replica snapshot building.
Yes, but my patch adds almost the smallest possible amount for
KnownAssignedXidsAdd/Remove - a single write to the array by index.
It differs from the first version in this thread which is based on linked lists.
The "next valid offset" is just "optimistic optimization" - it means
"you could safely skip KnownAssignedXidsNext[i] while finding the next
valid".
But KnownAssignedXidsNext is not updated by Add/Remove.
During recovery, where there's only one writer to the procarray / known xids,
it might not be hard to opportunistically build a dense version of the known
xids whenever a snapshot is built.
AFAIU the patch does exactly the same.
On the first snapshot building, offsets to the next valid entry are
set. So, a dense version is created on-demand.
And this version is reused (even partly if something was removed) on
the next snapshot building.
I'm not entirely sure what datastructure would work best. I can see something
like a radix tree work well, or a skiplist style approach. Or a hashtable:
We could try to use some other structure (for example - linked hash
map) - but the additional cost (memory management, links, hash
calculation) will probably significantly reduce performance.
And it is a much harder step to perform.
So, I think "next valid offset" optimization is a good trade-off for now:
* KnownAssignedXidsAdd/Remove are almost not affected in their complexity
* KnownAssignedXidsGetAndSetXmin is eliminated from the CPU top on
typical read scenario - so, more TPS, less ProcArrayLock contention
* it complements GetSnapshotData() scalability - now on standby
* changes are small
Thanks,
Michail.
On Tue, Aug 10, 2021 at 12:45:17AM +0300, Michail Nikolaev wrote:
Thanks for the feedback again.
From what I can see, there has been some feedback from Andres here,
and the thread is idle for six weeks now, so I have marked this patch
as RwF in the CF app.
--
Michael
Hello, Andres.
Could you please clarify how to better deal with the situation?
According to your previous letter, I think there was some
misunderstanding regarding the latest patch version (but I am not
sure). Because as far as understand provided optimization (lazily
calculated optional offset to the next valid xid) fits into your
wishes. It was described in the previous letter in more detail.
And now it is not clear for me how to move forward :)
There is an option to try to find some better data structure (like
some tricky linked hash map) but it is going to be huge changes
without any confidence to get a more effective version (because
provided changes make the structure pretty effective).
Another option I see - use optimization from the latest patch version
and get a significant TPS increase (5-7%) for the typical standby read
scenario. Patch is small and does not affect other scenarios in a
negative way. Probably I could make an additional set of some
performance tests and provide some simulation to prove that
pg_atomic_uint32-related code is correct (if required).
Or just leave the issue and hope someone else will try to fix it in
the future :)
Thanks a lot,
Michail.
Sorry for so late reply. I've been thinking about possible approaches.
KnownAssignedXids over hashtable in fact was implemented long before and rejected [0]https://github.com/postgres/postgres/commit/2871b4618af1acc85665eec0912c48f8341504c4.
3 авг. 2021 г., в 22:35, Andres Freund <andres@anarazel.de> написал(а):
On 2021-08-03 10:33:50 +0500, Andrey Borodin wrote:
3 авг. 2021 г., в 03:01, Andres Freund <andres@anarazel.de> написал(а):
On 2021-08-03 00:07:23 +0300, Michail Nikolaev wrote:The main idea is simple optimistic optimization - store offset to next
valid entry. So, in most cases, we could just skip all the gaps.
Of course, it adds some additional impact for workloads without long
(few seconds) transactions but it is almost not detectable (because of
CPU caches).I'm doubtful that that's really the right direction. For workloads that
are replay heavy we already often can see the cost of maintaining the
known xids datastructures show up significantly - not surprising, given
the datastructure. And for standby workloads with active primaries the
cost of searching through the array in all backends is noticeable as
well. I think this needs a bigger data structure redesign.KnownAssignedXids implements simple membership test idea. What kind of
redesign would you suggest? Proposed optimisation makes it close to optimal,
but needs eventual compression.Binary searches are very ineffecient on modern CPUs (unpredictable memory
accesses, unpredictable branches). We constantly need to do binary searches
during replay to remove xids from the array. I don't see how you can address
that with just the current datastructure.
Current patch addresses another problem. In presence of old enough transaction enumeration of KnownAssignedXids with shared lock prevents adding new transactions with exclusive lock. And recovery effectively pauses.
Binary searches can consume 10-15 cache misses, which is unreasonable amount of memory waits. But that's somewhat different problem.
Also binsearch is not that expensive when we compress KnownAssignedXids often.
Maybe use a hashtable of running transactions? It will be slightly faster
when adding\removing single transactions. But much worse when doing
KnownAssignedXidsRemove().Why would it be worse for KnownAssignedXidsRemove()? Were you intending to
write KnownAssignedXidsGet[AndSetXmin]()?
I was thinking about inefficient KnownAssignedXidsRemovePreceding() in hashtable. But, probably, this is not so frequent operation.
Maybe use a tree? (AVL\RB or something like that) It will be slightly better, because it does not need eventual compression like exiting array.
I'm not entirely sure what datastructure would work best. I can see something
like a radix tree work well, or a skiplist style approach. Or a hashtable:I'm not sure that we need to care as much about the cost of
KnownAssignedXidsGetAndSetXmin() - for one, the caching we now have makes that
less frequent. But more importantly, it'd not be hard to maintain an
occasionally (or opportunistically) maintained denser version for
GetSnapshotData() - there's only a single writer, making the concurrency
issues a lot simpler.
I've been prototyping Radix tree for a while.
Here every 4 xids are summarized my minimum Xid and number of underlying Xids. Of cause 4 is arbitrary number, summarization area must be of cacheline size.
┌───────┐
│ 1 / 9 │
├───────┴────┐
│ └────┐
│ └────┐
│ └────┐
▼ └───▶
┌───────────────────────────────┐
│ 1 / 3 | 5 / 0 | 9 / 3 | D / 3 │
├───────┬───────┬────────┬──────┴────┐
│ └─┐ └───┐ └────┐ └─────┐
│ └─┐ └──┐ └────┐ └─────┐
│ └─┐ └──┐ └────┐ └────┐
▼ └─┐ └──┐ └───┐ └────┐
┌───────────────▼────────────┴─▶────────────┴──▶───────────┴───▶
│ 1 | 2 | | 4 | | | | | 9 | | B | C | D | E | F | │
└──────────────────────────────────────────────────────────────┘
Bottom layer is current array (TransactionId *KnownAssignedXids).
When we remove Xid we need theoretical minimum of cachelines touched. I'd say 5-7 instead of 10-15 of binsearch (in case of millions of entries in KnownAssignedXids).
Enumerating running Xids is not that difficult too: we will need to scan O(xip) memory, not whole KnownAssignedXids array.
But the overall complexity of this approach does not seem good to me.
All in all, I think using proposed "KnownAssignedXidsNext" patch solves real problem and the problem of binary searches should be addressed by compressing KnownAssignedXids more often.
Currently we do not compress array
if (nelements < 4 * PROCARRAY_MAXPROCS || // It's not that big yet OR
nelements < 2 * pArray->numKnownAssignedXids) // It's contains less than a half of a bloat
return;
From my POV arbitrary number 4 is just too high.
Summary: I think ("KnownAssignedXidsNext" patch + compressing KnownAssignedXids more often) is better than major KnownAssignedXids redesign.
Best regards, Andrey Borodin.
[0]: https://github.com/postgres/postgres/commit/2871b4618af1acc85665eec0912c48f8341504c4
Hello, Andrey.
Thanks for your feedback.
Current patch addresses another problem. In presence of old enough transaction enumeration of KnownAssignedXids with shared lock prevents adding new transactions with exclusive lock. And recovery effectively pauses.
Actually, I see two problems here (caused by the presence of old long
transactions). The first one is lock contention which causes recovery
pauses. The second one - just high CPU usage on standby by
KnownAssignedXidsGetAndSetXmin.
All in all, I think using proposed "KnownAssignedXidsNext" patch solves real problem and the problem of binary searches should be addressed by compressing KnownAssignedXids more often.
I updated the patch a little. KnownAssignedXidsGetAndSetXmin now
causes fewer cache misses because some values are stored in variables
(registers). I think it is better to not lean on the compiler here
because of `volatile` args.
Also, I have added some comments.
Best regards,
Michail.
Attachments:
v2-0001-known-assignment-xid-next.patchtext/x-patch; charset=US-ASCII; name=v2-0001-known-assignment-xid-next.patchDownload
From 1d55c6fae8cc160eadd705da0d70d9e7fb5bc00f Mon Sep 17 00:00:00 2001
From: Michail Nikolaev <michail.nikolaev@gmail.com>
Date: Wed, 10 Nov 2021 00:02:18 +0300
Subject: [PATCH v2] known assignment xid next
---
src/backend/storage/ipc/procarray.c | 67 ++++++++++++++++++++++++-----
1 file changed, 57 insertions(+), 10 deletions(-)
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index 892f0f6799..422004ad31 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -267,6 +267,7 @@ static PGPROC *allProcs;
*/
static TransactionId *KnownAssignedXids;
static bool *KnownAssignedXidsValid;
+static pg_atomic_uint32 *KnownAssignedXidsNext;
static TransactionId latestObservedXid = InvalidTransactionId;
/*
@@ -405,6 +406,7 @@ void
CreateSharedProcArray(void)
{
bool found;
+ int i;
/* Create or attach to the ProcArray shared structure */
procArray = (ProcArrayStruct *)
@@ -446,6 +448,12 @@ CreateSharedProcArray(void)
ShmemInitStruct("KnownAssignedXidsValid",
mul_size(sizeof(bool), TOTAL_MAX_CACHED_SUBXIDS),
&found);
+ KnownAssignedXidsNext = (pg_atomic_uint32 *)
+ ShmemInitStruct("KnownAssignedXidsNext",
+ mul_size(sizeof(pg_atomic_uint32), TOTAL_MAX_CACHED_SUBXIDS),
+ &found);
+ for (i = 0; i < TOTAL_MAX_CACHED_SUBXIDS; i++)
+ pg_atomic_init_u32(&KnownAssignedXidsNext[i], 1);
}
}
@@ -4517,7 +4525,13 @@ ExpireOldKnownAssignedTransactionIds(TransactionId xid)
* XID entry itself. This preserves the property that the XID entries are
* sorted, so we can do binary searches easily. Periodically we compress
* out the unused entries; that's much cheaper than having to compress the
- * array immediately on every deletion.
+ * array immediately on every deletion. Also, we lazily maintain an offset
+ * in KnownAssignedXidsNext[] array to skip known to be invalid xids. It
+ * helps to skip the gaps; it could significantly increase performance in
+ * the case of long transactions on the primary. KnownAssignedXidsNext[] is
+ * updating while taking the snapshot. In general case KnownAssignedXidsNext
+ * contains not an offset to the next valid xid but a number which tends to
+ * the offset to next valid xid.
*
* The actually valid items in KnownAssignedXids[] and KnownAssignedXidsValid[]
* are those with indexes tail <= i < head; items outside this subscript range
@@ -4544,7 +4558,10 @@ ExpireOldKnownAssignedTransactionIds(TransactionId xid)
* head/tail pointers. (We could dispense with the spinlock if we were to
* create suitable memory access barrier primitives and use those instead.)
* The spinlock must be taken to read or write the head/tail pointers unless
- * the caller holds ProcArrayLock exclusively.
+ * the caller holds ProcArrayLock exclusively. Access to KnownAssignedXidsNext
+ * is not especially protected by any lock because it just some kind of hint
+ * to reduce the scan cost, but at least shared ProcArrayLock is held anyway -
+ * it is required to access KnownAssignedXids.
*
* Algorithmic analysis:
*
@@ -4555,7 +4572,7 @@ ExpireOldKnownAssignedTransactionIds(TransactionId xid)
* must happen)
* * Compressing the array is O(S) and requires exclusive lock
* * Removing an XID is O(logS) and requires exclusive lock
- * * Taking a snapshot is O(S) and requires shared lock
+ * * Taking a snapshot is O(S), O(N) next call; requires shared lock
* * Checking for an XID is O(logS) and requires shared lock
*
* In comparison, using a hash table for KnownAssignedXids would mean that
@@ -4615,12 +4632,13 @@ KnownAssignedXidsCompress(bool force)
* re-aligning data to 0th element.
*/
compress_index = 0;
- for (i = tail; i < head; i++)
+ for (i = tail; i < head; i += pg_atomic_read_u32(&KnownAssignedXidsNext[i]))
{
if (KnownAssignedXidsValid[i])
{
KnownAssignedXids[compress_index] = KnownAssignedXids[i];
KnownAssignedXidsValid[compress_index] = true;
+ pg_atomic_write_u32(&KnownAssignedXidsNext[compress_index], 1);
compress_index++;
}
}
@@ -4723,6 +4741,7 @@ KnownAssignedXidsAdd(TransactionId from_xid, TransactionId to_xid,
{
KnownAssignedXids[head] = next_xid;
KnownAssignedXidsValid[head] = true;
+ pg_atomic_write_u32(&KnownAssignedXidsNext[head], 1);
TransactionIdAdvance(next_xid);
head++;
}
@@ -4938,7 +4957,7 @@ KnownAssignedXidsRemovePreceding(TransactionId removeXid)
tail = pArray->tailKnownAssignedXids;
head = pArray->headKnownAssignedXids;
- for (i = tail; i < head; i++)
+ for (i = tail; i < head; i += pg_atomic_read_u32(&KnownAssignedXidsNext[i]))
{
if (KnownAssignedXidsValid[i])
{
@@ -4961,7 +4980,7 @@ KnownAssignedXidsRemovePreceding(TransactionId removeXid)
/*
* Advance the tail pointer if we've marked the tail item invalid.
*/
- for (i = tail; i < head; i++)
+ for (i = tail; i < head; i += pg_atomic_read_u32(&KnownAssignedXidsNext[i]))
{
if (KnownAssignedXidsValid[i])
break;
@@ -5011,7 +5030,10 @@ KnownAssignedXidsGetAndSetXmin(TransactionId *xarray, TransactionId *xmin,
int count = 0;
int head,
tail;
- int i;
+ int i,
+ prev;
+ uint32 offset = 0,
+ prevOffset;
/*
* Fetch head just once, since it may change while we loop. We can stop
@@ -5025,9 +5047,16 @@ KnownAssignedXidsGetAndSetXmin(TransactionId *xarray, TransactionId *xmin,
SpinLockAcquire(&procArray->known_assigned_xids_lck);
tail = procArray->tailKnownAssignedXids;
head = procArray->headKnownAssignedXids;
+ /**
+ * Store previous valid xid found. Also, store its offset to avoid
+ * additional read later.
+ */
+ prev = tail;
+ prevOffset = pg_atomic_read_u32(&KnownAssignedXidsNext[prev]);
SpinLockRelease(&procArray->known_assigned_xids_lck);
- for (i = tail; i < head; i++)
+ for (i = tail; i < head;
+ i += (offset = pg_atomic_read_u32(&KnownAssignedXidsNext[i])))
{
/* Skip any gaps in the array */
if (KnownAssignedXidsValid[i])
@@ -5050,6 +5079,24 @@ KnownAssignedXidsGetAndSetXmin(TransactionId *xarray, TransactionId *xmin,
TransactionIdFollowsOrEquals(knownXid, xmax))
break;
+ if (prev != i)
+ {
+ uint32 n = i - prev;
+ /**
+ * Do not touch the cache if value unchanged. This way we
+ * could avoid additional cache miss.
+ */
+ if (n != prevOffset)
+ pg_atomic_write_u32(&KnownAssignedXidsNext[prev], n);
+ /**
+ * Remember this xid as previous valid. Also, manually store
+ * prevOffset from current fetched value to avoid additional
+ * atomic read.
+ */
+ prev = i;
+ prevOffset = offset;
+ }
+
/* Add knownXid into output array */
xarray[count++] = knownXid;
}
@@ -5077,7 +5124,7 @@ KnownAssignedXidsGetOldestXmin(void)
head = procArray->headKnownAssignedXids;
SpinLockRelease(&procArray->known_assigned_xids_lck);
- for (i = tail; i < head; i++)
+ for (i = tail; i < head; i += pg_atomic_read_u32(&KnownAssignedXidsNext[i]))
{
/* Skip any gaps in the array */
if (KnownAssignedXidsValid[i])
@@ -5112,7 +5159,7 @@ KnownAssignedXidsDisplay(int trace_level)
initStringInfo(&buf);
- for (i = tail; i < head; i++)
+ for (i = tail; i < head; i += pg_atomic_read_u32(&KnownAssignedXidsNext[i]))
{
if (KnownAssignedXidsValid[i])
{
--
2.25.1
Hello, everyone.
I made a performance test to make sure the patch solves real issues
without performance regression.
Tests are made on 3 VM - one for primary, another - standby, latest
one - pgbench. It is Azure Standard_D16ads_v5 - 16 VCPU, 64GIB RAM,
Fast SSD.
5000 used as a number of connections (it is the max number of
connections for AWS - LEAST({DBInstanceClassMemory/9531392}, 5000)).
Setup:
primary:
max_connections=5000
listen_addresses='*'
fsync=off
standby:
primary_conninfo = 'user=postgres host=10.0.0.4 port=5432
sslmode=prefer sslcompression=0 gssencmode=prefer krbsrvname=postgres
target_session_attrs=any'
hot_standby_feedback = on
max_connections=5000
listen_addresses='*'
fsync=off
The test was run the following way:
# restart both standby and primary
# init fresh DB
./pgbench -h 10.0.0.4 -i -s 10 -U postgres -d postgres
# warm up primary for 10 seconds
./pgbench -h 10.0.0.4 -b simple-update -j 8 -c 16 -P 1 -T 10 -U
postgres postgres
# warm up standby for 10 seconds
./pgbench -h 10.0.0.5 -b select-only -j 8 -c 16 -n -P 1 -T 10 -U
postgres postgres
# then, run at the same(!) time (in parallel):
# simple-update on primary
./pgbench -h 10.0.0.4 -b simple-update -j 8 -c 16 -P 1 -T 180 -U
postgres postgres
# simple-select on standby
./pgbench -h 10.0.0.5 -b select-only -j 8 -c 16 -n -P 1 -T 180 -U
postgres postgres
# then, after 60 seconds after test start - start a long transaction
on the master
./psql -h 10.0.0.4 -c "BEGIN; select txid_current();SELECT
pg_sleep(5);COMMIT;" -U postgres postgres
I made 3 runs for both the patched and vanilla versions (current
master branch). One run of the patched version was retried because of
a significant difference in TPS (it is vCPU on VM with neighborhoods,
so, probably some isolation issue).
The result on the primary is about 23k-25k TPS for both versions.
So, graphics show a significant reduction of TPS on the secondary
while the long transaction is active (about 10%).
The patched version solves the issue without any noticeable regression
in the case of short-only transactions.
Also, transactions could be much shorted to reduce CPU - a few seconds
is enough.
Also, this is `perf diff` between `with` and `without` long
transaction recording.
Vanilla (+ 10.26% of KnownAssignedXidsGetAndSetXmin):
0.22% +10.26% postgres [.]
KnownAssignedXidsGetAndSetXmin
3.39% +0.68% [kernel.kallsyms] [k]
_raw_spin_unlock_irqrestore
2.66% -0.61% libc-2.31.so [.] 0x0000000000045dc1
3.77% -0.50% postgres [.] base_yyparse
3.43% -0.45% [kernel.kallsyms] [k] finish_task_switch
0.41% +0.36% postgres [.] pg_checksum_page
0.61% +0.31% [kernel.kallsyms] [k] copy_user_generic_string
Patched (+ 0.22%):
2.26% -0.40% [kernel.kallsyms] [k] _raw_spin_unlock_irqrestore
0.78% +0.39% [kernel.kallsyms] [k] copy_user_generic_string
0.22% +0.26% postgres [.] KnownAssignedXidsGetAndSetXmin
0.23% +0.20% postgres [.] ScanKeywordLookup
3.77% +0.19% postgres [.] base_yyparse
0.64% +0.19% postgres [.] pg_checksum_page
3.63% -0.18% [kernel.kallsyms] [k] finish_task_switch
If someone knows any additional performance tests that need to be done
- please share.
Best regards,
Michail.
Attachments:
runs.pngimage/png; name=runs.pngDownload
�PNG
IHDR � � ��mO sRGB ��� gAMA ���a pHYs � ����e ��IDATx^���_UW���?������������W?�y���h�2UU��NRS�*�tWbb�����FqBTDAd.� "^d�+2 "� ����:w�����\&]��z�9��3r�9����k�_w���b�a�a����g���_0�1�0�0�T �)�T�a�af
�"�a�a��r�He�a�a�,R�a�a�)�T�a�af��"�a�a��r�He�a�a�,R�a�a�)�T�a�af��"�a�a��r�He�a�a�,R�a�a�)�T�a�af��"�a��B�3{���g�5��g�hq����w=i�3l�.��c��0���L,R'�����He�W3%9��H�X��'�o
�Hefra��0�,R'��0��T�yY1hJ(���QY)b���:��y�r�&1[bU�ZY�Pd���V��w�#X���������|M+��;�������������q�
�E��������+*Kl^����@)R_�
>Z��p���fk�= ��-;g�0���E*���= ��/O�����4oT�(D����\�X�L^���b�|/�R`vP
��`e�����hgM��f�d:NyE�N����Pl�2N��l<�;�m����Z�1�kkSx.�����A��X��qkf��:�������a�?,R����H0`{�3� �D^�\"