Slow standby snapshot
Hi,
I recently ran into a problem in one of our production postgresql cluster.
I had noticed lock contention on procarray lock on standby, which causes
WAL replay lag growth.
To reproduce this, you can do the following:
1) set max_connections to big number, like 100000
2) begin a transaction on primary
3) start pgbench workload on primary and on standby
After a while it will be possible to see KnownAssignedXidsGetAndSetXmin in
perf top consuming abount 75 % of CPU.
%%
PerfTop: 1060 irqs/sec kernel: 0.0% exact: 0.0% [4000Hz cycles:u],
(target_pid: 273361)
-------------------------------------------------------------------------------
73.92% postgres [.] KnownAssignedXidsGetAndSetXmin
1.40% postgres [.] base_yyparse
0.96% postgres [.] LWLockAttemptLock
0.84% postgres [.] hash_search_with_hash_value
0.84% postgres [.] AtEOXact_GUC
0.72% postgres [.] ResetAllOptions
0.70% postgres [.] AllocSetAlloc
0.60% postgres [.] _bt_compare
0.55% postgres [.] core_yylex
0.42% libc-2.27.so [.] __strlen_avx2
0.23% postgres [.] LWLockRelease
0.19% postgres [.] MemoryContextAllocZeroAligned
0.18% postgres [.] expression_tree_walker.part.3
0.18% libc-2.27.so [.] __memmove_avx_unaligned_erms
0.17% postgres [.] PostgresMain
0.17% postgres [.] palloc
0.17% libc-2.27.so [.] _int_malloc
0.17% postgres [.] set_config_option
0.17% postgres [.] ScanKeywordLookup
0.16% postgres [.] _bt_checkpage
%%
We have tried to fix this by using BitMapSet instead of boolean array
KnownAssignedXidsValid, but this does not help too much.
Instead, using a doubly linked list helps a little more, we got +1000 tps
on pgbench workload with patched postgresql. The general idea of this patch
is that, instead of memorizing which elements in KnownAssignedXids are
valid, lets maintain a doubly linked list of them. This solution will work
in exactly the same way, except that taking a snapshot on the replica is
now O(running transaction) instead of O(head - tail) which is significantly
faster under some workloads. The patch helps to reduce CPU usage of
KnownAssignedXidsGetAndSetXmin to ~48% instead of ~74%, but does eliminate
it from perf top.
The problem is better reproduced on PG13 since PG14 has some snapshot
optimization.
Thanks!
Best regards, reshke
sorry, forgot to add a patch to the letter
чт, 20 мая 2021 г. в 13:52, Кирилл Решке <reshkekirill@gmail.com>:
Show quoted text
Hi,
I recently ran into a problem in one of our production postgresql cluster.
I had noticed lock contention on procarray lock on standby, which causes
WAL replay lag growth.
To reproduce this, you can do the following:1) set max_connections to big number, like 100000
2) begin a transaction on primary
3) start pgbench workload on primary and on standbyAfter a while it will be possible to see KnownAssignedXidsGetAndSetXmin in
perf top consuming abount 75 % of CPU.%%
PerfTop: 1060 irqs/sec kernel: 0.0% exact: 0.0% [4000Hz cycles:u],
(target_pid: 273361)-------------------------------------------------------------------------------
73.92% postgres [.] KnownAssignedXidsGetAndSetXmin
1.40% postgres [.] base_yyparse
0.96% postgres [.] LWLockAttemptLock
0.84% postgres [.] hash_search_with_hash_value
0.84% postgres [.] AtEOXact_GUC
0.72% postgres [.] ResetAllOptions
0.70% postgres [.] AllocSetAlloc
0.60% postgres [.] _bt_compare
0.55% postgres [.] core_yylex
0.42% libc-2.27.so [.] __strlen_avx2
0.23% postgres [.] LWLockRelease
0.19% postgres [.] MemoryContextAllocZeroAligned
0.18% postgres [.] expression_tree_walker.part.3
0.18% libc-2.27.so [.] __memmove_avx_unaligned_erms
0.17% postgres [.] PostgresMain
0.17% postgres [.] palloc
0.17% libc-2.27.so [.] _int_malloc
0.17% postgres [.] set_config_option
0.17% postgres [.] ScanKeywordLookup
0.16% postgres [.] _bt_checkpage%%
We have tried to fix this by using BitMapSet instead of boolean array
KnownAssignedXidsValid, but this does not help too much.Instead, using a doubly linked list helps a little more, we got +1000 tps
on pgbench workload with patched postgresql. The general idea of this patch
is that, instead of memorizing which elements in KnownAssignedXids are
valid, lets maintain a doubly linked list of them. This solution will work
in exactly the same way, except that taking a snapshot on the replica is
now O(running transaction) instead of O(head - tail) which is significantly
faster under some workloads. The patch helps to reduce CPU usage of
KnownAssignedXidsGetAndSetXmin to ~48% instead of ~74%, but does eliminate
it from perf top.The problem is better reproduced on PG13 since PG14 has some snapshot
optimization.Thanks!
Best regards, reshke
Attachments:
UseDoublyLinkedListInKnowAssingedXods.patchtext/x-patch; charset=US-ASCII; name=UseDoublyLinkedListInKnowAssingedXods.patchDownload+113-82
)Hello.
I recently ran into a problem in one of our production postgresql cluster.
I had noticed lock contention on procarray lock on standby, which causes WAL
replay lag growth.
Yes, I saw the same issue on my production cluster.
1) set max_connections to big number, like 100000
I made the tests with a more realistic value - 5000. It is valid value
for Amazon RDS for example (default is
LEAST({DBInstanceClassMemory/9531392}, 5000)).
The test looks like this:
pgbench -i -s 10 -U postgres -d postgres
pgbench -b select-only -p 6543 -j 1 -c 50 -n -P 1 -T 18000 -U postgres postgres
pgbench -b simple-update -j 1 -c 50 -n -P 1 -T 18000 -U postgres postgres
long transaction on primary - begin;select txid_current();
perf top -p <pid of some standby>
So, on postgres 14 (master) non-patched version looks like this:
5.13% postgres [.] KnownAssignedXidsGetAndSetXmin
4.61% postgres [.] pg_checksum_block
2.54% postgres [.] AllocSetAlloc
2.44% postgres [.] base_yyparse
It is too much to spend 5-6% of CPU running throw an array :) I think
it should be fixed for both the 13 and 14 versions.
The patched version like this (was unable to notice
KnownAssignedXidsGetAndSetXmin):
3.08% postgres [.] pg_checksum_block
2.89% postgres [.] AllocSetAlloc
2.66% postgres [.] base_yyparse
2.00% postgres [.] MemoryContextAllocZeroAligned
On postgres 13 non patched version looks even worse (definitely need
to be fixed in my opinion):
26.44% postgres [.] KnownAssignedXidsGetAndSetXmin
2.17% postgres [.] base_yyparse
2.01% postgres [.] AllocSetAlloc
1.55% postgres [.] MemoryContextAllocZeroAligned
But your patch does not apply to REL_13_STABLE. Could you please
provide two versions?
Also, there are warnings while building with patch:
procarray.c:4595:9: warning: ISO C90 forbids mixed
declarations and code [-Wdeclaration-after-statement]
4595 | int prv = -1;
| ^~~
procarray.c: In function ‘KnownAssignedXidsGetOldestXmin’:
procarray.c:5056:5: warning: variable ‘tail’ set but not used
[-Wunused-but-set-variable]
5056 | tail;
| ^~~~
procarray.c:5067:38: warning: ‘i’ is used uninitialized in
this function [-Wuninitialized]
5067 | i = KnownAssignedXidsValidDLL[i].nxt;
Some of them are clear errors, so, please recheck the code.
Also, maybe it is better to reduce the invasivity by using a more
simple approach. For example, use the first bit to mark xid as valid
and the last 7 bit (128 values) as an optimistic offset to the next
valid xid (jump by 127 steps in the worse scenario).
What do you think?
Also, it is a good idea to register the patch in the commitfest app
(https://commitfest.postgresql.org/).
Thanks,
Michail.
Hello, Kirill.
Also, maybe it is better to reduce the invasivity by using a more
simple approach. For example, use the first bit to mark xid as valid
and the last 7 bit (128 values) as an optimistic offset to the next
valid xid (jump by 127 steps in the worse scenario).
What do you think?
I have tried such an approach but looks like it is not effective,
probably because of CPU caching issues.
I have looked again at your patch, ut seems like it has a lot of
issues at the moment:
* error in KnownAssignedXidsGetOldestXmin, `i` is uninitialized, logic is wrong
* error in compressing function
(```KnownAssignedXidsValidDLL[compress_index].prv = prv;```, `prv` is
never updated)
* probably other errors?
* compilation warnings
* looks a little complex logic with `KAX_DLL_ENTRY_INVALID`
* variable\methods placing is bad (see `KAX_E_INVALID` and others)
* need to update documentation about KnownAssignedXidsValid, see ```To
keep individual deletions cheap, we need to allow gaps in the array```
in procarray.c
* formatting is broken
Do you have plans to update it? If not - I could try to rewrite it.
Also, what is about to add a patch to commitfest?
Thanks,
Michail.
Hello.
I have tried such an approach but looks like it is not effective,
probably because of CPU caching issues.
It was a mistake by me. I have repeated the approach and got good
results with small and a non-invasive patch.
The main idea is simple optimistic optimization - store offset to next
valid entry. So, in most cases, we could just skip all the gaps.
Of course, it adds some additional impact for workloads without long
(few seconds) transactions but it is almost not detectable (because of
CPU caches).
* TEST
The next testing setup was used:
max_connections=5000 (taken from real RDS installation)
pgbench -i -s 10 -U postgres -d postgres
# emulate typical OLTP load
pgbench -b simple-update -j 1 -c 50 -n -P 1 -T 18000 -U postgres postgres
#emulate typical cache load on replica
pgbench -b select-only -p 6543 -j 1 -c 50 -n -P 1 -T 18000 -U postgres postgres
# emulate some typical long transactions up to 60 seconds on primary
echo "\set t random(0, 60)
BEGIN;
select txid_current();
select pg_sleep(:t);
COMMIT;" > slow_transactions.bench
pgbench -f /home/nkey/pg/slow_transactions.bench -p 5432 -j 1 -c 10 -n
-P 1 -T 18000 -U postgres postgres
* RESULTS
*REL_13_STABLE* - 23.02% vs 0.76%
non-patched:
23.02% postgres [.] KnownAssignedXidsGetAndSetXmin
2.56% postgres [.] base_yyparse
2.15% postgres [.] AllocSetAlloc
1.68% postgres [.] MemoryContextAllocZeroAligned
1.51% postgres [.] hash_search_with_hash_value
1.26% postgres [.] SearchCatCacheInternal
1.03% postgres [.] hash_bytes
0.92% postgres [.] pg_checksum_block
0.89% postgres [.] expression_tree_walker
0.81% postgres [.] core_yylex
0.69% postgres [.] palloc
0.68% [kernel] [k] do_syscall_64
0.59% postgres [.] _bt_compare
0.54% postgres [.] new_list
patched:
3.09% postgres [.] base_yyparse
3.00% postgres [.] AllocSetAlloc
2.01% postgres [.] MemoryContextAllocZeroAligned
1.89% postgres [.] SearchCatCacheInternal
1.80% postgres [.] hash_search_with_hash_value
1.27% postgres [.] expression_tree_walker
1.27% postgres [.] pg_checksum_block
1.18% postgres [.] hash_bytes
1.10% postgres [.] core_yylex
0.96% [kernel] [k] do_syscall_64
0.86% postgres [.] palloc
0.84% postgres [.] _bt_compare
0.79% postgres [.] new_list
0.76% postgres [.] KnownAssignedXidsGetAndSetXmin
*MASTER* - 6.16% vs ~0%
(includes snapshot scalability optimization by Andres Freund (1))
non-patched:
6.16% postgres [.] KnownAssignedXidsGetAndSetXmin
3.05% postgres [.] AllocSetAlloc
2.59% postgres [.] base_yyparse
1.95% postgres [.] hash_search_with_hash_value
1.87% postgres [.] MemoryContextAllocZeroAligned
1.85% postgres [.] SearchCatCacheInternal
1.27% postgres [.] hash_bytes
1.16% postgres [.] expression_tree_walker
1.06% postgres [.] core_yylex
0.94% [kernel] [k] do_syscall_64
patched:
3.35% postgres [.] base_yyparse
2.84% postgres [.] AllocSetAlloc
1.89% postgres [.] hash_search_with_hash_value
1.82% postgres [.] MemoryContextAllocZeroAligned
1.79% postgres [.] SearchCatCacheInternal
1.49% postgres [.] pg_checksum_block
1.26% postgres [.] hash_bytes
1.26% postgres [.] expression_tree_walker
1.08% postgres [.] core_yylex
1.04% [kernel] [k] do_syscall_64
0.81% postgres [.] palloc
Looks like it is possible to get a significant TPS increase on a very
typical standby workload.
Currently, I have no environment to measure TPS accurately. Could you
please try it on yours?
I have attached two versions of the patch - for master and REL_13_STABLE.
Also, I am going to add a patch to commitfest (2).
Thanks,
MIchail.
(1): https://commitfest.postgresql.org/29/2500/
(2): https://commitfest.postgresql.org/34/3271/
Hi,
On 2021-08-03 00:07:23 +0300, Michail Nikolaev wrote:
The main idea is simple optimistic optimization - store offset to next
valid entry. So, in most cases, we could just skip all the gaps.
Of course, it adds some additional impact for workloads without long
(few seconds) transactions but it is almost not detectable (because of
CPU caches).
I'm doubtful that that's really the right direction. For workloads that
are replay heavy we already often can see the cost of maintaining the
known xids datastructures show up significantly - not surprising, given
the datastructure. And for standby workloads with active primaries the
cost of searching through the array in all backends is noticeable as
well. I think this needs a bigger data structure redesign.
Greetings,
Andres Freund
Hi,
3 авг. 2021 г., в 03:01, Andres Freund <andres@anarazel.de> написал(а):
Hi,
On 2021-08-03 00:07:23 +0300, Michail Nikolaev wrote:
The main idea is simple optimistic optimization - store offset to next
valid entry. So, in most cases, we could just skip all the gaps.
Of course, it adds some additional impact for workloads without long
(few seconds) transactions but it is almost not detectable (because of
CPU caches).I'm doubtful that that's really the right direction. For workloads that
are replay heavy we already often can see the cost of maintaining the
known xids datastructures show up significantly - not surprising, given
the datastructure. And for standby workloads with active primaries the
cost of searching through the array in all backends is noticeable as
well. I think this needs a bigger data structure redesign.
KnownAssignedXids implements simple membership test idea. What kind of redesign would you suggest? Proposed optimisation makes it close to optimal, but needs eventual compression.
Maybe use a hashtable of running transactions? It will be slightly faster when adding\removing single transactions. But much worse when doing KnownAssignedXidsRemove().
Maybe use a tree? (AVL\RB or something like that) It will be slightly better, because it does not need eventual compression like exiting array.
Best regards, Andrey Borodin.
Hi,
On 2021-08-03 10:33:50 +0500, Andrey Borodin wrote:
3 авг. 2021 г., в 03:01, Andres Freund <andres@anarazel.de> написал(а):
On 2021-08-03 00:07:23 +0300, Michail Nikolaev wrote:The main idea is simple optimistic optimization - store offset to next
valid entry. So, in most cases, we could just skip all the gaps.
Of course, it adds some additional impact for workloads without long
(few seconds) transactions but it is almost not detectable (because of
CPU caches).I'm doubtful that that's really the right direction. For workloads that
are replay heavy we already often can see the cost of maintaining the
known xids datastructures show up significantly - not surprising, given
the datastructure. And for standby workloads with active primaries the
cost of searching through the array in all backends is noticeable as
well. I think this needs a bigger data structure redesign.KnownAssignedXids implements simple membership test idea. What kind of
redesign would you suggest? Proposed optimisation makes it close to optimal,
but needs eventual compression.
Binary searches are very ineffecient on modern CPUs (unpredictable memory
accesses, unpredictable branches). We constantly need to do binary searches
during replay to remove xids from the array. I don't see how you can address
that with just the current datastructure.
Maybe use a hashtable of running transactions? It will be slightly faster
when adding\removing single transactions. But much worse when doing
KnownAssignedXidsRemove().
Why would it be worse for KnownAssignedXidsRemove()? Were you intending to
write KnownAssignedXidsGet[AndSetXmin]()?
Maybe use a tree? (AVL\RB or something like that) It will be slightly better, because it does not need eventual compression like exiting array.
I'm not entirely sure what datastructure would work best. I can see something
like a radix tree work well, or a skiplist style approach. Or a hashtable:
I'm not sure that we need to care as much about the cost of
KnownAssignedXidsGetAndSetXmin() - for one, the caching we now have makes that
less frequent. But more importantly, it'd not be hard to maintain an
occasionally (or opportunistically) maintained denser version for
GetSnapshotData() - there's only a single writer, making the concurrency
issues a lot simpler.
Greetings,
Andres Freund
Hello, Andres.
Thanks for your feedback.
Maybe use a hashtable of running transactions? It will be slightly faster
when adding\removing single transactions. But much worse when doing
KnownAssignedXidsRemove().Why would it be worse for KnownAssignedXidsRemove()? Were you intending to
write KnownAssignedXidsGet[AndSetXmin]()?
It is actually was a hashtable in 2010. It was replaced by Simon Riggs
in 2871b4618af1acc85665eec0912c48f8341504c4.
I'm not sure that we need to care as much about the cost of
KnownAssignedXidsGetAndSetXmin() - for one, the caching we now have makes that
less frequent.
It is still about 5-7% of CPU for a typical workload, a considerable
amount for me. And a lot of systems still work on 12 and 13.
The proposed approach eliminates KnownAssignedXidsGetAndSetXmin from
the top by a small changeset.
But more importantly, it'd not be hard to maintain an
occasionally (or opportunistically) maintained denser version for
GetSnapshotData() - there's only a single writer, making the concurrency
issues a lot simpler.
Could you please explain it in more detail?
Single writer and GetSnapshotData() already exclusively hold
ProcArrayLock at the moment of KnownAssignedXidsRemove,
KnownAssignedXidsGetAndSetXmin, and sometimes KnownAssignedXidsAdd.
BTW, the tiny thing we could also fix now is (comment from code):
(We could dispense with the spinlock if we were to
* create suitable memory access barrier primitives and use those instead.)
* The spinlock must be taken to read or write the head/tail pointers unless
* the caller holds ProcArrayLock exclusively.
Thanks,
Michail.
Hi,
On 2021-08-03 22:23:58 +0300, Michail Nikolaev wrote:
I'm not sure that we need to care as much about the cost of
KnownAssignedXidsGetAndSetXmin() - for one, the caching we now have makes that
less frequent.It is still about 5-7% of CPU for a typical workload, a considerable
amount for me.
I'm not saying we shouldn't optimize things. Just that it's less pressing. And
what kind of price we're willing to optimize may have changed.
And a lot of systems still work on 12 and 13.
I don't see us backporting performance improvements around this to 12 and 13,
so I don't think that matters much... We've done that a few times, but usually
when there's some accidentally quadratic behaviour or such.
But more importantly, it'd not be hard to maintain an
occasionally (or opportunistically) maintained denser version for
GetSnapshotData() - there's only a single writer, making the concurrency
issues a lot simpler.Could you please explain it in more detail?
Single writer and GetSnapshotData() already exclusively hold
ProcArrayLock at the moment of KnownAssignedXidsRemove,
KnownAssignedXidsGetAndSetXmin, and sometimes KnownAssignedXidsAdd.
GetSnapshotData() only locks ProcArrayLock in shared mode.
The problem is that we don't want to add a lot of work
KnownAssignedXidsAdd/Remove, because very often nobody will build a snapshot
for that moment and building a sorted, gap-free, linear array of xids isn't
cheap. In my experience it's more common to be bottlenecked by replay CPU
performance than on replica snapshot building.
During recovery, where there's only one writer to the procarray / known xids,
it might not be hard to opportunistically build a dense version of the known
xids whenever a snapshot is built.
Greetings,
Andres Freund
Hello, Andres.
Thanks for the feedback again.
The problem is that we don't want to add a lot of work
KnownAssignedXidsAdd/Remove, because very often nobody will build a snapshot
for that moment and building a sorted, gap-free, linear array of xids isn't
cheap. In my experience it's more common to be bottlenecked by replay CPU
performance than on replica snapshot building.
Yes, but my patch adds almost the smallest possible amount for
KnownAssignedXidsAdd/Remove - a single write to the array by index.
It differs from the first version in this thread which is based on linked lists.
The "next valid offset" is just "optimistic optimization" - it means
"you could safely skip KnownAssignedXidsNext[i] while finding the next
valid".
But KnownAssignedXidsNext is not updated by Add/Remove.
During recovery, where there's only one writer to the procarray / known xids,
it might not be hard to opportunistically build a dense version of the known
xids whenever a snapshot is built.
AFAIU the patch does exactly the same.
On the first snapshot building, offsets to the next valid entry are
set. So, a dense version is created on-demand.
And this version is reused (even partly if something was removed) on
the next snapshot building.
I'm not entirely sure what datastructure would work best. I can see something
like a radix tree work well, or a skiplist style approach. Or a hashtable:
We could try to use some other structure (for example - linked hash
map) - but the additional cost (memory management, links, hash
calculation) will probably significantly reduce performance.
And it is a much harder step to perform.
So, I think "next valid offset" optimization is a good trade-off for now:
* KnownAssignedXidsAdd/Remove are almost not affected in their complexity
* KnownAssignedXidsGetAndSetXmin is eliminated from the CPU top on
typical read scenario - so, more TPS, less ProcArrayLock contention
* it complements GetSnapshotData() scalability - now on standby
* changes are small
Thanks,
Michail.
On Tue, Aug 10, 2021 at 12:45:17AM +0300, Michail Nikolaev wrote:
Thanks for the feedback again.
From what I can see, there has been some feedback from Andres here,
and the thread is idle for six weeks now, so I have marked this patch
as RwF in the CF app.
--
Michael
Hello, Andres.
Could you please clarify how to better deal with the situation?
According to your previous letter, I think there was some
misunderstanding regarding the latest patch version (but I am not
sure). Because as far as understand provided optimization (lazily
calculated optional offset to the next valid xid) fits into your
wishes. It was described in the previous letter in more detail.
And now it is not clear for me how to move forward :)
There is an option to try to find some better data structure (like
some tricky linked hash map) but it is going to be huge changes
without any confidence to get a more effective version (because
provided changes make the structure pretty effective).
Another option I see - use optimization from the latest patch version
and get a significant TPS increase (5-7%) for the typical standby read
scenario. Patch is small and does not affect other scenarios in a
negative way. Probably I could make an additional set of some
performance tests and provide some simulation to prove that
pg_atomic_uint32-related code is correct (if required).
Or just leave the issue and hope someone else will try to fix it in
the future :)
Thanks a lot,
Michail.
Sorry for so late reply. I've been thinking about possible approaches.
KnownAssignedXids over hashtable in fact was implemented long before and rejected [0]https://github.com/postgres/postgres/commit/2871b4618af1acc85665eec0912c48f8341504c4.
3 авг. 2021 г., в 22:35, Andres Freund <andres@anarazel.de> написал(а):
On 2021-08-03 10:33:50 +0500, Andrey Borodin wrote:
3 авг. 2021 г., в 03:01, Andres Freund <andres@anarazel.de> написал(а):
On 2021-08-03 00:07:23 +0300, Michail Nikolaev wrote:The main idea is simple optimistic optimization - store offset to next
valid entry. So, in most cases, we could just skip all the gaps.
Of course, it adds some additional impact for workloads without long
(few seconds) transactions but it is almost not detectable (because of
CPU caches).I'm doubtful that that's really the right direction. For workloads that
are replay heavy we already often can see the cost of maintaining the
known xids datastructures show up significantly - not surprising, given
the datastructure. And for standby workloads with active primaries the
cost of searching through the array in all backends is noticeable as
well. I think this needs a bigger data structure redesign.KnownAssignedXids implements simple membership test idea. What kind of
redesign would you suggest? Proposed optimisation makes it close to optimal,
but needs eventual compression.Binary searches are very ineffecient on modern CPUs (unpredictable memory
accesses, unpredictable branches). We constantly need to do binary searches
during replay to remove xids from the array. I don't see how you can address
that with just the current datastructure.
Current patch addresses another problem. In presence of old enough transaction enumeration of KnownAssignedXids with shared lock prevents adding new transactions with exclusive lock. And recovery effectively pauses.
Binary searches can consume 10-15 cache misses, which is unreasonable amount of memory waits. But that's somewhat different problem.
Also binsearch is not that expensive when we compress KnownAssignedXids often.
Maybe use a hashtable of running transactions? It will be slightly faster
when adding\removing single transactions. But much worse when doing
KnownAssignedXidsRemove().Why would it be worse for KnownAssignedXidsRemove()? Were you intending to
write KnownAssignedXidsGet[AndSetXmin]()?
I was thinking about inefficient KnownAssignedXidsRemovePreceding() in hashtable. But, probably, this is not so frequent operation.
Maybe use a tree? (AVL\RB or something like that) It will be slightly better, because it does not need eventual compression like exiting array.
I'm not entirely sure what datastructure would work best. I can see something
like a radix tree work well, or a skiplist style approach. Or a hashtable:I'm not sure that we need to care as much about the cost of
KnownAssignedXidsGetAndSetXmin() - for one, the caching we now have makes that
less frequent. But more importantly, it'd not be hard to maintain an
occasionally (or opportunistically) maintained denser version for
GetSnapshotData() - there's only a single writer, making the concurrency
issues a lot simpler.
I've been prototyping Radix tree for a while.
Here every 4 xids are summarized my minimum Xid and number of underlying Xids. Of cause 4 is arbitrary number, summarization area must be of cacheline size.
┌───────┐
│ 1 / 9 │
├───────┴────┐
│ └────┐
│ └────┐
│ └────┐
▼ └───▶
┌───────────────────────────────┐
│ 1 / 3 | 5 / 0 | 9 / 3 | D / 3 │
├───────┬───────┬────────┬──────┴────┐
│ └─┐ └───┐ └────┐ └─────┐
│ └─┐ └──┐ └────┐ └─────┐
│ └─┐ └──┐ └────┐ └────┐
▼ └─┐ └──┐ └───┐ └────┐
┌───────────────▼────────────┴─▶────────────┴──▶───────────┴───▶
│ 1 | 2 | | 4 | | | | | 9 | | B | C | D | E | F | │
└──────────────────────────────────────────────────────────────┘
Bottom layer is current array (TransactionId *KnownAssignedXids).
When we remove Xid we need theoretical minimum of cachelines touched. I'd say 5-7 instead of 10-15 of binsearch (in case of millions of entries in KnownAssignedXids).
Enumerating running Xids is not that difficult too: we will need to scan O(xip) memory, not whole KnownAssignedXids array.
But the overall complexity of this approach does not seem good to me.
All in all, I think using proposed "KnownAssignedXidsNext" patch solves real problem and the problem of binary searches should be addressed by compressing KnownAssignedXids more often.
Currently we do not compress array
if (nelements < 4 * PROCARRAY_MAXPROCS || // It's not that big yet OR
nelements < 2 * pArray->numKnownAssignedXids) // It's contains less than a half of a bloat
return;
From my POV arbitrary number 4 is just too high.
Summary: I think ("KnownAssignedXidsNext" patch + compressing KnownAssignedXids more often) is better than major KnownAssignedXids redesign.
Best regards, Andrey Borodin.
[0]: https://github.com/postgres/postgres/commit/2871b4618af1acc85665eec0912c48f8341504c4
Hello, Andrey.
Thanks for your feedback.
Current patch addresses another problem. In presence of old enough transaction enumeration of KnownAssignedXids with shared lock prevents adding new transactions with exclusive lock. And recovery effectively pauses.
Actually, I see two problems here (caused by the presence of old long
transactions). The first one is lock contention which causes recovery
pauses. The second one - just high CPU usage on standby by
KnownAssignedXidsGetAndSetXmin.
All in all, I think using proposed "KnownAssignedXidsNext" patch solves real problem and the problem of binary searches should be addressed by compressing KnownAssignedXids more often.
I updated the patch a little. KnownAssignedXidsGetAndSetXmin now
causes fewer cache misses because some values are stored in variables
(registers). I think it is better to not lean on the compiler here
because of `volatile` args.
Also, I have added some comments.
Best regards,
Michail.
Attachments:
v2-0001-known-assignment-xid-next.patchtext/x-patch; charset=US-ASCII; name=v2-0001-known-assignment-xid-next.patchDownload+57-11
Hello, everyone.
I made a performance test to make sure the patch solves real issues
without performance regression.
Tests are made on 3 VM - one for primary, another - standby, latest
one - pgbench. It is Azure Standard_D16ads_v5 - 16 VCPU, 64GIB RAM,
Fast SSD.
5000 used as a number of connections (it is the max number of
connections for AWS - LEAST({DBInstanceClassMemory/9531392}, 5000)).
Setup:
primary:
max_connections=5000
listen_addresses='*'
fsync=off
standby:
primary_conninfo = 'user=postgres host=10.0.0.4 port=5432
sslmode=prefer sslcompression=0 gssencmode=prefer krbsrvname=postgres
target_session_attrs=any'
hot_standby_feedback = on
max_connections=5000
listen_addresses='*'
fsync=off
The test was run the following way:
# restart both standby and primary
# init fresh DB
./pgbench -h 10.0.0.4 -i -s 10 -U postgres -d postgres
# warm up primary for 10 seconds
./pgbench -h 10.0.0.4 -b simple-update -j 8 -c 16 -P 1 -T 10 -U
postgres postgres
# warm up standby for 10 seconds
./pgbench -h 10.0.0.5 -b select-only -j 8 -c 16 -n -P 1 -T 10 -U
postgres postgres
# then, run at the same(!) time (in parallel):
# simple-update on primary
./pgbench -h 10.0.0.4 -b simple-update -j 8 -c 16 -P 1 -T 180 -U
postgres postgres
# simple-select on standby
./pgbench -h 10.0.0.5 -b select-only -j 8 -c 16 -n -P 1 -T 180 -U
postgres postgres
# then, after 60 seconds after test start - start a long transaction
on the master
./psql -h 10.0.0.4 -c "BEGIN; select txid_current();SELECT
pg_sleep(5);COMMIT;" -U postgres postgres
I made 3 runs for both the patched and vanilla versions (current
master branch). One run of the patched version was retried because of
a significant difference in TPS (it is vCPU on VM with neighborhoods,
so, probably some isolation issue).
The result on the primary is about 23k-25k TPS for both versions.
So, graphics show a significant reduction of TPS on the secondary
while the long transaction is active (about 10%).
The patched version solves the issue without any noticeable regression
in the case of short-only transactions.
Also, transactions could be much shorted to reduce CPU - a few seconds
is enough.
Also, this is `perf diff` between `with` and `without` long
transaction recording.
Vanilla (+ 10.26% of KnownAssignedXidsGetAndSetXmin):
0.22% +10.26% postgres [.]
KnownAssignedXidsGetAndSetXmin
3.39% +0.68% [kernel.kallsyms] [k]
_raw_spin_unlock_irqrestore
2.66% -0.61% libc-2.31.so [.] 0x0000000000045dc1
3.77% -0.50% postgres [.] base_yyparse
3.43% -0.45% [kernel.kallsyms] [k] finish_task_switch
0.41% +0.36% postgres [.] pg_checksum_page
0.61% +0.31% [kernel.kallsyms] [k] copy_user_generic_string
Patched (+ 0.22%):
2.26% -0.40% [kernel.kallsyms] [k] _raw_spin_unlock_irqrestore
0.78% +0.39% [kernel.kallsyms] [k] copy_user_generic_string
0.22% +0.26% postgres [.] KnownAssignedXidsGetAndSetXmin
0.23% +0.20% postgres [.] ScanKeywordLookup
3.77% +0.19% postgres [.] base_yyparse
0.64% +0.19% postgres [.] pg_checksum_page
3.63% -0.18% [kernel.kallsyms] [k] finish_task_switch
If someone knows any additional performance tests that need to be done
- please share.
Best regards,
Michail.
On Wed, Nov 10, 2021 at 12:16 AM Michail Nikolaev
<michail.nikolaev@gmail.com> wrote:
I updated the patch a little. KnownAssignedXidsGetAndSetXmin now
causes fewer cache misses because some values are stored in variables
(registers). I think it is better to not lean on the compiler here
because of `volatile` args.
Also, I have added some comments.
It looks like KnownAssignedXidsNext doesn't have to be
pg_atomic_uint32. I see it only gets read with pg_atomic_read_u32()
and written with pg_atomic_write_u32(). Existing code believes that
read/write of 32-bit values is atomic. So, you can use just uint32
instead of pg_atomic_uint32.
------
Regards,
Alexander Korotkov
Hello, Alexander.
Thanks for your review.
It looks like KnownAssignedXidsNext doesn't have to be
pg_atomic_uint32. I see it only gets read with pg_atomic_read_u32()
and written with pg_atomic_write_u32(). Existing code believes that
read/write of 32-bit values is atomic. So, you can use just uint32
instead of pg_atomic_uint32.
Fixed. Looks better now, yeah.
Also, I added an additional (not depending on KnownAssignedXidsNext
feature) commit replacing the spinlock with a memory barrier. It goes
to Simon Riggs and Tom Lane at 2010:
(We could dispense with the spinlock if we were to
create suitable memory access barrier primitives and use those instead.)
Now it is possible to avoid additional spinlock on each
KnownAssignedXidsGetAndSetXmin. I have not measured the performance
impact of this particular change yet (and it is not easy to reliable
measure impact less than 0.5% probably), but I think it is worth
adding. We need to protect only the head pointer because the tail is
updated only with exclusive ProcArrayLock. BTW should I provide a
separate patch for this?
So, now we have a pretty successful benchmark for the typical use-case
and some additional investigation had been done. So, I think I’ll
re-add the patch to the commitfest app.
Thanks,
Michail
21 нояб. 2021 г., в 23:58, Michail Nikolaev <michail.nikolaev@gmail.com> написал(а):
<v3-0001-memory-barrier-instead-of-spinlock.patch>
Write barrier must be issued after write, not before.
Don't we need to issue read barrier too?
Best regards, Andrey Borodin.
Hello, Andrey.
Write barrier must be issued after write, not before.
Don't we need to issue read barrier too?
The write barrier is issued after the changes to KnownAssignedXidsNext
and KnownAssignedXidsValid arrays and before the update of
headKnownAssignedXids.
So, it seems to be correct. We make sure once the CPU sees changes of
headKnownAssignedXids - underlying arrays contain all the required
data.
Thanks,
Michail.