MultiXact\SLRU buffers configuration
Hi, hackers!
*** The problem ***
I'm investigating some cases of reduced database performance due to MultiXactOffsetLock contention (80% MultiXactOffsetLock, 20% IO DataFileRead).
The problem manifested itself during index repack and constraint validation. Both being effectively full table scans.
The database workload contains a lot of select for share\select for update queries. I've tried to construct synthetic world generator and could not achieve similar lock configuration: I see a lot of different locks in wait events, particularly a lot more MultiXactMemberLocks. But from my experiments with synthetic workload, contention of MultiXactOffsetLock can be reduced by increasing NUM_MXACTOFFSET_BUFFERS=8 to bigger numbers.
*** Question 1 ***
Is it safe to increase number of buffers of MultiXact\All SLRUs, recompile and run database as usual?
I cannot experiment much with production. But I'm mostly sure that bigger buffers will solve the problem.
*** Question 2 ***
Probably, we could do GUCs for SLRU sizes? Are there any reasons not to do them configurable? I think multis, clog, subtransactions and others will benefit from bigger buffer. But, probably, too much of knobs can be confusing.
*** Question 3 ***
MultiXact offset lock is always taken as exclusive lock. It turns MultiXact Offset subsystem to single threaded. If someone have good idea how to make it more concurrency-friendly, I'm willing to put some efforts into this.
Probably, I could just add LWlocks for each offset buffer page. Is it something worth doing? Or are there any hidden cavers and difficulties?
Thanks!
Best regards, Andrey Borodin.
8 мая 2020 г., в 21:36, Andrey M. Borodin <x4mmm@yandex-team.ru> написал(а):
*** The problem ***
I'm investigating some cases of reduced database performance due to MultiXactOffsetLock contention (80% MultiXactOffsetLock, 20% IO DataFileRead).
The problem manifested itself during index repack and constraint validation. Both being effectively full table scans.
The database workload contains a lot of select for share\select for update queries. I've tried to construct synthetic world generator and could not achieve similar lock configuration: I see a lot of different locks in wait events, particularly a lot more MultiXactMemberLocks. But from my experiments with synthetic workload, contention of MultiXactOffsetLock can be reduced by increasing NUM_MXACTOFFSET_BUFFERS=8 to bigger numbers.*** Question 1 ***
Is it safe to increase number of buffers of MultiXact\All SLRUs, recompile and run database as usual?
I cannot experiment much with production. But I'm mostly sure that bigger buffers will solve the problem.*** Question 2 ***
Probably, we could do GUCs for SLRU sizes? Are there any reasons not to do them configurable? I think multis, clog, subtransactions and others will benefit from bigger buffer. But, probably, too much of knobs can be confusing.*** Question 3 ***
MultiXact offset lock is always taken as exclusive lock. It turns MultiXact Offset subsystem to single threaded. If someone have good idea how to make it more concurrency-friendly, I'm willing to put some efforts into this.
Probably, I could just add LWlocks for each offset buffer page. Is it something worth doing? Or are there any hidden cavers and difficulties?
I've created benchmark[0]https://github.com/x4m/multixact_stress imitating MultiXact pressure on my laptop: 7 clients are concurrently running select "select * from table where primary_key = ANY ($1) for share" where $1 is array of identifiers so that each tuple in a table is locked by different set of XIDs. During this benchmark I observe contention of MultiXactControlLock in pg_stat_activity
пятница, 8 мая 2020 г. 15:08:37 (every 1s)
pid | wait_event | wait_event_type | state | query
-------+----------------------------+-----------------+--------+----------------------------------------------------
41344 | ClientRead | Client | idle | insert into t1 select generate_series(1,1000000,1)
41375 | MultiXactOffsetControlLock | LWLock | active | select * from t1 where i = ANY ($1) for share
41377 | MultiXactOffsetControlLock | LWLock | active | select * from t1 where i = ANY ($1) for share
41378 | | | active | select * from t1 where i = ANY ($1) for share
41379 | MultiXactOffsetControlLock | LWLock | active | select * from t1 where i = ANY ($1) for share
41381 | | | active | select * from t1 where i = ANY ($1) for share
41383 | MultiXactOffsetControlLock | LWLock | active | select * from t1 where i = ANY ($1) for share
41385 | MultiXactOffsetControlLock | LWLock | active | select * from t1 where i = ANY ($1) for share
(8 rows)
Finally, the benchmark is measuring time to execute select for update 42 times.
I've went ahead and created 3 patches:
1. Configurable SLRU buffer sizes for MultiXacOffsets and MultiXactMembers
2. Reduce locking level to shared on read of MultiXactId members
3. Configurable cache size
I've found out that:
1. When MultiXact working set does not fit into buffers - benchmark results grow very high. Yet, very big buffers slow down benchmark too. For this benchmark optimal SLRU size id 32 pages for offsets and 64 pages for members (defaults are 8 and 16 respectively).
2. Lock optimisation increases performance by 5% on default SLRU sizes. Actually, benchmark does not explicitly read MultiXactId members, but when it replaces one with another - it have to read previous set. I understand that we can construct benchmark to demonstrate dominance of any algorithm and 5% of synthetic workload is not a very big number. But it just make sense to try to take shared lock for reading.
3. Manipulations with cache size do not affect benchmark anyhow. It's somewhat expected: benchmark is designed to defeat cache, either way OffsetControlLock would not be stressed.
For our workload, I think we will just increase numbers of SLRU sizes. But patchset may be useful for tuning and as a performance optimisation of MultiXact.
Also MultiXacts seems to be not very good fit into SLRU design. I think it would be better to use B-tree as a container. Or at least make MultiXact members extendable in-place (reserve some size when multixact is created).
When we want to extend number of locks for a tuple currently we will:
1. Iterate through all SLRU buffers for offsets to read current offset (with exclusive lock for offsets)
2. Iterate through all buffers for members to find current members (with exclusive lock for members)
3. Create new members array with +1 xid
4. Iterate through all cache members to find out maybe there are any such cache item as what we are going to create
5. iterate over 1 again for write
6. Iterate over 2 again for write
Obviously this does not scale well - we cannot increase SLRU sizes for too long.
Thanks! I'd be happy to hear any feedback.
Best regards, Andrey Borodin.
Attachments:
v1-0001-Add-GUCs-to-tune-MultiXact-SLRUs.patchapplication/octet-stream; name=v1-0001-Add-GUCs-to-tune-MultiXact-SLRUs.patch; x-unix-mode=0644Download+62-9
v1-0002-Use-shared-lock-in-GetMultiXactIdMembers-for-offs.patchapplication/octet-stream; name=v1-0002-Use-shared-lock-in-GetMultiXactIdMembers-for-offs.patch; x-unix-mode=0644Download+17-16
v1-0003-Make-MultiXact-local-cache-size-configurable.patchapplication/octet-stream; name=v1-0003-Make-MultiXact-local-cache-size-configurable.patch; x-unix-mode=0644Download+29-2
11 мая 2020 г., в 16:17, Andrey M. Borodin <x4mmm@yandex-team.ru> написал(а):
I've went ahead and created 3 patches:
1. Configurable SLRU buffer sizes for MultiXacOffsets and MultiXactMembers
2. Reduce locking level to shared on read of MultiXactId members
3. Configurable cache size
I'm looking more at MultiXact and it seems to me that we have a race condition there.
When we create a new MultiXact we do:
1. Generate new MultiXactId under MultiXactGenLock
2. Record new mxid with members and offset to WAL
3. Write offset to SLRU under MultiXactOffsetControlLock
4. Write members to SLRU under MultiXactMemberControlLock
When we read MultiXact we do:
1. Retrieve offset by mxid from SLRU under MultiXactOffsetControlLock
2. If offset is 0 - it's not filled in at step 4 of previous algorithm, we sleep and goto 1
3. Retrieve members from SLRU under MultiXactMemberControlLock
4. ..... what we do if there are just zeroes because step 4 is not executed yet? Nothing, return empty members list.
What am I missing?
Best regards, Andrey Borodin.
At Wed, 13 May 2020 23:08:37 +0500, "Andrey M. Borodin" <x4mmm@yandex-team.ru> wrote in
11 мая 2020 г., в 16:17, Andrey M. Borodin <x4mmm@yandex-team.ru> написал(а):
I've went ahead and created 3 patches:
1. Configurable SLRU buffer sizes for MultiXacOffsets and MultiXactMembers
2. Reduce locking level to shared on read of MultiXactId members
3. Configurable cache sizeI'm looking more at MultiXact and it seems to me that we have a race condition there.
When we create a new MultiXact we do:
1. Generate new MultiXactId under MultiXactGenLock
2. Record new mxid with members and offset to WAL
3. Write offset to SLRU under MultiXactOffsetControlLock
4. Write members to SLRU under MultiXactMemberControlLock
But, don't we hold exclusive lock on the buffer through all the steps
above?
When we read MultiXact we do:
1. Retrieve offset by mxid from SLRU under MultiXactOffsetControlLock
2. If offset is 0 - it's not filled in at step 4 of previous algorithm, we sleep and goto 1
3. Retrieve members from SLRU under MultiXactMemberControlLock
4. ..... what we do if there are just zeroes because step 4 is not executed yet? Nothing, return empty members list.
So transactions never see such incomplete mxids, I believe.
What am I missing?
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
14 мая 2020 г., в 06:25, Kyotaro Horiguchi <horikyota.ntt@gmail.com> написал(а):
At Wed, 13 May 2020 23:08:37 +0500, "Andrey M. Borodin" <x4mmm@yandex-team.ru> wrote in
11 мая 2020 г., в 16:17, Andrey M. Borodin <x4mmm@yandex-team.ru> написал(а):
I've went ahead and created 3 patches:
1. Configurable SLRU buffer sizes for MultiXacOffsets and MultiXactMembers
2. Reduce locking level to shared on read of MultiXactId members
3. Configurable cache sizeI'm looking more at MultiXact and it seems to me that we have a race condition there.
When we create a new MultiXact we do:
1. Generate new MultiXactId under MultiXactGenLock
2. Record new mxid with members and offset to WAL
3. Write offset to SLRU under MultiXactOffsetControlLock
4. Write members to SLRU under MultiXactMemberControlLockBut, don't we hold exclusive lock on the buffer through all the steps
above?
Yes...Unless MultiXact is observed on StandBy. This could lead to observing inconsistent snapshot: one of lockers committed tuple delete, but standby sees it as alive.
When we read MultiXact we do:
1. Retrieve offset by mxid from SLRU under MultiXactOffsetControlLock
2. If offset is 0 - it's not filled in at step 4 of previous algorithm, we sleep and goto 1
3. Retrieve members from SLRU under MultiXactMemberControlLock
4. ..... what we do if there are just zeroes because step 4 is not executed yet? Nothing, return empty members list.So transactions never see such incomplete mxids, I believe.
I've observed sleep in step 2. I believe it's possible to observe special effects of step 4 too.
Maybe we could add lock on standby to dismiss this 1000us wait? Sometimes it hits hard on Standbys: if someone is locking whole table on primary - all seq scans on standbys follow him with MultiXactOffsetControlLock contention.
It looks like this:
0x00007fcd56896ff7 in __GI___select (nfds=nfds@entry=0, readfds=readfds@entry=0x0, writefds=writefds@entry=0x0, exceptfds=exceptfds@entry=0x0, timeout=timeout@entry=0x7ffd83376fe0) at ../sysdeps/unix/sysv/linux/select.c:41
#0 0x00007fcd56896ff7 in __GI___select (nfds=nfds@entry=0, readfds=readfds@entry=0x0, writefds=writefds@entry=0x0, exceptfds=exceptfds@entry=0x0, timeout=timeout@entry=0x7ffd83376fe0) at ../sysdeps/unix/sysv/linux/select.c:41
#1 0x000056186e0d54bd in pg_usleep (microsec=microsec@entry=1000) at ./build/../src/port/pgsleep.c:56
#2 0x000056186dd5edf2 in GetMultiXactIdMembers (from_pgupgrade=0 '\000', onlyLock=<optimized out>, members=0x7ffd83377080, multi=3106214809) at ./build/../src/backend/access/transam/multixact.c:1370
#3 GetMultiXactIdMembers () at ./build/../src/backend/access/transam/multixact.c:1202
#4 0x000056186dd2d2d9 in MultiXactIdGetUpdateXid (xmax=<optimized out>, t_infomask=<optimized out>) at ./build/../src/backend/access/heap/heapam.c:7039
#5 0x000056186dd35098 in HeapTupleGetUpdateXid (tuple=tuple@entry=0x7fcba3b63d58) at ./build/../src/backend/access/heap/heapam.c:7080
#6 0x000056186e0cd0f8 in HeapTupleSatisfiesMVCC (htup=<optimized out>, snapshot=0x56186f44a058, buffer=230684) at ./build/../src/backend/utils/time/tqual.c:1091
#7 0x000056186dd2d922 in heapgetpage (scan=scan@entry=0x56186f4c8e78, page=page@entry=3620) at ./build/../src/backend/access/heap/heapam.c:439
#8 0x000056186dd2ea7c in heapgettup_pagemode (key=0x0, nkeys=0, dir=ForwardScanDirection, scan=0x56186f4c8e78) at ./build/../src/backend/access/heap/heapam.c:1034
#9 heap_getnext (scan=scan@entry=0x56186f4c8e78, direction=direction@entry=ForwardScanDirection) at ./build/../src/backend/access/heap/heapam.c:1801
#10 0x000056186de84f51 in SeqNext (node=node@entry=0x56186f4a4f78) at ./build/../src/backend/executor/nodeSeqscan.c:81
#11 0x000056186de6a3f1 in ExecScanFetch (recheckMtd=0x56186de84ef0 <SeqRecheck>, accessMtd=0x56186de84f20 <SeqNext>, node=0x56186f4a4f78) at ./build/../src/backend/executor/execScan.c:97
#12 ExecScan (node=0x56186f4a4f78, accessMtd=0x56186de84f20 <SeqNext>, recheckMtd=0x56186de84ef0 <SeqRecheck>) at ./build/../src/backend/executor/execScan.c:164
Best regards, Andrey Borodin.
At Thu, 14 May 2020 10:19:42 +0500, "Andrey M. Borodin" <x4mmm@yandex-team.ru> wrote in
I'm looking more at MultiXact and it seems to me that we have a race condition there.
When we create a new MultiXact we do:
1. Generate new MultiXactId under MultiXactGenLock
2. Record new mxid with members and offset to WAL
3. Write offset to SLRU under MultiXactOffsetControlLock
4. Write members to SLRU under MultiXactMemberControlLockBut, don't we hold exclusive lock on the buffer through all the steps
above?Yes...Unless MultiXact is observed on StandBy. This could lead to observing inconsistent snapshot: one of lockers committed tuple delete, but standby sees it as alive.
Ah, right. I looked from GetNewMultiXactId. Actually
XLOG_MULTIXACT_CREATE_ID is not protected from concurrent reference to
the creating mxact id. And GetMultiXactIdMembers is considering that
case.
When we read MultiXact we do:
1. Retrieve offset by mxid from SLRU under MultiXactOffsetControlLock
2. If offset is 0 - it's not filled in at step 4 of previous algorithm, we sleep and goto 1
3. Retrieve members from SLRU under MultiXactMemberControlLock
4. ..... what we do if there are just zeroes because step 4 is not executed yet? Nothing, return empty members list.So transactions never see such incomplete mxids, I believe.
I've observed sleep in step 2. I believe it's possible to observe special effects of step 4 too.
Maybe we could add lock on standby to dismiss this 1000us wait? Sometimes it hits hard on Standbys: if someone is locking whole table on primary - all seq scans on standbys follow him with MultiXactOffsetControlLock contention.
GetMultiXactIdMembers believes that 4 is successfully done if 2
returned valid offset, but actually that is not obvious.
If we add a single giant lock just to isolate ,say,
GetMultiXactIdMember and RecordNewMultiXact, it reduces concurrency
unnecessarily. Perhaps we need finer-grained locking-key for standby
that works similary to buffer lock on primary, that doesn't cause
confilicts between irrelevant mxids.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
14 мая 2020 г., в 11:16, Kyotaro Horiguchi <horikyota.ntt@gmail.com> написал(а):
At Thu, 14 May 2020 10:19:42 +0500, "Andrey M. Borodin" <x4mmm@yandex-team.ru> wrote in
I'm looking more at MultiXact and it seems to me that we have a race condition there.
When we create a new MultiXact we do:
1. Generate new MultiXactId under MultiXactGenLock
2. Record new mxid with members and offset to WAL
3. Write offset to SLRU under MultiXactOffsetControlLock
4. Write members to SLRU under MultiXactMemberControlLockBut, don't we hold exclusive lock on the buffer through all the steps
above?Yes...Unless MultiXact is observed on StandBy. This could lead to observing inconsistent snapshot: one of lockers committed tuple delete, but standby sees it as alive.
Ah, right. I looked from GetNewMultiXactId. Actually
XLOG_MULTIXACT_CREATE_ID is not protected from concurrent reference to
the creating mxact id. And GetMultiXactIdMembers is considering that
case.When we read MultiXact we do:
1. Retrieve offset by mxid from SLRU under MultiXactOffsetControlLock
2. If offset is 0 - it's not filled in at step 4 of previous algorithm, we sleep and goto 1
3. Retrieve members from SLRU under MultiXactMemberControlLock
4. ..... what we do if there are just zeroes because step 4 is not executed yet? Nothing, return empty members list.So transactions never see such incomplete mxids, I believe.
I've observed sleep in step 2. I believe it's possible to observe special effects of step 4 too.
Maybe we could add lock on standby to dismiss this 1000us wait? Sometimes it hits hard on Standbys: if someone is locking whole table on primary - all seq scans on standbys follow him with MultiXactOffsetControlLock contention.GetMultiXactIdMembers believes that 4 is successfully done if 2
returned valid offset, but actually that is not obvious.If we add a single giant lock just to isolate ,say,
GetMultiXactIdMember and RecordNewMultiXact, it reduces concurrency
unnecessarily. Perhaps we need finer-grained locking-key for standby
that works similary to buffer lock on primary, that doesn't cause
confilicts between irrelevant mxids.
We can just replay members before offsets. If offset is already there - members are there too.
But I'd be happy if we could mitigate those 1000us too - with a hint about last maixd state in a shared MX state, for example.
Actually, if we read empty mxid array instead of something that is replayed just yet - it's not a problem of inconsistency, because transaction in this mxid could not commit before we started. ISTM.
So instead of fix, we, probably, can just add a comment. If this reasoning is correct.
Best regards, Andrey Borodin.
At Thu, 14 May 2020 11:44:01 +0500, "Andrey M. Borodin" <x4mmm@yandex-team.ru> wrote in
GetMultiXactIdMembers believes that 4 is successfully done if 2
returned valid offset, but actually that is not obvious.If we add a single giant lock just to isolate ,say,
GetMultiXactIdMember and RecordNewMultiXact, it reduces concurrency
unnecessarily. Perhaps we need finer-grained locking-key for standby
that works similary to buffer lock on primary, that doesn't cause
confilicts between irrelevant mxids.We can just replay members before offsets. If offset is already there - members are there too.
But I'd be happy if we could mitigate those 1000us too - with a hint about last maixd state in a shared MX state, for example.
Generally in such cases, condition variables would work. In the
attached PoC, the reader side gets no penalty in the "likely" code
path. The writer side always calls ConditionVariableBroadcast but the
waiter list is empty in almost all cases. But I couldn't cause the
situation where the sleep 1000u is reached.
Actually, if we read empty mxid array instead of something that is replayed just yet - it's not a problem of inconsistency, because transaction in this mxid could not commit before we started. ISTM.
So instead of fix, we, probably, can just add a comment. If this reasoning is correct.
The step 4 of the reader side reads the members of the target mxid. It
is already written if the offset of the *next* mxid is filled-in.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
Attachments:
mxid_wait_instead_of_sleep.patchtext/x-patch; charset=us-asciiDownload+27-2
15 мая 2020 г., в 05:03, Kyotaro Horiguchi <horikyota.ntt@gmail.com> написал(а):
At Thu, 14 May 2020 11:44:01 +0500, "Andrey M. Borodin" <x4mmm@yandex-team.ru> wrote in
GetMultiXactIdMembers believes that 4 is successfully done if 2
returned valid offset, but actually that is not obvious.If we add a single giant lock just to isolate ,say,
GetMultiXactIdMember and RecordNewMultiXact, it reduces concurrency
unnecessarily. Perhaps we need finer-grained locking-key for standby
that works similary to buffer lock on primary, that doesn't cause
confilicts between irrelevant mxids.We can just replay members before offsets. If offset is already there - members are there too.
But I'd be happy if we could mitigate those 1000us too - with a hint about last maixd state in a shared MX state, for example.Generally in such cases, condition variables would work. In the
attached PoC, the reader side gets no penalty in the "likely" code
path. The writer side always calls ConditionVariableBroadcast but the
waiter list is empty in almost all cases. But I couldn't cause the
situation where the sleep 1000u is reached.
Thanks! That really looks like a good solution without magic timeouts. Beautiful!
I think I can create temporary extension which calls MultiXact API and tests edge-cases like this 1000us wait.
This extension will also be also useful for me to assess impact of bigger buffers, reduced read locking (as in my 2nd patch) and other tweaks.
Actually, if we read empty mxid array instead of something that is replayed just yet - it's not a problem of inconsistency, because transaction in this mxid could not commit before we started. ISTM.
So instead of fix, we, probably, can just add a comment. If this reasoning is correct.The step 4 of the reader side reads the members of the target mxid. It
is already written if the offset of the *next* mxid is filled-in.
Most often - yes, but members are not guaranteed to be filled in order. Those who win MXMemberControlLock will write first.
But nobody can read members of MXID before it is returned. And its members will be written before returning MXID.
Best regards, Andrey Borodin.
At Fri, 15 May 2020 14:01:46 +0500, "Andrey M. Borodin" <x4mmm@yandex-team.ru> wrote in
15 мая 2020 г., в 05:03, Kyotaro Horiguchi <horikyota.ntt@gmail.com> написал(а):
At Thu, 14 May 2020 11:44:01 +0500, "Andrey M. Borodin" <x4mmm@yandex-team.ru> wrote in
GetMultiXactIdMembers believes that 4 is successfully done if 2
returned valid offset, but actually that is not obvious.If we add a single giant lock just to isolate ,say,
GetMultiXactIdMember and RecordNewMultiXact, it reduces concurrency
unnecessarily. Perhaps we need finer-grained locking-key for standby
that works similary to buffer lock on primary, that doesn't cause
confilicts between irrelevant mxids.We can just replay members before offsets. If offset is already there - members are there too.
But I'd be happy if we could mitigate those 1000us too - with a hint about last maixd state in a shared MX state, for example.Generally in such cases, condition variables would work. In the
attached PoC, the reader side gets no penalty in the "likely" code
path. The writer side always calls ConditionVariableBroadcast but the
waiter list is empty in almost all cases. But I couldn't cause the
situation where the sleep 1000u is reached.Thanks! That really looks like a good solution without magic timeouts. Beautiful!
I think I can create temporary extension which calls MultiXact API and tests edge-cases like this 1000us wait.
This extension will also be also useful for me to assess impact of bigger buffers, reduced read locking (as in my 2nd patch) and other tweaks.
Happy to hear that, It would need to use timeout just in case, though.
Actually, if we read empty mxid array instead of something that is replayed just yet - it's not a problem of inconsistency, because transaction in this mxid could not commit before we started. ISTM.
So instead of fix, we, probably, can just add a comment. If this reasoning is correct.The step 4 of the reader side reads the members of the target mxid. It
is already written if the offset of the *next* mxid is filled-in.Most often - yes, but members are not guaranteed to be filled in order. Those who win MXMemberControlLock will write first.
But nobody can read members of MXID before it is returned. And its members will be written before returning MXID.
Yeah, right. Otherwise assertion failure happens.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
On 15 May 2020, at 02:03, Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote:
Generally in such cases, condition variables would work. In the
attached PoC, the reader side gets no penalty in the "likely" code
path. The writer side always calls ConditionVariableBroadcast but the
waiter list is empty in almost all cases. But I couldn't cause the
situation where the sleep 1000u is reached.
The submitted patch no longer applies, can you please submit an updated
version? I'm marking the patch Waiting on Author in the meantime.
cheers ./daniel
2 июля 2020 г., в 17:02, Daniel Gustafsson <daniel@yesql.se> написал(а):
On 15 May 2020, at 02:03, Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote:
Generally in such cases, condition variables would work. In the
attached PoC, the reader side gets no penalty in the "likely" code
path. The writer side always calls ConditionVariableBroadcast but the
waiter list is empty in almost all cases. But I couldn't cause the
situation where the sleep 1000u is reached.The submitted patch no longer applies, can you please submit an updated
version? I'm marking the patch Waiting on Author in the meantime.
Thanks, Daniel! PFA V2
Best regards, Andrey Borodin.
Attachments:
v2-0001-Use-shared-lock-in-GetMultiXactIdMembers-for-offs.patchapplication/octet-stream; name=v2-0001-Use-shared-lock-in-GetMultiXactIdMembers-for-offs.patch; x-unix-mode=0644Download+17-16
v2-0002-Make-MultiXact-local-cache-size-configurable.patchapplication/octet-stream; name=v2-0002-Make-MultiXact-local-cache-size-configurable.patch; x-unix-mode=0644Download+29-2
v2-0003-Add-conditional-variable-to-wait-for-next-MultXac.patchapplication/octet-stream; name=v2-0003-Add-conditional-variable-to-wait-for-next-MultXac.patch; x-unix-mode=0644Download+25-2
On 8 Jul 2020, at 09:03, Andrey M. Borodin <x4mmm@yandex-team.ru> wrote:
2 июля 2020 г., в 17:02, Daniel Gustafsson <daniel@yesql.se> написал(а):
On 15 May 2020, at 02:03, Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote:
Generally in such cases, condition variables would work. In the
attached PoC, the reader side gets no penalty in the "likely" code
path. The writer side always calls ConditionVariableBroadcast but the
waiter list is empty in almost all cases. But I couldn't cause the
situation where the sleep 1000u is reached.The submitted patch no longer applies, can you please submit an updated
version? I'm marking the patch Waiting on Author in the meantime.Thanks, Daniel! PFA V2
This version too has stopped applying according to the CFbot. I've moved it to
the next commitfest as we're out of time in this one and it was only pointed
out now, but kept it Waiting on Author.
cheers ./daniel
On 08.07.2020 10:03, Andrey M. Borodin wrote:
2 июля 2020 г., в 17:02, Daniel Gustafsson <daniel@yesql.se> написал(а):
On 15 May 2020, at 02:03, Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote:
Generally in such cases, condition variables would work. In the
attached PoC, the reader side gets no penalty in the "likely" code
path. The writer side always calls ConditionVariableBroadcast but the
waiter list is empty in almost all cases. But I couldn't cause the
situation where the sleep 1000u is reached.The submitted patch no longer applies, can you please submit an updated
version? I'm marking the patch Waiting on Author in the meantime.Thanks, Daniel! PFA V2
Best regards, Andrey Borodin.
1) The first patch is sensible and harmless, so I think it is ready for
committer. I haven't tested the performance impact, though.
2) I like the initial proposal to make various SLRU buffers
configurable, however, I doubt if it really solves the problem, or just
moves it to another place?
The previous patch you sent was based on some version that contained
changes for other slru buffers numbers: 'multixact_offsets_slru_buffers'
and 'multixact_members_slru_buffers'. Have you just forgot to attach
them? The patch message "[PATCH v2 2/4]" hints that you had 4 patches)
Meanwhile, I attach the rebased patch to calm down the CFbot. The
changes are trivial.
2.1) I think that both min and max values for this parameter are too
extreme. Have you tested them?
+ &multixact_local_cache_entries,
+ 256, 2, INT_MAX / 2,
2.2) MAX_CACHE_ENTRIES is not used anymore, so it can be deleted.
3) No changes for third patch. I just renamed it for consistency.
--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
Attachments:
v3-0001-Use-shared-lock-in-GetMultiXactIdMembers-for-offs.patchtext/x-patch; charset=UTF-8; name=v3-0001-Use-shared-lock-in-GetMultiXactIdMembers-for-offs.patchDownload+17-16
v3-0002-Make-MultiXact-local-cache-size-configurable.patchtext/x-patch; charset=UTF-8; name=v3-0002-Make-MultiXact-local-cache-size-configurable.patchDownload+31-1
v3-0003-Add-conditional-variable-to-wait-for-next-MultXac.patchtext/x-patch; charset=UTF-8; name=v3-0003-Add-conditional-variable-to-wait-for-next-MultXac.patchDownload+25-2
Hi, Anastasia!
28 авг. 2020 г., в 23:08, Anastasia Lubennikova <a.lubennikova@postgrespro.ru> написал(а):
1) The first patch is sensible and harmless, so I think it is ready for committer. I haven't tested the performance impact, though.
2) I like the initial proposal to make various SLRU buffers configurable, however, I doubt if it really solves the problem, or just moves it to another place?
The previous patch you sent was based on some version that contained changes for other slru buffers numbers: 'multixact_offsets_slru_buffers' and 'multixact_members_slru_buffers'. Have you just forgot to attach them? The patch message "[PATCH v2 2/4]" hints that you had 4 patches)
Meanwhile, I attach the rebased patch to calm down the CFbot. The changes are trivial.2.1) I think that both min and max values for this parameter are too extreme. Have you tested them?
+ &multixact_local_cache_entries, + 256, 2, INT_MAX / 2,2.2) MAX_CACHE_ENTRIES is not used anymore, so it can be deleted.
3) No changes for third patch. I just renamed it for consistency.
Thank you for your review.
Indeed, I had 4th patch with tests, but these tests didn't work well: I still did not manage to stress SLRUs to reproduce problem from production...
You are absolutely correct in point 2: I did only tests with sane values. And observed extreme performance degradation with values ~ 64 megabytes. I do not know which highest values should we pick? 1Gb? Or highest possible functioning value?
I greatly appreciate your review, sorry for so long delay. Thanks!
Best regards, Andrey Borodin.
On 28.09.2020 17:41, Andrey M. Borodin wrote:
Hi, Anastasia!
28 авг. 2020 г., в 23:08, Anastasia Lubennikova <a.lubennikova@postgrespro.ru> написал(а):
1) The first patch is sensible and harmless, so I think it is ready for committer. I haven't tested the performance impact, though.
2) I like the initial proposal to make various SLRU buffers configurable, however, I doubt if it really solves the problem, or just moves it to another place?
The previous patch you sent was based on some version that contained changes for other slru buffers numbers: 'multixact_offsets_slru_buffers' and 'multixact_members_slru_buffers'. Have you just forgot to attach them? The patch message "[PATCH v2 2/4]" hints that you had 4 patches)
Meanwhile, I attach the rebased patch to calm down the CFbot. The changes are trivial.2.1) I think that both min and max values for this parameter are too extreme. Have you tested them?
+ &multixact_local_cache_entries, + 256, 2, INT_MAX / 2,2.2) MAX_CACHE_ENTRIES is not used anymore, so it can be deleted.
3) No changes for third patch. I just renamed it for consistency.
Thank you for your review.
Indeed, I had 4th patch with tests, but these tests didn't work well: I still did not manage to stress SLRUs to reproduce problem from production...
You are absolutely correct in point 2: I did only tests with sane values. And observed extreme performance degradation with values ~ 64 megabytes. I do not know which highest values should we pick? 1Gb? Or highest possible functioning value?
I would go with the values that we consider adequate for this setting.
As I see, there is no strict rule about it in guc.c and many variables
have large border values. Anyway, we need to explain it at least in the
documentation and code comments.
It seems that the default was conservative enough, so it can be also a
minimal value too. As for maximum, can you provide any benchmark
results? If we have a peak and a noticeable performance degradation
after that, we can use it to calculate the preferable maxvalue.
--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
Hi!
On Mon, Sep 28, 2020 at 5:41 PM Andrey M. Borodin <x4mmm@yandex-team.ru> wrote:
28 авг. 2020 г., в 23:08, Anastasia Lubennikova <a.lubennikova@postgrespro.ru> написал(а):
1) The first patch is sensible and harmless, so I think it is ready for committer. I haven't tested the performance impact, though.
2) I like the initial proposal to make various SLRU buffers configurable, however, I doubt if it really solves the problem, or just moves it to another place?
The previous patch you sent was based on some version that contained changes for other slru buffers numbers: 'multixact_offsets_slru_buffers' and 'multixact_members_slru_buffers'. Have you just forgot to attach them? The patch message "[PATCH v2 2/4]" hints that you had 4 patches)
Meanwhile, I attach the rebased patch to calm down the CFbot. The changes are trivial.2.1) I think that both min and max values for this parameter are too extreme. Have you tested them?
+ &multixact_local_cache_entries, + 256, 2, INT_MAX / 2,2.2) MAX_CACHE_ENTRIES is not used anymore, so it can be deleted.
3) No changes for third patch. I just renamed it for consistency.
Thank you for your review.
Indeed, I had 4th patch with tests, but these tests didn't work well: I still did not manage to stress SLRUs to reproduce problem from production...
You are absolutely correct in point 2: I did only tests with sane values. And observed extreme performance degradation with values ~ 64 megabytes. I do not know which highest values should we pick? 1Gb? Or highest possible functioning value?
I greatly appreciate your review, sorry for so long delay. Thanks!
I took a look at this patchset.
The 1st and 3rd patches look good to me. I made just minor improvements.
1) There is still a case when SimpleLruReadPage_ReadOnly() relocks the
SLRU lock, which is already taken in exclusive mode. I've evaded this
by passing the lock mode as a parameter to
SimpleLruReadPage_ReadOnly().
3) CHECK_FOR_INTERRUPTS() is not needed anymore, because it's called
inside ConditionVariableSleep() if needed. Also, no current wait
events use slashes, and I don't think we should introduce slashes
here. Even if we should, then we should also rename existing wait
events to be consistent with a new one. So, I've renamed the new wait
event to remove the slash.
Regarding the patch 2. I see the current documentation in the patch
doesn't explain to the user how to set the new parameter. I think we
should give users an idea what workloads need high values of
multixact_local_cache_entries parameter and what doesn't. Also, we
should explain the negative aspects of high values
multixact_local_cache_entries. Ideally, we should get the advantage
without overloading users with new nontrivial parameters, but I don't
have a particular idea here.
I'd like to propose committing 1 and 3, but leave 2 for further review.
------
Regards,
Alexander Korotkov
Attachments:
v4-0001-Use-shared-lock-in-GetMultiXactIdMembers-for-offsets.patchapplication/octet-stream; name=v4-0001-Use-shared-lock-in-GetMultiXactIdMembers-for-offsets.patchDownload+50-20
v4-0002-Make-MultiXact-local-cache-size-configurable.patchapplication/octet-stream; name=v4-0002-Make-MultiXact-local-cache-size-configurable.patchDownload+31-2
v4-0003-Add-conditional-variable-to-wait-for-next-MultXact-o.patchapplication/octet-stream; name=v4-0003-Add-conditional-variable-to-wait-for-next-MultXact-o.patchDownload+30-3
26 окт. 2020 г., в 06:05, Alexander Korotkov <aekorotkov@gmail.com> написал(а):
Hi!
On Mon, Sep 28, 2020 at 5:41 PM Andrey M. Borodin <x4mmm@yandex-team.ru> wrote:
28 авг. 2020 г., в 23:08, Anastasia Lubennikova <a.lubennikova@postgrespro.ru> написал(а):
1) The first patch is sensible and harmless, so I think it is ready for committer. I haven't tested the performance impact, though.
2) I like the initial proposal to make various SLRU buffers configurable, however, I doubt if it really solves the problem, or just moves it to another place?
The previous patch you sent was based on some version that contained changes for other slru buffers numbers: 'multixact_offsets_slru_buffers' and 'multixact_members_slru_buffers'. Have you just forgot to attach them? The patch message "[PATCH v2 2/4]" hints that you had 4 patches)
Meanwhile, I attach the rebased patch to calm down the CFbot. The changes are trivial.2.1) I think that both min and max values for this parameter are too extreme. Have you tested them?
+ &multixact_local_cache_entries, + 256, 2, INT_MAX / 2,2.2) MAX_CACHE_ENTRIES is not used anymore, so it can be deleted.
3) No changes for third patch. I just renamed it for consistency.
Thank you for your review.
Indeed, I had 4th patch with tests, but these tests didn't work well: I still did not manage to stress SLRUs to reproduce problem from production...
You are absolutely correct in point 2: I did only tests with sane values. And observed extreme performance degradation with values ~ 64 megabytes. I do not know which highest values should we pick? 1Gb? Or highest possible functioning value?
I greatly appreciate your review, sorry for so long delay. Thanks!
I took a look at this patchset.
The 1st and 3rd patches look good to me. I made just minor improvements.
1) There is still a case when SimpleLruReadPage_ReadOnly() relocks the
SLRU lock, which is already taken in exclusive mode. I've evaded this
by passing the lock mode as a parameter to
SimpleLruReadPage_ReadOnly().
3) CHECK_FOR_INTERRUPTS() is not needed anymore, because it's called
inside ConditionVariableSleep() if needed. Also, no current wait
events use slashes, and I don't think we should introduce slashes
here. Even if we should, then we should also rename existing wait
events to be consistent with a new one. So, I've renamed the new wait
event to remove the slash.Regarding the patch 2. I see the current documentation in the patch
doesn't explain to the user how to set the new parameter. I think we
should give users an idea what workloads need high values of
multixact_local_cache_entries parameter and what doesn't. Also, we
should explain the negative aspects of high values
multixact_local_cache_entries. Ideally, we should get the advantage
without overloading users with new nontrivial parameters, but I don't
have a particular idea here.I'd like to propose committing 1 and 3, but leave 2 for further review.
Thanks for your review, Alexander!
+1 for avoiding double locking in SimpleLruReadPage_ReadOnly().
Other changes seem correct to me too.
I've tried to find optimal value for cache size and it seems to me that it affects multixact scalability much less than sizes of offsets\members buffers. I concur that patch 2 of the patchset does not seem documented enough.
Best regards, Andrey Borodin.
On Mon, Oct 26, 2020 at 6:45 PM Andrey Borodin <x4mmm@yandex-team.ru> wrote:
Thanks for your review, Alexander!
+1 for avoiding double locking in SimpleLruReadPage_ReadOnly().
Other changes seem correct to me too.I've tried to find optimal value for cache size and it seems to me that it affects multixact scalability much less than sizes of offsets\members buffers. I concur that patch 2 of the patchset does not seem documented enough.
Thank you. I've made a few more minor adjustments to the patchset.
I'm going to push 0001 and 0003 if there are no objections.
------
Regards,
Alexander Korotkov
Attachments:
v5-0001-Use-shared-lock-in-GetMultiXactIdMembers-for-offs.patchapplication/octet-stream; name=v5-0001-Use-shared-lock-in-GetMultiXactIdMembers-for-offs.patchDownload+53-20
v5-0002-Make-MultiXact-local-cache-size-configurable.patchapplication/octet-stream; name=v5-0002-Make-MultiXact-local-cache-size-configurable.patchDownload+31-2
v5-0003-Add-conditional-variable-to-wait-for-next-MultXact.patchapplication/octet-stream; name=v5-0003-Add-conditional-variable-to-wait-for-next-MultXact.patchDownload+36-3
On Tue, Oct 27, 2020 at 8:02 PM Alexander Korotkov <aekorotkov@gmail.com> wrote:
On Mon, Oct 26, 2020 at 6:45 PM Andrey Borodin <x4mmm@yandex-team.ru> wrote:
Thanks for your review, Alexander!
+1 for avoiding double locking in SimpleLruReadPage_ReadOnly().
Other changes seem correct to me too.I've tried to find optimal value for cache size and it seems to me that it affects multixact scalability much less than sizes of offsets\members buffers. I concur that patch 2 of the patchset does not seem documented enough.
Thank you. I've made a few more minor adjustments to the patchset.
I'm going to push 0001 and 0003 if there are no objections.
I get that patchset v5 doesn't pass the tests due to typo in assert.
The fixes version is attached.
------
Regards,
Alexander Korotkov