MultiXact\SLRU buffers configuration
Hi, hackers!
*** The problem ***
I'm investigating some cases of reduced database performance due to MultiXactOffsetLock contention (80% MultiXactOffsetLock, 20% IO DataFileRead).
The problem manifested itself during index repack and constraint validation. Both being effectively full table scans.
The database workload contains a lot of select for share\select for update queries. I've tried to construct synthetic world generator and could not achieve similar lock configuration: I see a lot of different locks in wait events, particularly a lot more MultiXactMemberLocks. But from my experiments with synthetic workload, contention of MultiXactOffsetLock can be reduced by increasing NUM_MXACTOFFSET_BUFFERS=8 to bigger numbers.
*** Question 1 ***
Is it safe to increase number of buffers of MultiXact\All SLRUs, recompile and run database as usual?
I cannot experiment much with production. But I'm mostly sure that bigger buffers will solve the problem.
*** Question 2 ***
Probably, we could do GUCs for SLRU sizes? Are there any reasons not to do them configurable? I think multis, clog, subtransactions and others will benefit from bigger buffer. But, probably, too much of knobs can be confusing.
*** Question 3 ***
MultiXact offset lock is always taken as exclusive lock. It turns MultiXact Offset subsystem to single threaded. If someone have good idea how to make it more concurrency-friendly, I'm willing to put some efforts into this.
Probably, I could just add LWlocks for each offset buffer page. Is it something worth doing? Or are there any hidden cavers and difficulties?
Thanks!
Best regards, Andrey Borodin.
8 мая 2020 г., в 21:36, Andrey M. Borodin <x4mmm@yandex-team.ru> написал(а):
*** The problem ***
I'm investigating some cases of reduced database performance due to MultiXactOffsetLock contention (80% MultiXactOffsetLock, 20% IO DataFileRead).
The problem manifested itself during index repack and constraint validation. Both being effectively full table scans.
The database workload contains a lot of select for share\select for update queries. I've tried to construct synthetic world generator and could not achieve similar lock configuration: I see a lot of different locks in wait events, particularly a lot more MultiXactMemberLocks. But from my experiments with synthetic workload, contention of MultiXactOffsetLock can be reduced by increasing NUM_MXACTOFFSET_BUFFERS=8 to bigger numbers.*** Question 1 ***
Is it safe to increase number of buffers of MultiXact\All SLRUs, recompile and run database as usual?
I cannot experiment much with production. But I'm mostly sure that bigger buffers will solve the problem.*** Question 2 ***
Probably, we could do GUCs for SLRU sizes? Are there any reasons not to do them configurable? I think multis, clog, subtransactions and others will benefit from bigger buffer. But, probably, too much of knobs can be confusing.*** Question 3 ***
MultiXact offset lock is always taken as exclusive lock. It turns MultiXact Offset subsystem to single threaded. If someone have good idea how to make it more concurrency-friendly, I'm willing to put some efforts into this.
Probably, I could just add LWlocks for each offset buffer page. Is it something worth doing? Or are there any hidden cavers and difficulties?
I've created benchmark[0]https://github.com/x4m/multixact_stress imitating MultiXact pressure on my laptop: 7 clients are concurrently running select "select * from table where primary_key = ANY ($1) for share" where $1 is array of identifiers so that each tuple in a table is locked by different set of XIDs. During this benchmark I observe contention of MultiXactControlLock in pg_stat_activity
пятница, 8 мая 2020 г. 15:08:37 (every 1s)
pid | wait_event | wait_event_type | state | query
-------+----------------------------+-----------------+--------+----------------------------------------------------
41344 | ClientRead | Client | idle | insert into t1 select generate_series(1,1000000,1)
41375 | MultiXactOffsetControlLock | LWLock | active | select * from t1 where i = ANY ($1) for share
41377 | MultiXactOffsetControlLock | LWLock | active | select * from t1 where i = ANY ($1) for share
41378 | | | active | select * from t1 where i = ANY ($1) for share
41379 | MultiXactOffsetControlLock | LWLock | active | select * from t1 where i = ANY ($1) for share
41381 | | | active | select * from t1 where i = ANY ($1) for share
41383 | MultiXactOffsetControlLock | LWLock | active | select * from t1 where i = ANY ($1) for share
41385 | MultiXactOffsetControlLock | LWLock | active | select * from t1 where i = ANY ($1) for share
(8 rows)
Finally, the benchmark is measuring time to execute select for update 42 times.
I've went ahead and created 3 patches:
1. Configurable SLRU buffer sizes for MultiXacOffsets and MultiXactMembers
2. Reduce locking level to shared on read of MultiXactId members
3. Configurable cache size
I've found out that:
1. When MultiXact working set does not fit into buffers - benchmark results grow very high. Yet, very big buffers slow down benchmark too. For this benchmark optimal SLRU size id 32 pages for offsets and 64 pages for members (defaults are 8 and 16 respectively).
2. Lock optimisation increases performance by 5% on default SLRU sizes. Actually, benchmark does not explicitly read MultiXactId members, but when it replaces one with another - it have to read previous set. I understand that we can construct benchmark to demonstrate dominance of any algorithm and 5% of synthetic workload is not a very big number. But it just make sense to try to take shared lock for reading.
3. Manipulations with cache size do not affect benchmark anyhow. It's somewhat expected: benchmark is designed to defeat cache, either way OffsetControlLock would not be stressed.
For our workload, I think we will just increase numbers of SLRU sizes. But patchset may be useful for tuning and as a performance optimisation of MultiXact.
Also MultiXacts seems to be not very good fit into SLRU design. I think it would be better to use B-tree as a container. Or at least make MultiXact members extendable in-place (reserve some size when multixact is created).
When we want to extend number of locks for a tuple currently we will:
1. Iterate through all SLRU buffers for offsets to read current offset (with exclusive lock for offsets)
2. Iterate through all buffers for members to find current members (with exclusive lock for members)
3. Create new members array with +1 xid
4. Iterate through all cache members to find out maybe there are any such cache item as what we are going to create
5. iterate over 1 again for write
6. Iterate over 2 again for write
Obviously this does not scale well - we cannot increase SLRU sizes for too long.
Thanks! I'd be happy to hear any feedback.
Best regards, Andrey Borodin.
Attachments:
v1-0001-Add-GUCs-to-tune-MultiXact-SLRUs.patchapplication/octet-stream; name=v1-0001-Add-GUCs-to-tune-MultiXact-SLRUs.patch; x-unix-mode=0644Download
From 17ba50c29aafd2de81462e0aca70c8629076e52c Mon Sep 17 00:00:00 2001
From: Andrey Borodin <amborodin@acm.org>
Date: Sat, 9 May 2020 16:42:07 +0500
Subject: [PATCH v1 1/3] Add GUCs to tune MultiXact SLRUs
---
doc/src/sgml/config.sgml | 31 ++++++++++++++++++++++++++
src/backend/access/transam/multixact.c | 8 +++----
src/backend/utils/init/globals.c | 3 +++
src/backend/utils/misc/guc.c | 22 ++++++++++++++++++
src/include/access/multixact.h | 4 ----
src/include/miscadmin.h | 2 ++
6 files changed, 62 insertions(+), 8 deletions(-)
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 3aea1763b4..71aa5c4900 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -1746,6 +1746,37 @@ include_dir 'conf.d'
</listitem>
</varlistentry>
+ <varlistentry id="guc-multixact-offsets-slru-buffers" xreflabel="multixact_offsets_slru_buffers">
+ <term><varname>multixact_offsets_slru_buffers</varname> (<type>integer</type>)
+ <indexterm>
+ <primary><varname>multixact_offsets_slru_buffers</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Specifies the amount of memory to be used for MultiXact offsets. MultiXact offsets
+ are used to store informaion about offsets of multiple row lockers (caused by SELECT FOR UPDATE and others).
+ It defaults to 64 kilobytes (<literal>64KB</literal>).
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry id="guc-multixact-members-slru-buffers" xreflabel="multixact_members_slru_buffers">
+ <term><varname>multixact_members_slru_buffers</varname> (<type>integer</type>)
+ <indexterm>
+ <primary><varname>multixact_members_slru_buffers</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Specifies the amount of memory to be used for MultiXact members. MultiXact members
+ are used to store informaion about XIDs of multiple row lockers. Tipically <varname>multixact_members_slru_buffers</varname>
+ is twice more than <varname>multixact_offsets_slru_buffers</varname>.
+ It defaults to 128 kilobytes (<literal>128KB</literal>).
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry id="guc-max-stack-depth" xreflabel="max_stack_depth">
<term><varname>max_stack_depth</varname> (<type>integer</type>)
<indexterm>
diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index e2aa5c9ce4..d2ac029b98 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -1812,8 +1812,8 @@ MultiXactShmemSize(void)
mul_size(sizeof(MultiXactId) * 2, MaxOldestSlot))
size = SHARED_MULTIXACT_STATE_SIZE;
- size = add_size(size, SimpleLruShmemSize(NUM_MXACTOFFSET_BUFFERS, 0));
- size = add_size(size, SimpleLruShmemSize(NUM_MXACTMEMBER_BUFFERS, 0));
+ size = add_size(size, SimpleLruShmemSize(multixact_offsets_slru_buffers, 0));
+ size = add_size(size, SimpleLruShmemSize(multixact_members_slru_buffers, 0));
return size;
}
@@ -1829,11 +1829,11 @@ MultiXactShmemInit(void)
MultiXactMemberCtl->PagePrecedes = MultiXactMemberPagePrecedes;
SimpleLruInit(MultiXactOffsetCtl,
- "multixact_offset", NUM_MXACTOFFSET_BUFFERS, 0,
+ "multixact_offset", multixact_offsets_slru_buffers, 0,
MultiXactOffsetControlLock, "pg_multixact/offsets",
LWTRANCHE_MXACTOFFSET_BUFFERS);
SimpleLruInit(MultiXactMemberCtl,
- "multixact_member", NUM_MXACTMEMBER_BUFFERS, 0,
+ "multixact_member", multixact_members_slru_buffers, 0,
MultiXactMemberControlLock, "pg_multixact/members",
LWTRANCHE_MXACTMEMBER_BUFFERS);
diff --git a/src/backend/utils/init/globals.c b/src/backend/utils/init/globals.c
index eb19644419..aebfbdb67c 100644
--- a/src/backend/utils/init/globals.c
+++ b/src/backend/utils/init/globals.c
@@ -148,3 +148,6 @@ int VacuumCostBalance = 0; /* working state for vacuum */
bool VacuumCostActive = false;
double vacuum_cleanup_index_scale_factor;
+
+int multixact_offsets_slru_buffers = 8;
+int multixact_members_slru_buffers = 16;
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 5bdc02fce2..b139e387f3 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -2270,6 +2270,28 @@ static struct config_int ConfigureNamesInt[] =
NULL, NULL, NULL
},
+ {
+ {"multixact_offsets_slru_buffers", PGC_POSTMASTER, RESOURCES_MEM,
+ gettext_noop("Sets the number of shared memory buffers used for MultiXact offsets SLRU."),
+ NULL,
+ GUC_UNIT_BLOCKS
+ },
+ &multixact_offsets_slru_buffers,
+ 8, 2, INT_MAX / 2,
+ NULL, NULL, NULL
+ },
+
+ {
+ {"multixact_members_slru_buffers", PGC_POSTMASTER, RESOURCES_MEM,
+ gettext_noop("Sets the number of shared memory buffers used for MultiXact members SLRU."),
+ NULL,
+ GUC_UNIT_BLOCKS
+ },
+ &multixact_members_slru_buffers,
+ 16, 2, INT_MAX / 2,
+ NULL, NULL, NULL
+ },
+
{
{"temp_buffers", PGC_USERSET, RESOURCES_MEM,
gettext_noop("Sets the maximum number of temporary buffers used by each session."),
diff --git a/src/include/access/multixact.h b/src/include/access/multixact.h
index af4aac08bd..848398bf1e 100644
--- a/src/include/access/multixact.h
+++ b/src/include/access/multixact.h
@@ -28,10 +28,6 @@
#define MaxMultiXactOffset ((MultiXactOffset) 0xFFFFFFFF)
-/* Number of SLRU buffers to use for multixact */
-#define NUM_MXACTOFFSET_BUFFERS 8
-#define NUM_MXACTMEMBER_BUFFERS 16
-
/*
* Possible multixact lock modes ("status"). The first four modes are for
* tuple locks (FOR KEY SHARE, FOR SHARE, FOR NO KEY UPDATE, FOR UPDATE); the
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 14fa127ab1..95f4633a88 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -161,6 +161,8 @@ extern PGDLLIMPORT int MaxBackends;
extern PGDLLIMPORT int MaxConnections;
extern PGDLLIMPORT int max_worker_processes;
extern PGDLLIMPORT int max_parallel_workers;
+extern PGDLLIMPORT int multixact_offsets_slru_buffers;
+extern PGDLLIMPORT int multixact_members_slru_buffers;
extern PGDLLIMPORT int MyProcPid;
extern PGDLLIMPORT pg_time_t MyStartTime;
--
2.24.2 (Apple Git-127)
v1-0002-Use-shared-lock-in-GetMultiXactIdMembers-for-offs.patchapplication/octet-stream; name=v1-0002-Use-shared-lock-in-GetMultiXactIdMembers-for-offs.patch; x-unix-mode=0644Download
From 0a0ebc6391c33ca77260656d886224c71fb8ad96 Mon Sep 17 00:00:00 2001
From: Andrey Borodin <amborodin@acm.org>
Date: Sat, 9 May 2020 21:19:18 +0500
Subject: [PATCH v1 2/3] Use shared lock in GetMultiXactIdMembers for offsets
and members
Previously read of multixact required exclusive control locks in
offstes and members SLRUs. This could lead to contention.
In this commit we take advantge of SimpleLruReadPage_ReadOnly
similar to CLOG usage.
---
src/backend/access/transam/clog.c | 2 +-
src/backend/access/transam/commit_ts.c | 2 +-
src/backend/access/transam/multixact.c | 12 ++++++------
src/backend/access/transam/slru.c | 8 +++++---
src/backend/access/transam/subtrans.c | 2 +-
src/backend/commands/async.c | 2 +-
src/backend/storage/lmgr/predicate.c | 2 +-
src/include/access/slru.h | 2 +-
8 files changed, 17 insertions(+), 15 deletions(-)
diff --git a/src/backend/access/transam/clog.c b/src/backend/access/transam/clog.c
index f8e7670f8d..2ae024608e 100644
--- a/src/backend/access/transam/clog.c
+++ b/src/backend/access/transam/clog.c
@@ -643,7 +643,7 @@ TransactionIdGetStatus(TransactionId xid, XLogRecPtr *lsn)
/* lock is acquired by SimpleLruReadPage_ReadOnly */
- slotno = SimpleLruReadPage_ReadOnly(ClogCtl, pageno, xid);
+ slotno = SimpleLruReadPage_ReadOnly(ClogCtl, pageno, xid, false);
byteptr = ClogCtl->shared->page_buffer[slotno] + byteno;
status = (*byteptr >> bshift) & CLOG_XACT_BITMASK;
diff --git a/src/backend/access/transam/commit_ts.c b/src/backend/access/transam/commit_ts.c
index 630df672cc..1efa269339 100644
--- a/src/backend/access/transam/commit_ts.c
+++ b/src/backend/access/transam/commit_ts.c
@@ -342,7 +342,7 @@ TransactionIdGetCommitTsData(TransactionId xid, TimestampTz *ts,
}
/* lock is acquired by SimpleLruReadPage_ReadOnly */
- slotno = SimpleLruReadPage_ReadOnly(CommitTsCtl, pageno, xid);
+ slotno = SimpleLruReadPage_ReadOnly(CommitTsCtl, pageno, xid, false);
memcpy(&entry,
CommitTsCtl->shared->page_buffer[slotno] +
SizeOfCommitTimestampEntry * entryno,
diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index d2ac029b98..5387a3602d 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -1321,12 +1321,12 @@ GetMultiXactIdMembers(MultiXactId multi, MultiXactMember **members,
* time on every multixact creation.
*/
retry:
- LWLockAcquire(MultiXactOffsetControlLock, LW_EXCLUSIVE);
pageno = MultiXactIdToOffsetPage(multi);
entryno = MultiXactIdToOffsetEntry(multi);
- slotno = SimpleLruReadPage(MultiXactOffsetCtl, pageno, true, multi);
+ /* lock is acquired by SimpleLruReadPage_ReadOnly */
+ slotno = SimpleLruReadPage_ReadOnly(MultiXactOffsetCtl, pageno, multi, false);
offptr = (MultiXactOffset *) MultiXactOffsetCtl->shared->page_buffer[slotno];
offptr += entryno;
offset = *offptr;
@@ -1358,7 +1358,7 @@ retry:
entryno = MultiXactIdToOffsetEntry(tmpMXact);
if (pageno != prev_pageno)
- slotno = SimpleLruReadPage(MultiXactOffsetCtl, pageno, true, tmpMXact);
+ slotno = SimpleLruReadPage_ReadOnly(MultiXactOffsetCtl, pageno, tmpMXact, true);
offptr = (MultiXactOffset *) MultiXactOffsetCtl->shared->page_buffer[slotno];
offptr += entryno;
@@ -1382,7 +1382,7 @@ retry:
*members = ptr;
/* Now get the members themselves. */
- LWLockAcquire(MultiXactMemberControlLock, LW_EXCLUSIVE);
+ LWLockAcquire(MultiXactMemberControlLock, LW_SHARED);
truelength = 0;
prev_pageno = -1;
@@ -1399,7 +1399,7 @@ retry:
if (pageno != prev_pageno)
{
- slotno = SimpleLruReadPage(MultiXactMemberCtl, pageno, true, multi);
+ slotno = SimpleLruReadPage_ReadOnly(MultiXactMemberCtl, pageno, multi, true);
prev_pageno = pageno;
}
@@ -2745,7 +2745,7 @@ find_multixact_start(MultiXactId multi, MultiXactOffset *result)
return false;
/* lock is acquired by SimpleLruReadPage_ReadOnly */
- slotno = SimpleLruReadPage_ReadOnly(MultiXactOffsetCtl, pageno, multi);
+ slotno = SimpleLruReadPage_ReadOnly(MultiXactOffsetCtl, pageno, multi, false);
offptr = (MultiXactOffset *) MultiXactOffsetCtl->shared->page_buffer[slotno];
offptr += entryno;
offset = *offptr;
diff --git a/src/backend/access/transam/slru.c b/src/backend/access/transam/slru.c
index b2316af779..ae63ad4e6c 100644
--- a/src/backend/access/transam/slru.c
+++ b/src/backend/access/transam/slru.c
@@ -470,17 +470,19 @@ SimpleLruReadPage(SlruCtl ctl, int pageno, bool write_ok,
* Return value is the shared-buffer slot number now holding the page.
* The buffer's LRU access info is updated.
*
- * Control lock must NOT be held at entry, but will be held at exit.
+ * Control lock will be held at exit. If lock_held is true at least
+ * shared control lock must be held.
* It is unspecified whether the lock will be shared or exclusive.
*/
int
-SimpleLruReadPage_ReadOnly(SlruCtl ctl, int pageno, TransactionId xid)
+SimpleLruReadPage_ReadOnly(SlruCtl ctl, int pageno, TransactionId xid, bool lock_held)
{
SlruShared shared = ctl->shared;
int slotno;
/* Try to find the page while holding only shared lock */
- LWLockAcquire(shared->ControlLock, LW_SHARED);
+ if (!lock_held)
+ LWLockAcquire(shared->ControlLock, LW_SHARED);
/* See if page is already in a buffer */
for (slotno = 0; slotno < shared->num_slots; slotno++)
diff --git a/src/backend/access/transam/subtrans.c b/src/backend/access/transam/subtrans.c
index 25d7d739cf..98700f26a7 100644
--- a/src/backend/access/transam/subtrans.c
+++ b/src/backend/access/transam/subtrans.c
@@ -123,7 +123,7 @@ SubTransGetParent(TransactionId xid)
/* lock is acquired by SimpleLruReadPage_ReadOnly */
- slotno = SimpleLruReadPage_ReadOnly(SubTransCtl, pageno, xid);
+ slotno = SimpleLruReadPage_ReadOnly(SubTransCtl, pageno, xid, false);
ptr = (TransactionId *) SubTransCtl->shared->page_buffer[slotno];
ptr += entryno;
diff --git a/src/backend/commands/async.c b/src/backend/commands/async.c
index 0c9d20ebfc..a12447ecb5 100644
--- a/src/backend/commands/async.c
+++ b/src/backend/commands/async.c
@@ -2005,7 +2005,7 @@ asyncQueueReadAllNotifications(void)
* of the page we will actually inspect.
*/
slotno = SimpleLruReadPage_ReadOnly(AsyncCtl, curpage,
- InvalidTransactionId);
+ InvalidTransactionId, false);
if (curpage == QUEUE_POS_PAGE(head))
{
/* we only want to read as far as head */
diff --git a/src/backend/storage/lmgr/predicate.c b/src/backend/storage/lmgr/predicate.c
index 654584b77a..2969992a90 100644
--- a/src/backend/storage/lmgr/predicate.c
+++ b/src/backend/storage/lmgr/predicate.c
@@ -948,7 +948,7 @@ OldSerXidGetMinConflictCommitSeqNo(TransactionId xid)
* but will return with that lock held, which must then be released.
*/
slotno = SimpleLruReadPage_ReadOnly(OldSerXidSlruCtl,
- OldSerXidPage(xid), xid);
+ OldSerXidPage(xid), xid, false);
val = OldSerXidValue(slotno, xid);
LWLockRelease(OldSerXidLock);
return val;
diff --git a/src/include/access/slru.h b/src/include/access/slru.h
index 00dbd803e1..916dd8203e 100644
--- a/src/include/access/slru.h
+++ b/src/include/access/slru.h
@@ -144,7 +144,7 @@ extern int SimpleLruZeroPage(SlruCtl ctl, int pageno);
extern int SimpleLruReadPage(SlruCtl ctl, int pageno, bool write_ok,
TransactionId xid);
extern int SimpleLruReadPage_ReadOnly(SlruCtl ctl, int pageno,
- TransactionId xid);
+ TransactionId xid, bool lock_held);
extern void SimpleLruWritePage(SlruCtl ctl, int slotno);
extern void SimpleLruFlush(SlruCtl ctl, bool allow_redirtied);
extern void SimpleLruTruncate(SlruCtl ctl, int cutoffPage);
--
2.24.2 (Apple Git-127)
v1-0003-Make-MultiXact-local-cache-size-configurable.patchapplication/octet-stream; name=v1-0003-Make-MultiXact-local-cache-size-configurable.patch; x-unix-mode=0644Download
From 944551228e4c2b4ec015359821122d9676fcf7e0 Mon Sep 17 00:00:00 2001
From: Andrey Borodin <amborodin@acm.org>
Date: Mon, 11 May 2020 16:17:02 +0500
Subject: [PATCH v1 3/3] Make MultiXact local cache size configurable
---
doc/src/sgml/config.sgml | 16 ++++++++++++++++
src/backend/access/transam/multixact.c | 2 +-
src/backend/utils/init/globals.c | 1 +
src/backend/utils/misc/guc.c | 10 ++++++++++
src/include/miscadmin.h | 1 +
5 files changed, 29 insertions(+), 1 deletion(-)
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 71aa5c4900..6dbc3ce5f1 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -1777,6 +1777,22 @@ include_dir 'conf.d'
</listitem>
</varlistentry>
+ <varlistentry id="guc-multixact-local-cache-entries" xreflabel="multixact_local-cache-entries">
+ <term><varname>multixact_local_cache_entries</varname> (<type>integer</type>)
+ <indexterm>
+ <primary><varname>multixact_local_cache_entries</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Specifies the number cached MultiXact by backend. Any SLRU lookup is preceeded
+ by cache lookup. Higher numbers of cache size help to deduplicate lock sets, while
+ lower values make cache lookup faster.
+ It defaults to 256.
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry id="guc-max-stack-depth" xreflabel="max_stack_depth">
<term><varname>max_stack_depth</varname> (<type>integer</type>)
<indexterm>
diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index 5387a3602d..62f9bf73eb 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -1589,7 +1589,7 @@ mXactCachePut(MultiXactId multi, int nmembers, MultiXactMember *members)
qsort(entry->members, nmembers, sizeof(MultiXactMember), mxactMemberComparator);
dlist_push_head(&MXactCache, &entry->node);
- if (MXactCacheMembers++ >= MAX_CACHE_ENTRIES)
+ if (MXactCacheMembers++ >= multixact_local_cache_entries)
{
dlist_node *node;
mXactCacheEnt *entry;
diff --git a/src/backend/utils/init/globals.c b/src/backend/utils/init/globals.c
index aebfbdb67c..b17382293c 100644
--- a/src/backend/utils/init/globals.c
+++ b/src/backend/utils/init/globals.c
@@ -151,3 +151,4 @@ double vacuum_cleanup_index_scale_factor;
int multixact_offsets_slru_buffers = 8;
int multixact_members_slru_buffers = 16;
+int multixact_local_cache_entries = 256;
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index b139e387f3..a74f8a8f20 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -2292,6 +2292,16 @@ static struct config_int ConfigureNamesInt[] =
NULL, NULL, NULL
},
+ {
+ {"multixact_local_cache_entries", PGC_SUSET, RESOURCES_MEM,
+ gettext_noop("Sets the number cached MultiXact by backend."),
+ NULL
+ },
+ &multixact_local_cache_entries,
+ 256, 2, INT_MAX / 2,
+ NULL, NULL, NULL
+ },
+
{
{"temp_buffers", PGC_USERSET, RESOURCES_MEM,
gettext_noop("Sets the maximum number of temporary buffers used by each session."),
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 95f4633a88..27d5170ff1 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -163,6 +163,7 @@ extern PGDLLIMPORT int max_worker_processes;
extern PGDLLIMPORT int max_parallel_workers;
extern PGDLLIMPORT int multixact_offsets_slru_buffers;
extern PGDLLIMPORT int multixact_members_slru_buffers;
+extern PGDLLIMPORT int multixact_local_cache_entries;
extern PGDLLIMPORT int MyProcPid;
extern PGDLLIMPORT pg_time_t MyStartTime;
--
2.24.2 (Apple Git-127)
11 мая 2020 г., в 16:17, Andrey M. Borodin <x4mmm@yandex-team.ru> написал(а):
I've went ahead and created 3 patches:
1. Configurable SLRU buffer sizes for MultiXacOffsets and MultiXactMembers
2. Reduce locking level to shared on read of MultiXactId members
3. Configurable cache size
I'm looking more at MultiXact and it seems to me that we have a race condition there.
When we create a new MultiXact we do:
1. Generate new MultiXactId under MultiXactGenLock
2. Record new mxid with members and offset to WAL
3. Write offset to SLRU under MultiXactOffsetControlLock
4. Write members to SLRU under MultiXactMemberControlLock
When we read MultiXact we do:
1. Retrieve offset by mxid from SLRU under MultiXactOffsetControlLock
2. If offset is 0 - it's not filled in at step 4 of previous algorithm, we sleep and goto 1
3. Retrieve members from SLRU under MultiXactMemberControlLock
4. ..... what we do if there are just zeroes because step 4 is not executed yet? Nothing, return empty members list.
What am I missing?
Best regards, Andrey Borodin.
At Wed, 13 May 2020 23:08:37 +0500, "Andrey M. Borodin" <x4mmm@yandex-team.ru> wrote in
11 мая 2020 г., в 16:17, Andrey M. Borodin <x4mmm@yandex-team.ru> написал(а):
I've went ahead and created 3 patches:
1. Configurable SLRU buffer sizes for MultiXacOffsets and MultiXactMembers
2. Reduce locking level to shared on read of MultiXactId members
3. Configurable cache sizeI'm looking more at MultiXact and it seems to me that we have a race condition there.
When we create a new MultiXact we do:
1. Generate new MultiXactId under MultiXactGenLock
2. Record new mxid with members and offset to WAL
3. Write offset to SLRU under MultiXactOffsetControlLock
4. Write members to SLRU under MultiXactMemberControlLock
But, don't we hold exclusive lock on the buffer through all the steps
above?
When we read MultiXact we do:
1. Retrieve offset by mxid from SLRU under MultiXactOffsetControlLock
2. If offset is 0 - it's not filled in at step 4 of previous algorithm, we sleep and goto 1
3. Retrieve members from SLRU under MultiXactMemberControlLock
4. ..... what we do if there are just zeroes because step 4 is not executed yet? Nothing, return empty members list.
So transactions never see such incomplete mxids, I believe.
What am I missing?
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
14 мая 2020 г., в 06:25, Kyotaro Horiguchi <horikyota.ntt@gmail.com> написал(а):
At Wed, 13 May 2020 23:08:37 +0500, "Andrey M. Borodin" <x4mmm@yandex-team.ru> wrote in
11 мая 2020 г., в 16:17, Andrey M. Borodin <x4mmm@yandex-team.ru> написал(а):
I've went ahead and created 3 patches:
1. Configurable SLRU buffer sizes for MultiXacOffsets and MultiXactMembers
2. Reduce locking level to shared on read of MultiXactId members
3. Configurable cache sizeI'm looking more at MultiXact and it seems to me that we have a race condition there.
When we create a new MultiXact we do:
1. Generate new MultiXactId under MultiXactGenLock
2. Record new mxid with members and offset to WAL
3. Write offset to SLRU under MultiXactOffsetControlLock
4. Write members to SLRU under MultiXactMemberControlLockBut, don't we hold exclusive lock on the buffer through all the steps
above?
Yes...Unless MultiXact is observed on StandBy. This could lead to observing inconsistent snapshot: one of lockers committed tuple delete, but standby sees it as alive.
When we read MultiXact we do:
1. Retrieve offset by mxid from SLRU under MultiXactOffsetControlLock
2. If offset is 0 - it's not filled in at step 4 of previous algorithm, we sleep and goto 1
3. Retrieve members from SLRU under MultiXactMemberControlLock
4. ..... what we do if there are just zeroes because step 4 is not executed yet? Nothing, return empty members list.So transactions never see such incomplete mxids, I believe.
I've observed sleep in step 2. I believe it's possible to observe special effects of step 4 too.
Maybe we could add lock on standby to dismiss this 1000us wait? Sometimes it hits hard on Standbys: if someone is locking whole table on primary - all seq scans on standbys follow him with MultiXactOffsetControlLock contention.
It looks like this:
0x00007fcd56896ff7 in __GI___select (nfds=nfds@entry=0, readfds=readfds@entry=0x0, writefds=writefds@entry=0x0, exceptfds=exceptfds@entry=0x0, timeout=timeout@entry=0x7ffd83376fe0) at ../sysdeps/unix/sysv/linux/select.c:41
#0 0x00007fcd56896ff7 in __GI___select (nfds=nfds@entry=0, readfds=readfds@entry=0x0, writefds=writefds@entry=0x0, exceptfds=exceptfds@entry=0x0, timeout=timeout@entry=0x7ffd83376fe0) at ../sysdeps/unix/sysv/linux/select.c:41
#1 0x000056186e0d54bd in pg_usleep (microsec=microsec@entry=1000) at ./build/../src/port/pgsleep.c:56
#2 0x000056186dd5edf2 in GetMultiXactIdMembers (from_pgupgrade=0 '\000', onlyLock=<optimized out>, members=0x7ffd83377080, multi=3106214809) at ./build/../src/backend/access/transam/multixact.c:1370
#3 GetMultiXactIdMembers () at ./build/../src/backend/access/transam/multixact.c:1202
#4 0x000056186dd2d2d9 in MultiXactIdGetUpdateXid (xmax=<optimized out>, t_infomask=<optimized out>) at ./build/../src/backend/access/heap/heapam.c:7039
#5 0x000056186dd35098 in HeapTupleGetUpdateXid (tuple=tuple@entry=0x7fcba3b63d58) at ./build/../src/backend/access/heap/heapam.c:7080
#6 0x000056186e0cd0f8 in HeapTupleSatisfiesMVCC (htup=<optimized out>, snapshot=0x56186f44a058, buffer=230684) at ./build/../src/backend/utils/time/tqual.c:1091
#7 0x000056186dd2d922 in heapgetpage (scan=scan@entry=0x56186f4c8e78, page=page@entry=3620) at ./build/../src/backend/access/heap/heapam.c:439
#8 0x000056186dd2ea7c in heapgettup_pagemode (key=0x0, nkeys=0, dir=ForwardScanDirection, scan=0x56186f4c8e78) at ./build/../src/backend/access/heap/heapam.c:1034
#9 heap_getnext (scan=scan@entry=0x56186f4c8e78, direction=direction@entry=ForwardScanDirection) at ./build/../src/backend/access/heap/heapam.c:1801
#10 0x000056186de84f51 in SeqNext (node=node@entry=0x56186f4a4f78) at ./build/../src/backend/executor/nodeSeqscan.c:81
#11 0x000056186de6a3f1 in ExecScanFetch (recheckMtd=0x56186de84ef0 <SeqRecheck>, accessMtd=0x56186de84f20 <SeqNext>, node=0x56186f4a4f78) at ./build/../src/backend/executor/execScan.c:97
#12 ExecScan (node=0x56186f4a4f78, accessMtd=0x56186de84f20 <SeqNext>, recheckMtd=0x56186de84ef0 <SeqRecheck>) at ./build/../src/backend/executor/execScan.c:164
Best regards, Andrey Borodin.
At Thu, 14 May 2020 10:19:42 +0500, "Andrey M. Borodin" <x4mmm@yandex-team.ru> wrote in
I'm looking more at MultiXact and it seems to me that we have a race condition there.
When we create a new MultiXact we do:
1. Generate new MultiXactId under MultiXactGenLock
2. Record new mxid with members and offset to WAL
3. Write offset to SLRU under MultiXactOffsetControlLock
4. Write members to SLRU under MultiXactMemberControlLockBut, don't we hold exclusive lock on the buffer through all the steps
above?Yes...Unless MultiXact is observed on StandBy. This could lead to observing inconsistent snapshot: one of lockers committed tuple delete, but standby sees it as alive.
Ah, right. I looked from GetNewMultiXactId. Actually
XLOG_MULTIXACT_CREATE_ID is not protected from concurrent reference to
the creating mxact id. And GetMultiXactIdMembers is considering that
case.
When we read MultiXact we do:
1. Retrieve offset by mxid from SLRU under MultiXactOffsetControlLock
2. If offset is 0 - it's not filled in at step 4 of previous algorithm, we sleep and goto 1
3. Retrieve members from SLRU under MultiXactMemberControlLock
4. ..... what we do if there are just zeroes because step 4 is not executed yet? Nothing, return empty members list.So transactions never see such incomplete mxids, I believe.
I've observed sleep in step 2. I believe it's possible to observe special effects of step 4 too.
Maybe we could add lock on standby to dismiss this 1000us wait? Sometimes it hits hard on Standbys: if someone is locking whole table on primary - all seq scans on standbys follow him with MultiXactOffsetControlLock contention.
GetMultiXactIdMembers believes that 4 is successfully done if 2
returned valid offset, but actually that is not obvious.
If we add a single giant lock just to isolate ,say,
GetMultiXactIdMember and RecordNewMultiXact, it reduces concurrency
unnecessarily. Perhaps we need finer-grained locking-key for standby
that works similary to buffer lock on primary, that doesn't cause
confilicts between irrelevant mxids.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
14 мая 2020 г., в 11:16, Kyotaro Horiguchi <horikyota.ntt@gmail.com> написал(а):
At Thu, 14 May 2020 10:19:42 +0500, "Andrey M. Borodin" <x4mmm@yandex-team.ru> wrote in
I'm looking more at MultiXact and it seems to me that we have a race condition there.
When we create a new MultiXact we do:
1. Generate new MultiXactId under MultiXactGenLock
2. Record new mxid with members and offset to WAL
3. Write offset to SLRU under MultiXactOffsetControlLock
4. Write members to SLRU under MultiXactMemberControlLockBut, don't we hold exclusive lock on the buffer through all the steps
above?Yes...Unless MultiXact is observed on StandBy. This could lead to observing inconsistent snapshot: one of lockers committed tuple delete, but standby sees it as alive.
Ah, right. I looked from GetNewMultiXactId. Actually
XLOG_MULTIXACT_CREATE_ID is not protected from concurrent reference to
the creating mxact id. And GetMultiXactIdMembers is considering that
case.When we read MultiXact we do:
1. Retrieve offset by mxid from SLRU under MultiXactOffsetControlLock
2. If offset is 0 - it's not filled in at step 4 of previous algorithm, we sleep and goto 1
3. Retrieve members from SLRU under MultiXactMemberControlLock
4. ..... what we do if there are just zeroes because step 4 is not executed yet? Nothing, return empty members list.So transactions never see such incomplete mxids, I believe.
I've observed sleep in step 2. I believe it's possible to observe special effects of step 4 too.
Maybe we could add lock on standby to dismiss this 1000us wait? Sometimes it hits hard on Standbys: if someone is locking whole table on primary - all seq scans on standbys follow him with MultiXactOffsetControlLock contention.GetMultiXactIdMembers believes that 4 is successfully done if 2
returned valid offset, but actually that is not obvious.If we add a single giant lock just to isolate ,say,
GetMultiXactIdMember and RecordNewMultiXact, it reduces concurrency
unnecessarily. Perhaps we need finer-grained locking-key for standby
that works similary to buffer lock on primary, that doesn't cause
confilicts between irrelevant mxids.
We can just replay members before offsets. If offset is already there - members are there too.
But I'd be happy if we could mitigate those 1000us too - with a hint about last maixd state in a shared MX state, for example.
Actually, if we read empty mxid array instead of something that is replayed just yet - it's not a problem of inconsistency, because transaction in this mxid could not commit before we started. ISTM.
So instead of fix, we, probably, can just add a comment. If this reasoning is correct.
Best regards, Andrey Borodin.
At Thu, 14 May 2020 11:44:01 +0500, "Andrey M. Borodin" <x4mmm@yandex-team.ru> wrote in
GetMultiXactIdMembers believes that 4 is successfully done if 2
returned valid offset, but actually that is not obvious.If we add a single giant lock just to isolate ,say,
GetMultiXactIdMember and RecordNewMultiXact, it reduces concurrency
unnecessarily. Perhaps we need finer-grained locking-key for standby
that works similary to buffer lock on primary, that doesn't cause
confilicts between irrelevant mxids.We can just replay members before offsets. If offset is already there - members are there too.
But I'd be happy if we could mitigate those 1000us too - with a hint about last maixd state in a shared MX state, for example.
Generally in such cases, condition variables would work. In the
attached PoC, the reader side gets no penalty in the "likely" code
path. The writer side always calls ConditionVariableBroadcast but the
waiter list is empty in almost all cases. But I couldn't cause the
situation where the sleep 1000u is reached.
Actually, if we read empty mxid array instead of something that is replayed just yet - it's not a problem of inconsistency, because transaction in this mxid could not commit before we started. ISTM.
So instead of fix, we, probably, can just add a comment. If this reasoning is correct.
The step 4 of the reader side reads the members of the target mxid. It
is already written if the offset of the *next* mxid is filled-in.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
Attachments:
mxid_wait_instead_of_sleep.patchtext/x-patch; charset=us-asciiDownload
diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index e2aa5c9ce4..9db8f6cddd 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -82,6 +82,7 @@
#include "lib/ilist.h"
#include "miscadmin.h"
#include "pg_trace.h"
+#include "pgstat.h"
#include "postmaster/autovacuum.h"
#include "storage/lmgr.h"
#include "storage/pmsignal.h"
@@ -233,6 +234,7 @@ typedef struct MultiXactStateData
/* support for members anti-wraparound measures */
MultiXactOffset offsetStopLimit; /* known if oldestOffsetKnown */
+ ConditionVariable nextoff_cv;
/*
* Per-backend data starts here. We have two arrays stored in the area
* immediately following the MultiXactStateData struct. Each is indexed by
@@ -873,6 +875,14 @@ RecordNewMultiXact(MultiXactId multi, MultiXactOffset offset,
/* Exchange our lock */
LWLockRelease(MultiXactOffsetControlLock);
+ /*
+ * Let everybody know the offset of this mxid is recorded now. The waiters
+ * are waiting for the offset of the mxid next of the target to know the
+ * number of members of the target mxid, so we don't need to wait for
+ * members of this mxid are recorded.
+ */
+ ConditionVariableBroadcast(&MultiXactState->nextoff_cv);
+
LWLockAcquire(MultiXactMemberControlLock, LW_EXCLUSIVE);
prev_pageno = -1;
@@ -1367,9 +1377,19 @@ retry:
if (nextMXOffset == 0)
{
/* Corner case 2: next multixact is still being filled in */
+
+ /*
+ * The recorder of the next mxid is just before writing the offset.
+ * Wait for the offset to be written.
+ */
+ ConditionVariablePrepareToSleep(&MultiXactState->nextoff_cv);
+
LWLockRelease(MultiXactOffsetControlLock);
CHECK_FOR_INTERRUPTS();
- pg_usleep(1000L);
+
+ ConditionVariableSleep(&MultiXactState->nextoff_cv,
+ WAIT_EVENT_WAIT_NEXT_MXMEMBERS);
+ ConditionVariableCancelSleep();
goto retry;
}
@@ -1847,6 +1867,7 @@ MultiXactShmemInit(void)
/* Make sure we zero out the per-backend state */
MemSet(MultiXactState, 0, SHARED_MULTIXACT_STATE_SIZE);
+ ConditionVariableInit(&MultiXactState->nextoff_cv);
}
else
Assert(found);
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 0ecd29a1d9..1ac6b37188 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -3879,6 +3879,9 @@ pgstat_get_wait_ipc(WaitEventIPC w)
case WAIT_EVENT_SYNC_REP:
event_name = "SyncRep";
break;
+ case WAIT_EVENT_WAIT_NEXT_MXMEMBERS:
+ event_name = "Mact/WaitNextXactMembers";
+ break;
/* no default case, so that compiler will warn */
}
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index ae9a39573c..e79bba0bef 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -886,7 +886,8 @@ typedef enum
WAIT_EVENT_REPLICATION_ORIGIN_DROP,
WAIT_EVENT_REPLICATION_SLOT_DROP,
WAIT_EVENT_SAFE_SNAPSHOT,
- WAIT_EVENT_SYNC_REP
+ WAIT_EVENT_SYNC_REP,
+ WAIT_EVENT_WAIT_NEXT_MXMEMBERS
} WaitEventIPC;
/* ----------
15 мая 2020 г., в 05:03, Kyotaro Horiguchi <horikyota.ntt@gmail.com> написал(а):
At Thu, 14 May 2020 11:44:01 +0500, "Andrey M. Borodin" <x4mmm@yandex-team.ru> wrote in
GetMultiXactIdMembers believes that 4 is successfully done if 2
returned valid offset, but actually that is not obvious.If we add a single giant lock just to isolate ,say,
GetMultiXactIdMember and RecordNewMultiXact, it reduces concurrency
unnecessarily. Perhaps we need finer-grained locking-key for standby
that works similary to buffer lock on primary, that doesn't cause
confilicts between irrelevant mxids.We can just replay members before offsets. If offset is already there - members are there too.
But I'd be happy if we could mitigate those 1000us too - with a hint about last maixd state in a shared MX state, for example.Generally in such cases, condition variables would work. In the
attached PoC, the reader side gets no penalty in the "likely" code
path. The writer side always calls ConditionVariableBroadcast but the
waiter list is empty in almost all cases. But I couldn't cause the
situation where the sleep 1000u is reached.
Thanks! That really looks like a good solution without magic timeouts. Beautiful!
I think I can create temporary extension which calls MultiXact API and tests edge-cases like this 1000us wait.
This extension will also be also useful for me to assess impact of bigger buffers, reduced read locking (as in my 2nd patch) and other tweaks.
Actually, if we read empty mxid array instead of something that is replayed just yet - it's not a problem of inconsistency, because transaction in this mxid could not commit before we started. ISTM.
So instead of fix, we, probably, can just add a comment. If this reasoning is correct.The step 4 of the reader side reads the members of the target mxid. It
is already written if the offset of the *next* mxid is filled-in.
Most often - yes, but members are not guaranteed to be filled in order. Those who win MXMemberControlLock will write first.
But nobody can read members of MXID before it is returned. And its members will be written before returning MXID.
Best regards, Andrey Borodin.
At Fri, 15 May 2020 14:01:46 +0500, "Andrey M. Borodin" <x4mmm@yandex-team.ru> wrote in
15 мая 2020 г., в 05:03, Kyotaro Horiguchi <horikyota.ntt@gmail.com> написал(а):
At Thu, 14 May 2020 11:44:01 +0500, "Andrey M. Borodin" <x4mmm@yandex-team.ru> wrote in
GetMultiXactIdMembers believes that 4 is successfully done if 2
returned valid offset, but actually that is not obvious.If we add a single giant lock just to isolate ,say,
GetMultiXactIdMember and RecordNewMultiXact, it reduces concurrency
unnecessarily. Perhaps we need finer-grained locking-key for standby
that works similary to buffer lock on primary, that doesn't cause
confilicts between irrelevant mxids.We can just replay members before offsets. If offset is already there - members are there too.
But I'd be happy if we could mitigate those 1000us too - with a hint about last maixd state in a shared MX state, for example.Generally in such cases, condition variables would work. In the
attached PoC, the reader side gets no penalty in the "likely" code
path. The writer side always calls ConditionVariableBroadcast but the
waiter list is empty in almost all cases. But I couldn't cause the
situation where the sleep 1000u is reached.Thanks! That really looks like a good solution without magic timeouts. Beautiful!
I think I can create temporary extension which calls MultiXact API and tests edge-cases like this 1000us wait.
This extension will also be also useful for me to assess impact of bigger buffers, reduced read locking (as in my 2nd patch) and other tweaks.
Happy to hear that, It would need to use timeout just in case, though.
Actually, if we read empty mxid array instead of something that is replayed just yet - it's not a problem of inconsistency, because transaction in this mxid could not commit before we started. ISTM.
So instead of fix, we, probably, can just add a comment. If this reasoning is correct.The step 4 of the reader side reads the members of the target mxid. It
is already written if the offset of the *next* mxid is filled-in.Most often - yes, but members are not guaranteed to be filled in order. Those who win MXMemberControlLock will write first.
But nobody can read members of MXID before it is returned. And its members will be written before returning MXID.
Yeah, right. Otherwise assertion failure happens.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
On 15 May 2020, at 02:03, Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote:
Generally in such cases, condition variables would work. In the
attached PoC, the reader side gets no penalty in the "likely" code
path. The writer side always calls ConditionVariableBroadcast but the
waiter list is empty in almost all cases. But I couldn't cause the
situation where the sleep 1000u is reached.
The submitted patch no longer applies, can you please submit an updated
version? I'm marking the patch Waiting on Author in the meantime.
cheers ./daniel
2 июля 2020 г., в 17:02, Daniel Gustafsson <daniel@yesql.se> написал(а):
On 15 May 2020, at 02:03, Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote:
Generally in such cases, condition variables would work. In the
attached PoC, the reader side gets no penalty in the "likely" code
path. The writer side always calls ConditionVariableBroadcast but the
waiter list is empty in almost all cases. But I couldn't cause the
situation where the sleep 1000u is reached.The submitted patch no longer applies, can you please submit an updated
version? I'm marking the patch Waiting on Author in the meantime.
Thanks, Daniel! PFA V2
Best regards, Andrey Borodin.
Attachments:
v2-0001-Use-shared-lock-in-GetMultiXactIdMembers-for-offs.patchapplication/octet-stream; name=v2-0001-Use-shared-lock-in-GetMultiXactIdMembers-for-offs.patch; x-unix-mode=0644Download
From 264b475fbf6ee519dc3b4eb2cca25145d2aaa569 Mon Sep 17 00:00:00 2001
From: Andrey Borodin <amborodin@acm.org>
Date: Sat, 9 May 2020 21:19:18 +0500
Subject: [PATCH v2 1/4] Use shared lock in GetMultiXactIdMembers for offsets
and members
Previously read of multixact required exclusive control locks in
offstes and members SLRUs. This could lead to contention.
In this commit we take advantge of SimpleLruReadPage_ReadOnly
similar to CLOG usage.
---
src/backend/access/transam/clog.c | 2 +-
src/backend/access/transam/commit_ts.c | 2 +-
src/backend/access/transam/multixact.c | 12 ++++++------
src/backend/access/transam/slru.c | 8 +++++---
src/backend/access/transam/subtrans.c | 2 +-
src/backend/commands/async.c | 2 +-
src/backend/storage/lmgr/predicate.c | 2 +-
src/include/access/slru.h | 2 +-
8 files changed, 17 insertions(+), 15 deletions(-)
diff --git a/src/backend/access/transam/clog.c b/src/backend/access/transam/clog.c
index f3da40ae01..3614d0c774 100644
--- a/src/backend/access/transam/clog.c
+++ b/src/backend/access/transam/clog.c
@@ -643,7 +643,7 @@ TransactionIdGetStatus(TransactionId xid, XLogRecPtr *lsn)
/* lock is acquired by SimpleLruReadPage_ReadOnly */
- slotno = SimpleLruReadPage_ReadOnly(XactCtl, pageno, xid);
+ slotno = SimpleLruReadPage_ReadOnly(XactCtl, pageno, xid, false);
byteptr = XactCtl->shared->page_buffer[slotno] + byteno;
status = (*byteptr >> bshift) & CLOG_XACT_BITMASK;
diff --git a/src/backend/access/transam/commit_ts.c b/src/backend/access/transam/commit_ts.c
index 9cdb136435..5fbbf446e3 100644
--- a/src/backend/access/transam/commit_ts.c
+++ b/src/backend/access/transam/commit_ts.c
@@ -342,7 +342,7 @@ TransactionIdGetCommitTsData(TransactionId xid, TimestampTz *ts,
}
/* lock is acquired by SimpleLruReadPage_ReadOnly */
- slotno = SimpleLruReadPage_ReadOnly(CommitTsCtl, pageno, xid);
+ slotno = SimpleLruReadPage_ReadOnly(CommitTsCtl, pageno, xid, false);
memcpy(&entry,
CommitTsCtl->shared->page_buffer[slotno] +
SizeOfCommitTimestampEntry * entryno,
diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index 79ec21c75a..241d9dd78d 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -1321,12 +1321,12 @@ GetMultiXactIdMembers(MultiXactId multi, MultiXactMember **members,
* time on every multixact creation.
*/
retry:
- LWLockAcquire(MultiXactOffsetSLRULock, LW_EXCLUSIVE);
pageno = MultiXactIdToOffsetPage(multi);
entryno = MultiXactIdToOffsetEntry(multi);
- slotno = SimpleLruReadPage(MultiXactOffsetCtl, pageno, true, multi);
+ /* lock is acquired by SimpleLruReadPage_ReadOnly */
+ slotno = SimpleLruReadPage_ReadOnly(MultiXactOffsetCtl, pageno, multi, false);
offptr = (MultiXactOffset *) MultiXactOffsetCtl->shared->page_buffer[slotno];
offptr += entryno;
offset = *offptr;
@@ -1358,7 +1358,7 @@ retry:
entryno = MultiXactIdToOffsetEntry(tmpMXact);
if (pageno != prev_pageno)
- slotno = SimpleLruReadPage(MultiXactOffsetCtl, pageno, true, tmpMXact);
+ slotno = SimpleLruReadPage_ReadOnly(MultiXactOffsetCtl, pageno, tmpMXact, true);
offptr = (MultiXactOffset *) MultiXactOffsetCtl->shared->page_buffer[slotno];
offptr += entryno;
@@ -1382,7 +1382,7 @@ retry:
*members = ptr;
/* Now get the members themselves. */
- LWLockAcquire(MultiXactMemberSLRULock, LW_EXCLUSIVE);
+ LWLockAcquire(MultiXactMemberSLRULock, LW_SHARED);
truelength = 0;
prev_pageno = -1;
@@ -1399,7 +1399,7 @@ retry:
if (pageno != prev_pageno)
{
- slotno = SimpleLruReadPage(MultiXactMemberCtl, pageno, true, multi);
+ slotno = SimpleLruReadPage_ReadOnly(MultiXactMemberCtl, pageno, multi, true);
prev_pageno = pageno;
}
@@ -2745,7 +2745,7 @@ find_multixact_start(MultiXactId multi, MultiXactOffset *result)
return false;
/* lock is acquired by SimpleLruReadPage_ReadOnly */
- slotno = SimpleLruReadPage_ReadOnly(MultiXactOffsetCtl, pageno, multi);
+ slotno = SimpleLruReadPage_ReadOnly(MultiXactOffsetCtl, pageno, multi, false);
offptr = (MultiXactOffset *) MultiXactOffsetCtl->shared->page_buffer[slotno];
offptr += entryno;
offset = *offptr;
diff --git a/src/backend/access/transam/slru.c b/src/backend/access/transam/slru.c
index 61249f4a12..96d0def5c2 100644
--- a/src/backend/access/transam/slru.c
+++ b/src/backend/access/transam/slru.c
@@ -475,17 +475,19 @@ SimpleLruReadPage(SlruCtl ctl, int pageno, bool write_ok,
* Return value is the shared-buffer slot number now holding the page.
* The buffer's LRU access info is updated.
*
- * Control lock must NOT be held at entry, but will be held at exit.
+ * Control lock will be held at exit. If lock_held is true at least
+ * shared control lock must be held.
* It is unspecified whether the lock will be shared or exclusive.
*/
int
-SimpleLruReadPage_ReadOnly(SlruCtl ctl, int pageno, TransactionId xid)
+SimpleLruReadPage_ReadOnly(SlruCtl ctl, int pageno, TransactionId xid, bool lock_held)
{
SlruShared shared = ctl->shared;
int slotno;
/* Try to find the page while holding only shared lock */
- LWLockAcquire(shared->ControlLock, LW_SHARED);
+ if (!lock_held)
+ LWLockAcquire(shared->ControlLock, LW_SHARED);
/* See if page is already in a buffer */
for (slotno = 0; slotno < shared->num_slots; slotno++)
diff --git a/src/backend/access/transam/subtrans.c b/src/backend/access/transam/subtrans.c
index f33ae407a6..8d8e0d9ec7 100644
--- a/src/backend/access/transam/subtrans.c
+++ b/src/backend/access/transam/subtrans.c
@@ -123,7 +123,7 @@ SubTransGetParent(TransactionId xid)
/* lock is acquired by SimpleLruReadPage_ReadOnly */
- slotno = SimpleLruReadPage_ReadOnly(SubTransCtl, pageno, xid);
+ slotno = SimpleLruReadPage_ReadOnly(SubTransCtl, pageno, xid, false);
ptr = (TransactionId *) SubTransCtl->shared->page_buffer[slotno];
ptr += entryno;
diff --git a/src/backend/commands/async.c b/src/backend/commands/async.c
index 71b7577afc..70085fc037 100644
--- a/src/backend/commands/async.c
+++ b/src/backend/commands/async.c
@@ -2012,7 +2012,7 @@ asyncQueueReadAllNotifications(void)
* part of the page we will actually inspect.
*/
slotno = SimpleLruReadPage_ReadOnly(NotifyCtl, curpage,
- InvalidTransactionId);
+ InvalidTransactionId, false);
if (curpage == QUEUE_POS_PAGE(head))
{
/* we only want to read as far as head */
diff --git a/src/backend/storage/lmgr/predicate.c b/src/backend/storage/lmgr/predicate.c
index d24919f76b..e2a7d00d37 100644
--- a/src/backend/storage/lmgr/predicate.c
+++ b/src/backend/storage/lmgr/predicate.c
@@ -948,7 +948,7 @@ SerialGetMinConflictCommitSeqNo(TransactionId xid)
* but will return with that lock held, which must then be released.
*/
slotno = SimpleLruReadPage_ReadOnly(SerialSlruCtl,
- SerialPage(xid), xid);
+ SerialPage(xid), xid, false);
val = SerialValue(slotno, xid);
LWLockRelease(SerialSLRULock);
return val;
diff --git a/src/include/access/slru.h b/src/include/access/slru.h
index 61fbc80ef0..471c692021 100644
--- a/src/include/access/slru.h
+++ b/src/include/access/slru.h
@@ -140,7 +140,7 @@ extern int SimpleLruZeroPage(SlruCtl ctl, int pageno);
extern int SimpleLruReadPage(SlruCtl ctl, int pageno, bool write_ok,
TransactionId xid);
extern int SimpleLruReadPage_ReadOnly(SlruCtl ctl, int pageno,
- TransactionId xid);
+ TransactionId xid, bool lock_held);
extern void SimpleLruWritePage(SlruCtl ctl, int slotno);
extern void SimpleLruFlush(SlruCtl ctl, bool allow_redirtied);
extern void SimpleLruTruncate(SlruCtl ctl, int cutoffPage);
--
2.24.2 (Apple Git-127)
v2-0002-Make-MultiXact-local-cache-size-configurable.patchapplication/octet-stream; name=v2-0002-Make-MultiXact-local-cache-size-configurable.patch; x-unix-mode=0644Download
From 520686d6cc22a599bc06b91391171ca218c3db75 Mon Sep 17 00:00:00 2001
From: Andrey Borodin <amborodin@acm.org>
Date: Mon, 11 May 2020 16:17:02 +0500
Subject: [PATCH v2 2/4] Make MultiXact local cache size configurable
---
doc/src/sgml/config.sgml | 16 ++++++++++++++++
src/backend/access/transam/multixact.c | 2 +-
src/backend/utils/init/globals.c | 1 +
src/backend/utils/misc/guc.c | 10 ++++++++++
src/include/miscadmin.h | 1 +
5 files changed, 29 insertions(+), 1 deletion(-)
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 94d99373a7..31d7c1f4b3 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -1777,6 +1777,22 @@ include_dir 'conf.d'
</listitem>
</varlistentry>
+ <varlistentry id="guc-multixact-local-cache-entries" xreflabel="multixact_local-cache-entries">
+ <term><varname>multixact_local_cache_entries</varname> (<type>integer</type>)
+ <indexterm>
+ <primary><varname>multixact_local_cache_entries</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Specifies the number cached MultiXact by backend. Any SLRU lookup is preceeded
+ by cache lookup. Higher numbers of cache size help to deduplicate lock sets, while
+ lower values make cache lookup faster.
+ It defaults to 256.
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry id="guc-max-stack-depth" xreflabel="max_stack_depth">
<term><varname>max_stack_depth</varname> (<type>integer</type>)
<indexterm>
diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index 241d9dd78d..445b6258a4 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -1589,7 +1589,7 @@ mXactCachePut(MultiXactId multi, int nmembers, MultiXactMember *members)
qsort(entry->members, nmembers, sizeof(MultiXactMember), mxactMemberComparator);
dlist_push_head(&MXactCache, &entry->node);
- if (MXactCacheMembers++ >= MAX_CACHE_ENTRIES)
+ if (MXactCacheMembers++ >= multixact_local_cache_entries)
{
dlist_node *node;
mXactCacheEnt *entry;
diff --git a/src/backend/utils/init/globals.c b/src/backend/utils/init/globals.c
index 01e739da40..f404167ad0 100644
--- a/src/backend/utils/init/globals.c
+++ b/src/backend/utils/init/globals.c
@@ -151,3 +151,4 @@ double vacuum_cleanup_index_scale_factor;
int multixact_offsets_slru_buffers = 8;
int multixact_members_slru_buffers = 16;
+int multixact_local_cache_entries = 256;
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index d1fbb75cd5..6443f595cc 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -2273,6 +2273,16 @@ static struct config_int ConfigureNamesInt[] =
NULL, NULL, NULL
},
+ {
+ {"multixact_local_cache_entries", PGC_SUSET, RESOURCES_MEM,
+ gettext_noop("Sets the number cached MultiXact by backend."),
+ NULL
+ },
+ &multixact_local_cache_entries,
+ 256, 2, INT_MAX / 2,
+ NULL, NULL, NULL
+ },
+
{
{"temp_buffers", PGC_USERSET, RESOURCES_MEM,
gettext_noop("Sets the maximum number of temporary buffers used by each session."),
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index bb1475ad2b..9ce2adea63 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -163,6 +163,7 @@ extern PGDLLIMPORT int max_worker_processes;
extern PGDLLIMPORT int max_parallel_workers;
extern PGDLLIMPORT int multixact_offsets_slru_buffers;
extern PGDLLIMPORT int multixact_members_slru_buffers;
+extern PGDLLIMPORT int multixact_local_cache_entries;
extern PGDLLIMPORT int MyProcPid;
extern PGDLLIMPORT pg_time_t MyStartTime;
--
2.24.2 (Apple Git-127)
v2-0003-Add-conditional-variable-to-wait-for-next-MultXac.patchapplication/octet-stream; name=v2-0003-Add-conditional-variable-to-wait-for-next-MultXac.patch; x-unix-mode=0644Download
From ddde253b10b87d060d34f8d7250be27a27c18c0a Mon Sep 17 00:00:00 2001
From: Andrey Borodin <amborodin@acm.org>
Date: Fri, 22 May 2020 14:13:33 +0500
Subject: [PATCH v2 3/4] Add conditional variable to wait for next MultXact
offset in edge case
---
src/backend/access/transam/multixact.c | 23 ++++++++++++++++++++++-
src/backend/postmaster/pgstat.c | 2 ++
src/include/pgstat.h | 1 +
3 files changed, 25 insertions(+), 1 deletion(-)
diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index 445b6258a4..02de9a7c13 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -82,6 +82,7 @@
#include "lib/ilist.h"
#include "miscadmin.h"
#include "pg_trace.h"
+#include "pgstat.h"
#include "postmaster/autovacuum.h"
#include "storage/lmgr.h"
#include "storage/pmsignal.h"
@@ -233,6 +234,7 @@ typedef struct MultiXactStateData
/* support for members anti-wraparound measures */
MultiXactOffset offsetStopLimit; /* known if oldestOffsetKnown */
+ ConditionVariable nextoff_cv;
/*
* Per-backend data starts here. We have two arrays stored in the area
* immediately following the MultiXactStateData struct. Each is indexed by
@@ -873,6 +875,14 @@ RecordNewMultiXact(MultiXactId multi, MultiXactOffset offset,
/* Exchange our lock */
LWLockRelease(MultiXactOffsetSLRULock);
+ /*
+ * Let everybody know the offset of this mxid is recorded now. The waiters
+ * are waiting for the offset of the mxid next of the target to know the
+ * number of members of the target mxid, so we don't need to wait for
+ * members of this mxid are recorded.
+ */
+ ConditionVariableBroadcast(&MultiXactState->nextoff_cv);
+
LWLockAcquire(MultiXactMemberSLRULock, LW_EXCLUSIVE);
prev_pageno = -1;
@@ -1367,9 +1377,19 @@ retry:
if (nextMXOffset == 0)
{
/* Corner case 2: next multixact is still being filled in */
+
+ /*
+ * The recorder of the next mxid is just before writing the offset.
+ * Wait for the offset to be written.
+ */
+ ConditionVariablePrepareToSleep(&MultiXactState->nextoff_cv);
+
LWLockRelease(MultiXactOffsetSLRULock);
CHECK_FOR_INTERRUPTS();
- pg_usleep(1000L);
+
+ ConditionVariableSleep(&MultiXactState->nextoff_cv,
+ WAIT_EVENT_WAIT_NEXT_MXMEMBERS);
+ ConditionVariableCancelSleep();
goto retry;
}
@@ -1847,6 +1867,7 @@ MultiXactShmemInit(void)
/* Make sure we zero out the per-backend state */
MemSet(MultiXactState, 0, SHARED_MULTIXACT_STATE_SIZE);
+ ConditionVariableInit(&MultiXactState->nextoff_cv);
}
else
Assert(found);
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index edfa774ee4..4bce534c7a 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -3877,6 +3877,8 @@ pgstat_get_wait_ipc(WaitEventIPC w)
break;
case WAIT_EVENT_XACT_GROUP_UPDATE:
event_name = "XactGroupUpdate";
+ case WAIT_EVENT_WAIT_NEXT_MXMEMBERS:
+ event_name = "MultiXact/WaitNextXactMembers";
break;
/* no default case, so that compiler will warn */
}
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 1387201382..26f828597b 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -887,6 +887,7 @@ typedef enum
WAIT_EVENT_REPLICATION_SLOT_DROP,
WAIT_EVENT_SAFE_SNAPSHOT,
WAIT_EVENT_SYNC_REP,
+ WAIT_EVENT_WAIT_NEXT_MXMEMBERS,
WAIT_EVENT_XACT_GROUP_UPDATE
} WaitEventIPC;
--
2.24.2 (Apple Git-127)
On 8 Jul 2020, at 09:03, Andrey M. Borodin <x4mmm@yandex-team.ru> wrote:
2 июля 2020 г., в 17:02, Daniel Gustafsson <daniel@yesql.se> написал(а):
On 15 May 2020, at 02:03, Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote:
Generally in such cases, condition variables would work. In the
attached PoC, the reader side gets no penalty in the "likely" code
path. The writer side always calls ConditionVariableBroadcast but the
waiter list is empty in almost all cases. But I couldn't cause the
situation where the sleep 1000u is reached.The submitted patch no longer applies, can you please submit an updated
version? I'm marking the patch Waiting on Author in the meantime.Thanks, Daniel! PFA V2
This version too has stopped applying according to the CFbot. I've moved it to
the next commitfest as we're out of time in this one and it was only pointed
out now, but kept it Waiting on Author.
cheers ./daniel
On 08.07.2020 10:03, Andrey M. Borodin wrote:
2 июля 2020 г., в 17:02, Daniel Gustafsson <daniel@yesql.se> написал(а):
On 15 May 2020, at 02:03, Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote:
Generally in such cases, condition variables would work. In the
attached PoC, the reader side gets no penalty in the "likely" code
path. The writer side always calls ConditionVariableBroadcast but the
waiter list is empty in almost all cases. But I couldn't cause the
situation where the sleep 1000u is reached.The submitted patch no longer applies, can you please submit an updated
version? I'm marking the patch Waiting on Author in the meantime.Thanks, Daniel! PFA V2
Best regards, Andrey Borodin.
1) The first patch is sensible and harmless, so I think it is ready for
committer. I haven't tested the performance impact, though.
2) I like the initial proposal to make various SLRU buffers
configurable, however, I doubt if it really solves the problem, or just
moves it to another place?
The previous patch you sent was based on some version that contained
changes for other slru buffers numbers: 'multixact_offsets_slru_buffers'
and 'multixact_members_slru_buffers'. Have you just forgot to attach
them? The patch message "[PATCH v2 2/4]" hints that you had 4 patches)
Meanwhile, I attach the rebased patch to calm down the CFbot. The
changes are trivial.
2.1) I think that both min and max values for this parameter are too
extreme. Have you tested them?
+ &multixact_local_cache_entries,
+ 256, 2, INT_MAX / 2,
2.2) MAX_CACHE_ENTRIES is not used anymore, so it can be deleted.
3) No changes for third patch. I just renamed it for consistency.
--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
Attachments:
v3-0001-Use-shared-lock-in-GetMultiXactIdMembers-for-offs.patchtext/x-patch; charset=UTF-8; name=v3-0001-Use-shared-lock-in-GetMultiXactIdMembers-for-offs.patchDownload
From 264b475fbf6ee519dc3b4eb2cca25145d2aaa569 Mon Sep 17 00:00:00 2001
From: Andrey Borodin <amborodin@acm.org>
Date: Sat, 9 May 2020 21:19:18 +0500
Subject: [PATCH v2 1/4] Use shared lock in GetMultiXactIdMembers for offsets
and members
Previously read of multixact required exclusive control locks in
offstes and members SLRUs. This could lead to contention.
In this commit we take advantge of SimpleLruReadPage_ReadOnly
similar to CLOG usage.
---
src/backend/access/transam/clog.c | 2 +-
src/backend/access/transam/commit_ts.c | 2 +-
src/backend/access/transam/multixact.c | 12 ++++++------
src/backend/access/transam/slru.c | 8 +++++---
src/backend/access/transam/subtrans.c | 2 +-
src/backend/commands/async.c | 2 +-
src/backend/storage/lmgr/predicate.c | 2 +-
src/include/access/slru.h | 2 +-
8 files changed, 17 insertions(+), 15 deletions(-)
diff --git a/src/backend/access/transam/clog.c b/src/backend/access/transam/clog.c
index f3da40ae01..3614d0c774 100644
--- a/src/backend/access/transam/clog.c
+++ b/src/backend/access/transam/clog.c
@@ -643,7 +643,7 @@ TransactionIdGetStatus(TransactionId xid, XLogRecPtr *lsn)
/* lock is acquired by SimpleLruReadPage_ReadOnly */
- slotno = SimpleLruReadPage_ReadOnly(XactCtl, pageno, xid);
+ slotno = SimpleLruReadPage_ReadOnly(XactCtl, pageno, xid, false);
byteptr = XactCtl->shared->page_buffer[slotno] + byteno;
status = (*byteptr >> bshift) & CLOG_XACT_BITMASK;
diff --git a/src/backend/access/transam/commit_ts.c b/src/backend/access/transam/commit_ts.c
index 9cdb136435..5fbbf446e3 100644
--- a/src/backend/access/transam/commit_ts.c
+++ b/src/backend/access/transam/commit_ts.c
@@ -342,7 +342,7 @@ TransactionIdGetCommitTsData(TransactionId xid, TimestampTz *ts,
}
/* lock is acquired by SimpleLruReadPage_ReadOnly */
- slotno = SimpleLruReadPage_ReadOnly(CommitTsCtl, pageno, xid);
+ slotno = SimpleLruReadPage_ReadOnly(CommitTsCtl, pageno, xid, false);
memcpy(&entry,
CommitTsCtl->shared->page_buffer[slotno] +
SizeOfCommitTimestampEntry * entryno,
diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index 79ec21c75a..241d9dd78d 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -1321,12 +1321,12 @@ GetMultiXactIdMembers(MultiXactId multi, MultiXactMember **members,
* time on every multixact creation.
*/
retry:
- LWLockAcquire(MultiXactOffsetSLRULock, LW_EXCLUSIVE);
pageno = MultiXactIdToOffsetPage(multi);
entryno = MultiXactIdToOffsetEntry(multi);
- slotno = SimpleLruReadPage(MultiXactOffsetCtl, pageno, true, multi);
+ /* lock is acquired by SimpleLruReadPage_ReadOnly */
+ slotno = SimpleLruReadPage_ReadOnly(MultiXactOffsetCtl, pageno, multi, false);
offptr = (MultiXactOffset *) MultiXactOffsetCtl->shared->page_buffer[slotno];
offptr += entryno;
offset = *offptr;
@@ -1358,7 +1358,7 @@ retry:
entryno = MultiXactIdToOffsetEntry(tmpMXact);
if (pageno != prev_pageno)
- slotno = SimpleLruReadPage(MultiXactOffsetCtl, pageno, true, tmpMXact);
+ slotno = SimpleLruReadPage_ReadOnly(MultiXactOffsetCtl, pageno, tmpMXact, true);
offptr = (MultiXactOffset *) MultiXactOffsetCtl->shared->page_buffer[slotno];
offptr += entryno;
@@ -1382,7 +1382,7 @@ retry:
*members = ptr;
/* Now get the members themselves. */
- LWLockAcquire(MultiXactMemberSLRULock, LW_EXCLUSIVE);
+ LWLockAcquire(MultiXactMemberSLRULock, LW_SHARED);
truelength = 0;
prev_pageno = -1;
@@ -1399,7 +1399,7 @@ retry:
if (pageno != prev_pageno)
{
- slotno = SimpleLruReadPage(MultiXactMemberCtl, pageno, true, multi);
+ slotno = SimpleLruReadPage_ReadOnly(MultiXactMemberCtl, pageno, multi, true);
prev_pageno = pageno;
}
@@ -2745,7 +2745,7 @@ find_multixact_start(MultiXactId multi, MultiXactOffset *result)
return false;
/* lock is acquired by SimpleLruReadPage_ReadOnly */
- slotno = SimpleLruReadPage_ReadOnly(MultiXactOffsetCtl, pageno, multi);
+ slotno = SimpleLruReadPage_ReadOnly(MultiXactOffsetCtl, pageno, multi, false);
offptr = (MultiXactOffset *) MultiXactOffsetCtl->shared->page_buffer[slotno];
offptr += entryno;
offset = *offptr;
diff --git a/src/backend/access/transam/slru.c b/src/backend/access/transam/slru.c
index 61249f4a12..96d0def5c2 100644
--- a/src/backend/access/transam/slru.c
+++ b/src/backend/access/transam/slru.c
@@ -475,17 +475,19 @@ SimpleLruReadPage(SlruCtl ctl, int pageno, bool write_ok,
* Return value is the shared-buffer slot number now holding the page.
* The buffer's LRU access info is updated.
*
- * Control lock must NOT be held at entry, but will be held at exit.
+ * Control lock will be held at exit. If lock_held is true at least
+ * shared control lock must be held.
* It is unspecified whether the lock will be shared or exclusive.
*/
int
-SimpleLruReadPage_ReadOnly(SlruCtl ctl, int pageno, TransactionId xid)
+SimpleLruReadPage_ReadOnly(SlruCtl ctl, int pageno, TransactionId xid, bool lock_held)
{
SlruShared shared = ctl->shared;
int slotno;
/* Try to find the page while holding only shared lock */
- LWLockAcquire(shared->ControlLock, LW_SHARED);
+ if (!lock_held)
+ LWLockAcquire(shared->ControlLock, LW_SHARED);
/* See if page is already in a buffer */
for (slotno = 0; slotno < shared->num_slots; slotno++)
diff --git a/src/backend/access/transam/subtrans.c b/src/backend/access/transam/subtrans.c
index f33ae407a6..8d8e0d9ec7 100644
--- a/src/backend/access/transam/subtrans.c
+++ b/src/backend/access/transam/subtrans.c
@@ -123,7 +123,7 @@ SubTransGetParent(TransactionId xid)
/* lock is acquired by SimpleLruReadPage_ReadOnly */
- slotno = SimpleLruReadPage_ReadOnly(SubTransCtl, pageno, xid);
+ slotno = SimpleLruReadPage_ReadOnly(SubTransCtl, pageno, xid, false);
ptr = (TransactionId *) SubTransCtl->shared->page_buffer[slotno];
ptr += entryno;
diff --git a/src/backend/commands/async.c b/src/backend/commands/async.c
index 71b7577afc..70085fc037 100644
--- a/src/backend/commands/async.c
+++ b/src/backend/commands/async.c
@@ -2012,7 +2012,7 @@ asyncQueueReadAllNotifications(void)
* part of the page we will actually inspect.
*/
slotno = SimpleLruReadPage_ReadOnly(NotifyCtl, curpage,
- InvalidTransactionId);
+ InvalidTransactionId, false);
if (curpage == QUEUE_POS_PAGE(head))
{
/* we only want to read as far as head */
diff --git a/src/backend/storage/lmgr/predicate.c b/src/backend/storage/lmgr/predicate.c
index d24919f76b..e2a7d00d37 100644
--- a/src/backend/storage/lmgr/predicate.c
+++ b/src/backend/storage/lmgr/predicate.c
@@ -948,7 +948,7 @@ SerialGetMinConflictCommitSeqNo(TransactionId xid)
* but will return with that lock held, which must then be released.
*/
slotno = SimpleLruReadPage_ReadOnly(SerialSlruCtl,
- SerialPage(xid), xid);
+ SerialPage(xid), xid, false);
val = SerialValue(slotno, xid);
LWLockRelease(SerialSLRULock);
return val;
diff --git a/src/include/access/slru.h b/src/include/access/slru.h
index 61fbc80ef0..471c692021 100644
--- a/src/include/access/slru.h
+++ b/src/include/access/slru.h
@@ -140,7 +140,7 @@ extern int SimpleLruZeroPage(SlruCtl ctl, int pageno);
extern int SimpleLruReadPage(SlruCtl ctl, int pageno, bool write_ok,
TransactionId xid);
extern int SimpleLruReadPage_ReadOnly(SlruCtl ctl, int pageno,
- TransactionId xid);
+ TransactionId xid, bool lock_held);
extern void SimpleLruWritePage(SlruCtl ctl, int slotno);
extern void SimpleLruFlush(SlruCtl ctl, bool allow_redirtied);
extern void SimpleLruTruncate(SlruCtl ctl, int cutoffPage);
--
2.24.2 (Apple Git-127)
v3-0002-Make-MultiXact-local-cache-size-configurable.patchtext/x-patch; charset=UTF-8; name=v3-0002-Make-MultiXact-local-cache-size-configurable.patchDownload
commit 67053e7204b5044a7147d878eec4c66217495298
Author: anastasia <a.lubennikova@postgrespro.ru>
Date: Fri Aug 28 19:15:41 2020 +0300
Make MultiXact local cache size configurable
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 7a7177c550..14eea77f2b 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -1821,6 +1821,22 @@ include_dir 'conf.d'
</listitem>
</varlistentry>
+ <varlistentry id="guc-multixact-local-cache-entries" xreflabel="multixact_local-cache-entries">
+ <term><varname>multixact_local_cache_entries</varname> (<type>integer</type>)
+ <indexterm>
+ <primary><varname>multixact_local_cache_entries</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Specifies the number cached MultiXact by backend. Any SLRU lookup is preceeded
+ by cache lookup. Higher numbers of cache size help to deduplicate lock sets, while
+ lower values make cache lookup faster.
+ It defaults to 256.
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry id="guc-max-stack-depth" xreflabel="max_stack_depth">
<term><varname>max_stack_depth</varname> (<type>integer</type>)
<indexterm>
diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index b8bedca04a..99a2bc7c80 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -1589,7 +1589,7 @@ mXactCachePut(MultiXactId multi, int nmembers, MultiXactMember *members)
qsort(entry->members, nmembers, sizeof(MultiXactMember), mxactMemberComparator);
dlist_push_head(&MXactCache, &entry->node);
- if (MXactCacheMembers++ >= MAX_CACHE_ENTRIES)
+ if (MXactCacheMembers++ >= multixact_local_cache_entries)
{
dlist_node *node;
mXactCacheEnt *entry;
diff --git a/src/backend/utils/init/globals.c b/src/backend/utils/init/globals.c
index 6ab8216839..9ca71933dc 100644
--- a/src/backend/utils/init/globals.c
+++ b/src/backend/utils/init/globals.c
@@ -149,3 +149,5 @@ int VacuumCostBalance = 0; /* working state for vacuum */
bool VacuumCostActive = false;
double vacuum_cleanup_index_scale_factor;
+
+int multixact_local_cache_entries = 256;
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index de87ad6ef7..6f3027fd66 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -2257,6 +2257,16 @@ static struct config_int ConfigureNamesInt[] =
NULL, NULL, NULL
},
+ {
+ {"multixact_local_cache_entries", PGC_SUSET, RESOURCES_MEM,
+ gettext_noop("Sets the number of cached MultiXact by backend."),
+ NULL
+ },
+ &multixact_local_cache_entries,
+ 256, 2, INT_MAX / 2,
+ NULL, NULL, NULL
+ },
+
{
{"temp_buffers", PGC_USERSET, RESOURCES_MEM,
gettext_noop("Sets the maximum number of temporary buffers used by each session."),
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 72e3352398..01af61c963 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -162,6 +162,8 @@ extern PGDLLIMPORT int MaxConnections;
extern PGDLLIMPORT int max_worker_processes;
extern PGDLLIMPORT int max_parallel_workers;
+extern PGDLLIMPORT int multixact_local_cache_entries;
+
extern PGDLLIMPORT int MyProcPid;
extern PGDLLIMPORT pg_time_t MyStartTime;
extern PGDLLIMPORT TimestampTz MyStartTimestamp;
v3-0003-Add-conditional-variable-to-wait-for-next-MultXac.patchtext/x-patch; charset=UTF-8; name=v3-0003-Add-conditional-variable-to-wait-for-next-MultXac.patchDownload
From ddde253b10b87d060d34f8d7250be27a27c18c0a Mon Sep 17 00:00:00 2001
From: Andrey Borodin <amborodin@acm.org>
Date: Fri, 22 May 2020 14:13:33 +0500
Subject: [PATCH v2 3/4] Add conditional variable to wait for next MultXact
offset in edge case
---
src/backend/access/transam/multixact.c | 23 ++++++++++++++++++++++-
src/backend/postmaster/pgstat.c | 2 ++
src/include/pgstat.h | 1 +
3 files changed, 25 insertions(+), 1 deletion(-)
diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index 445b6258a4..02de9a7c13 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -82,6 +82,7 @@
#include "lib/ilist.h"
#include "miscadmin.h"
#include "pg_trace.h"
+#include "pgstat.h"
#include "postmaster/autovacuum.h"
#include "storage/lmgr.h"
#include "storage/pmsignal.h"
@@ -233,6 +234,7 @@ typedef struct MultiXactStateData
/* support for members anti-wraparound measures */
MultiXactOffset offsetStopLimit; /* known if oldestOffsetKnown */
+ ConditionVariable nextoff_cv;
/*
* Per-backend data starts here. We have two arrays stored in the area
* immediately following the MultiXactStateData struct. Each is indexed by
@@ -873,6 +875,14 @@ RecordNewMultiXact(MultiXactId multi, MultiXactOffset offset,
/* Exchange our lock */
LWLockRelease(MultiXactOffsetSLRULock);
+ /*
+ * Let everybody know the offset of this mxid is recorded now. The waiters
+ * are waiting for the offset of the mxid next of the target to know the
+ * number of members of the target mxid, so we don't need to wait for
+ * members of this mxid are recorded.
+ */
+ ConditionVariableBroadcast(&MultiXactState->nextoff_cv);
+
LWLockAcquire(MultiXactMemberSLRULock, LW_EXCLUSIVE);
prev_pageno = -1;
@@ -1367,9 +1377,19 @@ retry:
if (nextMXOffset == 0)
{
/* Corner case 2: next multixact is still being filled in */
+
+ /*
+ * The recorder of the next mxid is just before writing the offset.
+ * Wait for the offset to be written.
+ */
+ ConditionVariablePrepareToSleep(&MultiXactState->nextoff_cv);
+
LWLockRelease(MultiXactOffsetSLRULock);
CHECK_FOR_INTERRUPTS();
- pg_usleep(1000L);
+
+ ConditionVariableSleep(&MultiXactState->nextoff_cv,
+ WAIT_EVENT_WAIT_NEXT_MXMEMBERS);
+ ConditionVariableCancelSleep();
goto retry;
}
@@ -1847,6 +1867,7 @@ MultiXactShmemInit(void)
/* Make sure we zero out the per-backend state */
MemSet(MultiXactState, 0, SHARED_MULTIXACT_STATE_SIZE);
+ ConditionVariableInit(&MultiXactState->nextoff_cv);
}
else
Assert(found);
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index edfa774ee4..4bce534c7a 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -3877,6 +3877,8 @@ pgstat_get_wait_ipc(WaitEventIPC w)
break;
case WAIT_EVENT_XACT_GROUP_UPDATE:
event_name = "XactGroupUpdate";
+ case WAIT_EVENT_WAIT_NEXT_MXMEMBERS:
+ event_name = "MultiXact/WaitNextXactMembers";
break;
/* no default case, so that compiler will warn */
}
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 1387201382..26f828597b 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -887,6 +887,7 @@ typedef enum
WAIT_EVENT_REPLICATION_SLOT_DROP,
WAIT_EVENT_SAFE_SNAPSHOT,
WAIT_EVENT_SYNC_REP,
+ WAIT_EVENT_WAIT_NEXT_MXMEMBERS,
WAIT_EVENT_XACT_GROUP_UPDATE
} WaitEventIPC;
--
2.24.2 (Apple Git-127)
Hi, Anastasia!
28 авг. 2020 г., в 23:08, Anastasia Lubennikova <a.lubennikova@postgrespro.ru> написал(а):
1) The first patch is sensible and harmless, so I think it is ready for committer. I haven't tested the performance impact, though.
2) I like the initial proposal to make various SLRU buffers configurable, however, I doubt if it really solves the problem, or just moves it to another place?
The previous patch you sent was based on some version that contained changes for other slru buffers numbers: 'multixact_offsets_slru_buffers' and 'multixact_members_slru_buffers'. Have you just forgot to attach them? The patch message "[PATCH v2 2/4]" hints that you had 4 patches)
Meanwhile, I attach the rebased patch to calm down the CFbot. The changes are trivial.2.1) I think that both min and max values for this parameter are too extreme. Have you tested them?
+ &multixact_local_cache_entries, + 256, 2, INT_MAX / 2,2.2) MAX_CACHE_ENTRIES is not used anymore, so it can be deleted.
3) No changes for third patch. I just renamed it for consistency.
Thank you for your review.
Indeed, I had 4th patch with tests, but these tests didn't work well: I still did not manage to stress SLRUs to reproduce problem from production...
You are absolutely correct in point 2: I did only tests with sane values. And observed extreme performance degradation with values ~ 64 megabytes. I do not know which highest values should we pick? 1Gb? Or highest possible functioning value?
I greatly appreciate your review, sorry for so long delay. Thanks!
Best regards, Andrey Borodin.
On 28.09.2020 17:41, Andrey M. Borodin wrote:
Hi, Anastasia!
28 авг. 2020 г., в 23:08, Anastasia Lubennikova <a.lubennikova@postgrespro.ru> написал(а):
1) The first patch is sensible and harmless, so I think it is ready for committer. I haven't tested the performance impact, though.
2) I like the initial proposal to make various SLRU buffers configurable, however, I doubt if it really solves the problem, or just moves it to another place?
The previous patch you sent was based on some version that contained changes for other slru buffers numbers: 'multixact_offsets_slru_buffers' and 'multixact_members_slru_buffers'. Have you just forgot to attach them? The patch message "[PATCH v2 2/4]" hints that you had 4 patches)
Meanwhile, I attach the rebased patch to calm down the CFbot. The changes are trivial.2.1) I think that both min and max values for this parameter are too extreme. Have you tested them?
+ &multixact_local_cache_entries, + 256, 2, INT_MAX / 2,2.2) MAX_CACHE_ENTRIES is not used anymore, so it can be deleted.
3) No changes for third patch. I just renamed it for consistency.
Thank you for your review.
Indeed, I had 4th patch with tests, but these tests didn't work well: I still did not manage to stress SLRUs to reproduce problem from production...
You are absolutely correct in point 2: I did only tests with sane values. And observed extreme performance degradation with values ~ 64 megabytes. I do not know which highest values should we pick? 1Gb? Or highest possible functioning value?
I would go with the values that we consider adequate for this setting.
As I see, there is no strict rule about it in guc.c and many variables
have large border values. Anyway, we need to explain it at least in the
documentation and code comments.
It seems that the default was conservative enough, so it can be also a
minimal value too. As for maximum, can you provide any benchmark
results? If we have a peak and a noticeable performance degradation
after that, we can use it to calculate the preferable maxvalue.
--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
Hi!
On Mon, Sep 28, 2020 at 5:41 PM Andrey M. Borodin <x4mmm@yandex-team.ru> wrote:
28 авг. 2020 г., в 23:08, Anastasia Lubennikova <a.lubennikova@postgrespro.ru> написал(а):
1) The first patch is sensible and harmless, so I think it is ready for committer. I haven't tested the performance impact, though.
2) I like the initial proposal to make various SLRU buffers configurable, however, I doubt if it really solves the problem, or just moves it to another place?
The previous patch you sent was based on some version that contained changes for other slru buffers numbers: 'multixact_offsets_slru_buffers' and 'multixact_members_slru_buffers'. Have you just forgot to attach them? The patch message "[PATCH v2 2/4]" hints that you had 4 patches)
Meanwhile, I attach the rebased patch to calm down the CFbot. The changes are trivial.2.1) I think that both min and max values for this parameter are too extreme. Have you tested them?
+ &multixact_local_cache_entries, + 256, 2, INT_MAX / 2,2.2) MAX_CACHE_ENTRIES is not used anymore, so it can be deleted.
3) No changes for third patch. I just renamed it for consistency.
Thank you for your review.
Indeed, I had 4th patch with tests, but these tests didn't work well: I still did not manage to stress SLRUs to reproduce problem from production...
You are absolutely correct in point 2: I did only tests with sane values. And observed extreme performance degradation with values ~ 64 megabytes. I do not know which highest values should we pick? 1Gb? Or highest possible functioning value?
I greatly appreciate your review, sorry for so long delay. Thanks!
I took a look at this patchset.
The 1st and 3rd patches look good to me. I made just minor improvements.
1) There is still a case when SimpleLruReadPage_ReadOnly() relocks the
SLRU lock, which is already taken in exclusive mode. I've evaded this
by passing the lock mode as a parameter to
SimpleLruReadPage_ReadOnly().
3) CHECK_FOR_INTERRUPTS() is not needed anymore, because it's called
inside ConditionVariableSleep() if needed. Also, no current wait
events use slashes, and I don't think we should introduce slashes
here. Even if we should, then we should also rename existing wait
events to be consistent with a new one. So, I've renamed the new wait
event to remove the slash.
Regarding the patch 2. I see the current documentation in the patch
doesn't explain to the user how to set the new parameter. I think we
should give users an idea what workloads need high values of
multixact_local_cache_entries parameter and what doesn't. Also, we
should explain the negative aspects of high values
multixact_local_cache_entries. Ideally, we should get the advantage
without overloading users with new nontrivial parameters, but I don't
have a particular idea here.
I'd like to propose committing 1 and 3, but leave 2 for further review.
------
Regards,
Alexander Korotkov
Attachments:
v4-0001-Use-shared-lock-in-GetMultiXactIdMembers-for-offsets.patchapplication/octet-stream; name=v4-0001-Use-shared-lock-in-GetMultiXactIdMembers-for-offsets.patchDownload
From 12bbf5dbd372a431d106e7e8c3ed50ffa5707a19 Mon Sep 17 00:00:00 2001
From: Alexander Korotkov <akorotkov@postgresql.org>
Date: Mon, 26 Oct 2020 01:50:08 +0300
Subject: [PATCH 1/3] Use shared lock in GetMultiXactIdMembers for offsets and
members
Previously the read of multixact required exclusive control locks for both
offsets and members SLRUs. This could lead to the excessive lock contention.
This commit we makes multixacts SLRU take advantage of SimpleLruReadPage_ReadOnly
similar to clog, commit_ts and subtrans.
In order to evade extra reacquiring of CLRU lock, we teach
SimpleLruReadPage_ReadOnly() to take into account the current lock mode and
report resulting lock mode back.
Discussion: https://postgr.es/m/a7f1c4e1-1015-92a4-2bd4-6736bd13d03e%40postgrespro.ru#c496c4e75fc0605094a0e1f763e6a6ec
Author: Andrey Borodin
Reviewed-by: Kyotaro Horiguchi, Daniel Gustafsson
Reviewed-by: Anastasia Lubennikova, Alexander Korotkov
---
src/backend/access/transam/clog.c | 4 +++-
src/backend/access/transam/commit_ts.c | 4 +++-
src/backend/access/transam/multixact.c | 22 +++++++++++++++-------
src/backend/access/transam/slru.c | 23 +++++++++++++++++------
src/backend/access/transam/subtrans.c | 4 +++-
src/backend/commands/async.c | 4 +++-
src/backend/storage/lmgr/predicate.c | 4 +++-
src/include/access/slru.h | 2 +-
src/include/storage/lwlock.h | 2 ++
9 files changed, 50 insertions(+), 19 deletions(-)
diff --git a/src/backend/access/transam/clog.c b/src/backend/access/transam/clog.c
index 034349aa7b9..4c372065dee 100644
--- a/src/backend/access/transam/clog.c
+++ b/src/backend/access/transam/clog.c
@@ -640,10 +640,11 @@ TransactionIdGetStatus(TransactionId xid, XLogRecPtr *lsn)
int lsnindex;
char *byteptr;
XidStatus status;
+ LWLockMode lockmode = LW_NONE;
/* lock is acquired by SimpleLruReadPage_ReadOnly */
- slotno = SimpleLruReadPage_ReadOnly(XactCtl, pageno, xid);
+ slotno = SimpleLruReadPage_ReadOnly(XactCtl, pageno, xid, &lockmode);
byteptr = XactCtl->shared->page_buffer[slotno] + byteno;
status = (*byteptr >> bshift) & CLOG_XACT_BITMASK;
@@ -651,6 +652,7 @@ TransactionIdGetStatus(TransactionId xid, XLogRecPtr *lsn)
lsnindex = GetLSNIndex(slotno, xid);
*lsn = XactCtl->shared->group_lsn[lsnindex];
+ Assert(lockmode != LW_NONE);
LWLockRelease(XactSLRULock);
return status;
diff --git a/src/backend/access/transam/commit_ts.c b/src/backend/access/transam/commit_ts.c
index cb8a9688018..c19b0b5399e 100644
--- a/src/backend/access/transam/commit_ts.c
+++ b/src/backend/access/transam/commit_ts.c
@@ -288,6 +288,7 @@ TransactionIdGetCommitTsData(TransactionId xid, TimestampTz *ts,
CommitTimestampEntry entry;
TransactionId oldestCommitTsXid;
TransactionId newestCommitTsXid;
+ LWLockMode lockmode = LW_NONE;
if (!TransactionIdIsValid(xid))
ereport(ERROR,
@@ -342,7 +343,7 @@ TransactionIdGetCommitTsData(TransactionId xid, TimestampTz *ts,
}
/* lock is acquired by SimpleLruReadPage_ReadOnly */
- slotno = SimpleLruReadPage_ReadOnly(CommitTsCtl, pageno, xid);
+ slotno = SimpleLruReadPage_ReadOnly(CommitTsCtl, pageno, xid, &lockmode);
memcpy(&entry,
CommitTsCtl->shared->page_buffer[slotno] +
SizeOfCommitTimestampEntry * entryno,
@@ -352,6 +353,7 @@ TransactionIdGetCommitTsData(TransactionId xid, TimestampTz *ts,
if (nodeid)
*nodeid = entry.nodeid;
+ Assert(lockmode != LW_NONE);
LWLockRelease(CommitTsSLRULock);
return *ts != 0;
}
diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index 43653fe5721..ccbce90f0ea 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -1237,6 +1237,7 @@ GetMultiXactIdMembers(MultiXactId multi, MultiXactMember **members,
MultiXactId tmpMXact;
MultiXactOffset nextOffset;
MultiXactMember *ptr;
+ LWLockMode lockmode = LW_NONE;
debug_elog3(DEBUG2, "GetMembers: asked for %u", multi);
@@ -1340,12 +1341,13 @@ GetMultiXactIdMembers(MultiXactId multi, MultiXactMember **members,
* time on every multixact creation.
*/
retry:
- LWLockAcquire(MultiXactOffsetSLRULock, LW_EXCLUSIVE);
pageno = MultiXactIdToOffsetPage(multi);
entryno = MultiXactIdToOffsetEntry(multi);
- slotno = SimpleLruReadPage(MultiXactOffsetCtl, pageno, true, multi);
+ /* lock is acquired by SimpleLruReadPage_ReadOnly */
+ slotno = SimpleLruReadPage_ReadOnly(MultiXactOffsetCtl, pageno,
+ multi, &lockmode);
offptr = (MultiXactOffset *) MultiXactOffsetCtl->shared->page_buffer[slotno];
offptr += entryno;
offset = *offptr;
@@ -1377,7 +1379,8 @@ retry:
entryno = MultiXactIdToOffsetEntry(tmpMXact);
if (pageno != prev_pageno)
- slotno = SimpleLruReadPage(MultiXactOffsetCtl, pageno, true, tmpMXact);
+ slotno = SimpleLruReadPage_ReadOnly(MultiXactOffsetCtl, pageno,
+ tmpMXact, &lockmode);
offptr = (MultiXactOffset *) MultiXactOffsetCtl->shared->page_buffer[slotno];
offptr += entryno;
@@ -1395,14 +1398,14 @@ retry:
length = nextMXOffset - offset;
}
+ Assert(lockmode != LW_NONE);
LWLockRelease(MultiXactOffsetSLRULock);
ptr = (MultiXactMember *) palloc(length * sizeof(MultiXactMember));
*members = ptr;
/* Now get the members themselves. */
- LWLockAcquire(MultiXactMemberSLRULock, LW_EXCLUSIVE);
-
+ lockmode = LW_NONE;
truelength = 0;
prev_pageno = -1;
for (i = 0; i < length; i++, offset++)
@@ -1418,7 +1421,8 @@ retry:
if (pageno != prev_pageno)
{
- slotno = SimpleLruReadPage(MultiXactMemberCtl, pageno, true, multi);
+ slotno = SimpleLruReadPage_ReadOnly(MultiXactMemberCtl, pageno,
+ multi, &lockmode);
prev_pageno = pageno;
}
@@ -1441,6 +1445,7 @@ retry:
truelength++;
}
+ Assert(lockmode != LW_NONE);
LWLockRelease(MultiXactMemberSLRULock);
/*
@@ -2733,6 +2738,7 @@ find_multixact_start(MultiXactId multi, MultiXactOffset *result)
int entryno;
int slotno;
MultiXactOffset *offptr;
+ LWLockMode lockmode = LW_NONE;
Assert(MultiXactState->finishedStartup);
@@ -2749,10 +2755,12 @@ find_multixact_start(MultiXactId multi, MultiXactOffset *result)
return false;
/* lock is acquired by SimpleLruReadPage_ReadOnly */
- slotno = SimpleLruReadPage_ReadOnly(MultiXactOffsetCtl, pageno, multi);
+ slotno = SimpleLruReadPage_ReadOnly(MultiXactOffsetCtl, pageno,
+ multi, &lockmode);
offptr = (MultiXactOffset *) MultiXactOffsetCtl->shared->page_buffer[slotno];
offptr += entryno;
offset = *offptr;
+ Assert(lockmode != LW_NONE);
LWLockRelease(MultiXactOffsetSLRULock);
*result = offset;
diff --git a/src/backend/access/transam/slru.c b/src/backend/access/transam/slru.c
index 16a78986971..ca0a23ab875 100644
--- a/src/backend/access/transam/slru.c
+++ b/src/backend/access/transam/slru.c
@@ -487,17 +487,23 @@ SimpleLruReadPage(SlruCtl ctl, int pageno, bool write_ok,
* Return value is the shared-buffer slot number now holding the page.
* The buffer's LRU access info is updated.
*
- * Control lock must NOT be held at entry, but will be held at exit.
- * It is unspecified whether the lock will be shared or exclusive.
+ * Control lock must be held in *lock_mode mode, which may be LW_NONE. Control
+ * lock will be held at exit in at least shared mode. Resulting control lock
+ * mode is set to *lock_mode.
*/
int
-SimpleLruReadPage_ReadOnly(SlruCtl ctl, int pageno, TransactionId xid)
+SimpleLruReadPage_ReadOnly(SlruCtl ctl, int pageno, TransactionId xid,
+ LWLockMode *lock_mode)
{
SlruShared shared = ctl->shared;
int slotno;
/* Try to find the page while holding only shared lock */
- LWLockAcquire(shared->ControlLock, LW_SHARED);
+ if (*lock_mode == LW_NONE)
+ {
+ LWLockAcquire(shared->ControlLock, LW_SHARED);
+ *lock_mode = LW_SHARED;
+ }
/* See if page is already in a buffer */
for (slotno = 0; slotno < shared->num_slots; slotno++)
@@ -517,8 +523,13 @@ SimpleLruReadPage_ReadOnly(SlruCtl ctl, int pageno, TransactionId xid)
}
/* No luck, so switch to normal exclusive lock and do regular read */
- LWLockRelease(shared->ControlLock);
- LWLockAcquire(shared->ControlLock, LW_EXCLUSIVE);
+ if (*lock_mode != LW_EXCLUSIVE)
+ {
+ Assert(*lock_mode == LW_NONE);
+ LWLockRelease(shared->ControlLock);
+ LWLockAcquire(shared->ControlLock, LW_EXCLUSIVE);
+ *lock_mode = LW_EXCLUSIVE;
+ }
return SimpleLruReadPage(ctl, pageno, true, xid);
}
diff --git a/src/backend/access/transam/subtrans.c b/src/backend/access/transam/subtrans.c
index 0111e867c79..7bb15431893 100644
--- a/src/backend/access/transam/subtrans.c
+++ b/src/backend/access/transam/subtrans.c
@@ -113,6 +113,7 @@ SubTransGetParent(TransactionId xid)
int slotno;
TransactionId *ptr;
TransactionId parent;
+ LWLockMode lockmode = LW_NONE;
/* Can't ask about stuff that might not be around anymore */
Assert(TransactionIdFollowsOrEquals(xid, TransactionXmin));
@@ -123,12 +124,13 @@ SubTransGetParent(TransactionId xid)
/* lock is acquired by SimpleLruReadPage_ReadOnly */
- slotno = SimpleLruReadPage_ReadOnly(SubTransCtl, pageno, xid);
+ slotno = SimpleLruReadPage_ReadOnly(SubTransCtl, pageno, xid, &lockmode);
ptr = (TransactionId *) SubTransCtl->shared->page_buffer[slotno];
ptr += entryno;
parent = *ptr;
+ Assert(lockmode != LW_NONE);
LWLockRelease(SubtransSLRULock);
return parent;
diff --git a/src/backend/commands/async.c b/src/backend/commands/async.c
index 8dbcace3f93..8b419575590 100644
--- a/src/backend/commands/async.c
+++ b/src/backend/commands/async.c
@@ -2002,6 +2002,7 @@ asyncQueueReadAllNotifications(void)
int curoffset = QUEUE_POS_OFFSET(pos);
int slotno;
int copysize;
+ LWLockMode lockmode = LW_NONE;
/*
* We copy the data from SLRU into a local buffer, so as to avoid
@@ -2010,7 +2011,7 @@ asyncQueueReadAllNotifications(void)
* part of the page we will actually inspect.
*/
slotno = SimpleLruReadPage_ReadOnly(NotifyCtl, curpage,
- InvalidTransactionId);
+ InvalidTransactionId, &lockmode);
if (curpage == QUEUE_POS_PAGE(head))
{
/* we only want to read as far as head */
@@ -2027,6 +2028,7 @@ asyncQueueReadAllNotifications(void)
NotifyCtl->shared->page_buffer[slotno] + curoffset,
copysize);
/* Release lock that we got from SimpleLruReadPage_ReadOnly() */
+ Assert(lockmode != LW_NONE);
LWLockRelease(NotifySLRULock);
/*
diff --git a/src/backend/storage/lmgr/predicate.c b/src/backend/storage/lmgr/predicate.c
index 8a365b400c6..a4df90a8aeb 100644
--- a/src/backend/storage/lmgr/predicate.c
+++ b/src/backend/storage/lmgr/predicate.c
@@ -924,6 +924,7 @@ SerialGetMinConflictCommitSeqNo(TransactionId xid)
TransactionId tailXid;
SerCommitSeqNo val;
int slotno;
+ LWLockMode lockmode = LW_NONE;
Assert(TransactionIdIsValid(xid));
@@ -946,8 +947,9 @@ SerialGetMinConflictCommitSeqNo(TransactionId xid)
* but will return with that lock held, which must then be released.
*/
slotno = SimpleLruReadPage_ReadOnly(SerialSlruCtl,
- SerialPage(xid), xid);
+ SerialPage(xid), xid, &lockmode);
val = SerialValue(slotno, xid);
+ Assert(lockmode != LW_NONE);
LWLockRelease(SerialSLRULock);
return val;
}
diff --git a/src/include/access/slru.h b/src/include/access/slru.h
index b39b43504d8..4b66d3b592b 100644
--- a/src/include/access/slru.h
+++ b/src/include/access/slru.h
@@ -142,7 +142,7 @@ extern int SimpleLruZeroPage(SlruCtl ctl, int pageno);
extern int SimpleLruReadPage(SlruCtl ctl, int pageno, bool write_ok,
TransactionId xid);
extern int SimpleLruReadPage_ReadOnly(SlruCtl ctl, int pageno,
- TransactionId xid);
+ TransactionId xid, LWLockMode *lock_mode);
extern void SimpleLruWritePage(SlruCtl ctl, int slotno);
extern void SimpleLruWriteAll(SlruCtl ctl, bool allow_redirtied);
extern void SimpleLruTruncate(SlruCtl ctl, int cutoffPage);
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index af9b41795d2..2684c8e51c9 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -129,6 +129,8 @@ extern PGDLLIMPORT int NamedLWLockTrancheRequests;
typedef enum LWLockMode
{
+ LW_NONE, /* Not a lock mode. Indicates that there is
+ * no lock. */
LW_EXCLUSIVE,
LW_SHARED,
LW_WAIT_UNTIL_FREE /* A special mode used in PGPROC->lwWaitMode,
--
2.14.3
v4-0002-Make-MultiXact-local-cache-size-configurable.patchapplication/octet-stream; name=v4-0002-Make-MultiXact-local-cache-size-configurable.patchDownload
From b1af887f4c4e5402dce1b6ea226ac8b81b92443f Mon Sep 17 00:00:00 2001
From: Alexander Korotkov <akorotkov@postgresql.org>
Date: Mon, 26 Oct 2020 03:44:50 +0300
Subject: [PATCH 2/3] Make MultiXact local cache size configurable
---
doc/src/sgml/config.sgml | 16 ++++++++++++++++
src/backend/access/transam/multixact.c | 2 +-
src/backend/utils/init/globals.c | 2 ++
src/backend/utils/misc/guc.c | 10 ++++++++++
src/include/miscadmin.h | 2 ++
5 files changed, 31 insertions(+), 1 deletion(-)
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index f043433e318..fd4ca293476 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -1823,6 +1823,22 @@ include_dir 'conf.d'
</listitem>
</varlistentry>
+ <varlistentry id="guc-multixact-local-cache-entries" xreflabel="multixact_local-cache-entries">
+ <term><varname>multixact_local_cache_entries</varname> (<type>integer</type>)
+ <indexterm>
+ <primary><varname>multixact_local_cache_entries</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Specifies the number cached MultiXact by backend. Any SLRU lookup is preceeded
+ by cache lookup. Higher numbers of cache size help to deduplicate lock sets, while
+ lower values make cache lookup faster.
+ It defaults to 256.
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry id="guc-max-stack-depth" xreflabel="max_stack_depth">
<term><varname>max_stack_depth</varname> (<type>integer</type>)
<indexterm>
diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index ccbce90f0ea..57be24c0cc1 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -1613,7 +1613,7 @@ mXactCachePut(MultiXactId multi, int nmembers, MultiXactMember *members)
qsort(entry->members, nmembers, sizeof(MultiXactMember), mxactMemberComparator);
dlist_push_head(&MXactCache, &entry->node);
- if (MXactCacheMembers++ >= MAX_CACHE_ENTRIES)
+ if (MXactCacheMembers++ >= multixact_local_cache_entries)
{
dlist_node *node;
mXactCacheEnt *entry;
diff --git a/src/backend/utils/init/globals.c b/src/backend/utils/init/globals.c
index 6ab82168398..9ca71933dcc 100644
--- a/src/backend/utils/init/globals.c
+++ b/src/backend/utils/init/globals.c
@@ -149,3 +149,5 @@ int VacuumCostBalance = 0; /* working state for vacuum */
bool VacuumCostActive = false;
double vacuum_cleanup_index_scale_factor;
+
+int multixact_local_cache_entries = 256;
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index a62d64eaa47..d667578cc33 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -2257,6 +2257,16 @@ static struct config_int ConfigureNamesInt[] =
NULL, NULL, NULL
},
+ {
+ {"multixact_local_cache_entries", PGC_SUSET, RESOURCES_MEM,
+ gettext_noop("Sets the number of cached MultiXact by backend."),
+ NULL
+ },
+ &multixact_local_cache_entries,
+ 256, 2, INT_MAX / 2,
+ NULL, NULL, NULL
+ },
+
{
{"temp_buffers", PGC_USERSET, RESOURCES_MEM,
gettext_noop("Sets the maximum number of temporary buffers used by each session."),
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 72e33523984..01af61c963a 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -162,6 +162,8 @@ extern PGDLLIMPORT int MaxConnections;
extern PGDLLIMPORT int max_worker_processes;
extern PGDLLIMPORT int max_parallel_workers;
+extern PGDLLIMPORT int multixact_local_cache_entries;
+
extern PGDLLIMPORT int MyProcPid;
extern PGDLLIMPORT pg_time_t MyStartTime;
extern PGDLLIMPORT TimestampTz MyStartTimestamp;
--
2.14.3
v4-0003-Add-conditional-variable-to-wait-for-next-MultXact-o.patchapplication/octet-stream; name=v4-0003-Add-conditional-variable-to-wait-for-next-MultXact-o.patchDownload
From b878ffe74a10d34d312a7be66df7a53b423aaaec Mon Sep 17 00:00:00 2001
From: Alexander Korotkov <akorotkov@postgresql.org>
Date: Mon, 26 Oct 2020 01:55:31 +0300
Subject: [PATCH 3/3] Add conditional variable to wait for next MultXact offset
in corner case
GetMultiXactIdMembers() has a corner case, when the next multixact offset is
not yet set. In this case GetMultiXactIdMembers() has to sleep till this offset
is set. Currently the sleeping is implemented in naive way using pg_sleep()
and retry. This commit implements sleeping with conditional variable, which
provides more efficient way for waiting till the event.
Discussion: https://postgr.es/m/a7f1c4e1-1015-92a4-2bd4-6736bd13d03e%40postgrespro.ru#c496c4e75fc0605094a0e1f763e6a6ec
Author: Andrey Borodin
Reviewed-by: Kyotaro Horiguchi, Daniel Gustafsson
Reviewed-by: Anastasia Lubennikova, Alexander Korotkov
---
src/backend/access/transam/multixact.c | 29 +++++++++++++++++++++++++++--
src/backend/postmaster/pgstat.c | 2 ++
src/include/pgstat.h | 1 +
3 files changed, 30 insertions(+), 2 deletions(-)
diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index 57be24c0cc1..09df91899ec 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -82,6 +82,7 @@
#include "lib/ilist.h"
#include "miscadmin.h"
#include "pg_trace.h"
+#include "pgstat.h"
#include "postmaster/autovacuum.h"
#include "storage/lmgr.h"
#include "storage/pmsignal.h"
@@ -233,6 +234,7 @@ typedef struct MultiXactStateData
/* support for members anti-wraparound measures */
MultiXactOffset offsetStopLimit; /* known if oldestOffsetKnown */
+ ConditionVariable nextoff_cv;
/*
* Per-backend data starts here. We have two arrays stored in the area
* immediately following the MultiXactStateData struct. Each is indexed by
@@ -892,6 +894,14 @@ RecordNewMultiXact(MultiXactId multi, MultiXactOffset offset,
/* Exchange our lock */
LWLockRelease(MultiXactOffsetSLRULock);
+ /*
+ * Let everybody know the offset of this mxid is recorded now. The waiters
+ * are waiting for the offset of the mxid next of the target to know the
+ * number of members of the target mxid, so we don't need to wait for
+ * members of this mxid are recorded.
+ */
+ ConditionVariableBroadcast(&MultiXactState->nextoff_cv);
+
LWLockAcquire(MultiXactMemberSLRULock, LW_EXCLUSIVE);
prev_pageno = -1;
@@ -1389,9 +1399,23 @@ retry:
if (nextMXOffset == 0)
{
/* Corner case 2: next multixact is still being filled in */
+
+ /*
+ * The recorder of the next mxid is just before writing the offset.
+ * Wait for the offset to be written.
+ */
+ ConditionVariablePrepareToSleep(&MultiXactState->nextoff_cv);
+
+ /*
+ * We don't have to recheck if multixact was filled in during
+ * ConditionVariablePrepareToSleep(), because we were holding
+ * MultiXactOffsetSLRULock.
+ */
LWLockRelease(MultiXactOffsetSLRULock);
- CHECK_FOR_INTERRUPTS();
- pg_usleep(1000L);
+
+ ConditionVariableSleep(&MultiXactState->nextoff_cv,
+ WAIT_EVENT_WAIT_NEXT_MXMEMBERS);
+ ConditionVariableCancelSleep();
goto retry;
}
@@ -1873,6 +1897,7 @@ MultiXactShmemInit(void)
/* Make sure we zero out the per-backend state */
MemSet(MultiXactState, 0, SHARED_MULTIXACT_STATE_SIZE);
+ ConditionVariableInit(&MultiXactState->nextoff_cv);
}
else
Assert(found);
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 822f0ebc628..b99398a97e9 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -4020,6 +4020,8 @@ pgstat_get_wait_ipc(WaitEventIPC w)
break;
case WAIT_EVENT_XACT_GROUP_UPDATE:
event_name = "XactGroupUpdate";
+ case WAIT_EVENT_WAIT_NEXT_MXMEMBERS:
+ event_name = "MultiXactWaitNextMembers";
break;
/* no default case, so that compiler will warn */
}
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index a821ff4f158..75ede141ab7 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -952,6 +952,7 @@ typedef enum
WAIT_EVENT_REPLICATION_SLOT_DROP,
WAIT_EVENT_SAFE_SNAPSHOT,
WAIT_EVENT_SYNC_REP,
+ WAIT_EVENT_WAIT_NEXT_MXMEMBERS,
WAIT_EVENT_XACT_GROUP_UPDATE
} WaitEventIPC;
--
2.14.3
26 окт. 2020 г., в 06:05, Alexander Korotkov <aekorotkov@gmail.com> написал(а):
Hi!
On Mon, Sep 28, 2020 at 5:41 PM Andrey M. Borodin <x4mmm@yandex-team.ru> wrote:
28 авг. 2020 г., в 23:08, Anastasia Lubennikova <a.lubennikova@postgrespro.ru> написал(а):
1) The first patch is sensible and harmless, so I think it is ready for committer. I haven't tested the performance impact, though.
2) I like the initial proposal to make various SLRU buffers configurable, however, I doubt if it really solves the problem, or just moves it to another place?
The previous patch you sent was based on some version that contained changes for other slru buffers numbers: 'multixact_offsets_slru_buffers' and 'multixact_members_slru_buffers'. Have you just forgot to attach them? The patch message "[PATCH v2 2/4]" hints that you had 4 patches)
Meanwhile, I attach the rebased patch to calm down the CFbot. The changes are trivial.2.1) I think that both min and max values for this parameter are too extreme. Have you tested them?
+ &multixact_local_cache_entries, + 256, 2, INT_MAX / 2,2.2) MAX_CACHE_ENTRIES is not used anymore, so it can be deleted.
3) No changes for third patch. I just renamed it for consistency.
Thank you for your review.
Indeed, I had 4th patch with tests, but these tests didn't work well: I still did not manage to stress SLRUs to reproduce problem from production...
You are absolutely correct in point 2: I did only tests with sane values. And observed extreme performance degradation with values ~ 64 megabytes. I do not know which highest values should we pick? 1Gb? Or highest possible functioning value?
I greatly appreciate your review, sorry for so long delay. Thanks!
I took a look at this patchset.
The 1st and 3rd patches look good to me. I made just minor improvements.
1) There is still a case when SimpleLruReadPage_ReadOnly() relocks the
SLRU lock, which is already taken in exclusive mode. I've evaded this
by passing the lock mode as a parameter to
SimpleLruReadPage_ReadOnly().
3) CHECK_FOR_INTERRUPTS() is not needed anymore, because it's called
inside ConditionVariableSleep() if needed. Also, no current wait
events use slashes, and I don't think we should introduce slashes
here. Even if we should, then we should also rename existing wait
events to be consistent with a new one. So, I've renamed the new wait
event to remove the slash.Regarding the patch 2. I see the current documentation in the patch
doesn't explain to the user how to set the new parameter. I think we
should give users an idea what workloads need high values of
multixact_local_cache_entries parameter and what doesn't. Also, we
should explain the negative aspects of high values
multixact_local_cache_entries. Ideally, we should get the advantage
without overloading users with new nontrivial parameters, but I don't
have a particular idea here.I'd like to propose committing 1 and 3, but leave 2 for further review.
Thanks for your review, Alexander!
+1 for avoiding double locking in SimpleLruReadPage_ReadOnly().
Other changes seem correct to me too.
I've tried to find optimal value for cache size and it seems to me that it affects multixact scalability much less than sizes of offsets\members buffers. I concur that patch 2 of the patchset does not seem documented enough.
Best regards, Andrey Borodin.
On Mon, Oct 26, 2020 at 6:45 PM Andrey Borodin <x4mmm@yandex-team.ru> wrote:
Thanks for your review, Alexander!
+1 for avoiding double locking in SimpleLruReadPage_ReadOnly().
Other changes seem correct to me too.I've tried to find optimal value for cache size and it seems to me that it affects multixact scalability much less than sizes of offsets\members buffers. I concur that patch 2 of the patchset does not seem documented enough.
Thank you. I've made a few more minor adjustments to the patchset.
I'm going to push 0001 and 0003 if there are no objections.
------
Regards,
Alexander Korotkov
Attachments:
v5-0001-Use-shared-lock-in-GetMultiXactIdMembers-for-offs.patchapplication/octet-stream; name=v5-0001-Use-shared-lock-in-GetMultiXactIdMembers-for-offs.patchDownload
From e81bccfe1dbf0403163b8e1e836211e0dab1d7a4 Mon Sep 17 00:00:00 2001
From: Alexander Korotkov <akorotkov@postgresql.org>
Date: Tue, 27 Oct 2020 19:29:56 +0300
Subject: [PATCH 1/3] Use shared lock in GetMultiXactIdMembers for offsets and
members
Previously the read of multixact required exclusive control locks for both
offsets and members SLRUs. This could lead to the excessive lock contention.
This commit we makes multixacts SLRU take advantage of SimpleLruReadPage_ReadOnly
similar to clog, commit_ts and subtrans.
In order to evade extra reacquiring of CLRU lock, we teach
SimpleLruReadPage_ReadOnly() to take into account the current lock mode and
report resulting lock mode back.
Discussion: https://postgr.es/m/a7f1c4e1-1015-92a4-2bd4-6736bd13d03e%40postgrespro.ru#c496c4e75fc0605094a0e1f763e6a6ec
Author: Andrey Borodin
Reviewed-by: Kyotaro Horiguchi, Daniel Gustafsson
Reviewed-by: Anastasia Lubennikova, Alexander Korotkov
---
src/backend/access/transam/clog.c | 4 +++-
src/backend/access/transam/commit_ts.c | 4 +++-
src/backend/access/transam/multixact.c | 22 +++++++++++++++-------
src/backend/access/transam/slru.c | 23 +++++++++++++++++------
src/backend/access/transam/subtrans.c | 4 +++-
src/backend/commands/async.c | 4 +++-
src/backend/storage/lmgr/lwlock.c | 3 +++
src/backend/storage/lmgr/predicate.c | 4 +++-
src/include/access/slru.h | 2 +-
src/include/storage/lwlock.h | 2 ++
10 files changed, 53 insertions(+), 19 deletions(-)
diff --git a/src/backend/access/transam/clog.c b/src/backend/access/transam/clog.c
index 034349aa7b9..4c372065dee 100644
--- a/src/backend/access/transam/clog.c
+++ b/src/backend/access/transam/clog.c
@@ -640,10 +640,11 @@ TransactionIdGetStatus(TransactionId xid, XLogRecPtr *lsn)
int lsnindex;
char *byteptr;
XidStatus status;
+ LWLockMode lockmode = LW_NONE;
/* lock is acquired by SimpleLruReadPage_ReadOnly */
- slotno = SimpleLruReadPage_ReadOnly(XactCtl, pageno, xid);
+ slotno = SimpleLruReadPage_ReadOnly(XactCtl, pageno, xid, &lockmode);
byteptr = XactCtl->shared->page_buffer[slotno] + byteno;
status = (*byteptr >> bshift) & CLOG_XACT_BITMASK;
@@ -651,6 +652,7 @@ TransactionIdGetStatus(TransactionId xid, XLogRecPtr *lsn)
lsnindex = GetLSNIndex(slotno, xid);
*lsn = XactCtl->shared->group_lsn[lsnindex];
+ Assert(lockmode != LW_NONE);
LWLockRelease(XactSLRULock);
return status;
diff --git a/src/backend/access/transam/commit_ts.c b/src/backend/access/transam/commit_ts.c
index cb8a9688018..c19b0b5399e 100644
--- a/src/backend/access/transam/commit_ts.c
+++ b/src/backend/access/transam/commit_ts.c
@@ -288,6 +288,7 @@ TransactionIdGetCommitTsData(TransactionId xid, TimestampTz *ts,
CommitTimestampEntry entry;
TransactionId oldestCommitTsXid;
TransactionId newestCommitTsXid;
+ LWLockMode lockmode = LW_NONE;
if (!TransactionIdIsValid(xid))
ereport(ERROR,
@@ -342,7 +343,7 @@ TransactionIdGetCommitTsData(TransactionId xid, TimestampTz *ts,
}
/* lock is acquired by SimpleLruReadPage_ReadOnly */
- slotno = SimpleLruReadPage_ReadOnly(CommitTsCtl, pageno, xid);
+ slotno = SimpleLruReadPage_ReadOnly(CommitTsCtl, pageno, xid, &lockmode);
memcpy(&entry,
CommitTsCtl->shared->page_buffer[slotno] +
SizeOfCommitTimestampEntry * entryno,
@@ -352,6 +353,7 @@ TransactionIdGetCommitTsData(TransactionId xid, TimestampTz *ts,
if (nodeid)
*nodeid = entry.nodeid;
+ Assert(lockmode != LW_NONE);
LWLockRelease(CommitTsSLRULock);
return *ts != 0;
}
diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index 43653fe5721..ccbce90f0ea 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -1237,6 +1237,7 @@ GetMultiXactIdMembers(MultiXactId multi, MultiXactMember **members,
MultiXactId tmpMXact;
MultiXactOffset nextOffset;
MultiXactMember *ptr;
+ LWLockMode lockmode = LW_NONE;
debug_elog3(DEBUG2, "GetMembers: asked for %u", multi);
@@ -1340,12 +1341,13 @@ GetMultiXactIdMembers(MultiXactId multi, MultiXactMember **members,
* time on every multixact creation.
*/
retry:
- LWLockAcquire(MultiXactOffsetSLRULock, LW_EXCLUSIVE);
pageno = MultiXactIdToOffsetPage(multi);
entryno = MultiXactIdToOffsetEntry(multi);
- slotno = SimpleLruReadPage(MultiXactOffsetCtl, pageno, true, multi);
+ /* lock is acquired by SimpleLruReadPage_ReadOnly */
+ slotno = SimpleLruReadPage_ReadOnly(MultiXactOffsetCtl, pageno,
+ multi, &lockmode);
offptr = (MultiXactOffset *) MultiXactOffsetCtl->shared->page_buffer[slotno];
offptr += entryno;
offset = *offptr;
@@ -1377,7 +1379,8 @@ retry:
entryno = MultiXactIdToOffsetEntry(tmpMXact);
if (pageno != prev_pageno)
- slotno = SimpleLruReadPage(MultiXactOffsetCtl, pageno, true, tmpMXact);
+ slotno = SimpleLruReadPage_ReadOnly(MultiXactOffsetCtl, pageno,
+ tmpMXact, &lockmode);
offptr = (MultiXactOffset *) MultiXactOffsetCtl->shared->page_buffer[slotno];
offptr += entryno;
@@ -1395,14 +1398,14 @@ retry:
length = nextMXOffset - offset;
}
+ Assert(lockmode != LW_NONE);
LWLockRelease(MultiXactOffsetSLRULock);
ptr = (MultiXactMember *) palloc(length * sizeof(MultiXactMember));
*members = ptr;
/* Now get the members themselves. */
- LWLockAcquire(MultiXactMemberSLRULock, LW_EXCLUSIVE);
-
+ lockmode = LW_NONE;
truelength = 0;
prev_pageno = -1;
for (i = 0; i < length; i++, offset++)
@@ -1418,7 +1421,8 @@ retry:
if (pageno != prev_pageno)
{
- slotno = SimpleLruReadPage(MultiXactMemberCtl, pageno, true, multi);
+ slotno = SimpleLruReadPage_ReadOnly(MultiXactMemberCtl, pageno,
+ multi, &lockmode);
prev_pageno = pageno;
}
@@ -1441,6 +1445,7 @@ retry:
truelength++;
}
+ Assert(lockmode != LW_NONE);
LWLockRelease(MultiXactMemberSLRULock);
/*
@@ -2733,6 +2738,7 @@ find_multixact_start(MultiXactId multi, MultiXactOffset *result)
int entryno;
int slotno;
MultiXactOffset *offptr;
+ LWLockMode lockmode = LW_NONE;
Assert(MultiXactState->finishedStartup);
@@ -2749,10 +2755,12 @@ find_multixact_start(MultiXactId multi, MultiXactOffset *result)
return false;
/* lock is acquired by SimpleLruReadPage_ReadOnly */
- slotno = SimpleLruReadPage_ReadOnly(MultiXactOffsetCtl, pageno, multi);
+ slotno = SimpleLruReadPage_ReadOnly(MultiXactOffsetCtl, pageno,
+ multi, &lockmode);
offptr = (MultiXactOffset *) MultiXactOffsetCtl->shared->page_buffer[slotno];
offptr += entryno;
offset = *offptr;
+ Assert(lockmode != LW_NONE);
LWLockRelease(MultiXactOffsetSLRULock);
*result = offset;
diff --git a/src/backend/access/transam/slru.c b/src/backend/access/transam/slru.c
index 16a78986971..ca0a23ab875 100644
--- a/src/backend/access/transam/slru.c
+++ b/src/backend/access/transam/slru.c
@@ -487,17 +487,23 @@ SimpleLruReadPage(SlruCtl ctl, int pageno, bool write_ok,
* Return value is the shared-buffer slot number now holding the page.
* The buffer's LRU access info is updated.
*
- * Control lock must NOT be held at entry, but will be held at exit.
- * It is unspecified whether the lock will be shared or exclusive.
+ * Control lock must be held in *lock_mode mode, which may be LW_NONE. Control
+ * lock will be held at exit in at least shared mode. Resulting control lock
+ * mode is set to *lock_mode.
*/
int
-SimpleLruReadPage_ReadOnly(SlruCtl ctl, int pageno, TransactionId xid)
+SimpleLruReadPage_ReadOnly(SlruCtl ctl, int pageno, TransactionId xid,
+ LWLockMode *lock_mode)
{
SlruShared shared = ctl->shared;
int slotno;
/* Try to find the page while holding only shared lock */
- LWLockAcquire(shared->ControlLock, LW_SHARED);
+ if (*lock_mode == LW_NONE)
+ {
+ LWLockAcquire(shared->ControlLock, LW_SHARED);
+ *lock_mode = LW_SHARED;
+ }
/* See if page is already in a buffer */
for (slotno = 0; slotno < shared->num_slots; slotno++)
@@ -517,8 +523,13 @@ SimpleLruReadPage_ReadOnly(SlruCtl ctl, int pageno, TransactionId xid)
}
/* No luck, so switch to normal exclusive lock and do regular read */
- LWLockRelease(shared->ControlLock);
- LWLockAcquire(shared->ControlLock, LW_EXCLUSIVE);
+ if (*lock_mode != LW_EXCLUSIVE)
+ {
+ Assert(*lock_mode == LW_NONE);
+ LWLockRelease(shared->ControlLock);
+ LWLockAcquire(shared->ControlLock, LW_EXCLUSIVE);
+ *lock_mode = LW_EXCLUSIVE;
+ }
return SimpleLruReadPage(ctl, pageno, true, xid);
}
diff --git a/src/backend/access/transam/subtrans.c b/src/backend/access/transam/subtrans.c
index 0111e867c79..7bb15431893 100644
--- a/src/backend/access/transam/subtrans.c
+++ b/src/backend/access/transam/subtrans.c
@@ -113,6 +113,7 @@ SubTransGetParent(TransactionId xid)
int slotno;
TransactionId *ptr;
TransactionId parent;
+ LWLockMode lockmode = LW_NONE;
/* Can't ask about stuff that might not be around anymore */
Assert(TransactionIdFollowsOrEquals(xid, TransactionXmin));
@@ -123,12 +124,13 @@ SubTransGetParent(TransactionId xid)
/* lock is acquired by SimpleLruReadPage_ReadOnly */
- slotno = SimpleLruReadPage_ReadOnly(SubTransCtl, pageno, xid);
+ slotno = SimpleLruReadPage_ReadOnly(SubTransCtl, pageno, xid, &lockmode);
ptr = (TransactionId *) SubTransCtl->shared->page_buffer[slotno];
ptr += entryno;
parent = *ptr;
+ Assert(lockmode != LW_NONE);
LWLockRelease(SubtransSLRULock);
return parent;
diff --git a/src/backend/commands/async.c b/src/backend/commands/async.c
index 8dbcace3f93..8b419575590 100644
--- a/src/backend/commands/async.c
+++ b/src/backend/commands/async.c
@@ -2002,6 +2002,7 @@ asyncQueueReadAllNotifications(void)
int curoffset = QUEUE_POS_OFFSET(pos);
int slotno;
int copysize;
+ LWLockMode lockmode = LW_NONE;
/*
* We copy the data from SLRU into a local buffer, so as to avoid
@@ -2010,7 +2011,7 @@ asyncQueueReadAllNotifications(void)
* part of the page we will actually inspect.
*/
slotno = SimpleLruReadPage_ReadOnly(NotifyCtl, curpage,
- InvalidTransactionId);
+ InvalidTransactionId, &lockmode);
if (curpage == QUEUE_POS_PAGE(head))
{
/* we only want to read as far as head */
@@ -2027,6 +2028,7 @@ asyncQueueReadAllNotifications(void)
NotifyCtl->shared->page_buffer[slotno] + curoffset,
copysize);
/* Release lock that we got from SimpleLruReadPage_ReadOnly() */
+ Assert(lockmode != LW_NONE);
LWLockRelease(NotifySLRULock);
/*
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index 2fa90cc0954..8a7726f11c0 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -1068,6 +1068,8 @@ LWLockWakeup(LWLock *lock)
static void
LWLockQueueSelf(LWLock *lock, LWLockMode mode)
{
+ Assert(mode != LW_NONE);
+
/*
* If we don't have a PGPROC structure, there's no way to wait. This
* should never occur, since MyProc should only be null during shared
@@ -1828,6 +1830,7 @@ LWLockRelease(LWLock *lock)
elog(ERROR, "lock %s is not held", T_NAME(lock));
mode = held_lwlocks[i].mode;
+ Assert(mode == LW_EXCLUSIVE || mode == LW_SHARED);
num_held_lwlocks--;
for (; i < num_held_lwlocks; i++)
diff --git a/src/backend/storage/lmgr/predicate.c b/src/backend/storage/lmgr/predicate.c
index 8a365b400c6..a4df90a8aeb 100644
--- a/src/backend/storage/lmgr/predicate.c
+++ b/src/backend/storage/lmgr/predicate.c
@@ -924,6 +924,7 @@ SerialGetMinConflictCommitSeqNo(TransactionId xid)
TransactionId tailXid;
SerCommitSeqNo val;
int slotno;
+ LWLockMode lockmode = LW_NONE;
Assert(TransactionIdIsValid(xid));
@@ -946,8 +947,9 @@ SerialGetMinConflictCommitSeqNo(TransactionId xid)
* but will return with that lock held, which must then be released.
*/
slotno = SimpleLruReadPage_ReadOnly(SerialSlruCtl,
- SerialPage(xid), xid);
+ SerialPage(xid), xid, &lockmode);
val = SerialValue(slotno, xid);
+ Assert(lockmode != LW_NONE);
LWLockRelease(SerialSLRULock);
return val;
}
diff --git a/src/include/access/slru.h b/src/include/access/slru.h
index b39b43504d8..4b66d3b592b 100644
--- a/src/include/access/slru.h
+++ b/src/include/access/slru.h
@@ -142,7 +142,7 @@ extern int SimpleLruZeroPage(SlruCtl ctl, int pageno);
extern int SimpleLruReadPage(SlruCtl ctl, int pageno, bool write_ok,
TransactionId xid);
extern int SimpleLruReadPage_ReadOnly(SlruCtl ctl, int pageno,
- TransactionId xid);
+ TransactionId xid, LWLockMode *lock_mode);
extern void SimpleLruWritePage(SlruCtl ctl, int slotno);
extern void SimpleLruWriteAll(SlruCtl ctl, bool allow_redirtied);
extern void SimpleLruTruncate(SlruCtl ctl, int cutoffPage);
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index af9b41795d2..e680c6397aa 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -129,6 +129,8 @@ extern PGDLLIMPORT int NamedLWLockTrancheRequests;
typedef enum LWLockMode
{
+ LW_NONE, /* Not a lock mode. Indicates that there is no
+ * lock. */
LW_EXCLUSIVE,
LW_SHARED,
LW_WAIT_UNTIL_FREE /* A special mode used in PGPROC->lwWaitMode,
--
2.14.3
v5-0002-Make-MultiXact-local-cache-size-configurable.patchapplication/octet-stream; name=v5-0002-Make-MultiXact-local-cache-size-configurable.patchDownload
From 53ef71094688048002ffd1a62dbbb1a4a90d4691 Mon Sep 17 00:00:00 2001
From: Alexander Korotkov <akorotkov@postgresql.org>
Date: Mon, 26 Oct 2020 03:44:50 +0300
Subject: [PATCH 2/3] Make MultiXact local cache size configurable
---
doc/src/sgml/config.sgml | 16 ++++++++++++++++
src/backend/access/transam/multixact.c | 2 +-
src/backend/utils/init/globals.c | 2 ++
src/backend/utils/misc/guc.c | 10 ++++++++++
src/include/miscadmin.h | 2 ++
5 files changed, 31 insertions(+), 1 deletion(-)
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index f043433e318..fd4ca293476 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -1823,6 +1823,22 @@ include_dir 'conf.d'
</listitem>
</varlistentry>
+ <varlistentry id="guc-multixact-local-cache-entries" xreflabel="multixact_local-cache-entries">
+ <term><varname>multixact_local_cache_entries</varname> (<type>integer</type>)
+ <indexterm>
+ <primary><varname>multixact_local_cache_entries</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Specifies the number cached MultiXact by backend. Any SLRU lookup is preceeded
+ by cache lookup. Higher numbers of cache size help to deduplicate lock sets, while
+ lower values make cache lookup faster.
+ It defaults to 256.
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry id="guc-max-stack-depth" xreflabel="max_stack_depth">
<term><varname>max_stack_depth</varname> (<type>integer</type>)
<indexterm>
diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index ccbce90f0ea..57be24c0cc1 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -1613,7 +1613,7 @@ mXactCachePut(MultiXactId multi, int nmembers, MultiXactMember *members)
qsort(entry->members, nmembers, sizeof(MultiXactMember), mxactMemberComparator);
dlist_push_head(&MXactCache, &entry->node);
- if (MXactCacheMembers++ >= MAX_CACHE_ENTRIES)
+ if (MXactCacheMembers++ >= multixact_local_cache_entries)
{
dlist_node *node;
mXactCacheEnt *entry;
diff --git a/src/backend/utils/init/globals.c b/src/backend/utils/init/globals.c
index 6ab82168398..9ca71933dcc 100644
--- a/src/backend/utils/init/globals.c
+++ b/src/backend/utils/init/globals.c
@@ -149,3 +149,5 @@ int VacuumCostBalance = 0; /* working state for vacuum */
bool VacuumCostActive = false;
double vacuum_cleanup_index_scale_factor;
+
+int multixact_local_cache_entries = 256;
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index a62d64eaa47..d667578cc33 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -2257,6 +2257,16 @@ static struct config_int ConfigureNamesInt[] =
NULL, NULL, NULL
},
+ {
+ {"multixact_local_cache_entries", PGC_SUSET, RESOURCES_MEM,
+ gettext_noop("Sets the number of cached MultiXact by backend."),
+ NULL
+ },
+ &multixact_local_cache_entries,
+ 256, 2, INT_MAX / 2,
+ NULL, NULL, NULL
+ },
+
{
{"temp_buffers", PGC_USERSET, RESOURCES_MEM,
gettext_noop("Sets the maximum number of temporary buffers used by each session."),
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 72e33523984..01af61c963a 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -162,6 +162,8 @@ extern PGDLLIMPORT int MaxConnections;
extern PGDLLIMPORT int max_worker_processes;
extern PGDLLIMPORT int max_parallel_workers;
+extern PGDLLIMPORT int multixact_local_cache_entries;
+
extern PGDLLIMPORT int MyProcPid;
extern PGDLLIMPORT pg_time_t MyStartTime;
extern PGDLLIMPORT TimestampTz MyStartTimestamp;
--
2.14.3
v5-0003-Add-conditional-variable-to-wait-for-next-MultXact.patchapplication/octet-stream; name=v5-0003-Add-conditional-variable-to-wait-for-next-MultXact.patchDownload
From 8876966a703f35abf98b575978c9d38ff4297473 Mon Sep 17 00:00:00 2001
From: Alexander Korotkov <akorotkov@postgresql.org>
Date: Mon, 26 Oct 2020 01:55:31 +0300
Subject: [PATCH 3/3] Add conditional variable to wait for next MultXact offset
in corner case
GetMultiXactIdMembers() has a corner case, when the next multixact offset is
not yet set. In this case GetMultiXactIdMembers() has to sleep till this offset
is set. Currently the sleeping is implemented in naive way using pg_sleep()
and retry. This commit implements sleeping with conditional variable, which
provides more efficient way for waiting till the event.
Discussion: https://postgr.es/m/a7f1c4e1-1015-92a4-2bd4-6736bd13d03e%40postgrespro.ru#c496c4e75fc0605094a0e1f763e6a6ec
Author: Andrey Borodin
Reviewed-by: Kyotaro Horiguchi, Daniel Gustafsson
Reviewed-by: Anastasia Lubennikova, Alexander Korotkov
---
src/backend/access/transam/multixact.c | 35 ++++++++++++++++++++++++++++++++--
src/backend/postmaster/pgstat.c | 2 ++
src/include/pgstat.h | 1 +
3 files changed, 36 insertions(+), 2 deletions(-)
diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index 57be24c0cc1..70d977cba33 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -82,6 +82,7 @@
#include "lib/ilist.h"
#include "miscadmin.h"
#include "pg_trace.h"
+#include "pgstat.h"
#include "postmaster/autovacuum.h"
#include "storage/lmgr.h"
#include "storage/pmsignal.h"
@@ -233,6 +234,13 @@ typedef struct MultiXactStateData
/* support for members anti-wraparound measures */
MultiXactOffset offsetStopLimit; /* known if oldestOffsetKnown */
+ /*
+ * Conditional variable for waiting till the filling of the next multixact
+ * will be finished. See GetMultiXactIdMembers() and RecordNewMultiXact()
+ * for details.
+ */
+ ConditionVariable nextoffCV;
+
/*
* Per-backend data starts here. We have two arrays stored in the area
* immediately following the MultiXactStateData struct. Each is indexed by
@@ -892,6 +900,14 @@ RecordNewMultiXact(MultiXactId multi, MultiXactOffset offset,
/* Exchange our lock */
LWLockRelease(MultiXactOffsetSLRULock);
+ /*
+ * Let everybody know the offset of this mxid is recorded now. The
+ * waiters are waiting for the offset of the mxid next of the target to
+ * know the number of members of the target mxid, so we don't need to wait
+ * for members of this mxid are recorded.
+ */
+ ConditionVariableBroadcast(&MultiXactState->nextoffCV);
+
LWLockAcquire(MultiXactMemberSLRULock, LW_EXCLUSIVE);
prev_pageno = -1;
@@ -1389,9 +1405,23 @@ retry:
if (nextMXOffset == 0)
{
/* Corner case 2: next multixact is still being filled in */
+
+ /*
+ * The recorder of the next mxid is just before writing the
+ * offset. Wait for the offset to be written.
+ */
+ ConditionVariablePrepareToSleep(&MultiXactState->nextoffCV);
+
+ /*
+ * We don't have to recheck if multixact was filled in during
+ * ConditionVariablePrepareToSleep(), because we were holding
+ * MultiXactOffsetSLRULock.
+ */
LWLockRelease(MultiXactOffsetSLRULock);
- CHECK_FOR_INTERRUPTS();
- pg_usleep(1000L);
+
+ ConditionVariableSleep(&MultiXactState->nextoffCV,
+ WAIT_EVENT_WAIT_NEXT_MXMEMBERS);
+ ConditionVariableCancelSleep();
goto retry;
}
@@ -1873,6 +1903,7 @@ MultiXactShmemInit(void)
/* Make sure we zero out the per-backend state */
MemSet(MultiXactState, 0, SHARED_MULTIXACT_STATE_SIZE);
+ ConditionVariableInit(&MultiXactState->nextoffCV);
}
else
Assert(found);
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 822f0ebc628..b99398a97e9 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -4020,6 +4020,8 @@ pgstat_get_wait_ipc(WaitEventIPC w)
break;
case WAIT_EVENT_XACT_GROUP_UPDATE:
event_name = "XactGroupUpdate";
+ case WAIT_EVENT_WAIT_NEXT_MXMEMBERS:
+ event_name = "MultiXactWaitNextMembers";
break;
/* no default case, so that compiler will warn */
}
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index a821ff4f158..75ede141ab7 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -952,6 +952,7 @@ typedef enum
WAIT_EVENT_REPLICATION_SLOT_DROP,
WAIT_EVENT_SAFE_SNAPSHOT,
WAIT_EVENT_SYNC_REP,
+ WAIT_EVENT_WAIT_NEXT_MXMEMBERS,
WAIT_EVENT_XACT_GROUP_UPDATE
} WaitEventIPC;
--
2.14.3
On Tue, Oct 27, 2020 at 8:02 PM Alexander Korotkov <aekorotkov@gmail.com> wrote:
On Mon, Oct 26, 2020 at 6:45 PM Andrey Borodin <x4mmm@yandex-team.ru> wrote:
Thanks for your review, Alexander!
+1 for avoiding double locking in SimpleLruReadPage_ReadOnly().
Other changes seem correct to me too.I've tried to find optimal value for cache size and it seems to me that it affects multixact scalability much less than sizes of offsets\members buffers. I concur that patch 2 of the patchset does not seem documented enough.
Thank you. I've made a few more minor adjustments to the patchset.
I'm going to push 0001 and 0003 if there are no objections.
I get that patchset v5 doesn't pass the tests due to typo in assert.
The fixes version is attached.
------
Regards,
Alexander Korotkov
Attachments:
v6-0001-Use-shared-lock-in-GetMultiXactIdMembers-for-offs.patchapplication/octet-stream; name=v6-0001-Use-shared-lock-in-GetMultiXactIdMembers-for-offs.patchDownload
From 9307192bbb9d1ee48aba5c335ae54cff7b34ffb7 Mon Sep 17 00:00:00 2001
From: Alexander Korotkov <akorotkov@postgresql.org>
Date: Tue, 27 Oct 2020 19:29:56 +0300
Subject: [PATCH 1/3] Use shared lock in GetMultiXactIdMembers for offsets and
members
Previously the read of multixact required exclusive control locks for both
offsets and members SLRUs. This could lead to the excessive lock contention.
This commit we makes multixacts SLRU take advantage of SimpleLruReadPage_ReadOnly
similar to clog, commit_ts and subtrans.
In order to evade extra reacquiring of CLRU lock, we teach
SimpleLruReadPage_ReadOnly() to take into account the current lock mode and
report resulting lock mode back.
Discussion: https://postgr.es/m/a7f1c4e1-1015-92a4-2bd4-6736bd13d03e%40postgrespro.ru#c496c4e75fc0605094a0e1f763e6a6ec
Author: Andrey Borodin
Reviewed-by: Kyotaro Horiguchi, Daniel Gustafsson
Reviewed-by: Anastasia Lubennikova, Alexander Korotkov
---
src/backend/access/transam/clog.c | 4 +++-
src/backend/access/transam/commit_ts.c | 4 +++-
src/backend/access/transam/multixact.c | 22 +++++++++++++++-------
src/backend/access/transam/slru.c | 23 +++++++++++++++++------
src/backend/access/transam/subtrans.c | 4 +++-
src/backend/commands/async.c | 4 +++-
src/backend/storage/lmgr/lwlock.c | 3 +++
src/backend/storage/lmgr/predicate.c | 4 +++-
src/include/access/slru.h | 2 +-
src/include/storage/lwlock.h | 2 ++
10 files changed, 53 insertions(+), 19 deletions(-)
diff --git a/src/backend/access/transam/clog.c b/src/backend/access/transam/clog.c
index 034349aa7b9..4c372065dee 100644
--- a/src/backend/access/transam/clog.c
+++ b/src/backend/access/transam/clog.c
@@ -640,10 +640,11 @@ TransactionIdGetStatus(TransactionId xid, XLogRecPtr *lsn)
int lsnindex;
char *byteptr;
XidStatus status;
+ LWLockMode lockmode = LW_NONE;
/* lock is acquired by SimpleLruReadPage_ReadOnly */
- slotno = SimpleLruReadPage_ReadOnly(XactCtl, pageno, xid);
+ slotno = SimpleLruReadPage_ReadOnly(XactCtl, pageno, xid, &lockmode);
byteptr = XactCtl->shared->page_buffer[slotno] + byteno;
status = (*byteptr >> bshift) & CLOG_XACT_BITMASK;
@@ -651,6 +652,7 @@ TransactionIdGetStatus(TransactionId xid, XLogRecPtr *lsn)
lsnindex = GetLSNIndex(slotno, xid);
*lsn = XactCtl->shared->group_lsn[lsnindex];
+ Assert(lockmode != LW_NONE);
LWLockRelease(XactSLRULock);
return status;
diff --git a/src/backend/access/transam/commit_ts.c b/src/backend/access/transam/commit_ts.c
index cb8a9688018..c19b0b5399e 100644
--- a/src/backend/access/transam/commit_ts.c
+++ b/src/backend/access/transam/commit_ts.c
@@ -288,6 +288,7 @@ TransactionIdGetCommitTsData(TransactionId xid, TimestampTz *ts,
CommitTimestampEntry entry;
TransactionId oldestCommitTsXid;
TransactionId newestCommitTsXid;
+ LWLockMode lockmode = LW_NONE;
if (!TransactionIdIsValid(xid))
ereport(ERROR,
@@ -342,7 +343,7 @@ TransactionIdGetCommitTsData(TransactionId xid, TimestampTz *ts,
}
/* lock is acquired by SimpleLruReadPage_ReadOnly */
- slotno = SimpleLruReadPage_ReadOnly(CommitTsCtl, pageno, xid);
+ slotno = SimpleLruReadPage_ReadOnly(CommitTsCtl, pageno, xid, &lockmode);
memcpy(&entry,
CommitTsCtl->shared->page_buffer[slotno] +
SizeOfCommitTimestampEntry * entryno,
@@ -352,6 +353,7 @@ TransactionIdGetCommitTsData(TransactionId xid, TimestampTz *ts,
if (nodeid)
*nodeid = entry.nodeid;
+ Assert(lockmode != LW_NONE);
LWLockRelease(CommitTsSLRULock);
return *ts != 0;
}
diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index 43653fe5721..ccbce90f0ea 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -1237,6 +1237,7 @@ GetMultiXactIdMembers(MultiXactId multi, MultiXactMember **members,
MultiXactId tmpMXact;
MultiXactOffset nextOffset;
MultiXactMember *ptr;
+ LWLockMode lockmode = LW_NONE;
debug_elog3(DEBUG2, "GetMembers: asked for %u", multi);
@@ -1340,12 +1341,13 @@ GetMultiXactIdMembers(MultiXactId multi, MultiXactMember **members,
* time on every multixact creation.
*/
retry:
- LWLockAcquire(MultiXactOffsetSLRULock, LW_EXCLUSIVE);
pageno = MultiXactIdToOffsetPage(multi);
entryno = MultiXactIdToOffsetEntry(multi);
- slotno = SimpleLruReadPage(MultiXactOffsetCtl, pageno, true, multi);
+ /* lock is acquired by SimpleLruReadPage_ReadOnly */
+ slotno = SimpleLruReadPage_ReadOnly(MultiXactOffsetCtl, pageno,
+ multi, &lockmode);
offptr = (MultiXactOffset *) MultiXactOffsetCtl->shared->page_buffer[slotno];
offptr += entryno;
offset = *offptr;
@@ -1377,7 +1379,8 @@ retry:
entryno = MultiXactIdToOffsetEntry(tmpMXact);
if (pageno != prev_pageno)
- slotno = SimpleLruReadPage(MultiXactOffsetCtl, pageno, true, tmpMXact);
+ slotno = SimpleLruReadPage_ReadOnly(MultiXactOffsetCtl, pageno,
+ tmpMXact, &lockmode);
offptr = (MultiXactOffset *) MultiXactOffsetCtl->shared->page_buffer[slotno];
offptr += entryno;
@@ -1395,14 +1398,14 @@ retry:
length = nextMXOffset - offset;
}
+ Assert(lockmode != LW_NONE);
LWLockRelease(MultiXactOffsetSLRULock);
ptr = (MultiXactMember *) palloc(length * sizeof(MultiXactMember));
*members = ptr;
/* Now get the members themselves. */
- LWLockAcquire(MultiXactMemberSLRULock, LW_EXCLUSIVE);
-
+ lockmode = LW_NONE;
truelength = 0;
prev_pageno = -1;
for (i = 0; i < length; i++, offset++)
@@ -1418,7 +1421,8 @@ retry:
if (pageno != prev_pageno)
{
- slotno = SimpleLruReadPage(MultiXactMemberCtl, pageno, true, multi);
+ slotno = SimpleLruReadPage_ReadOnly(MultiXactMemberCtl, pageno,
+ multi, &lockmode);
prev_pageno = pageno;
}
@@ -1441,6 +1445,7 @@ retry:
truelength++;
}
+ Assert(lockmode != LW_NONE);
LWLockRelease(MultiXactMemberSLRULock);
/*
@@ -2733,6 +2738,7 @@ find_multixact_start(MultiXactId multi, MultiXactOffset *result)
int entryno;
int slotno;
MultiXactOffset *offptr;
+ LWLockMode lockmode = LW_NONE;
Assert(MultiXactState->finishedStartup);
@@ -2749,10 +2755,12 @@ find_multixact_start(MultiXactId multi, MultiXactOffset *result)
return false;
/* lock is acquired by SimpleLruReadPage_ReadOnly */
- slotno = SimpleLruReadPage_ReadOnly(MultiXactOffsetCtl, pageno, multi);
+ slotno = SimpleLruReadPage_ReadOnly(MultiXactOffsetCtl, pageno,
+ multi, &lockmode);
offptr = (MultiXactOffset *) MultiXactOffsetCtl->shared->page_buffer[slotno];
offptr += entryno;
offset = *offptr;
+ Assert(lockmode != LW_NONE);
LWLockRelease(MultiXactOffsetSLRULock);
*result = offset;
diff --git a/src/backend/access/transam/slru.c b/src/backend/access/transam/slru.c
index 16a78986971..cf89006e131 100644
--- a/src/backend/access/transam/slru.c
+++ b/src/backend/access/transam/slru.c
@@ -487,17 +487,23 @@ SimpleLruReadPage(SlruCtl ctl, int pageno, bool write_ok,
* Return value is the shared-buffer slot number now holding the page.
* The buffer's LRU access info is updated.
*
- * Control lock must NOT be held at entry, but will be held at exit.
- * It is unspecified whether the lock will be shared or exclusive.
+ * Control lock must be held in *lock_mode mode, which may be LW_NONE. Control
+ * lock will be held at exit in at least shared mode. Resulting control lock
+ * mode is set to *lock_mode.
*/
int
-SimpleLruReadPage_ReadOnly(SlruCtl ctl, int pageno, TransactionId xid)
+SimpleLruReadPage_ReadOnly(SlruCtl ctl, int pageno, TransactionId xid,
+ LWLockMode *lock_mode)
{
SlruShared shared = ctl->shared;
int slotno;
/* Try to find the page while holding only shared lock */
- LWLockAcquire(shared->ControlLock, LW_SHARED);
+ if (*lock_mode == LW_NONE)
+ {
+ LWLockAcquire(shared->ControlLock, LW_SHARED);
+ *lock_mode = LW_SHARED;
+ }
/* See if page is already in a buffer */
for (slotno = 0; slotno < shared->num_slots; slotno++)
@@ -517,8 +523,13 @@ SimpleLruReadPage_ReadOnly(SlruCtl ctl, int pageno, TransactionId xid)
}
/* No luck, so switch to normal exclusive lock and do regular read */
- LWLockRelease(shared->ControlLock);
- LWLockAcquire(shared->ControlLock, LW_EXCLUSIVE);
+ if (*lock_mode != LW_EXCLUSIVE)
+ {
+ Assert(*lock_mode != LW_NONE);
+ LWLockRelease(shared->ControlLock);
+ LWLockAcquire(shared->ControlLock, LW_EXCLUSIVE);
+ *lock_mode = LW_EXCLUSIVE;
+ }
return SimpleLruReadPage(ctl, pageno, true, xid);
}
diff --git a/src/backend/access/transam/subtrans.c b/src/backend/access/transam/subtrans.c
index 0111e867c79..7bb15431893 100644
--- a/src/backend/access/transam/subtrans.c
+++ b/src/backend/access/transam/subtrans.c
@@ -113,6 +113,7 @@ SubTransGetParent(TransactionId xid)
int slotno;
TransactionId *ptr;
TransactionId parent;
+ LWLockMode lockmode = LW_NONE;
/* Can't ask about stuff that might not be around anymore */
Assert(TransactionIdFollowsOrEquals(xid, TransactionXmin));
@@ -123,12 +124,13 @@ SubTransGetParent(TransactionId xid)
/* lock is acquired by SimpleLruReadPage_ReadOnly */
- slotno = SimpleLruReadPage_ReadOnly(SubTransCtl, pageno, xid);
+ slotno = SimpleLruReadPage_ReadOnly(SubTransCtl, pageno, xid, &lockmode);
ptr = (TransactionId *) SubTransCtl->shared->page_buffer[slotno];
ptr += entryno;
parent = *ptr;
+ Assert(lockmode != LW_NONE);
LWLockRelease(SubtransSLRULock);
return parent;
diff --git a/src/backend/commands/async.c b/src/backend/commands/async.c
index 8dbcace3f93..8b419575590 100644
--- a/src/backend/commands/async.c
+++ b/src/backend/commands/async.c
@@ -2002,6 +2002,7 @@ asyncQueueReadAllNotifications(void)
int curoffset = QUEUE_POS_OFFSET(pos);
int slotno;
int copysize;
+ LWLockMode lockmode = LW_NONE;
/*
* We copy the data from SLRU into a local buffer, so as to avoid
@@ -2010,7 +2011,7 @@ asyncQueueReadAllNotifications(void)
* part of the page we will actually inspect.
*/
slotno = SimpleLruReadPage_ReadOnly(NotifyCtl, curpage,
- InvalidTransactionId);
+ InvalidTransactionId, &lockmode);
if (curpage == QUEUE_POS_PAGE(head))
{
/* we only want to read as far as head */
@@ -2027,6 +2028,7 @@ asyncQueueReadAllNotifications(void)
NotifyCtl->shared->page_buffer[slotno] + curoffset,
copysize);
/* Release lock that we got from SimpleLruReadPage_ReadOnly() */
+ Assert(lockmode != LW_NONE);
LWLockRelease(NotifySLRULock);
/*
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index 2fa90cc0954..8a7726f11c0 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -1068,6 +1068,8 @@ LWLockWakeup(LWLock *lock)
static void
LWLockQueueSelf(LWLock *lock, LWLockMode mode)
{
+ Assert(mode != LW_NONE);
+
/*
* If we don't have a PGPROC structure, there's no way to wait. This
* should never occur, since MyProc should only be null during shared
@@ -1828,6 +1830,7 @@ LWLockRelease(LWLock *lock)
elog(ERROR, "lock %s is not held", T_NAME(lock));
mode = held_lwlocks[i].mode;
+ Assert(mode == LW_EXCLUSIVE || mode == LW_SHARED);
num_held_lwlocks--;
for (; i < num_held_lwlocks; i++)
diff --git a/src/backend/storage/lmgr/predicate.c b/src/backend/storage/lmgr/predicate.c
index 8a365b400c6..a4df90a8aeb 100644
--- a/src/backend/storage/lmgr/predicate.c
+++ b/src/backend/storage/lmgr/predicate.c
@@ -924,6 +924,7 @@ SerialGetMinConflictCommitSeqNo(TransactionId xid)
TransactionId tailXid;
SerCommitSeqNo val;
int slotno;
+ LWLockMode lockmode = LW_NONE;
Assert(TransactionIdIsValid(xid));
@@ -946,8 +947,9 @@ SerialGetMinConflictCommitSeqNo(TransactionId xid)
* but will return with that lock held, which must then be released.
*/
slotno = SimpleLruReadPage_ReadOnly(SerialSlruCtl,
- SerialPage(xid), xid);
+ SerialPage(xid), xid, &lockmode);
val = SerialValue(slotno, xid);
+ Assert(lockmode != LW_NONE);
LWLockRelease(SerialSLRULock);
return val;
}
diff --git a/src/include/access/slru.h b/src/include/access/slru.h
index b39b43504d8..4b66d3b592b 100644
--- a/src/include/access/slru.h
+++ b/src/include/access/slru.h
@@ -142,7 +142,7 @@ extern int SimpleLruZeroPage(SlruCtl ctl, int pageno);
extern int SimpleLruReadPage(SlruCtl ctl, int pageno, bool write_ok,
TransactionId xid);
extern int SimpleLruReadPage_ReadOnly(SlruCtl ctl, int pageno,
- TransactionId xid);
+ TransactionId xid, LWLockMode *lock_mode);
extern void SimpleLruWritePage(SlruCtl ctl, int slotno);
extern void SimpleLruWriteAll(SlruCtl ctl, bool allow_redirtied);
extern void SimpleLruTruncate(SlruCtl ctl, int cutoffPage);
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index af9b41795d2..e680c6397aa 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -129,6 +129,8 @@ extern PGDLLIMPORT int NamedLWLockTrancheRequests;
typedef enum LWLockMode
{
+ LW_NONE, /* Not a lock mode. Indicates that there is no
+ * lock. */
LW_EXCLUSIVE,
LW_SHARED,
LW_WAIT_UNTIL_FREE /* A special mode used in PGPROC->lwWaitMode,
--
2.14.3
v6-0002-Make-MultiXact-local-cache-size-configurable.patchapplication/octet-stream; name=v6-0002-Make-MultiXact-local-cache-size-configurable.patchDownload
From 2c40a5878da97b18e7e64a775f5caa0641e2f39c Mon Sep 17 00:00:00 2001
From: Alexander Korotkov <akorotkov@postgresql.org>
Date: Mon, 26 Oct 2020 03:44:50 +0300
Subject: [PATCH 2/3] Make MultiXact local cache size configurable
---
doc/src/sgml/config.sgml | 16 ++++++++++++++++
src/backend/access/transam/multixact.c | 2 +-
src/backend/utils/init/globals.c | 2 ++
src/backend/utils/misc/guc.c | 10 ++++++++++
src/include/miscadmin.h | 2 ++
5 files changed, 31 insertions(+), 1 deletion(-)
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index f043433e318..fd4ca293476 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -1823,6 +1823,22 @@ include_dir 'conf.d'
</listitem>
</varlistentry>
+ <varlistentry id="guc-multixact-local-cache-entries" xreflabel="multixact_local-cache-entries">
+ <term><varname>multixact_local_cache_entries</varname> (<type>integer</type>)
+ <indexterm>
+ <primary><varname>multixact_local_cache_entries</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Specifies the number cached MultiXact by backend. Any SLRU lookup is preceeded
+ by cache lookup. Higher numbers of cache size help to deduplicate lock sets, while
+ lower values make cache lookup faster.
+ It defaults to 256.
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry id="guc-max-stack-depth" xreflabel="max_stack_depth">
<term><varname>max_stack_depth</varname> (<type>integer</type>)
<indexterm>
diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index ccbce90f0ea..57be24c0cc1 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -1613,7 +1613,7 @@ mXactCachePut(MultiXactId multi, int nmembers, MultiXactMember *members)
qsort(entry->members, nmembers, sizeof(MultiXactMember), mxactMemberComparator);
dlist_push_head(&MXactCache, &entry->node);
- if (MXactCacheMembers++ >= MAX_CACHE_ENTRIES)
+ if (MXactCacheMembers++ >= multixact_local_cache_entries)
{
dlist_node *node;
mXactCacheEnt *entry;
diff --git a/src/backend/utils/init/globals.c b/src/backend/utils/init/globals.c
index 6ab82168398..9ca71933dcc 100644
--- a/src/backend/utils/init/globals.c
+++ b/src/backend/utils/init/globals.c
@@ -149,3 +149,5 @@ int VacuumCostBalance = 0; /* working state for vacuum */
bool VacuumCostActive = false;
double vacuum_cleanup_index_scale_factor;
+
+int multixact_local_cache_entries = 256;
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index a62d64eaa47..d667578cc33 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -2257,6 +2257,16 @@ static struct config_int ConfigureNamesInt[] =
NULL, NULL, NULL
},
+ {
+ {"multixact_local_cache_entries", PGC_SUSET, RESOURCES_MEM,
+ gettext_noop("Sets the number of cached MultiXact by backend."),
+ NULL
+ },
+ &multixact_local_cache_entries,
+ 256, 2, INT_MAX / 2,
+ NULL, NULL, NULL
+ },
+
{
{"temp_buffers", PGC_USERSET, RESOURCES_MEM,
gettext_noop("Sets the maximum number of temporary buffers used by each session."),
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 72e33523984..01af61c963a 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -162,6 +162,8 @@ extern PGDLLIMPORT int MaxConnections;
extern PGDLLIMPORT int max_worker_processes;
extern PGDLLIMPORT int max_parallel_workers;
+extern PGDLLIMPORT int multixact_local_cache_entries;
+
extern PGDLLIMPORT int MyProcPid;
extern PGDLLIMPORT pg_time_t MyStartTime;
extern PGDLLIMPORT TimestampTz MyStartTimestamp;
--
2.14.3
v6-0003-Add-conditional-variable-to-wait-for-next-MultXact.patchapplication/octet-stream; name=v6-0003-Add-conditional-variable-to-wait-for-next-MultXact.patchDownload
From da9cc973a4e7d568466b1c4de15479f707ad04f6 Mon Sep 17 00:00:00 2001
From: Alexander Korotkov <akorotkov@postgresql.org>
Date: Mon, 26 Oct 2020 01:55:31 +0300
Subject: [PATCH 3/3] Add conditional variable to wait for next MultXact offset
in corner case
GetMultiXactIdMembers() has a corner case, when the next multixact offset is
not yet set. In this case GetMultiXactIdMembers() has to sleep till this offset
is set. Currently the sleeping is implemented in naive way using pg_sleep()
and retry. This commit implements sleeping with conditional variable, which
provides more efficient way for waiting till the event.
Discussion: https://postgr.es/m/a7f1c4e1-1015-92a4-2bd4-6736bd13d03e%40postgrespro.ru#c496c4e75fc0605094a0e1f763e6a6ec
Author: Andrey Borodin
Reviewed-by: Kyotaro Horiguchi, Daniel Gustafsson
Reviewed-by: Anastasia Lubennikova, Alexander Korotkov
---
src/backend/access/transam/multixact.c | 35 ++++++++++++++++++++++++++++++++--
src/backend/postmaster/pgstat.c | 2 ++
src/include/pgstat.h | 1 +
3 files changed, 36 insertions(+), 2 deletions(-)
diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index 57be24c0cc1..70d977cba33 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -82,6 +82,7 @@
#include "lib/ilist.h"
#include "miscadmin.h"
#include "pg_trace.h"
+#include "pgstat.h"
#include "postmaster/autovacuum.h"
#include "storage/lmgr.h"
#include "storage/pmsignal.h"
@@ -233,6 +234,13 @@ typedef struct MultiXactStateData
/* support for members anti-wraparound measures */
MultiXactOffset offsetStopLimit; /* known if oldestOffsetKnown */
+ /*
+ * Conditional variable for waiting till the filling of the next multixact
+ * will be finished. See GetMultiXactIdMembers() and RecordNewMultiXact()
+ * for details.
+ */
+ ConditionVariable nextoffCV;
+
/*
* Per-backend data starts here. We have two arrays stored in the area
* immediately following the MultiXactStateData struct. Each is indexed by
@@ -892,6 +900,14 @@ RecordNewMultiXact(MultiXactId multi, MultiXactOffset offset,
/* Exchange our lock */
LWLockRelease(MultiXactOffsetSLRULock);
+ /*
+ * Let everybody know the offset of this mxid is recorded now. The
+ * waiters are waiting for the offset of the mxid next of the target to
+ * know the number of members of the target mxid, so we don't need to wait
+ * for members of this mxid are recorded.
+ */
+ ConditionVariableBroadcast(&MultiXactState->nextoffCV);
+
LWLockAcquire(MultiXactMemberSLRULock, LW_EXCLUSIVE);
prev_pageno = -1;
@@ -1389,9 +1405,23 @@ retry:
if (nextMXOffset == 0)
{
/* Corner case 2: next multixact is still being filled in */
+
+ /*
+ * The recorder of the next mxid is just before writing the
+ * offset. Wait for the offset to be written.
+ */
+ ConditionVariablePrepareToSleep(&MultiXactState->nextoffCV);
+
+ /*
+ * We don't have to recheck if multixact was filled in during
+ * ConditionVariablePrepareToSleep(), because we were holding
+ * MultiXactOffsetSLRULock.
+ */
LWLockRelease(MultiXactOffsetSLRULock);
- CHECK_FOR_INTERRUPTS();
- pg_usleep(1000L);
+
+ ConditionVariableSleep(&MultiXactState->nextoffCV,
+ WAIT_EVENT_WAIT_NEXT_MXMEMBERS);
+ ConditionVariableCancelSleep();
goto retry;
}
@@ -1873,6 +1903,7 @@ MultiXactShmemInit(void)
/* Make sure we zero out the per-backend state */
MemSet(MultiXactState, 0, SHARED_MULTIXACT_STATE_SIZE);
+ ConditionVariableInit(&MultiXactState->nextoffCV);
}
else
Assert(found);
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 822f0ebc628..b99398a97e9 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -4020,6 +4020,8 @@ pgstat_get_wait_ipc(WaitEventIPC w)
break;
case WAIT_EVENT_XACT_GROUP_UPDATE:
event_name = "XactGroupUpdate";
+ case WAIT_EVENT_WAIT_NEXT_MXMEMBERS:
+ event_name = "MultiXactWaitNextMembers";
break;
/* no default case, so that compiler will warn */
}
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index a821ff4f158..75ede141ab7 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -952,6 +952,7 @@ typedef enum
WAIT_EVENT_REPLICATION_SLOT_DROP,
WAIT_EVENT_SAFE_SNAPSHOT,
WAIT_EVENT_SYNC_REP,
+ WAIT_EVENT_WAIT_NEXT_MXMEMBERS,
WAIT_EVENT_XACT_GROUP_UPDATE
} WaitEventIPC;
--
2.14.3
On Tue, Oct 27, 2020 at 08:23:26PM +0300, Alexander Korotkov wrote:
On Tue, Oct 27, 2020 at 8:02 PM Alexander Korotkov <aekorotkov@gmail.com> wrote:
On Mon, Oct 26, 2020 at 6:45 PM Andrey Borodin <x4mmm@yandex-team.ru> wrote:
Thanks for your review, Alexander!
+1 for avoiding double locking in SimpleLruReadPage_ReadOnly().
Other changes seem correct to me too.I've tried to find optimal value for cache size and it seems to me that it affects multixact scalability much less than sizes of offsets\members buffers. I concur that patch 2 of the patchset does not seem documented enough.
Thank you. I've made a few more minor adjustments to the patchset.
I'm going to push 0001 and 0003 if there are no objections.I get that patchset v5 doesn't pass the tests due to typo in assert.
The fixes version is attached.
I did a quick review on this patch series. A couple comments:
0001
----
This looks quite suspicious to me - SimpleLruReadPage_ReadOnly is
changed to return information about what lock was used, merely to allow
the callers to do an Assert() that the value is not LW_NONE.
IMO we could achieve exactly the same thing by passing a simple flag
that would say 'make sure we got a lock' or something like that. In
fact, aren't all callers doing the assert? That'd mean we can just do
the check always, without the flag. (I see GetMultiXactIdMembers does
two calls and only checks the second result, but I wonder if that's
intended or omission.)
In any case, it'd make the lwlock.c changes unnecessary, I think.
0002
----
Specifies the number cached MultiXact by backend. Any SLRU lookup ...
should be 'number of cached ...'
0003
----
* Conditional variable for waiting till the filling of the next multixact
* will be finished. See GetMultiXactIdMembers() and RecordNewMultiXact()
* for details.
Perhaps 'till the next multixact is filled' or 'gets full' would be
better. Not sure.
This thread started with a discussion about making the SLRU sizes
configurable, but this patch version only adds a local cache. Does this
achieve the same goal, or would we still gain something by having GUCs
for the SLRUs?
If we're claiming this improves performance, it'd be good to have some
workload demonstrating that and measurements. I don't see anything like
that in this thread, so it's a bit hand-wavy. Can someone share details
of such workload (even synthetic one) and some basic measurements?
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Tomas, thanks for looking into this!
28 окт. 2020 г., в 06:36, Tomas Vondra <tomas.vondra@2ndquadrant.com> написал(а):
This thread started with a discussion about making the SLRU sizes
configurable, but this patch version only adds a local cache. Does this
achieve the same goal, or would we still gain something by having GUCs
for the SLRUs?If we're claiming this improves performance, it'd be good to have some
workload demonstrating that and measurements. I don't see anything like
that in this thread, so it's a bit hand-wavy. Can someone share details
of such workload (even synthetic one) and some basic measurements?
All patches in this thread aim at the same goal: improve performance in presence of MultiXact locks contention.
I could not build synthetical reproduction of the problem, however I did some MultiXact stressing here [0]https://github.com/x4m/multixact_stress. It's a clumsy test program, because it still is not clear to me which parameters of workload trigger MultiXact locks contention. In generic case I was encountering other locks like *GenLock: XidGenLock, MultixactGenLock etc. Yet our production system encounters this problem approximately once in a month through this year.
Test program locks for share different set of tuples in presence of concurrent full scans.
To produce a set of locks we choose one of 14 bits. If a row number has this bit set to 0 we add lock it.
I've been measuring time to lock all rows 3 time for each of 14 bits, observing total time to set all locks.
During the test I was observing locks in pg_stat_activity, if they did not contain enough MultiXact locks I was tuning parameters further (number of concurrent clients, number of bits, select queries etc).
Why is it so complicated? It seems that other reproductions of a problem were encountering other locks.
Lets describe patches in this thread from the POV of these test.
*** Configurable SLRU buffers for MultiXact members and offsets.
From tests it is clear that high and low values for these buffers affect the test time. Here are time for a one test run with different offsets and members sizes [1]https://github.com/x4m/multixact_stress/blob/master/testresults.txt#L22-L39
Our production currently runs with (numbers are pages of buffers)
+#define NUM_MXACTOFFSET_BUFFERS 32
+#define NUM_MXACTMEMBER_BUFFERS 64
And, looking back to incidents in summer and fall 2020, seems like it helped mostly.
But it's hard to give some tuning advises based on test results. Values (32,64) produce 10% better result than current hardcoded values (8,16). In generic case this is not what someone should tune first.
*** Configurable caches of MultiXacts.
Tests were specifically designed to beat caches. So, according to test the bigger cache is - the more time it takes to accomplish the test [2]https://github.com/x4m/multixact_stress/blob/master/testresults.txt#L83-L99.
Anyway cache is local for backend and it's purpose is deduplication of written MultiXacts, not enhancing reads.
*** Using advantage of SimpleLruReadPage_ReadOnly() in MultiXacts.
This simply aligns MultiXact with Subtransactions and CLOG. Other SLRUs already take advantage of reading SLRU with shared lock.
On synthetical tests without background selects this patch adds another ~4.7% of performance [3]https://github.com/x4m/multixact_stress/blob/master/testresults.txt#L9 against [4]https://github.com/x4m/multixact_stress/blob/master/testresults.txt#L29. This improvement seems consistent between different parameter values, yet within measurements deviation (see difference between warmup run [5]https://github.com/x4m/multixact_stress/blob/master/testresults.txt#L3 and closing run [6]https://github.com/x4m/multixact_stress/blob/master/testresults.txt#L19).
All in all, these attempts to measure impact are hand-wavy too. But it makes sense to use consistent approach among similar subsystems (MultiXacts, Subtrans, CLOG etc).
*** Reduce sleep in GetMultiXactIdMembers() on standby.
The problem with pg_usleep(1000L) within GetMultiXactIdMembers() manifests on standbys during contention of MultiXactOffsetControlLock. It's even harder to reproduce.
Yet it seems obvious that reducing sleep to shorter time frame will make count of sleeping backend smaller.
For consistency I've returned patch with SLRU buffer configs to patchset (other patches are intact). But I'm mostly concerned about patches 1 and 3.
Thanks!
Best regards, Andrey Borodin.
[0]: https://github.com/x4m/multixact_stress
[1]: https://github.com/x4m/multixact_stress/blob/master/testresults.txt#L22-L39
[2]: https://github.com/x4m/multixact_stress/blob/master/testresults.txt#L83-L99
[3]: https://github.com/x4m/multixact_stress/blob/master/testresults.txt#L9
[4]: https://github.com/x4m/multixact_stress/blob/master/testresults.txt#L29
[5]: https://github.com/x4m/multixact_stress/blob/master/testresults.txt#L3
[6]: https://github.com/x4m/multixact_stress/blob/master/testresults.txt#L19
Attachments:
v6-0001-Use-shared-lock-in-GetMultiXactIdMembers-for-offs.patchapplication/octet-stream; name=v6-0001-Use-shared-lock-in-GetMultiXactIdMembers-for-offs.patch; x-unix-mode=0644Download
From bcbd4f6d88d7b6dd9cf779b62e0e0261edadc71b Mon Sep 17 00:00:00 2001
From: Alexander Korotkov <akorotkov@postgresql.org>
Date: Tue, 27 Oct 2020 19:29:56 +0300
Subject: [PATCH v6 1/4] Use shared lock in GetMultiXactIdMembers for offsets
and members
Previously the read of multixact required exclusive control locks for both
offsets and members SLRUs. This could lead to the excessive lock contention.
This commit we makes multixacts SLRU take advantage of SimpleLruReadPage_ReadOnly
similar to clog, commit_ts and subtrans.
In order to evade extra reacquiring of CLRU lock, we teach
SimpleLruReadPage_ReadOnly() to take into account the current lock mode and
report resulting lock mode back.
Discussion: https://postgr.es/m/a7f1c4e1-1015-92a4-2bd4-6736bd13d03e%40postgrespro.ru#c496c4e75fc0605094a0e1f763e6a6ec
Author: Andrey Borodin
Reviewed-by: Kyotaro Horiguchi, Daniel Gustafsson
Reviewed-by: Anastasia Lubennikova, Alexander Korotkov
---
src/backend/access/transam/clog.c | 4 +++-
src/backend/access/transam/commit_ts.c | 4 +++-
src/backend/access/transam/multixact.c | 22 +++++++++++++++-------
src/backend/access/transam/slru.c | 23 +++++++++++++++++------
src/backend/access/transam/subtrans.c | 4 +++-
src/backend/commands/async.c | 4 +++-
src/backend/storage/lmgr/lwlock.c | 3 +++
src/backend/storage/lmgr/predicate.c | 4 +++-
src/include/access/slru.h | 2 +-
src/include/storage/lwlock.h | 2 ++
10 files changed, 53 insertions(+), 19 deletions(-)
diff --git a/src/backend/access/transam/clog.c b/src/backend/access/transam/clog.c
index 034349aa7b..4c372065de 100644
--- a/src/backend/access/transam/clog.c
+++ b/src/backend/access/transam/clog.c
@@ -640,10 +640,11 @@ TransactionIdGetStatus(TransactionId xid, XLogRecPtr *lsn)
int lsnindex;
char *byteptr;
XidStatus status;
+ LWLockMode lockmode = LW_NONE;
/* lock is acquired by SimpleLruReadPage_ReadOnly */
- slotno = SimpleLruReadPage_ReadOnly(XactCtl, pageno, xid);
+ slotno = SimpleLruReadPage_ReadOnly(XactCtl, pageno, xid, &lockmode);
byteptr = XactCtl->shared->page_buffer[slotno] + byteno;
status = (*byteptr >> bshift) & CLOG_XACT_BITMASK;
@@ -651,6 +652,7 @@ TransactionIdGetStatus(TransactionId xid, XLogRecPtr *lsn)
lsnindex = GetLSNIndex(slotno, xid);
*lsn = XactCtl->shared->group_lsn[lsnindex];
+ Assert(lockmode != LW_NONE);
LWLockRelease(XactSLRULock);
return status;
diff --git a/src/backend/access/transam/commit_ts.c b/src/backend/access/transam/commit_ts.c
index cb8a968801..c19b0b5399 100644
--- a/src/backend/access/transam/commit_ts.c
+++ b/src/backend/access/transam/commit_ts.c
@@ -288,6 +288,7 @@ TransactionIdGetCommitTsData(TransactionId xid, TimestampTz *ts,
CommitTimestampEntry entry;
TransactionId oldestCommitTsXid;
TransactionId newestCommitTsXid;
+ LWLockMode lockmode = LW_NONE;
if (!TransactionIdIsValid(xid))
ereport(ERROR,
@@ -342,7 +343,7 @@ TransactionIdGetCommitTsData(TransactionId xid, TimestampTz *ts,
}
/* lock is acquired by SimpleLruReadPage_ReadOnly */
- slotno = SimpleLruReadPage_ReadOnly(CommitTsCtl, pageno, xid);
+ slotno = SimpleLruReadPage_ReadOnly(CommitTsCtl, pageno, xid, &lockmode);
memcpy(&entry,
CommitTsCtl->shared->page_buffer[slotno] +
SizeOfCommitTimestampEntry * entryno,
@@ -352,6 +353,7 @@ TransactionIdGetCommitTsData(TransactionId xid, TimestampTz *ts,
if (nodeid)
*nodeid = entry.nodeid;
+ Assert(lockmode != LW_NONE);
LWLockRelease(CommitTsSLRULock);
return *ts != 0;
}
diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index 43653fe572..ccbce90f0e 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -1237,6 +1237,7 @@ GetMultiXactIdMembers(MultiXactId multi, MultiXactMember **members,
MultiXactId tmpMXact;
MultiXactOffset nextOffset;
MultiXactMember *ptr;
+ LWLockMode lockmode = LW_NONE;
debug_elog3(DEBUG2, "GetMembers: asked for %u", multi);
@@ -1340,12 +1341,13 @@ GetMultiXactIdMembers(MultiXactId multi, MultiXactMember **members,
* time on every multixact creation.
*/
retry:
- LWLockAcquire(MultiXactOffsetSLRULock, LW_EXCLUSIVE);
pageno = MultiXactIdToOffsetPage(multi);
entryno = MultiXactIdToOffsetEntry(multi);
- slotno = SimpleLruReadPage(MultiXactOffsetCtl, pageno, true, multi);
+ /* lock is acquired by SimpleLruReadPage_ReadOnly */
+ slotno = SimpleLruReadPage_ReadOnly(MultiXactOffsetCtl, pageno,
+ multi, &lockmode);
offptr = (MultiXactOffset *) MultiXactOffsetCtl->shared->page_buffer[slotno];
offptr += entryno;
offset = *offptr;
@@ -1377,7 +1379,8 @@ retry:
entryno = MultiXactIdToOffsetEntry(tmpMXact);
if (pageno != prev_pageno)
- slotno = SimpleLruReadPage(MultiXactOffsetCtl, pageno, true, tmpMXact);
+ slotno = SimpleLruReadPage_ReadOnly(MultiXactOffsetCtl, pageno,
+ tmpMXact, &lockmode);
offptr = (MultiXactOffset *) MultiXactOffsetCtl->shared->page_buffer[slotno];
offptr += entryno;
@@ -1395,14 +1398,14 @@ retry:
length = nextMXOffset - offset;
}
+ Assert(lockmode != LW_NONE);
LWLockRelease(MultiXactOffsetSLRULock);
ptr = (MultiXactMember *) palloc(length * sizeof(MultiXactMember));
*members = ptr;
/* Now get the members themselves. */
- LWLockAcquire(MultiXactMemberSLRULock, LW_EXCLUSIVE);
-
+ lockmode = LW_NONE;
truelength = 0;
prev_pageno = -1;
for (i = 0; i < length; i++, offset++)
@@ -1418,7 +1421,8 @@ retry:
if (pageno != prev_pageno)
{
- slotno = SimpleLruReadPage(MultiXactMemberCtl, pageno, true, multi);
+ slotno = SimpleLruReadPage_ReadOnly(MultiXactMemberCtl, pageno,
+ multi, &lockmode);
prev_pageno = pageno;
}
@@ -1441,6 +1445,7 @@ retry:
truelength++;
}
+ Assert(lockmode != LW_NONE);
LWLockRelease(MultiXactMemberSLRULock);
/*
@@ -2733,6 +2738,7 @@ find_multixact_start(MultiXactId multi, MultiXactOffset *result)
int entryno;
int slotno;
MultiXactOffset *offptr;
+ LWLockMode lockmode = LW_NONE;
Assert(MultiXactState->finishedStartup);
@@ -2749,10 +2755,12 @@ find_multixact_start(MultiXactId multi, MultiXactOffset *result)
return false;
/* lock is acquired by SimpleLruReadPage_ReadOnly */
- slotno = SimpleLruReadPage_ReadOnly(MultiXactOffsetCtl, pageno, multi);
+ slotno = SimpleLruReadPage_ReadOnly(MultiXactOffsetCtl, pageno,
+ multi, &lockmode);
offptr = (MultiXactOffset *) MultiXactOffsetCtl->shared->page_buffer[slotno];
offptr += entryno;
offset = *offptr;
+ Assert(lockmode != LW_NONE);
LWLockRelease(MultiXactOffsetSLRULock);
*result = offset;
diff --git a/src/backend/access/transam/slru.c b/src/backend/access/transam/slru.c
index 16a7898697..ca0a23ab87 100644
--- a/src/backend/access/transam/slru.c
+++ b/src/backend/access/transam/slru.c
@@ -487,17 +487,23 @@ SimpleLruReadPage(SlruCtl ctl, int pageno, bool write_ok,
* Return value is the shared-buffer slot number now holding the page.
* The buffer's LRU access info is updated.
*
- * Control lock must NOT be held at entry, but will be held at exit.
- * It is unspecified whether the lock will be shared or exclusive.
+ * Control lock must be held in *lock_mode mode, which may be LW_NONE. Control
+ * lock will be held at exit in at least shared mode. Resulting control lock
+ * mode is set to *lock_mode.
*/
int
-SimpleLruReadPage_ReadOnly(SlruCtl ctl, int pageno, TransactionId xid)
+SimpleLruReadPage_ReadOnly(SlruCtl ctl, int pageno, TransactionId xid,
+ LWLockMode *lock_mode)
{
SlruShared shared = ctl->shared;
int slotno;
/* Try to find the page while holding only shared lock */
- LWLockAcquire(shared->ControlLock, LW_SHARED);
+ if (*lock_mode == LW_NONE)
+ {
+ LWLockAcquire(shared->ControlLock, LW_SHARED);
+ *lock_mode = LW_SHARED;
+ }
/* See if page is already in a buffer */
for (slotno = 0; slotno < shared->num_slots; slotno++)
@@ -517,8 +523,13 @@ SimpleLruReadPage_ReadOnly(SlruCtl ctl, int pageno, TransactionId xid)
}
/* No luck, so switch to normal exclusive lock and do regular read */
- LWLockRelease(shared->ControlLock);
- LWLockAcquire(shared->ControlLock, LW_EXCLUSIVE);
+ if (*lock_mode != LW_EXCLUSIVE)
+ {
+ Assert(*lock_mode == LW_NONE);
+ LWLockRelease(shared->ControlLock);
+ LWLockAcquire(shared->ControlLock, LW_EXCLUSIVE);
+ *lock_mode = LW_EXCLUSIVE;
+ }
return SimpleLruReadPage(ctl, pageno, true, xid);
}
diff --git a/src/backend/access/transam/subtrans.c b/src/backend/access/transam/subtrans.c
index 0111e867c7..7bb1543189 100644
--- a/src/backend/access/transam/subtrans.c
+++ b/src/backend/access/transam/subtrans.c
@@ -113,6 +113,7 @@ SubTransGetParent(TransactionId xid)
int slotno;
TransactionId *ptr;
TransactionId parent;
+ LWLockMode lockmode = LW_NONE;
/* Can't ask about stuff that might not be around anymore */
Assert(TransactionIdFollowsOrEquals(xid, TransactionXmin));
@@ -123,12 +124,13 @@ SubTransGetParent(TransactionId xid)
/* lock is acquired by SimpleLruReadPage_ReadOnly */
- slotno = SimpleLruReadPage_ReadOnly(SubTransCtl, pageno, xid);
+ slotno = SimpleLruReadPage_ReadOnly(SubTransCtl, pageno, xid, &lockmode);
ptr = (TransactionId *) SubTransCtl->shared->page_buffer[slotno];
ptr += entryno;
parent = *ptr;
+ Assert(lockmode != LW_NONE);
LWLockRelease(SubtransSLRULock);
return parent;
diff --git a/src/backend/commands/async.c b/src/backend/commands/async.c
index 8dbcace3f9..8b41957559 100644
--- a/src/backend/commands/async.c
+++ b/src/backend/commands/async.c
@@ -2002,6 +2002,7 @@ asyncQueueReadAllNotifications(void)
int curoffset = QUEUE_POS_OFFSET(pos);
int slotno;
int copysize;
+ LWLockMode lockmode = LW_NONE;
/*
* We copy the data from SLRU into a local buffer, so as to avoid
@@ -2010,7 +2011,7 @@ asyncQueueReadAllNotifications(void)
* part of the page we will actually inspect.
*/
slotno = SimpleLruReadPage_ReadOnly(NotifyCtl, curpage,
- InvalidTransactionId);
+ InvalidTransactionId, &lockmode);
if (curpage == QUEUE_POS_PAGE(head))
{
/* we only want to read as far as head */
@@ -2027,6 +2028,7 @@ asyncQueueReadAllNotifications(void)
NotifyCtl->shared->page_buffer[slotno] + curoffset,
copysize);
/* Release lock that we got from SimpleLruReadPage_ReadOnly() */
+ Assert(lockmode != LW_NONE);
LWLockRelease(NotifySLRULock);
/*
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index 2fa90cc095..8a7726f11c 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -1068,6 +1068,8 @@ LWLockWakeup(LWLock *lock)
static void
LWLockQueueSelf(LWLock *lock, LWLockMode mode)
{
+ Assert(mode != LW_NONE);
+
/*
* If we don't have a PGPROC structure, there's no way to wait. This
* should never occur, since MyProc should only be null during shared
@@ -1828,6 +1830,7 @@ LWLockRelease(LWLock *lock)
elog(ERROR, "lock %s is not held", T_NAME(lock));
mode = held_lwlocks[i].mode;
+ Assert(mode == LW_EXCLUSIVE || mode == LW_SHARED);
num_held_lwlocks--;
for (; i < num_held_lwlocks; i++)
diff --git a/src/backend/storage/lmgr/predicate.c b/src/backend/storage/lmgr/predicate.c
index 8a365b400c..a4df90a8ae 100644
--- a/src/backend/storage/lmgr/predicate.c
+++ b/src/backend/storage/lmgr/predicate.c
@@ -924,6 +924,7 @@ SerialGetMinConflictCommitSeqNo(TransactionId xid)
TransactionId tailXid;
SerCommitSeqNo val;
int slotno;
+ LWLockMode lockmode = LW_NONE;
Assert(TransactionIdIsValid(xid));
@@ -946,8 +947,9 @@ SerialGetMinConflictCommitSeqNo(TransactionId xid)
* but will return with that lock held, which must then be released.
*/
slotno = SimpleLruReadPage_ReadOnly(SerialSlruCtl,
- SerialPage(xid), xid);
+ SerialPage(xid), xid, &lockmode);
val = SerialValue(slotno, xid);
+ Assert(lockmode != LW_NONE);
LWLockRelease(SerialSLRULock);
return val;
}
diff --git a/src/include/access/slru.h b/src/include/access/slru.h
index b39b43504d..4b66d3b592 100644
--- a/src/include/access/slru.h
+++ b/src/include/access/slru.h
@@ -142,7 +142,7 @@ extern int SimpleLruZeroPage(SlruCtl ctl, int pageno);
extern int SimpleLruReadPage(SlruCtl ctl, int pageno, bool write_ok,
TransactionId xid);
extern int SimpleLruReadPage_ReadOnly(SlruCtl ctl, int pageno,
- TransactionId xid);
+ TransactionId xid, LWLockMode *lock_mode);
extern void SimpleLruWritePage(SlruCtl ctl, int slotno);
extern void SimpleLruWriteAll(SlruCtl ctl, bool allow_redirtied);
extern void SimpleLruTruncate(SlruCtl ctl, int cutoffPage);
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index af9b41795d..e680c6397a 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -129,6 +129,8 @@ extern PGDLLIMPORT int NamedLWLockTrancheRequests;
typedef enum LWLockMode
{
+ LW_NONE, /* Not a lock mode. Indicates that there is no
+ * lock. */
LW_EXCLUSIVE,
LW_SHARED,
LW_WAIT_UNTIL_FREE /* A special mode used in PGPROC->lwWaitMode,
--
2.24.3 (Apple Git-128)
v6-0004-Add-GUCs-to-tune-MultiXact-SLRUs.patchapplication/octet-stream; name=v6-0004-Add-GUCs-to-tune-MultiXact-SLRUs.patch; x-unix-mode=0644Download
From a5e15ae42bd6c99d39bac7a9988f586c48ceba13 Mon Sep 17 00:00:00 2001
From: Andrey Borodin <amborodin@acm.org>
Date: Sat, 9 May 2020 16:42:07 +0500
Subject: [PATCH v6 4/4] Add GUCs to tune MultiXact SLRUs
---
doc/src/sgml/config.sgml | 31 ++++++++++++++++++++++++++
src/backend/access/transam/multixact.c | 8 +++----
src/backend/utils/init/globals.c | 2 ++
src/backend/utils/misc/guc.c | 22 ++++++++++++++++++
src/include/access/multixact.h | 4 ----
src/include/miscadmin.h | 2 ++
6 files changed, 61 insertions(+), 8 deletions(-)
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index fd4ca29347..d2ca4934de 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -1838,6 +1838,37 @@ include_dir 'conf.d'
</para>
</listitem>
</varlistentry>
+
+ <varlistentry id="guc-multixact-offsets-slru-buffers" xreflabel="multixact_offsets_slru_buffers">
+ <term><varname>multixact_offsets_slru_buffers</varname> (<type>integer</type>)
+ <indexterm>
+ <primary><varname>multixact_offsets_slru_buffers</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Specifies the amount of memory to be used for MultiXact offsets. MultiXact offsets
+ are used to store informaion about offsets of multiple row lockers (caused by SELECT FOR UPDATE and others).
+ It defaults to 64 kilobytes (<literal>64KB</literal>).
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry id="guc-multixact-members-slru-buffers" xreflabel="multixact_members_slru_buffers">
+ <term><varname>multixact_members_slru_buffers</varname> (<type>integer</type>)
+ <indexterm>
+ <primary><varname>multixact_members_slru_buffers</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Specifies the amount of memory to be used for MultiXact members. MultiXact members
+ are used to store informaion about XIDs of multiple row lockers. Tipically <varname>multixact_members_slru_buffers</varname>
+ is twice more than <varname>multixact_offsets_slru_buffers</varname>.
+ It defaults to 128 kilobytes (<literal>128KB</literal>).
+ </para>
+ </listitem>
+ </varlistentry>
<varlistentry id="guc-max-stack-depth" xreflabel="max_stack_depth">
<term><varname>max_stack_depth</varname> (<type>integer</type>)
diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index 70d977cba3..1c177242c5 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -1866,8 +1866,8 @@ MultiXactShmemSize(void)
mul_size(sizeof(MultiXactId) * 2, MaxOldestSlot))
size = SHARED_MULTIXACT_STATE_SIZE;
- size = add_size(size, SimpleLruShmemSize(NUM_MULTIXACTOFFSET_BUFFERS, 0));
- size = add_size(size, SimpleLruShmemSize(NUM_MULTIXACTMEMBER_BUFFERS, 0));
+ size = add_size(size, SimpleLruShmemSize(multixact_offsets_slru_buffers, 0));
+ size = add_size(size, SimpleLruShmemSize(multixact_members_slru_buffers, 0));
return size;
}
@@ -1883,12 +1883,12 @@ MultiXactShmemInit(void)
MultiXactMemberCtl->PagePrecedes = MultiXactMemberPagePrecedes;
SimpleLruInit(MultiXactOffsetCtl,
- "MultiXactOffset", NUM_MULTIXACTOFFSET_BUFFERS, 0,
+ "MultiXactOffset", multixact_offsets_slru_buffers, 0,
MultiXactOffsetSLRULock, "pg_multixact/offsets",
LWTRANCHE_MULTIXACTOFFSET_BUFFER,
SYNC_HANDLER_MULTIXACT_OFFSET);
SimpleLruInit(MultiXactMemberCtl,
- "MultiXactMember", NUM_MULTIXACTMEMBER_BUFFERS, 0,
+ "MultiXactMember", multixact_members_slru_buffers, 0,
MultiXactMemberSLRULock, "pg_multixact/members",
LWTRANCHE_MULTIXACTMEMBER_BUFFER,
SYNC_HANDLER_MULTIXACT_MEMBER);
diff --git a/src/backend/utils/init/globals.c b/src/backend/utils/init/globals.c
index 9ca71933dc..a5ec7bfe88 100644
--- a/src/backend/utils/init/globals.c
+++ b/src/backend/utils/init/globals.c
@@ -151,3 +151,5 @@ bool VacuumCostActive = false;
double vacuum_cleanup_index_scale_factor;
int multixact_local_cache_entries = 256;
+int multixact_offsets_slru_buffers = 8;
+int multixact_members_slru_buffers = 16;
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index d667578cc3..7682c8be1e 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -2267,6 +2267,28 @@ static struct config_int ConfigureNamesInt[] =
NULL, NULL, NULL
},
+ {
+ {"multixact_offsets_slru_buffers", PGC_POSTMASTER, RESOURCES_MEM,
+ gettext_noop("Sets the number of shared memory buffers used for MultiXact offsets SLRU."),
+ NULL,
+ GUC_UNIT_BLOCKS
+ },
+ &multixact_offsets_slru_buffers,
+ 8, 2, INT_MAX / 2,
+ NULL, NULL, NULL
+ },
+
+ {
+ {"multixact_members_slru_buffers", PGC_POSTMASTER, RESOURCES_MEM,
+ gettext_noop("Sets the number of shared memory buffers used for MultiXact members SLRU."),
+ NULL,
+ GUC_UNIT_BLOCKS
+ },
+ &multixact_members_slru_buffers,
+ 16, 2, INT_MAX / 2,
+ NULL, NULL, NULL
+ },
+
{
{"temp_buffers", PGC_USERSET, RESOURCES_MEM,
gettext_noop("Sets the maximum number of temporary buffers used by each session."),
diff --git a/src/include/access/multixact.h b/src/include/access/multixact.h
index 9a30380901..630ceaea4d 100644
--- a/src/include/access/multixact.h
+++ b/src/include/access/multixact.h
@@ -29,10 +29,6 @@
#define MaxMultiXactOffset ((MultiXactOffset) 0xFFFFFFFF)
-/* Number of SLRU buffers to use for multixact */
-#define NUM_MULTIXACTOFFSET_BUFFERS 8
-#define NUM_MULTIXACTMEMBER_BUFFERS 16
-
/*
* Possible multixact lock modes ("status"). The first four modes are for
* tuple locks (FOR KEY SHARE, FOR SHARE, FOR NO KEY UPDATE, FOR UPDATE); the
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 01af61c963..ef8abea84d 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -161,6 +161,8 @@ extern PGDLLIMPORT int MaxBackends;
extern PGDLLIMPORT int MaxConnections;
extern PGDLLIMPORT int max_worker_processes;
extern PGDLLIMPORT int max_parallel_workers;
+extern PGDLLIMPORT int multixact_offsets_slru_buffers;
+extern PGDLLIMPORT int multixact_members_slru_buffers;
extern PGDLLIMPORT int multixact_local_cache_entries;
--
2.24.3 (Apple Git-128)
v6-0003-Add-conditional-variable-to-wait-for-next-MultXac.patchapplication/octet-stream; name=v6-0003-Add-conditional-variable-to-wait-for-next-MultXac.patch; x-unix-mode=0644Download
From d25e7d68a7979fc391c5d988b256a49cdd916bfb Mon Sep 17 00:00:00 2001
From: Alexander Korotkov <akorotkov@postgresql.org>
Date: Mon, 26 Oct 2020 01:55:31 +0300
Subject: [PATCH v6 3/4] Add conditional variable to wait for next MultXact
offset in corner case
GetMultiXactIdMembers() has a corner case, when the next multixact offset is
not yet set. In this case GetMultiXactIdMembers() has to sleep till this offset
is set. Currently the sleeping is implemented in naive way using pg_sleep()
and retry. This commit implements sleeping with conditional variable, which
provides more efficient way for waiting till the event.
Discussion: https://postgr.es/m/a7f1c4e1-1015-92a4-2bd4-6736bd13d03e%40postgrespro.ru#c496c4e75fc0605094a0e1f763e6a6ec
Author: Andrey Borodin
Reviewed-by: Kyotaro Horiguchi, Daniel Gustafsson
Reviewed-by: Anastasia Lubennikova, Alexander Korotkov
---
src/backend/access/transam/multixact.c | 35 ++++++++++++++++++++++++--
src/backend/postmaster/pgstat.c | 2 ++
src/include/pgstat.h | 1 +
3 files changed, 36 insertions(+), 2 deletions(-)
diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index 57be24c0cc..70d977cba3 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -82,6 +82,7 @@
#include "lib/ilist.h"
#include "miscadmin.h"
#include "pg_trace.h"
+#include "pgstat.h"
#include "postmaster/autovacuum.h"
#include "storage/lmgr.h"
#include "storage/pmsignal.h"
@@ -233,6 +234,13 @@ typedef struct MultiXactStateData
/* support for members anti-wraparound measures */
MultiXactOffset offsetStopLimit; /* known if oldestOffsetKnown */
+ /*
+ * Conditional variable for waiting till the filling of the next multixact
+ * will be finished. See GetMultiXactIdMembers() and RecordNewMultiXact()
+ * for details.
+ */
+ ConditionVariable nextoffCV;
+
/*
* Per-backend data starts here. We have two arrays stored in the area
* immediately following the MultiXactStateData struct. Each is indexed by
@@ -892,6 +900,14 @@ RecordNewMultiXact(MultiXactId multi, MultiXactOffset offset,
/* Exchange our lock */
LWLockRelease(MultiXactOffsetSLRULock);
+ /*
+ * Let everybody know the offset of this mxid is recorded now. The
+ * waiters are waiting for the offset of the mxid next of the target to
+ * know the number of members of the target mxid, so we don't need to wait
+ * for members of this mxid are recorded.
+ */
+ ConditionVariableBroadcast(&MultiXactState->nextoffCV);
+
LWLockAcquire(MultiXactMemberSLRULock, LW_EXCLUSIVE);
prev_pageno = -1;
@@ -1389,9 +1405,23 @@ retry:
if (nextMXOffset == 0)
{
/* Corner case 2: next multixact is still being filled in */
+
+ /*
+ * The recorder of the next mxid is just before writing the
+ * offset. Wait for the offset to be written.
+ */
+ ConditionVariablePrepareToSleep(&MultiXactState->nextoffCV);
+
+ /*
+ * We don't have to recheck if multixact was filled in during
+ * ConditionVariablePrepareToSleep(), because we were holding
+ * MultiXactOffsetSLRULock.
+ */
LWLockRelease(MultiXactOffsetSLRULock);
- CHECK_FOR_INTERRUPTS();
- pg_usleep(1000L);
+
+ ConditionVariableSleep(&MultiXactState->nextoffCV,
+ WAIT_EVENT_WAIT_NEXT_MXMEMBERS);
+ ConditionVariableCancelSleep();
goto retry;
}
@@ -1873,6 +1903,7 @@ MultiXactShmemInit(void)
/* Make sure we zero out the per-backend state */
MemSet(MultiXactState, 0, SHARED_MULTIXACT_STATE_SIZE);
+ ConditionVariableInit(&MultiXactState->nextoffCV);
}
else
Assert(found);
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 822f0ebc62..b99398a97e 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -4020,6 +4020,8 @@ pgstat_get_wait_ipc(WaitEventIPC w)
break;
case WAIT_EVENT_XACT_GROUP_UPDATE:
event_name = "XactGroupUpdate";
+ case WAIT_EVENT_WAIT_NEXT_MXMEMBERS:
+ event_name = "MultiXactWaitNextMembers";
break;
/* no default case, so that compiler will warn */
}
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index a821ff4f15..75ede141ab 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -952,6 +952,7 @@ typedef enum
WAIT_EVENT_REPLICATION_SLOT_DROP,
WAIT_EVENT_SAFE_SNAPSHOT,
WAIT_EVENT_SYNC_REP,
+ WAIT_EVENT_WAIT_NEXT_MXMEMBERS,
WAIT_EVENT_XACT_GROUP_UPDATE
} WaitEventIPC;
--
2.24.3 (Apple Git-128)
v6-0002-Make-MultiXact-local-cache-size-configurable.patchapplication/octet-stream; name=v6-0002-Make-MultiXact-local-cache-size-configurable.patch; x-unix-mode=0644Download
From d0cd1ce556ad3f4e7bbebf0468dc6e00765d79ae Mon Sep 17 00:00:00 2001
From: Alexander Korotkov <akorotkov@postgresql.org>
Date: Mon, 26 Oct 2020 03:44:50 +0300
Subject: [PATCH v6 2/4] Make MultiXact local cache size configurable
---
doc/src/sgml/config.sgml | 16 ++++++++++++++++
src/backend/access/transam/multixact.c | 2 +-
src/backend/utils/init/globals.c | 2 ++
src/backend/utils/misc/guc.c | 10 ++++++++++
src/include/miscadmin.h | 2 ++
5 files changed, 31 insertions(+), 1 deletion(-)
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index f043433e31..fd4ca29347 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -1823,6 +1823,22 @@ include_dir 'conf.d'
</listitem>
</varlistentry>
+ <varlistentry id="guc-multixact-local-cache-entries" xreflabel="multixact_local-cache-entries">
+ <term><varname>multixact_local_cache_entries</varname> (<type>integer</type>)
+ <indexterm>
+ <primary><varname>multixact_local_cache_entries</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Specifies the number cached MultiXact by backend. Any SLRU lookup is preceeded
+ by cache lookup. Higher numbers of cache size help to deduplicate lock sets, while
+ lower values make cache lookup faster.
+ It defaults to 256.
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry id="guc-max-stack-depth" xreflabel="max_stack_depth">
<term><varname>max_stack_depth</varname> (<type>integer</type>)
<indexterm>
diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index ccbce90f0e..57be24c0cc 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -1613,7 +1613,7 @@ mXactCachePut(MultiXactId multi, int nmembers, MultiXactMember *members)
qsort(entry->members, nmembers, sizeof(MultiXactMember), mxactMemberComparator);
dlist_push_head(&MXactCache, &entry->node);
- if (MXactCacheMembers++ >= MAX_CACHE_ENTRIES)
+ if (MXactCacheMembers++ >= multixact_local_cache_entries)
{
dlist_node *node;
mXactCacheEnt *entry;
diff --git a/src/backend/utils/init/globals.c b/src/backend/utils/init/globals.c
index 6ab8216839..9ca71933dc 100644
--- a/src/backend/utils/init/globals.c
+++ b/src/backend/utils/init/globals.c
@@ -149,3 +149,5 @@ int VacuumCostBalance = 0; /* working state for vacuum */
bool VacuumCostActive = false;
double vacuum_cleanup_index_scale_factor;
+
+int multixact_local_cache_entries = 256;
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index a62d64eaa4..d667578cc3 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -2257,6 +2257,16 @@ static struct config_int ConfigureNamesInt[] =
NULL, NULL, NULL
},
+ {
+ {"multixact_local_cache_entries", PGC_SUSET, RESOURCES_MEM,
+ gettext_noop("Sets the number of cached MultiXact by backend."),
+ NULL
+ },
+ &multixact_local_cache_entries,
+ 256, 2, INT_MAX / 2,
+ NULL, NULL, NULL
+ },
+
{
{"temp_buffers", PGC_USERSET, RESOURCES_MEM,
gettext_noop("Sets the maximum number of temporary buffers used by each session."),
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 72e3352398..01af61c963 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -162,6 +162,8 @@ extern PGDLLIMPORT int MaxConnections;
extern PGDLLIMPORT int max_worker_processes;
extern PGDLLIMPORT int max_parallel_workers;
+extern PGDLLIMPORT int multixact_local_cache_entries;
+
extern PGDLLIMPORT int MyProcPid;
extern PGDLLIMPORT pg_time_t MyStartTime;
extern PGDLLIMPORT TimestampTz MyStartTimestamp;
--
2.24.3 (Apple Git-128)
Hi, Tomas!
Thank you for your review.
On Wed, Oct 28, 2020 at 4:36 AM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
I did a quick review on this patch series. A couple comments:
0001
----This looks quite suspicious to me - SimpleLruReadPage_ReadOnly is
changed to return information about what lock was used, merely to allow
the callers to do an Assert() that the value is not LW_NONE.
Yes, but this is not merely to allow callers to do an Assert().
Sometimes in multixacts it could save us some relocks. So, we can
skip relocking lock to exclusive mode if it's in exclusive already.
Adding Assert() to every caller is probably overkill.
IMO we could achieve exactly the same thing by passing a simple flag
that would say 'make sure we got a lock' or something like that. In
fact, aren't all callers doing the assert? That'd mean we can just do
the check always, without the flag. (I see GetMultiXactIdMembers does
two calls and only checks the second result, but I wonder if that's
intended or omission.)
Having just the flag is exactly what the original version by Andrey
did. But if we have to read two multixact offsets pages or multiple
members page in one GetMultiXactIdMembers()), then it does relocks
from exclusive mode to exclusive mode. I decide that once we decide
to optimize this locks, this situation is nice to evade.
In any case, it'd make the lwlock.c changes unnecessary, I think.
I agree that it would be better to not touch lwlock.c. But I didn't
find a way to evade relocking exclusive mode to exclusive mode without
touching lwlock.c or making code cumbersome in other places.
0002
----Specifies the number cached MultiXact by backend. Any SLRU lookup ...
should be 'number of cached ...'
Sounds reasonable.
0003
----* Conditional variable for waiting till the filling of the next multixact
* will be finished. See GetMultiXactIdMembers() and RecordNewMultiXact()
* for details.Perhaps 'till the next multixact is filled' or 'gets full' would be
better. Not sure.
Sounds reasonable as well.
------
Regards,
Alexander Korotkov
On Wed, Oct 28, 2020 at 10:36:39PM +0300, Alexander Korotkov wrote:
Hi, Tomas!
Thank you for your review.
On Wed, Oct 28, 2020 at 4:36 AM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:I did a quick review on this patch series. A couple comments:
0001
----This looks quite suspicious to me - SimpleLruReadPage_ReadOnly is
changed to return information about what lock was used, merely to allow
the callers to do an Assert() that the value is not LW_NONE.Yes, but this is not merely to allow callers to do an Assert().
Sometimes in multixacts it could save us some relocks. So, we can
skip relocking lock to exclusive mode if it's in exclusive already.
Adding Assert() to every caller is probably overkill.
Hmm, OK. That can only happen in GetMultiXactIdMembers, which is the
only place where we do retry, right? Do we actually know this makes any
measurable difference? It seems we're mostly imagining that it might
help, but we don't have any actual proof of that (e.g. a workload which
we might benchmark). Or am I missing something?
For me, the extra conditions make it way harder to reason about the
behavior of the code, and I can't convince myself it's worth it.
IMO we could achieve exactly the same thing by passing a simple flag
that would say 'make sure we got a lock' or something like that. In
fact, aren't all callers doing the assert? That'd mean we can just do
the check always, without the flag. (I see GetMultiXactIdMembers does
two calls and only checks the second result, but I wonder if that's
intended or omission.)Having just the flag is exactly what the original version by Andrey
did. But if we have to read two multixact offsets pages or multiple
members page in one GetMultiXactIdMembers()), then it does relocks
from exclusive mode to exclusive mode. I decide that once we decide
to optimize this locks, this situation is nice to evade.In any case, it'd make the lwlock.c changes unnecessary, I think.
I agree that it would be better to not touch lwlock.c. But I didn't
find a way to evade relocking exclusive mode to exclusive mode without
touching lwlock.c or making code cumbersome in other places.
Hmm. OK.
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Hi,
On Wed, Oct 28, 2020 at 12:34:58PM +0500, Andrey Borodin wrote:
Tomas, thanks for looking into this!
28 окт. 2020 г., в 06:36, Tomas Vondra <tomas.vondra@2ndquadrant.com> написал(а):
This thread started with a discussion about making the SLRU sizes
configurable, but this patch version only adds a local cache. Does this
achieve the same goal, or would we still gain something by having GUCs
for the SLRUs?If we're claiming this improves performance, it'd be good to have some
workload demonstrating that and measurements. I don't see anything like
that in this thread, so it's a bit hand-wavy. Can someone share details
of such workload (even synthetic one) and some basic measurements?All patches in this thread aim at the same goal: improve performance in presence of MultiXact locks contention.
I could not build synthetical reproduction of the problem, however I did some MultiXact stressing here [0]. It's a clumsy test program, because it still is not clear to me which parameters of workload trigger MultiXact locks contention. In generic case I was encountering other locks like *GenLock: XidGenLock, MultixactGenLock etc. Yet our production system encounters this problem approximately once in a month through this year.Test program locks for share different set of tuples in presence of concurrent full scans.
To produce a set of locks we choose one of 14 bits. If a row number has this bit set to 0 we add lock it.
I've been measuring time to lock all rows 3 time for each of 14 bits, observing total time to set all locks.
During the test I was observing locks in pg_stat_activity, if they did not contain enough MultiXact locks I was tuning parameters further (number of concurrent clients, number of bits, select queries etc).Why is it so complicated? It seems that other reproductions of a problem were encountering other locks.
It's not my intention to be mean or anything like that, but to me this
means we don't really understand the problem we're trying to solve. Had
we understood it, we should be able to construct a workload reproducing
the issue ...
I understand what the individual patches are doing, and maybe those
changes are desirable in general. But without any benchmarks from a
plausible workload I find it hard to convince myself that:
(a) it actually will help with the issue you're observing on production
and
(b) it's actually worth the extra complexity (e.g. the lwlock changes)
I'm willing to invest some of my time into reviewing/testing this, but I
think we badly need better insight into the issue, so that we can build
a workload reproducing it. Perhaps collecting some perf profiles and a
sample of the queries might help, but I assume you already tried that.
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
29 окт. 2020 г., в 04:32, Tomas Vondra <tomas.vondra@2ndquadrant.com> написал(а):
It's not my intention to be mean or anything like that, but to me this
means we don't really understand the problem we're trying to solve. Had
we understood it, we should be able to construct a workload reproducing
the issue ...I understand what the individual patches are doing, and maybe those
changes are desirable in general. But without any benchmarks from a
plausible workload I find it hard to convince myself that:(a) it actually will help with the issue you're observing on production
and
(b) it's actually worth the extra complexity (e.g. the lwlock changes)I'm willing to invest some of my time into reviewing/testing this, but I
think we badly need better insight into the issue, so that we can build
a workload reproducing it. Perhaps collecting some perf profiles and a
sample of the queries might help, but I assume you already tried that.
Thanks, Tomas! This totally makes sense.
Indeed, collecting queries did not help yet. We have loadtest environment equivalent to production (but with 10x less shards), copy of production workload queries. But the problem does not manifest there.
Why do I think the problem is in MultiXacts?
Here is a chart with number of wait events on each host
During the problem MultiXactOffsetControlLock and SLRURead dominate all other lock types. After primary switchover to another node SLRURead continued for a bit there, then disappeared.
Backtraces on standbys during the problem show that most of backends are sleeping in pg_sleep(1000L) and are not included into wait stats on these charts.
Currently I'm considering writing test that directly calls MultiXactIdExpand(), MultiXactIdCreate(), and GetMultiXactIdMembers() from an extension. How do you think, would benchmarks in such tests be meaningful?
Thanks!
Best regards, Andrey Borodin.
Attachments:
Графика-1.pngimage/png; name="=?utf-8?B?0JPRgNCw0YTQuNC60LAtMS5wbmc=?="Download
�PNG
IHDR � � Qm�� sRGB ��� DeXIfMM * �i � � �� � L�c� �iTXtXML:com.adobe.xmp <x:xmpmeta xmlns:x="adobe:ns:meta/" x:xmptk="XMP Core 5.4.0">
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
<rdf:Description rdf:about=""
xmlns:exif="http://ns.adobe.com/exif/1.0/">
<exif:ColorSpace>1</exif:ColorSpace>
<exif:PixelXDimension>1440</exif:PixelXDimension>
<exif:PixelYDimension>1477</exif:PixelYDimension>
</rdf:Description>
</rdf:RDF>
</x:xmpmeta>
�:�z @ IDATx���e�U'�:��$��4��d����e����`09-,L�����Nk�x�&.6&�d��-K��if49vN�����^�y�R���4�����n��sN�_�:U7������\��j������j������j�����T*3�S�=����p���7]
t5��@W]
t5��@W]
t5�+��s�3`tOW]
t5���r
LO�GGKss�|��&��+�U*��J50P��m�8�����*9��#���������-�h�Y7������j`m4��k��.����z4P*U&'K��%P�)��������P�
U
M�
��eQ.�z�+����-�b*����
~o�^���^���!o����BXSM@�CC���U`c!������j���@@_���e����V]�6�
"7�)vJ�>���#����:�o���C�.]��������V9�����l�)��j���-��.��R��������D 2���j
�������t���j���-��� �~��bl�t���+����'fx������~X =wn�r��I�\��m��[�<�%we���+g��^sM��`���Z������</<\�8����7�T������]=���Fh�8WT2�;�$�Z}�������w��pC��p5_����?vl�����y��W��>>>�Ry����`�)��K%����k�wj�|����==�`�E�b�c�:�a��&'����������gO���D!�Z�z�����m��!�C���X���r���=��Z�n����h��r���U��v����Xy������ �+�3g��
G�_�{��@W]
\�(��e/���k/ ��������_��{��l���������=��#m�]�����^������������R0x|��c������U�\�K���$����/�����m���pFF�d��q��������������#{�;���_�3r[�M4Ky������� �,��H��?����>v�P���?���W���������w�/������w�5C�!�x��_��#�������������C�!��~�g�B�������O���_����Q�Di�?�c�W~��m�
���x��n�a�ND[�8Z�=�L���'�!?�������O�v��dA�G�y�;���P�
(,������k�B�����m����x����}ue�d�6��K��B����%�����f�A�m��5��������������?���}�}���Y�a����Q�u��'�����OL���c83`������_��G��OO�c�����x�N���'���z�}������~�������h���������D[u��t�������T��^�F�3���b�y~zXx��k_w]����?���\��Y�!�'�:�������Gu������� D�������]�� �7k�G��Q� ��W��
C�������P��N�NyvD��=�+Lv����Ln������F�
��>:c\���9}�
����:sf~�����V�"K"�k��I��'���r;wQ���o0���/��&r0�1��o������|�2���"��o�J�=�,�"�������B���Ik��0�p���
�b�[n��8p�u����_������m��W�2��IZ��7$���T�&��/C�pc!�����D��A�������?�c}��G�\�y)n��o���������NL������v<F��~x��l���}�e�n��{����:}:xj��
������l����;p~�<;q�x��"���L@B���OL���3�x�������'�������'_�������O�������
��w~���x���k�f�2}��w��=qN�xKg5�D��W�����
������?��EI��\ .�*��|�"������K�# �����������r��#�ed���L!*w=��S3[R�������K�V�w�B����P�YmY�{�e����������>���c3:��/������q��j������aeD(�$������������������9��Hd Uo��v�%>��]|*}�������]�{����C�&&"z`t��\#�R�J
�K�U���[d43#�P��B�J�c_$R�z��T�@������"�*�������:jH^jI}/�i ��&����KE]�>rd�����{�s���x���5��~V+���?<�h~��nU���O�(�o���2��tN�]
t5�u5`�����
i�;$��ow���mY����s�=����n�������,����k��@o�~�9�*>���������_.�����{��k^s����4^~������?
�y�W��o��m��b���t�F���n�--��O����_9��������������y��$�1�@x����j�r�7���y�.Y���~����W?��7��_��?���!����<����2����2j>�9;~��nL��������x�k��
�
�����~��_c���_?�/�r�|�����������Q>�������zh������������}���������4���/y��o~���>rd�E/��%_��u�;z�}����~��������u�hR6������\bd��#S������iO�`�_�����~>�3`}�k�A$&'�x�����fX�_��!Z���o>����+����)"4�=���/y�~c�k_{��/?x���&3?�s�����rK����'&@����]�����G������x�������������~���eL}����I����-{����w~���d1��Ko|�I�C�7}�u_��{������+����k�;�s?�
��m�T!f]��n@�D��?>s�]����G9f j��/����>��w_L��&����j��>5I{���o�1X7�~`�d �������������0������3*����}������(c�/���L�����p,����>o'����������9���������7�q�9���B����\�����}_���9~X4`!��)<� C���m=���'���Hk�
�i�}�M&$��g���Z���;��Z����7�����3Nr���?�7�� ��}����d���,��EIw~�w\+�����0AR{_���_�e��s��o�������_����������:��ON���N���W����=����B�����q����������*�7s��o=��� k��'?9������#S����_�����C
LVX�����9;R��SL���|p
��E[���bR��w}��~�������*G0�����������?�������Z�2�0����F3j���*P���H�:e��~��Z"s�m��Y���TP��"��a�p%����?O4�'=iX|��4 �Q���wk��
dD��@�R�:�O58th�����G���>���ah�d47�Z�JUz9v�t5����� �����?~dd$�Sd�t���Ww����{�N}����<4`�J@�F� Ox���*;w������}�`t�������,`��)�}��m"������E8��������d�Y�����7��������������26��1;q���0~�W��p�����A��+^�0�O2:�s�pF;C�����3����<�i����!W�-��_����>�{��x��M�tm6za�
[���N��������_�u���+&���~]����B���l��L�k�|�m�O���U���y�YB�+�5���{�3� �������^u��2������y�[�u����<5�BuD7����������� �!r�3Z@��)E��u�y�k9���������r��'�^���
*P����7.r�18��(J�q�~��/q����&���~�w�#�8���?���;��p!`2�T��{�g?;_V��K
�����/���v�������M'^�����,��
����T
o{��)n&�S����p��7�����2g���&{�K�K��d��_8HF�0Ux���a�Bj����xc��������]�:oq�����3�|��J�I����V�|�v���d�x�V���3�_?��W���S`x�T��d�H9����h��)w9'�;���2�2�~{oB�����i��M���+�r��>��E{��R���go��j@�xWS;�����;�����|��{�{I=W�fk����^(���~x�}��Dd��?������'��������:I���|�����t|��n�M���.=�1C��%�����FZ�5>�,�(��-�;7��3?s���7�����_�����p�L-##�o}�A�^�-��p��3:6c���������77���7|��Jn��~��o�\�u��������kR�����/�ej���r���J?��7�mk#�1(-]&]����e3��rV3���m���z�E~�g����@����?=u����O[Mz���z��Z��W����"U�����S����8���r��y�[����A��X��M����"V �����<*��~kh����C�z�U����6����o.�6z�����f�.�Tm.�]��q�cA�� �:�On�m�����)=�7j��V����"��D/��7�P����m�K#�^]
t5��@� �z����!�k��,~�aP����A[p6�����O�8�w�����y��n`h ����?.�@�W�.�Y��'Y?�A�����+����^�^A�Zm��'= P~����qy� ��c�������8��i�us�0��Nf�0���:nc0�h4}�wG�b/s�~\���h�4l����d����>�$���r����0�����o|�c8�:���)_�����=�y�0�)}f��`���a�P
�AZ,.V{��`,@���>�%���-t�d�B���-��~�� 5 �0�u_��� �b�e��h�F��x�~344u�s��9����[�=�Lb������x����
�L� ���_��F����8p!�[������� ��>����n��`!f�S`/�0�%������+;WU���h�
��PF����^��a�6�Zueg�DM<�!
�&������@��~��M�R(J�?��f��F�p���v��<�)C�1x����=m��
,��/����b�<Ka��vFM����5����f 30sbQE����W������\/�.���6��{$5����!U�w���3�T� km�D���
y�9���\{��G�5~����rC�&��*_5J���A�� �����M�X���z�v>�ru[�9th"����Gg1S1����U�S�D{��v>�)�#:f�$B���z�^.�T]MkM\U ���K_zfL, �����_��{�swq%�(8'�}�m�R��e]��C���?��y|hY����,�&{����?>�k���^=��;�G7��=��;�]3��L�����O|b�_�Q��RG2%�D�^Q>�[��o/X���q���p�S�'uJ�W�r��$�YT%j���{LZ��jjk����|�gn����4b���(�'<a��>k�~C�Q
$�Z��w��~R��F��
GK�6��m&�&�::�2f����{�m\�,��HjW�:����������� ��h���n��F!�>�h������6�p��k!`��O}�S�|A�3g�@��}�c���ae!���2
"��"������� h�Z�H��Dd������,� �6��t�o�Y0������h�3ZH�m�O����3Y���.>oh�_����H}TR��/k��G���8�����~������,��`h��������P��a���������+x�e/� ���+��6v~��\\����]9'�Bp���E�2��LN��� �.�����Q�'�f`�";�Fq�E�W��N�dL\I�� ��i<��2`�d����=�������}�uf�D�e�aI���;�t����@�K�0%3�;B���$�te7�!V�U�0������X�7
}Y�A7N�B0(2U�YJ�](�@�rb<V����4�����`^�"N�ht�O���`���%��~�����Y���������-w&m��lU$���M����x����f��g ���So����
��!������3�Y�0��x"\��A����8 ��b�����?YC�vn)A�(�Tf�+!���jS��������`B�d�'.�@=�3��� ����@��0 ��?^2������/%�ccR�����Y8"���E�%
��/�#W4�����//�5sc�'25U#�plhS���&Y���@l v�{� �����iu���-��J�����X�{D�S?�17���� ��U~;�����@�P&�S�f}x3��}NLp�t���o����U�Eo<jL� [�@�Hj�"r��X"M_���oS��7$h��M��JPs����`(�2I�M�@3+)�4��hJ�_q����!#�����cE*�fg�Qa���c�5j�V����G�E����d:���������`@�������E'
����||�c���f�
��� ����b`�hv����l�a_p�hP��800 S�Aj�����_����^%�<��hfdEh�� h]�1�V�����Le/{�!�I�B������:���s�-�l����h�����_��Gz(� Fd@������_1�����`:���:h��L�E��uI����I�H�^���e�f��D?}����je�zqB��'����k�vc�� Ye�vD
<e�b�FJ`��/
���!������B�+�%��F)&F�`#P�d>�uYtb���b7�i���d|�������/��Ax�[�gi�N�j��i�-�~��������n2���@y�p�e���x0�X�;���k��4���O�������N��zw���b�-WT�h#�����M �(�*�����$�����-�Yaw4'ak���/����K ~���P���Q�����d�O�"3�����K4�l7�������gg���P�/���b26�5��������
K9�����_yMY�|��l�mF�[�0���<%(#����oF���� |�6�;�5��j��ddwWE����b��es��7�>�&*�k_{�qb#- k�U�q�������X����^mwC�������Z��p��$���(�H�7� <����2"��:~��o43��,&���o�S���~��6}�p�|-B�KvA��*������ca$:��::U�"CCAaJ���!�����Q=�f��:�j��OP��59�|��g
<�9���t%,���4�s�C�T��i����V���T�b���+�����E��eP���`��_�&��>�ch�V[OT6������(���I���:Rvq�u!�{�mb�E��N^aj�m3�� ��m�oz�I����G^���e�����@W[X������9r����l�`1��}���S7�|3t�044��Bdn�) f6�e�q�{\r����k%��G?�`r�e�Y7+;���4���,�F8� 4�!G���
Ox]F)���h��`�{�+�~����e��t��p*����?7�E��p������m]d����}������r �eq�n2Q������� �~�S)�#�r1xd
��_
;��o�5�H ms�����9���W��J�������[t
6F�����>��@�� 4/g*,@�y��Jr����<�*����A���������gP��7��3,�����^��#/y��
���+��y�1�L����3ct���O�w�0
k��N�V;�I����\o��� .G�����x���G>r ���h &�c��7�K-S,����������C?���s,f�gz��
C�}������_��6�������D��W�(\�
Zy9�����vp"�8o�k�D�����k����������(�WhK<;Y�����������*��<��Wu�P���P�����:6����{�������D��<�2��W���{h��6�����H<8IJ?t�����7~��RXa����/������kf��7<j�j*�J���.UUR�������\3�.���>�X.&C�����arr���?YP~R��#������Bi���*����8���i�+�
��|���`va����~�SF�~���<�i�� �@�������C�=r�b����>l�"xQ]��� �q�x�k�@��7����Yr�e
�Z��~:Q��g������S�:��v�N���m�0��?��qG��mB�o�u�C}�;����:��B��7��y�����d��H������7k��Ej�����Bu�R�*���{�nz�"4$�P�>ZUil�)��V�D�\O�����, Z���&,#��A��\���v�z\���^7�T���#\�[b�����,�Y�b�4_��0E���-o�M�,�$�X�,��=�����6��Z���]
t5��V�Oy�S����/P�+p��L���[n��2X�B�pgakD k!Y�$O{��2'�������P��?����g��'wjt<