SLRU optimization - configurable buffer pool and partitioning the SLRU lock
The small size of the SLRU buffer pools can sometimes become a
performance problem because it’s not difficult to have a workload
where the number of buffers actively in use is larger than the
fixed-size buffer pool. However, just increasing the size of the
buffer pool doesn’t necessarily help, because the linear search that
we use for buffer replacement doesn’t scale, and also because
contention on the single centralized lock limits scalability.
There is a couple of patches proposed in the past to address the
problem of increasing the buffer pool size, one of the patch [1] was
proposed by Thomas Munro where we make the size of the buffer pool
configurable. And, in order to deal with the linear search in the
large buffer pool, we divide the SLRU buffer pool into associative
banks so that searching in the buffer pool doesn’t get affected by the
large size of the buffer pool. This does well for the workloads which
are mainly impacted by the frequent buffer replacement but this still
doesn’t stand well with the workloads where the centralized control
lock is the bottleneck.
So I have taken this patch as my base patch (v1-0001) and further
added 2 more improvements to this 1) In v1-0002, Instead of a
centralized control lock for the SLRU I have introduced a bank-wise
control lock 2)In v1-0003, I have removed the global LRU counter and
introduced a bank-wise counter. The second change (v1-0003) is in
order to avoid the CPU/OS cache invalidation due to frequent updates
of the single variable, later in my performance test I will show how
much gain we have gotten because of these 2 changes.
Note: This is going to be a long email but I have summarised the main
idea above this point and now I am going to discuss more internal
information in order to show that the design idea is valid and also
going to show 2 performance tests where one is specific to the
contention on the centralized lock and other is mainly contention due
to frequent buffer replacement in SLRU buffer pool. We are getting ~2x
TPS compared to the head by these patches and in later sections, I am
going discuss this in more detail i.e. exact performance numbers and
analysis of why we are seeing the gain.
There are some problems I faced while converting this centralized
control lock to a bank-wise lock and that is mainly because this lock
is (mis)used for different purposes. The main purpose of this control
lock as I understand it is to protect the in-memory access
(read/write) of the buffers in the SLRU buffer pool.
Here is the list of some problems and their analysis:
1) In some of the SLRU, we use this lock for protecting the members
inside the control structure which is specific to that SLRU layer i.e.
SerialControlData() members are protected by the SerialSLRULock, and I
don’t think it is the right use of this lock so for this purpose I
have introduced another lock called SerialControlLock for this
specific purpose. Based on my analysis there is no reason for
protecting these members and the SLRU buffer access with the same
lock.
2) The member called ‘latest_page_number’ inside SlruSharedData is
also protected by the SLRULock, I would not say this use case is wrong
but since this is a common variable and not a per bank variable can
not be protected by the bankwise lock now. But the usage of this
variable is just to track the latest page in an SLRU so that we do not
evict out the latest page during victim page selection. So I have
converted this to an atomic variable as it is completely independent
of the SLRU buffer access.
3) In order to protect SlruScanDirectory, basically the
SlruScanDirectory() from DeactivateCommitTs(), is called under the
SLRU control lock, but from all other places SlruScanDirectory() is
called without lock and that is because the caller of this function is
called from the places which are not executed concurrently(i.e.
startup, checkpoint). This function DeactivateCommitTs() is also
called from the startup only so there doesn't seem any use in calling
this under the SLRU control lock. Currently, I have called this under
the all-bank lock because logically this is not a performance path,
and that way we are keeping it consistent with the current logic, but
if others also think that we do not need a lock at this place then we
might remove it and then we don't need this all-bank lock anywhere.
There are some other uses of this lock where we might think it will be
a problem if we divide it into a bank-wise lock but it's not and I
have given my analysis for the same
1) SimpleLruTruncate: We might worry that if we convert to a bank-wise
lock then this could be an issue as we might need to release and
acquire different locks as we scan different banks. But as per my
analysis, this is not an issue because a) With the current code also
do release and acquire the centralized lock multiple times in order to
perform the I/O on the buffer slot so the behavior is not changed but
the most important thing is b) All SLRU layers take care that this
function should not be accessed concurrently, I have verified all
access to this function and its true and the function header of this
function also says the same. So this is not an issue as per my
analysis.
2) Extending or adding a new page to SLRU: I have noticed that this
is also protected by either some other exclusive lock or only done
during startup. So in short the SLRULock is just used for protecting
against the access of the buffers in the buffer pool but that is not
for guaranteeing the exclusive access inside the function because that
is taken care of in some other way.
3) Another thing that I noticed while writing this and thought it
would be good to make a note of that as well. Basically for the CLOG
group update of the xid status. Therein if we do not get the control
lock on the SLRU then we add ourselves to a group and then the group
leader does the job for all the members in the group. One might think
that different pages in the group might belong to different SLRU bank
so the leader might need to acquire/release the lock as it process the
request in the group. Yes, that is true, and it is taken care but we
don’t need to worry about the case because as per the implementation
of the group update, we are trying to have the members with the same
page request in one group and only due to some exception there could
be members with the different page request. So the point is with a
bank-wise lock we are handling that exception case but that's not a
regular case that we need to acquire/release multiple times. So
design-wise we are good and performance-wise there should not be any
problem because most of the time we might be updating the pages from
the same bank, and if in some cases we have some updates for old
transactions due to long-running transactions then we should do better
by not having a centralized lock.
Performance Test:
Exp1: Show problems due to CPU/OS cache invalidation due to frequent
updates of the centralized lock and a common LRU counter. So here we
are running a parallel transaction to pgbench script which frequently
creates subtransaction overflow and that forces the visibility-check
mechanism to access the subtrans SLRU.
Test machine: 8 CPU/ 64 core/ 128 with HT/ 512 MB RAM / SSD
scale factor: 300
shared_buffers=20GB
checkpoint_timeout=40min
max_wal_size=20GB
max_connections=200
Workload: Run these 2 scripts parallelly:
./pgbench -c $ -j $ -T 600 -P5 -M prepared postgres
./pgbench -c 1 -j 1 -T 600 -f savepoint.sql postgres
savepoint.sql (create subtransaction overflow)
BEGIN;
SAVEPOINT S1;
INSERT INTO test VALUES(1)
← repeat 70 times →
SELECT pg_sleep(1);
COMMIT;
Code under test:
Head: PostgreSQL head code
SlruBank: The first patch applied to convert the SLRU buffer pool into
the bank (0001)
SlruBank+BankwiseLockAndLru: Applied 0001+0002+0003
Results:
Clients Head SlruBank SlruBank+BankwiseLockAndLru
1 457 491 475
8 3753 3819 3782
32 14594 14328 17028
64 15600 16243 25944
128 15957 16272 31731
So we can see that at 128 clients, we get ~2x TPS(with SlruBank +
BankwiseLock and bankwise LRU counter) as compared to HEAD. We might
be thinking that we do not see much gain only with the SlruBank patch.
The reason is that in this particular test case, we are not seeing
much load on the buffer replacement. In fact, the wait event also
doesn’t show contention on any lock instead the main load is due to
frequently modifying the common variable like the centralized control
lock and the centralized LRU counters. That is evident in perf data
as shown below
+ 74.72% 0.06% postgres postgres [.] XidInMVCCSnapshot
+ 74.08% 0.02% postgres postgres [.] SubTransGetTopmostTransaction
+ 74.04% 0.07% postgres postgres [.] SubTransGetParent
+ 57.66% 0.04% postgres postgres [.] LWLockAcquire
+ 57.64% 0.26% postgres postgres [.] SimpleLruReadPage_ReadOnly
……
+ 16.53% 0.07% postgres postgres [.] LWLockRelease
+ 16.36% 0.04% postgres postgres [.] pg_atomic_sub_fetch_u32
+ 16.31% 16.24% postgres postgres [.] pg_atomic_fetch_sub_u32_impl
We can notice that the main load is on the atomic variable within the
LWLockAcquire and LWLockRelease. Once we apply the bankwise lock
patch(v1-0002) the same problem is visible on cur_lru_count updation
in the SlruRecentlyUsed[2]#define SlruRecentlyUsed(shared, slotno) \ do { \ .. (shared)->cur_lru_count = ++new_lru_count; \ .. } \ } while (0) macro (I have not shown that here but it
was visible in my perf report). And that is resolved by implementing
a bankwise counter.
[2]: #define SlruRecentlyUsed(shared, slotno) \ do { \ .. (shared)->cur_lru_count = ++new_lru_count; \ .. } \ } while (0)
#define SlruRecentlyUsed(shared, slotno) \
do { \
..
(shared)->cur_lru_count = ++new_lru_count; \
..
} \
} while (0)
Exp2: This test shows the load on SLRU frequent buffer replacement. In
this test, we are running the pgbench kind script which frequently
generates multixact-id, and parallelly we are starting and committing
a long-running transaction so that the multixact-ids are not
immediately cleaned up by the vacuum and we create contention on the
SLRU buffer pool. I am not leaving the long-running transaction
running forever as that will start to show another problem with
respect to bloat and we will lose the purpose of what I am trying to
show here.
Note: test configurations are the same as Exp1, just the workload is
different, we are running below 2 scripts.
and new config parameter(added in v1-0001) slru_buffers_size_scale=4,
that means NUM_MULTIXACTOFFSET_BUFFERS will be 64 that is 16 in Head
and
NUM_MULTIXACTMEMBER_BUFFERS will be 128 which is 32 in head
./pgbench -c $ -j $ -T 600 -P5 -M prepared -f multixact.sql postgres
./pgbench -c 1 -j 1 -T 600 -f longrunning.sql postgres
cat > multixact.sql <<EOF
\set aid random(1, 100000 * :scale)
\set bid random(1, 1 * :scale)
\set tid random(1, 10 * :scale)
\set delta random(-5000, 5000)
BEGIN;
SELECT FROM pgbench_accounts WHERE aid = :aid FOR UPDATE;
SAVEPOINT S1;
UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid;
SELECT abalance FROM pgbench_accounts WHERE aid = :aid;
INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES
(:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP);
END;
EOF
cat > longrunning.sql << EOF
BEGIN;
INSERT INTO pgbench_test VALUES(1);
select pg_sleep(10);
COMMIT;
EOF
Results:
Clients Head SlruBank SlruBank+BankwiseLock
1 528 513 531
8 3870 4239 4157
32 13945 14470 14556
64 10086 19034 24482
128 6909 15627 18161
Here we can see good improvement with the SlruBank patch itself
because of increasing the SLRU buffer pool, as in this workload there
is a lot of contention due to buffer replacement. As shown below we
can see a lot of load on MultiXactOffsetSLRU as well as on
MultiXactOffsetBuffer which shows there are frequent buffer evictions
in this workload. And, increasing the SLRU buffer pool size is
helping a lot, and further dividing the SLRU lock into bank-wise locks
we are seeing a further gain. So in total, we are seeing ~2.5x TPS at
64 and 128 thread compared to head.
3401 LWLock | MultiXactOffsetSLRU
2031 LWLock | MultiXactOffsetBuffer
687 |
427 LWLock | BufferContent
Credits:
- The base patch v1-0001 is authored by Thomas Munro and I have just rebased it.
- 0002 and 0003 are new patches written by me based on design ideas
from Robert and Myself.
--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
Attachments:
v1-0003-Introduce-bank-wise-LRU-counter.patchapplication/octet-stream; name=v1-0003-Introduce-bank-wise-LRU-counter.patchDownload
From 2fe09c749e7fbca1998f7964ab8341df466023c3 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Wed, 11 Oct 2023 15:41:34 +0530
Subject: [PATCH v1 3/3] Introduce bank-wise LRU counter
Since we have already divided buffer pool in banks and victim
buffer search is also done at the bank level so there is no need
to have a centralized lru counter. And this will also improve
the performance by reducing the frequent cpu cache invalidation by
not updating the common variable.
Dilip Kumar based on design idea from Robert Haas
---
src/backend/access/transam/slru.c | 23 +++++++++++++++--------
src/include/access/slru.h | 28 +++++++++++++++++-----------
2 files changed, 32 insertions(+), 19 deletions(-)
diff --git a/src/backend/access/transam/slru.c b/src/backend/access/transam/slru.c
index c06e4eddd1..fd44ad7d47 100644
--- a/src/backend/access/transam/slru.c
+++ b/src/backend/access/transam/slru.c
@@ -110,13 +110,13 @@ typedef struct SlruWriteAllData *SlruWriteAll;
*
* The reason for the if-test is that there are often many consecutive
* accesses to the same page (particularly the latest page). By suppressing
- * useless increments of cur_lru_count, we reduce the probability that old
+ * useless increments of bank_cur_lru_count, we reduce the probability that old
* pages' counts will "wrap around" and make them appear recently used.
*
* We allow this code to be executed concurrently by multiple processes within
* SimpleLruReadPage_ReadOnly(). As long as int reads and writes are atomic,
* this should not cause any completely-bogus values to enter the computation.
- * However, it is possible for either cur_lru_count or individual
+ * However, it is possible for either bank_cur_lru_count or individual
* page_lru_count entries to be "reset" to lower values than they should have,
* in case a process is delayed while it executes this macro. With care in
* SlruSelectLRUPage(), this does little harm, and in any case the absolute
@@ -125,9 +125,10 @@ typedef struct SlruWriteAllData *SlruWriteAll;
*/
#define SlruRecentlyUsed(shared, slotno) \
do { \
- int new_lru_count = (shared)->cur_lru_count; \
+ int bankno = slotno / SLRU_BANK_SIZE; \
+ int new_lru_count = (shared)->bank_cur_lru_count[bankno]; \
if (new_lru_count != (shared)->page_lru_count[slotno]) { \
- (shared)->cur_lru_count = ++new_lru_count; \
+ (shared)->bank_cur_lru_count[bankno] = ++new_lru_count; \
(shared)->page_lru_count[slotno] = new_lru_count; \
} \
} while (0)
@@ -200,6 +201,7 @@ SimpleLruShmemSize(int nslots, int nlsns)
sz += MAXALIGN(nslots * sizeof(int)); /* page_lru_count[] */
sz += MAXALIGN(nslots * sizeof(LWLockPadded)); /* buffer_locks[] */
sz += MAXALIGN((bankmask + 1) * sizeof(LWLockPadded)); /* bank_locks[] */
+ sz += MAXALIGN((bankmask + 1) * sizeof(int)); /* bank_cur_lru_count[] */
if (nlsns > 0)
sz += MAXALIGN(nslots * nlsns * sizeof(XLogRecPtr)); /* group_lsn[] */
@@ -276,8 +278,6 @@ SimpleLruInit(SlruCtl ctl, const char *name, int nslots, int nlsns,
shared->num_slots = nslots;
shared->lsn_groups_per_page = nlsns;
- shared->cur_lru_count = 0;
-
/* shared->latest_page_number will be set later */
shared->slru_stats_idx = pgstat_get_slru_index(name);
@@ -300,6 +300,8 @@ SimpleLruInit(SlruCtl ctl, const char *name, int nslots, int nlsns,
offset += MAXALIGN(nslots * sizeof(LWLockPadded));
shared->bank_locks = (LWLockPadded *) (ptr + offset);
offset += MAXALIGN(nbanks * sizeof(LWLockPadded));
+ shared->bank_cur_lru_count = (int *) (ptr + offset);
+ offset += MAXALIGN(nbanks * sizeof(int));
if (nlsns > 0)
{
@@ -321,8 +323,11 @@ SimpleLruInit(SlruCtl ctl, const char *name, int nslots, int nlsns,
}
/* initialize bank locks for each buffer bank */
for (bankno = 0; bankno < nbanks; bankno++)
+ {
LWLockInitialize(&shared->bank_locks[bankno].lock,
slru_tranche_id);
+ shared->bank_cur_lru_count[bankno] = 0;
+ }
/* Should fit to estimated shmem size */
Assert(ptr - (char *) shared <= SimpleLruShmemSize(nslots, nlsns));
@@ -1112,9 +1117,11 @@ SlruSelectLRUPage(SlruCtl ctl, int pageno)
int best_invalid_page_number = 0; /* keep compiler quiet */
/* See if page already has a buffer assigned */
- int bankstart = (pageno & ctl->bank_mask) * SLRU_BANK_SIZE;
+ int bankno = pageno & ctl->bank_mask;
+ int bankstart = bankno * SLRU_BANK_SIZE;
int bankend = bankstart + SLRU_BANK_SIZE;
+
for (slotno = bankstart; slotno < bankend; slotno++)
{
if (shared->page_number[slotno] == pageno &&
@@ -1149,7 +1156,7 @@ SlruSelectLRUPage(SlruCtl ctl, int pageno)
* That gets us back on the path to having good data when there are
* multiple pages with the same lru_count.
*/
- cur_count = (shared->cur_lru_count)++;
+ cur_count = (shared->bank_cur_lru_count[bankno])++;
for (slotno = bankstart; slotno < bankend; slotno++)
{
int this_delta;
diff --git a/src/include/access/slru.h b/src/include/access/slru.h
index eec7a568dc..fea12cdfb3 100644
--- a/src/include/access/slru.h
+++ b/src/include/access/slru.h
@@ -73,6 +73,23 @@ typedef struct SlruSharedData
*/
LWLockPadded *bank_locks;
+ /*----------
+ * Instead of global counter we maintain a bank-wise lru counter because
+ * a) we are doing the victim buffer selection as bank level so there is
+ * no point of having a global counter b) manipulating a global counter
+ * will have frequent cpu cache invalidation and that will affect the
+ * performance.
+ *
+ * We mark a page "most recently used" by setting
+ * page_lru_count[slotno] = ++bank_cur_lru_count[bankno];
+ * The oldest page is therefore the one with the highest value of
+ * bank_cur_lru_count[bankno] - page_lru_count[slotno]
+ * The counts will eventually wrap around, but this calculation still
+ * works as long as no page's age exceeds INT_MAX counts.
+ *----------
+ */
+ int *bank_cur_lru_count;
+
/*
* Optional array of WAL flush LSNs associated with entries in the SLRU
* pages. If not zero/NULL, we must flush WAL before writing pages (true
@@ -84,17 +101,6 @@ typedef struct SlruSharedData
XLogRecPtr *group_lsn;
int lsn_groups_per_page;
- /*----------
- * We mark a page "most recently used" by setting
- * page_lru_count[slotno] = ++cur_lru_count;
- * The oldest page is therefore the one with the highest value of
- * cur_lru_count - page_lru_count[slotno]
- * The counts will eventually wrap around, but this calculation still
- * works as long as no page's age exceeds INT_MAX counts.
- *----------
- */
- int cur_lru_count;
-
/*
* latest_page_number is the page number of the current end of the log;
* this is not critical data, since we use it only to avoid swapping out
--
2.39.2 (Apple Git-143)
v1-0001-Divide-SLRU-buffers-into-banks.patchapplication/octet-stream; name=v1-0001-Divide-SLRU-buffers-into-banks.patchDownload
From 0d05d2a043ab393df797ba2ab67d8471398a9260 Mon Sep 17 00:00:00 2001
From: Dilip kumar <dilipkumar@dkmac.local>
Date: Fri, 8 Sep 2023 15:08:32 +0530
Subject: [PATCH v1 1/3] Divide SLRU buffers into banks
We want to eliminate linear search within SLRU buffers.
To do so we divide SLRU buffers into banks. Each bank holds
approximately 8 buffers. Each SLRU pageno may reside only in one bank.
Adjacent pagenos reside in different banks.
Also invent slru_buffers_size_scale to control SLRU buffers.
patch by Thomas Munro
---
doc/src/sgml/config.sgml | 31 +++++++++++
src/backend/access/transam/clog.c | 28 ++--------
src/backend/access/transam/commit_ts.c | 19 ++-----
src/backend/access/transam/slru.c | 54 +++++++++++++++++--
src/backend/access/transam/subtrans.c | 1 +
src/backend/utils/init/globals.c | 2 +
src/backend/utils/misc/guc_tables.c | 10 ++++
src/backend/utils/misc/postgresql.conf.sample | 3 ++
src/include/access/multixact.h | 4 +-
src/include/access/slru.h | 5 ++
src/include/access/subtrans.h | 2 +-
src/include/commands/async.h | 2 +-
src/include/miscadmin.h | 2 +
src/include/storage/predicate.h | 2 +-
14 files changed, 117 insertions(+), 48 deletions(-)
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 924309af26..416d979b54 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2006,6 +2006,37 @@ include_dir 'conf.d'
</listitem>
</varlistentry>
+ <varlistentry id="guc-slru-buffers-size-scale" xreflabel="slru_buffers_size_scale">
+ <term><varname>slru_buffers_size_scale</varname> (<type>integer</type>)
+ <indexterm>
+ <primary><varname>slru_buffers_size_scale</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Specifies power 2 scale for all SLRU shared memory buffers sizes. Buffers sizes depends on
+ both <literal>guc_slru_buffers_size_scale</literal> and <literal>shared_buffers</literal> params.
+ </para>
+ <para>
+ This affects on buffers in the list below (see also <xref linkend="pgdata-contents-table"/>):
+ <itemizedlist>
+ <listitem><para><literal>NUM_MULTIXACTOFFSET_BUFFERS = Min(32 << slru_buffers_size_scale, shared_buffers/256)</literal></para></listitem>
+ <listitem><para><literal>NUM_MULTIXACTMEMBER_BUFFERS = Min(64 << slru_buffers_size_scale, shared_buffers/256)</literal></para></listitem>
+ <listitem><para><literal>NUM_SUBTRANS_BUFFERS = Min(64 << slru_buffers_size_scale, shared_buffers/256)</literal></para></listitem>
+ <listitem><para><literal>NUM_NOTIFY_BUFFERS = Min(32 << slru_buffers_size_scale, shared_buffers/256)</literal></para></listitem>
+ <listitem><para><literal>NUM_SERIAL_BUFFERS = Min(32 << slru_buffers_size_scale, shared_buffers/256)</literal></para></listitem>
+ <listitem><para><literal>NUM_CLOG_BUFFERS = Min(128 << slru_buffers_size_scale, shared_buffers/256)</literal></para></listitem>
+ <listitem><para><literal>NUM_COMMIT_TS_BUFFERS = Min(128 << slru_buffers_size_scale, shared_buffers/256)</literal></para></listitem>
+ </itemizedlist>
+ </para>
+ <para>
+ Value is in <literal>0..7</literal> bounds.
+ The default value is <literal>2</literal>.
+ This parameter can only be set at server start.
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry id="guc-max-stack-depth" xreflabel="max_stack_depth">
<term><varname>max_stack_depth</varname> (<type>integer</type>)
<indexterm>
diff --git a/src/backend/access/transam/clog.c b/src/backend/access/transam/clog.c
index 4a431d5876..29d58f1eb3 100644
--- a/src/backend/access/transam/clog.c
+++ b/src/backend/access/transam/clog.c
@@ -74,6 +74,8 @@
#define GetLSNIndex(slotno, xid) ((slotno) * CLOG_LSNS_PER_PAGE + \
((xid) % (TransactionId) CLOG_XACTS_PER_PAGE) / CLOG_XACTS_PER_LSN_GROUP)
+#define NUM_CLOG_BUFFERS (128 << slru_buffers_size_scale)
+
/*
* The number of subtransactions below which we consider to apply clog group
* update optimization. Testing reveals that the number higher than this can
@@ -660,42 +662,20 @@ TransactionIdGetStatus(TransactionId xid, XLogRecPtr *lsn)
return status;
}
-/*
- * Number of shared CLOG buffers.
- *
- * On larger multi-processor systems, it is possible to have many CLOG page
- * requests in flight at one time which could lead to disk access for CLOG
- * page if the required page is not found in memory. Testing revealed that we
- * can get the best performance by having 128 CLOG buffers, more than that it
- * doesn't improve performance.
- *
- * Unconditionally keeping the number of CLOG buffers to 128 did not seem like
- * a good idea, because it would increase the minimum amount of shared memory
- * required to start, which could be a problem for people running very small
- * configurations. The following formula seems to represent a reasonable
- * compromise: people with very low values for shared_buffers will get fewer
- * CLOG buffers as well, and everyone else will get 128.
- */
-Size
-CLOGShmemBuffers(void)
-{
- return Min(128, Max(4, NBuffers / 512));
-}
-
/*
* Initialization of shared memory for CLOG
*/
Size
CLOGShmemSize(void)
{
- return SimpleLruShmemSize(CLOGShmemBuffers(), CLOG_LSNS_PER_PAGE);
+ return SimpleLruShmemSize(NUM_CLOG_BUFFERS, CLOG_LSNS_PER_PAGE);
}
void
CLOGShmemInit(void)
{
XactCtl->PagePrecedes = CLOGPagePrecedes;
- SimpleLruInit(XactCtl, "Xact", CLOGShmemBuffers(), CLOG_LSNS_PER_PAGE,
+ SimpleLruInit(XactCtl, "Xact", NUM_CLOG_BUFFERS, CLOG_LSNS_PER_PAGE,
XactSLRULock, "pg_xact", LWTRANCHE_XACT_BUFFER,
SYNC_HANDLER_CLOG);
SlruPagePrecedesUnitTests(XactCtl, CLOG_XACTS_PER_PAGE);
diff --git a/src/backend/access/transam/commit_ts.c b/src/backend/access/transam/commit_ts.c
index b897fabc70..54422f2780 100644
--- a/src/backend/access/transam/commit_ts.c
+++ b/src/backend/access/transam/commit_ts.c
@@ -70,6 +70,8 @@ typedef struct CommitTimestampEntry
#define TransactionIdToCTsEntry(xid) \
((xid) % (TransactionId) COMMIT_TS_XACTS_PER_PAGE)
+#define NUM_COMMIT_TS_BUFFERS (128 << slru_buffers_size_scale)
+
/*
* Link to shared-memory data structures for CommitTs control
*/
@@ -487,26 +489,13 @@ pg_xact_commit_timestamp_origin(PG_FUNCTION_ARGS)
PG_RETURN_DATUM(HeapTupleGetDatum(htup));
}
-/*
- * Number of shared CommitTS buffers.
- *
- * We use a very similar logic as for the number of CLOG buffers (except we
- * scale up twice as fast with shared buffers, and the maximum is twice as
- * high); see comments in CLOGShmemBuffers.
- */
-Size
-CommitTsShmemBuffers(void)
-{
- return Min(256, Max(4, NBuffers / 256));
-}
-
/*
* Shared memory sizing for CommitTs
*/
Size
CommitTsShmemSize(void)
{
- return SimpleLruShmemSize(CommitTsShmemBuffers(), 0) +
+ return SimpleLruShmemSize(NUM_COMMIT_TS_BUFFERS, 0) +
sizeof(CommitTimestampShared);
}
@@ -520,7 +509,7 @@ CommitTsShmemInit(void)
bool found;
CommitTsCtl->PagePrecedes = CommitTsPagePrecedes;
- SimpleLruInit(CommitTsCtl, "CommitTs", CommitTsShmemBuffers(), 0,
+ SimpleLruInit(CommitTsCtl, "CommitTs", NUM_COMMIT_TS_BUFFERS, 0,
CommitTsSLRULock, "pg_commit_ts",
LWTRANCHE_COMMITTS_BUFFER,
SYNC_HANDLER_COMMIT_TS);
diff --git a/src/backend/access/transam/slru.c b/src/backend/access/transam/slru.c
index 71ac70fb40..57889b72bd 100644
--- a/src/backend/access/transam/slru.c
+++ b/src/backend/access/transam/slru.c
@@ -59,6 +59,7 @@
#include "pgstat.h"
#include "storage/fd.h"
#include "storage/shmem.h"
+#include "port/pg_bitutils.h"
#define SlruFileName(ctl, path, seg) \
snprintf(path, MAXPGPATH, "%s/%04X", (ctl)->Dir, seg)
@@ -71,6 +72,17 @@
*/
#define MAX_WRITEALL_BUFFERS 16
+/*
+ * To avoid overflowing internal arithmetic and the size_t data type, the
+ * number of buffers should not exceed this number.
+ */
+#define SLRU_MAX_ALLOWED_BUFFERS ((1024 * 1024 * 1024) / BLCKSZ)
+
+/*
+ * SLRU bank size for slotno hash banks
+ */
+#define SLRU_BANK_SIZE 8
+
typedef struct SlruWriteAllData
{
int num_files; /* # files actually open */
@@ -134,7 +146,7 @@ typedef enum
static SlruErrorCause slru_errcause;
static int slru_errno;
-
+static void SlruAdjustNSlots(int *nslots, int *bankmask);
static void SimpleLruZeroLSNs(SlruCtl ctl, int slotno);
static void SimpleLruWaitIO(SlruCtl ctl, int slotno);
static void SlruInternalWritePage(SlruCtl ctl, int slotno, SlruWriteAll fdata);
@@ -148,6 +160,25 @@ static bool SlruScanDirCbDeleteCutoff(SlruCtl ctl, char *filename,
int segpage, void *data);
static void SlruInternalDeleteSegment(SlruCtl ctl, int segno);
+/*
+ * Pick number of slots and bank size optimal for hashed associative SLRU buffers.
+ * We declare SLRU nslots is always power of 2.
+ * We split SLRU to 8-sized hash banks, after some performance benchmarks.
+ * We hash pageno to banks by pageno masked by 3 upper bits.
+ */
+static void
+SlruAdjustNSlots(int *nslots, int *bankmask)
+{
+ Assert(*nslots > 0);
+ Assert(*nslots <= SLRU_MAX_ALLOWED_BUFFERS);
+
+ *nslots = (int) pg_nextpower2_32(Max(SLRU_BANK_SIZE, Min(*nslots, NBuffers / 256)));
+
+ *bankmask = *nslots / SLRU_BANK_SIZE - 1;
+
+ elog(DEBUG5, "nslots %d banksize %d nbanks %d bankmask %x", *nslots, SLRU_BANK_SIZE, *nslots / SLRU_BANK_SIZE, *bankmask);
+}
+
/*
* Initialization of shared memory
*/
@@ -156,6 +187,9 @@ Size
SimpleLruShmemSize(int nslots, int nlsns)
{
Size sz;
+ int bankmask_ignore;
+
+ SlruAdjustNSlots(&nslots, &bankmask_ignore);
/* we assume nslots isn't so large as to risk overflow */
sz = MAXALIGN(sizeof(SlruSharedData));
@@ -191,6 +225,9 @@ SimpleLruInit(SlruCtl ctl, const char *name, int nslots, int nlsns,
{
SlruShared shared;
bool found;
+ int bankmask;
+
+ SlruAdjustNSlots(&nslots, &bankmask);
shared = (SlruShared) ShmemInitStruct(name,
SimpleLruShmemSize(nslots, nlsns),
@@ -258,7 +295,10 @@ SimpleLruInit(SlruCtl ctl, const char *name, int nslots, int nlsns,
Assert(ptr - (char *) shared <= SimpleLruShmemSize(nslots, nlsns));
}
else
+ {
Assert(found);
+ Assert(shared->num_slots == nslots);
+ }
/*
* Initialize the unshared control struct, including directory path. We
@@ -266,6 +306,7 @@ SimpleLruInit(SlruCtl ctl, const char *name, int nslots, int nlsns,
*/
ctl->shared = shared;
ctl->sync_handler = sync_handler;
+ ctl->bank_mask = bankmask;
strlcpy(ctl->Dir, subdir, sizeof(ctl->Dir));
}
@@ -497,12 +538,14 @@ SimpleLruReadPage_ReadOnly(SlruCtl ctl, int pageno, TransactionId xid)
{
SlruShared shared = ctl->shared;
int slotno;
+ int bankstart = (pageno & ctl->bank_mask) * SLRU_BANK_SIZE;
+ int bankend = bankstart + SLRU_BANK_SIZE;
/* Try to find the page while holding only shared lock */
LWLockAcquire(shared->ControlLock, LW_SHARED);
/* See if page is already in a buffer */
- for (slotno = 0; slotno < shared->num_slots; slotno++)
+ for (slotno = bankstart; slotno < bankend; slotno++)
{
if (shared->page_number[slotno] == pageno &&
shared->page_status[slotno] != SLRU_PAGE_EMPTY &&
@@ -1031,7 +1074,10 @@ SlruSelectLRUPage(SlruCtl ctl, int pageno)
int best_invalid_page_number = 0; /* keep compiler quiet */
/* See if page already has a buffer assigned */
- for (slotno = 0; slotno < shared->num_slots; slotno++)
+ int bankstart = (pageno & ctl->bank_mask) * SLRU_BANK_SIZE;
+ int bankend = bankstart + SLRU_BANK_SIZE;
+
+ for (slotno = bankstart; slotno < bankend; slotno++)
{
if (shared->page_number[slotno] == pageno &&
shared->page_status[slotno] != SLRU_PAGE_EMPTY)
@@ -1066,7 +1112,7 @@ SlruSelectLRUPage(SlruCtl ctl, int pageno)
* multiple pages with the same lru_count.
*/
cur_count = (shared->cur_lru_count)++;
- for (slotno = 0; slotno < shared->num_slots; slotno++)
+ for (slotno = bankstart; slotno < bankend; slotno++)
{
int this_delta;
int this_page_number;
diff --git a/src/backend/access/transam/subtrans.c b/src/backend/access/transam/subtrans.c
index 62bb610167..125273e235 100644
--- a/src/backend/access/transam/subtrans.c
+++ b/src/backend/access/transam/subtrans.c
@@ -31,6 +31,7 @@
#include "access/slru.h"
#include "access/subtrans.h"
#include "access/transam.h"
+#include "miscadmin.h"
#include "pg_trace.h"
#include "utils/snapmgr.h"
diff --git a/src/backend/utils/init/globals.c b/src/backend/utils/init/globals.c
index 011ec18015..61b12d1056 100644
--- a/src/backend/utils/init/globals.c
+++ b/src/backend/utils/init/globals.c
@@ -154,3 +154,5 @@ int64 VacuumPageDirty = 0;
int VacuumCostBalance = 0; /* working state for vacuum */
bool VacuumCostActive = false;
+
+int slru_buffers_size_scale = 2; /* power 2 scale for SLRU buffers */
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 16ec6c5ef0..4a182225b7 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -2277,6 +2277,16 @@ struct config_int ConfigureNamesInt[] =
NULL, NULL, NULL
},
+ {
+ {"slru_buffers_size_scale", PGC_POSTMASTER, RESOURCES_MEM,
+ gettext_noop("SLRU buffers size scale of power 2"),
+ NULL
+ },
+ &slru_buffers_size_scale,
+ 2, 0, 7,
+ NULL, NULL, NULL
+ },
+
{
{"temp_buffers", PGC_USERSET, RESOURCES_MEM,
gettext_noop("Sets the maximum number of temporary buffers used by each session."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index d08d55c3fe..136ea5f48c 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -157,6 +157,9 @@
# mmap
# (change requires restart)
#min_dynamic_shared_memory = 0MB # (change requires restart)
+#slru_buffers_size_scale = 2 # SLRU buffers size scale of power 2, range 0..7
+ # (change requires restart)
+
#vacuum_buffer_usage_limit = 256kB # size of vacuum and analyze buffer access strategy ring;
# 0 to disable vacuum buffer access strategy;
# range 128kB to 16GB
diff --git a/src/include/access/multixact.h b/src/include/access/multixact.h
index 246f757f6a..6a2c914d48 100644
--- a/src/include/access/multixact.h
+++ b/src/include/access/multixact.h
@@ -30,8 +30,8 @@
#define MaxMultiXactOffset ((MultiXactOffset) 0xFFFFFFFF)
/* Number of SLRU buffers to use for multixact */
-#define NUM_MULTIXACTOFFSET_BUFFERS 8
-#define NUM_MULTIXACTMEMBER_BUFFERS 16
+#define NUM_MULTIXACTOFFSET_BUFFERS (16 << slru_buffers_size_scale)
+#define NUM_MULTIXACTMEMBER_BUFFERS (32 << slru_buffers_size_scale)
/*
* Possible multixact lock modes ("status"). The first four modes are for
diff --git a/src/include/access/slru.h b/src/include/access/slru.h
index a8a424d92d..f5f2b5b8b5 100644
--- a/src/include/access/slru.h
+++ b/src/include/access/slru.h
@@ -134,6 +134,11 @@ typedef struct SlruCtlData
* it's always the same, it doesn't need to be in shared memory.
*/
char Dir[64];
+
+ /*
+ * mask for slotno hash bank
+ */
+ Size bank_mask;
} SlruCtlData;
typedef SlruCtlData *SlruCtl;
diff --git a/src/include/access/subtrans.h b/src/include/access/subtrans.h
index 46a473c77f..0dad287550 100644
--- a/src/include/access/subtrans.h
+++ b/src/include/access/subtrans.h
@@ -12,7 +12,7 @@
#define SUBTRANS_H
/* Number of SLRU buffers to use for subtrans */
-#define NUM_SUBTRANS_BUFFERS 32
+#define NUM_SUBTRANS_BUFFERS (32 << slru_buffers_size_scale)
extern void SubTransSetParent(TransactionId xid, TransactionId parent);
extern TransactionId SubTransGetParent(TransactionId xid);
diff --git a/src/include/commands/async.h b/src/include/commands/async.h
index 02da6ba7e1..b1d59472b1 100644
--- a/src/include/commands/async.h
+++ b/src/include/commands/async.h
@@ -18,7 +18,7 @@
/*
* The number of SLRU page buffers we use for the notification queue.
*/
-#define NUM_NOTIFY_BUFFERS 8
+#define NUM_NOTIFY_BUFFERS (16 << slru_buffers_size_scale)
extern PGDLLIMPORT bool Trace_notify;
extern PGDLLIMPORT volatile sig_atomic_t notifyInterruptPending;
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 14bd574fc2..f2cec02a2f 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -177,6 +177,7 @@ extern PGDLLIMPORT int MaxBackends;
extern PGDLLIMPORT int MaxConnections;
extern PGDLLIMPORT int max_worker_processes;
extern PGDLLIMPORT int max_parallel_workers;
+extern PGDLLIMPORT int slru_buffers_size_scale;
extern PGDLLIMPORT int MyProcPid;
extern PGDLLIMPORT pg_time_t MyStartTime;
@@ -262,6 +263,7 @@ extern PGDLLIMPORT int work_mem;
extern PGDLLIMPORT double hash_mem_multiplier;
extern PGDLLIMPORT int maintenance_work_mem;
extern PGDLLIMPORT int max_parallel_maintenance_workers;
+extern PGDLLIMPORT int slru_buffers_size_scale;
/*
* Upper and lower hard limits for the buffer access strategy ring size
diff --git a/src/include/storage/predicate.h b/src/include/storage/predicate.h
index cd48afa17b..794ecd8169 100644
--- a/src/include/storage/predicate.h
+++ b/src/include/storage/predicate.h
@@ -28,7 +28,7 @@ extern PGDLLIMPORT int max_predicate_locks_per_page;
/* Number of SLRU buffers to use for Serial SLRU */
-#define NUM_SERIAL_BUFFERS 16
+#define NUM_SERIAL_BUFFERS (16 << slru_buffers_size_scale)
/*
* A handle used for sharing SERIALIZABLEXACT objects between the participants
--
2.39.2 (Apple Git-143)
v1-0002-bank-wise-slru-locks.patchapplication/octet-stream; name=v1-0002-bank-wise-slru-locks.patchDownload
From 4823d95c86ee696b2df57526ffba93aea83054bf Mon Sep 17 00:00:00 2001
From: Dilip kumar <dilipkumar@dkmac.local>
Date: Sat, 9 Sep 2023 12:56:10 +0530
Subject: [PATCH v1 2/3] bank wise slru locks
The previous patch has divided SLRU buffer pool into associative
banks. And this patch is further optimizing it by introducing
bank wise slru locks instead of a common centralized lock this
will reduce the contention on the slru control lock.
Dilip Kumar based on some design suggestions from Robert Haas
---
src/backend/access/transam/clog.c | 108 +++++++++-----
src/backend/access/transam/commit_ts.c | 43 +++---
src/backend/access/transam/multixact.c | 179 ++++++++++++++++-------
src/backend/access/transam/slru.c | 134 +++++++++++++----
src/backend/access/transam/subtrans.c | 27 ++--
src/backend/commands/async.c | 30 ++--
src/backend/storage/lmgr/lwlock.c | 14 ++
src/backend/storage/lmgr/lwlocknames.txt | 14 +-
src/backend/storage/lmgr/predicate.c | 32 ++--
src/include/access/slru.h | 33 ++++-
src/include/storage/lwlock.h | 8 +
src/test/modules/test_slru/test_slru.c | 28 ++--
12 files changed, 452 insertions(+), 198 deletions(-)
diff --git a/src/backend/access/transam/clog.c b/src/backend/access/transam/clog.c
index 29d58f1eb3..938806532d 100644
--- a/src/backend/access/transam/clog.c
+++ b/src/backend/access/transam/clog.c
@@ -276,14 +276,19 @@ TransactionIdSetPageStatus(TransactionId xid, int nsubxids,
XLogRecPtr lsn, int pageno,
bool all_xact_same_page)
{
+ LWLock *lock;
+
/* Can't use group update when PGPROC overflows. */
StaticAssertDecl(THRESHOLD_SUBTRANS_CLOG_OPT <= PGPROC_MAX_CACHED_SUBXIDS,
"group clog threshold less than PGPROC cached subxids");
+ /* Get SLRU lock w.r.t. the SLRU page we are going to access. */
+ lock = SimpleLruPageGetSLRULock(XactCtl, pageno);
+
/*
- * When there is contention on XactSLRULock, we try to group multiple
+ * When there is contention on SLRU lock, we try to group multiple
* updates; a single leader process will perform transaction status
- * updates for multiple backends so that the number of times XactSLRULock
+ * updates for multiple backends so that the number of times the SLRU lock
* needs to be acquired is reduced.
*
* For this optimization to be safe, the XID and subxids in MyProc must be
@@ -302,17 +307,17 @@ TransactionIdSetPageStatus(TransactionId xid, int nsubxids,
nsubxids * sizeof(TransactionId)) == 0))
{
/*
- * If we can immediately acquire XactSLRULock, we update the status of
+ * If we can immediately acquire SLRULock, we update the status of
* our own XID and release the lock. If not, try use group XID
* update. If that doesn't work out, fall back to waiting for the
* lock to perform an update for this transaction only.
*/
- if (LWLockConditionalAcquire(XactSLRULock, LW_EXCLUSIVE))
+ if (LWLockConditionalAcquire(lock, LW_EXCLUSIVE))
{
/* Got the lock without waiting! Do the update. */
TransactionIdSetPageStatusInternal(xid, nsubxids, subxids, status,
lsn, pageno);
- LWLockRelease(XactSLRULock);
+ LWLockRelease(lock);
return;
}
else if (TransactionGroupUpdateXidStatus(xid, status, lsn, pageno))
@@ -325,10 +330,10 @@ TransactionIdSetPageStatus(TransactionId xid, int nsubxids,
}
/* Group update not applicable, or couldn't accept this page number. */
- LWLockAcquire(XactSLRULock, LW_EXCLUSIVE);
+ LWLockAcquire(lock, LW_EXCLUSIVE);
TransactionIdSetPageStatusInternal(xid, nsubxids, subxids, status,
lsn, pageno);
- LWLockRelease(XactSLRULock);
+ LWLockRelease(lock);
}
/*
@@ -347,7 +352,8 @@ TransactionIdSetPageStatusInternal(TransactionId xid, int nsubxids,
Assert(status == TRANSACTION_STATUS_COMMITTED ||
status == TRANSACTION_STATUS_ABORTED ||
(status == TRANSACTION_STATUS_SUB_COMMITTED && !TransactionIdIsValid(xid)));
- Assert(LWLockHeldByMeInMode(XactSLRULock, LW_EXCLUSIVE));
+ Assert(LWLockHeldByMeInMode(SimpleLruPageGetSLRULock(XactCtl, pageno),
+ LW_EXCLUSIVE));
/*
* If we're doing an async commit (ie, lsn is valid), then we must wait
@@ -398,14 +404,13 @@ TransactionIdSetPageStatusInternal(TransactionId xid, int nsubxids,
}
/*
- * When we cannot immediately acquire XactSLRULock in exclusive mode at
+ * When we cannot immediately acquire SLRU bank lock in exclusive mode at
* commit time, add ourselves to a list of processes that need their XIDs
* status update. The first process to add itself to the list will acquire
- * XactSLRULock in exclusive mode and set transaction status as required
- * on behalf of all group members. This avoids a great deal of contention
- * around XactSLRULock when many processes are trying to commit at once,
- * since the lock need not be repeatedly handed off from one committing
- * process to the next.
+ * SLRU lock in exclusive mode and set transaction status as required on behalf
+ * of all group members. This avoids a great deal of contention around
+ * SLRULock when many processes are trying to commit at once, since the lock
+ * need not be repeatedly handed off from one committing process to the next.
*
* Returns true when transaction status has been updated in clog; returns
* false if we decided against applying the optimization because the page
@@ -419,6 +424,8 @@ TransactionGroupUpdateXidStatus(TransactionId xid, XidStatus status,
PGPROC *proc = MyProc;
uint32 nextidx;
uint32 wakeidx;
+ int prevpageno;
+ LWLock *prevlock = NULL;
/* We should definitely have an XID whose status needs to be updated. */
Assert(TransactionIdIsValid(xid));
@@ -499,11 +506,8 @@ TransactionGroupUpdateXidStatus(TransactionId xid, XidStatus status,
return true;
}
- /* We are the leader. Acquire the lock on behalf of everyone. */
- LWLockAcquire(XactSLRULock, LW_EXCLUSIVE);
-
/*
- * Now that we've got the lock, clear the list of processes waiting for
+ * We are leader so clear the list of processes waiting for
* group XID status update, saving a pointer to the head of the list.
* Trying to pop elements one at a time could lead to an ABA problem.
*/
@@ -513,10 +517,38 @@ TransactionGroupUpdateXidStatus(TransactionId xid, XidStatus status,
/* Remember head of list so we can perform wakeups after dropping lock. */
wakeidx = nextidx;
+ /* Acquire the SLRU bank lock w.r.t. the first page in the group. */
+ prevpageno = ProcGlobal->allProcs[nextidx].clogGroupMemberPage;
+ prevlock = SimpleLruPageGetSLRULock(XactCtl, prevpageno);
+ LWLockAcquire(prevlock, LW_EXCLUSIVE);
+
/* Walk the list and update the status of all XIDs. */
while (nextidx != INVALID_PGPROCNO)
{
PGPROC *nextproc = &ProcGlobal->allProcs[nextidx];
+ int thispageno = nextproc->clogGroupMemberPage;
+
+ /*
+ * Although we are trying our best to keep same page in a group, there
+ * are cases where we might get different pages as well for detail
+ * refer comment in above while loop where we are adding this process
+ * for group update. So if the current page we are going to access is
+ * not in the same slru bank in which we updated the last page then we
+ * need to release the lock on the previous bank and acquire lock on
+ * the bank w.r.t. the page we are going to update now.
+ */
+ if (thispageno != prevpageno)
+ {
+ LWLock *lock = SimpleLruPageGetSLRULock(XactCtl, thispageno);
+
+ if (prevlock != lock)
+ {
+ LWLockRelease(prevlock);
+ LWLockAcquire(lock, LW_EXCLUSIVE);
+ }
+ prevlock = lock;
+ prevpageno = thispageno;
+ }
/*
* Transactions with more than THRESHOLD_SUBTRANS_CLOG_OPT sub-XIDs
@@ -536,7 +568,8 @@ TransactionGroupUpdateXidStatus(TransactionId xid, XidStatus status,
}
/* We're done with the lock now. */
- LWLockRelease(XactSLRULock);
+ if (prevlock != NULL)
+ LWLockRelease(prevlock);
/*
* Now that we've released the lock, go back and wake everybody up. We
@@ -565,10 +598,11 @@ TransactionGroupUpdateXidStatus(TransactionId xid, XidStatus status,
/*
* Sets the commit status of a single transaction.
*
- * Must be called with XactSLRULock held
+ * Must be called with slot specific SLRU bank's lock held
*/
static void
-TransactionIdSetStatusBit(TransactionId xid, XidStatus status, XLogRecPtr lsn, int slotno)
+TransactionIdSetStatusBit(TransactionId xid, XidStatus status, XLogRecPtr lsn,
+ int slotno)
{
int byteno = TransactionIdToByte(xid);
int bshift = TransactionIdToBIndex(xid) * CLOG_BITS_PER_XACT;
@@ -657,7 +691,7 @@ TransactionIdGetStatus(TransactionId xid, XLogRecPtr *lsn)
lsnindex = GetLSNIndex(slotno, xid);
*lsn = XactCtl->shared->group_lsn[lsnindex];
- LWLockRelease(XactSLRULock);
+ LWLockRelease(SimpleLruPageGetSLRULock(XactCtl, pageno));
return status;
}
@@ -676,7 +710,7 @@ CLOGShmemInit(void)
{
XactCtl->PagePrecedes = CLOGPagePrecedes;
SimpleLruInit(XactCtl, "Xact", NUM_CLOG_BUFFERS, CLOG_LSNS_PER_PAGE,
- XactSLRULock, "pg_xact", LWTRANCHE_XACT_BUFFER,
+ "pg_xact", LWTRANCHE_XACT_BUFFER, LWTRANCHE_XACT_SLRU,
SYNC_HANDLER_CLOG);
SlruPagePrecedesUnitTests(XactCtl, CLOG_XACTS_PER_PAGE);
}
@@ -691,8 +725,9 @@ void
BootStrapCLOG(void)
{
int slotno;
+ LWLock *lock = SimpleLruPageGetSLRULock(XactCtl, 0);
- LWLockAcquire(XactSLRULock, LW_EXCLUSIVE);
+ LWLockAcquire(lock, LW_EXCLUSIVE);
/* Create and zero the first page of the commit log */
slotno = ZeroCLOGPage(0, false);
@@ -701,7 +736,7 @@ BootStrapCLOG(void)
SimpleLruWritePage(XactCtl, slotno);
Assert(!XactCtl->shared->page_dirty[slotno]);
- LWLockRelease(XactSLRULock);
+ LWLockRelease(lock);
}
/*
@@ -736,14 +771,10 @@ StartupCLOG(void)
TransactionId xid = XidFromFullTransactionId(ShmemVariableCache->nextXid);
int pageno = TransactionIdToPage(xid);
- LWLockAcquire(XactSLRULock, LW_EXCLUSIVE);
-
/*
* Initialize our idea of the latest page number.
*/
- XactCtl->shared->latest_page_number = pageno;
-
- LWLockRelease(XactSLRULock);
+ pg_atomic_init_u32(&XactCtl->shared->latest_page_number, pageno);
}
/*
@@ -754,8 +785,9 @@ TrimCLOG(void)
{
TransactionId xid = XidFromFullTransactionId(ShmemVariableCache->nextXid);
int pageno = TransactionIdToPage(xid);
+ LWLock *lock = SimpleLruPageGetSLRULock(XactCtl, pageno);
- LWLockAcquire(XactSLRULock, LW_EXCLUSIVE);
+ LWLockAcquire(lock, LW_EXCLUSIVE);
/*
* Zero out the remainder of the current clog page. Under normal
@@ -787,7 +819,7 @@ TrimCLOG(void)
XactCtl->shared->page_dirty[slotno] = true;
}
- LWLockRelease(XactSLRULock);
+ LWLockRelease(lock);
}
/*
@@ -819,6 +851,7 @@ void
ExtendCLOG(TransactionId newestXact)
{
int pageno;
+ LWLock *lock;
/*
* No work except at first XID of a page. But beware: just after
@@ -829,13 +862,14 @@ ExtendCLOG(TransactionId newestXact)
return;
pageno = TransactionIdToPage(newestXact);
+ lock = SimpleLruPageGetSLRULock(XactCtl, pageno);
- LWLockAcquire(XactSLRULock, LW_EXCLUSIVE);
+ LWLockAcquire(lock, LW_EXCLUSIVE);
/* Zero the page and make an XLOG entry about it */
ZeroCLOGPage(pageno, true);
- LWLockRelease(XactSLRULock);
+ LWLockRelease(lock);
}
@@ -973,16 +1007,18 @@ clog_redo(XLogReaderState *record)
{
int pageno;
int slotno;
+ LWLock *lock;
memcpy(&pageno, XLogRecGetData(record), sizeof(int));
- LWLockAcquire(XactSLRULock, LW_EXCLUSIVE);
+ lock = SimpleLruPageGetSLRULock(XactCtl, pageno);
+ LWLockAcquire(lock, LW_EXCLUSIVE);
slotno = ZeroCLOGPage(pageno, false);
SimpleLruWritePage(XactCtl, slotno);
Assert(!XactCtl->shared->page_dirty[slotno]);
- LWLockRelease(XactSLRULock);
+ LWLockRelease(lock);
}
else if (info == CLOG_TRUNCATE)
{
diff --git a/src/backend/access/transam/commit_ts.c b/src/backend/access/transam/commit_ts.c
index 54422f2780..0c7f5bae86 100644
--- a/src/backend/access/transam/commit_ts.c
+++ b/src/backend/access/transam/commit_ts.c
@@ -220,8 +220,9 @@ SetXidCommitTsInPage(TransactionId xid, int nsubxids,
{
int slotno;
int i;
+ LWLock *lock = SimpleLruPageGetSLRULock(CommitTsCtl, pageno);
- LWLockAcquire(CommitTsSLRULock, LW_EXCLUSIVE);
+ LWLockAcquire(lock, LW_EXCLUSIVE);
slotno = SimpleLruReadPage(CommitTsCtl, pageno, true, xid);
@@ -231,13 +232,13 @@ SetXidCommitTsInPage(TransactionId xid, int nsubxids,
CommitTsCtl->shared->page_dirty[slotno] = true;
- LWLockRelease(CommitTsSLRULock);
+ LWLockRelease(lock);
}
/*
* Sets the commit timestamp of a single transaction.
*
- * Must be called with CommitTsSLRULock held
+ * Must be called with slot specific SLRU bank's Lock held
*/
static void
TransactionIdSetCommitTs(TransactionId xid, TimestampTz ts,
@@ -338,7 +339,7 @@ TransactionIdGetCommitTsData(TransactionId xid, TimestampTz *ts,
if (nodeid)
*nodeid = entry.nodeid;
- LWLockRelease(CommitTsSLRULock);
+ LWLockRelease(SimpleLruPageGetSLRULock(CommitTsCtl, pageno));
return *ts != 0;
}
@@ -510,9 +511,8 @@ CommitTsShmemInit(void)
CommitTsCtl->PagePrecedes = CommitTsPagePrecedes;
SimpleLruInit(CommitTsCtl, "CommitTs", NUM_COMMIT_TS_BUFFERS, 0,
- CommitTsSLRULock, "pg_commit_ts",
- LWTRANCHE_COMMITTS_BUFFER,
- SYNC_HANDLER_COMMIT_TS);
+ "pg_commit_ts", LWTRANCHE_COMMITTS_BUFFER,
+ LWTRANCHE_COMMITTS_SLRU, SYNC_HANDLER_COMMIT_TS);
SlruPagePrecedesUnitTests(CommitTsCtl, COMMIT_TS_XACTS_PER_PAGE);
commitTsShared = ShmemInitStruct("CommitTs shared",
@@ -668,9 +668,7 @@ ActivateCommitTs(void)
/*
* Re-Initialize our idea of the latest page number.
*/
- LWLockAcquire(CommitTsSLRULock, LW_EXCLUSIVE);
- CommitTsCtl->shared->latest_page_number = pageno;
- LWLockRelease(CommitTsSLRULock);
+ pg_atomic_write_u32(&CommitTsCtl->shared->latest_page_number, pageno);
/*
* If CommitTs is enabled, but it wasn't in the previous server run, we
@@ -697,12 +695,13 @@ ActivateCommitTs(void)
if (!SimpleLruDoesPhysicalPageExist(CommitTsCtl, pageno))
{
int slotno;
+ LWLock *lock = SimpleLruPageGetSLRULock(CommitTsCtl, pageno);
- LWLockAcquire(CommitTsSLRULock, LW_EXCLUSIVE);
+ LWLockAcquire(lock, LW_EXCLUSIVE);
slotno = ZeroCommitTsPage(pageno, false);
SimpleLruWritePage(CommitTsCtl, slotno);
Assert(!CommitTsCtl->shared->page_dirty[slotno]);
- LWLockRelease(CommitTsSLRULock);
+ LWLockRelease(lock);
}
/* Change the activation status in shared memory. */
@@ -751,9 +750,9 @@ DeactivateCommitTs(void)
* be overwritten anyway when we wrap around, but it seems better to be
* tidy.)
*/
- LWLockAcquire(CommitTsSLRULock, LW_EXCLUSIVE);
+ SimpleLruAcquireAllBankLock(CommitTsCtl, LW_EXCLUSIVE);
(void) SlruScanDirectory(CommitTsCtl, SlruScanDirCbDeleteAll, NULL);
- LWLockRelease(CommitTsSLRULock);
+ SimpleLruReleaseAllBankLock(CommitTsCtl);
}
/*
@@ -785,6 +784,7 @@ void
ExtendCommitTs(TransactionId newestXact)
{
int pageno;
+ LWLock *lock;
/*
* Nothing to do if module not enabled. Note we do an unlocked read of
@@ -805,12 +805,14 @@ ExtendCommitTs(TransactionId newestXact)
pageno = TransactionIdToCTsPage(newestXact);
- LWLockAcquire(CommitTsSLRULock, LW_EXCLUSIVE);
+ lock = SimpleLruPageGetSLRULock(CommitTsCtl, pageno);
+
+ LWLockAcquire(lock, LW_EXCLUSIVE);
/* Zero the page and make an XLOG entry about it */
ZeroCommitTsPage(pageno, !InRecovery);
- LWLockRelease(CommitTsSLRULock);
+ LWLockRelease(lock);
}
/*
@@ -964,16 +966,18 @@ commit_ts_redo(XLogReaderState *record)
{
int pageno;
int slotno;
+ LWLock *lock;
memcpy(&pageno, XLogRecGetData(record), sizeof(int));
+ lock = SimpleLruPageGetSLRULock(CommitTsCtl, pageno);
- LWLockAcquire(CommitTsSLRULock, LW_EXCLUSIVE);
+ LWLockAcquire(lock, LW_EXCLUSIVE);
slotno = ZeroCommitTsPage(pageno, false);
SimpleLruWritePage(CommitTsCtl, slotno);
Assert(!CommitTsCtl->shared->page_dirty[slotno]);
- LWLockRelease(CommitTsSLRULock);
+ LWLockRelease(lock);
}
else if (info == COMMIT_TS_TRUNCATE)
{
@@ -985,7 +989,8 @@ commit_ts_redo(XLogReaderState *record)
* During XLOG replay, latest_page_number isn't set up yet; insert a
* suitable value to bypass the sanity test in SimpleLruTruncate.
*/
- CommitTsCtl->shared->latest_page_number = trunc->pageno;
+ pg_atomic_write_u32(&CommitTsCtl->shared->latest_page_number,
+ trunc->pageno);
SimpleLruTruncate(CommitTsCtl, trunc->pageno);
}
diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index abb022e067..e63bd4cf71 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -192,10 +192,10 @@ static SlruCtlData MultiXactMemberCtlData;
/*
* MultiXact state shared across all backends. All this state is protected
- * by MultiXactGenLock. (We also use MultiXactOffsetSLRULock and
- * MultiXactMemberSLRULock to guard accesses to the two sets of SLRU
- * buffers. For concurrency's sake, we avoid holding more than one of these
- * locks at a time.)
+ * by MultiXactGenLock. (We also use SLRU bank's lock of MultiXactOffset and
+ * MultiXactMember to guard accesses to the two sets of SLRU buffers. For
+ * concurrency's sake, we avoid holding more than one of these locks at a
+ * time.)
*/
typedef struct MultiXactStateData
{
@@ -870,12 +870,15 @@ RecordNewMultiXact(MultiXactId multi, MultiXactOffset offset,
int slotno;
MultiXactOffset *offptr;
int i;
-
- LWLockAcquire(MultiXactOffsetSLRULock, LW_EXCLUSIVE);
+ LWLock *lock;
+ LWLock *prevlock = NULL;
pageno = MultiXactIdToOffsetPage(multi);
entryno = MultiXactIdToOffsetEntry(multi);
+ lock = SimpleLruPageGetSLRULock(MultiXactOffsetCtl, pageno);
+ LWLockAcquire(lock, LW_EXCLUSIVE);
+
/*
* Note: we pass the MultiXactId to SimpleLruReadPage as the "transaction"
* to complain about if there's any I/O error. This is kinda bogus, but
@@ -891,10 +894,8 @@ RecordNewMultiXact(MultiXactId multi, MultiXactOffset offset,
MultiXactOffsetCtl->shared->page_dirty[slotno] = true;
- /* Exchange our lock */
- LWLockRelease(MultiXactOffsetSLRULock);
-
- LWLockAcquire(MultiXactMemberSLRULock, LW_EXCLUSIVE);
+ /* Release MultiXactOffset SLRU lock. */
+ LWLockRelease(lock);
prev_pageno = -1;
@@ -916,6 +917,20 @@ RecordNewMultiXact(MultiXactId multi, MultiXactOffset offset,
if (pageno != prev_pageno)
{
+ /*
+ * MultiXactMember SLRU page is changed so check if this new page
+ * fall into the different SLRU bank then release the old bank's
+ * lock and acquire lock on the new bank.
+ */
+ lock = SimpleLruPageGetSLRULock(MultiXactMemberCtl, pageno);
+ if (lock != prevlock)
+ {
+ if (prevlock != NULL)
+ LWLockRelease(prevlock);
+
+ LWLockAcquire(lock, LW_EXCLUSIVE);
+ prevlock = lock;
+ }
slotno = SimpleLruReadPage(MultiXactMemberCtl, pageno, true, multi);
prev_pageno = pageno;
}
@@ -936,7 +951,8 @@ RecordNewMultiXact(MultiXactId multi, MultiXactOffset offset,
MultiXactMemberCtl->shared->page_dirty[slotno] = true;
}
- LWLockRelease(MultiXactMemberSLRULock);
+ if (prevlock != NULL)
+ LWLockRelease(prevlock);
}
/*
@@ -1239,6 +1255,8 @@ GetMultiXactIdMembers(MultiXactId multi, MultiXactMember **members,
MultiXactId tmpMXact;
MultiXactOffset nextOffset;
MultiXactMember *ptr;
+ LWLock *lock;
+ LWLock *prevlock = NULL;
debug_elog3(DEBUG2, "GetMembers: asked for %u", multi);
@@ -1342,11 +1360,23 @@ GetMultiXactIdMembers(MultiXactId multi, MultiXactMember **members,
* time on every multixact creation.
*/
retry:
- LWLockAcquire(MultiXactOffsetSLRULock, LW_EXCLUSIVE);
-
pageno = MultiXactIdToOffsetPage(multi);
entryno = MultiXactIdToOffsetEntry(multi);
+ /*
+ * If the page is on the different SLRU bank then release the lock on the
+ * previous bank if we are already holding one and acquire the lock on the
+ * new bank.
+ */
+ lock = SimpleLruPageGetSLRULock(MultiXactOffsetCtl, pageno);
+ if (lock != prevlock)
+ {
+ if (prevlock != NULL)
+ LWLockRelease(prevlock);
+ LWLockAcquire(lock, LW_EXCLUSIVE);
+ prevlock = lock;
+ }
+
slotno = SimpleLruReadPage(MultiXactOffsetCtl, pageno, true, multi);
offptr = (MultiXactOffset *) MultiXactOffsetCtl->shared->page_buffer[slotno];
offptr += entryno;
@@ -1379,7 +1409,22 @@ retry:
entryno = MultiXactIdToOffsetEntry(tmpMXact);
if (pageno != prev_pageno)
+ {
+ /*
+ * SLRU pageno is changed so check whether this page is falling in
+ * the different slru bank than on which we are already holding the
+ * lock and if so release the lock on the old bank and acquire that
+ * on the new bank.
+ */
+ lock = SimpleLruPageGetSLRULock(MultiXactOffsetCtl, pageno);
+ if (prevlock != lock)
+ {
+ LWLockRelease(prevlock);
+ LWLockAcquire(lock, LW_EXCLUSIVE);
+ prevlock = lock;
+ }
slotno = SimpleLruReadPage(MultiXactOffsetCtl, pageno, true, tmpMXact);
+ }
offptr = (MultiXactOffset *) MultiXactOffsetCtl->shared->page_buffer[slotno];
offptr += entryno;
@@ -1388,7 +1433,8 @@ retry:
if (nextMXOffset == 0)
{
/* Corner case 2: next multixact is still being filled in */
- LWLockRelease(MultiXactOffsetSLRULock);
+ LWLockRelease(prevlock);
+ prevlock = NULL;
CHECK_FOR_INTERRUPTS();
pg_usleep(1000L);
goto retry;
@@ -1397,13 +1443,11 @@ retry:
length = nextMXOffset - offset;
}
- LWLockRelease(MultiXactOffsetSLRULock);
+ LWLockRelease(prevlock);
+ prevlock = NULL;
ptr = (MultiXactMember *) palloc(length * sizeof(MultiXactMember));
- /* Now get the members themselves. */
- LWLockAcquire(MultiXactMemberSLRULock, LW_EXCLUSIVE);
-
truelength = 0;
prev_pageno = -1;
for (i = 0; i < length; i++, offset++)
@@ -1419,6 +1463,20 @@ retry:
if (pageno != prev_pageno)
{
+ /*
+ * MultiXactMember SLRU page is changed so check if this new page
+ * fall into the different SLRU bank then release the old bank's
+ * lock and acquire lock on the new bank.
+ */
+ lock = SimpleLruPageGetSLRULock(MultiXactMemberCtl, pageno);
+ if (lock != prevlock)
+ {
+ if (prevlock)
+ LWLockRelease(prevlock);
+ LWLockAcquire(lock, LW_EXCLUSIVE);
+ prevlock = lock;
+ }
+
slotno = SimpleLruReadPage(MultiXactMemberCtl, pageno, true, multi);
prev_pageno = pageno;
}
@@ -1442,7 +1500,8 @@ retry:
truelength++;
}
- LWLockRelease(MultiXactMemberSLRULock);
+ if (prevlock)
+ LWLockRelease(prevlock);
/* A multixid with zero members should not happen */
Assert(truelength > 0);
@@ -1852,15 +1911,14 @@ MultiXactShmemInit(void)
SimpleLruInit(MultiXactOffsetCtl,
"MultiXactOffset", NUM_MULTIXACTOFFSET_BUFFERS, 0,
- MultiXactOffsetSLRULock, "pg_multixact/offsets",
- LWTRANCHE_MULTIXACTOFFSET_BUFFER,
+ "pg_multixact/offsets", LWTRANCHE_MULTIXACTOFFSET_BUFFER,
+ LWTRANCHE_MULTIXACTOFFSET_SLRU,
SYNC_HANDLER_MULTIXACT_OFFSET);
SlruPagePrecedesUnitTests(MultiXactOffsetCtl, MULTIXACT_OFFSETS_PER_PAGE);
SimpleLruInit(MultiXactMemberCtl,
"MultiXactMember", NUM_MULTIXACTMEMBER_BUFFERS, 0,
- MultiXactMemberSLRULock, "pg_multixact/members",
- LWTRANCHE_MULTIXACTMEMBER_BUFFER,
- SYNC_HANDLER_MULTIXACT_MEMBER);
+ "pg_multixact/members", LWTRANCHE_MULTIXACTMEMBER_BUFFER,
+ LWTRANCHE_MULTIXACTMEMBER_SLRU, SYNC_HANDLER_MULTIXACT_MEMBER);
/* doesn't call SimpleLruTruncate() or meet criteria for unit tests */
/* Initialize our shared state struct */
@@ -1894,8 +1952,10 @@ void
BootStrapMultiXact(void)
{
int slotno;
+ LWLock *lock;
- LWLockAcquire(MultiXactOffsetSLRULock, LW_EXCLUSIVE);
+ lock = SimpleLruPageGetSLRULock(MultiXactOffsetCtl , 0);
+ LWLockAcquire(lock, LW_EXCLUSIVE);
/* Create and zero the first page of the offsets log */
slotno = ZeroMultiXactOffsetPage(0, false);
@@ -1904,9 +1964,10 @@ BootStrapMultiXact(void)
SimpleLruWritePage(MultiXactOffsetCtl, slotno);
Assert(!MultiXactOffsetCtl->shared->page_dirty[slotno]);
- LWLockRelease(MultiXactOffsetSLRULock);
+ LWLockRelease(lock);
- LWLockAcquire(MultiXactMemberSLRULock, LW_EXCLUSIVE);
+ lock = SimpleLruPageGetSLRULock(MultiXactMemberCtl , 0);
+ LWLockAcquire(lock, LW_EXCLUSIVE);
/* Create and zero the first page of the members log */
slotno = ZeroMultiXactMemberPage(0, false);
@@ -1915,7 +1976,7 @@ BootStrapMultiXact(void)
SimpleLruWritePage(MultiXactMemberCtl, slotno);
Assert(!MultiXactMemberCtl->shared->page_dirty[slotno]);
- LWLockRelease(MultiXactMemberSLRULock);
+ LWLockRelease(lock);
}
/*
@@ -1975,10 +2036,12 @@ static void
MaybeExtendOffsetSlru(void)
{
int pageno;
+ LWLock *lock;
pageno = MultiXactIdToOffsetPage(MultiXactState->nextMXact);
+ lock = SimpleLruPageGetSLRULock(MultiXactOffsetCtl ,pageno);
- LWLockAcquire(MultiXactOffsetSLRULock, LW_EXCLUSIVE);
+ LWLockAcquire(lock, LW_EXCLUSIVE);
if (!SimpleLruDoesPhysicalPageExist(MultiXactOffsetCtl, pageno))
{
@@ -1993,7 +2056,7 @@ MaybeExtendOffsetSlru(void)
SimpleLruWritePage(MultiXactOffsetCtl, slotno);
}
- LWLockRelease(MultiXactOffsetSLRULock);
+ LWLockRelease(lock);
}
/*
@@ -2015,13 +2078,15 @@ StartupMultiXact(void)
* Initialize offset's idea of the latest page number.
*/
pageno = MultiXactIdToOffsetPage(multi);
- MultiXactOffsetCtl->shared->latest_page_number = pageno;
+ pg_atomic_init_u32(&MultiXactOffsetCtl->shared->latest_page_number,
+ pageno);
/*
* Initialize member's idea of the latest page number.
*/
pageno = MXOffsetToMemberPage(offset);
- MultiXactMemberCtl->shared->latest_page_number = pageno;
+ pg_atomic_init_u32(&MultiXactMemberCtl->shared->latest_page_number,
+ pageno);
}
/*
@@ -2037,7 +2102,6 @@ TrimMultiXact(void)
int pageno;
int entryno;
int flagsoff;
-
LWLockAcquire(MultiXactGenLock, LW_SHARED);
nextMXact = MultiXactState->nextMXact;
offset = MultiXactState->nextOffset;
@@ -2046,13 +2110,13 @@ TrimMultiXact(void)
LWLockRelease(MultiXactGenLock);
/* Clean up offsets state */
- LWLockAcquire(MultiXactOffsetSLRULock, LW_EXCLUSIVE);
/*
* (Re-)Initialize our idea of the latest page number for offsets.
*/
pageno = MultiXactIdToOffsetPage(nextMXact);
- MultiXactOffsetCtl->shared->latest_page_number = pageno;
+ pg_atomic_write_u32(&MultiXactOffsetCtl->shared->latest_page_number,
+ pageno);
/*
* Zero out the remainder of the current offsets page. See notes in
@@ -2067,7 +2131,9 @@ TrimMultiXact(void)
{
int slotno;
MultiXactOffset *offptr;
+ LWLock *lock = SimpleLruPageGetSLRULock(MultiXactOffsetCtl, pageno);
+ LWLockAcquire(lock, LW_EXCLUSIVE);
slotno = SimpleLruReadPage(MultiXactOffsetCtl, pageno, true, nextMXact);
offptr = (MultiXactOffset *) MultiXactOffsetCtl->shared->page_buffer[slotno];
offptr += entryno;
@@ -2075,18 +2141,17 @@ TrimMultiXact(void)
MemSet(offptr, 0, BLCKSZ - (entryno * sizeof(MultiXactOffset)));
MultiXactOffsetCtl->shared->page_dirty[slotno] = true;
+ LWLockRelease(lock);
}
- LWLockRelease(MultiXactOffsetSLRULock);
-
/* And the same for members */
- LWLockAcquire(MultiXactMemberSLRULock, LW_EXCLUSIVE);
/*
* (Re-)Initialize our idea of the latest page number for members.
*/
pageno = MXOffsetToMemberPage(offset);
- MultiXactMemberCtl->shared->latest_page_number = pageno;
+ pg_atomic_write_u32(&MultiXactMemberCtl->shared->latest_page_number,
+ pageno);
/*
* Zero out the remainder of the current members page. See notes in
@@ -2098,7 +2163,9 @@ TrimMultiXact(void)
int slotno;
TransactionId *xidptr;
int memberoff;
+ LWLock *lock = SimpleLruPageGetSLRULock(MultiXactMemberCtl, pageno);
+ LWLockAcquire(lock, LW_EXCLUSIVE);
memberoff = MXOffsetToMemberOffset(offset);
slotno = SimpleLruReadPage(MultiXactMemberCtl, pageno, true, offset);
xidptr = (TransactionId *)
@@ -2113,10 +2180,9 @@ TrimMultiXact(void)
*/
MultiXactMemberCtl->shared->page_dirty[slotno] = true;
+ LWLockRelease(lock);
}
- LWLockRelease(MultiXactMemberSLRULock);
-
/* signal that we're officially up */
LWLockAcquire(MultiXactGenLock, LW_EXCLUSIVE);
MultiXactState->finishedStartup = true;
@@ -2404,6 +2470,7 @@ static void
ExtendMultiXactOffset(MultiXactId multi)
{
int pageno;
+ LWLock *lock;
/*
* No work except at first MultiXactId of a page. But beware: just after
@@ -2414,13 +2481,14 @@ ExtendMultiXactOffset(MultiXactId multi)
return;
pageno = MultiXactIdToOffsetPage(multi);
+ lock = SimpleLruPageGetSLRULock(MultiXactOffsetCtl, pageno);
- LWLockAcquire(MultiXactOffsetSLRULock, LW_EXCLUSIVE);
+ LWLockAcquire(lock, LW_EXCLUSIVE);
/* Zero the page and make an XLOG entry about it */
ZeroMultiXactOffsetPage(pageno, true);
- LWLockRelease(MultiXactOffsetSLRULock);
+ LWLockRelease(lock);
}
/*
@@ -2453,15 +2521,17 @@ ExtendMultiXactMember(MultiXactOffset offset, int nmembers)
if (flagsoff == 0 && flagsbit == 0)
{
int pageno;
+ LWLock *lock;
pageno = MXOffsetToMemberPage(offset);
+ lock = SimpleLruPageGetSLRULock(MultiXactMemberCtl, pageno);
- LWLockAcquire(MultiXactMemberSLRULock, LW_EXCLUSIVE);
+ LWLockAcquire(lock, LW_EXCLUSIVE);
/* Zero the page and make an XLOG entry about it */
ZeroMultiXactMemberPage(pageno, true);
- LWLockRelease(MultiXactMemberSLRULock);
+ LWLockRelease(lock);
}
/*
@@ -2759,7 +2829,7 @@ find_multixact_start(MultiXactId multi, MultiXactOffset *result)
offptr = (MultiXactOffset *) MultiXactOffsetCtl->shared->page_buffer[slotno];
offptr += entryno;
offset = *offptr;
- LWLockRelease(MultiXactOffsetSLRULock);
+ LWLockRelease(SimpleLruPageGetSLRULock(MultiXactOffsetCtl, pageno));
*result = offset;
return true;
@@ -3241,31 +3311,33 @@ multixact_redo(XLogReaderState *record)
{
int pageno;
int slotno;
+ LWLock *lock;
memcpy(&pageno, XLogRecGetData(record), sizeof(int));
-
- LWLockAcquire(MultiXactOffsetSLRULock, LW_EXCLUSIVE);
+ lock = SimpleLruPageGetSLRULock(MultiXactOffsetCtl, pageno);
+ LWLockAcquire(lock, LW_EXCLUSIVE);
slotno = ZeroMultiXactOffsetPage(pageno, false);
SimpleLruWritePage(MultiXactOffsetCtl, slotno);
Assert(!MultiXactOffsetCtl->shared->page_dirty[slotno]);
- LWLockRelease(MultiXactOffsetSLRULock);
+ LWLockRelease(lock);
}
else if (info == XLOG_MULTIXACT_ZERO_MEM_PAGE)
{
int pageno;
int slotno;
+ LWLock *lock;
memcpy(&pageno, XLogRecGetData(record), sizeof(int));
-
- LWLockAcquire(MultiXactMemberSLRULock, LW_EXCLUSIVE);
+ lock = SimpleLruPageGetSLRULock(MultiXactMemberCtl, pageno);
+ LWLockAcquire(lock, LW_EXCLUSIVE);
slotno = ZeroMultiXactMemberPage(pageno, false);
SimpleLruWritePage(MultiXactMemberCtl, slotno);
Assert(!MultiXactMemberCtl->shared->page_dirty[slotno]);
- LWLockRelease(MultiXactMemberSLRULock);
+ LWLockRelease(lock);
}
else if (info == XLOG_MULTIXACT_CREATE_ID)
{
@@ -3331,7 +3403,8 @@ multixact_redo(XLogReaderState *record)
* SimpleLruTruncate.
*/
pageno = MultiXactIdToOffsetPage(xlrec.endTruncOff);
- MultiXactOffsetCtl->shared->latest_page_number = pageno;
+ pg_atomic_write_u32(&MultiXactOffsetCtl->shared->latest_page_number,
+ pageno);
PerformOffsetsTruncation(xlrec.startTruncOff, xlrec.endTruncOff);
LWLockRelease(MultiXactTruncationLock);
diff --git a/src/backend/access/transam/slru.c b/src/backend/access/transam/slru.c
index 57889b72bd..c06e4eddd1 100644
--- a/src/backend/access/transam/slru.c
+++ b/src/backend/access/transam/slru.c
@@ -187,9 +187,9 @@ Size
SimpleLruShmemSize(int nslots, int nlsns)
{
Size sz;
- int bankmask_ignore;
+ int bankmask;
- SlruAdjustNSlots(&nslots, &bankmask_ignore);
+ SlruAdjustNSlots(&nslots, &bankmask);
/* we assume nslots isn't so large as to risk overflow */
sz = MAXALIGN(sizeof(SlruSharedData));
@@ -199,6 +199,7 @@ SimpleLruShmemSize(int nslots, int nlsns)
sz += MAXALIGN(nslots * sizeof(int)); /* page_number[] */
sz += MAXALIGN(nslots * sizeof(int)); /* page_lru_count[] */
sz += MAXALIGN(nslots * sizeof(LWLockPadded)); /* buffer_locks[] */
+ sz += MAXALIGN((bankmask + 1) * sizeof(LWLockPadded)); /* bank_locks[] */
if (nlsns > 0)
sz += MAXALIGN(nslots * nlsns * sizeof(XLogRecPtr)); /* group_lsn[] */
@@ -206,6 +207,32 @@ SimpleLruShmemSize(int nslots, int nlsns)
return BUFFERALIGN(sz) + BLCKSZ * nslots;
}
+/*
+ * Function to acquire all bank's lock of the given SlruCtl
+ */
+void
+SimpleLruAcquireAllBankLock(SlruCtl ctl, LWLockMode mode)
+{
+ SlruShared shared = ctl->shared;
+ int bankno;
+
+ for (bankno = 0; bankno <= ctl->bank_mask; bankno++)
+ LWLockAcquire(&shared->bank_locks[bankno].lock, mode);
+}
+
+/*
+ * Function to release all bank's lock of the given SlruCtl
+ */
+void
+SimpleLruReleaseAllBankLock(SlruCtl ctl)
+{
+ SlruShared shared = ctl->shared;
+ int bankno;
+
+ for (bankno = 0; bankno <= ctl->bank_mask; bankno++)
+ LWLockRelease(&shared->bank_locks[bankno].lock);
+}
+
/*
* Initialize, or attach to, a simple LRU cache in shared memory.
*
@@ -220,7 +247,7 @@ SimpleLruShmemSize(int nslots, int nlsns)
*/
void
SimpleLruInit(SlruCtl ctl, const char *name, int nslots, int nlsns,
- LWLock *ctllock, const char *subdir, int tranche_id,
+ const char *subdir, int tranche_id, int slru_tranche_id,
SyncRequestHandler sync_handler)
{
SlruShared shared;
@@ -239,13 +266,13 @@ SimpleLruInit(SlruCtl ctl, const char *name, int nslots, int nlsns,
char *ptr;
Size offset;
int slotno;
+ int nbanks = bankmask + 1;
+ int bankno;
Assert(!found);
memset(shared, 0, sizeof(SlruSharedData));
- shared->ControlLock = ctllock;
-
shared->num_slots = nslots;
shared->lsn_groups_per_page = nlsns;
@@ -271,6 +298,8 @@ SimpleLruInit(SlruCtl ctl, const char *name, int nslots, int nlsns,
/* Initialize LWLocks */
shared->buffer_locks = (LWLockPadded *) (ptr + offset);
offset += MAXALIGN(nslots * sizeof(LWLockPadded));
+ shared->bank_locks = (LWLockPadded *) (ptr + offset);
+ offset += MAXALIGN(nbanks * sizeof(LWLockPadded));
if (nlsns > 0)
{
@@ -290,6 +319,10 @@ SimpleLruInit(SlruCtl ctl, const char *name, int nslots, int nlsns,
shared->page_lru_count[slotno] = 0;
ptr += BLCKSZ;
}
+ /* initialize bank locks for each buffer bank */
+ for (bankno = 0; bankno < nbanks; bankno++)
+ LWLockInitialize(&shared->bank_locks[bankno].lock,
+ slru_tranche_id);
/* Should fit to estimated shmem size */
Assert(ptr - (char *) shared <= SimpleLruShmemSize(nslots, nlsns));
@@ -344,7 +377,7 @@ SimpleLruZeroPage(SlruCtl ctl, int pageno)
SimpleLruZeroLSNs(ctl, slotno);
/* Assume this page is now the latest active page */
- shared->latest_page_number = pageno;
+ pg_atomic_write_u32(&shared->latest_page_number, pageno);
/* update the stats counter of zeroed pages */
pgstat_count_slru_page_zeroed(shared->slru_stats_idx);
@@ -383,12 +416,13 @@ static void
SimpleLruWaitIO(SlruCtl ctl, int slotno)
{
SlruShared shared = ctl->shared;
+ int bankno = slotno / SLRU_BANK_SIZE;
/* See notes at top of file */
- LWLockRelease(shared->ControlLock);
+ LWLockRelease(&shared->bank_locks[bankno].lock);
LWLockAcquire(&shared->buffer_locks[slotno].lock, LW_SHARED);
LWLockRelease(&shared->buffer_locks[slotno].lock);
- LWLockAcquire(shared->ControlLock, LW_EXCLUSIVE);
+ LWLockAcquire(&shared->bank_locks[bankno].lock, LW_EXCLUSIVE);
/*
* If the slot is still in an io-in-progress state, then either someone
@@ -443,6 +477,7 @@ SimpleLruReadPage(SlruCtl ctl, int pageno, bool write_ok,
for (;;)
{
int slotno;
+ int bankno;
bool ok;
/* See if page already is in memory; if not, pick victim slot */
@@ -485,9 +520,10 @@ SimpleLruReadPage(SlruCtl ctl, int pageno, bool write_ok,
/* Acquire per-buffer lock (cannot deadlock, see notes at top) */
LWLockAcquire(&shared->buffer_locks[slotno].lock, LW_EXCLUSIVE);
+ bankno = slotno / SLRU_BANK_SIZE;
/* Release control lock while doing I/O */
- LWLockRelease(shared->ControlLock);
+ LWLockRelease(&shared->bank_locks[bankno].lock);
/* Do the read */
ok = SlruPhysicalReadPage(ctl, pageno, slotno);
@@ -496,7 +532,7 @@ SimpleLruReadPage(SlruCtl ctl, int pageno, bool write_ok,
SimpleLruZeroLSNs(ctl, slotno);
/* Re-acquire control lock and update page state */
- LWLockAcquire(shared->ControlLock, LW_EXCLUSIVE);
+ LWLockAcquire(&shared->bank_locks[bankno].lock, LW_EXCLUSIVE);
Assert(shared->page_number[slotno] == pageno &&
shared->page_status[slotno] == SLRU_PAGE_READ_IN_PROGRESS &&
@@ -538,11 +574,12 @@ SimpleLruReadPage_ReadOnly(SlruCtl ctl, int pageno, TransactionId xid)
{
SlruShared shared = ctl->shared;
int slotno;
- int bankstart = (pageno & ctl->bank_mask) * SLRU_BANK_SIZE;
+ int bankno = pageno & ctl->bank_mask;
+ int bankstart = bankno * SLRU_BANK_SIZE;
int bankend = bankstart + SLRU_BANK_SIZE;
/* Try to find the page while holding only shared lock */
- LWLockAcquire(shared->ControlLock, LW_SHARED);
+ LWLockAcquire(&shared->bank_locks[bankno].lock, LW_SHARED);
/* See if page is already in a buffer */
for (slotno = bankstart; slotno < bankend; slotno++)
@@ -562,8 +599,8 @@ SimpleLruReadPage_ReadOnly(SlruCtl ctl, int pageno, TransactionId xid)
}
/* No luck, so switch to normal exclusive lock and do regular read */
- LWLockRelease(shared->ControlLock);
- LWLockAcquire(shared->ControlLock, LW_EXCLUSIVE);
+ LWLockRelease(&shared->bank_locks[bankno].lock);
+ LWLockAcquire(&shared->bank_locks[bankno].lock, LW_EXCLUSIVE);
return SimpleLruReadPage(ctl, pageno, true, xid);
}
@@ -585,6 +622,7 @@ SlruInternalWritePage(SlruCtl ctl, int slotno, SlruWriteAll fdata)
SlruShared shared = ctl->shared;
int pageno = shared->page_number[slotno];
bool ok;
+ int bankno = slotno / SLRU_BANK_SIZE;
/* If a write is in progress, wait for it to finish */
while (shared->page_status[slotno] == SLRU_PAGE_WRITE_IN_PROGRESS &&
@@ -613,7 +651,7 @@ SlruInternalWritePage(SlruCtl ctl, int slotno, SlruWriteAll fdata)
LWLockAcquire(&shared->buffer_locks[slotno].lock, LW_EXCLUSIVE);
/* Release control lock while doing I/O */
- LWLockRelease(shared->ControlLock);
+ LWLockRelease(&shared->bank_locks[bankno].lock);
/* Do the write */
ok = SlruPhysicalWritePage(ctl, pageno, slotno, fdata);
@@ -628,7 +666,7 @@ SlruInternalWritePage(SlruCtl ctl, int slotno, SlruWriteAll fdata)
}
/* Re-acquire control lock and update page state */
- LWLockAcquire(shared->ControlLock, LW_EXCLUSIVE);
+ LWLockAcquire(&shared->bank_locks[bankno].lock, LW_EXCLUSIVE);
Assert(shared->page_number[slotno] == pageno &&
shared->page_status[slotno] == SLRU_PAGE_WRITE_IN_PROGRESS);
@@ -1133,7 +1171,7 @@ SlruSelectLRUPage(SlruCtl ctl, int pageno)
this_delta = 0;
}
this_page_number = shared->page_number[slotno];
- if (this_page_number == shared->latest_page_number)
+ if (this_page_number == pg_atomic_read_u32(&shared->latest_page_number))
continue;
if (shared->page_status[slotno] == SLRU_PAGE_VALID)
{
@@ -1207,6 +1245,7 @@ SimpleLruWriteAll(SlruCtl ctl, bool allow_redirtied)
int slotno;
int pageno = 0;
int i;
+ int lastbankno = 0;
bool ok;
/* update the stats counter of flushes */
@@ -1217,10 +1256,19 @@ SimpleLruWriteAll(SlruCtl ctl, bool allow_redirtied)
*/
fdata.num_files = 0;
- LWLockAcquire(shared->ControlLock, LW_EXCLUSIVE);
+ LWLockAcquire(&shared->bank_locks[0].lock, LW_EXCLUSIVE);
for (slotno = 0; slotno < shared->num_slots; slotno++)
{
+ int curbankno = slotno / SLRU_BANK_SIZE;
+
+ if (curbankno != lastbankno)
+ {
+ LWLockRelease(&shared->bank_locks[lastbankno].lock);
+ LWLockAcquire(&shared->bank_locks[curbankno].lock, LW_EXCLUSIVE);
+ lastbankno = curbankno;
+ }
+
SlruInternalWritePage(ctl, slotno, &fdata);
/*
@@ -1234,7 +1282,7 @@ SimpleLruWriteAll(SlruCtl ctl, bool allow_redirtied)
!shared->page_dirty[slotno]));
}
- LWLockRelease(shared->ControlLock);
+ LWLockRelease(&shared->bank_locks[lastbankno].lock);
/*
* Now close any files that were open
@@ -1274,6 +1322,7 @@ SimpleLruTruncate(SlruCtl ctl, int cutoffPage)
{
SlruShared shared = ctl->shared;
int slotno;
+ int prevbankno;
/* update the stats counter of truncates */
pgstat_count_slru_truncate(shared->slru_stats_idx);
@@ -1284,25 +1333,38 @@ SimpleLruTruncate(SlruCtl ctl, int cutoffPage)
* or just after a checkpoint, any dirty pages should have been flushed
* already ... we're just being extra careful here.)
*/
- LWLockAcquire(shared->ControlLock, LW_EXCLUSIVE);
-
restart:
/*
* While we are holding the lock, make an important safety check: the
* current endpoint page must not be eligible for removal.
*/
- if (ctl->PagePrecedes(shared->latest_page_number, cutoffPage))
+ if (ctl->PagePrecedes(pg_atomic_read_u32(&shared->latest_page_number),
+ cutoffPage))
{
- LWLockRelease(shared->ControlLock);
ereport(LOG,
(errmsg("could not truncate directory \"%s\": apparent wraparound",
ctl->Dir)));
return;
}
+ prevbankno = 0;
+ LWLockAcquire(&shared->bank_locks[prevbankno].lock, LW_EXCLUSIVE);
for (slotno = 0; slotno < shared->num_slots; slotno++)
{
+ int curbankno = slotno / SLRU_BANK_SIZE;
+
+ /*
+ * If the curbankno is not same as prevbankno then release the lock on
+ * the prevbankno and acquire the lock on the curbankno.
+ */
+ if (curbankno != prevbankno)
+ {
+ LWLockRelease(&shared->bank_locks[prevbankno].lock);
+ LWLockAcquire(&shared->bank_locks[curbankno].lock, LW_EXCLUSIVE);
+ prevbankno = curbankno;
+ }
+
if (shared->page_status[slotno] == SLRU_PAGE_EMPTY)
continue;
if (!ctl->PagePrecedes(shared->page_number[slotno], cutoffPage))
@@ -1332,10 +1394,12 @@ restart:
SlruInternalWritePage(ctl, slotno, NULL);
else
SimpleLruWaitIO(ctl, slotno);
+
+ LWLockRelease(&shared->bank_locks[prevbankno].lock);
goto restart;
}
- LWLockRelease(shared->ControlLock);
+ LWLockRelease(&shared->bank_locks[prevbankno].lock);
/* Now we can remove the old segment(s) */
(void) SlruScanDirectory(ctl, SlruScanDirCbDeleteCutoff, &cutoffPage);
@@ -1376,15 +1440,31 @@ SlruDeleteSegment(SlruCtl ctl, int segno)
SlruShared shared = ctl->shared;
int slotno;
bool did_write;
+ int prevbankno = 0;
/* Clean out any possibly existing references to the segment. */
- LWLockAcquire(shared->ControlLock, LW_EXCLUSIVE);
+ LWLockAcquire(&shared->bank_locks[prevbankno].lock, LW_EXCLUSIVE);
restart:
did_write = false;
for (slotno = 0; slotno < shared->num_slots; slotno++)
{
- int pagesegno = shared->page_number[slotno] / SLRU_PAGES_PER_SEGMENT;
+ int pagesegno;
+ int curbankno;
+
+ curbankno = slotno / SLRU_BANK_SIZE;
+
+ /*
+ * If the curbankno is not same as prevbankno then release the lock on
+ * the prevbankno and acquire the lock on the curbankno.
+ */
+ if (curbankno != prevbankno)
+ {
+ LWLockRelease(&shared->bank_locks[prevbankno].lock);
+ LWLockAcquire(&shared->bank_locks[curbankno].lock, LW_EXCLUSIVE);
+ prevbankno = curbankno;
+ }
+ pagesegno = shared->page_number[slotno] / SLRU_PAGES_PER_SEGMENT;
if (shared->page_status[slotno] == SLRU_PAGE_EMPTY)
continue;
@@ -1418,7 +1498,7 @@ restart:
SlruInternalDeleteSegment(ctl, segno);
- LWLockRelease(shared->ControlLock);
+ LWLockRelease(&shared->bank_locks[prevbankno].lock);
}
/*
diff --git a/src/backend/access/transam/subtrans.c b/src/backend/access/transam/subtrans.c
index 125273e235..2b0afa8a15 100644
--- a/src/backend/access/transam/subtrans.c
+++ b/src/backend/access/transam/subtrans.c
@@ -77,12 +77,14 @@ SubTransSetParent(TransactionId xid, TransactionId parent)
int pageno = TransactionIdToPage(xid);
int entryno = TransactionIdToEntry(xid);
int slotno;
+ LWLock *lock;
TransactionId *ptr;
Assert(TransactionIdIsValid(parent));
Assert(TransactionIdFollows(xid, parent));
- LWLockAcquire(SubtransSLRULock, LW_EXCLUSIVE);
+ lock = SimpleLruPageGetSLRULock(SubTransCtl, pageno);
+ LWLockAcquire(lock, LW_EXCLUSIVE);
slotno = SimpleLruReadPage(SubTransCtl, pageno, true, xid);
ptr = (TransactionId *) SubTransCtl->shared->page_buffer[slotno];
@@ -100,7 +102,7 @@ SubTransSetParent(TransactionId xid, TransactionId parent)
SubTransCtl->shared->page_dirty[slotno] = true;
}
- LWLockRelease(SubtransSLRULock);
+ LWLockRelease(lock);
}
/*
@@ -130,7 +132,7 @@ SubTransGetParent(TransactionId xid)
parent = *ptr;
- LWLockRelease(SubtransSLRULock);
+ LWLockRelease(SimpleLruPageGetSLRULock(SubTransCtl, pageno));
return parent;
}
@@ -193,8 +195,8 @@ SUBTRANSShmemInit(void)
{
SubTransCtl->PagePrecedes = SubTransPagePrecedes;
SimpleLruInit(SubTransCtl, "Subtrans", NUM_SUBTRANS_BUFFERS, 0,
- SubtransSLRULock, "pg_subtrans",
- LWTRANCHE_SUBTRANS_BUFFER, SYNC_HANDLER_NONE);
+ "pg_subtrans", LWTRANCHE_SUBTRANS_BUFFER,
+ LWTRANCHE_SUBTRANS_SLRU, SYNC_HANDLER_NONE);
SlruPagePrecedesUnitTests(SubTransCtl, SUBTRANS_XACTS_PER_PAGE);
}
@@ -212,8 +214,9 @@ void
BootStrapSUBTRANS(void)
{
int slotno;
+ LWLock *lock = SimpleLruPageGetSLRULock(SubTransCtl, 0);
- LWLockAcquire(SubtransSLRULock, LW_EXCLUSIVE);
+ LWLockAcquire(lock, LW_EXCLUSIVE);
/* Create and zero the first page of the subtrans log */
slotno = ZeroSUBTRANSPage(0);
@@ -222,7 +225,7 @@ BootStrapSUBTRANS(void)
SimpleLruWritePage(SubTransCtl, slotno);
Assert(!SubTransCtl->shared->page_dirty[slotno]);
- LWLockRelease(SubtransSLRULock);
+ LWLockRelease(lock);
}
/*
@@ -259,7 +262,7 @@ StartupSUBTRANS(TransactionId oldestActiveXID)
* Whenever we advance into a new page, ExtendSUBTRANS will likewise zero
* the new page without regard to whatever was previously on disk.
*/
- LWLockAcquire(SubtransSLRULock, LW_EXCLUSIVE);
+ SimpleLruAcquireAllBankLock(SubTransCtl, LW_EXCLUSIVE);
startPage = TransactionIdToPage(oldestActiveXID);
nextXid = ShmemVariableCache->nextXid;
@@ -275,7 +278,7 @@ StartupSUBTRANS(TransactionId oldestActiveXID)
}
(void) ZeroSUBTRANSPage(startPage);
- LWLockRelease(SubtransSLRULock);
+ SimpleLruReleaseAllBankLock(SubTransCtl);
}
/*
@@ -309,6 +312,7 @@ void
ExtendSUBTRANS(TransactionId newestXact)
{
int pageno;
+ LWLock *lock;
/*
* No work except at first XID of a page. But beware: just after
@@ -320,12 +324,13 @@ ExtendSUBTRANS(TransactionId newestXact)
pageno = TransactionIdToPage(newestXact);
- LWLockAcquire(SubtransSLRULock, LW_EXCLUSIVE);
+ lock = SimpleLruPageGetSLRULock(SubTransCtl, pageno);
+ LWLockAcquire(lock, LW_EXCLUSIVE);
/* Zero the page */
ZeroSUBTRANSPage(pageno);
- LWLockRelease(SubtransSLRULock);
+ LWLockRelease(lock);
}
diff --git a/src/backend/commands/async.c b/src/backend/commands/async.c
index d148d10850..7088fe15ea 100644
--- a/src/backend/commands/async.c
+++ b/src/backend/commands/async.c
@@ -267,9 +267,10 @@ typedef struct QueueBackendStatus
* both NotifyQueueLock and NotifyQueueTailLock in EXCLUSIVE mode, backends
* can change the tail pointers.
*
- * NotifySLRULock is used as the control lock for the pg_notify SLRU buffers.
+ * SLRU buffer pool is divided in banks and bank wise SLRU lock is used as
+ * the control lock for the pg_notify SLRU buffers.
* In order to avoid deadlocks, whenever we need multiple locks, we first get
- * NotifyQueueTailLock, then NotifyQueueLock, and lastly NotifySLRULock.
+ * NotifyQueueTailLock, then NotifyQueueLock, and lastly SLRU bank lock.
*
* Each backend uses the backend[] array entry with index equal to its
* BackendId (which can range from 1 to MaxBackends). We rely on this to make
@@ -570,8 +571,8 @@ AsyncShmemInit(void)
*/
NotifyCtl->PagePrecedes = asyncQueuePagePrecedes;
SimpleLruInit(NotifyCtl, "Notify", NUM_NOTIFY_BUFFERS, 0,
- NotifySLRULock, "pg_notify", LWTRANCHE_NOTIFY_BUFFER,
- SYNC_HANDLER_NONE);
+ "pg_notify", LWTRANCHE_NOTIFY_BUFFER,
+ LWTRANCHE_NOTIFY_SLRU, SYNC_HANDLER_NONE);
if (!found)
{
@@ -1402,7 +1403,7 @@ asyncQueueNotificationToEntry(Notification *n, AsyncQueueEntry *qe)
* Eventually we will return NULL indicating all is done.
*
* We are holding NotifyQueueLock already from the caller and grab
- * NotifySLRULock locally in this function.
+ * page specific SLRULock locally in this function.
*/
static ListCell *
asyncQueueAddEntries(ListCell *nextNotify)
@@ -1412,9 +1413,7 @@ asyncQueueAddEntries(ListCell *nextNotify)
int pageno;
int offset;
int slotno;
-
- /* We hold both NotifyQueueLock and NotifySLRULock during this operation */
- LWLockAcquire(NotifySLRULock, LW_EXCLUSIVE);
+ LWLock *lock;
/*
* We work with a local copy of QUEUE_HEAD, which we write back to shared
@@ -1438,6 +1437,11 @@ asyncQueueAddEntries(ListCell *nextNotify)
* wrapped around, but re-zeroing the page is harmless in that case.)
*/
pageno = QUEUE_POS_PAGE(queue_head);
+ lock = SimpleLruPageGetSLRULock(NotifyCtl, pageno);
+
+ /* We hold both NotifyQueueLock and SLRU bank lock during this operation */
+ LWLockAcquire(lock, LW_EXCLUSIVE);
+
if (QUEUE_POS_IS_ZERO(queue_head))
slotno = SimpleLruZeroPage(NotifyCtl, pageno);
else
@@ -1509,7 +1513,7 @@ asyncQueueAddEntries(ListCell *nextNotify)
/* Success, so update the global QUEUE_HEAD */
QUEUE_HEAD = queue_head;
- LWLockRelease(NotifySLRULock);
+ LWLockRelease(lock);
return nextNotify;
}
@@ -1988,7 +1992,7 @@ asyncQueueReadAllNotifications(void)
/*
* We copy the data from SLRU into a local buffer, so as to avoid
- * holding the NotifySLRULock while we are examining the entries
+ * holding the SLRULock while we are examining the entries
* and possibly transmitting them to our frontend. Copy only the
* part of the page we will actually inspect.
*/
@@ -2010,7 +2014,7 @@ asyncQueueReadAllNotifications(void)
NotifyCtl->shared->page_buffer[slotno] + curoffset,
copysize);
/* Release lock that we got from SimpleLruReadPage_ReadOnly() */
- LWLockRelease(NotifySLRULock);
+ LWLockRelease(SimpleLruPageGetSLRULock(NotifyCtl, curpage));
/*
* Process messages up to the stop position, end of page, or an
@@ -2051,7 +2055,7 @@ asyncQueueReadAllNotifications(void)
*
* The current page must have been fetched into page_buffer from shared
* memory. (We could access the page right in shared memory, but that
- * would imply holding the NotifySLRULock throughout this routine.)
+ * would imply holding the SLRU bank lock throughout this routine.)
*
* We stop if we reach the "stop" position, or reach a notification from an
* uncommitted transaction, or reach the end of the page.
@@ -2204,7 +2208,7 @@ asyncQueueAdvanceTail(void)
if (asyncQueuePagePrecedes(oldtailpage, boundary))
{
/*
- * SimpleLruTruncate() will ask for NotifySLRULock but will also
+ * SimpleLruTruncate() will ask for SLRU bank locks but will also
* release the lock again.
*/
SimpleLruTruncate(NotifyCtl, newtailpage);
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index 315a78cda9..1261af0548 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -190,6 +190,20 @@ static const char *const BuiltinTrancheNames[] = {
"LogicalRepLauncherDSA",
/* LWTRANCHE_LAUNCHER_HASH: */
"LogicalRepLauncherHash",
+ /* LWTRANCHE_XACT_SLRU: */
+ "XactSLRU",
+ /* LWTRANCHE_COMMITTS_SLRU: */
+ "CommitTSSLRU",
+ /* LWTRANCHE_SUBTRANS_SLRU: */
+ "SubtransSLRU",
+ /* LWTRANCHE_MULTIXACTOFFSET_SLRU: */
+ "MultixactOffsetSLRU",
+ /* LWTRANCHE_MULTIXACTMEMBER_SLRU: */
+ "MultixactMemberSLRU",
+ /* LWTRANCHE_NOTIFY_SLRU: */
+ "NotifySLRU",
+ /* LWTRANCHE_SERIAL_SLRU: */
+ "SerialSLRU"
};
StaticAssertDecl(lengthof(BuiltinTrancheNames) ==
diff --git a/src/backend/storage/lmgr/lwlocknames.txt b/src/backend/storage/lmgr/lwlocknames.txt
index f72f2906ce..9e66ecd1ed 100644
--- a/src/backend/storage/lmgr/lwlocknames.txt
+++ b/src/backend/storage/lmgr/lwlocknames.txt
@@ -16,11 +16,11 @@ WALBufMappingLock 7
WALWriteLock 8
ControlFileLock 9
# 10 was CheckpointLock
-XactSLRULock 11
-SubtransSLRULock 12
+# 11 was XactSLRULock
+# 12 was SubtransSLRULock
MultiXactGenLock 13
-MultiXactOffsetSLRULock 14
-MultiXactMemberSLRULock 15
+# 14 was MultiXactOffsetSLRULock
+# 15 was MultiXactMemberSLRULock
RelCacheInitLock 16
CheckpointerCommLock 17
TwoPhaseStateLock 18
@@ -31,19 +31,19 @@ AutovacuumLock 22
AutovacuumScheduleLock 23
SyncScanLock 24
RelationMappingLock 25
-NotifySLRULock 26
+#26 was NotifySLRULock
NotifyQueueLock 27
SerializableXactHashLock 28
SerializableFinishedListLock 29
SerializablePredicateListLock 30
-SerialSLRULock 31
+SerialControlLock 31
SyncRepLock 32
BackgroundWorkerLock 33
DynamicSharedMemoryControlLock 34
AutoFileLock 35
ReplicationSlotAllocationLock 36
ReplicationSlotControlLock 37
-CommitTsSLRULock 38
+#38 was CommitTsSLRULock
CommitTsLock 39
ReplicationOriginLock 40
MultiXactTruncationLock 41
diff --git a/src/backend/storage/lmgr/predicate.c b/src/backend/storage/lmgr/predicate.c
index 1af41213b4..fe00148956 100644
--- a/src/backend/storage/lmgr/predicate.c
+++ b/src/backend/storage/lmgr/predicate.c
@@ -808,8 +808,8 @@ SerialInit(void)
*/
SerialSlruCtl->PagePrecedes = SerialPagePrecedesLogically;
SimpleLruInit(SerialSlruCtl, "Serial",
- NUM_SERIAL_BUFFERS, 0, SerialSLRULock, "pg_serial",
- LWTRANCHE_SERIAL_BUFFER, SYNC_HANDLER_NONE);
+ NUM_SERIAL_BUFFERS, 0, "pg_serial", LWTRANCHE_SERIAL_BUFFER,
+ LWTRANCHE_SERIAL_SLRU, SYNC_HANDLER_NONE);
#ifdef USE_ASSERT_CHECKING
SerialPagePrecedesLogicallyUnitTests();
#endif
@@ -846,12 +846,14 @@ SerialAdd(TransactionId xid, SerCommitSeqNo minConflictCommitSeqNo)
int slotno;
int firstZeroPage;
bool isNewPage;
+ LWLock *lock;
Assert(TransactionIdIsValid(xid));
targetPage = SerialPage(xid);
+ lock = SimpleLruPageGetSLRULock(SerialSlruCtl, targetPage);
- LWLockAcquire(SerialSLRULock, LW_EXCLUSIVE);
+ LWLockAcquire(lock, LW_EXCLUSIVE);
/*
* If no serializable transactions are active, there shouldn't be anything
@@ -901,7 +903,7 @@ SerialAdd(TransactionId xid, SerCommitSeqNo minConflictCommitSeqNo)
SerialValue(slotno, xid) = minConflictCommitSeqNo;
SerialSlruCtl->shared->page_dirty[slotno] = true;
- LWLockRelease(SerialSLRULock);
+ LWLockRelease(lock);
}
/*
@@ -919,10 +921,10 @@ SerialGetMinConflictCommitSeqNo(TransactionId xid)
Assert(TransactionIdIsValid(xid));
- LWLockAcquire(SerialSLRULock, LW_SHARED);
+ LWLockAcquire(SerialControlLock, LW_SHARED);
headXid = serialControl->headXid;
tailXid = serialControl->tailXid;
- LWLockRelease(SerialSLRULock);
+ LWLockRelease(SerialControlLock);
if (!TransactionIdIsValid(headXid))
return 0;
@@ -934,13 +936,13 @@ SerialGetMinConflictCommitSeqNo(TransactionId xid)
return 0;
/*
- * The following function must be called without holding SerialSLRULock,
+ * The following function must be called without holding SLRU bank lock,
* but will return with that lock held, which must then be released.
*/
slotno = SimpleLruReadPage_ReadOnly(SerialSlruCtl,
SerialPage(xid), xid);
val = SerialValue(slotno, xid);
- LWLockRelease(SerialSLRULock);
+ LWLockRelease(SimpleLruPageGetSLRULock(SerialSlruCtl, SerialPage(xid)));
return val;
}
@@ -953,7 +955,7 @@ SerialGetMinConflictCommitSeqNo(TransactionId xid)
static void
SerialSetActiveSerXmin(TransactionId xid)
{
- LWLockAcquire(SerialSLRULock, LW_EXCLUSIVE);
+ LWLockAcquire(SerialControlLock, LW_EXCLUSIVE);
/*
* When no sxacts are active, nothing overlaps, set the xid values to
@@ -965,7 +967,7 @@ SerialSetActiveSerXmin(TransactionId xid)
{
serialControl->tailXid = InvalidTransactionId;
serialControl->headXid = InvalidTransactionId;
- LWLockRelease(SerialSLRULock);
+ LWLockRelease(SerialControlLock);
return;
}
@@ -983,7 +985,7 @@ SerialSetActiveSerXmin(TransactionId xid)
{
serialControl->tailXid = xid;
}
- LWLockRelease(SerialSLRULock);
+ LWLockRelease(SerialControlLock);
return;
}
@@ -992,7 +994,7 @@ SerialSetActiveSerXmin(TransactionId xid)
serialControl->tailXid = xid;
- LWLockRelease(SerialSLRULock);
+ LWLockRelease(SerialControlLock);
}
/*
@@ -1006,12 +1008,12 @@ CheckPointPredicate(void)
{
int tailPage;
- LWLockAcquire(SerialSLRULock, LW_EXCLUSIVE);
+ LWLockAcquire(SerialControlLock, LW_EXCLUSIVE);
/* Exit quickly if the SLRU is currently not in use. */
if (serialControl->headPage < 0)
{
- LWLockRelease(SerialSLRULock);
+ LWLockRelease(SerialControlLock);
return;
}
@@ -1055,7 +1057,7 @@ CheckPointPredicate(void)
serialControl->headPage = -1;
}
- LWLockRelease(SerialSLRULock);
+ LWLockRelease(SerialControlLock);
/* Truncate away pages that are no longer required */
SimpleLruTruncate(SerialSlruCtl, tailPage);
diff --git a/src/include/access/slru.h b/src/include/access/slru.h
index f5f2b5b8b5..eec7a568dc 100644
--- a/src/include/access/slru.h
+++ b/src/include/access/slru.h
@@ -52,8 +52,6 @@ typedef enum
*/
typedef struct SlruSharedData
{
- LWLock *ControlLock;
-
/* Number of buffers managed by this SLRU structure */
int num_slots;
@@ -68,6 +66,13 @@ typedef struct SlruSharedData
int *page_lru_count;
LWLockPadded *buffer_locks;
+ /*
+ * Lock to protect the buffer slot access in per SLRU bank. The
+ * buffer_locks protects the I/O on each buffer slots whereas this lock
+ * protect the in memory operation on the buffer within one SLRU bank.
+ */
+ LWLockPadded *bank_locks;
+
/*
* Optional array of WAL flush LSNs associated with entries in the SLRU
* pages. If not zero/NULL, we must flush WAL before writing pages (true
@@ -95,7 +100,7 @@ typedef struct SlruSharedData
* this is not critical data, since we use it only to avoid swapping out
* the latest page.
*/
- int latest_page_number;
+ pg_atomic_uint32 latest_page_number;
/* SLRU's index for statistics purposes (might not be unique) */
int slru_stats_idx;
@@ -143,11 +148,25 @@ typedef struct SlruCtlData
typedef SlruCtlData *SlruCtl;
+/*
+ * Get the SLRU bank lock for given SlruCtl and the pageno.
+ *
+ * This lock needs to be acquire in order to access the slru buffer slots in
+ * the respective bank. For more details refer comments in SlruSharedData.
+ */
+static inline LWLock *
+SimpleLruPageGetSLRULock(SlruCtl ctl, int pageno)
+{
+ int bankno = (pageno & ctl->bank_mask);
+
+ /* Try to find the page while holding only shared lock */
+ return &(ctl->shared->bank_locks[bankno].lock);
+}
extern Size SimpleLruShmemSize(int nslots, int nlsns);
extern void SimpleLruInit(SlruCtl ctl, const char *name, int nslots, int nlsns,
- LWLock *ctllock, const char *subdir, int tranche_id,
- SyncRequestHandler sync_handler);
+ const char *subdir, int tranche_id,
+ int slru_tranche_id, SyncRequestHandler sync_handler);
extern int SimpleLruZeroPage(SlruCtl ctl, int pageno);
extern int SimpleLruReadPage(SlruCtl ctl, int pageno, bool write_ok,
TransactionId xid);
@@ -175,5 +194,7 @@ extern bool SlruScanDirCbReportPresence(SlruCtl ctl, char *filename,
int segpage, void *data);
extern bool SlruScanDirCbDeleteAll(SlruCtl ctl, char *filename, int segpage,
void *data);
-
+extern LWLock *SimpleLruPageGetSLRULock(SlruCtl ctl, int pageno);
+extern void SimpleLruAcquireAllBankLock(SlruCtl ctl, LWLockMode mode);
+extern void SimpleLruReleaseAllBankLock(SlruCtl ctl);
#endif /* SLRU_H */
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index d77410bdea..09d2efe8ca 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -207,6 +207,14 @@ typedef enum BuiltinTrancheIds
LWTRANCHE_PGSTATS_DATA,
LWTRANCHE_LAUNCHER_DSA,
LWTRANCHE_LAUNCHER_HASH,
+ LWTRANCHE_XACT_SLRU,
+ LWTRANCHE_COMMITTS_SLRU,
+ LWTRANCHE_SUBTRANS_SLRU,
+ LWTRANCHE_MULTIXACTOFFSET_SLRU,
+ LWTRANCHE_MULTIXACTMEMBER_SLRU,
+ LWTRANCHE_NOTIFY_SLRU,
+ LWTRANCHE_SERIAL_SLRU,
+
LWTRANCHE_FIRST_USER_DEFINED
} BuiltinTrancheIds;
diff --git a/src/test/modules/test_slru/test_slru.c b/src/test/modules/test_slru/test_slru.c
index ae21444c47..7b2eb4ae50 100644
--- a/src/test/modules/test_slru/test_slru.c
+++ b/src/test/modules/test_slru/test_slru.c
@@ -63,9 +63,9 @@ test_slru_page_write(PG_FUNCTION_ARGS)
int pageno = PG_GETARG_INT32(0);
char *data = text_to_cstring(PG_GETARG_TEXT_PP(1));
int slotno;
+ LWLock *lock = SimpleLruPageGetSLRULock(TestSlruCtl, pageno);
- LWLockAcquire(TestSLRULock, LW_EXCLUSIVE);
-
+ LWLockAcquire(lock, LW_EXCLUSIVE);
slotno = SimpleLruZeroPage(TestSlruCtl, pageno);
/* these should match */
@@ -80,7 +80,7 @@ test_slru_page_write(PG_FUNCTION_ARGS)
BLCKSZ - 1);
SimpleLruWritePage(TestSlruCtl, slotno);
- LWLockRelease(TestSLRULock);
+ LWLockRelease(lock);
PG_RETURN_VOID();
}
@@ -99,13 +99,14 @@ test_slru_page_read(PG_FUNCTION_ARGS)
bool write_ok = PG_GETARG_BOOL(1);
char *data = NULL;
int slotno;
+ LWLock *lock = SimpleLruPageGetSLRULock(TestSlruCtl, pageno);
/* find page in buffers, reading it if necessary */
- LWLockAcquire(TestSLRULock, LW_EXCLUSIVE);
+ LWLockAcquire(lock, LW_EXCLUSIVE);
slotno = SimpleLruReadPage(TestSlruCtl, pageno,
write_ok, InvalidTransactionId);
data = (char *) TestSlruCtl->shared->page_buffer[slotno];
- LWLockRelease(TestSLRULock);
+ LWLockRelease(lock);
PG_RETURN_TEXT_P(cstring_to_text(data));
}
@@ -116,14 +117,15 @@ test_slru_page_readonly(PG_FUNCTION_ARGS)
int pageno = PG_GETARG_INT32(0);
char *data = NULL;
int slotno;
+ LWLock *lock = SimpleLruPageGetSLRULock(TestSlruCtl, pageno);
/* find page in buffers, reading it if necessary */
slotno = SimpleLruReadPage_ReadOnly(TestSlruCtl,
pageno,
InvalidTransactionId);
- Assert(LWLockHeldByMe(TestSLRULock));
+ Assert(LWLockHeldByMe(lock));
data = (char *) TestSlruCtl->shared->page_buffer[slotno];
- LWLockRelease(TestSLRULock);
+ LWLockRelease(lock);
PG_RETURN_TEXT_P(cstring_to_text(data));
}
@@ -133,10 +135,11 @@ test_slru_page_exists(PG_FUNCTION_ARGS)
{
int pageno = PG_GETARG_INT32(0);
bool found;
+ LWLock *lock = SimpleLruPageGetSLRULock(TestSlruCtl, pageno);
- LWLockAcquire(TestSLRULock, LW_EXCLUSIVE);
+ LWLockAcquire(lock, LW_EXCLUSIVE);
found = SimpleLruDoesPhysicalPageExist(TestSlruCtl, pageno);
- LWLockRelease(TestSLRULock);
+ LWLockRelease(lock);
PG_RETURN_BOOL(found);
}
@@ -215,6 +218,7 @@ test_slru_shmem_startup(void)
{
const char slru_dir_name[] = "pg_test_slru";
int test_tranche_id;
+ int test_buffer_tranche_id;
if (prev_shmem_startup_hook)
prev_shmem_startup_hook();
@@ -228,11 +232,13 @@ test_slru_shmem_startup(void)
/* initialize the SLRU facility */
test_tranche_id = LWLockNewTrancheId();
LWLockRegisterTranche(test_tranche_id, "test_slru_tranche");
- LWLockInitialize(TestSLRULock, test_tranche_id);
+
+ test_buffer_tranche_id = LWLockNewTrancheId();
+ LWLockRegisterTranche(test_tranche_id, "test_buffer_tranche");
TestSlruCtl->PagePrecedes = test_slru_page_precedes_logically;
SimpleLruInit(TestSlruCtl, "TestSLRU",
- NUM_TEST_BUFFERS, 0, TestSLRULock, slru_dir_name,
+ NUM_TEST_BUFFERS, 0, slru_dir_name, test_buffer_tranche_id,
test_tranche_id, SYNC_HANDLER_NONE);
}
--
2.39.2 (Apple Git-143)
On Wed, Oct 11, 2023 at 4:34 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
The small size of the SLRU buffer pools can sometimes become a
performance problem because it’s not difficult to have a workload
where the number of buffers actively in use is larger than the
fixed-size buffer pool. However, just increasing the size of the
buffer pool doesn’t necessarily help, because the linear search that
we use for buffer replacement doesn’t scale, and also because
contention on the single centralized lock limits scalability.There is a couple of patches proposed in the past to address the
problem of increasing the buffer pool size, one of the patch [1] was
proposed by Thomas Munro where we make the size of the buffer pool
configurable.
In my last email, I forgot to give the link from where I have taken
the base path for dividing the buffer pool in banks so giving the same
here[1]https://commitfest.postgresql.org/43/2627/. And looking at this again it seems that the idea of that
patch was from
Andrey M. Borodin and the idea of the SLRU scale factor were
introduced by Yura Sokolov and Ivan Lazarev. Apologies for missing
that in the first email.
[1]: https://commitfest.postgresql.org/43/2627/
--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
On Wed, Oct 11, 2023 at 5:57 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
On Wed, Oct 11, 2023 at 4:34 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
In my last email, I forgot to give the link from where I have taken
the base path for dividing the buffer pool in banks so giving the same
here[1]. And looking at this again it seems that the idea of that
patch was from
Andrey M. Borodin and the idea of the SLRU scale factor were
introduced by Yura Sokolov and Ivan Lazarev. Apologies for missing
that in the first email.
In my last email I have just rebased the base patch, so now while
reading through that patch I realized that there was some refactoring
needed and some unused functions were there so I have removed that and
also added some comments. Also did some refactoring to my patches. So
reposting the patch series.
--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
Attachments:
v2-0002-bank-wise-slru-locks.patchapplication/octet-stream; name=v2-0002-bank-wise-slru-locks.patchDownload
From 72f6610cbdcdcfdd3a0efe3e83031852c56e0bd9 Mon Sep 17 00:00:00 2001
From: Dilip kumar <dilipkumar@dkmac.local>
Date: Sat, 9 Sep 2023 12:56:10 +0530
Subject: [PATCH v2 2/3] bank wise slru locks
The previous patch has divided SLRU buffer pool into associative
banks. And this patch is further optimizing it by introducing
bank wise slru locks instead of a common centralized lock this
will reduce the contention on the slru control lock.
Dilip Kumar based on some design suggestions from Robert Haas
---
src/backend/access/transam/clog.c | 108 +++++++++-----
src/backend/access/transam/commit_ts.c | 43 +++---
src/backend/access/transam/multixact.c | 179 ++++++++++++++++-------
src/backend/access/transam/slru.c | 139 ++++++++++++++----
src/backend/access/transam/subtrans.c | 57 ++++++--
src/backend/commands/async.c | 30 ++--
src/backend/storage/lmgr/lwlock.c | 14 ++
src/backend/storage/lmgr/lwlocknames.txt | 14 +-
src/backend/storage/lmgr/predicate.c | 32 ++--
src/include/access/slru.h | 32 +++-
src/include/storage/lwlock.h | 8 +
src/test/modules/test_slru/test_slru.c | 32 ++--
12 files changed, 482 insertions(+), 206 deletions(-)
diff --git a/src/backend/access/transam/clog.c b/src/backend/access/transam/clog.c
index d4ac85e052..929d89a187 100644
--- a/src/backend/access/transam/clog.c
+++ b/src/backend/access/transam/clog.c
@@ -277,14 +277,19 @@ TransactionIdSetPageStatus(TransactionId xid, int nsubxids,
XLogRecPtr lsn, int pageno,
bool all_xact_same_page)
{
+ LWLock *lock;
+
/* Can't use group update when PGPROC overflows. */
StaticAssertDecl(THRESHOLD_SUBTRANS_CLOG_OPT <= PGPROC_MAX_CACHED_SUBXIDS,
"group clog threshold less than PGPROC cached subxids");
+ /* Get the SLRU bank lock w.r.t. the page we are going to access. */
+ lock = SimpleLruGetSLRUBankLock(XactCtl, pageno);
+
/*
- * When there is contention on XactSLRULock, we try to group multiple
+ * When there is contention on SLRU lock, we try to group multiple
* updates; a single leader process will perform transaction status
- * updates for multiple backends so that the number of times XactSLRULock
+ * updates for multiple backends so that the number of times the SLRU lock
* needs to be acquired is reduced.
*
* For this optimization to be safe, the XID and subxids in MyProc must be
@@ -303,17 +308,17 @@ TransactionIdSetPageStatus(TransactionId xid, int nsubxids,
nsubxids * sizeof(TransactionId)) == 0))
{
/*
- * If we can immediately acquire XactSLRULock, we update the status of
+ * If we can immediately acquire SLRU lock, we update the status of
* our own XID and release the lock. If not, try use group XID
* update. If that doesn't work out, fall back to waiting for the
* lock to perform an update for this transaction only.
*/
- if (LWLockConditionalAcquire(XactSLRULock, LW_EXCLUSIVE))
+ if (LWLockConditionalAcquire(lock, LW_EXCLUSIVE))
{
/* Got the lock without waiting! Do the update. */
TransactionIdSetPageStatusInternal(xid, nsubxids, subxids, status,
lsn, pageno);
- LWLockRelease(XactSLRULock);
+ LWLockRelease(lock);
return;
}
else if (TransactionGroupUpdateXidStatus(xid, status, lsn, pageno))
@@ -326,10 +331,10 @@ TransactionIdSetPageStatus(TransactionId xid, int nsubxids,
}
/* Group update not applicable, or couldn't accept this page number. */
- LWLockAcquire(XactSLRULock, LW_EXCLUSIVE);
+ LWLockAcquire(lock, LW_EXCLUSIVE);
TransactionIdSetPageStatusInternal(xid, nsubxids, subxids, status,
lsn, pageno);
- LWLockRelease(XactSLRULock);
+ LWLockRelease(lock);
}
/*
@@ -348,7 +353,8 @@ TransactionIdSetPageStatusInternal(TransactionId xid, int nsubxids,
Assert(status == TRANSACTION_STATUS_COMMITTED ||
status == TRANSACTION_STATUS_ABORTED ||
(status == TRANSACTION_STATUS_SUB_COMMITTED && !TransactionIdIsValid(xid)));
- Assert(LWLockHeldByMeInMode(XactSLRULock, LW_EXCLUSIVE));
+ Assert(LWLockHeldByMeInMode(SimpleLruGetSLRUBankLock(XactCtl, pageno),
+ LW_EXCLUSIVE));
/*
* If we're doing an async commit (ie, lsn is valid), then we must wait
@@ -399,14 +405,13 @@ TransactionIdSetPageStatusInternal(TransactionId xid, int nsubxids,
}
/*
- * When we cannot immediately acquire XactSLRULock in exclusive mode at
+ * When we cannot immediately acquire SLRU bank lock in exclusive mode at
* commit time, add ourselves to a list of processes that need their XIDs
* status update. The first process to add itself to the list will acquire
- * XactSLRULock in exclusive mode and set transaction status as required
- * on behalf of all group members. This avoids a great deal of contention
- * around XactSLRULock when many processes are trying to commit at once,
- * since the lock need not be repeatedly handed off from one committing
- * process to the next.
+ * the lock in exclusive mode and set transaction status as required on behalf
+ * of all group members. This avoids a great deal of contention when many
+ * processes are trying to commit at once, since the lock need not be
+ * repeatedly handed off from one committing process to the next.
*
* Returns true when transaction status has been updated in clog; returns
* false if we decided against applying the optimization because the page
@@ -420,6 +425,8 @@ TransactionGroupUpdateXidStatus(TransactionId xid, XidStatus status,
PGPROC *proc = MyProc;
uint32 nextidx;
uint32 wakeidx;
+ int prevpageno;
+ LWLock *prevlock = NULL;
/* We should definitely have an XID whose status needs to be updated. */
Assert(TransactionIdIsValid(xid));
@@ -500,11 +507,8 @@ TransactionGroupUpdateXidStatus(TransactionId xid, XidStatus status,
return true;
}
- /* We are the leader. Acquire the lock on behalf of everyone. */
- LWLockAcquire(XactSLRULock, LW_EXCLUSIVE);
-
/*
- * Now that we've got the lock, clear the list of processes waiting for
+ * We are leader so clear the list of processes waiting for
* group XID status update, saving a pointer to the head of the list.
* Trying to pop elements one at a time could lead to an ABA problem.
*/
@@ -514,10 +518,38 @@ TransactionGroupUpdateXidStatus(TransactionId xid, XidStatus status,
/* Remember head of list so we can perform wakeups after dropping lock. */
wakeidx = nextidx;
+ /* Acquire the SLRU bank lock w.r.t. the first page in the group. */
+ prevpageno = ProcGlobal->allProcs[nextidx].clogGroupMemberPage;
+ prevlock = SimpleLruGetSLRUBankLock(XactCtl, prevpageno);
+ LWLockAcquire(prevlock, LW_EXCLUSIVE);
+
/* Walk the list and update the status of all XIDs. */
while (nextidx != INVALID_PGPROCNO)
{
PGPROC *nextproc = &ProcGlobal->allProcs[nextidx];
+ int thispageno = nextproc->clogGroupMemberPage;
+
+ /*
+ * Although we are trying our best to keep same page in a group, there
+ * are cases where we might get different pages as well for detail
+ * refer comment in above while loop where we are adding this process
+ * for group update. So if the current page we are going to access is
+ * not in the same slru bank in which we updated the last page then we
+ * need to release the lock on the previous bank and acquire lock on
+ * the bank w.r.t. the page we are going to update now.
+ */
+ if (thispageno != prevpageno)
+ {
+ LWLock *lock = SimpleLruGetSLRUBankLock(XactCtl, thispageno);
+
+ if (prevlock != lock)
+ {
+ LWLockRelease(prevlock);
+ LWLockAcquire(lock, LW_EXCLUSIVE);
+ }
+ prevlock = lock;
+ prevpageno = thispageno;
+ }
/*
* Transactions with more than THRESHOLD_SUBTRANS_CLOG_OPT sub-XIDs
@@ -537,7 +569,8 @@ TransactionGroupUpdateXidStatus(TransactionId xid, XidStatus status,
}
/* We're done with the lock now. */
- LWLockRelease(XactSLRULock);
+ if (prevlock != NULL)
+ LWLockRelease(prevlock);
/*
* Now that we've released the lock, go back and wake everybody up. We
@@ -566,10 +599,11 @@ TransactionGroupUpdateXidStatus(TransactionId xid, XidStatus status,
/*
* Sets the commit status of a single transaction.
*
- * Must be called with XactSLRULock held
+ * Must be called with slot specific SLRU bank's lock held
*/
static void
-TransactionIdSetStatusBit(TransactionId xid, XidStatus status, XLogRecPtr lsn, int slotno)
+TransactionIdSetStatusBit(TransactionId xid, XidStatus status, XLogRecPtr lsn,
+ int slotno)
{
int byteno = TransactionIdToByte(xid);
int bshift = TransactionIdToBIndex(xid) * CLOG_BITS_PER_XACT;
@@ -658,7 +692,7 @@ TransactionIdGetStatus(TransactionId xid, XLogRecPtr *lsn)
lsnindex = GetLSNIndex(slotno, xid);
*lsn = XactCtl->shared->group_lsn[lsnindex];
- LWLockRelease(XactSLRULock);
+ LWLockRelease(SimpleLruGetSLRUBankLock(XactCtl, pageno));
return status;
}
@@ -677,7 +711,7 @@ CLOGShmemInit(void)
{
XactCtl->PagePrecedes = CLOGPagePrecedes;
SimpleLruInit(XactCtl, "Xact", NUM_CLOG_BUFFERS, CLOG_LSNS_PER_PAGE,
- XactSLRULock, "pg_xact", LWTRANCHE_XACT_BUFFER,
+ "pg_xact", LWTRANCHE_XACT_BUFFER, LWTRANCHE_XACT_SLRU,
SYNC_HANDLER_CLOG);
SlruPagePrecedesUnitTests(XactCtl, CLOG_XACTS_PER_PAGE);
}
@@ -692,8 +726,9 @@ void
BootStrapCLOG(void)
{
int slotno;
+ LWLock *lock = SimpleLruGetSLRUBankLock(XactCtl, 0);
- LWLockAcquire(XactSLRULock, LW_EXCLUSIVE);
+ LWLockAcquire(lock, LW_EXCLUSIVE);
/* Create and zero the first page of the commit log */
slotno = ZeroCLOGPage(0, false);
@@ -702,7 +737,7 @@ BootStrapCLOG(void)
SimpleLruWritePage(XactCtl, slotno);
Assert(!XactCtl->shared->page_dirty[slotno]);
- LWLockRelease(XactSLRULock);
+ LWLockRelease(lock);
}
/*
@@ -737,14 +772,10 @@ StartupCLOG(void)
TransactionId xid = XidFromFullTransactionId(ShmemVariableCache->nextXid);
int pageno = TransactionIdToPage(xid);
- LWLockAcquire(XactSLRULock, LW_EXCLUSIVE);
-
/*
* Initialize our idea of the latest page number.
*/
- XactCtl->shared->latest_page_number = pageno;
-
- LWLockRelease(XactSLRULock);
+ pg_atomic_init_u32(&XactCtl->shared->latest_page_number, pageno);
}
/*
@@ -755,8 +786,9 @@ TrimCLOG(void)
{
TransactionId xid = XidFromFullTransactionId(ShmemVariableCache->nextXid);
int pageno = TransactionIdToPage(xid);
+ LWLock *lock = SimpleLruGetSLRUBankLock(XactCtl, pageno);
- LWLockAcquire(XactSLRULock, LW_EXCLUSIVE);
+ LWLockAcquire(lock, LW_EXCLUSIVE);
/*
* Zero out the remainder of the current clog page. Under normal
@@ -788,7 +820,7 @@ TrimCLOG(void)
XactCtl->shared->page_dirty[slotno] = true;
}
- LWLockRelease(XactSLRULock);
+ LWLockRelease(lock);
}
/*
@@ -820,6 +852,7 @@ void
ExtendCLOG(TransactionId newestXact)
{
int pageno;
+ LWLock *lock;
/*
* No work except at first XID of a page. But beware: just after
@@ -830,13 +863,14 @@ ExtendCLOG(TransactionId newestXact)
return;
pageno = TransactionIdToPage(newestXact);
+ lock = SimpleLruGetSLRUBankLock(XactCtl, pageno);
- LWLockAcquire(XactSLRULock, LW_EXCLUSIVE);
+ LWLockAcquire(lock, LW_EXCLUSIVE);
/* Zero the page and make an XLOG entry about it */
ZeroCLOGPage(pageno, true);
- LWLockRelease(XactSLRULock);
+ LWLockRelease(lock);
}
@@ -974,16 +1008,18 @@ clog_redo(XLogReaderState *record)
{
int pageno;
int slotno;
+ LWLock *lock;
memcpy(&pageno, XLogRecGetData(record), sizeof(int));
- LWLockAcquire(XactSLRULock, LW_EXCLUSIVE);
+ lock = SimpleLruGetSLRUBankLock(XactCtl, pageno);
+ LWLockAcquire(lock, LW_EXCLUSIVE);
slotno = ZeroCLOGPage(pageno, false);
SimpleLruWritePage(XactCtl, slotno);
Assert(!XactCtl->shared->page_dirty[slotno]);
- LWLockRelease(XactSLRULock);
+ LWLockRelease(lock);
}
else if (info == CLOG_TRUNCATE)
{
diff --git a/src/backend/access/transam/commit_ts.c b/src/backend/access/transam/commit_ts.c
index 26614d5ceb..645a11d1ab 100644
--- a/src/backend/access/transam/commit_ts.c
+++ b/src/backend/access/transam/commit_ts.c
@@ -221,8 +221,9 @@ SetXidCommitTsInPage(TransactionId xid, int nsubxids,
{
int slotno;
int i;
+ LWLock *lock = SimpleLruGetSLRUBankLock(CommitTsCtl, pageno);
- LWLockAcquire(CommitTsSLRULock, LW_EXCLUSIVE);
+ LWLockAcquire(lock, LW_EXCLUSIVE);
slotno = SimpleLruReadPage(CommitTsCtl, pageno, true, xid);
@@ -232,13 +233,13 @@ SetXidCommitTsInPage(TransactionId xid, int nsubxids,
CommitTsCtl->shared->page_dirty[slotno] = true;
- LWLockRelease(CommitTsSLRULock);
+ LWLockRelease(lock);
}
/*
* Sets the commit timestamp of a single transaction.
*
- * Must be called with CommitTsSLRULock held
+ * Must be called with slot specific SLRU bank's Lock held
*/
static void
TransactionIdSetCommitTs(TransactionId xid, TimestampTz ts,
@@ -339,7 +340,7 @@ TransactionIdGetCommitTsData(TransactionId xid, TimestampTz *ts,
if (nodeid)
*nodeid = entry.nodeid;
- LWLockRelease(CommitTsSLRULock);
+ LWLockRelease(SimpleLruGetSLRUBankLock(CommitTsCtl, pageno));
return *ts != 0;
}
@@ -511,9 +512,8 @@ CommitTsShmemInit(void)
CommitTsCtl->PagePrecedes = CommitTsPagePrecedes;
SimpleLruInit(CommitTsCtl, "CommitTs", NUM_COMMIT_TS_BUFFERS, 0,
- CommitTsSLRULock, "pg_commit_ts",
- LWTRANCHE_COMMITTS_BUFFER,
- SYNC_HANDLER_COMMIT_TS);
+ "pg_commit_ts", LWTRANCHE_COMMITTS_BUFFER,
+ LWTRANCHE_COMMITTS_SLRU, SYNC_HANDLER_COMMIT_TS);
SlruPagePrecedesUnitTests(CommitTsCtl, COMMIT_TS_XACTS_PER_PAGE);
commitTsShared = ShmemInitStruct("CommitTs shared",
@@ -669,9 +669,7 @@ ActivateCommitTs(void)
/*
* Re-Initialize our idea of the latest page number.
*/
- LWLockAcquire(CommitTsSLRULock, LW_EXCLUSIVE);
- CommitTsCtl->shared->latest_page_number = pageno;
- LWLockRelease(CommitTsSLRULock);
+ pg_atomic_write_u32(&CommitTsCtl->shared->latest_page_number, pageno);
/*
* If CommitTs is enabled, but it wasn't in the previous server run, we
@@ -698,12 +696,13 @@ ActivateCommitTs(void)
if (!SimpleLruDoesPhysicalPageExist(CommitTsCtl, pageno))
{
int slotno;
+ LWLock *lock = SimpleLruGetSLRUBankLock(CommitTsCtl, pageno);
- LWLockAcquire(CommitTsSLRULock, LW_EXCLUSIVE);
+ LWLockAcquire(lock, LW_EXCLUSIVE);
slotno = ZeroCommitTsPage(pageno, false);
SimpleLruWritePage(CommitTsCtl, slotno);
Assert(!CommitTsCtl->shared->page_dirty[slotno]);
- LWLockRelease(CommitTsSLRULock);
+ LWLockRelease(lock);
}
/* Change the activation status in shared memory. */
@@ -752,9 +751,9 @@ DeactivateCommitTs(void)
* be overwritten anyway when we wrap around, but it seems better to be
* tidy.)
*/
- LWLockAcquire(CommitTsSLRULock, LW_EXCLUSIVE);
+ SimpleLruAcquireAllBankLock(CommitTsCtl, LW_EXCLUSIVE);
(void) SlruScanDirectory(CommitTsCtl, SlruScanDirCbDeleteAll, NULL);
- LWLockRelease(CommitTsSLRULock);
+ SimpleLruReleaseAllBankLock(CommitTsCtl);
}
/*
@@ -786,6 +785,7 @@ void
ExtendCommitTs(TransactionId newestXact)
{
int pageno;
+ LWLock *lock;
/*
* Nothing to do if module not enabled. Note we do an unlocked read of
@@ -806,12 +806,14 @@ ExtendCommitTs(TransactionId newestXact)
pageno = TransactionIdToCTsPage(newestXact);
- LWLockAcquire(CommitTsSLRULock, LW_EXCLUSIVE);
+ lock = SimpleLruGetSLRUBankLock(CommitTsCtl, pageno);
+
+ LWLockAcquire(lock, LW_EXCLUSIVE);
/* Zero the page and make an XLOG entry about it */
ZeroCommitTsPage(pageno, !InRecovery);
- LWLockRelease(CommitTsSLRULock);
+ LWLockRelease(lock);
}
/*
@@ -965,16 +967,18 @@ commit_ts_redo(XLogReaderState *record)
{
int pageno;
int slotno;
+ LWLock *lock;
memcpy(&pageno, XLogRecGetData(record), sizeof(int));
+ lock = SimpleLruGetSLRUBankLock(CommitTsCtl, pageno);
- LWLockAcquire(CommitTsSLRULock, LW_EXCLUSIVE);
+ LWLockAcquire(lock, LW_EXCLUSIVE);
slotno = ZeroCommitTsPage(pageno, false);
SimpleLruWritePage(CommitTsCtl, slotno);
Assert(!CommitTsCtl->shared->page_dirty[slotno]);
- LWLockRelease(CommitTsSLRULock);
+ LWLockRelease(lock);
}
else if (info == COMMIT_TS_TRUNCATE)
{
@@ -986,7 +990,8 @@ commit_ts_redo(XLogReaderState *record)
* During XLOG replay, latest_page_number isn't set up yet; insert a
* suitable value to bypass the sanity test in SimpleLruTruncate.
*/
- CommitTsCtl->shared->latest_page_number = trunc->pageno;
+ pg_atomic_write_u32(&CommitTsCtl->shared->latest_page_number,
+ trunc->pageno);
SimpleLruTruncate(CommitTsCtl, trunc->pageno);
}
diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index abb022e067..804e3c603c 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -192,10 +192,10 @@ static SlruCtlData MultiXactMemberCtlData;
/*
* MultiXact state shared across all backends. All this state is protected
- * by MultiXactGenLock. (We also use MultiXactOffsetSLRULock and
- * MultiXactMemberSLRULock to guard accesses to the two sets of SLRU
- * buffers. For concurrency's sake, we avoid holding more than one of these
- * locks at a time.)
+ * by MultiXactGenLock. (We also use SLRU bank's lock of MultiXactOffset and
+ * MultiXactMember to guard accesses to the two sets of SLRU buffers. For
+ * concurrency's sake, we avoid holding more than one of these locks at a
+ * time.)
*/
typedef struct MultiXactStateData
{
@@ -870,12 +870,15 @@ RecordNewMultiXact(MultiXactId multi, MultiXactOffset offset,
int slotno;
MultiXactOffset *offptr;
int i;
-
- LWLockAcquire(MultiXactOffsetSLRULock, LW_EXCLUSIVE);
+ LWLock *lock;
+ LWLock *prevlock = NULL;
pageno = MultiXactIdToOffsetPage(multi);
entryno = MultiXactIdToOffsetEntry(multi);
+ lock = SimpleLruGetSLRUBankLock(MultiXactOffsetCtl, pageno);
+ LWLockAcquire(lock, LW_EXCLUSIVE);
+
/*
* Note: we pass the MultiXactId to SimpleLruReadPage as the "transaction"
* to complain about if there's any I/O error. This is kinda bogus, but
@@ -891,10 +894,8 @@ RecordNewMultiXact(MultiXactId multi, MultiXactOffset offset,
MultiXactOffsetCtl->shared->page_dirty[slotno] = true;
- /* Exchange our lock */
- LWLockRelease(MultiXactOffsetSLRULock);
-
- LWLockAcquire(MultiXactMemberSLRULock, LW_EXCLUSIVE);
+ /* Release MultiXactOffset SLRU lock. */
+ LWLockRelease(lock);
prev_pageno = -1;
@@ -916,6 +917,20 @@ RecordNewMultiXact(MultiXactId multi, MultiXactOffset offset,
if (pageno != prev_pageno)
{
+ /*
+ * MultiXactMember SLRU page is changed so check if this new page
+ * fall into the different SLRU bank then release the old bank's
+ * lock and acquire lock on the new bank.
+ */
+ lock = SimpleLruGetSLRUBankLock(MultiXactMemberCtl, pageno);
+ if (lock != prevlock)
+ {
+ if (prevlock != NULL)
+ LWLockRelease(prevlock);
+
+ LWLockAcquire(lock, LW_EXCLUSIVE);
+ prevlock = lock;
+ }
slotno = SimpleLruReadPage(MultiXactMemberCtl, pageno, true, multi);
prev_pageno = pageno;
}
@@ -936,7 +951,8 @@ RecordNewMultiXact(MultiXactId multi, MultiXactOffset offset,
MultiXactMemberCtl->shared->page_dirty[slotno] = true;
}
- LWLockRelease(MultiXactMemberSLRULock);
+ if (prevlock != NULL)
+ LWLockRelease(prevlock);
}
/*
@@ -1239,6 +1255,8 @@ GetMultiXactIdMembers(MultiXactId multi, MultiXactMember **members,
MultiXactId tmpMXact;
MultiXactOffset nextOffset;
MultiXactMember *ptr;
+ LWLock *lock;
+ LWLock *prevlock = NULL;
debug_elog3(DEBUG2, "GetMembers: asked for %u", multi);
@@ -1342,11 +1360,23 @@ GetMultiXactIdMembers(MultiXactId multi, MultiXactMember **members,
* time on every multixact creation.
*/
retry:
- LWLockAcquire(MultiXactOffsetSLRULock, LW_EXCLUSIVE);
-
pageno = MultiXactIdToOffsetPage(multi);
entryno = MultiXactIdToOffsetEntry(multi);
+ /*
+ * If the page is on the different SLRU bank then release the lock on the
+ * previous bank if we are already holding one and acquire the lock on the
+ * new bank.
+ */
+ lock = SimpleLruGetSLRUBankLock(MultiXactOffsetCtl, pageno);
+ if (lock != prevlock)
+ {
+ if (prevlock != NULL)
+ LWLockRelease(prevlock);
+ LWLockAcquire(lock, LW_EXCLUSIVE);
+ prevlock = lock;
+ }
+
slotno = SimpleLruReadPage(MultiXactOffsetCtl, pageno, true, multi);
offptr = (MultiXactOffset *) MultiXactOffsetCtl->shared->page_buffer[slotno];
offptr += entryno;
@@ -1379,7 +1409,22 @@ retry:
entryno = MultiXactIdToOffsetEntry(tmpMXact);
if (pageno != prev_pageno)
+ {
+ /*
+ * SLRU pageno is changed so check whether this page is falling in
+ * the different slru bank than on which we are already holding the
+ * lock and if so release the lock on the old bank and acquire that
+ * on the new bank.
+ */
+ lock = SimpleLruGetSLRUBankLock(MultiXactOffsetCtl, pageno);
+ if (prevlock != lock)
+ {
+ LWLockRelease(prevlock);
+ LWLockAcquire(lock, LW_EXCLUSIVE);
+ prevlock = lock;
+ }
slotno = SimpleLruReadPage(MultiXactOffsetCtl, pageno, true, tmpMXact);
+ }
offptr = (MultiXactOffset *) MultiXactOffsetCtl->shared->page_buffer[slotno];
offptr += entryno;
@@ -1388,7 +1433,8 @@ retry:
if (nextMXOffset == 0)
{
/* Corner case 2: next multixact is still being filled in */
- LWLockRelease(MultiXactOffsetSLRULock);
+ LWLockRelease(prevlock);
+ prevlock = NULL;
CHECK_FOR_INTERRUPTS();
pg_usleep(1000L);
goto retry;
@@ -1397,13 +1443,11 @@ retry:
length = nextMXOffset - offset;
}
- LWLockRelease(MultiXactOffsetSLRULock);
+ LWLockRelease(prevlock);
+ prevlock = NULL;
ptr = (MultiXactMember *) palloc(length * sizeof(MultiXactMember));
- /* Now get the members themselves. */
- LWLockAcquire(MultiXactMemberSLRULock, LW_EXCLUSIVE);
-
truelength = 0;
prev_pageno = -1;
for (i = 0; i < length; i++, offset++)
@@ -1419,6 +1463,20 @@ retry:
if (pageno != prev_pageno)
{
+ /*
+ * MultiXactMember SLRU page is changed so check if this new page
+ * fall into the different SLRU bank then release the old bank's
+ * lock and acquire lock on the new bank.
+ */
+ lock = SimpleLruGetSLRUBankLock(MultiXactMemberCtl, pageno);
+ if (lock != prevlock)
+ {
+ if (prevlock)
+ LWLockRelease(prevlock);
+ LWLockAcquire(lock, LW_EXCLUSIVE);
+ prevlock = lock;
+ }
+
slotno = SimpleLruReadPage(MultiXactMemberCtl, pageno, true, multi);
prev_pageno = pageno;
}
@@ -1442,7 +1500,8 @@ retry:
truelength++;
}
- LWLockRelease(MultiXactMemberSLRULock);
+ if (prevlock)
+ LWLockRelease(prevlock);
/* A multixid with zero members should not happen */
Assert(truelength > 0);
@@ -1852,15 +1911,14 @@ MultiXactShmemInit(void)
SimpleLruInit(MultiXactOffsetCtl,
"MultiXactOffset", NUM_MULTIXACTOFFSET_BUFFERS, 0,
- MultiXactOffsetSLRULock, "pg_multixact/offsets",
- LWTRANCHE_MULTIXACTOFFSET_BUFFER,
+ "pg_multixact/offsets", LWTRANCHE_MULTIXACTOFFSET_BUFFER,
+ LWTRANCHE_MULTIXACTOFFSET_SLRU,
SYNC_HANDLER_MULTIXACT_OFFSET);
SlruPagePrecedesUnitTests(MultiXactOffsetCtl, MULTIXACT_OFFSETS_PER_PAGE);
SimpleLruInit(MultiXactMemberCtl,
"MultiXactMember", NUM_MULTIXACTMEMBER_BUFFERS, 0,
- MultiXactMemberSLRULock, "pg_multixact/members",
- LWTRANCHE_MULTIXACTMEMBER_BUFFER,
- SYNC_HANDLER_MULTIXACT_MEMBER);
+ "pg_multixact/members", LWTRANCHE_MULTIXACTMEMBER_BUFFER,
+ LWTRANCHE_MULTIXACTMEMBER_SLRU, SYNC_HANDLER_MULTIXACT_MEMBER);
/* doesn't call SimpleLruTruncate() or meet criteria for unit tests */
/* Initialize our shared state struct */
@@ -1894,8 +1952,10 @@ void
BootStrapMultiXact(void)
{
int slotno;
+ LWLock *lock;
- LWLockAcquire(MultiXactOffsetSLRULock, LW_EXCLUSIVE);
+ lock = SimpleLruGetSLRUBankLock(MultiXactOffsetCtl , 0);
+ LWLockAcquire(lock, LW_EXCLUSIVE);
/* Create and zero the first page of the offsets log */
slotno = ZeroMultiXactOffsetPage(0, false);
@@ -1904,9 +1964,10 @@ BootStrapMultiXact(void)
SimpleLruWritePage(MultiXactOffsetCtl, slotno);
Assert(!MultiXactOffsetCtl->shared->page_dirty[slotno]);
- LWLockRelease(MultiXactOffsetSLRULock);
+ LWLockRelease(lock);
- LWLockAcquire(MultiXactMemberSLRULock, LW_EXCLUSIVE);
+ lock = SimpleLruGetSLRUBankLock(MultiXactMemberCtl , 0);
+ LWLockAcquire(lock, LW_EXCLUSIVE);
/* Create and zero the first page of the members log */
slotno = ZeroMultiXactMemberPage(0, false);
@@ -1915,7 +1976,7 @@ BootStrapMultiXact(void)
SimpleLruWritePage(MultiXactMemberCtl, slotno);
Assert(!MultiXactMemberCtl->shared->page_dirty[slotno]);
- LWLockRelease(MultiXactMemberSLRULock);
+ LWLockRelease(lock);
}
/*
@@ -1975,10 +2036,12 @@ static void
MaybeExtendOffsetSlru(void)
{
int pageno;
+ LWLock *lock;
pageno = MultiXactIdToOffsetPage(MultiXactState->nextMXact);
+ lock = SimpleLruGetSLRUBankLock(MultiXactOffsetCtl ,pageno);
- LWLockAcquire(MultiXactOffsetSLRULock, LW_EXCLUSIVE);
+ LWLockAcquire(lock, LW_EXCLUSIVE);
if (!SimpleLruDoesPhysicalPageExist(MultiXactOffsetCtl, pageno))
{
@@ -1993,7 +2056,7 @@ MaybeExtendOffsetSlru(void)
SimpleLruWritePage(MultiXactOffsetCtl, slotno);
}
- LWLockRelease(MultiXactOffsetSLRULock);
+ LWLockRelease(lock);
}
/*
@@ -2015,13 +2078,15 @@ StartupMultiXact(void)
* Initialize offset's idea of the latest page number.
*/
pageno = MultiXactIdToOffsetPage(multi);
- MultiXactOffsetCtl->shared->latest_page_number = pageno;
+ pg_atomic_init_u32(&MultiXactOffsetCtl->shared->latest_page_number,
+ pageno);
/*
* Initialize member's idea of the latest page number.
*/
pageno = MXOffsetToMemberPage(offset);
- MultiXactMemberCtl->shared->latest_page_number = pageno;
+ pg_atomic_init_u32(&MultiXactMemberCtl->shared->latest_page_number,
+ pageno);
}
/*
@@ -2037,7 +2102,6 @@ TrimMultiXact(void)
int pageno;
int entryno;
int flagsoff;
-
LWLockAcquire(MultiXactGenLock, LW_SHARED);
nextMXact = MultiXactState->nextMXact;
offset = MultiXactState->nextOffset;
@@ -2046,13 +2110,13 @@ TrimMultiXact(void)
LWLockRelease(MultiXactGenLock);
/* Clean up offsets state */
- LWLockAcquire(MultiXactOffsetSLRULock, LW_EXCLUSIVE);
/*
* (Re-)Initialize our idea of the latest page number for offsets.
*/
pageno = MultiXactIdToOffsetPage(nextMXact);
- MultiXactOffsetCtl->shared->latest_page_number = pageno;
+ pg_atomic_write_u32(&MultiXactOffsetCtl->shared->latest_page_number,
+ pageno);
/*
* Zero out the remainder of the current offsets page. See notes in
@@ -2067,7 +2131,9 @@ TrimMultiXact(void)
{
int slotno;
MultiXactOffset *offptr;
+ LWLock *lock = SimpleLruGetSLRUBankLock(MultiXactOffsetCtl, pageno);
+ LWLockAcquire(lock, LW_EXCLUSIVE);
slotno = SimpleLruReadPage(MultiXactOffsetCtl, pageno, true, nextMXact);
offptr = (MultiXactOffset *) MultiXactOffsetCtl->shared->page_buffer[slotno];
offptr += entryno;
@@ -2075,18 +2141,17 @@ TrimMultiXact(void)
MemSet(offptr, 0, BLCKSZ - (entryno * sizeof(MultiXactOffset)));
MultiXactOffsetCtl->shared->page_dirty[slotno] = true;
+ LWLockRelease(lock);
}
- LWLockRelease(MultiXactOffsetSLRULock);
-
/* And the same for members */
- LWLockAcquire(MultiXactMemberSLRULock, LW_EXCLUSIVE);
/*
* (Re-)Initialize our idea of the latest page number for members.
*/
pageno = MXOffsetToMemberPage(offset);
- MultiXactMemberCtl->shared->latest_page_number = pageno;
+ pg_atomic_write_u32(&MultiXactMemberCtl->shared->latest_page_number,
+ pageno);
/*
* Zero out the remainder of the current members page. See notes in
@@ -2098,7 +2163,9 @@ TrimMultiXact(void)
int slotno;
TransactionId *xidptr;
int memberoff;
+ LWLock *lock = SimpleLruGetSLRUBankLock(MultiXactMemberCtl, pageno);
+ LWLockAcquire(lock, LW_EXCLUSIVE);
memberoff = MXOffsetToMemberOffset(offset);
slotno = SimpleLruReadPage(MultiXactMemberCtl, pageno, true, offset);
xidptr = (TransactionId *)
@@ -2113,10 +2180,9 @@ TrimMultiXact(void)
*/
MultiXactMemberCtl->shared->page_dirty[slotno] = true;
+ LWLockRelease(lock);
}
- LWLockRelease(MultiXactMemberSLRULock);
-
/* signal that we're officially up */
LWLockAcquire(MultiXactGenLock, LW_EXCLUSIVE);
MultiXactState->finishedStartup = true;
@@ -2404,6 +2470,7 @@ static void
ExtendMultiXactOffset(MultiXactId multi)
{
int pageno;
+ LWLock *lock;
/*
* No work except at first MultiXactId of a page. But beware: just after
@@ -2414,13 +2481,14 @@ ExtendMultiXactOffset(MultiXactId multi)
return;
pageno = MultiXactIdToOffsetPage(multi);
+ lock = SimpleLruGetSLRUBankLock(MultiXactOffsetCtl, pageno);
- LWLockAcquire(MultiXactOffsetSLRULock, LW_EXCLUSIVE);
+ LWLockAcquire(lock, LW_EXCLUSIVE);
/* Zero the page and make an XLOG entry about it */
ZeroMultiXactOffsetPage(pageno, true);
- LWLockRelease(MultiXactOffsetSLRULock);
+ LWLockRelease(lock);
}
/*
@@ -2453,15 +2521,17 @@ ExtendMultiXactMember(MultiXactOffset offset, int nmembers)
if (flagsoff == 0 && flagsbit == 0)
{
int pageno;
+ LWLock *lock;
pageno = MXOffsetToMemberPage(offset);
+ lock = SimpleLruGetSLRUBankLock(MultiXactMemberCtl, pageno);
- LWLockAcquire(MultiXactMemberSLRULock, LW_EXCLUSIVE);
+ LWLockAcquire(lock, LW_EXCLUSIVE);
/* Zero the page and make an XLOG entry about it */
ZeroMultiXactMemberPage(pageno, true);
- LWLockRelease(MultiXactMemberSLRULock);
+ LWLockRelease(lock);
}
/*
@@ -2759,7 +2829,7 @@ find_multixact_start(MultiXactId multi, MultiXactOffset *result)
offptr = (MultiXactOffset *) MultiXactOffsetCtl->shared->page_buffer[slotno];
offptr += entryno;
offset = *offptr;
- LWLockRelease(MultiXactOffsetSLRULock);
+ LWLockRelease(SimpleLruGetSLRUBankLock(MultiXactOffsetCtl, pageno));
*result = offset;
return true;
@@ -3241,31 +3311,33 @@ multixact_redo(XLogReaderState *record)
{
int pageno;
int slotno;
+ LWLock *lock;
memcpy(&pageno, XLogRecGetData(record), sizeof(int));
-
- LWLockAcquire(MultiXactOffsetSLRULock, LW_EXCLUSIVE);
+ lock = SimpleLruGetSLRUBankLock(MultiXactOffsetCtl, pageno);
+ LWLockAcquire(lock, LW_EXCLUSIVE);
slotno = ZeroMultiXactOffsetPage(pageno, false);
SimpleLruWritePage(MultiXactOffsetCtl, slotno);
Assert(!MultiXactOffsetCtl->shared->page_dirty[slotno]);
- LWLockRelease(MultiXactOffsetSLRULock);
+ LWLockRelease(lock);
}
else if (info == XLOG_MULTIXACT_ZERO_MEM_PAGE)
{
int pageno;
int slotno;
+ LWLock *lock;
memcpy(&pageno, XLogRecGetData(record), sizeof(int));
-
- LWLockAcquire(MultiXactMemberSLRULock, LW_EXCLUSIVE);
+ lock = SimpleLruGetSLRUBankLock(MultiXactMemberCtl, pageno);
+ LWLockAcquire(lock, LW_EXCLUSIVE);
slotno = ZeroMultiXactMemberPage(pageno, false);
SimpleLruWritePage(MultiXactMemberCtl, slotno);
Assert(!MultiXactMemberCtl->shared->page_dirty[slotno]);
- LWLockRelease(MultiXactMemberSLRULock);
+ LWLockRelease(lock);
}
else if (info == XLOG_MULTIXACT_CREATE_ID)
{
@@ -3331,7 +3403,8 @@ multixact_redo(XLogReaderState *record)
* SimpleLruTruncate.
*/
pageno = MultiXactIdToOffsetPage(xlrec.endTruncOff);
- MultiXactOffsetCtl->shared->latest_page_number = pageno;
+ pg_atomic_write_u32(&MultiXactOffsetCtl->shared->latest_page_number,
+ pageno);
PerformOffsetsTruncation(xlrec.startTruncOff, xlrec.endTruncOff);
LWLockRelease(MultiXactTruncationLock);
diff --git a/src/backend/access/transam/slru.c b/src/backend/access/transam/slru.c
index 57889b72bd..d0931308f8 100644
--- a/src/backend/access/transam/slru.c
+++ b/src/backend/access/transam/slru.c
@@ -187,9 +187,9 @@ Size
SimpleLruShmemSize(int nslots, int nlsns)
{
Size sz;
- int bankmask_ignore;
+ int bankmask;
- SlruAdjustNSlots(&nslots, &bankmask_ignore);
+ SlruAdjustNSlots(&nslots, &bankmask);
/* we assume nslots isn't so large as to risk overflow */
sz = MAXALIGN(sizeof(SlruSharedData));
@@ -199,6 +199,7 @@ SimpleLruShmemSize(int nslots, int nlsns)
sz += MAXALIGN(nslots * sizeof(int)); /* page_number[] */
sz += MAXALIGN(nslots * sizeof(int)); /* page_lru_count[] */
sz += MAXALIGN(nslots * sizeof(LWLockPadded)); /* buffer_locks[] */
+ sz += MAXALIGN((bankmask + 1) * sizeof(LWLockPadded)); /* bank_locks[] */
if (nlsns > 0)
sz += MAXALIGN(nslots * nlsns * sizeof(XLogRecPtr)); /* group_lsn[] */
@@ -206,6 +207,32 @@ SimpleLruShmemSize(int nslots, int nlsns)
return BUFFERALIGN(sz) + BLCKSZ * nslots;
}
+/*
+ * Function to acquire all bank's lock of the given SlruCtl
+ */
+void
+SimpleLruAcquireAllBankLock(SlruCtl ctl, LWLockMode mode)
+{
+ SlruShared shared = ctl->shared;
+ int bankno;
+
+ for (bankno = 0; bankno <= ctl->bank_mask; bankno++)
+ LWLockAcquire(&shared->bank_locks[bankno].lock, mode);
+}
+
+/*
+ * Function to release all bank's lock of the given SlruCtl
+ */
+void
+SimpleLruReleaseAllBankLock(SlruCtl ctl)
+{
+ SlruShared shared = ctl->shared;
+ int bankno;
+
+ for (bankno = 0; bankno <= ctl->bank_mask; bankno++)
+ LWLockRelease(&shared->bank_locks[bankno].lock);
+}
+
/*
* Initialize, or attach to, a simple LRU cache in shared memory.
*
@@ -215,12 +242,13 @@ SimpleLruShmemSize(int nslots, int nlsns)
* nlsns: number of LSN groups per page (set to zero if not relevant).
* ctllock: LWLock to use to control access to the shared control structure.
* subdir: PGDATA-relative subdirectory that will contain the files.
- * tranche_id: LWLock tranche ID to use for the SLRU's per-buffer LWLocks.
+ * buffer_tranche_id: tranche ID to use for the SLRU's per-buffer LWLocks.
+ * bank_tranche_id: tranche ID to use for the SLRU's per-bank LWLocks.
* sync_handler: which set of functions to use to handle sync requests
*/
void
SimpleLruInit(SlruCtl ctl, const char *name, int nslots, int nlsns,
- LWLock *ctllock, const char *subdir, int tranche_id,
+ const char *subdir, int buffer_tranche_id, int bank_tranche_id,
SyncRequestHandler sync_handler)
{
SlruShared shared;
@@ -239,13 +267,13 @@ SimpleLruInit(SlruCtl ctl, const char *name, int nslots, int nlsns,
char *ptr;
Size offset;
int slotno;
+ int nbanks = bankmask + 1;
+ int bankno;
Assert(!found);
memset(shared, 0, sizeof(SlruSharedData));
- shared->ControlLock = ctllock;
-
shared->num_slots = nslots;
shared->lsn_groups_per_page = nlsns;
@@ -271,6 +299,8 @@ SimpleLruInit(SlruCtl ctl, const char *name, int nslots, int nlsns,
/* Initialize LWLocks */
shared->buffer_locks = (LWLockPadded *) (ptr + offset);
offset += MAXALIGN(nslots * sizeof(LWLockPadded));
+ shared->bank_locks = (LWLockPadded *) (ptr + offset);
+ offset += MAXALIGN(nbanks * sizeof(LWLockPadded));
if (nlsns > 0)
{
@@ -282,7 +312,7 @@ SimpleLruInit(SlruCtl ctl, const char *name, int nslots, int nlsns,
for (slotno = 0; slotno < nslots; slotno++)
{
LWLockInitialize(&shared->buffer_locks[slotno].lock,
- tranche_id);
+ buffer_tranche_id);
shared->page_buffer[slotno] = ptr;
shared->page_status[slotno] = SLRU_PAGE_EMPTY;
@@ -290,6 +320,10 @@ SimpleLruInit(SlruCtl ctl, const char *name, int nslots, int nlsns,
shared->page_lru_count[slotno] = 0;
ptr += BLCKSZ;
}
+ /* Initialize bank locks for each buffer bank. */
+ for (bankno = 0; bankno < nbanks; bankno++)
+ LWLockInitialize(&shared->bank_locks[bankno].lock,
+ bank_tranche_id);
/* Should fit to estimated shmem size */
Assert(ptr - (char *) shared <= SimpleLruShmemSize(nslots, nlsns));
@@ -344,7 +378,7 @@ SimpleLruZeroPage(SlruCtl ctl, int pageno)
SimpleLruZeroLSNs(ctl, slotno);
/* Assume this page is now the latest active page */
- shared->latest_page_number = pageno;
+ pg_atomic_write_u32(&shared->latest_page_number, pageno);
/* update the stats counter of zeroed pages */
pgstat_count_slru_page_zeroed(shared->slru_stats_idx);
@@ -383,12 +417,13 @@ static void
SimpleLruWaitIO(SlruCtl ctl, int slotno)
{
SlruShared shared = ctl->shared;
+ int bankno = slotno / SLRU_BANK_SIZE;
/* See notes at top of file */
- LWLockRelease(shared->ControlLock);
+ LWLockRelease(&shared->bank_locks[bankno].lock);
LWLockAcquire(&shared->buffer_locks[slotno].lock, LW_SHARED);
LWLockRelease(&shared->buffer_locks[slotno].lock);
- LWLockAcquire(shared->ControlLock, LW_EXCLUSIVE);
+ LWLockAcquire(&shared->bank_locks[bankno].lock, LW_EXCLUSIVE);
/*
* If the slot is still in an io-in-progress state, then either someone
@@ -443,6 +478,7 @@ SimpleLruReadPage(SlruCtl ctl, int pageno, bool write_ok,
for (;;)
{
int slotno;
+ int bankno;
bool ok;
/* See if page already is in memory; if not, pick victim slot */
@@ -485,9 +521,10 @@ SimpleLruReadPage(SlruCtl ctl, int pageno, bool write_ok,
/* Acquire per-buffer lock (cannot deadlock, see notes at top) */
LWLockAcquire(&shared->buffer_locks[slotno].lock, LW_EXCLUSIVE);
+ bankno = slotno / SLRU_BANK_SIZE;
/* Release control lock while doing I/O */
- LWLockRelease(shared->ControlLock);
+ LWLockRelease(&shared->bank_locks[bankno].lock);
/* Do the read */
ok = SlruPhysicalReadPage(ctl, pageno, slotno);
@@ -496,7 +533,7 @@ SimpleLruReadPage(SlruCtl ctl, int pageno, bool write_ok,
SimpleLruZeroLSNs(ctl, slotno);
/* Re-acquire control lock and update page state */
- LWLockAcquire(shared->ControlLock, LW_EXCLUSIVE);
+ LWLockAcquire(&shared->bank_locks[bankno].lock, LW_EXCLUSIVE);
Assert(shared->page_number[slotno] == pageno &&
shared->page_status[slotno] == SLRU_PAGE_READ_IN_PROGRESS &&
@@ -538,11 +575,12 @@ SimpleLruReadPage_ReadOnly(SlruCtl ctl, int pageno, TransactionId xid)
{
SlruShared shared = ctl->shared;
int slotno;
- int bankstart = (pageno & ctl->bank_mask) * SLRU_BANK_SIZE;
+ int bankno = pageno & ctl->bank_mask;
+ int bankstart = bankno * SLRU_BANK_SIZE;
int bankend = bankstart + SLRU_BANK_SIZE;
/* Try to find the page while holding only shared lock */
- LWLockAcquire(shared->ControlLock, LW_SHARED);
+ LWLockAcquire(&shared->bank_locks[bankno].lock, LW_SHARED);
/* See if page is already in a buffer */
for (slotno = bankstart; slotno < bankend; slotno++)
@@ -562,8 +600,8 @@ SimpleLruReadPage_ReadOnly(SlruCtl ctl, int pageno, TransactionId xid)
}
/* No luck, so switch to normal exclusive lock and do regular read */
- LWLockRelease(shared->ControlLock);
- LWLockAcquire(shared->ControlLock, LW_EXCLUSIVE);
+ LWLockRelease(&shared->bank_locks[bankno].lock);
+ LWLockAcquire(&shared->bank_locks[bankno].lock, LW_EXCLUSIVE);
return SimpleLruReadPage(ctl, pageno, true, xid);
}
@@ -585,6 +623,7 @@ SlruInternalWritePage(SlruCtl ctl, int slotno, SlruWriteAll fdata)
SlruShared shared = ctl->shared;
int pageno = shared->page_number[slotno];
bool ok;
+ int bankno = slotno / SLRU_BANK_SIZE;
/* If a write is in progress, wait for it to finish */
while (shared->page_status[slotno] == SLRU_PAGE_WRITE_IN_PROGRESS &&
@@ -613,7 +652,7 @@ SlruInternalWritePage(SlruCtl ctl, int slotno, SlruWriteAll fdata)
LWLockAcquire(&shared->buffer_locks[slotno].lock, LW_EXCLUSIVE);
/* Release control lock while doing I/O */
- LWLockRelease(shared->ControlLock);
+ LWLockRelease(&shared->bank_locks[bankno].lock);
/* Do the write */
ok = SlruPhysicalWritePage(ctl, pageno, slotno, fdata);
@@ -628,7 +667,7 @@ SlruInternalWritePage(SlruCtl ctl, int slotno, SlruWriteAll fdata)
}
/* Re-acquire control lock and update page state */
- LWLockAcquire(shared->ControlLock, LW_EXCLUSIVE);
+ LWLockAcquire(&shared->bank_locks[bankno].lock, LW_EXCLUSIVE);
Assert(shared->page_number[slotno] == pageno &&
shared->page_status[slotno] == SLRU_PAGE_WRITE_IN_PROGRESS);
@@ -1133,7 +1172,7 @@ SlruSelectLRUPage(SlruCtl ctl, int pageno)
this_delta = 0;
}
this_page_number = shared->page_number[slotno];
- if (this_page_number == shared->latest_page_number)
+ if (this_page_number == pg_atomic_read_u32(&shared->latest_page_number))
continue;
if (shared->page_status[slotno] == SLRU_PAGE_VALID)
{
@@ -1207,6 +1246,7 @@ SimpleLruWriteAll(SlruCtl ctl, bool allow_redirtied)
int slotno;
int pageno = 0;
int i;
+ int lastbankno = 0;
bool ok;
/* update the stats counter of flushes */
@@ -1217,10 +1257,19 @@ SimpleLruWriteAll(SlruCtl ctl, bool allow_redirtied)
*/
fdata.num_files = 0;
- LWLockAcquire(shared->ControlLock, LW_EXCLUSIVE);
+ LWLockAcquire(&shared->bank_locks[0].lock, LW_EXCLUSIVE);
for (slotno = 0; slotno < shared->num_slots; slotno++)
{
+ int curbankno = slotno / SLRU_BANK_SIZE;
+
+ if (curbankno != lastbankno)
+ {
+ LWLockRelease(&shared->bank_locks[lastbankno].lock);
+ LWLockAcquire(&shared->bank_locks[curbankno].lock, LW_EXCLUSIVE);
+ lastbankno = curbankno;
+ }
+
SlruInternalWritePage(ctl, slotno, &fdata);
/*
@@ -1234,7 +1283,7 @@ SimpleLruWriteAll(SlruCtl ctl, bool allow_redirtied)
!shared->page_dirty[slotno]));
}
- LWLockRelease(shared->ControlLock);
+ LWLockRelease(&shared->bank_locks[lastbankno].lock);
/*
* Now close any files that were open
@@ -1274,6 +1323,7 @@ SimpleLruTruncate(SlruCtl ctl, int cutoffPage)
{
SlruShared shared = ctl->shared;
int slotno;
+ int prevbankno;
/* update the stats counter of truncates */
pgstat_count_slru_truncate(shared->slru_stats_idx);
@@ -1284,25 +1334,38 @@ SimpleLruTruncate(SlruCtl ctl, int cutoffPage)
* or just after a checkpoint, any dirty pages should have been flushed
* already ... we're just being extra careful here.)
*/
- LWLockAcquire(shared->ControlLock, LW_EXCLUSIVE);
-
restart:
/*
* While we are holding the lock, make an important safety check: the
* current endpoint page must not be eligible for removal.
*/
- if (ctl->PagePrecedes(shared->latest_page_number, cutoffPage))
+ if (ctl->PagePrecedes(pg_atomic_read_u32(&shared->latest_page_number),
+ cutoffPage))
{
- LWLockRelease(shared->ControlLock);
ereport(LOG,
(errmsg("could not truncate directory \"%s\": apparent wraparound",
ctl->Dir)));
return;
}
+ prevbankno = 0;
+ LWLockAcquire(&shared->bank_locks[prevbankno].lock, LW_EXCLUSIVE);
for (slotno = 0; slotno < shared->num_slots; slotno++)
{
+ int curbankno = slotno / SLRU_BANK_SIZE;
+
+ /*
+ * If the curbankno is not same as prevbankno then release the lock on
+ * the prevbankno and acquire the lock on the curbankno.
+ */
+ if (curbankno != prevbankno)
+ {
+ LWLockRelease(&shared->bank_locks[prevbankno].lock);
+ LWLockAcquire(&shared->bank_locks[curbankno].lock, LW_EXCLUSIVE);
+ prevbankno = curbankno;
+ }
+
if (shared->page_status[slotno] == SLRU_PAGE_EMPTY)
continue;
if (!ctl->PagePrecedes(shared->page_number[slotno], cutoffPage))
@@ -1332,10 +1395,12 @@ restart:
SlruInternalWritePage(ctl, slotno, NULL);
else
SimpleLruWaitIO(ctl, slotno);
+
+ LWLockRelease(&shared->bank_locks[prevbankno].lock);
goto restart;
}
- LWLockRelease(shared->ControlLock);
+ LWLockRelease(&shared->bank_locks[prevbankno].lock);
/* Now we can remove the old segment(s) */
(void) SlruScanDirectory(ctl, SlruScanDirCbDeleteCutoff, &cutoffPage);
@@ -1376,15 +1441,31 @@ SlruDeleteSegment(SlruCtl ctl, int segno)
SlruShared shared = ctl->shared;
int slotno;
bool did_write;
+ int prevbankno = 0;
/* Clean out any possibly existing references to the segment. */
- LWLockAcquire(shared->ControlLock, LW_EXCLUSIVE);
+ LWLockAcquire(&shared->bank_locks[prevbankno].lock, LW_EXCLUSIVE);
restart:
did_write = false;
for (slotno = 0; slotno < shared->num_slots; slotno++)
{
- int pagesegno = shared->page_number[slotno] / SLRU_PAGES_PER_SEGMENT;
+ int pagesegno;
+ int curbankno;
+
+ curbankno = slotno / SLRU_BANK_SIZE;
+
+ /*
+ * If the curbankno is not same as prevbankno then release the lock on
+ * the prevbankno and acquire the lock on the curbankno.
+ */
+ if (curbankno != prevbankno)
+ {
+ LWLockRelease(&shared->bank_locks[prevbankno].lock);
+ LWLockAcquire(&shared->bank_locks[curbankno].lock, LW_EXCLUSIVE);
+ prevbankno = curbankno;
+ }
+ pagesegno = shared->page_number[slotno] / SLRU_PAGES_PER_SEGMENT;
if (shared->page_status[slotno] == SLRU_PAGE_EMPTY)
continue;
@@ -1418,7 +1499,7 @@ restart:
SlruInternalDeleteSegment(ctl, segno);
- LWLockRelease(shared->ControlLock);
+ LWLockRelease(&shared->bank_locks[prevbankno].lock);
}
/*
diff --git a/src/backend/access/transam/subtrans.c b/src/backend/access/transam/subtrans.c
index 125273e235..48f22a5fcd 100644
--- a/src/backend/access/transam/subtrans.c
+++ b/src/backend/access/transam/subtrans.c
@@ -77,12 +77,14 @@ SubTransSetParent(TransactionId xid, TransactionId parent)
int pageno = TransactionIdToPage(xid);
int entryno = TransactionIdToEntry(xid);
int slotno;
+ LWLock *lock;
TransactionId *ptr;
Assert(TransactionIdIsValid(parent));
Assert(TransactionIdFollows(xid, parent));
- LWLockAcquire(SubtransSLRULock, LW_EXCLUSIVE);
+ lock = SimpleLruGetSLRUBankLock(SubTransCtl, pageno);
+ LWLockAcquire(lock, LW_EXCLUSIVE);
slotno = SimpleLruReadPage(SubTransCtl, pageno, true, xid);
ptr = (TransactionId *) SubTransCtl->shared->page_buffer[slotno];
@@ -100,7 +102,7 @@ SubTransSetParent(TransactionId xid, TransactionId parent)
SubTransCtl->shared->page_dirty[slotno] = true;
}
- LWLockRelease(SubtransSLRULock);
+ LWLockRelease(lock);
}
/*
@@ -130,7 +132,7 @@ SubTransGetParent(TransactionId xid)
parent = *ptr;
- LWLockRelease(SubtransSLRULock);
+ LWLockRelease(SimpleLruGetSLRUBankLock(SubTransCtl, pageno));
return parent;
}
@@ -193,8 +195,8 @@ SUBTRANSShmemInit(void)
{
SubTransCtl->PagePrecedes = SubTransPagePrecedes;
SimpleLruInit(SubTransCtl, "Subtrans", NUM_SUBTRANS_BUFFERS, 0,
- SubtransSLRULock, "pg_subtrans",
- LWTRANCHE_SUBTRANS_BUFFER, SYNC_HANDLER_NONE);
+ "pg_subtrans", LWTRANCHE_SUBTRANS_BUFFER,
+ LWTRANCHE_SUBTRANS_SLRU, SYNC_HANDLER_NONE);
SlruPagePrecedesUnitTests(SubTransCtl, SUBTRANS_XACTS_PER_PAGE);
}
@@ -212,8 +214,9 @@ void
BootStrapSUBTRANS(void)
{
int slotno;
+ LWLock *lock = SimpleLruGetSLRUBankLock(SubTransCtl, 0);
- LWLockAcquire(SubtransSLRULock, LW_EXCLUSIVE);
+ LWLockAcquire(lock, LW_EXCLUSIVE);
/* Create and zero the first page of the subtrans log */
slotno = ZeroSUBTRANSPage(0);
@@ -222,7 +225,7 @@ BootStrapSUBTRANS(void)
SimpleLruWritePage(SubTransCtl, slotno);
Assert(!SubTransCtl->shared->page_dirty[slotno]);
- LWLockRelease(SubtransSLRULock);
+ LWLockRelease(lock);
}
/*
@@ -252,6 +255,8 @@ StartupSUBTRANS(TransactionId oldestActiveXID)
FullTransactionId nextXid;
int startPage;
int endPage;
+ LWLock *prevlock;
+ LWLock *lock;
/*
* Since we don't expect pg_subtrans to be valid across crashes, we
@@ -259,23 +264,47 @@ StartupSUBTRANS(TransactionId oldestActiveXID)
* Whenever we advance into a new page, ExtendSUBTRANS will likewise zero
* the new page without regard to whatever was previously on disk.
*/
- LWLockAcquire(SubtransSLRULock, LW_EXCLUSIVE);
-
startPage = TransactionIdToPage(oldestActiveXID);
nextXid = ShmemVariableCache->nextXid;
endPage = TransactionIdToPage(XidFromFullTransactionId(nextXid));
+ prevlock = SimpleLruGetSLRUBankLock(SubTransCtl, startPage);
+ LWLockAcquire(prevlock, LW_EXCLUSIVE);
while (startPage != endPage)
{
+ lock = SimpleLruGetSLRUBankLock(SubTransCtl, startPage);
+
+ /*
+ * Check if we need to acquire the lock on the new bank then release
+ * the lock on the old bank and acquire on the new bank.
+ */
+ if (prevlock != lock)
+ {
+ LWLockRelease(prevlock);
+ LWLockAcquire(lock, LW_EXCLUSIVE);
+ prevlock = lock;
+ }
+
(void) ZeroSUBTRANSPage(startPage);
startPage++;
/* must account for wraparound */
if (startPage > TransactionIdToPage(MaxTransactionId))
startPage = 0;
}
- (void) ZeroSUBTRANSPage(startPage);
- LWLockRelease(SubtransSLRULock);
+ lock = SimpleLruGetSLRUBankLock(SubTransCtl, startPage);
+
+ /*
+ * Check if we need to acquire the lock on the new bank then release
+ * the lock on the old bank and acquire on the new bank.
+ */
+ if (prevlock != lock)
+ {
+ LWLockRelease(prevlock);
+ LWLockAcquire(lock, LW_EXCLUSIVE);
+ }
+ (void) ZeroSUBTRANSPage(startPage);
+ LWLockRelease(lock);
}
/*
@@ -309,6 +338,7 @@ void
ExtendSUBTRANS(TransactionId newestXact)
{
int pageno;
+ LWLock *lock;
/*
* No work except at first XID of a page. But beware: just after
@@ -320,12 +350,13 @@ ExtendSUBTRANS(TransactionId newestXact)
pageno = TransactionIdToPage(newestXact);
- LWLockAcquire(SubtransSLRULock, LW_EXCLUSIVE);
+ lock = SimpleLruGetSLRUBankLock(SubTransCtl, pageno);
+ LWLockAcquire(lock, LW_EXCLUSIVE);
/* Zero the page */
ZeroSUBTRANSPage(pageno);
- LWLockRelease(SubtransSLRULock);
+ LWLockRelease(lock);
}
diff --git a/src/backend/commands/async.c b/src/backend/commands/async.c
index d148d10850..2fc230ca51 100644
--- a/src/backend/commands/async.c
+++ b/src/backend/commands/async.c
@@ -267,9 +267,10 @@ typedef struct QueueBackendStatus
* both NotifyQueueLock and NotifyQueueTailLock in EXCLUSIVE mode, backends
* can change the tail pointers.
*
- * NotifySLRULock is used as the control lock for the pg_notify SLRU buffers.
+ * SLRU buffer pool is divided in banks and bank wise SLRU lock is used as
+ * the control lock for the pg_notify SLRU buffers.
* In order to avoid deadlocks, whenever we need multiple locks, we first get
- * NotifyQueueTailLock, then NotifyQueueLock, and lastly NotifySLRULock.
+ * NotifyQueueTailLock, then NotifyQueueLock, and lastly SLRU bank lock.
*
* Each backend uses the backend[] array entry with index equal to its
* BackendId (which can range from 1 to MaxBackends). We rely on this to make
@@ -570,8 +571,8 @@ AsyncShmemInit(void)
*/
NotifyCtl->PagePrecedes = asyncQueuePagePrecedes;
SimpleLruInit(NotifyCtl, "Notify", NUM_NOTIFY_BUFFERS, 0,
- NotifySLRULock, "pg_notify", LWTRANCHE_NOTIFY_BUFFER,
- SYNC_HANDLER_NONE);
+ "pg_notify", LWTRANCHE_NOTIFY_BUFFER,
+ LWTRANCHE_NOTIFY_SLRU, SYNC_HANDLER_NONE);
if (!found)
{
@@ -1402,7 +1403,7 @@ asyncQueueNotificationToEntry(Notification *n, AsyncQueueEntry *qe)
* Eventually we will return NULL indicating all is done.
*
* We are holding NotifyQueueLock already from the caller and grab
- * NotifySLRULock locally in this function.
+ * page specific SLRU bank lock locally in this function.
*/
static ListCell *
asyncQueueAddEntries(ListCell *nextNotify)
@@ -1412,9 +1413,7 @@ asyncQueueAddEntries(ListCell *nextNotify)
int pageno;
int offset;
int slotno;
-
- /* We hold both NotifyQueueLock and NotifySLRULock during this operation */
- LWLockAcquire(NotifySLRULock, LW_EXCLUSIVE);
+ LWLock *lock;
/*
* We work with a local copy of QUEUE_HEAD, which we write back to shared
@@ -1438,6 +1437,11 @@ asyncQueueAddEntries(ListCell *nextNotify)
* wrapped around, but re-zeroing the page is harmless in that case.)
*/
pageno = QUEUE_POS_PAGE(queue_head);
+ lock = SimpleLruGetSLRUBankLock(NotifyCtl, pageno);
+
+ /* We hold both NotifyQueueLock and SLRU bank lock during this operation */
+ LWLockAcquire(lock, LW_EXCLUSIVE);
+
if (QUEUE_POS_IS_ZERO(queue_head))
slotno = SimpleLruZeroPage(NotifyCtl, pageno);
else
@@ -1509,7 +1513,7 @@ asyncQueueAddEntries(ListCell *nextNotify)
/* Success, so update the global QUEUE_HEAD */
QUEUE_HEAD = queue_head;
- LWLockRelease(NotifySLRULock);
+ LWLockRelease(lock);
return nextNotify;
}
@@ -1988,7 +1992,7 @@ asyncQueueReadAllNotifications(void)
/*
* We copy the data from SLRU into a local buffer, so as to avoid
- * holding the NotifySLRULock while we are examining the entries
+ * holding the SLRU lock while we are examining the entries
* and possibly transmitting them to our frontend. Copy only the
* part of the page we will actually inspect.
*/
@@ -2010,7 +2014,7 @@ asyncQueueReadAllNotifications(void)
NotifyCtl->shared->page_buffer[slotno] + curoffset,
copysize);
/* Release lock that we got from SimpleLruReadPage_ReadOnly() */
- LWLockRelease(NotifySLRULock);
+ LWLockRelease(SimpleLruGetSLRUBankLock(NotifyCtl, curpage));
/*
* Process messages up to the stop position, end of page, or an
@@ -2051,7 +2055,7 @@ asyncQueueReadAllNotifications(void)
*
* The current page must have been fetched into page_buffer from shared
* memory. (We could access the page right in shared memory, but that
- * would imply holding the NotifySLRULock throughout this routine.)
+ * would imply holding the SLRU bank lock throughout this routine.)
*
* We stop if we reach the "stop" position, or reach a notification from an
* uncommitted transaction, or reach the end of the page.
@@ -2204,7 +2208,7 @@ asyncQueueAdvanceTail(void)
if (asyncQueuePagePrecedes(oldtailpage, boundary))
{
/*
- * SimpleLruTruncate() will ask for NotifySLRULock but will also
+ * SimpleLruTruncate() will ask for SLRU bank locks but will also
* release the lock again.
*/
SimpleLruTruncate(NotifyCtl, newtailpage);
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index 315a78cda9..1261af0548 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -190,6 +190,20 @@ static const char *const BuiltinTrancheNames[] = {
"LogicalRepLauncherDSA",
/* LWTRANCHE_LAUNCHER_HASH: */
"LogicalRepLauncherHash",
+ /* LWTRANCHE_XACT_SLRU: */
+ "XactSLRU",
+ /* LWTRANCHE_COMMITTS_SLRU: */
+ "CommitTSSLRU",
+ /* LWTRANCHE_SUBTRANS_SLRU: */
+ "SubtransSLRU",
+ /* LWTRANCHE_MULTIXACTOFFSET_SLRU: */
+ "MultixactOffsetSLRU",
+ /* LWTRANCHE_MULTIXACTMEMBER_SLRU: */
+ "MultixactMemberSLRU",
+ /* LWTRANCHE_NOTIFY_SLRU: */
+ "NotifySLRU",
+ /* LWTRANCHE_SERIAL_SLRU: */
+ "SerialSLRU"
};
StaticAssertDecl(lengthof(BuiltinTrancheNames) ==
diff --git a/src/backend/storage/lmgr/lwlocknames.txt b/src/backend/storage/lmgr/lwlocknames.txt
index f72f2906ce..9e66ecd1ed 100644
--- a/src/backend/storage/lmgr/lwlocknames.txt
+++ b/src/backend/storage/lmgr/lwlocknames.txt
@@ -16,11 +16,11 @@ WALBufMappingLock 7
WALWriteLock 8
ControlFileLock 9
# 10 was CheckpointLock
-XactSLRULock 11
-SubtransSLRULock 12
+# 11 was XactSLRULock
+# 12 was SubtransSLRULock
MultiXactGenLock 13
-MultiXactOffsetSLRULock 14
-MultiXactMemberSLRULock 15
+# 14 was MultiXactOffsetSLRULock
+# 15 was MultiXactMemberSLRULock
RelCacheInitLock 16
CheckpointerCommLock 17
TwoPhaseStateLock 18
@@ -31,19 +31,19 @@ AutovacuumLock 22
AutovacuumScheduleLock 23
SyncScanLock 24
RelationMappingLock 25
-NotifySLRULock 26
+#26 was NotifySLRULock
NotifyQueueLock 27
SerializableXactHashLock 28
SerializableFinishedListLock 29
SerializablePredicateListLock 30
-SerialSLRULock 31
+SerialControlLock 31
SyncRepLock 32
BackgroundWorkerLock 33
DynamicSharedMemoryControlLock 34
AutoFileLock 35
ReplicationSlotAllocationLock 36
ReplicationSlotControlLock 37
-CommitTsSLRULock 38
+#38 was CommitTsSLRULock
CommitTsLock 39
ReplicationOriginLock 40
MultiXactTruncationLock 41
diff --git a/src/backend/storage/lmgr/predicate.c b/src/backend/storage/lmgr/predicate.c
index 1af41213b4..e771aaa82b 100644
--- a/src/backend/storage/lmgr/predicate.c
+++ b/src/backend/storage/lmgr/predicate.c
@@ -808,8 +808,8 @@ SerialInit(void)
*/
SerialSlruCtl->PagePrecedes = SerialPagePrecedesLogically;
SimpleLruInit(SerialSlruCtl, "Serial",
- NUM_SERIAL_BUFFERS, 0, SerialSLRULock, "pg_serial",
- LWTRANCHE_SERIAL_BUFFER, SYNC_HANDLER_NONE);
+ NUM_SERIAL_BUFFERS, 0, "pg_serial", LWTRANCHE_SERIAL_BUFFER,
+ LWTRANCHE_SERIAL_SLRU, SYNC_HANDLER_NONE);
#ifdef USE_ASSERT_CHECKING
SerialPagePrecedesLogicallyUnitTests();
#endif
@@ -846,12 +846,14 @@ SerialAdd(TransactionId xid, SerCommitSeqNo minConflictCommitSeqNo)
int slotno;
int firstZeroPage;
bool isNewPage;
+ LWLock *lock;
Assert(TransactionIdIsValid(xid));
targetPage = SerialPage(xid);
+ lock = SimpleLruGetSLRUBankLock(SerialSlruCtl, targetPage);
- LWLockAcquire(SerialSLRULock, LW_EXCLUSIVE);
+ LWLockAcquire(lock, LW_EXCLUSIVE);
/*
* If no serializable transactions are active, there shouldn't be anything
@@ -901,7 +903,7 @@ SerialAdd(TransactionId xid, SerCommitSeqNo minConflictCommitSeqNo)
SerialValue(slotno, xid) = minConflictCommitSeqNo;
SerialSlruCtl->shared->page_dirty[slotno] = true;
- LWLockRelease(SerialSLRULock);
+ LWLockRelease(lock);
}
/*
@@ -919,10 +921,10 @@ SerialGetMinConflictCommitSeqNo(TransactionId xid)
Assert(TransactionIdIsValid(xid));
- LWLockAcquire(SerialSLRULock, LW_SHARED);
+ LWLockAcquire(SerialControlLock, LW_SHARED);
headXid = serialControl->headXid;
tailXid = serialControl->tailXid;
- LWLockRelease(SerialSLRULock);
+ LWLockRelease(SerialControlLock);
if (!TransactionIdIsValid(headXid))
return 0;
@@ -934,13 +936,13 @@ SerialGetMinConflictCommitSeqNo(TransactionId xid)
return 0;
/*
- * The following function must be called without holding SerialSLRULock,
+ * The following function must be called without holding SLRU bank lock,
* but will return with that lock held, which must then be released.
*/
slotno = SimpleLruReadPage_ReadOnly(SerialSlruCtl,
SerialPage(xid), xid);
val = SerialValue(slotno, xid);
- LWLockRelease(SerialSLRULock);
+ LWLockRelease(SimpleLruGetSLRUBankLock(SerialSlruCtl, SerialPage(xid)));
return val;
}
@@ -953,7 +955,7 @@ SerialGetMinConflictCommitSeqNo(TransactionId xid)
static void
SerialSetActiveSerXmin(TransactionId xid)
{
- LWLockAcquire(SerialSLRULock, LW_EXCLUSIVE);
+ LWLockAcquire(SerialControlLock, LW_EXCLUSIVE);
/*
* When no sxacts are active, nothing overlaps, set the xid values to
@@ -965,7 +967,7 @@ SerialSetActiveSerXmin(TransactionId xid)
{
serialControl->tailXid = InvalidTransactionId;
serialControl->headXid = InvalidTransactionId;
- LWLockRelease(SerialSLRULock);
+ LWLockRelease(SerialControlLock);
return;
}
@@ -983,7 +985,7 @@ SerialSetActiveSerXmin(TransactionId xid)
{
serialControl->tailXid = xid;
}
- LWLockRelease(SerialSLRULock);
+ LWLockRelease(SerialControlLock);
return;
}
@@ -992,7 +994,7 @@ SerialSetActiveSerXmin(TransactionId xid)
serialControl->tailXid = xid;
- LWLockRelease(SerialSLRULock);
+ LWLockRelease(SerialControlLock);
}
/*
@@ -1006,12 +1008,12 @@ CheckPointPredicate(void)
{
int tailPage;
- LWLockAcquire(SerialSLRULock, LW_EXCLUSIVE);
+ LWLockAcquire(SerialControlLock, LW_EXCLUSIVE);
/* Exit quickly if the SLRU is currently not in use. */
if (serialControl->headPage < 0)
{
- LWLockRelease(SerialSLRULock);
+ LWLockRelease(SerialControlLock);
return;
}
@@ -1055,7 +1057,7 @@ CheckPointPredicate(void)
serialControl->headPage = -1;
}
- LWLockRelease(SerialSLRULock);
+ LWLockRelease(SerialControlLock);
/* Truncate away pages that are no longer required */
SimpleLruTruncate(SerialSlruCtl, tailPage);
diff --git a/src/include/access/slru.h b/src/include/access/slru.h
index f5f2b5b8b5..8844853a57 100644
--- a/src/include/access/slru.h
+++ b/src/include/access/slru.h
@@ -52,8 +52,6 @@ typedef enum
*/
typedef struct SlruSharedData
{
- LWLock *ControlLock;
-
/* Number of buffers managed by this SLRU structure */
int num_slots;
@@ -68,6 +66,13 @@ typedef struct SlruSharedData
int *page_lru_count;
LWLockPadded *buffer_locks;
+ /*
+ * Locks to protect the in memory buffer slot access in per SLRU bank.
+ * The buffer_locks protects the I/O on each buffer slots whereas this lock
+ * protect the in memory operation on the buffer within one SLRU bank.
+ */
+ LWLockPadded *bank_locks;
+
/*
* Optional array of WAL flush LSNs associated with entries in the SLRU
* pages. If not zero/NULL, we must flush WAL before writing pages (true
@@ -95,7 +100,7 @@ typedef struct SlruSharedData
* this is not critical data, since we use it only to avoid swapping out
* the latest page.
*/
- int latest_page_number;
+ pg_atomic_uint32 latest_page_number;
/* SLRU's index for statistics purposes (might not be unique) */
int slru_stats_idx;
@@ -143,11 +148,24 @@ typedef struct SlruCtlData
typedef SlruCtlData *SlruCtl;
+/*
+ * Get the SLRU bank lock for given SlruCtl and the pageno.
+ *
+ * This lock needs to be acquire in order to access the slru buffer slots in
+ * the respective bank. For more details refer comments in SlruSharedData.
+ */
+static inline LWLock *
+SimpleLruGetSLRUBankLock(SlruCtl ctl, int pageno)
+{
+ int bankno = (pageno & ctl->bank_mask);
+
+ return &(ctl->shared->bank_locks[bankno].lock);
+}
extern Size SimpleLruShmemSize(int nslots, int nlsns);
extern void SimpleLruInit(SlruCtl ctl, const char *name, int nslots, int nlsns,
- LWLock *ctllock, const char *subdir, int tranche_id,
- SyncRequestHandler sync_handler);
+ const char *subdir, int buffer_tranche_id,
+ int bank_tranche_id, SyncRequestHandler sync_handler);
extern int SimpleLruZeroPage(SlruCtl ctl, int pageno);
extern int SimpleLruReadPage(SlruCtl ctl, int pageno, bool write_ok,
TransactionId xid);
@@ -175,5 +193,7 @@ extern bool SlruScanDirCbReportPresence(SlruCtl ctl, char *filename,
int segpage, void *data);
extern bool SlruScanDirCbDeleteAll(SlruCtl ctl, char *filename, int segpage,
void *data);
-
+extern LWLock *SimpleLruGetSLRUBankLock(SlruCtl ctl, int pageno);
+extern void SimpleLruAcquireAllBankLock(SlruCtl ctl, LWLockMode mode);
+extern void SimpleLruReleaseAllBankLock(SlruCtl ctl);
#endif /* SLRU_H */
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index d77410bdea..09d2efe8ca 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -207,6 +207,14 @@ typedef enum BuiltinTrancheIds
LWTRANCHE_PGSTATS_DATA,
LWTRANCHE_LAUNCHER_DSA,
LWTRANCHE_LAUNCHER_HASH,
+ LWTRANCHE_XACT_SLRU,
+ LWTRANCHE_COMMITTS_SLRU,
+ LWTRANCHE_SUBTRANS_SLRU,
+ LWTRANCHE_MULTIXACTOFFSET_SLRU,
+ LWTRANCHE_MULTIXACTMEMBER_SLRU,
+ LWTRANCHE_NOTIFY_SLRU,
+ LWTRANCHE_SERIAL_SLRU,
+
LWTRANCHE_FIRST_USER_DEFINED
} BuiltinTrancheIds;
diff --git a/src/test/modules/test_slru/test_slru.c b/src/test/modules/test_slru/test_slru.c
index ae21444c47..9a02f33933 100644
--- a/src/test/modules/test_slru/test_slru.c
+++ b/src/test/modules/test_slru/test_slru.c
@@ -40,10 +40,6 @@ PG_FUNCTION_INFO_V1(test_slru_delete_all);
/* Number of SLRU page slots */
#define NUM_TEST_BUFFERS 16
-/* SLRU control lock */
-LWLock TestSLRULock;
-#define TestSLRULock (&TestSLRULock)
-
static SlruCtlData TestSlruCtlData;
#define TestSlruCtl (&TestSlruCtlData)
@@ -63,9 +59,9 @@ test_slru_page_write(PG_FUNCTION_ARGS)
int pageno = PG_GETARG_INT32(0);
char *data = text_to_cstring(PG_GETARG_TEXT_PP(1));
int slotno;
+ LWLock *lock = SimpleLruGetSLRUBankLock(TestSlruCtl, pageno);
- LWLockAcquire(TestSLRULock, LW_EXCLUSIVE);
-
+ LWLockAcquire(lock, LW_EXCLUSIVE);
slotno = SimpleLruZeroPage(TestSlruCtl, pageno);
/* these should match */
@@ -80,7 +76,7 @@ test_slru_page_write(PG_FUNCTION_ARGS)
BLCKSZ - 1);
SimpleLruWritePage(TestSlruCtl, slotno);
- LWLockRelease(TestSLRULock);
+ LWLockRelease(lock);
PG_RETURN_VOID();
}
@@ -99,13 +95,14 @@ test_slru_page_read(PG_FUNCTION_ARGS)
bool write_ok = PG_GETARG_BOOL(1);
char *data = NULL;
int slotno;
+ LWLock *lock = SimpleLruGetSLRUBankLock(TestSlruCtl, pageno);
/* find page in buffers, reading it if necessary */
- LWLockAcquire(TestSLRULock, LW_EXCLUSIVE);
+ LWLockAcquire(lock, LW_EXCLUSIVE);
slotno = SimpleLruReadPage(TestSlruCtl, pageno,
write_ok, InvalidTransactionId);
data = (char *) TestSlruCtl->shared->page_buffer[slotno];
- LWLockRelease(TestSLRULock);
+ LWLockRelease(lock);
PG_RETURN_TEXT_P(cstring_to_text(data));
}
@@ -116,14 +113,15 @@ test_slru_page_readonly(PG_FUNCTION_ARGS)
int pageno = PG_GETARG_INT32(0);
char *data = NULL;
int slotno;
+ LWLock *lock = SimpleLruGetSLRUBankLock(TestSlruCtl, pageno);
/* find page in buffers, reading it if necessary */
slotno = SimpleLruReadPage_ReadOnly(TestSlruCtl,
pageno,
InvalidTransactionId);
- Assert(LWLockHeldByMe(TestSLRULock));
+ Assert(LWLockHeldByMe(lock));
data = (char *) TestSlruCtl->shared->page_buffer[slotno];
- LWLockRelease(TestSLRULock);
+ LWLockRelease(lock);
PG_RETURN_TEXT_P(cstring_to_text(data));
}
@@ -133,10 +131,11 @@ test_slru_page_exists(PG_FUNCTION_ARGS)
{
int pageno = PG_GETARG_INT32(0);
bool found;
+ LWLock *lock = SimpleLruGetSLRUBankLock(TestSlruCtl, pageno);
- LWLockAcquire(TestSLRULock, LW_EXCLUSIVE);
+ LWLockAcquire(lock, LW_EXCLUSIVE);
found = SimpleLruDoesPhysicalPageExist(TestSlruCtl, pageno);
- LWLockRelease(TestSLRULock);
+ LWLockRelease(lock);
PG_RETURN_BOOL(found);
}
@@ -215,6 +214,7 @@ test_slru_shmem_startup(void)
{
const char slru_dir_name[] = "pg_test_slru";
int test_tranche_id;
+ int test_buffer_tranche_id;
if (prev_shmem_startup_hook)
prev_shmem_startup_hook();
@@ -228,11 +228,13 @@ test_slru_shmem_startup(void)
/* initialize the SLRU facility */
test_tranche_id = LWLockNewTrancheId();
LWLockRegisterTranche(test_tranche_id, "test_slru_tranche");
- LWLockInitialize(TestSLRULock, test_tranche_id);
+
+ test_buffer_tranche_id = LWLockNewTrancheId();
+ LWLockRegisterTranche(test_tranche_id, "test_buffer_tranche");
TestSlruCtl->PagePrecedes = test_slru_page_precedes_logically;
SimpleLruInit(TestSlruCtl, "TestSLRU",
- NUM_TEST_BUFFERS, 0, TestSLRULock, slru_dir_name,
+ NUM_TEST_BUFFERS, 0, slru_dir_name, test_buffer_tranche_id,
test_tranche_id, SYNC_HANDLER_NONE);
}
--
2.39.2 (Apple Git-143)
v2-0003-Introduce-bank-wise-LRU-counter.patchapplication/octet-stream; name=v2-0003-Introduce-bank-wise-LRU-counter.patchDownload
From 9c8528913575edd9dd8a095e9cd7dd648fed0c5f Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Thu, 12 Oct 2023 16:04:14 +0530
Subject: [PATCH v2 3/3] Introduce bank-wise LRU counter
Since we have already divided buffer pool in banks and victim
buffer search is also done at the bank level so there is no need
to have a centralized lru counter. And this will also improve
the performance by reducing the frequent cpu cache invalidation by
not updating the common variable.
Dilip Kumar based on design idea from Robert Haas
---
src/backend/access/transam/slru.c | 23 +++++++++++++++--------
src/include/access/slru.h | 28 +++++++++++++++++-----------
2 files changed, 32 insertions(+), 19 deletions(-)
diff --git a/src/backend/access/transam/slru.c b/src/backend/access/transam/slru.c
index d0931308f8..318d9ea3fa 100644
--- a/src/backend/access/transam/slru.c
+++ b/src/backend/access/transam/slru.c
@@ -110,13 +110,13 @@ typedef struct SlruWriteAllData *SlruWriteAll;
*
* The reason for the if-test is that there are often many consecutive
* accesses to the same page (particularly the latest page). By suppressing
- * useless increments of cur_lru_count, we reduce the probability that old
+ * useless increments of bank_cur_lru_count, we reduce the probability that old
* pages' counts will "wrap around" and make them appear recently used.
*
* We allow this code to be executed concurrently by multiple processes within
* SimpleLruReadPage_ReadOnly(). As long as int reads and writes are atomic,
* this should not cause any completely-bogus values to enter the computation.
- * However, it is possible for either cur_lru_count or individual
+ * However, it is possible for either bank_cur_lru_count or individual
* page_lru_count entries to be "reset" to lower values than they should have,
* in case a process is delayed while it executes this macro. With care in
* SlruSelectLRUPage(), this does little harm, and in any case the absolute
@@ -125,9 +125,10 @@ typedef struct SlruWriteAllData *SlruWriteAll;
*/
#define SlruRecentlyUsed(shared, slotno) \
do { \
- int new_lru_count = (shared)->cur_lru_count; \
+ int slrubankno = (slotno) / SLRU_BANK_SIZE; \
+ int new_lru_count = (shared)->bank_cur_lru_count[slrubankno]; \
if (new_lru_count != (shared)->page_lru_count[slotno]) { \
- (shared)->cur_lru_count = ++new_lru_count; \
+ (shared)->bank_cur_lru_count[slrubankno] = ++new_lru_count; \
(shared)->page_lru_count[slotno] = new_lru_count; \
} \
} while (0)
@@ -200,6 +201,7 @@ SimpleLruShmemSize(int nslots, int nlsns)
sz += MAXALIGN(nslots * sizeof(int)); /* page_lru_count[] */
sz += MAXALIGN(nslots * sizeof(LWLockPadded)); /* buffer_locks[] */
sz += MAXALIGN((bankmask + 1) * sizeof(LWLockPadded)); /* bank_locks[] */
+ sz += MAXALIGN((bankmask + 1) * sizeof(int)); /* bank_cur_lru_count[] */
if (nlsns > 0)
sz += MAXALIGN(nslots * nlsns * sizeof(XLogRecPtr)); /* group_lsn[] */
@@ -277,8 +279,6 @@ SimpleLruInit(SlruCtl ctl, const char *name, int nslots, int nlsns,
shared->num_slots = nslots;
shared->lsn_groups_per_page = nlsns;
- shared->cur_lru_count = 0;
-
/* shared->latest_page_number will be set later */
shared->slru_stats_idx = pgstat_get_slru_index(name);
@@ -301,6 +301,8 @@ SimpleLruInit(SlruCtl ctl, const char *name, int nslots, int nlsns,
offset += MAXALIGN(nslots * sizeof(LWLockPadded));
shared->bank_locks = (LWLockPadded *) (ptr + offset);
offset += MAXALIGN(nbanks * sizeof(LWLockPadded));
+ shared->bank_cur_lru_count = (int *) (ptr + offset);
+ offset += MAXALIGN(nbanks * sizeof(int));
if (nlsns > 0)
{
@@ -322,8 +324,11 @@ SimpleLruInit(SlruCtl ctl, const char *name, int nslots, int nlsns,
}
/* Initialize bank locks for each buffer bank. */
for (bankno = 0; bankno < nbanks; bankno++)
+ {
LWLockInitialize(&shared->bank_locks[bankno].lock,
bank_tranche_id);
+ shared->bank_cur_lru_count[bankno] = 0;
+ }
/* Should fit to estimated shmem size */
Assert(ptr - (char *) shared <= SimpleLruShmemSize(nslots, nlsns));
@@ -1113,9 +1118,11 @@ SlruSelectLRUPage(SlruCtl ctl, int pageno)
int best_invalid_page_number = 0; /* keep compiler quiet */
/* See if page already has a buffer assigned */
- int bankstart = (pageno & ctl->bank_mask) * SLRU_BANK_SIZE;
+ int bankno = pageno & ctl->bank_mask;
+ int bankstart = bankno * SLRU_BANK_SIZE;
int bankend = bankstart + SLRU_BANK_SIZE;
+
for (slotno = bankstart; slotno < bankend; slotno++)
{
if (shared->page_number[slotno] == pageno &&
@@ -1150,7 +1157,7 @@ SlruSelectLRUPage(SlruCtl ctl, int pageno)
* That gets us back on the path to having good data when there are
* multiple pages with the same lru_count.
*/
- cur_count = (shared->cur_lru_count)++;
+ cur_count = (shared->bank_cur_lru_count[bankno])++;
for (slotno = bankstart; slotno < bankend; slotno++)
{
int this_delta;
diff --git a/src/include/access/slru.h b/src/include/access/slru.h
index 8844853a57..9be6d26d78 100644
--- a/src/include/access/slru.h
+++ b/src/include/access/slru.h
@@ -73,6 +73,23 @@ typedef struct SlruSharedData
*/
LWLockPadded *bank_locks;
+ /*----------
+ * Instead of global counter we maintain a bank-wise lru counter because
+ * a) we are doing the victim buffer selection as bank level so there is
+ * no point of having a global counter b) manipulating a global counter
+ * will have frequent cpu cache invalidation and that will affect the
+ * performance.
+ *
+ * We mark a page "most recently used" by setting
+ * page_lru_count[slotno] = ++bank_cur_lru_count[bankno];
+ * The oldest page is therefore the one with the highest value of
+ * bank_cur_lru_count[bankno] - page_lru_count[slotno]
+ * The counts will eventually wrap around, but this calculation still
+ * works as long as no page's age exceeds INT_MAX counts.
+ *----------
+ */
+ int *bank_cur_lru_count;
+
/*
* Optional array of WAL flush LSNs associated with entries in the SLRU
* pages. If not zero/NULL, we must flush WAL before writing pages (true
@@ -84,17 +101,6 @@ typedef struct SlruSharedData
XLogRecPtr *group_lsn;
int lsn_groups_per_page;
- /*----------
- * We mark a page "most recently used" by setting
- * page_lru_count[slotno] = ++cur_lru_count;
- * The oldest page is therefore the one with the highest value of
- * cur_lru_count - page_lru_count[slotno]
- * The counts will eventually wrap around, but this calculation still
- * works as long as no page's age exceeds INT_MAX counts.
- *----------
- */
- int cur_lru_count;
-
/*
* latest_page_number is the page number of the current end of the log;
* this is not critical data, since we use it only to avoid swapping out
--
2.39.2 (Apple Git-143)
v2-0001-Divide-SLRU-buffers-into-banks.patchapplication/octet-stream; name=v2-0001-Divide-SLRU-buffers-into-banks.patchDownload
From 5fa38ace34f0c460c9af8889ea922c2d5c4d0b38 Mon Sep 17 00:00:00 2001
From: Dilip kumar <dilipkumar@dkmac.local>
Date: Fri, 8 Sep 2023 15:08:32 +0530
Subject: [PATCH v2 1/3] Divide SLRU buffers into banks
We want to eliminate linear search within SLRU buffers.
To do so we divide SLRU buffers into banks. Each bank holds
approximately 8 buffers. Each SLRU pageno may reside only in one bank.
Adjacent pagenos reside in different banks.
Also invent slru_buffers_size_scale to control SLRU buffers.
Andrey M. Borodin, Yura Sokolov, Ivan Lazarev and minor refactoring by Dilip Kumar
---
doc/src/sgml/config.sgml | 31 +++++++++++
src/backend/access/transam/clog.c | 29 ++--------
src/backend/access/transam/commit_ts.c | 20 ++-----
src/backend/access/transam/slru.c | 54 +++++++++++++++++--
src/backend/access/transam/subtrans.c | 1 +
src/backend/utils/init/globals.c | 2 +
src/backend/utils/misc/guc_tables.c | 10 ++++
src/backend/utils/misc/postgresql.conf.sample | 3 ++
src/include/access/clog.h | 1 -
src/include/access/commit_ts.h | 1 -
src/include/access/multixact.h | 4 +-
src/include/access/slru.h | 5 ++
src/include/access/subtrans.h | 2 +-
src/include/commands/async.h | 2 +-
src/include/miscadmin.h | 2 +
src/include/storage/predicate.h | 2 +-
16 files changed, 119 insertions(+), 50 deletions(-)
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 924309af26..416d979b54 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2006,6 +2006,37 @@ include_dir 'conf.d'
</listitem>
</varlistentry>
+ <varlistentry id="guc-slru-buffers-size-scale" xreflabel="slru_buffers_size_scale">
+ <term><varname>slru_buffers_size_scale</varname> (<type>integer</type>)
+ <indexterm>
+ <primary><varname>slru_buffers_size_scale</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Specifies power 2 scale for all SLRU shared memory buffers sizes. Buffers sizes depends on
+ both <literal>guc_slru_buffers_size_scale</literal> and <literal>shared_buffers</literal> params.
+ </para>
+ <para>
+ This affects on buffers in the list below (see also <xref linkend="pgdata-contents-table"/>):
+ <itemizedlist>
+ <listitem><para><literal>NUM_MULTIXACTOFFSET_BUFFERS = Min(32 << slru_buffers_size_scale, shared_buffers/256)</literal></para></listitem>
+ <listitem><para><literal>NUM_MULTIXACTMEMBER_BUFFERS = Min(64 << slru_buffers_size_scale, shared_buffers/256)</literal></para></listitem>
+ <listitem><para><literal>NUM_SUBTRANS_BUFFERS = Min(64 << slru_buffers_size_scale, shared_buffers/256)</literal></para></listitem>
+ <listitem><para><literal>NUM_NOTIFY_BUFFERS = Min(32 << slru_buffers_size_scale, shared_buffers/256)</literal></para></listitem>
+ <listitem><para><literal>NUM_SERIAL_BUFFERS = Min(32 << slru_buffers_size_scale, shared_buffers/256)</literal></para></listitem>
+ <listitem><para><literal>NUM_CLOG_BUFFERS = Min(128 << slru_buffers_size_scale, shared_buffers/256)</literal></para></listitem>
+ <listitem><para><literal>NUM_COMMIT_TS_BUFFERS = Min(128 << slru_buffers_size_scale, shared_buffers/256)</literal></para></listitem>
+ </itemizedlist>
+ </para>
+ <para>
+ Value is in <literal>0..7</literal> bounds.
+ The default value is <literal>2</literal>.
+ This parameter can only be set at server start.
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry id="guc-max-stack-depth" xreflabel="max_stack_depth">
<term><varname>max_stack_depth</varname> (<type>integer</type>)
<indexterm>
diff --git a/src/backend/access/transam/clog.c b/src/backend/access/transam/clog.c
index 4a431d5876..d4ac85e052 100644
--- a/src/backend/access/transam/clog.c
+++ b/src/backend/access/transam/clog.c
@@ -74,6 +74,9 @@
#define GetLSNIndex(slotno, xid) ((slotno) * CLOG_LSNS_PER_PAGE + \
((xid) % (TransactionId) CLOG_XACTS_PER_PAGE) / CLOG_XACTS_PER_LSN_GROUP)
+/* Number of SLRU buffers to use for clog */
+#define NUM_CLOG_BUFFERS (128 << slru_buffers_size_scale)
+
/*
* The number of subtransactions below which we consider to apply clog group
* update optimization. Testing reveals that the number higher than this can
@@ -660,42 +663,20 @@ TransactionIdGetStatus(TransactionId xid, XLogRecPtr *lsn)
return status;
}
-/*
- * Number of shared CLOG buffers.
- *
- * On larger multi-processor systems, it is possible to have many CLOG page
- * requests in flight at one time which could lead to disk access for CLOG
- * page if the required page is not found in memory. Testing revealed that we
- * can get the best performance by having 128 CLOG buffers, more than that it
- * doesn't improve performance.
- *
- * Unconditionally keeping the number of CLOG buffers to 128 did not seem like
- * a good idea, because it would increase the minimum amount of shared memory
- * required to start, which could be a problem for people running very small
- * configurations. The following formula seems to represent a reasonable
- * compromise: people with very low values for shared_buffers will get fewer
- * CLOG buffers as well, and everyone else will get 128.
- */
-Size
-CLOGShmemBuffers(void)
-{
- return Min(128, Max(4, NBuffers / 512));
-}
-
/*
* Initialization of shared memory for CLOG
*/
Size
CLOGShmemSize(void)
{
- return SimpleLruShmemSize(CLOGShmemBuffers(), CLOG_LSNS_PER_PAGE);
+ return SimpleLruShmemSize(NUM_CLOG_BUFFERS, CLOG_LSNS_PER_PAGE);
}
void
CLOGShmemInit(void)
{
XactCtl->PagePrecedes = CLOGPagePrecedes;
- SimpleLruInit(XactCtl, "Xact", CLOGShmemBuffers(), CLOG_LSNS_PER_PAGE,
+ SimpleLruInit(XactCtl, "Xact", NUM_CLOG_BUFFERS, CLOG_LSNS_PER_PAGE,
XactSLRULock, "pg_xact", LWTRANCHE_XACT_BUFFER,
SYNC_HANDLER_CLOG);
SlruPagePrecedesUnitTests(XactCtl, CLOG_XACTS_PER_PAGE);
diff --git a/src/backend/access/transam/commit_ts.c b/src/backend/access/transam/commit_ts.c
index b897fabc70..26614d5ceb 100644
--- a/src/backend/access/transam/commit_ts.c
+++ b/src/backend/access/transam/commit_ts.c
@@ -70,6 +70,9 @@ typedef struct CommitTimestampEntry
#define TransactionIdToCTsEntry(xid) \
((xid) % (TransactionId) COMMIT_TS_XACTS_PER_PAGE)
+/* Number of SLRU buffers to use for commit_ts */
+#define NUM_COMMIT_TS_BUFFERS (128 << slru_buffers_size_scale)
+
/*
* Link to shared-memory data structures for CommitTs control
*/
@@ -487,26 +490,13 @@ pg_xact_commit_timestamp_origin(PG_FUNCTION_ARGS)
PG_RETURN_DATUM(HeapTupleGetDatum(htup));
}
-/*
- * Number of shared CommitTS buffers.
- *
- * We use a very similar logic as for the number of CLOG buffers (except we
- * scale up twice as fast with shared buffers, and the maximum is twice as
- * high); see comments in CLOGShmemBuffers.
- */
-Size
-CommitTsShmemBuffers(void)
-{
- return Min(256, Max(4, NBuffers / 256));
-}
-
/*
* Shared memory sizing for CommitTs
*/
Size
CommitTsShmemSize(void)
{
- return SimpleLruShmemSize(CommitTsShmemBuffers(), 0) +
+ return SimpleLruShmemSize(NUM_COMMIT_TS_BUFFERS, 0) +
sizeof(CommitTimestampShared);
}
@@ -520,7 +510,7 @@ CommitTsShmemInit(void)
bool found;
CommitTsCtl->PagePrecedes = CommitTsPagePrecedes;
- SimpleLruInit(CommitTsCtl, "CommitTs", CommitTsShmemBuffers(), 0,
+ SimpleLruInit(CommitTsCtl, "CommitTs", NUM_COMMIT_TS_BUFFERS, 0,
CommitTsSLRULock, "pg_commit_ts",
LWTRANCHE_COMMITTS_BUFFER,
SYNC_HANDLER_COMMIT_TS);
diff --git a/src/backend/access/transam/slru.c b/src/backend/access/transam/slru.c
index 71ac70fb40..57889b72bd 100644
--- a/src/backend/access/transam/slru.c
+++ b/src/backend/access/transam/slru.c
@@ -59,6 +59,7 @@
#include "pgstat.h"
#include "storage/fd.h"
#include "storage/shmem.h"
+#include "port/pg_bitutils.h"
#define SlruFileName(ctl, path, seg) \
snprintf(path, MAXPGPATH, "%s/%04X", (ctl)->Dir, seg)
@@ -71,6 +72,17 @@
*/
#define MAX_WRITEALL_BUFFERS 16
+/*
+ * To avoid overflowing internal arithmetic and the size_t data type, the
+ * number of buffers should not exceed this number.
+ */
+#define SLRU_MAX_ALLOWED_BUFFERS ((1024 * 1024 * 1024) / BLCKSZ)
+
+/*
+ * SLRU bank size for slotno hash banks
+ */
+#define SLRU_BANK_SIZE 8
+
typedef struct SlruWriteAllData
{
int num_files; /* # files actually open */
@@ -134,7 +146,7 @@ typedef enum
static SlruErrorCause slru_errcause;
static int slru_errno;
-
+static void SlruAdjustNSlots(int *nslots, int *bankmask);
static void SimpleLruZeroLSNs(SlruCtl ctl, int slotno);
static void SimpleLruWaitIO(SlruCtl ctl, int slotno);
static void SlruInternalWritePage(SlruCtl ctl, int slotno, SlruWriteAll fdata);
@@ -148,6 +160,25 @@ static bool SlruScanDirCbDeleteCutoff(SlruCtl ctl, char *filename,
int segpage, void *data);
static void SlruInternalDeleteSegment(SlruCtl ctl, int segno);
+/*
+ * Pick number of slots and bank size optimal for hashed associative SLRU buffers.
+ * We declare SLRU nslots is always power of 2.
+ * We split SLRU to 8-sized hash banks, after some performance benchmarks.
+ * We hash pageno to banks by pageno masked by 3 upper bits.
+ */
+static void
+SlruAdjustNSlots(int *nslots, int *bankmask)
+{
+ Assert(*nslots > 0);
+ Assert(*nslots <= SLRU_MAX_ALLOWED_BUFFERS);
+
+ *nslots = (int) pg_nextpower2_32(Max(SLRU_BANK_SIZE, Min(*nslots, NBuffers / 256)));
+
+ *bankmask = *nslots / SLRU_BANK_SIZE - 1;
+
+ elog(DEBUG5, "nslots %d banksize %d nbanks %d bankmask %x", *nslots, SLRU_BANK_SIZE, *nslots / SLRU_BANK_SIZE, *bankmask);
+}
+
/*
* Initialization of shared memory
*/
@@ -156,6 +187,9 @@ Size
SimpleLruShmemSize(int nslots, int nlsns)
{
Size sz;
+ int bankmask_ignore;
+
+ SlruAdjustNSlots(&nslots, &bankmask_ignore);
/* we assume nslots isn't so large as to risk overflow */
sz = MAXALIGN(sizeof(SlruSharedData));
@@ -191,6 +225,9 @@ SimpleLruInit(SlruCtl ctl, const char *name, int nslots, int nlsns,
{
SlruShared shared;
bool found;
+ int bankmask;
+
+ SlruAdjustNSlots(&nslots, &bankmask);
shared = (SlruShared) ShmemInitStruct(name,
SimpleLruShmemSize(nslots, nlsns),
@@ -258,7 +295,10 @@ SimpleLruInit(SlruCtl ctl, const char *name, int nslots, int nlsns,
Assert(ptr - (char *) shared <= SimpleLruShmemSize(nslots, nlsns));
}
else
+ {
Assert(found);
+ Assert(shared->num_slots == nslots);
+ }
/*
* Initialize the unshared control struct, including directory path. We
@@ -266,6 +306,7 @@ SimpleLruInit(SlruCtl ctl, const char *name, int nslots, int nlsns,
*/
ctl->shared = shared;
ctl->sync_handler = sync_handler;
+ ctl->bank_mask = bankmask;
strlcpy(ctl->Dir, subdir, sizeof(ctl->Dir));
}
@@ -497,12 +538,14 @@ SimpleLruReadPage_ReadOnly(SlruCtl ctl, int pageno, TransactionId xid)
{
SlruShared shared = ctl->shared;
int slotno;
+ int bankstart = (pageno & ctl->bank_mask) * SLRU_BANK_SIZE;
+ int bankend = bankstart + SLRU_BANK_SIZE;
/* Try to find the page while holding only shared lock */
LWLockAcquire(shared->ControlLock, LW_SHARED);
/* See if page is already in a buffer */
- for (slotno = 0; slotno < shared->num_slots; slotno++)
+ for (slotno = bankstart; slotno < bankend; slotno++)
{
if (shared->page_number[slotno] == pageno &&
shared->page_status[slotno] != SLRU_PAGE_EMPTY &&
@@ -1031,7 +1074,10 @@ SlruSelectLRUPage(SlruCtl ctl, int pageno)
int best_invalid_page_number = 0; /* keep compiler quiet */
/* See if page already has a buffer assigned */
- for (slotno = 0; slotno < shared->num_slots; slotno++)
+ int bankstart = (pageno & ctl->bank_mask) * SLRU_BANK_SIZE;
+ int bankend = bankstart + SLRU_BANK_SIZE;
+
+ for (slotno = bankstart; slotno < bankend; slotno++)
{
if (shared->page_number[slotno] == pageno &&
shared->page_status[slotno] != SLRU_PAGE_EMPTY)
@@ -1066,7 +1112,7 @@ SlruSelectLRUPage(SlruCtl ctl, int pageno)
* multiple pages with the same lru_count.
*/
cur_count = (shared->cur_lru_count)++;
- for (slotno = 0; slotno < shared->num_slots; slotno++)
+ for (slotno = bankstart; slotno < bankend; slotno++)
{
int this_delta;
int this_page_number;
diff --git a/src/backend/access/transam/subtrans.c b/src/backend/access/transam/subtrans.c
index 62bb610167..125273e235 100644
--- a/src/backend/access/transam/subtrans.c
+++ b/src/backend/access/transam/subtrans.c
@@ -31,6 +31,7 @@
#include "access/slru.h"
#include "access/subtrans.h"
#include "access/transam.h"
+#include "miscadmin.h"
#include "pg_trace.h"
#include "utils/snapmgr.h"
diff --git a/src/backend/utils/init/globals.c b/src/backend/utils/init/globals.c
index 011ec18015..61b12d1056 100644
--- a/src/backend/utils/init/globals.c
+++ b/src/backend/utils/init/globals.c
@@ -154,3 +154,5 @@ int64 VacuumPageDirty = 0;
int VacuumCostBalance = 0; /* working state for vacuum */
bool VacuumCostActive = false;
+
+int slru_buffers_size_scale = 2; /* power 2 scale for SLRU buffers */
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 16ec6c5ef0..4a182225b7 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -2277,6 +2277,16 @@ struct config_int ConfigureNamesInt[] =
NULL, NULL, NULL
},
+ {
+ {"slru_buffers_size_scale", PGC_POSTMASTER, RESOURCES_MEM,
+ gettext_noop("SLRU buffers size scale of power 2"),
+ NULL
+ },
+ &slru_buffers_size_scale,
+ 2, 0, 7,
+ NULL, NULL, NULL
+ },
+
{
{"temp_buffers", PGC_USERSET, RESOURCES_MEM,
gettext_noop("Sets the maximum number of temporary buffers used by each session."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index d08d55c3fe..136ea5f48c 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -157,6 +157,9 @@
# mmap
# (change requires restart)
#min_dynamic_shared_memory = 0MB # (change requires restart)
+#slru_buffers_size_scale = 2 # SLRU buffers size scale of power 2, range 0..7
+ # (change requires restart)
+
#vacuum_buffer_usage_limit = 256kB # size of vacuum and analyze buffer access strategy ring;
# 0 to disable vacuum buffer access strategy;
# range 128kB to 16GB
diff --git a/src/include/access/clog.h b/src/include/access/clog.h
index d99444f073..cee7e19b3f 100644
--- a/src/include/access/clog.h
+++ b/src/include/access/clog.h
@@ -40,7 +40,6 @@ extern void TransactionIdSetTreeStatus(TransactionId xid, int nsubxids,
TransactionId *subxids, XidStatus status, XLogRecPtr lsn);
extern XidStatus TransactionIdGetStatus(TransactionId xid, XLogRecPtr *lsn);
-extern Size CLOGShmemBuffers(void);
extern Size CLOGShmemSize(void);
extern void CLOGShmemInit(void);
extern void BootStrapCLOG(void);
diff --git a/src/include/access/commit_ts.h b/src/include/access/commit_ts.h
index 5087cdce51..155e82eb4f 100644
--- a/src/include/access/commit_ts.h
+++ b/src/include/access/commit_ts.h
@@ -27,7 +27,6 @@ extern bool TransactionIdGetCommitTsData(TransactionId xid,
extern TransactionId GetLatestCommitTsData(TimestampTz *ts,
RepOriginId *nodeid);
-extern Size CommitTsShmemBuffers(void);
extern Size CommitTsShmemSize(void);
extern void CommitTsShmemInit(void);
extern void BootStrapCommitTs(void);
diff --git a/src/include/access/multixact.h b/src/include/access/multixact.h
index 246f757f6a..6a2c914d48 100644
--- a/src/include/access/multixact.h
+++ b/src/include/access/multixact.h
@@ -30,8 +30,8 @@
#define MaxMultiXactOffset ((MultiXactOffset) 0xFFFFFFFF)
/* Number of SLRU buffers to use for multixact */
-#define NUM_MULTIXACTOFFSET_BUFFERS 8
-#define NUM_MULTIXACTMEMBER_BUFFERS 16
+#define NUM_MULTIXACTOFFSET_BUFFERS (16 << slru_buffers_size_scale)
+#define NUM_MULTIXACTMEMBER_BUFFERS (32 << slru_buffers_size_scale)
/*
* Possible multixact lock modes ("status"). The first four modes are for
diff --git a/src/include/access/slru.h b/src/include/access/slru.h
index a8a424d92d..f5f2b5b8b5 100644
--- a/src/include/access/slru.h
+++ b/src/include/access/slru.h
@@ -134,6 +134,11 @@ typedef struct SlruCtlData
* it's always the same, it doesn't need to be in shared memory.
*/
char Dir[64];
+
+ /*
+ * mask for slotno hash bank
+ */
+ Size bank_mask;
} SlruCtlData;
typedef SlruCtlData *SlruCtl;
diff --git a/src/include/access/subtrans.h b/src/include/access/subtrans.h
index 46a473c77f..0dad287550 100644
--- a/src/include/access/subtrans.h
+++ b/src/include/access/subtrans.h
@@ -12,7 +12,7 @@
#define SUBTRANS_H
/* Number of SLRU buffers to use for subtrans */
-#define NUM_SUBTRANS_BUFFERS 32
+#define NUM_SUBTRANS_BUFFERS (32 << slru_buffers_size_scale)
extern void SubTransSetParent(TransactionId xid, TransactionId parent);
extern TransactionId SubTransGetParent(TransactionId xid);
diff --git a/src/include/commands/async.h b/src/include/commands/async.h
index 02da6ba7e1..b1d59472b1 100644
--- a/src/include/commands/async.h
+++ b/src/include/commands/async.h
@@ -18,7 +18,7 @@
/*
* The number of SLRU page buffers we use for the notification queue.
*/
-#define NUM_NOTIFY_BUFFERS 8
+#define NUM_NOTIFY_BUFFERS (16 << slru_buffers_size_scale)
extern PGDLLIMPORT bool Trace_notify;
extern PGDLLIMPORT volatile sig_atomic_t notifyInterruptPending;
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 14bd574fc2..f2cec02a2f 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -177,6 +177,7 @@ extern PGDLLIMPORT int MaxBackends;
extern PGDLLIMPORT int MaxConnections;
extern PGDLLIMPORT int max_worker_processes;
extern PGDLLIMPORT int max_parallel_workers;
+extern PGDLLIMPORT int slru_buffers_size_scale;
extern PGDLLIMPORT int MyProcPid;
extern PGDLLIMPORT pg_time_t MyStartTime;
@@ -262,6 +263,7 @@ extern PGDLLIMPORT int work_mem;
extern PGDLLIMPORT double hash_mem_multiplier;
extern PGDLLIMPORT int maintenance_work_mem;
extern PGDLLIMPORT int max_parallel_maintenance_workers;
+extern PGDLLIMPORT int slru_buffers_size_scale;
/*
* Upper and lower hard limits for the buffer access strategy ring size
diff --git a/src/include/storage/predicate.h b/src/include/storage/predicate.h
index cd48afa17b..794ecd8169 100644
--- a/src/include/storage/predicate.h
+++ b/src/include/storage/predicate.h
@@ -28,7 +28,7 @@ extern PGDLLIMPORT int max_predicate_locks_per_page;
/* Number of SLRU buffers to use for Serial SLRU */
-#define NUM_SERIAL_BUFFERS 16
+#define NUM_SERIAL_BUFFERS (16 << slru_buffers_size_scale)
/*
* A handle used for sharing SERIALIZABLEXACT objects between the participants
--
2.39.2 (Apple Git-143)
On Wed, Oct 11, 2023 at 4:35 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
The small size of the SLRU buffer pools can sometimes become a
performance problem because it’s not difficult to have a workload
where the number of buffers actively in use is larger than the
fixed-size buffer pool. However, just increasing the size of the
buffer pool doesn’t necessarily help, because the linear search that
we use for buffer replacement doesn’t scale, and also because
contention on the single centralized lock limits scalability.There is a couple of patches proposed in the past to address the
problem of increasing the buffer pool size, one of the patch [1] was
proposed by Thomas Munro where we make the size of the buffer pool
configurable. And, in order to deal with the linear search in the
large buffer pool, we divide the SLRU buffer pool into associative
banks so that searching in the buffer pool doesn’t get affected by the
large size of the buffer pool. This does well for the workloads which
are mainly impacted by the frequent buffer replacement but this still
doesn’t stand well with the workloads where the centralized control
lock is the bottleneck.So I have taken this patch as my base patch (v1-0001) and further
added 2 more improvements to this 1) In v1-0002, Instead of a
centralized control lock for the SLRU I have introduced a bank-wise
control lock 2)In v1-0003, I have removed the global LRU counter and
introduced a bank-wise counter. The second change (v1-0003) is in
order to avoid the CPU/OS cache invalidation due to frequent updates
of the single variable, later in my performance test I will show how
much gain we have gotten because of these 2 changes.Note: This is going to be a long email but I have summarised the main
idea above this point and now I am going to discuss more internal
information in order to show that the design idea is valid and also
going to show 2 performance tests where one is specific to the
contention on the centralized lock and other is mainly contention due
to frequent buffer replacement in SLRU buffer pool. We are getting ~2x
TPS compared to the head by these patches and in later sections, I am
going discuss this in more detail i.e. exact performance numbers and
analysis of why we are seeing the gain.
...
Performance Test:
Exp1: Show problems due to CPU/OS cache invalidation due to frequent
updates of the centralized lock and a common LRU counter. So here we
are running a parallel transaction to pgbench script which frequently
creates subtransaction overflow and that forces the visibility-check
mechanism to access the subtrans SLRU.
Test machine: 8 CPU/ 64 core/ 128 with HT/ 512 MB RAM / SSD
scale factor: 300
shared_buffers=20GB
checkpoint_timeout=40min
max_wal_size=20GB
max_connections=200Workload: Run these 2 scripts parallelly:
./pgbench -c $ -j $ -T 600 -P5 -M prepared postgres
./pgbench -c 1 -j 1 -T 600 -f savepoint.sql postgressavepoint.sql (create subtransaction overflow)
BEGIN;
SAVEPOINT S1;
INSERT INTO test VALUES(1)
← repeat 70 times →
SELECT pg_sleep(1);
COMMIT;Code under test:
Head: PostgreSQL head code
SlruBank: The first patch applied to convert the SLRU buffer pool into
the bank (0001)
SlruBank+BankwiseLockAndLru: Applied 0001+0002+0003Results:
Clients Head SlruBank SlruBank+BankwiseLockAndLru
1 457 491 475
8 3753 3819 3782
32 14594 14328 17028
64 15600 16243 25944
128 15957 16272 31731So we can see that at 128 clients, we get ~2x TPS(with SlruBank +
BankwiseLock and bankwise LRU counter) as compared to HEAD.
This and other results shared by you look promising. Will there be any
improvement in workloads related to clog buffer usage? BTW, I remember
that there was also a discussion of moving SLRU into a regular buffer
pool [1]https://commitfest.postgresql.org/43/3514/. You have not provided any explanation as to whether that
approach will have any merits after we do this or whether that
approach is not worth pursuing at all.
[1]: https://commitfest.postgresql.org/43/3514/
--
With Regards,
Amit Kapila.
On Sat, Oct 14, 2023 at 9:43 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
This and other results shared by you look promising. Will there be any
improvement in workloads related to clog buffer usage?
I did not understand this question can you explain this a bit? In
short, if it is regarding the performance then we will see it for all
the SLRUs as the control lock is not centralized anymore instead it is
a bank-wise lock.
BTW, I remember
that there was also a discussion of moving SLRU into a regular buffer
pool [1]. You have not provided any explanation as to whether that
approach will have any merits after we do this or whether that
approach is not worth pursuing at all.
Yeah, I haven't read that thread in detail about performance numbers
and all. But both of these can not coexist because this patch is
improving the SLRU buffer pool access/configurable size and also lock
contention. If we move SLRU to the main buffer pool then we might not
have a similar problem instead there might be other problems like SLRU
buffers getting swapped out due to other relation buffers and all and
OTOH the advantages of that approach would be that we can just use a
bigger buffer pool and SLRU can also take advantage of that. But in
my opinion, most of the time we have limited page access in SLRU and
the SLRU buffer access pattern is also quite different from the
relation pages access pattern so keeping them under the same buffer
pool and comparing against relation pages for victim buffer selection
might cause different problems. But anyway I would discuss those
points maybe in that thread.
--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
On 2023-Oct-11, Dilip Kumar wrote:
In my last email, I forgot to give the link from where I have taken
the base path for dividing the buffer pool in banks so giving the same
here[1]. And looking at this again it seems that the idea of that
patch was from Andrey M. Borodin and the idea of the SLRU scale factor
were introduced by Yura Sokolov and Ivan Lazarev. Apologies for
missing that in the first email.
You mean [1]/messages/by-id/452d01f7e331458f56ad79bef537c31b@postgrespro.ru I don't like this idea very much, because of the magic numbers that act as ratios for numbers of buffers on each SLRU compared to other SLRUs. These values, which I took from the documentation part of the patch, appear to have been selected by throwing darts at the wall:.
[1]: /messages/by-id/452d01f7e331458f56ad79bef537c31b@postgrespro.ru I don't like this idea very much, because of the magic numbers that act as ratios for numbers of buffers on each SLRU compared to other SLRUs. These values, which I took from the documentation part of the patch, appear to have been selected by throwing darts at the wall:
I don't like this idea very much, because of the magic numbers that act
as ratios for numbers of buffers on each SLRU compared to other SLRUs.
These values, which I took from the documentation part of the patch,
appear to have been selected by throwing darts at the wall:
NUM_CLOG_BUFFERS = Min(128 << slru_buffers_size_scale, shared_buffers/256)
NUM_COMMIT_TS_BUFFERS = Min(128 << slru_buffers_size_scale, shared_buffers/256)
NUM_SUBTRANS_BUFFERS = Min(64 << slru_buffers_size_scale, shared_buffers/256)
NUM_NOTIFY_BUFFERS = Min(32 << slru_buffers_size_scale, shared_buffers/256)
NUM_SERIAL_BUFFERS = Min(32 << slru_buffers_size_scale, shared_buffers/256)
NUM_MULTIXACTOFFSET_BUFFERS = Min(32 << slru_buffers_size_scale, shared_buffers/256)
NUM_MULTIXACTMEMBER_BUFFERS = Min(64 << slru_buffers_size_scale, shared_buffers/256)
... which look pretty random already, if similar enough to the current
hardcoded values. In reality, the code implements different values than
what the documentation says.
I don't see why would CLOG have the same number as COMMIT_TS, when the
size for elements of the latter is like 32 times bigger -- however, the
frequency of reads for COMMIT_TS is like 1000x smaller than for CLOG.
SUBTRANS is half of CLOG, yet it is 16 times larger, and it covers the
same range. The MULTIXACT ones appear to keep the current ratio among
them (8/16 gets changed to 32/64).
... and this whole mess is scaled exponentially without regard to the
size that each SLRU requires. This is just betting that enough memory
can be wasted across all SLRUs up to the point where the one that is
actually contended has sufficient memory. This doesn't sound sensible
to me.
Like everybody else, I like having less GUCs to configure, but going
this far to avoid them looks rather disastrous to me. IMO we should
just use Munro's older patches that gave one GUC per SLRU, and users
only need to increase the one that shows up in pg_wait_event sampling.
Someday we will get the (much more complicated) patches to move these
buffers to steal memory from shared buffers, and that'll hopefully let
use get rid of all this complexity.
I'm inclined to use Borodin's patch last posted here [2]/messages/by-id/93236D36-B91C-4DFA-AF03-99C083840378@yandex-team.ru instead of your
proposed 0001.
[2]: /messages/by-id/93236D36-B91C-4DFA-AF03-99C083840378@yandex-team.ru
I did skim patches 0002 and 0003 without going into too much detail;
they look reasonable ideas. I have not tried to reproduce the claimed
performance benefits. I think measuring this patch set with the tests
posted by Shawn Debnath in [3]/messages/by-id/YemDdpMrsoJFQJnU@f01898859afd.ant.amazon.com is important, too.
[3]: /messages/by-id/YemDdpMrsoJFQJnU@f01898859afd.ant.amazon.com
On the other hand, here's a somewhat crazy idea. What if, instead of
stealing buffers from shared_buffers (which causes a lot of complexity),
we allocate a common pool for all SLRUs to use? We provide a single
knob -- say, non_relational_buffers=32MB as default -- and we use a LRU
algorithm (or something) to distribute that memory across all the SLRUs.
So the ratio to use for this SLRU or that one would depend on the nature
of the workload: maybe more for multixact in this server here, but more
for subtrans in that server there; it's just the total amount that the
user would have to configure, side by side with shared_buffers (and
perhaps scale with it like wal_buffers), and the LRU would handle the
rest. The "only" problem here is finding a distribution algorithm that
doesn't further degrade performance, of course ...
--
Álvaro Herrera Breisgau, Deutschland — https://www.EnterpriseDB.com/
"The problem with the facetime model is not just that it's demoralizing, but
that the people pretending to work interrupt the ones actually working."
-- Paul Graham, http://www.paulgraham.com/opensource.html
On Tue, Oct 24, 2023 at 06:04:13PM +0200, Alvaro Herrera wrote:
Like everybody else, I like having less GUCs to configure, but going
this far to avoid them looks rather disastrous to me. IMO we should
just use Munro's older patches that gave one GUC per SLRU, and users
only need to increase the one that shows up in pg_wait_event sampling.
Someday we will get the (much more complicated) patches to move these
buffers to steal memory from shared buffers, and that'll hopefully let
use get rid of all this complexity.
+1
On the other hand, here's a somewhat crazy idea. What if, instead of
stealing buffers from shared_buffers (which causes a lot of complexity),
we allocate a common pool for all SLRUs to use? We provide a single
knob -- say, non_relational_buffers=32MB as default -- and we use a LRU
algorithm (or something) to distribute that memory across all the SLRUs.
So the ratio to use for this SLRU or that one would depend on the nature
of the workload: maybe more for multixact in this server here, but more
for subtrans in that server there; it's just the total amount that the
user would have to configure, side by side with shared_buffers (and
perhaps scale with it like wal_buffers), and the LRU would handle the
rest. The "only" problem here is finding a distribution algorithm that
doesn't further degrade performance, of course ...
I think it's worth a try. It does seem simpler, and it might allow us to
sidestep some concerns about scaling when the SLRU pages are in
shared_buffers [0]/messages/by-id/ZPsaEGRvllitxB3v@tamriel.snowman.net.
[0]: /messages/by-id/ZPsaEGRvllitxB3v@tamriel.snowman.net
--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
On Tue, Oct 24, 2023 at 9:34 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
On 2023-Oct-11, Dilip Kumar wrote:
In my last email, I forgot to give the link from where I have taken
the base path for dividing the buffer pool in banks so giving the same
here[1]. And looking at this again it seems that the idea of that
patch was from Andrey M. Borodin and the idea of the SLRU scale factor
were introduced by Yura Sokolov and Ivan Lazarev. Apologies for
missing that in the first email.You mean [1].
[1] /messages/by-id/452d01f7e331458f56ad79bef537c31b@postgrespro.ru
I don't like this idea very much, because of the magic numbers that act
as ratios for numbers of buffers on each SLRU compared to other SLRUs.
These values, which I took from the documentation part of the patch,
appear to have been selected by throwing darts at the wall:NUM_CLOG_BUFFERS = Min(128 << slru_buffers_size_scale, shared_buffers/256)
NUM_COMMIT_TS_BUFFERS = Min(128 << slru_buffers_size_scale, shared_buffers/256)
NUM_SUBTRANS_BUFFERS = Min(64 << slru_buffers_size_scale, shared_buffers/256)
NUM_NOTIFY_BUFFERS = Min(32 << slru_buffers_size_scale, shared_buffers/256)
NUM_SERIAL_BUFFERS = Min(32 << slru_buffers_size_scale, shared_buffers/256)
NUM_MULTIXACTOFFSET_BUFFERS = Min(32 << slru_buffers_size_scale, shared_buffers/256)
NUM_MULTIXACTMEMBER_BUFFERS = Min(64 << slru_buffers_size_scale, shared_buffers/256)... which look pretty random already, if similar enough to the current
hardcoded values. In reality, the code implements different values than
what the documentation says.I don't see why would CLOG have the same number as COMMIT_TS, when the
size for elements of the latter is like 32 times bigger -- however, the
frequency of reads for COMMIT_TS is like 1000x smaller than for CLOG.
SUBTRANS is half of CLOG, yet it is 16 times larger, and it covers the
same range. The MULTIXACT ones appear to keep the current ratio among
them (8/16 gets changed to 32/64).... and this whole mess is scaled exponentially without regard to the
size that each SLRU requires. This is just betting that enough memory
can be wasted across all SLRUs up to the point where the one that is
actually contended has sufficient memory. This doesn't sound sensible
to me.Like everybody else, I like having less GUCs to configure, but going
this far to avoid them looks rather disastrous to me. IMO we should
just use Munro's older patches that gave one GUC per SLRU, and users
only need to increase the one that shows up in pg_wait_event sampling.
Someday we will get the (much more complicated) patches to move these
buffers to steal memory from shared buffers, and that'll hopefully let
use get rid of all this complexity.
Overall I agree with your comments, actually, I haven't put that much
thought into the GUC part and how it scales the SLRU buffers w.r.t.
this single configurable parameter. Yeah, so I think it is better
that we take the older patch version as our base patch where we have
separate GUC per SLRU.
I'm inclined to use Borodin's patch last posted here [2] instead of your
proposed 0001.
[2] /messages/by-id/93236D36-B91C-4DFA-AF03-99C083840378@yandex-team.ru
I will rebase my patches on top of this.
I did skim patches 0002 and 0003 without going into too much detail;
they look reasonable ideas. I have not tried to reproduce the claimed
performance benefits. I think measuring this patch set with the tests
posted by Shawn Debnath in [3] is important, too.
[3] /messages/by-id/YemDdpMrsoJFQJnU@f01898859afd.ant.amazon.com
Thanks for taking a look.
On the other hand, here's a somewhat crazy idea. What if, instead of
stealing buffers from shared_buffers (which causes a lot of complexity),
Currently, we do not steal buffers from shared_buffers, computation is
dependent upon Nbuffers though. I mean for each SLRU we are computing
separate memory which is additional than the shared_buffers no?
we allocate a common pool for all SLRUs to use? We provide a single
knob -- say, non_relational_buffers=32MB as default -- and we use a LRU
algorithm (or something) to distribute that memory across all the SLRUs.
So the ratio to use for this SLRU or that one would depend on the nature
of the workload: maybe more for multixact in this server here, but more
for subtrans in that server there; it's just the total amount that the
user would have to configure, side by side with shared_buffers (and
perhaps scale with it like wal_buffers), and the LRU would handle the
rest. The "only" problem here is finding a distribution algorithm that
doesn't further degrade performance, of course ...
Yeah, this could be an idea, but are you talking about that all the
SLRUs will share the single buffer pool and based on the LRU algorithm
it will be decided which page will stay in the buffer pool and which
will be out? But wouldn't that create another issue of different
SLRUs starting to contend on the same lock if we have a common buffer
pool for all the SLRUs? Or am I missing something? Or you are saying
that although there is a common buffer pool each SLRU will have its
own boundaries in it so protected by a separate lock and based on the
workload those boundaries can change dynamically? I haven't put much
thought into how practical the idea is but just trying to understand
what you have in mind.
--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
On Fri, Oct 20, 2023 at 9:40 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
On Sat, Oct 14, 2023 at 9:43 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
This and other results shared by you look promising. Will there be any
improvement in workloads related to clog buffer usage?I did not understand this question can you explain this a bit?
I meant to ask about the impact of this patch on accessing transaction
status via TransactionIdGetStatus(). Shouldn't we expect some
improvement in accessing CLOG buffers?
--
With Regards,
Amit Kapila.
On Wed, Oct 25, 2023 at 5:58 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
On Fri, Oct 20, 2023 at 9:40 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
On Sat, Oct 14, 2023 at 9:43 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
This and other results shared by you look promising. Will there be any
improvement in workloads related to clog buffer usage?I did not understand this question can you explain this a bit?
I meant to ask about the impact of this patch on accessing transaction
status via TransactionIdGetStatus(). Shouldn't we expect some
improvement in accessing CLOG buffers?
Yes, there should be because 1) Now there is no common lock so
contention on a centralized control lock will be reduced when we are
accessing the transaction status from pages falling in different SLRU
banks 2) Buffers size is configurable so if the workload is accessing
transactions status of different range then it would help in frequent
buffer eviction but this might not be most common case.
--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
On Wed, Oct 25, 2023 at 10:34 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
On Tue, Oct 24, 2023 at 9:34 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
Overall I agree with your comments, actually, I haven't put that much
thought into the GUC part and how it scales the SLRU buffers w.r.t.
this single configurable parameter. Yeah, so I think it is better
that we take the older patch version as our base patch where we have
separate GUC per SLRU.I'm inclined to use Borodin's patch last posted here [2] instead of your
proposed 0001.
[2] /messages/by-id/93236D36-B91C-4DFA-AF03-99C083840378@yandex-team.ruI will rebase my patches on top of this.
I have taken 0001 and 0002 from [1]/messages/by-id/93236D36-B91C-4DFA-AF03-99C083840378@yandex-team.ru, done some bug fixes in 0001, and
changed the logic of SlruAdjustNSlots() in 0002, such that now it
starts with the next power of 2 value of the configured slots and
keeps doubling the number of banks until we reach the number of banks
to the max SLRU_MAX_BANKS(128) and bank size is bigger than
SLRU_MIN_BANK_SIZE (8). By doing so, we will ensure we don't have too
many banks, but also that we don't have very large banks. There was
also a patch 0003 in this thread but I haven't taken this as this is
another optimization of merging some structure members and I will
analyze the performance characteristic of this and try to add it on
top of the complete patch series.
Patch details:
0001 - GUC parameter for each SLRU
0002 - Divide the SLRU pool into banks
(The above 2 are taken from [1]/messages/by-id/93236D36-B91C-4DFA-AF03-99C083840378@yandex-team.ru with some modification and rebasing by me)
0003 - Implement bank-wise SLRU lock as described in the first email
of this thread
0004 - Implement bank-wise LRU counter as described in the first email
of this thread
0005 - Some other optimization suggested offlist by Alvaro, i.e.
merging buffer locks and bank locks in the same array so that the
bank-wise LRU counter does not fetch the next cache line in a hot
function SlruRecentlyUsed()
Note: I think 0003,0004 and 0005 can be merged together but kept
separate so that we can review them independently and see how useful
each of them is.
[1]: /messages/by-id/93236D36-B91C-4DFA-AF03-99C083840378@yandex-team.ru
--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
Attachments:
v3-0005-Merge-bank-locks-array-with-buffer-locks-array.patchapplication/octet-stream; name=v3-0005-Merge-bank-locks-array-with-buffer-locks-array.patchDownload
From c80516008f76a8a4b68ff5cab9ada952373ee6ff Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Sat, 28 Oct 2023 16:24:04 +0530
Subject: [PATCH v3 5/5] Merge bank locks array with buffer locks array
This will help us getting the bank_cur_lru_count in same cacheline
which is frequently accessed in SlruRecentlyUsed.
---
src/backend/access/transam/slru.c | 123 ++++++++++++++++--------------
src/include/access/slru.h | 15 ++--
2 files changed, 72 insertions(+), 66 deletions(-)
diff --git a/src/backend/access/transam/slru.c b/src/backend/access/transam/slru.c
index 6c8c21f215..3728c02607 100644
--- a/src/backend/access/transam/slru.c
+++ b/src/backend/access/transam/slru.c
@@ -156,8 +156,7 @@ SimpleLruShmemSize(int nslots, int nlsns)
sz += MAXALIGN(nslots * sizeof(bool)); /* page_dirty[] */
sz += MAXALIGN(nslots * sizeof(int)); /* page_number[] */
sz += MAXALIGN(nslots * sizeof(int)); /* page_lru_count[] */
- sz += MAXALIGN(nslots * sizeof(LWLockPadded)); /* buffer_locks[] */
- sz += MAXALIGN(nbanks * sizeof(LWLockPadded)); /* bank_locks[] */
+ sz += MAXALIGN((nslots + nslots) * sizeof(LWLockPadded)); /* locks[] */
sz += MAXALIGN(nbanks * sizeof(int)); /* bank_cur_lru_count[] */
if (nlsns > 0)
@@ -229,10 +228,8 @@ SimpleLruInit(SlruCtl ctl, const char *name, int nslots, int nlsns,
offset += MAXALIGN(nslots * sizeof(int));
/* Initialize LWLocks */
- shared->buffer_locks = (LWLockPadded *) (ptr + offset);
- offset += MAXALIGN(nslots * sizeof(LWLockPadded));
- shared->bank_locks = (LWLockPadded *) (ptr + offset);
- offset += MAXALIGN(nbanks * sizeof(LWLockPadded));
+ shared->locks = (LWLockPadded *) (ptr + offset);
+ offset += MAXALIGN((nslots + nbanks) * sizeof(LWLockPadded));
shared->bank_cur_lru_count = (int *) (ptr + offset);
offset += MAXALIGN(nbanks * sizeof(int));
@@ -245,8 +242,7 @@ SimpleLruInit(SlruCtl ctl, const char *name, int nslots, int nlsns,
ptr += BUFFERALIGN(offset);
for (slotno = 0; slotno < nslots; slotno++)
{
- LWLockInitialize(&shared->buffer_locks[slotno].lock,
- buffer_tranche_id);
+ LWLockInitialize(&shared->locks[slotno].lock, buffer_tranche_id);
shared->page_buffer[slotno] = ptr;
shared->page_status[slotno] = SLRU_PAGE_EMPTY;
@@ -257,7 +253,7 @@ SimpleLruInit(SlruCtl ctl, const char *name, int nslots, int nlsns,
/* Initialize bank locks for each buffer bank. */
for (bankno = 0; bankno < nbanks; bankno++)
{
- LWLockInitialize(&shared->bank_locks[bankno].lock,
+ LWLockInitialize(&shared->locks[nslots + bankno].lock,
bank_tranche_id);
shared->bank_cur_lru_count[bankno] = 0;
}
@@ -356,12 +352,13 @@ SimpleLruWaitIO(SlruCtl ctl, int slotno)
{
SlruShared shared = ctl->shared;
int bankno = slotno / ctl->bank_size;
+ int banklockoffset = shared->num_slots + bankno;
/* See notes at top of file */
- LWLockRelease(&shared->bank_locks[bankno].lock);
- LWLockAcquire(&shared->buffer_locks[slotno].lock, LW_SHARED);
- LWLockRelease(&shared->buffer_locks[slotno].lock);
- LWLockAcquire(&shared->bank_locks[bankno].lock, LW_EXCLUSIVE);
+ LWLockRelease(&shared->locks[banklockoffset].lock);
+ LWLockAcquire(&shared->locks[slotno].lock, LW_SHARED);
+ LWLockRelease(&shared->locks[slotno].lock);
+ LWLockAcquire(&shared->locks[banklockoffset].lock, LW_EXCLUSIVE);
/*
* If the slot is still in an io-in-progress state, then either someone
@@ -374,7 +371,7 @@ SimpleLruWaitIO(SlruCtl ctl, int slotno)
if (shared->page_status[slotno] == SLRU_PAGE_READ_IN_PROGRESS ||
shared->page_status[slotno] == SLRU_PAGE_WRITE_IN_PROGRESS)
{
- if (LWLockConditionalAcquire(&shared->buffer_locks[slotno].lock, LW_SHARED))
+ if (LWLockConditionalAcquire(&shared->locks[slotno].lock, LW_SHARED))
{
/* indeed, the I/O must have failed */
if (shared->page_status[slotno] == SLRU_PAGE_READ_IN_PROGRESS)
@@ -384,7 +381,7 @@ SimpleLruWaitIO(SlruCtl ctl, int slotno)
shared->page_status[slotno] = SLRU_PAGE_VALID;
shared->page_dirty[slotno] = true;
}
- LWLockRelease(&shared->buffer_locks[slotno].lock);
+ LWLockRelease(&shared->locks[slotno].lock);
}
}
}
@@ -417,6 +414,7 @@ SimpleLruReadPage(SlruCtl ctl, int pageno, bool write_ok,
{
int slotno;
int bankno;
+ int banklockoffset;
bool ok;
/* See if page already is in memory; if not, pick victim slot */
@@ -458,11 +456,12 @@ SimpleLruReadPage(SlruCtl ctl, int pageno, bool write_ok,
shared->page_dirty[slotno] = false;
/* Acquire per-buffer lock (cannot deadlock, see notes at top) */
- LWLockAcquire(&shared->buffer_locks[slotno].lock, LW_EXCLUSIVE);
+ LWLockAcquire(&shared->locks[slotno].lock, LW_EXCLUSIVE);
bankno = slotno / ctl->bank_size;
+ banklockoffset = shared->num_slots + bankno;
/* Release control lock while doing I/O */
- LWLockRelease(&shared->bank_locks[bankno].lock);
+ LWLockRelease(&shared->locks[banklockoffset].lock);
/* Do the read */
ok = SlruPhysicalReadPage(ctl, pageno, slotno);
@@ -471,7 +470,7 @@ SimpleLruReadPage(SlruCtl ctl, int pageno, bool write_ok,
SimpleLruZeroLSNs(ctl, slotno);
/* Re-acquire control lock and update page state */
- LWLockAcquire(&shared->bank_locks[bankno].lock, LW_EXCLUSIVE);
+ LWLockAcquire(&shared->locks[banklockoffset].lock, LW_EXCLUSIVE);
Assert(shared->page_number[slotno] == pageno &&
shared->page_status[slotno] == SLRU_PAGE_READ_IN_PROGRESS &&
@@ -479,7 +478,7 @@ SimpleLruReadPage(SlruCtl ctl, int pageno, bool write_ok,
shared->page_status[slotno] = ok ? SLRU_PAGE_VALID : SLRU_PAGE_EMPTY;
- LWLockRelease(&shared->buffer_locks[slotno].lock);
+ LWLockRelease(&shared->locks[slotno].lock);
/* Now it's okay to ereport if we failed */
if (!ok)
@@ -516,9 +515,10 @@ SimpleLruReadPage_ReadOnly(SlruCtl ctl, int pageno, TransactionId xid)
int bankno = pageno & ctl->bank_mask;
int bankstart = bankno * ctl->bank_size;
int bankend = bankstart + ctl->bank_size;
+ int banklockoffset = shared->num_slots + bankno;
/* Try to find the page while holding only shared lock */
- LWLockAcquire(&shared->bank_locks[bankno].lock, LW_SHARED);
+ LWLockAcquire(&shared->locks[banklockoffset].lock, LW_SHARED);
/* See if page is already in a buffer */
for (slotno = bankstart; slotno < bankend; slotno++)
@@ -538,8 +538,8 @@ SimpleLruReadPage_ReadOnly(SlruCtl ctl, int pageno, TransactionId xid)
}
/* No luck, so switch to normal exclusive lock and do regular read */
- LWLockRelease(&shared->bank_locks[bankno].lock);
- LWLockAcquire(&shared->bank_locks[bankno].lock, LW_EXCLUSIVE);
+ LWLockRelease(&shared->locks[banklockoffset].lock);
+ LWLockAcquire(&shared->locks[banklockoffset].lock, LW_EXCLUSIVE);
return SimpleLruReadPage(ctl, pageno, true, xid);
}
@@ -562,6 +562,7 @@ SlruInternalWritePage(SlruCtl ctl, int slotno, SlruWriteAll fdata)
int pageno = shared->page_number[slotno];
bool ok;
int bankno = slotno / ctl->bank_size;
+ int banklockoffset = shared->num_slots + bankno;
/* If a write is in progress, wait for it to finish */
while (shared->page_status[slotno] == SLRU_PAGE_WRITE_IN_PROGRESS &&
@@ -587,10 +588,10 @@ SlruInternalWritePage(SlruCtl ctl, int slotno, SlruWriteAll fdata)
shared->page_dirty[slotno] = false;
/* Acquire per-buffer lock (cannot deadlock, see notes at top) */
- LWLockAcquire(&shared->buffer_locks[slotno].lock, LW_EXCLUSIVE);
+ LWLockAcquire(&shared->locks[slotno].lock, LW_EXCLUSIVE);
/* Release control lock while doing I/O */
- LWLockRelease(&shared->bank_locks[bankno].lock);
+ LWLockRelease(&shared->locks[banklockoffset].lock);
/* Do the write */
ok = SlruPhysicalWritePage(ctl, pageno, slotno, fdata);
@@ -605,7 +606,7 @@ SlruInternalWritePage(SlruCtl ctl, int slotno, SlruWriteAll fdata)
}
/* Re-acquire control lock and update page state */
- LWLockAcquire(&shared->bank_locks[bankno].lock, LW_EXCLUSIVE);
+ LWLockAcquire(&shared->locks[banklockoffset].lock, LW_EXCLUSIVE);
Assert(shared->page_number[slotno] == pageno &&
shared->page_status[slotno] == SLRU_PAGE_WRITE_IN_PROGRESS);
@@ -616,7 +617,7 @@ SlruInternalWritePage(SlruCtl ctl, int slotno, SlruWriteAll fdata)
shared->page_status[slotno] = SLRU_PAGE_VALID;
- LWLockRelease(&shared->buffer_locks[slotno].lock);
+ LWLockRelease(&shared->locks[slotno].lock);
/* Now it's okay to ereport if we failed */
if (!ok)
@@ -1185,7 +1186,7 @@ SimpleLruWriteAll(SlruCtl ctl, bool allow_redirtied)
int slotno;
int pageno = 0;
int i;
- int lastbankno = 0;
+ int prevlockoffset = shared->num_slots;
bool ok;
/* update the stats counter of flushes */
@@ -1196,17 +1197,17 @@ SimpleLruWriteAll(SlruCtl ctl, bool allow_redirtied)
*/
fdata.num_files = 0;
- LWLockAcquire(&shared->bank_locks[0].lock, LW_EXCLUSIVE);
+ LWLockAcquire(&shared->locks[prevlockoffset].lock, LW_EXCLUSIVE);
for (slotno = 0; slotno < shared->num_slots; slotno++)
{
- int curbankno = slotno / ctl->bank_size;
+ int curlockoffset = shared->num_slots + slotno / ctl->bank_size;
- if (curbankno != lastbankno)
+ if (curlockoffset != prevlockoffset)
{
- LWLockRelease(&shared->bank_locks[lastbankno].lock);
- LWLockAcquire(&shared->bank_locks[curbankno].lock, LW_EXCLUSIVE);
- lastbankno = curbankno;
+ LWLockRelease(&shared->locks[prevlockoffset].lock);
+ LWLockAcquire(&shared->locks[curlockoffset].lock, LW_EXCLUSIVE);
+ prevlockoffset = curlockoffset;
}
SlruInternalWritePage(ctl, slotno, &fdata);
@@ -1222,7 +1223,7 @@ SimpleLruWriteAll(SlruCtl ctl, bool allow_redirtied)
!shared->page_dirty[slotno]));
}
- LWLockRelease(&shared->bank_locks[lastbankno].lock);
+ LWLockRelease(&shared->locks[prevlockoffset].lock);
/*
* Now close any files that were open
@@ -1262,7 +1263,8 @@ SimpleLruTruncate(SlruCtl ctl, int cutoffPage)
{
SlruShared shared = ctl->shared;
int slotno;
- int prevbankno;
+ int nslots = shared->num_slots;
+ int prevlockoffset;
/* update the stats counter of truncates */
pgstat_count_slru_truncate(shared->slru_stats_idx);
@@ -1288,21 +1290,21 @@ restart:
return;
}
- prevbankno = 0;
- LWLockAcquire(&shared->bank_locks[prevbankno].lock, LW_EXCLUSIVE);
- for (slotno = 0; slotno < shared->num_slots; slotno++)
+ prevlockoffset = nslots;
+ LWLockAcquire(&shared->locks[prevlockoffset].lock, LW_EXCLUSIVE);
+ for (slotno = 0; slotno < nslots; slotno++)
{
- int curbankno = slotno / ctl->bank_size;
+ int curlockoffset = nslots + (slotno / ctl->bank_size);
/*
* If the curbankno is not same as prevbankno then release the lock on
* the prevbankno and acquire the lock on the curbankno.
*/
- if (curbankno != prevbankno)
+ if (curlockoffset != prevlockoffset)
{
- LWLockRelease(&shared->bank_locks[prevbankno].lock);
- LWLockAcquire(&shared->bank_locks[curbankno].lock, LW_EXCLUSIVE);
- prevbankno = curbankno;
+ LWLockRelease(&shared->locks[prevlockoffset].lock);
+ LWLockAcquire(&shared->locks[curlockoffset].lock, LW_EXCLUSIVE);
+ prevlockoffset = curlockoffset;
}
if (shared->page_status[slotno] == SLRU_PAGE_EMPTY)
@@ -1335,11 +1337,11 @@ restart:
else
SimpleLruWaitIO(ctl, slotno);
- LWLockRelease(&shared->bank_locks[prevbankno].lock);
+ LWLockRelease(&shared->locks[prevlockoffset].lock);
goto restart;
}
- LWLockRelease(&shared->bank_locks[prevbankno].lock);
+ LWLockRelease(&shared->locks[prevlockoffset].lock);
/* Now we can remove the old segment(s) */
(void) SlruScanDirectory(ctl, SlruScanDirCbDeleteCutoff, &cutoffPage);
@@ -1380,28 +1382,29 @@ SlruDeleteSegment(SlruCtl ctl, int segno)
SlruShared shared = ctl->shared;
int slotno;
bool did_write;
- int prevbankno = 0;
+ int nslots = shared->num_slots;
+ int prevlockoffset = nslots;
/* Clean out any possibly existing references to the segment. */
- LWLockAcquire(&shared->bank_locks[prevbankno].lock, LW_EXCLUSIVE);
+ LWLockAcquire(&shared->locks[prevlockoffset].lock, LW_EXCLUSIVE);
restart:
did_write = false;
- for (slotno = 0; slotno < shared->num_slots; slotno++)
+ for (slotno = 0; slotno < nslots; slotno++)
{
int pagesegno;
- int curbankno;
+ int curlockoffset;
- curbankno = slotno / ctl->bank_size;
+ curlockoffset = nslots + (slotno / ctl->bank_size);
/*
* If the curbankno is not same as prevbankno then release the lock on
* the prevbankno and acquire the lock on the curbankno.
*/
- if (curbankno != prevbankno)
+ if (curlockoffset != prevlockoffset)
{
- LWLockRelease(&shared->bank_locks[prevbankno].lock);
- LWLockAcquire(&shared->bank_locks[curbankno].lock, LW_EXCLUSIVE);
- prevbankno = curbankno;
+ LWLockRelease(&shared->locks[prevlockoffset].lock);
+ LWLockAcquire(&shared->locks[curlockoffset].lock, LW_EXCLUSIVE);
+ prevlockoffset = curlockoffset;
}
pagesegno = shared->page_number[slotno] / SLRU_PAGES_PER_SEGMENT;
@@ -1438,7 +1441,7 @@ restart:
SlruInternalDeleteSegment(ctl, segno);
- LWLockRelease(&shared->bank_locks[prevbankno].lock);
+ LWLockRelease(&shared->locks[prevlockoffset].lock);
}
/*
@@ -1756,10 +1759,11 @@ SimpleLruAcquireAllBankLock(SlruCtl ctl, LWLockMode mode)
{
SlruShared shared = ctl->shared;
int bankno;
- int nbanks = shared->num_slots / ctl->bank_size;
+ int nslots = shared->num_slots;
+ int nbanks = nslots / ctl->bank_size;
for (bankno = 0; bankno < nbanks; bankno++)
- LWLockAcquire(&shared->bank_locks[bankno].lock, mode);
+ LWLockAcquire(&shared->locks[nslots + bankno].lock, mode);
}
/*
@@ -1770,8 +1774,9 @@ SimpleLruReleaseAllBankLock(SlruCtl ctl)
{
SlruShared shared = ctl->shared;
int bankno;
- int nbanks = shared->num_slots / ctl->bank_size;
+ int nslots = shared->num_slots;
+ int nbanks = nslots / ctl->bank_size;
for (bankno = 0; bankno < nbanks; bankno++)
- LWLockRelease(&shared->bank_locks[bankno].lock);
+ LWLockRelease(&shared->locks[nslots + bankno].lock);
}
diff --git a/src/include/access/slru.h b/src/include/access/slru.h
index a18b07f5d0..6759c900f3 100644
--- a/src/include/access/slru.h
+++ b/src/include/access/slru.h
@@ -69,14 +69,14 @@ typedef struct SlruSharedData
bool *page_dirty;
int *page_number;
int *page_lru_count;
- LWLockPadded *buffer_locks;
/*
- * Locks to protect the in memory buffer slot access in per SLRU bank. The
- * buffer_locks protects the I/O on each buffer slots whereas this lock
- * protect the in memory operation on the buffer within one SLRU bank.
+ * This contains nslots numbers of buffers locks and nbanks numbers of
+ * bank locks. The buffer locks protects the I/O on each buffer slots
+ * whereas the bank lock protect the in memory operation on the buffer
+ * within one SLRU bank.
*/
- LWLockPadded *bank_locks;
+ LWLockPadded *locks;
/*----------
* Instead of global counter we maintain a bank-wise lru counter because
@@ -169,9 +169,10 @@ typedef SlruCtlData *SlruCtl;
static inline LWLock *
SimpleLruGetSLRUBankLock(SlruCtl ctl, int pageno)
{
- int bankno = (pageno & ctl->bank_mask);
+ int banklockoffset =
+ ctl->shared->num_slots + (pageno & ctl->bank_mask);
- return &(ctl->shared->bank_locks[bankno].lock);
+ return &(ctl->shared->locks[banklockoffset].lock);
}
extern Size SimpleLruShmemSize(int nslots, int nlsns);
--
2.39.2 (Apple Git-143)
v3-0002-Divide-SLRU-buffers-into-banks.patchapplication/octet-stream; name=v3-0002-Divide-SLRU-buffers-into-banks.patchDownload
From 0fbd91533ad3f1ee3a4931aafeb7b9aebf40d839 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Wed, 25 Oct 2023 16:51:34 +0530
Subject: [PATCH v3 2/5] Divide SLRU buffers into banks
We want to eliminate linear search within SLRU buffers.
To do so we divide SLRU buffers into banks. Each bank holds
approximately 8 buffers. Each SLRU pageno may reside only in one bank.
Adjacent pagenos reside in different banks.
Andrey M. Borodin with some modification by Dilip Kumar
based on fedback by Alvaro Herrera
---
src/backend/access/transam/slru.c | 73 +++++++++++++++++++++++++++++--
src/include/access/slru.h | 6 +++
2 files changed, 75 insertions(+), 4 deletions(-)
diff --git a/src/backend/access/transam/slru.c b/src/backend/access/transam/slru.c
index 9ed24e1185..c339e0a7e4 100644
--- a/src/backend/access/transam/slru.c
+++ b/src/backend/access/transam/slru.c
@@ -59,6 +59,7 @@
#include "pgstat.h"
#include "storage/fd.h"
#include "storage/shmem.h"
+#include "port/pg_bitutils.h"
#define SlruFileName(ctl, path, seg) \
snprintf(path, MAXPGPATH, "%s/%04X", (ctl)->Dir, seg)
@@ -71,6 +72,18 @@
*/
#define MAX_WRITEALL_BUFFERS 16
+/*
+ * To avoid overflowing internal arithmetic and the size_t data type, the
+ * number of buffers should not exceed this number.
+ */
+#define SLRU_MAX_ALLOWED_BUFFERS ((1024 * 1024 * 1024) / BLCKSZ)
+
+/*
+ * SLRU bank size for slotno hash banks
+ */
+#define SLRU_MIN_BANK_SIZE 8
+#define SLRU_MAX_BANKS 128
+
typedef struct SlruWriteAllData
{
int num_files; /* # files actually open */
@@ -134,7 +147,6 @@ typedef enum
static SlruErrorCause slru_errcause;
static int slru_errno;
-
static void SimpleLruZeroLSNs(SlruCtl ctl, int slotno);
static void SimpleLruWaitIO(SlruCtl ctl, int slotno);
static void SlruInternalWritePage(SlruCtl ctl, int slotno, SlruWriteAll fdata);
@@ -147,6 +159,7 @@ static int SlruSelectLRUPage(SlruCtl ctl, int pageno);
static bool SlruScanDirCbDeleteCutoff(SlruCtl ctl, char *filename,
int segpage, void *data);
static void SlruInternalDeleteSegment(SlruCtl ctl, int segno);
+static void SlruAdjustNSlots(int *nslots, int *banksize, int *bankmask);
/*
* Initialization of shared memory
@@ -156,6 +169,10 @@ Size
SimpleLruShmemSize(int nslots, int nlsns)
{
Size sz;
+ int bankmask_ignore;
+ int banksize_ignore;
+
+ SlruAdjustNSlots(&nslots, &banksize_ignore, &bankmask_ignore);
/* we assume nslots isn't so large as to risk overflow */
sz = MAXALIGN(sizeof(SlruSharedData));
@@ -191,6 +208,10 @@ SimpleLruInit(SlruCtl ctl, const char *name, int nslots, int nlsns,
{
SlruShared shared;
bool found;
+ int bankmask;
+ int banksize;
+
+ SlruAdjustNSlots(&nslots, &banksize, &bankmask);
shared = (SlruShared) ShmemInitStruct(name,
SimpleLruShmemSize(nslots, nlsns),
@@ -258,7 +279,10 @@ SimpleLruInit(SlruCtl ctl, const char *name, int nslots, int nlsns,
Assert(ptr - (char *) shared <= SimpleLruShmemSize(nslots, nlsns));
}
else
+ {
Assert(found);
+ Assert(shared->num_slots == nslots);
+ }
/*
* Initialize the unshared control struct, including directory path. We
@@ -266,6 +290,8 @@ SimpleLruInit(SlruCtl ctl, const char *name, int nslots, int nlsns,
*/
ctl->shared = shared;
ctl->sync_handler = sync_handler;
+ ctl->bank_size = banksize;
+ ctl->bank_mask = bankmask;
strlcpy(ctl->Dir, subdir, sizeof(ctl->Dir));
}
@@ -497,12 +523,14 @@ SimpleLruReadPage_ReadOnly(SlruCtl ctl, int pageno, TransactionId xid)
{
SlruShared shared = ctl->shared;
int slotno;
+ int bankstart = (pageno & ctl->bank_mask) * ctl->bank_size;
+ int bankend = bankstart + ctl->bank_size;
/* Try to find the page while holding only shared lock */
LWLockAcquire(shared->ControlLock, LW_SHARED);
/* See if page is already in a buffer */
- for (slotno = 0; slotno < shared->num_slots; slotno++)
+ for (slotno = bankstart; slotno < bankend; slotno++)
{
if (shared->page_number[slotno] == pageno &&
shared->page_status[slotno] != SLRU_PAGE_EMPTY &&
@@ -1031,7 +1059,10 @@ SlruSelectLRUPage(SlruCtl ctl, int pageno)
int best_invalid_page_number = 0; /* keep compiler quiet */
/* See if page already has a buffer assigned */
- for (slotno = 0; slotno < shared->num_slots; slotno++)
+ int bankstart = (pageno & ctl->bank_mask) * ctl->bank_size;
+ int bankend = bankstart + ctl->bank_size;
+
+ for (slotno = bankstart; slotno < bankend; slotno++)
{
if (shared->page_number[slotno] == pageno &&
shared->page_status[slotno] != SLRU_PAGE_EMPTY)
@@ -1066,7 +1097,7 @@ SlruSelectLRUPage(SlruCtl ctl, int pageno)
* multiple pages with the same lru_count.
*/
cur_count = (shared->cur_lru_count)++;
- for (slotno = 0; slotno < shared->num_slots; slotno++)
+ for (slotno = bankstart; slotno < bankend; slotno++)
{
int this_delta;
int this_page_number;
@@ -1613,3 +1644,37 @@ SlruSyncFileTag(SlruCtl ctl, const FileTag *ftag, char *path)
errno = save_errno;
return result;
}
+
+/*
+ * Pick bank size optimal for N-assiciative SLRU buffers.
+ *
+ * We expect the bank number to be picked from the lowest bits of the requested
+ * pageno. Thus we want the number of banks to be the power of 2.
+ */
+static void
+SlruAdjustNSlots(int *nslots, int *banksize, int *bankmask)
+{
+ int nbanks = 1;
+
+ *nslots = (int) pg_nextpower2_32(Max(SLRU_MIN_BANK_SIZE, *nslots));
+ *banksize = *nslots;
+
+ /*
+ * Adjust the number of banks and per bank size. Start with one bank, then
+ * double it until we reach SLRU_MAX_BANKS, and the bank size exceeds
+ * SLRU_MIN_BANK_SIZE. By doing so, we will ensure we don't have too many
+ * banks, but also that we don't have very large banks.
+ */
+ while (nbanks < SLRU_MAX_BANKS && *banksize > SLRU_MIN_BANK_SIZE)
+ {
+ if ((*banksize & 1) != 0)
+ *banksize += 1;
+ *banksize /= 2;
+ nbanks *= 2;
+ }
+
+ elog(DEBUG5, "nslots %d banksize %d nbanks %d ", *nslots, *banksize, nbanks);
+
+ *nslots = *banksize * nbanks;
+ *bankmask = (*nslots / *banksize) - 1;
+}
diff --git a/src/include/access/slru.h b/src/include/access/slru.h
index c0d37e3eb3..c3fd58185a 100644
--- a/src/include/access/slru.h
+++ b/src/include/access/slru.h
@@ -139,6 +139,12 @@ typedef struct SlruCtlData
* it's always the same, it doesn't need to be in shared memory.
*/
char Dir[64];
+
+ /*
+ * mask and size for slotno banks
+ */
+ int bank_size;
+ Size bank_mask;
} SlruCtlData;
typedef SlruCtlData *SlruCtl;
--
2.39.2 (Apple Git-143)
v3-0004-Introduce-bank-wise-LRU-counter.patchapplication/octet-stream; name=v3-0004-Introduce-bank-wise-LRU-counter.patchDownload
From 2ea0f9c9dad8482275eab2e77cc4d128ba2d5196 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Sat, 28 Oct 2023 13:48:44 +0530
Subject: [PATCH v3 4/5] Introduce bank-wise LRU counter
Since we have already divided buffer pool in banks and victim
buffer search is also done at the bank level so there is no need
to have a centralized lru counter. And this will also improve
the performance by reducing the frequent cpu cache invalidation by
not updating the common variable.
Dilip Kumar based on design idea from Robert Haas
---
src/backend/access/transam/slru.c | 83 +++++++++++++++++--------------
src/include/access/slru.h | 28 +++++++----
2 files changed, 64 insertions(+), 47 deletions(-)
diff --git a/src/backend/access/transam/slru.c b/src/backend/access/transam/slru.c
index cf215627ea..6c8c21f215 100644
--- a/src/backend/access/transam/slru.c
+++ b/src/backend/access/transam/slru.c
@@ -105,34 +105,6 @@ typedef struct SlruWriteAllData *SlruWriteAll;
(a).segno = (xx_segno) \
)
-/*
- * Macro to mark a buffer slot "most recently used". Note multiple evaluation
- * of arguments!
- *
- * The reason for the if-test is that there are often many consecutive
- * accesses to the same page (particularly the latest page). By suppressing
- * useless increments of cur_lru_count, we reduce the probability that old
- * pages' counts will "wrap around" and make them appear recently used.
- *
- * We allow this code to be executed concurrently by multiple processes within
- * SimpleLruReadPage_ReadOnly(). As long as int reads and writes are atomic,
- * this should not cause any completely-bogus values to enter the computation.
- * However, it is possible for either cur_lru_count or individual
- * page_lru_count entries to be "reset" to lower values than they should have,
- * in case a process is delayed while it executes this macro. With care in
- * SlruSelectLRUPage(), this does little harm, and in any case the absolute
- * worst possible consequence is a nonoptimal choice of page to evict. The
- * gain from allowing concurrent reads of SLRU pages seems worth it.
- */
-#define SlruRecentlyUsed(shared, slotno) \
- do { \
- int new_lru_count = (shared)->cur_lru_count; \
- if (new_lru_count != (shared)->page_lru_count[slotno]) { \
- (shared)->cur_lru_count = ++new_lru_count; \
- (shared)->page_lru_count[slotno] = new_lru_count; \
- } \
- } while (0)
-
/* Saved info for SlruReportIOError */
typedef enum
{
@@ -159,6 +131,8 @@ static int SlruSelectLRUPage(SlruCtl ctl, int pageno);
static bool SlruScanDirCbDeleteCutoff(SlruCtl ctl, char *filename,
int segpage, void *data);
static void SlruInternalDeleteSegment(SlruCtl ctl, int segno);
+static inline void SlruRecentlyUsed(SlruShared shared, int slotno,
+ int banksize);
static int SlruAdjustNSlots(int *nslots, int *banksize, int *bankmask);
/*
@@ -184,6 +158,7 @@ SimpleLruShmemSize(int nslots, int nlsns)
sz += MAXALIGN(nslots * sizeof(int)); /* page_lru_count[] */
sz += MAXALIGN(nslots * sizeof(LWLockPadded)); /* buffer_locks[] */
sz += MAXALIGN(nbanks * sizeof(LWLockPadded)); /* bank_locks[] */
+ sz += MAXALIGN(nbanks * sizeof(int)); /* bank_cur_lru_count[] */
if (nlsns > 0)
sz += MAXALIGN(nslots * nlsns * sizeof(XLogRecPtr)); /* group_lsn[] */
@@ -236,8 +211,6 @@ SimpleLruInit(SlruCtl ctl, const char *name, int nslots, int nlsns,
shared->num_slots = nslots;
shared->lsn_groups_per_page = nlsns;
- shared->cur_lru_count = 0;
-
/* shared->latest_page_number will be set later */
shared->slru_stats_idx = pgstat_get_slru_index(name);
@@ -260,6 +233,8 @@ SimpleLruInit(SlruCtl ctl, const char *name, int nslots, int nlsns,
offset += MAXALIGN(nslots * sizeof(LWLockPadded));
shared->bank_locks = (LWLockPadded *) (ptr + offset);
offset += MAXALIGN(nbanks * sizeof(LWLockPadded));
+ shared->bank_cur_lru_count = (int *) (ptr + offset);
+ offset += MAXALIGN(nbanks * sizeof(int));
if (nlsns > 0)
{
@@ -281,8 +256,11 @@ SimpleLruInit(SlruCtl ctl, const char *name, int nslots, int nlsns,
}
/* Initialize bank locks for each buffer bank. */
for (bankno = 0; bankno < nbanks; bankno++)
+ {
LWLockInitialize(&shared->bank_locks[bankno].lock,
bank_tranche_id);
+ shared->bank_cur_lru_count[bankno] = 0;
+ }
/* Should fit to estimated shmem size */
Assert(ptr - (char *) shared <= SimpleLruShmemSize(nslots, nlsns));
@@ -329,7 +307,7 @@ SimpleLruZeroPage(SlruCtl ctl, int pageno)
shared->page_number[slotno] = pageno;
shared->page_status[slotno] = SLRU_PAGE_VALID;
shared->page_dirty[slotno] = true;
- SlruRecentlyUsed(shared, slotno);
+ SlruRecentlyUsed(shared, slotno, ctl->bank_size);
/* Set the buffer to zeroes */
MemSet(shared->page_buffer[slotno], 0, BLCKSZ);
@@ -461,7 +439,7 @@ SimpleLruReadPage(SlruCtl ctl, int pageno, bool write_ok,
continue;
}
/* Otherwise, it's ready to use */
- SlruRecentlyUsed(shared, slotno);
+ SlruRecentlyUsed(shared, slotno, ctl->bank_size);
/* update the stats counter of pages found in the SLRU */
pgstat_count_slru_page_hit(shared->slru_stats_idx);
@@ -507,7 +485,7 @@ SimpleLruReadPage(SlruCtl ctl, int pageno, bool write_ok,
if (!ok)
SlruReportIOError(ctl, pageno, xid);
- SlruRecentlyUsed(shared, slotno);
+ SlruRecentlyUsed(shared, slotno, ctl->bank_size);
/* update the stats counter of pages not found in SLRU */
pgstat_count_slru_page_read(shared->slru_stats_idx);
@@ -550,7 +528,7 @@ SimpleLruReadPage_ReadOnly(SlruCtl ctl, int pageno, TransactionId xid)
shared->page_status[slotno] != SLRU_PAGE_READ_IN_PROGRESS)
{
/* See comments for SlruRecentlyUsed macro */
- SlruRecentlyUsed(shared, slotno);
+ SlruRecentlyUsed(shared, slotno, ctl->bank_size);
/* update the stats counter of pages found in the SLRU */
pgstat_count_slru_page_hit(shared->slru_stats_idx);
@@ -1073,7 +1051,8 @@ SlruSelectLRUPage(SlruCtl ctl, int pageno)
int best_invalid_page_number = 0; /* keep compiler quiet */
/* See if page already has a buffer assigned */
- int bankstart = (pageno & ctl->bank_mask) * ctl->bank_size;
+ int bankno = pageno & ctl->bank_mask;
+ int bankstart = bankno * ctl->bank_size;
int bankend = bankstart + ctl->bank_size;
for (slotno = bankstart; slotno < bankend; slotno++)
@@ -1110,7 +1089,7 @@ SlruSelectLRUPage(SlruCtl ctl, int pageno)
* That gets us back on the path to having good data when there are
* multiple pages with the same lru_count.
*/
- cur_count = (shared->cur_lru_count)++;
+ cur_count = (shared->bank_cur_lru_count[bankno])++;
for (slotno = bankstart; slotno < bankend; slotno++)
{
int this_delta;
@@ -1701,6 +1680,38 @@ SlruSyncFileTag(SlruCtl ctl, const FileTag *ftag, char *path)
return result;
}
+/*
+ * Function to mark a buffer slot "most recently used". Note multiple
+ * evaluation of arguments!
+ *
+ * The reason for the if-test is that there are often many consecutive
+ * accesses to the same page (particularly the latest page). By suppressing
+ * useless increments of bank_cur_lru_count, we reduce the probability that old
+ * pages' counts will "wrap around" and make them appear recently used.
+ *
+ * We allow this code to be executed concurrently by multiple processes within
+ * SimpleLruReadPage_ReadOnly(). As long as int reads and writes are atomic,
+ * this should not cause any completely-bogus values to enter the computation.
+ * However, it is possible for either bank_cur_lru_count or individual
+ * page_lru_count entries to be "reset" to lower values than they should have,
+ * in case a process is delayed while it executes this macro. With care in
+ * SlruSelectLRUPage(), this does little harm, and in any case the absolute
+ * worst possible consequence is a nonoptimal choice of page to evict. The
+ * gain from allowing concurrent reads of SLRU pages seems worth it.
+ */
+static inline void
+SlruRecentlyUsed(SlruShared shared, int slotno, int banksize)
+{
+ int slrubankno = slotno / banksize;
+ int new_lru_count = shared->bank_cur_lru_count[slrubankno];
+
+ if (new_lru_count != shared->page_lru_count[slotno])
+ {
+ shared->bank_cur_lru_count[slrubankno] = ++new_lru_count;
+ shared->page_lru_count[slotno] = new_lru_count;
+ }
+}
+
/*
* Pick bank size optimal for N-assiciative SLRU buffers.
*
diff --git a/src/include/access/slru.h b/src/include/access/slru.h
index f3545d5f5d..a18b07f5d0 100644
--- a/src/include/access/slru.h
+++ b/src/include/access/slru.h
@@ -78,6 +78,23 @@ typedef struct SlruSharedData
*/
LWLockPadded *bank_locks;
+ /*----------
+ * Instead of global counter we maintain a bank-wise lru counter because
+ * a) we are doing the victim buffer selection as bank level so there is
+ * no point of having a global counter b) manipulating a global counter
+ * will have frequent cpu cache invalidation and that will affect the
+ * performance.
+ *
+ * We mark a page "most recently used" by setting
+ * page_lru_count[slotno] = ++bank_cur_lru_count[bankno];
+ * The oldest page is therefore the one with the highest value of
+ * bank_cur_lru_count[bankno] - page_lru_count[slotno]
+ * The counts will eventually wrap around, but this calculation still
+ * works as long as no page's age exceeds INT_MAX counts.
+ *----------
+ */
+ int *bank_cur_lru_count;
+
/*
* Optional array of WAL flush LSNs associated with entries in the SLRU
* pages. If not zero/NULL, we must flush WAL before writing pages (true
@@ -89,17 +106,6 @@ typedef struct SlruSharedData
XLogRecPtr *group_lsn;
int lsn_groups_per_page;
- /*----------
- * We mark a page "most recently used" by setting
- * page_lru_count[slotno] = ++cur_lru_count;
- * The oldest page is therefore the one with the highest value of
- * cur_lru_count - page_lru_count[slotno]
- * The counts will eventually wrap around, but this calculation still
- * works as long as no page's age exceeds INT_MAX counts.
- *----------
- */
- int cur_lru_count;
-
/*
* latest_page_number is the page number of the current end of the log;
* this is not critical data, since we use it only to avoid swapping out
--
2.39.2 (Apple Git-143)
v3-0001-Make-all-SLRU-buffer-sizes-configurable.patchapplication/octet-stream; name=v3-0001-Make-all-SLRU-buffer-sizes-configurable.patchDownload
From c5d594053a2ad3056bde425bd52f589e3c102e02 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Wed, 25 Oct 2023 14:45:00 +0530
Subject: [PATCH v3 1/5] Make all SLRU buffer sizes configurable.
Provide new GUCs to set the number of buffers, instead of using hard
coded defaults.
Remove the limits on xact_buffers and commit_ts_buffers. The default
sizes for those caches are ~0.2% and ~0.1% of shared_buffers, as before,
but now there is no cap at 128 and 16 buffers respectively (unless
track_commit_timestamp is disabled, in the latter case, then we might as
well keep it tiny). Sizes much larger than the old limits have been
shown to be useful on modern systems, and an earlier commit replaced a
linear search with a hash table to avoid problems with extreme cases.
Patch by Andrey M. Borodin with some Bug fixes by Dilip Kumar
ReviewedBy Anastasia Lubennikova, Tomas Vondra, Alexander Korotkov,
Gilles Darold, Thomas Munro and Dilip Kumar
---
doc/src/sgml/config.sgml | 135 ++++++++++++++++++
src/backend/access/transam/clog.c | 23 ++-
src/backend/access/transam/commit_ts.c | 5 +
src/backend/access/transam/multixact.c | 8 +-
src/backend/access/transam/subtrans.c | 5 +-
src/backend/commands/async.c | 8 +-
src/backend/commands/variable.c | 19 +++
src/backend/storage/lmgr/predicate.c | 4 +-
src/backend/utils/init/globals.c | 8 ++
src/backend/utils/misc/guc_tables.c | 77 ++++++++++
src/backend/utils/misc/postgresql.conf.sample | 9 ++
src/include/access/clog.h | 10 ++
src/include/access/commit_ts.h | 1 -
src/include/access/multixact.h | 4 -
src/include/access/slru.h | 5 +
src/include/access/subtrans.h | 3 -
src/include/commands/async.h | 5 -
src/include/miscadmin.h | 7 +
src/include/storage/predicate.h | 4 -
src/include/utils/guc_hooks.h | 2 +
20 files changed, 298 insertions(+), 44 deletions(-)
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 985cabfc0b..0584bcdc51 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2006,6 +2006,141 @@ include_dir 'conf.d'
</listitem>
</varlistentry>
+ <varlistentry id="guc-multixact-offsets-buffers" xreflabel="multixact_offsets_buffers">
+ <term><varname>multixact_offsets_buffers</varname> (<type>integer</type>)
+ <indexterm>
+ <primary><varname>multixact_offsets_buffers</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Specifies the amount of shared memory to use to cache the contents
+ of <literal>pg_multixact/offsets</literal> (see
+ <xref linkend="pgdata-contents-table"/>).
+ If this value is specified without units, it is taken as blocks,
+ that is <symbol>BLCKSZ</symbol> bytes, typically 8kB.
+ The default value is <literal>8</literal>.
+ This parameter can only be set at server start.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry id="guc-multixact-members-buffers" xreflabel="multixact_members_buffers">
+ <term><varname>multixact_members_buffers</varname> (<type>integer</type>)
+ <indexterm>
+ <primary><varname>multixact_members_buffers</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Specifies the amount of shared memory to use to cache the contents
+ of <literal>pg_multixact/members</literal> (see
+ <xref linkend="pgdata-contents-table"/>).
+ If this value is specified without units, it is taken as blocks,
+ that is <symbol>BLCKSZ</symbol> bytes, typically 8kB.
+ The default value is <literal>16</literal>.
+ This parameter can only be set at server start.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry id="guc-subtrans-buffers" xreflabel="subtrans_buffers">
+ <term><varname>subtrans_buffers</varname> (<type>integer</type>)
+ <indexterm>
+ <primary><varname>subtrans_buffers</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Specifies the amount of shared memory to use to cache the contents
+ of <literal>pg_subtrans</literal> (see
+ <xref linkend="pgdata-contents-table"/>).
+ If this value is specified without units, it is taken as blocks,
+ that is <symbol>BLCKSZ</symbol> bytes, typically 8kB.
+ The default value is <literal>8</literal>.
+ This parameter can only be set at server start.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry id="guc-notify-buffers" xreflabel="notify_buffers">
+ <term><varname>notify_buffers</varname> (<type>integer</type>)
+ <indexterm>
+ <primary><varname>notify_buffers</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Specifies the amount of shared memory to use to cache the contents
+ of <literal>pg_notify</literal> (see
+ <xref linkend="pgdata-contents-table"/>).
+ If this value is specified without units, it is taken as blocks,
+ that is <symbol>BLCKSZ</symbol> bytes, typically 8kB.
+ The default value is <literal>8</literal>.
+ This parameter can only be set at server start.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry id="guc-serial-buffers" xreflabel="serial_buffers">
+ <term><varname>serial_buffers</varname> (<type>integer</type>)
+ <indexterm>
+ <primary><varname>serial_buffers</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Specifies the amount of shared memory to use to cache the contents
+ of <literal>pg_serial</literal> (see
+ <xref linkend="pgdata-contents-table"/>).
+ If this value is specified without units, it is taken as blocks,
+ that is <symbol>BLCKSZ</symbol> bytes, typically 8kB.
+ The default value is <literal>16</literal>.
+ This parameter can only be set at server start.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry id="guc-xact-buffers" xreflabel="xact_buffers">
+ <term><varname>xact_buffers</varname> (<type>integer</type>)
+ <indexterm>
+ <primary><varname>xact_buffers</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Specifies the amount of shared memory to use to cache the contents
+ of <literal>pg_xact</literal> (see
+ <xref linkend="pgdata-contents-table"/>).
+ If this value is specified without units, it is taken as blocks,
+ that is <symbol>BLCKSZ</symbol> bytes, typically 8kB.
+ The default value is <literal>0</literal>, which requests
+ <varname>shared_buffers</varname> / 512, but not fewer than 4 blocks.
+ This parameter can only be set at server start.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry id="guc-commit-ts-buffers" xreflabel="commit_ts_buffers">
+ <term><varname>commit_ts_buffers</varname> (<type>integer</type>)
+ <indexterm>
+ <primary><varname>commit_ts_buffers</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Specifies the amount of memory to use to cache the cotents of
+ <literal>pg_commit_ts</literal> (see
+ <xref linkend="pgdata-contents-table"/>).
+ If this value is specified without units, it is taken as blocks,
+ that is <symbol>BLCKSZ</symbol> bytes, typically 8kB.
+ The default value is <literal>0</literal>, which requests
+ <varname>shared_buffers</varname> / 1024, but not fewer than 4 blocks.
+ This parameter can only be set at server start.
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry id="guc-max-stack-depth" xreflabel="max_stack_depth">
<term><varname>max_stack_depth</varname> (<type>integer</type>)
<indexterm>
diff --git a/src/backend/access/transam/clog.c b/src/backend/access/transam/clog.c
index 4a431d5876..6ef9aacb0e 100644
--- a/src/backend/access/transam/clog.c
+++ b/src/backend/access/transam/clog.c
@@ -58,8 +58,8 @@
/* We need two bits per xact, so four xacts fit in a byte */
#define CLOG_BITS_PER_XACT 2
-#define CLOG_XACTS_PER_BYTE 4
-#define CLOG_XACTS_PER_PAGE (BLCKSZ * CLOG_XACTS_PER_BYTE)
+StaticAssertDecl((CLOG_BITS_PER_XACT * CLOG_XACTS_PER_BYTE) == BITS_PER_BYTE,
+ "CLOG_BITS_PER_XACT and CLOG_XACTS_PER_BYTE are inconsistent");
#define CLOG_XACT_BITMASK ((1 << CLOG_BITS_PER_XACT) - 1)
#define TransactionIdToPage(xid) ((xid) / (TransactionId) CLOG_XACTS_PER_PAGE)
@@ -663,23 +663,16 @@ TransactionIdGetStatus(TransactionId xid, XLogRecPtr *lsn)
/*
* Number of shared CLOG buffers.
*
- * On larger multi-processor systems, it is possible to have many CLOG page
- * requests in flight at one time which could lead to disk access for CLOG
- * page if the required page is not found in memory. Testing revealed that we
- * can get the best performance by having 128 CLOG buffers, more than that it
- * doesn't improve performance.
- *
- * Unconditionally keeping the number of CLOG buffers to 128 did not seem like
- * a good idea, because it would increase the minimum amount of shared memory
- * required to start, which could be a problem for people running very small
- * configurations. The following formula seems to represent a reasonable
- * compromise: people with very low values for shared_buffers will get fewer
- * CLOG buffers as well, and everyone else will get 128.
+ * By default, we'll use 2MB of for every 1GB of shared buffers, up to the
+ * theoretical maximum useful value, but always at least 4 buffers.
*/
Size
CLOGShmemBuffers(void)
{
- return Min(128, Max(4, NBuffers / 512));
+ /* Use configured value if provided. */
+ if (xact_buffers > 0)
+ return Max(4, xact_buffers);
+ return Min(CLOG_MAX_ALLOWED_BUFFERS, Max(4, NBuffers / 512));
}
/*
diff --git a/src/backend/access/transam/commit_ts.c b/src/backend/access/transam/commit_ts.c
index b897fabc70..48826672ea 100644
--- a/src/backend/access/transam/commit_ts.c
+++ b/src/backend/access/transam/commit_ts.c
@@ -493,10 +493,15 @@ pg_xact_commit_timestamp_origin(PG_FUNCTION_ARGS)
* We use a very similar logic as for the number of CLOG buffers (except we
* scale up twice as fast with shared buffers, and the maximum is twice as
* high); see comments in CLOGShmemBuffers.
+ * By default, we'll use 1MB of for every 1GB of shared buffers, up to the
+ * maximum value that slru.c will allow, but always at least 4 buffers.
*/
Size
CommitTsShmemBuffers(void)
{
+ /* Use configured value if provided. */
+ if (commit_ts_buffers > 0)
+ return Max(4, commit_ts_buffers);
return Min(256, Max(4, NBuffers / 256));
}
diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index 57ed34c0a8..62709fcd07 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -1834,8 +1834,8 @@ MultiXactShmemSize(void)
mul_size(sizeof(MultiXactId) * 2, MaxOldestSlot))
size = SHARED_MULTIXACT_STATE_SIZE;
- size = add_size(size, SimpleLruShmemSize(NUM_MULTIXACTOFFSET_BUFFERS, 0));
- size = add_size(size, SimpleLruShmemSize(NUM_MULTIXACTMEMBER_BUFFERS, 0));
+ size = add_size(size, SimpleLruShmemSize(multixact_offsets_buffers, 0));
+ size = add_size(size, SimpleLruShmemSize(multixact_members_buffers, 0));
return size;
}
@@ -1851,13 +1851,13 @@ MultiXactShmemInit(void)
MultiXactMemberCtl->PagePrecedes = MultiXactMemberPagePrecedes;
SimpleLruInit(MultiXactOffsetCtl,
- "MultiXactOffset", NUM_MULTIXACTOFFSET_BUFFERS, 0,
+ "MultiXactOffset", multixact_offsets_buffers, 0,
MultiXactOffsetSLRULock, "pg_multixact/offsets",
LWTRANCHE_MULTIXACTOFFSET_BUFFER,
SYNC_HANDLER_MULTIXACT_OFFSET);
SlruPagePrecedesUnitTests(MultiXactOffsetCtl, MULTIXACT_OFFSETS_PER_PAGE);
SimpleLruInit(MultiXactMemberCtl,
- "MultiXactMember", NUM_MULTIXACTMEMBER_BUFFERS, 0,
+ "MultiXactMember", multixact_members_buffers, 0,
MultiXactMemberSLRULock, "pg_multixact/members",
LWTRANCHE_MULTIXACTMEMBER_BUFFER,
SYNC_HANDLER_MULTIXACT_MEMBER);
diff --git a/src/backend/access/transam/subtrans.c b/src/backend/access/transam/subtrans.c
index 62bb610167..0dd48f40f3 100644
--- a/src/backend/access/transam/subtrans.c
+++ b/src/backend/access/transam/subtrans.c
@@ -31,6 +31,7 @@
#include "access/slru.h"
#include "access/subtrans.h"
#include "access/transam.h"
+#include "miscadmin.h"
#include "pg_trace.h"
#include "utils/snapmgr.h"
@@ -184,14 +185,14 @@ SubTransGetTopmostTransaction(TransactionId xid)
Size
SUBTRANSShmemSize(void)
{
- return SimpleLruShmemSize(NUM_SUBTRANS_BUFFERS, 0);
+ return SimpleLruShmemSize(subtrans_buffers, 0);
}
void
SUBTRANSShmemInit(void)
{
SubTransCtl->PagePrecedes = SubTransPagePrecedes;
- SimpleLruInit(SubTransCtl, "Subtrans", NUM_SUBTRANS_BUFFERS, 0,
+ SimpleLruInit(SubTransCtl, "Subtrans", subtrans_buffers, 0,
SubtransSLRULock, "pg_subtrans",
LWTRANCHE_SUBTRANS_BUFFER, SYNC_HANDLER_NONE);
SlruPagePrecedesUnitTests(SubTransCtl, SUBTRANS_XACTS_PER_PAGE);
diff --git a/src/backend/commands/async.c b/src/backend/commands/async.c
index 38ddae08b8..4bdbbe5cc0 100644
--- a/src/backend/commands/async.c
+++ b/src/backend/commands/async.c
@@ -117,7 +117,7 @@
* frontend during startup.) The above design guarantees that notifies from
* other backends will never be missed by ignoring self-notifies.
*
- * The amount of shared memory used for notify management (NUM_NOTIFY_BUFFERS)
+ * The amount of shared memory used for notify management (notify_buffers)
* can be varied without affecting anything but performance. The maximum
* amount of notification data that can be queued at one time is determined
* by slru.c's wraparound limit; see QUEUE_MAX_PAGE below.
@@ -235,7 +235,7 @@ typedef struct QueuePosition
*
* Resist the temptation to make this really large. While that would save
* work in some places, it would add cost in others. In particular, this
- * should likely be less than NUM_NOTIFY_BUFFERS, to ensure that backends
+ * should likely be less than notify_buffers, to ensure that backends
* catch up before the pages they'll need to read fall out of SLRU cache.
*/
#define QUEUE_CLEANUP_DELAY 4
@@ -521,7 +521,7 @@ AsyncShmemSize(void)
size = mul_size(MaxBackends + 1, sizeof(QueueBackendStatus));
size = add_size(size, offsetof(AsyncQueueControl, backend));
- size = add_size(size, SimpleLruShmemSize(NUM_NOTIFY_BUFFERS, 0));
+ size = add_size(size, SimpleLruShmemSize(notify_buffers, 0));
return size;
}
@@ -569,7 +569,7 @@ AsyncShmemInit(void)
* Set up SLRU management of the pg_notify data.
*/
NotifyCtl->PagePrecedes = asyncQueuePagePrecedes;
- SimpleLruInit(NotifyCtl, "Notify", NUM_NOTIFY_BUFFERS, 0,
+ SimpleLruInit(NotifyCtl, "Notify", notify_buffers, 0,
NotifySLRULock, "pg_notify", LWTRANCHE_NOTIFY_BUFFER,
SYNC_HANDLER_NONE);
diff --git a/src/backend/commands/variable.c b/src/backend/commands/variable.c
index a88cf5f118..ee25aa0656 100644
--- a/src/backend/commands/variable.c
+++ b/src/backend/commands/variable.c
@@ -18,6 +18,8 @@
#include <ctype.h>
+#include "access/clog.h"
+#include "access/commit_ts.h"
#include "access/htup_details.h"
#include "access/parallel.h"
#include "access/xact.h"
@@ -400,6 +402,23 @@ show_timezone(void)
return "unknown";
}
+const char *
+show_xact_buffers(void)
+{
+ static char nbuf[16];
+
+ snprintf(nbuf, sizeof(nbuf), "%zu", CLOGShmemBuffers());
+ return nbuf;
+}
+
+const char *
+show_commit_ts_buffers(void)
+{
+ static char nbuf[16];
+
+ snprintf(nbuf, sizeof(nbuf), "%zu", CommitTsShmemBuffers());
+ return nbuf;
+}
/*
* LOG_TIMEZONE
diff --git a/src/backend/storage/lmgr/predicate.c b/src/backend/storage/lmgr/predicate.c
index a794546db3..18ea18316d 100644
--- a/src/backend/storage/lmgr/predicate.c
+++ b/src/backend/storage/lmgr/predicate.c
@@ -808,7 +808,7 @@ SerialInit(void)
*/
SerialSlruCtl->PagePrecedes = SerialPagePrecedesLogically;
SimpleLruInit(SerialSlruCtl, "Serial",
- NUM_SERIAL_BUFFERS, 0, SerialSLRULock, "pg_serial",
+ serial_buffers, 0, SerialSLRULock, "pg_serial",
LWTRANCHE_SERIAL_BUFFER, SYNC_HANDLER_NONE);
#ifdef USE_ASSERT_CHECKING
SerialPagePrecedesLogicallyUnitTests();
@@ -1347,7 +1347,7 @@ PredicateLockShmemSize(void)
/* Shared memory structures for SLRU tracking of old committed xids. */
size = add_size(size, sizeof(SerialControlData));
- size = add_size(size, SimpleLruShmemSize(NUM_SERIAL_BUFFERS, 0));
+ size = add_size(size, SimpleLruShmemSize(serial_buffers, 0));
return size;
}
diff --git a/src/backend/utils/init/globals.c b/src/backend/utils/init/globals.c
index 60bc1217fb..82acdf4226 100644
--- a/src/backend/utils/init/globals.c
+++ b/src/backend/utils/init/globals.c
@@ -156,3 +156,11 @@ int64 VacuumPageDirty = 0;
int VacuumCostBalance = 0; /* working state for vacuum */
bool VacuumCostActive = false;
+
+int multixact_offsets_buffers = 8;
+int multixact_members_buffers = 16;
+int subtrans_buffers = 32;
+int notify_buffers = 8;
+int serial_buffers = 16;
+int xact_buffers = 0;
+int commit_ts_buffers = 0;
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 7605eff9b9..83acff7037 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -28,6 +28,7 @@
#include "access/commit_ts.h"
#include "access/gin.h"
+#include "access/slru.h"
#include "access/toast_compression.h"
#include "access/twophase.h"
#include "access/xlog_internal.h"
@@ -2287,6 +2288,82 @@ struct config_int ConfigureNamesInt[] =
NULL, NULL, NULL
},
+ {
+ {"multixact_offsets_buffers", PGC_POSTMASTER, RESOURCES_MEM,
+ gettext_noop("Sets the number of shared memory buffers used for the MultiXact offset SLRU cache."),
+ NULL,
+ GUC_UNIT_BLOCKS
+ },
+ &multixact_offsets_buffers,
+ 8, 2, SLRU_MAX_ALLOWED_BUFFERS,
+ NULL, NULL, NULL
+ },
+
+ {
+ {"multixact_members_buffers", PGC_POSTMASTER, RESOURCES_MEM,
+ gettext_noop("Sets the number of shared memory buffers used for the MultiXact member SLRU cache."),
+ NULL,
+ GUC_UNIT_BLOCKS
+ },
+ &multixact_members_buffers,
+ 16, 2, SLRU_MAX_ALLOWED_BUFFERS,
+ NULL, NULL, NULL
+ },
+
+ {
+ {"subtrans_buffers", PGC_POSTMASTER, RESOURCES_MEM,
+ gettext_noop("Sets the number of shared memory buffers used for the sub-transaction SLRU cache."),
+ NULL,
+ GUC_UNIT_BLOCKS
+ },
+ &subtrans_buffers,
+ 32, 2, SLRU_MAX_ALLOWED_BUFFERS,
+ NULL, NULL, NULL
+ },
+ {
+ {"notify_buffers", PGC_POSTMASTER, RESOURCES_MEM,
+ gettext_noop("Sets the number of shared memory buffers used for the NOTIFY message SLRU cache."),
+ NULL,
+ GUC_UNIT_BLOCKS
+ },
+ ¬ify_buffers,
+ 8, 2, SLRU_MAX_ALLOWED_BUFFERS,
+ NULL, NULL, NULL
+ },
+
+ {
+ {"serial_buffers", PGC_POSTMASTER, RESOURCES_MEM,
+ gettext_noop("Sets the number of shared memory buffers used for the serializable transaction SLRU cache."),
+ NULL,
+ GUC_UNIT_BLOCKS
+ },
+ &serial_buffers,
+ 16, 2, SLRU_MAX_ALLOWED_BUFFERS,
+ NULL, NULL, NULL
+ },
+
+ {
+ {"xact_buffers", PGC_POSTMASTER, RESOURCES_MEM,
+ gettext_noop("Sets the number of shared memory buffers used for the transaction status SLRU cache."),
+ NULL,
+ GUC_UNIT_BLOCKS
+ },
+ &xact_buffers,
+ 0, 0, CLOG_MAX_ALLOWED_BUFFERS,
+ NULL, NULL, show_xact_buffers
+ },
+
+ {
+ {"commit_ts_buffers", PGC_POSTMASTER, RESOURCES_MEM,
+ gettext_noop("Sets the size of the dedicated buffer pool used for the commit timestamp SLRU cache."),
+ NULL,
+ GUC_UNIT_BLOCKS
+ },
+ &commit_ts_buffers,
+ 0, 0, SLRU_MAX_ALLOWED_BUFFERS,
+ NULL, NULL, show_commit_ts_buffers
+ },
+
{
{"temp_buffers", PGC_USERSET, RESOURCES_MEM,
gettext_noop("Sets the maximum number of temporary buffers used by each session."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index d08d55c3fe..c21d6468ed 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -50,6 +50,15 @@
#external_pid_file = '' # write an extra PID file
# (change requires restart)
+# - SLRU Buffers (change requires restart) -
+
+#xact_buffers = 0 # memory for pg_xact (0 = auto)
+#subtrans_buffers = 32 # memory for pg_subtrans
+#multixact_offsets_buffers = 8 # memory for pg_multixact/offsets
+#multixact_members_buffers = 16 # memory for pg_multixact/members
+#notify_buffers = 8 # memory for pg_notify
+#serial_buffers = 16 # memory for pg_serial
+#commit_ts_buffers = 0 # memory for pg_commit_ts (0 = auto)
#------------------------------------------------------------------------------
# CONNECTIONS AND AUTHENTICATION
diff --git a/src/include/access/clog.h b/src/include/access/clog.h
index d99444f073..a9cd65db36 100644
--- a/src/include/access/clog.h
+++ b/src/include/access/clog.h
@@ -15,6 +15,16 @@
#include "storage/sync.h"
#include "lib/stringinfo.h"
+/*
+ * Don't allow xact_buffers to be set higher than could possibly be useful or
+ * SLRU would allow.
+ */
+#define CLOG_XACTS_PER_BYTE 4
+#define CLOG_XACTS_PER_PAGE (BLCKSZ * CLOG_XACTS_PER_BYTE)
+#define CLOG_MAX_ALLOWED_BUFFERS \
+ Min(SLRU_MAX_ALLOWED_BUFFERS, \
+ (((MaxTransactionId / 2) + (CLOG_XACTS_PER_PAGE - 1)) / CLOG_XACTS_PER_PAGE))
+
/*
* Possible transaction statuses --- note that all-zeroes is the initial
* state.
diff --git a/src/include/access/commit_ts.h b/src/include/access/commit_ts.h
index 5087cdce51..78d017ad85 100644
--- a/src/include/access/commit_ts.h
+++ b/src/include/access/commit_ts.h
@@ -16,7 +16,6 @@
#include "replication/origin.h"
#include "storage/sync.h"
-
extern PGDLLIMPORT bool track_commit_timestamp;
extern void TransactionTreeSetCommitTsData(TransactionId xid, int nsubxids,
diff --git a/src/include/access/multixact.h b/src/include/access/multixact.h
index 0be1355892..18d7ba4ca9 100644
--- a/src/include/access/multixact.h
+++ b/src/include/access/multixact.h
@@ -29,10 +29,6 @@
#define MaxMultiXactOffset ((MultiXactOffset) 0xFFFFFFFF)
-/* Number of SLRU buffers to use for multixact */
-#define NUM_MULTIXACTOFFSET_BUFFERS 8
-#define NUM_MULTIXACTMEMBER_BUFFERS 16
-
/*
* Possible multixact lock modes ("status"). The first four modes are for
* tuple locks (FOR KEY SHARE, FOR SHARE, FOR NO KEY UPDATE, FOR UPDATE); the
diff --git a/src/include/access/slru.h b/src/include/access/slru.h
index 552cc19e68..c0d37e3eb3 100644
--- a/src/include/access/slru.h
+++ b/src/include/access/slru.h
@@ -17,6 +17,11 @@
#include "storage/lwlock.h"
#include "storage/sync.h"
+/*
+ * To avoid overflowing internal arithmetic and the size_t data type, the
+ * number of buffers should not exceed this number.
+ */
+#define SLRU_MAX_ALLOWED_BUFFERS ((1024 * 1024 * 1024) / BLCKSZ)
/*
* Define SLRU segment size. A page is the same BLCKSZ as is used everywhere
diff --git a/src/include/access/subtrans.h b/src/include/access/subtrans.h
index 46a473c77f..147dc4acc3 100644
--- a/src/include/access/subtrans.h
+++ b/src/include/access/subtrans.h
@@ -11,9 +11,6 @@
#ifndef SUBTRANS_H
#define SUBTRANS_H
-/* Number of SLRU buffers to use for subtrans */
-#define NUM_SUBTRANS_BUFFERS 32
-
extern void SubTransSetParent(TransactionId xid, TransactionId parent);
extern TransactionId SubTransGetParent(TransactionId xid);
extern TransactionId SubTransGetTopmostTransaction(TransactionId xid);
diff --git a/src/include/commands/async.h b/src/include/commands/async.h
index 02da6ba7e1..b3e6815ee4 100644
--- a/src/include/commands/async.h
+++ b/src/include/commands/async.h
@@ -15,11 +15,6 @@
#include <signal.h>
-/*
- * The number of SLRU page buffers we use for the notification queue.
- */
-#define NUM_NOTIFY_BUFFERS 8
-
extern PGDLLIMPORT bool Trace_notify;
extern PGDLLIMPORT volatile sig_atomic_t notifyInterruptPending;
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index f0cc651435..e2473f41de 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -177,6 +177,13 @@ extern PGDLLIMPORT int MaxBackends;
extern PGDLLIMPORT int MaxConnections;
extern PGDLLIMPORT int max_worker_processes;
extern PGDLLIMPORT int max_parallel_workers;
+extern PGDLLIMPORT int multixact_offsets_buffers;
+extern PGDLLIMPORT int multixact_members_buffers;
+extern PGDLLIMPORT int subtrans_buffers;
+extern PGDLLIMPORT int notify_buffers;
+extern PGDLLIMPORT int serial_buffers;
+extern PGDLLIMPORT int xact_buffers;
+extern PGDLLIMPORT int commit_ts_buffers;
extern PGDLLIMPORT int MyProcPid;
extern PGDLLIMPORT pg_time_t MyStartTime;
diff --git a/src/include/storage/predicate.h b/src/include/storage/predicate.h
index cd48afa17b..7b68c8f1c7 100644
--- a/src/include/storage/predicate.h
+++ b/src/include/storage/predicate.h
@@ -26,10 +26,6 @@ extern PGDLLIMPORT int max_predicate_locks_per_xact;
extern PGDLLIMPORT int max_predicate_locks_per_relation;
extern PGDLLIMPORT int max_predicate_locks_per_page;
-
-/* Number of SLRU buffers to use for Serial SLRU */
-#define NUM_SERIAL_BUFFERS 16
-
/*
* A handle used for sharing SERIALIZABLEXACT objects between the participants
* in a parallel query.
diff --git a/src/include/utils/guc_hooks.h b/src/include/utils/guc_hooks.h
index 2a191830a8..8597e430de 100644
--- a/src/include/utils/guc_hooks.h
+++ b/src/include/utils/guc_hooks.h
@@ -161,4 +161,6 @@ extern void assign_wal_consistency_checking(const char *newval, void *extra);
extern bool check_wal_segment_size(int *newval, void **extra, GucSource source);
extern void assign_wal_sync_method(int new_wal_sync_method, void *extra);
+extern const char *show_xact_buffers(void);
+extern const char *show_commit_ts_buffers(void);
#endif /* GUC_HOOKS_H */
--
2.39.2 (Apple Git-143)
v3-0003-Bank-wise-slru-locks.patchapplication/octet-stream; name=v3-0003-Bank-wise-slru-locks.patchDownload
From 6b2f662dfe0794dce613c33a21f8f740cf8229e3 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Mon, 30 Oct 2023 11:06:12 +0530
Subject: [PATCH v3 3/5] Bank wise slru locks
The previous patch has divided SLRU buffer pool into associative
banks. And this patch is further optimizing it by introducing
bank wise slru locks instead of a common centralized lock this
will reduce the contention on the slru control lock.
Dilip Kumar with some design inpur from Robert Haas
and review by Alvaro Herrera
---
src/backend/access/transam/clog.c | 114 ++++++++++-----
src/backend/access/transam/commit_ts.c | 43 +++---
src/backend/access/transam/multixact.c | 177 ++++++++++++++++-------
src/backend/access/transam/slru.c | 148 +++++++++++++++----
src/backend/access/transam/subtrans.c | 58 ++++++--
src/backend/commands/async.c | 32 ++--
src/backend/storage/lmgr/lwlock.c | 14 ++
src/backend/storage/lmgr/lwlocknames.txt | 14 +-
src/backend/storage/lmgr/predicate.c | 33 +++--
src/include/access/slru.h | 32 +++-
src/include/storage/lwlock.h | 7 +
src/test/modules/test_slru/test_slru.c | 32 ++--
12 files changed, 494 insertions(+), 210 deletions(-)
diff --git a/src/backend/access/transam/clog.c b/src/backend/access/transam/clog.c
index 6ef9aacb0e..830d8bcdf5 100644
--- a/src/backend/access/transam/clog.c
+++ b/src/backend/access/transam/clog.c
@@ -274,14 +274,19 @@ TransactionIdSetPageStatus(TransactionId xid, int nsubxids,
XLogRecPtr lsn, int pageno,
bool all_xact_same_page)
{
+ LWLock *lock;
+
/* Can't use group update when PGPROC overflows. */
StaticAssertDecl(THRESHOLD_SUBTRANS_CLOG_OPT <= PGPROC_MAX_CACHED_SUBXIDS,
"group clog threshold less than PGPROC cached subxids");
+ /* Get the SLRU bank lock w.r.t. the page we are going to access. */
+ lock = SimpleLruGetSLRUBankLock(XactCtl, pageno);
+
/*
- * When there is contention on XactSLRULock, we try to group multiple
+ * When there is contention on SLRU lock, we try to group multiple
* updates; a single leader process will perform transaction status
- * updates for multiple backends so that the number of times XactSLRULock
+ * updates for multiple backends so that the number of times the SLRU lock
* needs to be acquired is reduced.
*
* For this optimization to be safe, the XID and subxids in MyProc must be
@@ -300,17 +305,17 @@ TransactionIdSetPageStatus(TransactionId xid, int nsubxids,
nsubxids * sizeof(TransactionId)) == 0))
{
/*
- * If we can immediately acquire XactSLRULock, we update the status of
+ * If we can immediately acquire SLRU lock, we update the status of
* our own XID and release the lock. If not, try use group XID
* update. If that doesn't work out, fall back to waiting for the
* lock to perform an update for this transaction only.
*/
- if (LWLockConditionalAcquire(XactSLRULock, LW_EXCLUSIVE))
+ if (LWLockConditionalAcquire(lock, LW_EXCLUSIVE))
{
/* Got the lock without waiting! Do the update. */
TransactionIdSetPageStatusInternal(xid, nsubxids, subxids, status,
lsn, pageno);
- LWLockRelease(XactSLRULock);
+ LWLockRelease(lock);
return;
}
else if (TransactionGroupUpdateXidStatus(xid, status, lsn, pageno))
@@ -323,10 +328,10 @@ TransactionIdSetPageStatus(TransactionId xid, int nsubxids,
}
/* Group update not applicable, or couldn't accept this page number. */
- LWLockAcquire(XactSLRULock, LW_EXCLUSIVE);
+ LWLockAcquire(lock, LW_EXCLUSIVE);
TransactionIdSetPageStatusInternal(xid, nsubxids, subxids, status,
lsn, pageno);
- LWLockRelease(XactSLRULock);
+ LWLockRelease(lock);
}
/*
@@ -345,7 +350,8 @@ TransactionIdSetPageStatusInternal(TransactionId xid, int nsubxids,
Assert(status == TRANSACTION_STATUS_COMMITTED ||
status == TRANSACTION_STATUS_ABORTED ||
(status == TRANSACTION_STATUS_SUB_COMMITTED && !TransactionIdIsValid(xid)));
- Assert(LWLockHeldByMeInMode(XactSLRULock, LW_EXCLUSIVE));
+ Assert(LWLockHeldByMeInMode(SimpleLruGetSLRUBankLock(XactCtl, pageno),
+ LW_EXCLUSIVE));
/*
* If we're doing an async commit (ie, lsn is valid), then we must wait
@@ -396,14 +402,13 @@ TransactionIdSetPageStatusInternal(TransactionId xid, int nsubxids,
}
/*
- * When we cannot immediately acquire XactSLRULock in exclusive mode at
+ * When we cannot immediately acquire SLRU bank lock in exclusive mode at
* commit time, add ourselves to a list of processes that need their XIDs
* status update. The first process to add itself to the list will acquire
- * XactSLRULock in exclusive mode and set transaction status as required
- * on behalf of all group members. This avoids a great deal of contention
- * around XactSLRULock when many processes are trying to commit at once,
- * since the lock need not be repeatedly handed off from one committing
- * process to the next.
+ * the lock in exclusive mode and set transaction status as required on behalf
+ * of all group members. This avoids a great deal of contention when many
+ * processes are trying to commit at once, since the lock need not be
+ * repeatedly handed off from one committing process to the next.
*
* Returns true when transaction status has been updated in clog; returns
* false if we decided against applying the optimization because the page
@@ -417,6 +422,8 @@ TransactionGroupUpdateXidStatus(TransactionId xid, XidStatus status,
PGPROC *proc = MyProc;
uint32 nextidx;
uint32 wakeidx;
+ int prevpageno;
+ LWLock *prevlock = NULL;
/* We should definitely have an XID whose status needs to be updated. */
Assert(TransactionIdIsValid(xid));
@@ -497,13 +504,10 @@ TransactionGroupUpdateXidStatus(TransactionId xid, XidStatus status,
return true;
}
- /* We are the leader. Acquire the lock on behalf of everyone. */
- LWLockAcquire(XactSLRULock, LW_EXCLUSIVE);
-
/*
- * Now that we've got the lock, clear the list of processes waiting for
- * group XID status update, saving a pointer to the head of the list.
- * Trying to pop elements one at a time could lead to an ABA problem.
+ * We are leader so clear the list of processes waiting for group XID
+ * status update, saving a pointer to the head of the list. Trying to pop
+ * elements one at a time could lead to an ABA problem.
*/
nextidx = pg_atomic_exchange_u32(&procglobal->clogGroupFirst,
INVALID_PGPROCNO);
@@ -511,10 +515,38 @@ TransactionGroupUpdateXidStatus(TransactionId xid, XidStatus status,
/* Remember head of list so we can perform wakeups after dropping lock. */
wakeidx = nextidx;
+ /* Acquire the SLRU bank lock w.r.t. the first page in the group. */
+ prevpageno = ProcGlobal->allProcs[nextidx].clogGroupMemberPage;
+ prevlock = SimpleLruGetSLRUBankLock(XactCtl, prevpageno);
+ LWLockAcquire(prevlock, LW_EXCLUSIVE);
+
/* Walk the list and update the status of all XIDs. */
while (nextidx != INVALID_PGPROCNO)
{
PGPROC *nextproc = &ProcGlobal->allProcs[nextidx];
+ int thispageno = nextproc->clogGroupMemberPage;
+
+ /*
+ * Although we are trying our best to keep same page in a group, there
+ * are cases where we might get different pages as well for detail
+ * refer comment in above while loop where we are adding this process
+ * for group update. So if the current page we are going to access is
+ * not in the same slru bank in which we updated the last page then we
+ * need to release the lock on the previous bank and acquire lock on
+ * the bank w.r.t. the page we are going to update now.
+ */
+ if (thispageno != prevpageno)
+ {
+ LWLock *lock = SimpleLruGetSLRUBankLock(XactCtl, thispageno);
+
+ if (prevlock != lock)
+ {
+ LWLockRelease(prevlock);
+ LWLockAcquire(lock, LW_EXCLUSIVE);
+ }
+ prevlock = lock;
+ prevpageno = thispageno;
+ }
/*
* Transactions with more than THRESHOLD_SUBTRANS_CLOG_OPT sub-XIDs
@@ -534,7 +566,8 @@ TransactionGroupUpdateXidStatus(TransactionId xid, XidStatus status,
}
/* We're done with the lock now. */
- LWLockRelease(XactSLRULock);
+ if (prevlock != NULL)
+ LWLockRelease(prevlock);
/*
* Now that we've released the lock, go back and wake everybody up. We
@@ -563,10 +596,11 @@ TransactionGroupUpdateXidStatus(TransactionId xid, XidStatus status,
/*
* Sets the commit status of a single transaction.
*
- * Must be called with XactSLRULock held
+ * Must be called with slot specific SLRU bank's lock held
*/
static void
-TransactionIdSetStatusBit(TransactionId xid, XidStatus status, XLogRecPtr lsn, int slotno)
+TransactionIdSetStatusBit(TransactionId xid, XidStatus status, XLogRecPtr lsn,
+ int slotno)
{
int byteno = TransactionIdToByte(xid);
int bshift = TransactionIdToBIndex(xid) * CLOG_BITS_PER_XACT;
@@ -655,7 +689,7 @@ TransactionIdGetStatus(TransactionId xid, XLogRecPtr *lsn)
lsnindex = GetLSNIndex(slotno, xid);
*lsn = XactCtl->shared->group_lsn[lsnindex];
- LWLockRelease(XactSLRULock);
+ LWLockRelease(SimpleLruGetSLRUBankLock(XactCtl, pageno));
return status;
}
@@ -689,8 +723,8 @@ CLOGShmemInit(void)
{
XactCtl->PagePrecedes = CLOGPagePrecedes;
SimpleLruInit(XactCtl, "Xact", CLOGShmemBuffers(), CLOG_LSNS_PER_PAGE,
- XactSLRULock, "pg_xact", LWTRANCHE_XACT_BUFFER,
- SYNC_HANDLER_CLOG);
+ "pg_xact", LWTRANCHE_XACT_BUFFER,
+ LWTRANCHE_XACT_SLRU, SYNC_HANDLER_CLOG);
SlruPagePrecedesUnitTests(XactCtl, CLOG_XACTS_PER_PAGE);
}
@@ -704,8 +738,9 @@ void
BootStrapCLOG(void)
{
int slotno;
+ LWLock *lock = SimpleLruGetSLRUBankLock(XactCtl, 0);
- LWLockAcquire(XactSLRULock, LW_EXCLUSIVE);
+ LWLockAcquire(lock, LW_EXCLUSIVE);
/* Create and zero the first page of the commit log */
slotno = ZeroCLOGPage(0, false);
@@ -714,7 +749,7 @@ BootStrapCLOG(void)
SimpleLruWritePage(XactCtl, slotno);
Assert(!XactCtl->shared->page_dirty[slotno]);
- LWLockRelease(XactSLRULock);
+ LWLockRelease(lock);
}
/*
@@ -749,14 +784,10 @@ StartupCLOG(void)
TransactionId xid = XidFromFullTransactionId(ShmemVariableCache->nextXid);
int pageno = TransactionIdToPage(xid);
- LWLockAcquire(XactSLRULock, LW_EXCLUSIVE);
-
/*
* Initialize our idea of the latest page number.
*/
- XactCtl->shared->latest_page_number = pageno;
-
- LWLockRelease(XactSLRULock);
+ pg_atomic_init_u32(&XactCtl->shared->latest_page_number, pageno);
}
/*
@@ -767,8 +798,9 @@ TrimCLOG(void)
{
TransactionId xid = XidFromFullTransactionId(ShmemVariableCache->nextXid);
int pageno = TransactionIdToPage(xid);
+ LWLock *lock = SimpleLruGetSLRUBankLock(XactCtl, pageno);
- LWLockAcquire(XactSLRULock, LW_EXCLUSIVE);
+ LWLockAcquire(lock, LW_EXCLUSIVE);
/*
* Zero out the remainder of the current clog page. Under normal
@@ -800,7 +832,7 @@ TrimCLOG(void)
XactCtl->shared->page_dirty[slotno] = true;
}
- LWLockRelease(XactSLRULock);
+ LWLockRelease(lock);
}
/*
@@ -832,6 +864,7 @@ void
ExtendCLOG(TransactionId newestXact)
{
int pageno;
+ LWLock *lock;
/*
* No work except at first XID of a page. But beware: just after
@@ -842,13 +875,14 @@ ExtendCLOG(TransactionId newestXact)
return;
pageno = TransactionIdToPage(newestXact);
+ lock = SimpleLruGetSLRUBankLock(XactCtl, pageno);
- LWLockAcquire(XactSLRULock, LW_EXCLUSIVE);
+ LWLockAcquire(lock, LW_EXCLUSIVE);
/* Zero the page and make an XLOG entry about it */
ZeroCLOGPage(pageno, true);
- LWLockRelease(XactSLRULock);
+ LWLockRelease(lock);
}
@@ -986,16 +1020,18 @@ clog_redo(XLogReaderState *record)
{
int pageno;
int slotno;
+ LWLock *lock;
memcpy(&pageno, XLogRecGetData(record), sizeof(int));
- LWLockAcquire(XactSLRULock, LW_EXCLUSIVE);
+ lock = SimpleLruGetSLRUBankLock(XactCtl, pageno);
+ LWLockAcquire(lock, LW_EXCLUSIVE);
slotno = ZeroCLOGPage(pageno, false);
SimpleLruWritePage(XactCtl, slotno);
Assert(!XactCtl->shared->page_dirty[slotno]);
- LWLockRelease(XactSLRULock);
+ LWLockRelease(lock);
}
else if (info == CLOG_TRUNCATE)
{
diff --git a/src/backend/access/transam/commit_ts.c b/src/backend/access/transam/commit_ts.c
index 48826672ea..204341da53 100644
--- a/src/backend/access/transam/commit_ts.c
+++ b/src/backend/access/transam/commit_ts.c
@@ -218,8 +218,9 @@ SetXidCommitTsInPage(TransactionId xid, int nsubxids,
{
int slotno;
int i;
+ LWLock *lock = SimpleLruGetSLRUBankLock(CommitTsCtl, pageno);
- LWLockAcquire(CommitTsSLRULock, LW_EXCLUSIVE);
+ LWLockAcquire(lock, LW_EXCLUSIVE);
slotno = SimpleLruReadPage(CommitTsCtl, pageno, true, xid);
@@ -229,13 +230,13 @@ SetXidCommitTsInPage(TransactionId xid, int nsubxids,
CommitTsCtl->shared->page_dirty[slotno] = true;
- LWLockRelease(CommitTsSLRULock);
+ LWLockRelease(lock);
}
/*
* Sets the commit timestamp of a single transaction.
*
- * Must be called with CommitTsSLRULock held
+ * Must be called with slot specific SLRU bank's Lock held
*/
static void
TransactionIdSetCommitTs(TransactionId xid, TimestampTz ts,
@@ -336,7 +337,7 @@ TransactionIdGetCommitTsData(TransactionId xid, TimestampTz *ts,
if (nodeid)
*nodeid = entry.nodeid;
- LWLockRelease(CommitTsSLRULock);
+ LWLockRelease(SimpleLruGetSLRUBankLock(CommitTsCtl, pageno));
return *ts != 0;
}
@@ -526,9 +527,8 @@ CommitTsShmemInit(void)
CommitTsCtl->PagePrecedes = CommitTsPagePrecedes;
SimpleLruInit(CommitTsCtl, "CommitTs", CommitTsShmemBuffers(), 0,
- CommitTsSLRULock, "pg_commit_ts",
- LWTRANCHE_COMMITTS_BUFFER,
- SYNC_HANDLER_COMMIT_TS);
+ "pg_commit_ts", LWTRANCHE_COMMITTS_BUFFER,
+ LWTRANCHE_COMMITTS_SLRU, SYNC_HANDLER_COMMIT_TS);
SlruPagePrecedesUnitTests(CommitTsCtl, COMMIT_TS_XACTS_PER_PAGE);
commitTsShared = ShmemInitStruct("CommitTs shared",
@@ -684,9 +684,7 @@ ActivateCommitTs(void)
/*
* Re-Initialize our idea of the latest page number.
*/
- LWLockAcquire(CommitTsSLRULock, LW_EXCLUSIVE);
- CommitTsCtl->shared->latest_page_number = pageno;
- LWLockRelease(CommitTsSLRULock);
+ pg_atomic_write_u32(&CommitTsCtl->shared->latest_page_number, pageno);
/*
* If CommitTs is enabled, but it wasn't in the previous server run, we
@@ -713,12 +711,13 @@ ActivateCommitTs(void)
if (!SimpleLruDoesPhysicalPageExist(CommitTsCtl, pageno))
{
int slotno;
+ LWLock *lock = SimpleLruGetSLRUBankLock(CommitTsCtl, pageno);
- LWLockAcquire(CommitTsSLRULock, LW_EXCLUSIVE);
+ LWLockAcquire(lock, LW_EXCLUSIVE);
slotno = ZeroCommitTsPage(pageno, false);
SimpleLruWritePage(CommitTsCtl, slotno);
Assert(!CommitTsCtl->shared->page_dirty[slotno]);
- LWLockRelease(CommitTsSLRULock);
+ LWLockRelease(lock);
}
/* Change the activation status in shared memory. */
@@ -767,9 +766,9 @@ DeactivateCommitTs(void)
* be overwritten anyway when we wrap around, but it seems better to be
* tidy.)
*/
- LWLockAcquire(CommitTsSLRULock, LW_EXCLUSIVE);
+ SimpleLruAcquireAllBankLock(CommitTsCtl, LW_EXCLUSIVE);
(void) SlruScanDirectory(CommitTsCtl, SlruScanDirCbDeleteAll, NULL);
- LWLockRelease(CommitTsSLRULock);
+ SimpleLruReleaseAllBankLock(CommitTsCtl);
}
/*
@@ -801,6 +800,7 @@ void
ExtendCommitTs(TransactionId newestXact)
{
int pageno;
+ LWLock *lock;
/*
* Nothing to do if module not enabled. Note we do an unlocked read of
@@ -821,12 +821,14 @@ ExtendCommitTs(TransactionId newestXact)
pageno = TransactionIdToCTsPage(newestXact);
- LWLockAcquire(CommitTsSLRULock, LW_EXCLUSIVE);
+ lock = SimpleLruGetSLRUBankLock(CommitTsCtl, pageno);
+
+ LWLockAcquire(lock, LW_EXCLUSIVE);
/* Zero the page and make an XLOG entry about it */
ZeroCommitTsPage(pageno, !InRecovery);
- LWLockRelease(CommitTsSLRULock);
+ LWLockRelease(lock);
}
/*
@@ -980,16 +982,18 @@ commit_ts_redo(XLogReaderState *record)
{
int pageno;
int slotno;
+ LWLock *lock;
memcpy(&pageno, XLogRecGetData(record), sizeof(int));
+ lock = SimpleLruGetSLRUBankLock(CommitTsCtl, pageno);
- LWLockAcquire(CommitTsSLRULock, LW_EXCLUSIVE);
+ LWLockAcquire(lock, LW_EXCLUSIVE);
slotno = ZeroCommitTsPage(pageno, false);
SimpleLruWritePage(CommitTsCtl, slotno);
Assert(!CommitTsCtl->shared->page_dirty[slotno]);
- LWLockRelease(CommitTsSLRULock);
+ LWLockRelease(lock);
}
else if (info == COMMIT_TS_TRUNCATE)
{
@@ -1001,7 +1005,8 @@ commit_ts_redo(XLogReaderState *record)
* During XLOG replay, latest_page_number isn't set up yet; insert a
* suitable value to bypass the sanity test in SimpleLruTruncate.
*/
- CommitTsCtl->shared->latest_page_number = trunc->pageno;
+ pg_atomic_write_u32(&CommitTsCtl->shared->latest_page_number,
+ trunc->pageno);
SimpleLruTruncate(CommitTsCtl, trunc->pageno);
}
diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index 62709fcd07..3284900e02 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -192,10 +192,10 @@ static SlruCtlData MultiXactMemberCtlData;
/*
* MultiXact state shared across all backends. All this state is protected
- * by MultiXactGenLock. (We also use MultiXactOffsetSLRULock and
- * MultiXactMemberSLRULock to guard accesses to the two sets of SLRU
- * buffers. For concurrency's sake, we avoid holding more than one of these
- * locks at a time.)
+ * by MultiXactGenLock. (We also use SLRU bank's lock of MultiXactOffset and
+ * MultiXactMember to guard accesses to the two sets of SLRU buffers. For
+ * concurrency's sake, we avoid holding more than one of these locks at a
+ * time.)
*/
typedef struct MultiXactStateData
{
@@ -870,12 +870,15 @@ RecordNewMultiXact(MultiXactId multi, MultiXactOffset offset,
int slotno;
MultiXactOffset *offptr;
int i;
-
- LWLockAcquire(MultiXactOffsetSLRULock, LW_EXCLUSIVE);
+ LWLock *lock;
+ LWLock *prevlock = NULL;
pageno = MultiXactIdToOffsetPage(multi);
entryno = MultiXactIdToOffsetEntry(multi);
+ lock = SimpleLruGetSLRUBankLock(MultiXactOffsetCtl, pageno);
+ LWLockAcquire(lock, LW_EXCLUSIVE);
+
/*
* Note: we pass the MultiXactId to SimpleLruReadPage as the "transaction"
* to complain about if there's any I/O error. This is kinda bogus, but
@@ -891,10 +894,8 @@ RecordNewMultiXact(MultiXactId multi, MultiXactOffset offset,
MultiXactOffsetCtl->shared->page_dirty[slotno] = true;
- /* Exchange our lock */
- LWLockRelease(MultiXactOffsetSLRULock);
-
- LWLockAcquire(MultiXactMemberSLRULock, LW_EXCLUSIVE);
+ /* Release MultiXactOffset SLRU lock. */
+ LWLockRelease(lock);
prev_pageno = -1;
@@ -916,6 +917,20 @@ RecordNewMultiXact(MultiXactId multi, MultiXactOffset offset,
if (pageno != prev_pageno)
{
+ /*
+ * MultiXactMember SLRU page is changed so check if this new page
+ * fall into the different SLRU bank then release the old bank's
+ * lock and acquire lock on the new bank.
+ */
+ lock = SimpleLruGetSLRUBankLock(MultiXactMemberCtl, pageno);
+ if (lock != prevlock)
+ {
+ if (prevlock != NULL)
+ LWLockRelease(prevlock);
+
+ LWLockAcquire(lock, LW_EXCLUSIVE);
+ prevlock = lock;
+ }
slotno = SimpleLruReadPage(MultiXactMemberCtl, pageno, true, multi);
prev_pageno = pageno;
}
@@ -936,7 +951,8 @@ RecordNewMultiXact(MultiXactId multi, MultiXactOffset offset,
MultiXactMemberCtl->shared->page_dirty[slotno] = true;
}
- LWLockRelease(MultiXactMemberSLRULock);
+ if (prevlock != NULL)
+ LWLockRelease(prevlock);
}
/*
@@ -1239,6 +1255,8 @@ GetMultiXactIdMembers(MultiXactId multi, MultiXactMember **members,
MultiXactId tmpMXact;
MultiXactOffset nextOffset;
MultiXactMember *ptr;
+ LWLock *lock;
+ LWLock *prevlock = NULL;
debug_elog3(DEBUG2, "GetMembers: asked for %u", multi);
@@ -1342,11 +1360,23 @@ GetMultiXactIdMembers(MultiXactId multi, MultiXactMember **members,
* time on every multixact creation.
*/
retry:
- LWLockAcquire(MultiXactOffsetSLRULock, LW_EXCLUSIVE);
-
pageno = MultiXactIdToOffsetPage(multi);
entryno = MultiXactIdToOffsetEntry(multi);
+ /*
+ * If the page is on the different SLRU bank then release the lock on the
+ * previous bank if we are already holding one and acquire the lock on the
+ * new bank.
+ */
+ lock = SimpleLruGetSLRUBankLock(MultiXactOffsetCtl, pageno);
+ if (lock != prevlock)
+ {
+ if (prevlock != NULL)
+ LWLockRelease(prevlock);
+ LWLockAcquire(lock, LW_EXCLUSIVE);
+ prevlock = lock;
+ }
+
slotno = SimpleLruReadPage(MultiXactOffsetCtl, pageno, true, multi);
offptr = (MultiXactOffset *) MultiXactOffsetCtl->shared->page_buffer[slotno];
offptr += entryno;
@@ -1379,7 +1409,22 @@ retry:
entryno = MultiXactIdToOffsetEntry(tmpMXact);
if (pageno != prev_pageno)
+ {
+ /*
+ * SLRU pageno is changed so check whether this page is falling in
+ * the different slru bank than on which we are already holding
+ * the lock and if so release the lock on the old bank and acquire
+ * that on the new bank.
+ */
+ lock = SimpleLruGetSLRUBankLock(MultiXactOffsetCtl, pageno);
+ if (prevlock != lock)
+ {
+ LWLockRelease(prevlock);
+ LWLockAcquire(lock, LW_EXCLUSIVE);
+ prevlock = lock;
+ }
slotno = SimpleLruReadPage(MultiXactOffsetCtl, pageno, true, tmpMXact);
+ }
offptr = (MultiXactOffset *) MultiXactOffsetCtl->shared->page_buffer[slotno];
offptr += entryno;
@@ -1388,7 +1433,8 @@ retry:
if (nextMXOffset == 0)
{
/* Corner case 2: next multixact is still being filled in */
- LWLockRelease(MultiXactOffsetSLRULock);
+ LWLockRelease(prevlock);
+ prevlock = NULL;
CHECK_FOR_INTERRUPTS();
pg_usleep(1000L);
goto retry;
@@ -1397,13 +1443,11 @@ retry:
length = nextMXOffset - offset;
}
- LWLockRelease(MultiXactOffsetSLRULock);
+ LWLockRelease(prevlock);
+ prevlock = NULL;
ptr = (MultiXactMember *) palloc(length * sizeof(MultiXactMember));
- /* Now get the members themselves. */
- LWLockAcquire(MultiXactMemberSLRULock, LW_EXCLUSIVE);
-
truelength = 0;
prev_pageno = -1;
for (i = 0; i < length; i++, offset++)
@@ -1419,6 +1463,20 @@ retry:
if (pageno != prev_pageno)
{
+ /*
+ * MultiXactMember SLRU page is changed so check if this new page
+ * fall into the different SLRU bank then release the old bank's
+ * lock and acquire lock on the new bank.
+ */
+ lock = SimpleLruGetSLRUBankLock(MultiXactMemberCtl, pageno);
+ if (lock != prevlock)
+ {
+ if (prevlock)
+ LWLockRelease(prevlock);
+ LWLockAcquire(lock, LW_EXCLUSIVE);
+ prevlock = lock;
+ }
+
slotno = SimpleLruReadPage(MultiXactMemberCtl, pageno, true, multi);
prev_pageno = pageno;
}
@@ -1442,7 +1500,8 @@ retry:
truelength++;
}
- LWLockRelease(MultiXactMemberSLRULock);
+ if (prevlock)
+ LWLockRelease(prevlock);
/* A multixid with zero members should not happen */
Assert(truelength > 0);
@@ -1852,14 +1911,14 @@ MultiXactShmemInit(void)
SimpleLruInit(MultiXactOffsetCtl,
"MultiXactOffset", multixact_offsets_buffers, 0,
- MultiXactOffsetSLRULock, "pg_multixact/offsets",
- LWTRANCHE_MULTIXACTOFFSET_BUFFER,
+ "pg_multixact/offsets", LWTRANCHE_MULTIXACTOFFSET_BUFFER,
+ LWTRANCHE_MULTIXACTOFFSET_SLRU,
SYNC_HANDLER_MULTIXACT_OFFSET);
SlruPagePrecedesUnitTests(MultiXactOffsetCtl, MULTIXACT_OFFSETS_PER_PAGE);
SimpleLruInit(MultiXactMemberCtl,
"MultiXactMember", multixact_members_buffers, 0,
- MultiXactMemberSLRULock, "pg_multixact/members",
- LWTRANCHE_MULTIXACTMEMBER_BUFFER,
+ "pg_multixact/members", LWTRANCHE_MULTIXACTMEMBER_BUFFER,
+ LWTRANCHE_MULTIXACTMEMBER_SLRU,
SYNC_HANDLER_MULTIXACT_MEMBER);
/* doesn't call SimpleLruTruncate() or meet criteria for unit tests */
@@ -1894,8 +1953,10 @@ void
BootStrapMultiXact(void)
{
int slotno;
+ LWLock *lock;
- LWLockAcquire(MultiXactOffsetSLRULock, LW_EXCLUSIVE);
+ lock = SimpleLruGetSLRUBankLock(MultiXactOffsetCtl, 0);
+ LWLockAcquire(lock, LW_EXCLUSIVE);
/* Create and zero the first page of the offsets log */
slotno = ZeroMultiXactOffsetPage(0, false);
@@ -1904,9 +1965,10 @@ BootStrapMultiXact(void)
SimpleLruWritePage(MultiXactOffsetCtl, slotno);
Assert(!MultiXactOffsetCtl->shared->page_dirty[slotno]);
- LWLockRelease(MultiXactOffsetSLRULock);
+ LWLockRelease(lock);
- LWLockAcquire(MultiXactMemberSLRULock, LW_EXCLUSIVE);
+ lock = SimpleLruGetSLRUBankLock(MultiXactMemberCtl, 0);
+ LWLockAcquire(lock, LW_EXCLUSIVE);
/* Create and zero the first page of the members log */
slotno = ZeroMultiXactMemberPage(0, false);
@@ -1915,7 +1977,7 @@ BootStrapMultiXact(void)
SimpleLruWritePage(MultiXactMemberCtl, slotno);
Assert(!MultiXactMemberCtl->shared->page_dirty[slotno]);
- LWLockRelease(MultiXactMemberSLRULock);
+ LWLockRelease(lock);
}
/*
@@ -1975,10 +2037,12 @@ static void
MaybeExtendOffsetSlru(void)
{
int pageno;
+ LWLock *lock;
pageno = MultiXactIdToOffsetPage(MultiXactState->nextMXact);
+ lock = SimpleLruGetSLRUBankLock(MultiXactOffsetCtl, pageno);
- LWLockAcquire(MultiXactOffsetSLRULock, LW_EXCLUSIVE);
+ LWLockAcquire(lock, LW_EXCLUSIVE);
if (!SimpleLruDoesPhysicalPageExist(MultiXactOffsetCtl, pageno))
{
@@ -1993,7 +2057,7 @@ MaybeExtendOffsetSlru(void)
SimpleLruWritePage(MultiXactOffsetCtl, slotno);
}
- LWLockRelease(MultiXactOffsetSLRULock);
+ LWLockRelease(lock);
}
/*
@@ -2015,13 +2079,15 @@ StartupMultiXact(void)
* Initialize offset's idea of the latest page number.
*/
pageno = MultiXactIdToOffsetPage(multi);
- MultiXactOffsetCtl->shared->latest_page_number = pageno;
+ pg_atomic_init_u32(&MultiXactOffsetCtl->shared->latest_page_number,
+ pageno);
/*
* Initialize member's idea of the latest page number.
*/
pageno = MXOffsetToMemberPage(offset);
- MultiXactMemberCtl->shared->latest_page_number = pageno;
+ pg_atomic_init_u32(&MultiXactMemberCtl->shared->latest_page_number,
+ pageno);
}
/*
@@ -2046,13 +2112,13 @@ TrimMultiXact(void)
LWLockRelease(MultiXactGenLock);
/* Clean up offsets state */
- LWLockAcquire(MultiXactOffsetSLRULock, LW_EXCLUSIVE);
/*
* (Re-)Initialize our idea of the latest page number for offsets.
*/
pageno = MultiXactIdToOffsetPage(nextMXact);
- MultiXactOffsetCtl->shared->latest_page_number = pageno;
+ pg_atomic_write_u32(&MultiXactOffsetCtl->shared->latest_page_number,
+ pageno);
/*
* Zero out the remainder of the current offsets page. See notes in
@@ -2067,7 +2133,9 @@ TrimMultiXact(void)
{
int slotno;
MultiXactOffset *offptr;
+ LWLock *lock = SimpleLruGetSLRUBankLock(MultiXactOffsetCtl, pageno);
+ LWLockAcquire(lock, LW_EXCLUSIVE);
slotno = SimpleLruReadPage(MultiXactOffsetCtl, pageno, true, nextMXact);
offptr = (MultiXactOffset *) MultiXactOffsetCtl->shared->page_buffer[slotno];
offptr += entryno;
@@ -2075,18 +2143,17 @@ TrimMultiXact(void)
MemSet(offptr, 0, BLCKSZ - (entryno * sizeof(MultiXactOffset)));
MultiXactOffsetCtl->shared->page_dirty[slotno] = true;
+ LWLockRelease(lock);
}
- LWLockRelease(MultiXactOffsetSLRULock);
-
/* And the same for members */
- LWLockAcquire(MultiXactMemberSLRULock, LW_EXCLUSIVE);
/*
* (Re-)Initialize our idea of the latest page number for members.
*/
pageno = MXOffsetToMemberPage(offset);
- MultiXactMemberCtl->shared->latest_page_number = pageno;
+ pg_atomic_write_u32(&MultiXactMemberCtl->shared->latest_page_number,
+ pageno);
/*
* Zero out the remainder of the current members page. See notes in
@@ -2098,7 +2165,9 @@ TrimMultiXact(void)
int slotno;
TransactionId *xidptr;
int memberoff;
+ LWLock *lock = SimpleLruGetSLRUBankLock(MultiXactMemberCtl, pageno);
+ LWLockAcquire(lock, LW_EXCLUSIVE);
memberoff = MXOffsetToMemberOffset(offset);
slotno = SimpleLruReadPage(MultiXactMemberCtl, pageno, true, offset);
xidptr = (TransactionId *)
@@ -2113,10 +2182,9 @@ TrimMultiXact(void)
*/
MultiXactMemberCtl->shared->page_dirty[slotno] = true;
+ LWLockRelease(lock);
}
- LWLockRelease(MultiXactMemberSLRULock);
-
/* signal that we're officially up */
LWLockAcquire(MultiXactGenLock, LW_EXCLUSIVE);
MultiXactState->finishedStartup = true;
@@ -2404,6 +2472,7 @@ static void
ExtendMultiXactOffset(MultiXactId multi)
{
int pageno;
+ LWLock *lock;
/*
* No work except at first MultiXactId of a page. But beware: just after
@@ -2414,13 +2483,14 @@ ExtendMultiXactOffset(MultiXactId multi)
return;
pageno = MultiXactIdToOffsetPage(multi);
+ lock = SimpleLruGetSLRUBankLock(MultiXactOffsetCtl, pageno);
- LWLockAcquire(MultiXactOffsetSLRULock, LW_EXCLUSIVE);
+ LWLockAcquire(lock, LW_EXCLUSIVE);
/* Zero the page and make an XLOG entry about it */
ZeroMultiXactOffsetPage(pageno, true);
- LWLockRelease(MultiXactOffsetSLRULock);
+ LWLockRelease(lock);
}
/*
@@ -2453,15 +2523,17 @@ ExtendMultiXactMember(MultiXactOffset offset, int nmembers)
if (flagsoff == 0 && flagsbit == 0)
{
int pageno;
+ LWLock *lock;
pageno = MXOffsetToMemberPage(offset);
+ lock = SimpleLruGetSLRUBankLock(MultiXactMemberCtl, pageno);
- LWLockAcquire(MultiXactMemberSLRULock, LW_EXCLUSIVE);
+ LWLockAcquire(lock, LW_EXCLUSIVE);
/* Zero the page and make an XLOG entry about it */
ZeroMultiXactMemberPage(pageno, true);
- LWLockRelease(MultiXactMemberSLRULock);
+ LWLockRelease(lock);
}
/*
@@ -2759,7 +2831,7 @@ find_multixact_start(MultiXactId multi, MultiXactOffset *result)
offptr = (MultiXactOffset *) MultiXactOffsetCtl->shared->page_buffer[slotno];
offptr += entryno;
offset = *offptr;
- LWLockRelease(MultiXactOffsetSLRULock);
+ LWLockRelease(SimpleLruGetSLRUBankLock(MultiXactOffsetCtl, pageno));
*result = offset;
return true;
@@ -3241,31 +3313,33 @@ multixact_redo(XLogReaderState *record)
{
int pageno;
int slotno;
+ LWLock *lock;
memcpy(&pageno, XLogRecGetData(record), sizeof(int));
-
- LWLockAcquire(MultiXactOffsetSLRULock, LW_EXCLUSIVE);
+ lock = SimpleLruGetSLRUBankLock(MultiXactOffsetCtl, pageno);
+ LWLockAcquire(lock, LW_EXCLUSIVE);
slotno = ZeroMultiXactOffsetPage(pageno, false);
SimpleLruWritePage(MultiXactOffsetCtl, slotno);
Assert(!MultiXactOffsetCtl->shared->page_dirty[slotno]);
- LWLockRelease(MultiXactOffsetSLRULock);
+ LWLockRelease(lock);
}
else if (info == XLOG_MULTIXACT_ZERO_MEM_PAGE)
{
int pageno;
int slotno;
+ LWLock *lock;
memcpy(&pageno, XLogRecGetData(record), sizeof(int));
-
- LWLockAcquire(MultiXactMemberSLRULock, LW_EXCLUSIVE);
+ lock = SimpleLruGetSLRUBankLock(MultiXactMemberCtl, pageno);
+ LWLockAcquire(lock, LW_EXCLUSIVE);
slotno = ZeroMultiXactMemberPage(pageno, false);
SimpleLruWritePage(MultiXactMemberCtl, slotno);
Assert(!MultiXactMemberCtl->shared->page_dirty[slotno]);
- LWLockRelease(MultiXactMemberSLRULock);
+ LWLockRelease(lock);
}
else if (info == XLOG_MULTIXACT_CREATE_ID)
{
@@ -3331,7 +3405,8 @@ multixact_redo(XLogReaderState *record)
* SimpleLruTruncate.
*/
pageno = MultiXactIdToOffsetPage(xlrec.endTruncOff);
- MultiXactOffsetCtl->shared->latest_page_number = pageno;
+ pg_atomic_write_u32(&MultiXactOffsetCtl->shared->latest_page_number,
+ pageno);
PerformOffsetsTruncation(xlrec.startTruncOff, xlrec.endTruncOff);
LWLockRelease(MultiXactTruncationLock);
diff --git a/src/backend/access/transam/slru.c b/src/backend/access/transam/slru.c
index c339e0a7e4..cf215627ea 100644
--- a/src/backend/access/transam/slru.c
+++ b/src/backend/access/transam/slru.c
@@ -159,7 +159,7 @@ static int SlruSelectLRUPage(SlruCtl ctl, int pageno);
static bool SlruScanDirCbDeleteCutoff(SlruCtl ctl, char *filename,
int segpage, void *data);
static void SlruInternalDeleteSegment(SlruCtl ctl, int segno);
-static void SlruAdjustNSlots(int *nslots, int *banksize, int *bankmask);
+static int SlruAdjustNSlots(int *nslots, int *banksize, int *bankmask);
/*
* Initialization of shared memory
@@ -171,8 +171,9 @@ SimpleLruShmemSize(int nslots, int nlsns)
Size sz;
int bankmask_ignore;
int banksize_ignore;
+ int nbanks;
- SlruAdjustNSlots(&nslots, &banksize_ignore, &bankmask_ignore);
+ nbanks = SlruAdjustNSlots(&nslots, &banksize_ignore, &bankmask_ignore);
/* we assume nslots isn't so large as to risk overflow */
sz = MAXALIGN(sizeof(SlruSharedData));
@@ -182,6 +183,7 @@ SimpleLruShmemSize(int nslots, int nlsns)
sz += MAXALIGN(nslots * sizeof(int)); /* page_number[] */
sz += MAXALIGN(nslots * sizeof(int)); /* page_lru_count[] */
sz += MAXALIGN(nslots * sizeof(LWLockPadded)); /* buffer_locks[] */
+ sz += MAXALIGN(nbanks * sizeof(LWLockPadded)); /* bank_locks[] */
if (nlsns > 0)
sz += MAXALIGN(nslots * nlsns * sizeof(XLogRecPtr)); /* group_lsn[] */
@@ -198,20 +200,22 @@ SimpleLruShmemSize(int nslots, int nlsns)
* nlsns: number of LSN groups per page (set to zero if not relevant).
* ctllock: LWLock to use to control access to the shared control structure.
* subdir: PGDATA-relative subdirectory that will contain the files.
- * tranche_id: LWLock tranche ID to use for the SLRU's per-buffer LWLocks.
+ * buffer_tranche_id: tranche ID to use for the SLRU's per-buffer LWLocks.
+ * bank_tranche_id: tranche ID to use for the SLRU's per-bank LWLocks.
* sync_handler: which set of functions to use to handle sync requests
*/
void
SimpleLruInit(SlruCtl ctl, const char *name, int nslots, int nlsns,
- LWLock *ctllock, const char *subdir, int tranche_id,
+ const char *subdir, int buffer_tranche_id, int bank_tranche_id,
SyncRequestHandler sync_handler)
{
SlruShared shared;
bool found;
int bankmask;
int banksize;
+ int nbanks;
- SlruAdjustNSlots(&nslots, &banksize, &bankmask);
+ nbanks = SlruAdjustNSlots(&nslots, &banksize, &bankmask);
shared = (SlruShared) ShmemInitStruct(name,
SimpleLruShmemSize(nslots, nlsns),
@@ -223,13 +227,12 @@ SimpleLruInit(SlruCtl ctl, const char *name, int nslots, int nlsns,
char *ptr;
Size offset;
int slotno;
+ int bankno;
Assert(!found);
memset(shared, 0, sizeof(SlruSharedData));
- shared->ControlLock = ctllock;
-
shared->num_slots = nslots;
shared->lsn_groups_per_page = nlsns;
@@ -255,6 +258,8 @@ SimpleLruInit(SlruCtl ctl, const char *name, int nslots, int nlsns,
/* Initialize LWLocks */
shared->buffer_locks = (LWLockPadded *) (ptr + offset);
offset += MAXALIGN(nslots * sizeof(LWLockPadded));
+ shared->bank_locks = (LWLockPadded *) (ptr + offset);
+ offset += MAXALIGN(nbanks * sizeof(LWLockPadded));
if (nlsns > 0)
{
@@ -266,7 +271,7 @@ SimpleLruInit(SlruCtl ctl, const char *name, int nslots, int nlsns,
for (slotno = 0; slotno < nslots; slotno++)
{
LWLockInitialize(&shared->buffer_locks[slotno].lock,
- tranche_id);
+ buffer_tranche_id);
shared->page_buffer[slotno] = ptr;
shared->page_status[slotno] = SLRU_PAGE_EMPTY;
@@ -274,6 +279,10 @@ SimpleLruInit(SlruCtl ctl, const char *name, int nslots, int nlsns,
shared->page_lru_count[slotno] = 0;
ptr += BLCKSZ;
}
+ /* Initialize bank locks for each buffer bank. */
+ for (bankno = 0; bankno < nbanks; bankno++)
+ LWLockInitialize(&shared->bank_locks[bankno].lock,
+ bank_tranche_id);
/* Should fit to estimated shmem size */
Assert(ptr - (char *) shared <= SimpleLruShmemSize(nslots, nlsns));
@@ -329,7 +338,7 @@ SimpleLruZeroPage(SlruCtl ctl, int pageno)
SimpleLruZeroLSNs(ctl, slotno);
/* Assume this page is now the latest active page */
- shared->latest_page_number = pageno;
+ pg_atomic_write_u32(&shared->latest_page_number, pageno);
/* update the stats counter of zeroed pages */
pgstat_count_slru_page_zeroed(shared->slru_stats_idx);
@@ -368,12 +377,13 @@ static void
SimpleLruWaitIO(SlruCtl ctl, int slotno)
{
SlruShared shared = ctl->shared;
+ int bankno = slotno / ctl->bank_size;
/* See notes at top of file */
- LWLockRelease(shared->ControlLock);
+ LWLockRelease(&shared->bank_locks[bankno].lock);
LWLockAcquire(&shared->buffer_locks[slotno].lock, LW_SHARED);
LWLockRelease(&shared->buffer_locks[slotno].lock);
- LWLockAcquire(shared->ControlLock, LW_EXCLUSIVE);
+ LWLockAcquire(&shared->bank_locks[bankno].lock, LW_EXCLUSIVE);
/*
* If the slot is still in an io-in-progress state, then either someone
@@ -428,6 +438,7 @@ SimpleLruReadPage(SlruCtl ctl, int pageno, bool write_ok,
for (;;)
{
int slotno;
+ int bankno;
bool ok;
/* See if page already is in memory; if not, pick victim slot */
@@ -470,9 +481,10 @@ SimpleLruReadPage(SlruCtl ctl, int pageno, bool write_ok,
/* Acquire per-buffer lock (cannot deadlock, see notes at top) */
LWLockAcquire(&shared->buffer_locks[slotno].lock, LW_EXCLUSIVE);
+ bankno = slotno / ctl->bank_size;
/* Release control lock while doing I/O */
- LWLockRelease(shared->ControlLock);
+ LWLockRelease(&shared->bank_locks[bankno].lock);
/* Do the read */
ok = SlruPhysicalReadPage(ctl, pageno, slotno);
@@ -481,7 +493,7 @@ SimpleLruReadPage(SlruCtl ctl, int pageno, bool write_ok,
SimpleLruZeroLSNs(ctl, slotno);
/* Re-acquire control lock and update page state */
- LWLockAcquire(shared->ControlLock, LW_EXCLUSIVE);
+ LWLockAcquire(&shared->bank_locks[bankno].lock, LW_EXCLUSIVE);
Assert(shared->page_number[slotno] == pageno &&
shared->page_status[slotno] == SLRU_PAGE_READ_IN_PROGRESS &&
@@ -523,11 +535,12 @@ SimpleLruReadPage_ReadOnly(SlruCtl ctl, int pageno, TransactionId xid)
{
SlruShared shared = ctl->shared;
int slotno;
- int bankstart = (pageno & ctl->bank_mask) * ctl->bank_size;
+ int bankno = pageno & ctl->bank_mask;
+ int bankstart = bankno * ctl->bank_size;
int bankend = bankstart + ctl->bank_size;
/* Try to find the page while holding only shared lock */
- LWLockAcquire(shared->ControlLock, LW_SHARED);
+ LWLockAcquire(&shared->bank_locks[bankno].lock, LW_SHARED);
/* See if page is already in a buffer */
for (slotno = bankstart; slotno < bankend; slotno++)
@@ -547,8 +560,8 @@ SimpleLruReadPage_ReadOnly(SlruCtl ctl, int pageno, TransactionId xid)
}
/* No luck, so switch to normal exclusive lock and do regular read */
- LWLockRelease(shared->ControlLock);
- LWLockAcquire(shared->ControlLock, LW_EXCLUSIVE);
+ LWLockRelease(&shared->bank_locks[bankno].lock);
+ LWLockAcquire(&shared->bank_locks[bankno].lock, LW_EXCLUSIVE);
return SimpleLruReadPage(ctl, pageno, true, xid);
}
@@ -570,6 +583,7 @@ SlruInternalWritePage(SlruCtl ctl, int slotno, SlruWriteAll fdata)
SlruShared shared = ctl->shared;
int pageno = shared->page_number[slotno];
bool ok;
+ int bankno = slotno / ctl->bank_size;
/* If a write is in progress, wait for it to finish */
while (shared->page_status[slotno] == SLRU_PAGE_WRITE_IN_PROGRESS &&
@@ -598,7 +612,7 @@ SlruInternalWritePage(SlruCtl ctl, int slotno, SlruWriteAll fdata)
LWLockAcquire(&shared->buffer_locks[slotno].lock, LW_EXCLUSIVE);
/* Release control lock while doing I/O */
- LWLockRelease(shared->ControlLock);
+ LWLockRelease(&shared->bank_locks[bankno].lock);
/* Do the write */
ok = SlruPhysicalWritePage(ctl, pageno, slotno, fdata);
@@ -613,7 +627,7 @@ SlruInternalWritePage(SlruCtl ctl, int slotno, SlruWriteAll fdata)
}
/* Re-acquire control lock and update page state */
- LWLockAcquire(shared->ControlLock, LW_EXCLUSIVE);
+ LWLockAcquire(&shared->bank_locks[bankno].lock, LW_EXCLUSIVE);
Assert(shared->page_number[slotno] == pageno &&
shared->page_status[slotno] == SLRU_PAGE_WRITE_IN_PROGRESS);
@@ -1118,7 +1132,7 @@ SlruSelectLRUPage(SlruCtl ctl, int pageno)
this_delta = 0;
}
this_page_number = shared->page_number[slotno];
- if (this_page_number == shared->latest_page_number)
+ if (this_page_number == pg_atomic_read_u32(&shared->latest_page_number))
continue;
if (shared->page_status[slotno] == SLRU_PAGE_VALID)
{
@@ -1192,6 +1206,7 @@ SimpleLruWriteAll(SlruCtl ctl, bool allow_redirtied)
int slotno;
int pageno = 0;
int i;
+ int lastbankno = 0;
bool ok;
/* update the stats counter of flushes */
@@ -1202,10 +1217,19 @@ SimpleLruWriteAll(SlruCtl ctl, bool allow_redirtied)
*/
fdata.num_files = 0;
- LWLockAcquire(shared->ControlLock, LW_EXCLUSIVE);
+ LWLockAcquire(&shared->bank_locks[0].lock, LW_EXCLUSIVE);
for (slotno = 0; slotno < shared->num_slots; slotno++)
{
+ int curbankno = slotno / ctl->bank_size;
+
+ if (curbankno != lastbankno)
+ {
+ LWLockRelease(&shared->bank_locks[lastbankno].lock);
+ LWLockAcquire(&shared->bank_locks[curbankno].lock, LW_EXCLUSIVE);
+ lastbankno = curbankno;
+ }
+
SlruInternalWritePage(ctl, slotno, &fdata);
/*
@@ -1219,7 +1243,7 @@ SimpleLruWriteAll(SlruCtl ctl, bool allow_redirtied)
!shared->page_dirty[slotno]));
}
- LWLockRelease(shared->ControlLock);
+ LWLockRelease(&shared->bank_locks[lastbankno].lock);
/*
* Now close any files that were open
@@ -1259,6 +1283,7 @@ SimpleLruTruncate(SlruCtl ctl, int cutoffPage)
{
SlruShared shared = ctl->shared;
int slotno;
+ int prevbankno;
/* update the stats counter of truncates */
pgstat_count_slru_truncate(shared->slru_stats_idx);
@@ -1269,25 +1294,38 @@ SimpleLruTruncate(SlruCtl ctl, int cutoffPage)
* or just after a checkpoint, any dirty pages should have been flushed
* already ... we're just being extra careful here.)
*/
- LWLockAcquire(shared->ControlLock, LW_EXCLUSIVE);
-
restart:
/*
* While we are holding the lock, make an important safety check: the
* current endpoint page must not be eligible for removal.
*/
- if (ctl->PagePrecedes(shared->latest_page_number, cutoffPage))
+ if (ctl->PagePrecedes(pg_atomic_read_u32(&shared->latest_page_number),
+ cutoffPage))
{
- LWLockRelease(shared->ControlLock);
ereport(LOG,
(errmsg("could not truncate directory \"%s\": apparent wraparound",
ctl->Dir)));
return;
}
+ prevbankno = 0;
+ LWLockAcquire(&shared->bank_locks[prevbankno].lock, LW_EXCLUSIVE);
for (slotno = 0; slotno < shared->num_slots; slotno++)
{
+ int curbankno = slotno / ctl->bank_size;
+
+ /*
+ * If the curbankno is not same as prevbankno then release the lock on
+ * the prevbankno and acquire the lock on the curbankno.
+ */
+ if (curbankno != prevbankno)
+ {
+ LWLockRelease(&shared->bank_locks[prevbankno].lock);
+ LWLockAcquire(&shared->bank_locks[curbankno].lock, LW_EXCLUSIVE);
+ prevbankno = curbankno;
+ }
+
if (shared->page_status[slotno] == SLRU_PAGE_EMPTY)
continue;
if (!ctl->PagePrecedes(shared->page_number[slotno], cutoffPage))
@@ -1317,10 +1355,12 @@ restart:
SlruInternalWritePage(ctl, slotno, NULL);
else
SimpleLruWaitIO(ctl, slotno);
+
+ LWLockRelease(&shared->bank_locks[prevbankno].lock);
goto restart;
}
- LWLockRelease(shared->ControlLock);
+ LWLockRelease(&shared->bank_locks[prevbankno].lock);
/* Now we can remove the old segment(s) */
(void) SlruScanDirectory(ctl, SlruScanDirCbDeleteCutoff, &cutoffPage);
@@ -1361,15 +1401,31 @@ SlruDeleteSegment(SlruCtl ctl, int segno)
SlruShared shared = ctl->shared;
int slotno;
bool did_write;
+ int prevbankno = 0;
/* Clean out any possibly existing references to the segment. */
- LWLockAcquire(shared->ControlLock, LW_EXCLUSIVE);
+ LWLockAcquire(&shared->bank_locks[prevbankno].lock, LW_EXCLUSIVE);
restart:
did_write = false;
for (slotno = 0; slotno < shared->num_slots; slotno++)
{
- int pagesegno = shared->page_number[slotno] / SLRU_PAGES_PER_SEGMENT;
+ int pagesegno;
+ int curbankno;
+
+ curbankno = slotno / ctl->bank_size;
+
+ /*
+ * If the curbankno is not same as prevbankno then release the lock on
+ * the prevbankno and acquire the lock on the curbankno.
+ */
+ if (curbankno != prevbankno)
+ {
+ LWLockRelease(&shared->bank_locks[prevbankno].lock);
+ LWLockAcquire(&shared->bank_locks[curbankno].lock, LW_EXCLUSIVE);
+ prevbankno = curbankno;
+ }
+ pagesegno = shared->page_number[slotno] / SLRU_PAGES_PER_SEGMENT;
if (shared->page_status[slotno] == SLRU_PAGE_EMPTY)
continue;
@@ -1403,7 +1459,7 @@ restart:
SlruInternalDeleteSegment(ctl, segno);
- LWLockRelease(shared->ControlLock);
+ LWLockRelease(&shared->bank_locks[prevbankno].lock);
}
/*
@@ -1651,7 +1707,7 @@ SlruSyncFileTag(SlruCtl ctl, const FileTag *ftag, char *path)
* We expect the bank number to be picked from the lowest bits of the requested
* pageno. Thus we want the number of banks to be the power of 2.
*/
-static void
+static int
SlruAdjustNSlots(int *nslots, int *banksize, int *bankmask)
{
int nbanks = 1;
@@ -1677,4 +1733,34 @@ SlruAdjustNSlots(int *nslots, int *banksize, int *bankmask)
*nslots = *banksize * nbanks;
*bankmask = (*nslots / *banksize) - 1;
+
+ return nbanks;
+}
+
+/*
+ * Function to acquire all bank's lock of the given SlruCtl
+ */
+void
+SimpleLruAcquireAllBankLock(SlruCtl ctl, LWLockMode mode)
+{
+ SlruShared shared = ctl->shared;
+ int bankno;
+ int nbanks = shared->num_slots / ctl->bank_size;
+
+ for (bankno = 0; bankno < nbanks; bankno++)
+ LWLockAcquire(&shared->bank_locks[bankno].lock, mode);
+}
+
+/*
+ * Function to release all bank's lock of the given SlruCtl
+ */
+void
+SimpleLruReleaseAllBankLock(SlruCtl ctl)
+{
+ SlruShared shared = ctl->shared;
+ int bankno;
+ int nbanks = shared->num_slots / ctl->bank_size;
+
+ for (bankno = 0; bankno < nbanks; bankno++)
+ LWLockRelease(&shared->bank_locks[bankno].lock);
}
diff --git a/src/backend/access/transam/subtrans.c b/src/backend/access/transam/subtrans.c
index 0dd48f40f3..4e3fc5fc51 100644
--- a/src/backend/access/transam/subtrans.c
+++ b/src/backend/access/transam/subtrans.c
@@ -77,12 +77,14 @@ SubTransSetParent(TransactionId xid, TransactionId parent)
int pageno = TransactionIdToPage(xid);
int entryno = TransactionIdToEntry(xid);
int slotno;
+ LWLock *lock;
TransactionId *ptr;
Assert(TransactionIdIsValid(parent));
Assert(TransactionIdFollows(xid, parent));
- LWLockAcquire(SubtransSLRULock, LW_EXCLUSIVE);
+ lock = SimpleLruGetSLRUBankLock(SubTransCtl, pageno);
+ LWLockAcquire(lock, LW_EXCLUSIVE);
slotno = SimpleLruReadPage(SubTransCtl, pageno, true, xid);
ptr = (TransactionId *) SubTransCtl->shared->page_buffer[slotno];
@@ -100,7 +102,7 @@ SubTransSetParent(TransactionId xid, TransactionId parent)
SubTransCtl->shared->page_dirty[slotno] = true;
}
- LWLockRelease(SubtransSLRULock);
+ LWLockRelease(lock);
}
/*
@@ -130,7 +132,7 @@ SubTransGetParent(TransactionId xid)
parent = *ptr;
- LWLockRelease(SubtransSLRULock);
+ LWLockRelease(SimpleLruGetSLRUBankLock(SubTransCtl, pageno));
return parent;
}
@@ -193,8 +195,9 @@ SUBTRANSShmemInit(void)
{
SubTransCtl->PagePrecedes = SubTransPagePrecedes;
SimpleLruInit(SubTransCtl, "Subtrans", subtrans_buffers, 0,
- SubtransSLRULock, "pg_subtrans",
- LWTRANCHE_SUBTRANS_BUFFER, SYNC_HANDLER_NONE);
+ "pg_subtrans", LWTRANCHE_SUBTRANS_BUFFER,
+ LWTRANCHE_SUBTRANS_SLRU,
+ SYNC_HANDLER_NONE);
SlruPagePrecedesUnitTests(SubTransCtl, SUBTRANS_XACTS_PER_PAGE);
}
@@ -212,8 +215,9 @@ void
BootStrapSUBTRANS(void)
{
int slotno;
+ LWLock *lock = SimpleLruGetSLRUBankLock(SubTransCtl, 0);
- LWLockAcquire(SubtransSLRULock, LW_EXCLUSIVE);
+ LWLockAcquire(lock, LW_EXCLUSIVE);
/* Create and zero the first page of the subtrans log */
slotno = ZeroSUBTRANSPage(0);
@@ -222,7 +226,7 @@ BootStrapSUBTRANS(void)
SimpleLruWritePage(SubTransCtl, slotno);
Assert(!SubTransCtl->shared->page_dirty[slotno]);
- LWLockRelease(SubtransSLRULock);
+ LWLockRelease(lock);
}
/*
@@ -252,6 +256,8 @@ StartupSUBTRANS(TransactionId oldestActiveXID)
FullTransactionId nextXid;
int startPage;
int endPage;
+ LWLock *prevlock;
+ LWLock *lock;
/*
* Since we don't expect pg_subtrans to be valid across crashes, we
@@ -259,23 +265,47 @@ StartupSUBTRANS(TransactionId oldestActiveXID)
* Whenever we advance into a new page, ExtendSUBTRANS will likewise zero
* the new page without regard to whatever was previously on disk.
*/
- LWLockAcquire(SubtransSLRULock, LW_EXCLUSIVE);
-
startPage = TransactionIdToPage(oldestActiveXID);
nextXid = ShmemVariableCache->nextXid;
endPage = TransactionIdToPage(XidFromFullTransactionId(nextXid));
+ prevlock = SimpleLruGetSLRUBankLock(SubTransCtl, startPage);
+ LWLockAcquire(prevlock, LW_EXCLUSIVE);
while (startPage != endPage)
{
+ lock = SimpleLruGetSLRUBankLock(SubTransCtl, startPage);
+
+ /*
+ * Check if we need to acquire the lock on the new bank then release
+ * the lock on the old bank and acquire on the new bank.
+ */
+ if (prevlock != lock)
+ {
+ LWLockRelease(prevlock);
+ LWLockAcquire(lock, LW_EXCLUSIVE);
+ prevlock = lock;
+ }
+
(void) ZeroSUBTRANSPage(startPage);
startPage++;
/* must account for wraparound */
if (startPage > TransactionIdToPage(MaxTransactionId))
startPage = 0;
}
- (void) ZeroSUBTRANSPage(startPage);
- LWLockRelease(SubtransSLRULock);
+ lock = SimpleLruGetSLRUBankLock(SubTransCtl, startPage);
+
+ /*
+ * Check if we need to acquire the lock on the new bank then release the
+ * lock on the old bank and acquire on the new bank.
+ */
+ if (prevlock != lock)
+ {
+ LWLockRelease(prevlock);
+ LWLockAcquire(lock, LW_EXCLUSIVE);
+ }
+ (void) ZeroSUBTRANSPage(startPage);
+ LWLockRelease(lock);
}
/*
@@ -309,6 +339,7 @@ void
ExtendSUBTRANS(TransactionId newestXact)
{
int pageno;
+ LWLock *lock;
/*
* No work except at first XID of a page. But beware: just after
@@ -320,12 +351,13 @@ ExtendSUBTRANS(TransactionId newestXact)
pageno = TransactionIdToPage(newestXact);
- LWLockAcquire(SubtransSLRULock, LW_EXCLUSIVE);
+ lock = SimpleLruGetSLRUBankLock(SubTransCtl, pageno);
+ LWLockAcquire(lock, LW_EXCLUSIVE);
/* Zero the page */
ZeroSUBTRANSPage(pageno);
- LWLockRelease(SubtransSLRULock);
+ LWLockRelease(lock);
}
diff --git a/src/backend/commands/async.c b/src/backend/commands/async.c
index 4bdbbe5cc0..9f14faed78 100644
--- a/src/backend/commands/async.c
+++ b/src/backend/commands/async.c
@@ -267,9 +267,10 @@ typedef struct QueueBackendStatus
* both NotifyQueueLock and NotifyQueueTailLock in EXCLUSIVE mode, backends
* can change the tail pointers.
*
- * NotifySLRULock is used as the control lock for the pg_notify SLRU buffers.
+ * SLRU buffer pool is divided in banks and bank wise SLRU lock is used as
+ * the control lock for the pg_notify SLRU buffers.
* In order to avoid deadlocks, whenever we need multiple locks, we first get
- * NotifyQueueTailLock, then NotifyQueueLock, and lastly NotifySLRULock.
+ * NotifyQueueTailLock, then NotifyQueueLock, and lastly SLRU bank lock.
*
* Each backend uses the backend[] array entry with index equal to its
* BackendId (which can range from 1 to MaxBackends). We rely on this to make
@@ -570,7 +571,7 @@ AsyncShmemInit(void)
*/
NotifyCtl->PagePrecedes = asyncQueuePagePrecedes;
SimpleLruInit(NotifyCtl, "Notify", notify_buffers, 0,
- NotifySLRULock, "pg_notify", LWTRANCHE_NOTIFY_BUFFER,
+ "pg_notify", LWTRANCHE_NOTIFY_BUFFER, LWTRANCHE_NOTIFY_SLRU,
SYNC_HANDLER_NONE);
if (!found)
@@ -1402,7 +1403,7 @@ asyncQueueNotificationToEntry(Notification *n, AsyncQueueEntry *qe)
* Eventually we will return NULL indicating all is done.
*
* We are holding NotifyQueueLock already from the caller and grab
- * NotifySLRULock locally in this function.
+ * page specific SLRU bank lock locally in this function.
*/
static ListCell *
asyncQueueAddEntries(ListCell *nextNotify)
@@ -1412,9 +1413,7 @@ asyncQueueAddEntries(ListCell *nextNotify)
int pageno;
int offset;
int slotno;
-
- /* We hold both NotifyQueueLock and NotifySLRULock during this operation */
- LWLockAcquire(NotifySLRULock, LW_EXCLUSIVE);
+ LWLock *lock;
/*
* We work with a local copy of QUEUE_HEAD, which we write back to shared
@@ -1438,6 +1437,11 @@ asyncQueueAddEntries(ListCell *nextNotify)
* wrapped around, but re-zeroing the page is harmless in that case.)
*/
pageno = QUEUE_POS_PAGE(queue_head);
+ lock = SimpleLruGetSLRUBankLock(NotifyCtl, pageno);
+
+ /* We hold both NotifyQueueLock and SLRU bank lock during this operation */
+ LWLockAcquire(lock, LW_EXCLUSIVE);
+
if (QUEUE_POS_IS_ZERO(queue_head))
slotno = SimpleLruZeroPage(NotifyCtl, pageno);
else
@@ -1509,7 +1513,7 @@ asyncQueueAddEntries(ListCell *nextNotify)
/* Success, so update the global QUEUE_HEAD */
QUEUE_HEAD = queue_head;
- LWLockRelease(NotifySLRULock);
+ LWLockRelease(lock);
return nextNotify;
}
@@ -1988,9 +1992,9 @@ asyncQueueReadAllNotifications(void)
/*
* We copy the data from SLRU into a local buffer, so as to avoid
- * holding the NotifySLRULock while we are examining the entries
- * and possibly transmitting them to our frontend. Copy only the
- * part of the page we will actually inspect.
+ * holding the SLRU lock while we are examining the entries and
+ * possibly transmitting them to our frontend. Copy only the part
+ * of the page we will actually inspect.
*/
slotno = SimpleLruReadPage_ReadOnly(NotifyCtl, curpage,
InvalidTransactionId);
@@ -2010,7 +2014,7 @@ asyncQueueReadAllNotifications(void)
NotifyCtl->shared->page_buffer[slotno] + curoffset,
copysize);
/* Release lock that we got from SimpleLruReadPage_ReadOnly() */
- LWLockRelease(NotifySLRULock);
+ LWLockRelease(SimpleLruGetSLRUBankLock(NotifyCtl, curpage));
/*
* Process messages up to the stop position, end of page, or an
@@ -2051,7 +2055,7 @@ asyncQueueReadAllNotifications(void)
*
* The current page must have been fetched into page_buffer from shared
* memory. (We could access the page right in shared memory, but that
- * would imply holding the NotifySLRULock throughout this routine.)
+ * would imply holding the SLRU bank lock throughout this routine.)
*
* We stop if we reach the "stop" position, or reach a notification from an
* uncommitted transaction, or reach the end of the page.
@@ -2204,7 +2208,7 @@ asyncQueueAdvanceTail(void)
if (asyncQueuePagePrecedes(oldtailpage, boundary))
{
/*
- * SimpleLruTruncate() will ask for NotifySLRULock but will also
+ * SimpleLruTruncate() will ask for SLRU bank locks but will also
* release the lock again.
*/
SimpleLruTruncate(NotifyCtl, newtailpage);
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index 315a78cda9..1261af0548 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -190,6 +190,20 @@ static const char *const BuiltinTrancheNames[] = {
"LogicalRepLauncherDSA",
/* LWTRANCHE_LAUNCHER_HASH: */
"LogicalRepLauncherHash",
+ /* LWTRANCHE_XACT_SLRU: */
+ "XactSLRU",
+ /* LWTRANCHE_COMMITTS_SLRU: */
+ "CommitTSSLRU",
+ /* LWTRANCHE_SUBTRANS_SLRU: */
+ "SubtransSLRU",
+ /* LWTRANCHE_MULTIXACTOFFSET_SLRU: */
+ "MultixactOffsetSLRU",
+ /* LWTRANCHE_MULTIXACTMEMBER_SLRU: */
+ "MultixactMemberSLRU",
+ /* LWTRANCHE_NOTIFY_SLRU: */
+ "NotifySLRU",
+ /* LWTRANCHE_SERIAL_SLRU: */
+ "SerialSLRU"
};
StaticAssertDecl(lengthof(BuiltinTrancheNames) ==
diff --git a/src/backend/storage/lmgr/lwlocknames.txt b/src/backend/storage/lmgr/lwlocknames.txt
index f72f2906ce..9e66ecd1ed 100644
--- a/src/backend/storage/lmgr/lwlocknames.txt
+++ b/src/backend/storage/lmgr/lwlocknames.txt
@@ -16,11 +16,11 @@ WALBufMappingLock 7
WALWriteLock 8
ControlFileLock 9
# 10 was CheckpointLock
-XactSLRULock 11
-SubtransSLRULock 12
+# 11 was XactSLRULock
+# 12 was SubtransSLRULock
MultiXactGenLock 13
-MultiXactOffsetSLRULock 14
-MultiXactMemberSLRULock 15
+# 14 was MultiXactOffsetSLRULock
+# 15 was MultiXactMemberSLRULock
RelCacheInitLock 16
CheckpointerCommLock 17
TwoPhaseStateLock 18
@@ -31,19 +31,19 @@ AutovacuumLock 22
AutovacuumScheduleLock 23
SyncScanLock 24
RelationMappingLock 25
-NotifySLRULock 26
+#26 was NotifySLRULock
NotifyQueueLock 27
SerializableXactHashLock 28
SerializableFinishedListLock 29
SerializablePredicateListLock 30
-SerialSLRULock 31
+SerialControlLock 31
SyncRepLock 32
BackgroundWorkerLock 33
DynamicSharedMemoryControlLock 34
AutoFileLock 35
ReplicationSlotAllocationLock 36
ReplicationSlotControlLock 37
-CommitTsSLRULock 38
+#38 was CommitTsSLRULock
CommitTsLock 39
ReplicationOriginLock 40
MultiXactTruncationLock 41
diff --git a/src/backend/storage/lmgr/predicate.c b/src/backend/storage/lmgr/predicate.c
index 18ea18316d..4098a056e5 100644
--- a/src/backend/storage/lmgr/predicate.c
+++ b/src/backend/storage/lmgr/predicate.c
@@ -808,8 +808,9 @@ SerialInit(void)
*/
SerialSlruCtl->PagePrecedes = SerialPagePrecedesLogically;
SimpleLruInit(SerialSlruCtl, "Serial",
- serial_buffers, 0, SerialSLRULock, "pg_serial",
- LWTRANCHE_SERIAL_BUFFER, SYNC_HANDLER_NONE);
+ serial_buffers, 0, "pg_serial",
+ LWTRANCHE_SERIAL_BUFFER, LWTRANCHE_SERIAL_SLRU,
+ SYNC_HANDLER_NONE);
#ifdef USE_ASSERT_CHECKING
SerialPagePrecedesLogicallyUnitTests();
#endif
@@ -846,12 +847,14 @@ SerialAdd(TransactionId xid, SerCommitSeqNo minConflictCommitSeqNo)
int slotno;
int firstZeroPage;
bool isNewPage;
+ LWLock *lock;
Assert(TransactionIdIsValid(xid));
targetPage = SerialPage(xid);
+ lock = SimpleLruGetSLRUBankLock(SerialSlruCtl, targetPage);
- LWLockAcquire(SerialSLRULock, LW_EXCLUSIVE);
+ LWLockAcquire(lock, LW_EXCLUSIVE);
/*
* If no serializable transactions are active, there shouldn't be anything
@@ -901,7 +904,7 @@ SerialAdd(TransactionId xid, SerCommitSeqNo minConflictCommitSeqNo)
SerialValue(slotno, xid) = minConflictCommitSeqNo;
SerialSlruCtl->shared->page_dirty[slotno] = true;
- LWLockRelease(SerialSLRULock);
+ LWLockRelease(lock);
}
/*
@@ -919,10 +922,10 @@ SerialGetMinConflictCommitSeqNo(TransactionId xid)
Assert(TransactionIdIsValid(xid));
- LWLockAcquire(SerialSLRULock, LW_SHARED);
+ LWLockAcquire(SerialControlLock, LW_SHARED);
headXid = serialControl->headXid;
tailXid = serialControl->tailXid;
- LWLockRelease(SerialSLRULock);
+ LWLockRelease(SerialControlLock);
if (!TransactionIdIsValid(headXid))
return 0;
@@ -934,13 +937,13 @@ SerialGetMinConflictCommitSeqNo(TransactionId xid)
return 0;
/*
- * The following function must be called without holding SerialSLRULock,
+ * The following function must be called without holding SLRU bank lock,
* but will return with that lock held, which must then be released.
*/
slotno = SimpleLruReadPage_ReadOnly(SerialSlruCtl,
SerialPage(xid), xid);
val = SerialValue(slotno, xid);
- LWLockRelease(SerialSLRULock);
+ LWLockRelease(SimpleLruGetSLRUBankLock(SerialSlruCtl, SerialPage(xid)));
return val;
}
@@ -953,7 +956,7 @@ SerialGetMinConflictCommitSeqNo(TransactionId xid)
static void
SerialSetActiveSerXmin(TransactionId xid)
{
- LWLockAcquire(SerialSLRULock, LW_EXCLUSIVE);
+ LWLockAcquire(SerialControlLock, LW_EXCLUSIVE);
/*
* When no sxacts are active, nothing overlaps, set the xid values to
@@ -965,7 +968,7 @@ SerialSetActiveSerXmin(TransactionId xid)
{
serialControl->tailXid = InvalidTransactionId;
serialControl->headXid = InvalidTransactionId;
- LWLockRelease(SerialSLRULock);
+ LWLockRelease(SerialControlLock);
return;
}
@@ -983,7 +986,7 @@ SerialSetActiveSerXmin(TransactionId xid)
{
serialControl->tailXid = xid;
}
- LWLockRelease(SerialSLRULock);
+ LWLockRelease(SerialControlLock);
return;
}
@@ -992,7 +995,7 @@ SerialSetActiveSerXmin(TransactionId xid)
serialControl->tailXid = xid;
- LWLockRelease(SerialSLRULock);
+ LWLockRelease(SerialControlLock);
}
/*
@@ -1006,12 +1009,12 @@ CheckPointPredicate(void)
{
int truncateCutoffPage;
- LWLockAcquire(SerialSLRULock, LW_EXCLUSIVE);
+ LWLockAcquire(SerialControlLock, LW_EXCLUSIVE);
/* Exit quickly if the SLRU is currently not in use. */
if (serialControl->headPage < 0)
{
- LWLockRelease(SerialSLRULock);
+ LWLockRelease(SerialControlLock);
return;
}
@@ -1071,7 +1074,7 @@ CheckPointPredicate(void)
serialControl->headPage = -1;
}
- LWLockRelease(SerialSLRULock);
+ LWLockRelease(SerialControlLock);
/* Truncate away pages that are no longer required */
SimpleLruTruncate(SerialSlruCtl, truncateCutoffPage);
diff --git a/src/include/access/slru.h b/src/include/access/slru.h
index c3fd58185a..f3545d5f5d 100644
--- a/src/include/access/slru.h
+++ b/src/include/access/slru.h
@@ -57,8 +57,6 @@ typedef enum
*/
typedef struct SlruSharedData
{
- LWLock *ControlLock;
-
/* Number of buffers managed by this SLRU structure */
int num_slots;
@@ -73,6 +71,13 @@ typedef struct SlruSharedData
int *page_lru_count;
LWLockPadded *buffer_locks;
+ /*
+ * Locks to protect the in memory buffer slot access in per SLRU bank. The
+ * buffer_locks protects the I/O on each buffer slots whereas this lock
+ * protect the in memory operation on the buffer within one SLRU bank.
+ */
+ LWLockPadded *bank_locks;
+
/*
* Optional array of WAL flush LSNs associated with entries in the SLRU
* pages. If not zero/NULL, we must flush WAL before writing pages (true
@@ -100,7 +105,7 @@ typedef struct SlruSharedData
* this is not critical data, since we use it only to avoid swapping out
* the latest page.
*/
- int latest_page_number;
+ pg_atomic_uint32 latest_page_number;
/* SLRU's index for statistics purposes (might not be unique) */
int slru_stats_idx;
@@ -149,11 +154,24 @@ typedef struct SlruCtlData
typedef SlruCtlData *SlruCtl;
+/*
+ * Get the SLRU bank lock for given SlruCtl and the pageno.
+ *
+ * This lock needs to be acquire in order to access the slru buffer slots in
+ * the respective bank. For more details refer comments in SlruSharedData.
+ */
+static inline LWLock *
+SimpleLruGetSLRUBankLock(SlruCtl ctl, int pageno)
+{
+ int bankno = (pageno & ctl->bank_mask);
+
+ return &(ctl->shared->bank_locks[bankno].lock);
+}
extern Size SimpleLruShmemSize(int nslots, int nlsns);
extern void SimpleLruInit(SlruCtl ctl, const char *name, int nslots, int nlsns,
- LWLock *ctllock, const char *subdir, int tranche_id,
- SyncRequestHandler sync_handler);
+ const char *subdir, int buffer_tranche_id,
+ int bank_tranche_id, SyncRequestHandler sync_handler);
extern int SimpleLruZeroPage(SlruCtl ctl, int pageno);
extern int SimpleLruReadPage(SlruCtl ctl, int pageno, bool write_ok,
TransactionId xid);
@@ -181,5 +199,7 @@ extern bool SlruScanDirCbReportPresence(SlruCtl ctl, char *filename,
int segpage, void *data);
extern bool SlruScanDirCbDeleteAll(SlruCtl ctl, char *filename, int segpage,
void *data);
-
+extern LWLock *SimpleLruGetSLRUBankLock(SlruCtl ctl, int pageno);
+extern void SimpleLruAcquireAllBankLock(SlruCtl ctl, LWLockMode mode);
+extern void SimpleLruReleaseAllBankLock(SlruCtl ctl);
#endif /* SLRU_H */
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index b038e599c0..87cb812b84 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -207,6 +207,13 @@ typedef enum BuiltinTrancheIds
LWTRANCHE_PGSTATS_DATA,
LWTRANCHE_LAUNCHER_DSA,
LWTRANCHE_LAUNCHER_HASH,
+ LWTRANCHE_XACT_SLRU,
+ LWTRANCHE_COMMITTS_SLRU,
+ LWTRANCHE_SUBTRANS_SLRU,
+ LWTRANCHE_MULTIXACTOFFSET_SLRU,
+ LWTRANCHE_MULTIXACTMEMBER_SLRU,
+ LWTRANCHE_NOTIFY_SLRU,
+ LWTRANCHE_SERIAL_SLRU,
LWTRANCHE_FIRST_USER_DEFINED,
} BuiltinTrancheIds;
diff --git a/src/test/modules/test_slru/test_slru.c b/src/test/modules/test_slru/test_slru.c
index ae21444c47..9a02f33933 100644
--- a/src/test/modules/test_slru/test_slru.c
+++ b/src/test/modules/test_slru/test_slru.c
@@ -40,10 +40,6 @@ PG_FUNCTION_INFO_V1(test_slru_delete_all);
/* Number of SLRU page slots */
#define NUM_TEST_BUFFERS 16
-/* SLRU control lock */
-LWLock TestSLRULock;
-#define TestSLRULock (&TestSLRULock)
-
static SlruCtlData TestSlruCtlData;
#define TestSlruCtl (&TestSlruCtlData)
@@ -63,9 +59,9 @@ test_slru_page_write(PG_FUNCTION_ARGS)
int pageno = PG_GETARG_INT32(0);
char *data = text_to_cstring(PG_GETARG_TEXT_PP(1));
int slotno;
+ LWLock *lock = SimpleLruGetSLRUBankLock(TestSlruCtl, pageno);
- LWLockAcquire(TestSLRULock, LW_EXCLUSIVE);
-
+ LWLockAcquire(lock, LW_EXCLUSIVE);
slotno = SimpleLruZeroPage(TestSlruCtl, pageno);
/* these should match */
@@ -80,7 +76,7 @@ test_slru_page_write(PG_FUNCTION_ARGS)
BLCKSZ - 1);
SimpleLruWritePage(TestSlruCtl, slotno);
- LWLockRelease(TestSLRULock);
+ LWLockRelease(lock);
PG_RETURN_VOID();
}
@@ -99,13 +95,14 @@ test_slru_page_read(PG_FUNCTION_ARGS)
bool write_ok = PG_GETARG_BOOL(1);
char *data = NULL;
int slotno;
+ LWLock *lock = SimpleLruGetSLRUBankLock(TestSlruCtl, pageno);
/* find page in buffers, reading it if necessary */
- LWLockAcquire(TestSLRULock, LW_EXCLUSIVE);
+ LWLockAcquire(lock, LW_EXCLUSIVE);
slotno = SimpleLruReadPage(TestSlruCtl, pageno,
write_ok, InvalidTransactionId);
data = (char *) TestSlruCtl->shared->page_buffer[slotno];
- LWLockRelease(TestSLRULock);
+ LWLockRelease(lock);
PG_RETURN_TEXT_P(cstring_to_text(data));
}
@@ -116,14 +113,15 @@ test_slru_page_readonly(PG_FUNCTION_ARGS)
int pageno = PG_GETARG_INT32(0);
char *data = NULL;
int slotno;
+ LWLock *lock = SimpleLruGetSLRUBankLock(TestSlruCtl, pageno);
/* find page in buffers, reading it if necessary */
slotno = SimpleLruReadPage_ReadOnly(TestSlruCtl,
pageno,
InvalidTransactionId);
- Assert(LWLockHeldByMe(TestSLRULock));
+ Assert(LWLockHeldByMe(lock));
data = (char *) TestSlruCtl->shared->page_buffer[slotno];
- LWLockRelease(TestSLRULock);
+ LWLockRelease(lock);
PG_RETURN_TEXT_P(cstring_to_text(data));
}
@@ -133,10 +131,11 @@ test_slru_page_exists(PG_FUNCTION_ARGS)
{
int pageno = PG_GETARG_INT32(0);
bool found;
+ LWLock *lock = SimpleLruGetSLRUBankLock(TestSlruCtl, pageno);
- LWLockAcquire(TestSLRULock, LW_EXCLUSIVE);
+ LWLockAcquire(lock, LW_EXCLUSIVE);
found = SimpleLruDoesPhysicalPageExist(TestSlruCtl, pageno);
- LWLockRelease(TestSLRULock);
+ LWLockRelease(lock);
PG_RETURN_BOOL(found);
}
@@ -215,6 +214,7 @@ test_slru_shmem_startup(void)
{
const char slru_dir_name[] = "pg_test_slru";
int test_tranche_id;
+ int test_buffer_tranche_id;
if (prev_shmem_startup_hook)
prev_shmem_startup_hook();
@@ -228,11 +228,13 @@ test_slru_shmem_startup(void)
/* initialize the SLRU facility */
test_tranche_id = LWLockNewTrancheId();
LWLockRegisterTranche(test_tranche_id, "test_slru_tranche");
- LWLockInitialize(TestSLRULock, test_tranche_id);
+
+ test_buffer_tranche_id = LWLockNewTrancheId();
+ LWLockRegisterTranche(test_tranche_id, "test_buffer_tranche");
TestSlruCtl->PagePrecedes = test_slru_page_precedes_logically;
SimpleLruInit(TestSlruCtl, "TestSLRU",
- NUM_TEST_BUFFERS, 0, TestSLRULock, slru_dir_name,
+ NUM_TEST_BUFFERS, 0, slru_dir_name, test_buffer_tranche_id,
test_tranche_id, SYNC_HANDLER_NONE);
}
--
2.39.2 (Apple Git-143)
On Mon, Oct 30, 2023 at 11:50 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
Based on some offlist discussions with Alvaro and Robert in separate
conversations, I and Alvaro we came to the same point if a user sets a
very high value for the number of slots (say 1GB) then the number of
slots in each bank will be 1024 (considering max number of bank 128)
and if we continue the sequence search for finding the buffer for the
page then that could be costly in such cases. But later in one of the
conversations with Robert, I realized that we can have this bank-wise
lock approach along with the partitioned hash table.
So the idea is, that we will use the buffer mapping hash table
something like Thoams used in one of his patches [1]/messages/by-id/CA+hUKGLCLDtgDj2Xsf0uBk5WXDCeHxBDDJPsyY7m65Fde-=pyg@mail.gmail.com, but instead of a
normal hash table, we will use the partitioned hash table. The SLRU
buffer pool is still divided as we have done in the bank-wise approach
and there will be separate locks for each slot range. So now we get
the benefit of both approaches 1) By having a mapping hash we can
avoid the sequence search 2) By dividing the buffer pool into banks
and keeping the victim buffer search within those banks we avoid
locking all the partitions during victim buffer search 3) And we can
also maintain a bank-wise LRU counter so that we avoid contention on a
single variable as we have discussed in my first email of this thread.
Please find the updated patch set details and patches attached to the
email.
[1]: /messages/by-id/CA+hUKGLCLDtgDj2Xsf0uBk5WXDCeHxBDDJPsyY7m65Fde-=pyg@mail.gmail.com
patch as the previous patch set
[2]: 0002-Add-a-buffer-mapping-table-for-SLRUs: Patch to introduce buffer mapping hash table
buffer mapping hash table
[3]: 0003-Partition-wise-slru-locks: Partition the hash table and also introduce partition-wise locks: this is a merge of 0003 and 0004 from the previous patch set but instead of bank-wise locks it has partition-wise locks and LRU counter.
introduce partition-wise locks: this is a merge of 0003 and 0004 from
the previous patch set but instead of bank-wise locks it has
partition-wise locks and LRU counter.
[4]: 0004-Merge-partition-locks-array-with-buffer-locks-array: merging buffer locks and bank locks in the same array so that the bank-wise LRU counter does not fetch the next cache line in a hot function SlruRecentlyUsed()(same as 0005 from the previous patch set)
buffer locks and bank locks in the same array so that the bank-wise
LRU counter does not fetch the next cache line in a hot function
SlruRecentlyUsed()(same as 0005 from the previous patch set)
[5]: 0005-Ensure-slru-buffer-slots-are-in-multiple-of-number-of: Ensure that the number of slots is in multiple of the number of banks
that the number of slots is in multiple of the number of banks
With this approach, I have also made some changes where the number of
banks is constant (i.e. 8) so that some of the computations are easy.
I think with a buffer mapping hash table we should not have much
problem in keeping this fixed as with very extreme configuration and
very high numbers of slots also we do not have performance problems as
we are not doing sequence search because of buffer mapping hash and if
the number of slots is set so high then the victim buffer search also
should not be frequent so we should not be worried about sequence
search within a bank for victim buffer search. I have also changed
the default value of the number of slots to 64 and the minimum value
to 16 I think this is a reasonable default value because the existing
values are too low considering the modern hardware and these
parameters is configurable so user can set it to low value if running
with very low memory.
[1]: /messages/by-id/CA+hUKGLCLDtgDj2Xsf0uBk5WXDCeHxBDDJPsyY7m65Fde-=pyg@mail.gmail.com
--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
Attachments:
v4-0005-Ensure-slru-buffer-slots-are-in-multiple-of-numbe.patchapplication/octet-stream; name=v4-0005-Ensure-slru-buffer-slots-are-in-multiple-of-numbe.patchDownload
From 51be79b5c580a794760cf1baf4e040c55443adc6 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Thu, 2 Nov 2023 11:40:08 +0530
Subject: [PATCH v4 5/5] Ensure slru buffer slots are in multiple of numbe of
partitions
---
src/backend/access/transam/clog.c | 10 ++++++++++
src/backend/access/transam/commit_ts.c | 10 ++++++++++
src/backend/access/transam/multixact.c | 19 +++++++++++++++++++
src/backend/access/transam/slru.c | 18 ++++++++++++++++++
src/backend/access/transam/subtrans.c | 10 ++++++++++
src/backend/commands/async.c | 10 ++++++++++
src/backend/storage/lmgr/predicate.c | 10 ++++++++++
src/backend/utils/misc/guc_tables.c | 14 +++++++-------
src/include/access/slru.h | 1 +
src/include/utils/guc_hooks.h | 11 +++++++++++
10 files changed, 106 insertions(+), 7 deletions(-)
diff --git a/src/backend/access/transam/clog.c b/src/backend/access/transam/clog.c
index ab453cd171..17e08792d4 100644
--- a/src/backend/access/transam/clog.c
+++ b/src/backend/access/transam/clog.c
@@ -43,6 +43,7 @@
#include "pgstat.h"
#include "storage/proc.h"
#include "storage/sync.h"
+#include "utils/guc_hooks.h"
/*
* Defines for CLOG page sizes. A page is the same BLCKSZ as is used
@@ -1056,3 +1057,12 @@ clogsyncfiletag(const FileTag *ftag, char *path)
{
return SlruSyncFileTag(XactCtl, ftag, path);
}
+
+/*
+ * GUC check_hook for xact_buffers
+ */
+bool
+check_xact_buffers(int *newval, void **extra, GucSource source)
+{
+ return check_slru_buffers("xact_buffers", newval);
+}
diff --git a/src/backend/access/transam/commit_ts.c b/src/backend/access/transam/commit_ts.c
index 58314e3885..4fd01c5ce8 100644
--- a/src/backend/access/transam/commit_ts.c
+++ b/src/backend/access/transam/commit_ts.c
@@ -33,6 +33,7 @@
#include "pg_trace.h"
#include "storage/shmem.h"
#include "utils/builtins.h"
+#include "utils/guc_hooks.h"
#include "utils/snapmgr.h"
#include "utils/timestamp.h"
@@ -1022,3 +1023,12 @@ committssyncfiletag(const FileTag *ftag, char *path)
{
return SlruSyncFileTag(CommitTsCtl, ftag, path);
}
+
+/*
+ * GUC check_hook for commit_ts_buffers
+ */
+bool
+check_commit_ts_buffers(int *newval, void **extra, GucSource source)
+{
+ return check_slru_buffers("commit_ts_buffers", newval);
+}
\ No newline at end of file
diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index aa4f11fd3b..d0ce4e28d2 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -88,6 +88,7 @@
#include "storage/proc.h"
#include "storage/procarray.h"
#include "utils/builtins.h"
+#include "utils/guc_hooks.h"
#include "utils/memutils.h"
#include "utils/snapmgr.h"
@@ -3494,3 +3495,21 @@ multixactmemberssyncfiletag(const FileTag *ftag, char *path)
{
return SlruSyncFileTag(MultiXactMemberCtl, ftag, path);
}
+
+/*
+ * GUC check_hook for multixact_offsets_buffers
+ */
+bool
+check_multixact_offsets_buffers(int *newval, void **extra, GucSource source)
+{
+ return check_slru_buffers("multixact_offsets_buffers", newval);
+}
+
+/*
+ * GUC check_hook for multixact_members_buffers
+ */
+bool
+check_multixact_members_buffers(int *newval, void **extra, GucSource source)
+{
+ return check_slru_buffers("multixact_members_buffers", newval);
+}
\ No newline at end of file
diff --git a/src/backend/access/transam/slru.c b/src/backend/access/transam/slru.c
index 8b89a86a10..bac6bf1d42 100644
--- a/src/backend/access/transam/slru.c
+++ b/src/backend/access/transam/slru.c
@@ -59,6 +59,7 @@
#include "pgstat.h"
#include "storage/fd.h"
#include "storage/shmem.h"
+#include "utils/guc.h"
#include "utils/hsearch.h"
#define SlruFileName(ctl, path, seg) \
@@ -1850,3 +1851,20 @@ SimpleLruUnLockAllPartitions(SlruCtl ctl)
for (partno = 0; partno < SLRU_NUM_PARTITIONS; partno++)
LWLockRelease(&shared->locks[nslots + partno].lock);
}
+
+/*
+ * Helper function for GUC check_hook to check whether slru buffers are in
+ * multiples of SLRU_NUM_PARTITIONS.
+ */
+bool
+check_slru_buffers(const char *name, int *newval)
+{
+ /* Value upper and lower hard limits are inclusive */
+ if (*newval % SLRU_NUM_PARTITIONS == 0)
+ return true;
+
+ /* Value does not fall within any allowable range */
+ GUC_check_errdetail("\"%s\" must be in multiple of %d", name,
+ SLRU_NUM_PARTITIONS);
+ return false;
+}
diff --git a/src/backend/access/transam/subtrans.c b/src/backend/access/transam/subtrans.c
index e4da6e28ae..16a26a2ca5 100644
--- a/src/backend/access/transam/subtrans.c
+++ b/src/backend/access/transam/subtrans.c
@@ -33,6 +33,7 @@
#include "access/transam.h"
#include "miscadmin.h"
#include "pg_trace.h"
+#include "utils/guc_hooks.h"
#include "utils/snapmgr.h"
@@ -406,3 +407,12 @@ SubTransPagePrecedes(int page1, int page2)
return (TransactionIdPrecedes(xid1, xid2) &&
TransactionIdPrecedes(xid1, xid2 + SUBTRANS_XACTS_PER_PAGE - 1));
}
+
+/*
+ * GUC check_hook for subtrans_buffers
+ */
+bool
+check_subtrans_buffers(int *newval, void **extra, GucSource source)
+{
+ return check_slru_buffers("subtrans_buffers", newval);
+}
diff --git a/src/backend/commands/async.c b/src/backend/commands/async.c
index 81fdca410b..0ea6880764 100644
--- a/src/backend/commands/async.c
+++ b/src/backend/commands/async.c
@@ -149,6 +149,7 @@
#include "storage/sinval.h"
#include "tcop/tcopprot.h"
#include "utils/builtins.h"
+#include "utils/guc_hooks.h"
#include "utils/memutils.h"
#include "utils/ps_status.h"
#include "utils/snapmgr.h"
@@ -2462,3 +2463,12 @@ ClearPendingActionsAndNotifies(void)
pendingActions = NULL;
pendingNotifies = NULL;
}
+
+/*
+ * GUC check_hook for notify_buffers
+ */
+bool
+check_notify_buffers(int *newval, void **extra, GucSource source)
+{
+ return check_slru_buffers("notify_buffers", newval);
+}
diff --git a/src/backend/storage/lmgr/predicate.c b/src/backend/storage/lmgr/predicate.c
index 6b7c1aa00e..40089a606d 100644
--- a/src/backend/storage/lmgr/predicate.c
+++ b/src/backend/storage/lmgr/predicate.c
@@ -208,6 +208,7 @@
#include "storage/predicate_internals.h"
#include "storage/proc.h"
#include "storage/procarray.h"
+#include "utils/guc_hooks.h"
#include "utils/rel.h"
#include "utils/snapmgr.h"
@@ -5014,3 +5015,12 @@ AttachSerializableXact(SerializableXactHandle handle)
if (MySerializableXact != InvalidSerializableXact)
CreateLocalPredicateLockHash();
}
+
+/*
+ * GUC check_hook for serial_buffers
+ */
+bool
+check_serial_buffers(int *newval, void **extra, GucSource source)
+{
+ return check_slru_buffers("serial_buffers", newval);
+}
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index c82635943b..7c85d2126e 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -2296,7 +2296,7 @@ struct config_int ConfigureNamesInt[] =
},
&multixact_offsets_buffers,
64, 16, SLRU_MAX_ALLOWED_BUFFERS,
- NULL, NULL, NULL
+ check_multixact_offsets_buffers, NULL, NULL
},
{
@@ -2307,7 +2307,7 @@ struct config_int ConfigureNamesInt[] =
},
&multixact_members_buffers,
64, 16, SLRU_MAX_ALLOWED_BUFFERS,
- NULL, NULL, NULL
+ check_multixact_members_buffers, NULL, NULL
},
{
@@ -2318,7 +2318,7 @@ struct config_int ConfigureNamesInt[] =
},
&subtrans_buffers,
64, 16, SLRU_MAX_ALLOWED_BUFFERS,
- NULL, NULL, NULL
+ check_subtrans_buffers, NULL, NULL
},
{
{"notify_buffers", PGC_POSTMASTER, RESOURCES_MEM,
@@ -2328,7 +2328,7 @@ struct config_int ConfigureNamesInt[] =
},
¬ify_buffers,
64, 16, SLRU_MAX_ALLOWED_BUFFERS,
- NULL, NULL, NULL
+ check_notify_buffers, NULL, NULL
},
{
@@ -2339,7 +2339,7 @@ struct config_int ConfigureNamesInt[] =
},
&serial_buffers,
64, 16, SLRU_MAX_ALLOWED_BUFFERS,
- NULL, NULL, NULL
+ check_serial_buffers, NULL, NULL
},
{
@@ -2350,7 +2350,7 @@ struct config_int ConfigureNamesInt[] =
},
&xact_buffers,
64, 0, CLOG_MAX_ALLOWED_BUFFERS,
- NULL, NULL, show_xact_buffers
+ check_xact_buffers, NULL, show_xact_buffers
},
{
@@ -2361,7 +2361,7 @@ struct config_int ConfigureNamesInt[] =
},
&commit_ts_buffers,
64, 0, SLRU_MAX_ALLOWED_BUFFERS,
- NULL, NULL, show_commit_ts_buffers
+ check_commit_ts_buffers, NULL, show_commit_ts_buffers
},
{
diff --git a/src/include/access/slru.h b/src/include/access/slru.h
index ac1227f29f..fef23d30f5 100644
--- a/src/include/access/slru.h
+++ b/src/include/access/slru.h
@@ -198,4 +198,5 @@ extern LWLock *SimpleLruGetPartitionLock(SlruCtl ctl, int pageno);
extern void SimpleLruLockAllPartitions(SlruCtl ctl, LWLockMode mode);
extern void SimpleLruUnLockAllPartitions(SlruCtl ctl);
extern LWLock *SimpleLruGetPartitionLock(SlruCtl ctl, int pageno);
+extern bool check_slru_buffers(const char *name, int *newval);
#endif /* SLRU_H */
diff --git a/src/include/utils/guc_hooks.h b/src/include/utils/guc_hooks.h
index 8597e430de..7dd96a2059 100644
--- a/src/include/utils/guc_hooks.h
+++ b/src/include/utils/guc_hooks.h
@@ -128,6 +128,17 @@ extern bool check_ssl(bool *newval, void **extra, GucSource source);
extern bool check_stage_log_stats(bool *newval, void **extra, GucSource source);
extern bool check_synchronous_standby_names(char **newval, void **extra,
GucSource source);
+extern bool check_multixact_offsets_buffers(int *newval, void **extra,
+ GucSource source);
+extern bool check_multixact_members_buffers(int *newval, void **extra,
+ GucSource source);
+extern bool check_subtrans_buffers(int *newval, void **extra,
+ GucSource source);
+extern bool check_notify_buffers(int *newval, void **extra, GucSource source);
+extern bool check_serial_buffers(int *newval, void **extra, GucSource source);
+extern bool check_xact_buffers(int *newval, void **extra, GucSource source);
+extern bool check_commit_ts_buffers(int *newval, void **extra,
+ GucSource source);
extern void assign_synchronous_standby_names(const char *newval, void *extra);
extern void assign_synchronous_commit(int newval, void *extra);
extern void assign_syslog_facility(int newval, void *extra);
--
2.39.2 (Apple Git-143)
v4-0001-Make-all-SLRU-buffer-sizes-configurable.patchapplication/octet-stream; name=v4-0001-Make-all-SLRU-buffer-sizes-configurable.patchDownload
From acfdf8c7bc64026d51c7f187080294843e805617 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Wed, 25 Oct 2023 14:45:00 +0530
Subject: [PATCH v4 1/5] Make all SLRU buffer sizes configurable.
Provide new GUCs to set the number of buffers, instead of using hard
coded defaults.
Remove the limits on xact_buffers and commit_ts_buffers. The default
sizes for those caches are ~0.2% and ~0.1% of shared_buffers, as before,
but now there is no cap at 128 and 16 buffers respectively (unless
track_commit_timestamp is disabled, in the latter case, then we might as
well keep it tiny). Sizes much larger than the old limits have been
shown to be useful on modern systems.
Patch by Andrey M. Borodin with some Bug fixes by Dilip Kumar
ReviewedBy Anastasia Lubennikova, Tomas Vondra, Alexander Korotkov,
Gilles Darold, Thomas Munro and Dilip Kumar
---
doc/src/sgml/config.sgml | 135 ++++++++++++++++++
src/backend/access/transam/clog.c | 23 ++-
src/backend/access/transam/commit_ts.c | 7 +-
src/backend/access/transam/multixact.c | 8 +-
src/backend/access/transam/subtrans.c | 5 +-
src/backend/commands/async.c | 8 +-
src/backend/commands/variable.c | 19 +++
src/backend/storage/lmgr/predicate.c | 4 +-
src/backend/utils/init/globals.c | 8 ++
src/backend/utils/misc/guc_tables.c | 77 ++++++++++
src/backend/utils/misc/postgresql.conf.sample | 9 ++
src/include/access/clog.h | 10 ++
src/include/access/commit_ts.h | 1 -
src/include/access/multixact.h | 4 -
src/include/access/slru.h | 5 +
src/include/access/subtrans.h | 3 -
src/include/commands/async.h | 5 -
src/include/miscadmin.h | 7 +
src/include/storage/predicate.h | 4 -
src/include/utils/guc_hooks.h | 2 +
20 files changed, 299 insertions(+), 45 deletions(-)
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 985cabfc0b..eeb21efdd4 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2006,6 +2006,141 @@ include_dir 'conf.d'
</listitem>
</varlistentry>
+ <varlistentry id="guc-multixact-offsets-buffers" xreflabel="multixact_offsets_buffers">
+ <term><varname>multixact_offsets_buffers</varname> (<type>integer</type>)
+ <indexterm>
+ <primary><varname>multixact_offsets_buffers</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Specifies the amount of shared memory to use to cache the contents
+ of <literal>pg_multixact/offsets</literal> (see
+ <xref linkend="pgdata-contents-table"/>).
+ If this value is specified without units, it is taken as blocks,
+ that is <symbol>BLCKSZ</symbol> bytes, typically 8kB.
+ The default value is <literal>8</literal>.
+ This parameter can only be set at server start.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry id="guc-multixact-members-buffers" xreflabel="multixact_members_buffers">
+ <term><varname>multixact_members_buffers</varname> (<type>integer</type>)
+ <indexterm>
+ <primary><varname>multixact_members_buffers</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Specifies the amount of shared memory to use to cache the contents
+ of <literal>pg_multixact/members</literal> (see
+ <xref linkend="pgdata-contents-table"/>).
+ If this value is specified without units, it is taken as blocks,
+ that is <symbol>BLCKSZ</symbol> bytes, typically 8kB.
+ The default value is <literal>16</literal>.
+ This parameter can only be set at server start.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry id="guc-subtrans-buffers" xreflabel="subtrans_buffers">
+ <term><varname>subtrans_buffers</varname> (<type>integer</type>)
+ <indexterm>
+ <primary><varname>subtrans_buffers</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Specifies the amount of shared memory to use to cache the contents
+ of <literal>pg_subtrans</literal> (see
+ <xref linkend="pgdata-contents-table"/>).
+ If this value is specified without units, it is taken as blocks,
+ that is <symbol>BLCKSZ</symbol> bytes, typically 8kB.
+ The default value is <literal>8</literal>.
+ This parameter can only be set at server start.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry id="guc-notify-buffers" xreflabel="notify_buffers">
+ <term><varname>notify_buffers</varname> (<type>integer</type>)
+ <indexterm>
+ <primary><varname>notify_buffers</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Specifies the amount of shared memory to use to cache the contents
+ of <literal>pg_notify</literal> (see
+ <xref linkend="pgdata-contents-table"/>).
+ If this value is specified without units, it is taken as blocks,
+ that is <symbol>BLCKSZ</symbol> bytes, typically 8kB.
+ The default value is <literal>8</literal>.
+ This parameter can only be set at server start.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry id="guc-serial-buffers" xreflabel="serial_buffers">
+ <term><varname>serial_buffers</varname> (<type>integer</type>)
+ <indexterm>
+ <primary><varname>serial_buffers</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Specifies the amount of shared memory to use to cache the contents
+ of <literal>pg_serial</literal> (see
+ <xref linkend="pgdata-contents-table"/>).
+ If this value is specified without units, it is taken as blocks,
+ that is <symbol>BLCKSZ</symbol> bytes, typically 8kB.
+ The default value is <literal>16</literal>.
+ This parameter can only be set at server start.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry id="guc-xact-buffers" xreflabel="xact_buffers">
+ <term><varname>xact_buffers</varname> (<type>integer</type>)
+ <indexterm>
+ <primary><varname>xact_buffers</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Specifies the amount of shared memory to use to cache the contents
+ of <literal>pg_xact</literal> (see
+ <xref linkend="pgdata-contents-table"/>).
+ If this value is specified without units, it is taken as blocks,
+ that is <symbol>BLCKSZ</symbol> bytes, typically 8kB.
+ The default value is <literal>0</literal>, which requests
+ <varname>shared_buffers</varname> / 512, but not fewer than 16 blocks.
+ This parameter can only be set at server start.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry id="guc-commit-ts-buffers" xreflabel="commit_ts_buffers">
+ <term><varname>commit_ts_buffers</varname> (<type>integer</type>)
+ <indexterm>
+ <primary><varname>commit_ts_buffers</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Specifies the amount of memory to use to cache the cotents of
+ <literal>pg_commit_ts</literal> (see
+ <xref linkend="pgdata-contents-table"/>).
+ If this value is specified without units, it is taken as blocks,
+ that is <symbol>BLCKSZ</symbol> bytes, typically 8kB.
+ The default value is <literal>0</literal>, which requests
+ <varname>shared_buffers</varname> / 1024, but not fewer than 16 blocks.
+ This parameter can only be set at server start.
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry id="guc-max-stack-depth" xreflabel="max_stack_depth">
<term><varname>max_stack_depth</varname> (<type>integer</type>)
<indexterm>
diff --git a/src/backend/access/transam/clog.c b/src/backend/access/transam/clog.c
index 4a431d5876..7979bbd00f 100644
--- a/src/backend/access/transam/clog.c
+++ b/src/backend/access/transam/clog.c
@@ -58,8 +58,8 @@
/* We need two bits per xact, so four xacts fit in a byte */
#define CLOG_BITS_PER_XACT 2
-#define CLOG_XACTS_PER_BYTE 4
-#define CLOG_XACTS_PER_PAGE (BLCKSZ * CLOG_XACTS_PER_BYTE)
+StaticAssertDecl((CLOG_BITS_PER_XACT * CLOG_XACTS_PER_BYTE) == BITS_PER_BYTE,
+ "CLOG_BITS_PER_XACT and CLOG_XACTS_PER_BYTE are inconsistent");
#define CLOG_XACT_BITMASK ((1 << CLOG_BITS_PER_XACT) - 1)
#define TransactionIdToPage(xid) ((xid) / (TransactionId) CLOG_XACTS_PER_PAGE)
@@ -663,23 +663,16 @@ TransactionIdGetStatus(TransactionId xid, XLogRecPtr *lsn)
/*
* Number of shared CLOG buffers.
*
- * On larger multi-processor systems, it is possible to have many CLOG page
- * requests in flight at one time which could lead to disk access for CLOG
- * page if the required page is not found in memory. Testing revealed that we
- * can get the best performance by having 128 CLOG buffers, more than that it
- * doesn't improve performance.
- *
- * Unconditionally keeping the number of CLOG buffers to 128 did not seem like
- * a good idea, because it would increase the minimum amount of shared memory
- * required to start, which could be a problem for people running very small
- * configurations. The following formula seems to represent a reasonable
- * compromise: people with very low values for shared_buffers will get fewer
- * CLOG buffers as well, and everyone else will get 128.
+ * By default, we'll use 2MB of for every 1GB of shared buffers, up to the
+ * theoretical maximum useful value, but always at least 16 buffers.
*/
Size
CLOGShmemBuffers(void)
{
- return Min(128, Max(4, NBuffers / 512));
+ /* Use configured value if provided. */
+ if (xact_buffers > 0)
+ return Max(16, xact_buffers);
+ return Min(CLOG_MAX_ALLOWED_BUFFERS, Max(16, NBuffers / 512));
}
/*
diff --git a/src/backend/access/transam/commit_ts.c b/src/backend/access/transam/commit_ts.c
index b897fabc70..47a1c9f0e5 100644
--- a/src/backend/access/transam/commit_ts.c
+++ b/src/backend/access/transam/commit_ts.c
@@ -493,11 +493,16 @@ pg_xact_commit_timestamp_origin(PG_FUNCTION_ARGS)
* We use a very similar logic as for the number of CLOG buffers (except we
* scale up twice as fast with shared buffers, and the maximum is twice as
* high); see comments in CLOGShmemBuffers.
+ * By default, we'll use 1MB of for every 1GB of shared buffers, up to the
+ * maximum value that slru.c will allow, but always at least 16 buffers.
*/
Size
CommitTsShmemBuffers(void)
{
- return Min(256, Max(4, NBuffers / 256));
+ /* Use configured value if provided. */
+ if (commit_ts_buffers > 0)
+ return Max(16, commit_ts_buffers);
+ return Min(256, Max(16, NBuffers / 256));
}
/*
diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index 57ed34c0a8..62709fcd07 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -1834,8 +1834,8 @@ MultiXactShmemSize(void)
mul_size(sizeof(MultiXactId) * 2, MaxOldestSlot))
size = SHARED_MULTIXACT_STATE_SIZE;
- size = add_size(size, SimpleLruShmemSize(NUM_MULTIXACTOFFSET_BUFFERS, 0));
- size = add_size(size, SimpleLruShmemSize(NUM_MULTIXACTMEMBER_BUFFERS, 0));
+ size = add_size(size, SimpleLruShmemSize(multixact_offsets_buffers, 0));
+ size = add_size(size, SimpleLruShmemSize(multixact_members_buffers, 0));
return size;
}
@@ -1851,13 +1851,13 @@ MultiXactShmemInit(void)
MultiXactMemberCtl->PagePrecedes = MultiXactMemberPagePrecedes;
SimpleLruInit(MultiXactOffsetCtl,
- "MultiXactOffset", NUM_MULTIXACTOFFSET_BUFFERS, 0,
+ "MultiXactOffset", multixact_offsets_buffers, 0,
MultiXactOffsetSLRULock, "pg_multixact/offsets",
LWTRANCHE_MULTIXACTOFFSET_BUFFER,
SYNC_HANDLER_MULTIXACT_OFFSET);
SlruPagePrecedesUnitTests(MultiXactOffsetCtl, MULTIXACT_OFFSETS_PER_PAGE);
SimpleLruInit(MultiXactMemberCtl,
- "MultiXactMember", NUM_MULTIXACTMEMBER_BUFFERS, 0,
+ "MultiXactMember", multixact_members_buffers, 0,
MultiXactMemberSLRULock, "pg_multixact/members",
LWTRANCHE_MULTIXACTMEMBER_BUFFER,
SYNC_HANDLER_MULTIXACT_MEMBER);
diff --git a/src/backend/access/transam/subtrans.c b/src/backend/access/transam/subtrans.c
index 62bb610167..0dd48f40f3 100644
--- a/src/backend/access/transam/subtrans.c
+++ b/src/backend/access/transam/subtrans.c
@@ -31,6 +31,7 @@
#include "access/slru.h"
#include "access/subtrans.h"
#include "access/transam.h"
+#include "miscadmin.h"
#include "pg_trace.h"
#include "utils/snapmgr.h"
@@ -184,14 +185,14 @@ SubTransGetTopmostTransaction(TransactionId xid)
Size
SUBTRANSShmemSize(void)
{
- return SimpleLruShmemSize(NUM_SUBTRANS_BUFFERS, 0);
+ return SimpleLruShmemSize(subtrans_buffers, 0);
}
void
SUBTRANSShmemInit(void)
{
SubTransCtl->PagePrecedes = SubTransPagePrecedes;
- SimpleLruInit(SubTransCtl, "Subtrans", NUM_SUBTRANS_BUFFERS, 0,
+ SimpleLruInit(SubTransCtl, "Subtrans", subtrans_buffers, 0,
SubtransSLRULock, "pg_subtrans",
LWTRANCHE_SUBTRANS_BUFFER, SYNC_HANDLER_NONE);
SlruPagePrecedesUnitTests(SubTransCtl, SUBTRANS_XACTS_PER_PAGE);
diff --git a/src/backend/commands/async.c b/src/backend/commands/async.c
index 38ddae08b8..4bdbbe5cc0 100644
--- a/src/backend/commands/async.c
+++ b/src/backend/commands/async.c
@@ -117,7 +117,7 @@
* frontend during startup.) The above design guarantees that notifies from
* other backends will never be missed by ignoring self-notifies.
*
- * The amount of shared memory used for notify management (NUM_NOTIFY_BUFFERS)
+ * The amount of shared memory used for notify management (notify_buffers)
* can be varied without affecting anything but performance. The maximum
* amount of notification data that can be queued at one time is determined
* by slru.c's wraparound limit; see QUEUE_MAX_PAGE below.
@@ -235,7 +235,7 @@ typedef struct QueuePosition
*
* Resist the temptation to make this really large. While that would save
* work in some places, it would add cost in others. In particular, this
- * should likely be less than NUM_NOTIFY_BUFFERS, to ensure that backends
+ * should likely be less than notify_buffers, to ensure that backends
* catch up before the pages they'll need to read fall out of SLRU cache.
*/
#define QUEUE_CLEANUP_DELAY 4
@@ -521,7 +521,7 @@ AsyncShmemSize(void)
size = mul_size(MaxBackends + 1, sizeof(QueueBackendStatus));
size = add_size(size, offsetof(AsyncQueueControl, backend));
- size = add_size(size, SimpleLruShmemSize(NUM_NOTIFY_BUFFERS, 0));
+ size = add_size(size, SimpleLruShmemSize(notify_buffers, 0));
return size;
}
@@ -569,7 +569,7 @@ AsyncShmemInit(void)
* Set up SLRU management of the pg_notify data.
*/
NotifyCtl->PagePrecedes = asyncQueuePagePrecedes;
- SimpleLruInit(NotifyCtl, "Notify", NUM_NOTIFY_BUFFERS, 0,
+ SimpleLruInit(NotifyCtl, "Notify", notify_buffers, 0,
NotifySLRULock, "pg_notify", LWTRANCHE_NOTIFY_BUFFER,
SYNC_HANDLER_NONE);
diff --git a/src/backend/commands/variable.c b/src/backend/commands/variable.c
index a88cf5f118..ee25aa0656 100644
--- a/src/backend/commands/variable.c
+++ b/src/backend/commands/variable.c
@@ -18,6 +18,8 @@
#include <ctype.h>
+#include "access/clog.h"
+#include "access/commit_ts.h"
#include "access/htup_details.h"
#include "access/parallel.h"
#include "access/xact.h"
@@ -400,6 +402,23 @@ show_timezone(void)
return "unknown";
}
+const char *
+show_xact_buffers(void)
+{
+ static char nbuf[16];
+
+ snprintf(nbuf, sizeof(nbuf), "%zu", CLOGShmemBuffers());
+ return nbuf;
+}
+
+const char *
+show_commit_ts_buffers(void)
+{
+ static char nbuf[16];
+
+ snprintf(nbuf, sizeof(nbuf), "%zu", CommitTsShmemBuffers());
+ return nbuf;
+}
/*
* LOG_TIMEZONE
diff --git a/src/backend/storage/lmgr/predicate.c b/src/backend/storage/lmgr/predicate.c
index a794546db3..18ea18316d 100644
--- a/src/backend/storage/lmgr/predicate.c
+++ b/src/backend/storage/lmgr/predicate.c
@@ -808,7 +808,7 @@ SerialInit(void)
*/
SerialSlruCtl->PagePrecedes = SerialPagePrecedesLogically;
SimpleLruInit(SerialSlruCtl, "Serial",
- NUM_SERIAL_BUFFERS, 0, SerialSLRULock, "pg_serial",
+ serial_buffers, 0, SerialSLRULock, "pg_serial",
LWTRANCHE_SERIAL_BUFFER, SYNC_HANDLER_NONE);
#ifdef USE_ASSERT_CHECKING
SerialPagePrecedesLogicallyUnitTests();
@@ -1347,7 +1347,7 @@ PredicateLockShmemSize(void)
/* Shared memory structures for SLRU tracking of old committed xids. */
size = add_size(size, sizeof(SerialControlData));
- size = add_size(size, SimpleLruShmemSize(NUM_SERIAL_BUFFERS, 0));
+ size = add_size(size, SimpleLruShmemSize(serial_buffers, 0));
return size;
}
diff --git a/src/backend/utils/init/globals.c b/src/backend/utils/init/globals.c
index 60bc1217fb..96d480325b 100644
--- a/src/backend/utils/init/globals.c
+++ b/src/backend/utils/init/globals.c
@@ -156,3 +156,11 @@ int64 VacuumPageDirty = 0;
int VacuumCostBalance = 0; /* working state for vacuum */
bool VacuumCostActive = false;
+
+int multixact_offsets_buffers = 64;
+int multixact_members_buffers = 64;
+int subtrans_buffers = 64;
+int notify_buffers = 64;
+int serial_buffers = 64;
+int xact_buffers = 64;
+int commit_ts_buffers = 64;
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 7605eff9b9..c82635943b 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -28,6 +28,7 @@
#include "access/commit_ts.h"
#include "access/gin.h"
+#include "access/slru.h"
#include "access/toast_compression.h"
#include "access/twophase.h"
#include "access/xlog_internal.h"
@@ -2287,6 +2288,82 @@ struct config_int ConfigureNamesInt[] =
NULL, NULL, NULL
},
+ {
+ {"multixact_offsets_buffers", PGC_POSTMASTER, RESOURCES_MEM,
+ gettext_noop("Sets the number of shared memory buffers used for the MultiXact offset SLRU cache."),
+ NULL,
+ GUC_UNIT_BLOCKS
+ },
+ &multixact_offsets_buffers,
+ 64, 16, SLRU_MAX_ALLOWED_BUFFERS,
+ NULL, NULL, NULL
+ },
+
+ {
+ {"multixact_members_buffers", PGC_POSTMASTER, RESOURCES_MEM,
+ gettext_noop("Sets the number of shared memory buffers used for the MultiXact member SLRU cache."),
+ NULL,
+ GUC_UNIT_BLOCKS
+ },
+ &multixact_members_buffers,
+ 64, 16, SLRU_MAX_ALLOWED_BUFFERS,
+ NULL, NULL, NULL
+ },
+
+ {
+ {"subtrans_buffers", PGC_POSTMASTER, RESOURCES_MEM,
+ gettext_noop("Sets the number of shared memory buffers used for the sub-transaction SLRU cache."),
+ NULL,
+ GUC_UNIT_BLOCKS
+ },
+ &subtrans_buffers,
+ 64, 16, SLRU_MAX_ALLOWED_BUFFERS,
+ NULL, NULL, NULL
+ },
+ {
+ {"notify_buffers", PGC_POSTMASTER, RESOURCES_MEM,
+ gettext_noop("Sets the number of shared memory buffers used for the NOTIFY message SLRU cache."),
+ NULL,
+ GUC_UNIT_BLOCKS
+ },
+ ¬ify_buffers,
+ 64, 16, SLRU_MAX_ALLOWED_BUFFERS,
+ NULL, NULL, NULL
+ },
+
+ {
+ {"serial_buffers", PGC_POSTMASTER, RESOURCES_MEM,
+ gettext_noop("Sets the number of shared memory buffers used for the serializable transaction SLRU cache."),
+ NULL,
+ GUC_UNIT_BLOCKS
+ },
+ &serial_buffers,
+ 64, 16, SLRU_MAX_ALLOWED_BUFFERS,
+ NULL, NULL, NULL
+ },
+
+ {
+ {"xact_buffers", PGC_POSTMASTER, RESOURCES_MEM,
+ gettext_noop("Sets the number of shared memory buffers used for the transaction status SLRU cache."),
+ NULL,
+ GUC_UNIT_BLOCKS
+ },
+ &xact_buffers,
+ 64, 0, CLOG_MAX_ALLOWED_BUFFERS,
+ NULL, NULL, show_xact_buffers
+ },
+
+ {
+ {"commit_ts_buffers", PGC_POSTMASTER, RESOURCES_MEM,
+ gettext_noop("Sets the size of the dedicated buffer pool used for the commit timestamp SLRU cache."),
+ NULL,
+ GUC_UNIT_BLOCKS
+ },
+ &commit_ts_buffers,
+ 64, 0, SLRU_MAX_ALLOWED_BUFFERS,
+ NULL, NULL, show_commit_ts_buffers
+ },
+
{
{"temp_buffers", PGC_USERSET, RESOURCES_MEM,
gettext_noop("Sets the maximum number of temporary buffers used by each session."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index d08d55c3fe..c21d6468ed 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -50,6 +50,15 @@
#external_pid_file = '' # write an extra PID file
# (change requires restart)
+# - SLRU Buffers (change requires restart) -
+
+#xact_buffers = 0 # memory for pg_xact (0 = auto)
+#subtrans_buffers = 32 # memory for pg_subtrans
+#multixact_offsets_buffers = 8 # memory for pg_multixact/offsets
+#multixact_members_buffers = 16 # memory for pg_multixact/members
+#notify_buffers = 8 # memory for pg_notify
+#serial_buffers = 16 # memory for pg_serial
+#commit_ts_buffers = 0 # memory for pg_commit_ts (0 = auto)
#------------------------------------------------------------------------------
# CONNECTIONS AND AUTHENTICATION
diff --git a/src/include/access/clog.h b/src/include/access/clog.h
index d99444f073..a9cd65db36 100644
--- a/src/include/access/clog.h
+++ b/src/include/access/clog.h
@@ -15,6 +15,16 @@
#include "storage/sync.h"
#include "lib/stringinfo.h"
+/*
+ * Don't allow xact_buffers to be set higher than could possibly be useful or
+ * SLRU would allow.
+ */
+#define CLOG_XACTS_PER_BYTE 4
+#define CLOG_XACTS_PER_PAGE (BLCKSZ * CLOG_XACTS_PER_BYTE)
+#define CLOG_MAX_ALLOWED_BUFFERS \
+ Min(SLRU_MAX_ALLOWED_BUFFERS, \
+ (((MaxTransactionId / 2) + (CLOG_XACTS_PER_PAGE - 1)) / CLOG_XACTS_PER_PAGE))
+
/*
* Possible transaction statuses --- note that all-zeroes is the initial
* state.
diff --git a/src/include/access/commit_ts.h b/src/include/access/commit_ts.h
index 5087cdce51..78d017ad85 100644
--- a/src/include/access/commit_ts.h
+++ b/src/include/access/commit_ts.h
@@ -16,7 +16,6 @@
#include "replication/origin.h"
#include "storage/sync.h"
-
extern PGDLLIMPORT bool track_commit_timestamp;
extern void TransactionTreeSetCommitTsData(TransactionId xid, int nsubxids,
diff --git a/src/include/access/multixact.h b/src/include/access/multixact.h
index 0be1355892..18d7ba4ca9 100644
--- a/src/include/access/multixact.h
+++ b/src/include/access/multixact.h
@@ -29,10 +29,6 @@
#define MaxMultiXactOffset ((MultiXactOffset) 0xFFFFFFFF)
-/* Number of SLRU buffers to use for multixact */
-#define NUM_MULTIXACTOFFSET_BUFFERS 8
-#define NUM_MULTIXACTMEMBER_BUFFERS 16
-
/*
* Possible multixact lock modes ("status"). The first four modes are for
* tuple locks (FOR KEY SHARE, FOR SHARE, FOR NO KEY UPDATE, FOR UPDATE); the
diff --git a/src/include/access/slru.h b/src/include/access/slru.h
index 552cc19e68..c0d37e3eb3 100644
--- a/src/include/access/slru.h
+++ b/src/include/access/slru.h
@@ -17,6 +17,11 @@
#include "storage/lwlock.h"
#include "storage/sync.h"
+/*
+ * To avoid overflowing internal arithmetic and the size_t data type, the
+ * number of buffers should not exceed this number.
+ */
+#define SLRU_MAX_ALLOWED_BUFFERS ((1024 * 1024 * 1024) / BLCKSZ)
/*
* Define SLRU segment size. A page is the same BLCKSZ as is used everywhere
diff --git a/src/include/access/subtrans.h b/src/include/access/subtrans.h
index 46a473c77f..147dc4acc3 100644
--- a/src/include/access/subtrans.h
+++ b/src/include/access/subtrans.h
@@ -11,9 +11,6 @@
#ifndef SUBTRANS_H
#define SUBTRANS_H
-/* Number of SLRU buffers to use for subtrans */
-#define NUM_SUBTRANS_BUFFERS 32
-
extern void SubTransSetParent(TransactionId xid, TransactionId parent);
extern TransactionId SubTransGetParent(TransactionId xid);
extern TransactionId SubTransGetTopmostTransaction(TransactionId xid);
diff --git a/src/include/commands/async.h b/src/include/commands/async.h
index 02da6ba7e1..b3e6815ee4 100644
--- a/src/include/commands/async.h
+++ b/src/include/commands/async.h
@@ -15,11 +15,6 @@
#include <signal.h>
-/*
- * The number of SLRU page buffers we use for the notification queue.
- */
-#define NUM_NOTIFY_BUFFERS 8
-
extern PGDLLIMPORT bool Trace_notify;
extern PGDLLIMPORT volatile sig_atomic_t notifyInterruptPending;
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index f0cc651435..e2473f41de 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -177,6 +177,13 @@ extern PGDLLIMPORT int MaxBackends;
extern PGDLLIMPORT int MaxConnections;
extern PGDLLIMPORT int max_worker_processes;
extern PGDLLIMPORT int max_parallel_workers;
+extern PGDLLIMPORT int multixact_offsets_buffers;
+extern PGDLLIMPORT int multixact_members_buffers;
+extern PGDLLIMPORT int subtrans_buffers;
+extern PGDLLIMPORT int notify_buffers;
+extern PGDLLIMPORT int serial_buffers;
+extern PGDLLIMPORT int xact_buffers;
+extern PGDLLIMPORT int commit_ts_buffers;
extern PGDLLIMPORT int MyProcPid;
extern PGDLLIMPORT pg_time_t MyStartTime;
diff --git a/src/include/storage/predicate.h b/src/include/storage/predicate.h
index cd48afa17b..7b68c8f1c7 100644
--- a/src/include/storage/predicate.h
+++ b/src/include/storage/predicate.h
@@ -26,10 +26,6 @@ extern PGDLLIMPORT int max_predicate_locks_per_xact;
extern PGDLLIMPORT int max_predicate_locks_per_relation;
extern PGDLLIMPORT int max_predicate_locks_per_page;
-
-/* Number of SLRU buffers to use for Serial SLRU */
-#define NUM_SERIAL_BUFFERS 16
-
/*
* A handle used for sharing SERIALIZABLEXACT objects between the participants
* in a parallel query.
diff --git a/src/include/utils/guc_hooks.h b/src/include/utils/guc_hooks.h
index 2a191830a8..8597e430de 100644
--- a/src/include/utils/guc_hooks.h
+++ b/src/include/utils/guc_hooks.h
@@ -161,4 +161,6 @@ extern void assign_wal_consistency_checking(const char *newval, void *extra);
extern bool check_wal_segment_size(int *newval, void **extra, GucSource source);
extern void assign_wal_sync_method(int new_wal_sync_method, void *extra);
+extern const char *show_xact_buffers(void);
+extern const char *show_commit_ts_buffers(void);
#endif /* GUC_HOOKS_H */
--
2.39.2 (Apple Git-143)
v4-0003-Partition-wise-slru-locks.patchapplication/octet-stream; name=v4-0003-Partition-wise-slru-locks.patchDownload
From ee74be845bbaff6d4db6add978f016292d90de10 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Thu, 2 Nov 2023 14:02:37 +0530
Subject: [PATCH v4 3/5] Partition wise slru locks
The previous patch has implemented a buffer mapping hash
table. Now this patch is further optimizing it by making the
hash table partitioned and introducing a partition-wise locks
instead of a common centralized lock this will reduce the
contention on the slru control lock. Here we also make the
victim buffer search limited within the slots covered by a
single partition.
Dilip Kumar with design input from Robert Haas
---
src/backend/access/transam/clog.c | 115 ++++++----
src/backend/access/transam/commit_ts.c | 43 ++--
src/backend/access/transam/multixact.c | 177 ++++++++++-----
src/backend/access/transam/slru.c | 261 +++++++++++++++++------
src/backend/access/transam/subtrans.c | 59 +++--
src/backend/commands/async.c | 46 ++--
src/backend/storage/lmgr/lwlock.c | 14 ++
src/backend/storage/lmgr/lwlocknames.txt | 14 +-
src/backend/storage/lmgr/predicate.c | 35 +--
src/include/access/slru.h | 52 +++--
src/include/storage/lwlock.h | 7 +
src/test/modules/test_slru/test_slru.c | 32 +--
12 files changed, 601 insertions(+), 254 deletions(-)
diff --git a/src/backend/access/transam/clog.c b/src/backend/access/transam/clog.c
index 7979bbd00f..ab453cd171 100644
--- a/src/backend/access/transam/clog.c
+++ b/src/backend/access/transam/clog.c
@@ -274,14 +274,19 @@ TransactionIdSetPageStatus(TransactionId xid, int nsubxids,
XLogRecPtr lsn, int pageno,
bool all_xact_same_page)
{
+ LWLock *lock;
+
/* Can't use group update when PGPROC overflows. */
StaticAssertDecl(THRESHOLD_SUBTRANS_CLOG_OPT <= PGPROC_MAX_CACHED_SUBXIDS,
"group clog threshold less than PGPROC cached subxids");
+ /* Get the SLRU partition lock w.r.t. the page we are going to access. */
+ lock = SimpleLruGetPartitionLock(XactCtl, pageno);
+
/*
- * When there is contention on XactSLRULock, we try to group multiple
+ * When there is contention on SLRU lock, we try to group multiple
* updates; a single leader process will perform transaction status
- * updates for multiple backends so that the number of times XactSLRULock
+ * updates for multiple backends so that the number of times the SLRU lock
* needs to be acquired is reduced.
*
* For this optimization to be safe, the XID and subxids in MyProc must be
@@ -300,17 +305,17 @@ TransactionIdSetPageStatus(TransactionId xid, int nsubxids,
nsubxids * sizeof(TransactionId)) == 0))
{
/*
- * If we can immediately acquire XactSLRULock, we update the status of
+ * If we can immediately acquire SLRU lock, we update the status of
* our own XID and release the lock. If not, try use group XID
* update. If that doesn't work out, fall back to waiting for the
* lock to perform an update for this transaction only.
*/
- if (LWLockConditionalAcquire(XactSLRULock, LW_EXCLUSIVE))
+ if (LWLockConditionalAcquire(lock, LW_EXCLUSIVE))
{
/* Got the lock without waiting! Do the update. */
TransactionIdSetPageStatusInternal(xid, nsubxids, subxids, status,
lsn, pageno);
- LWLockRelease(XactSLRULock);
+ LWLockRelease(lock);
return;
}
else if (TransactionGroupUpdateXidStatus(xid, status, lsn, pageno))
@@ -323,10 +328,10 @@ TransactionIdSetPageStatus(TransactionId xid, int nsubxids,
}
/* Group update not applicable, or couldn't accept this page number. */
- LWLockAcquire(XactSLRULock, LW_EXCLUSIVE);
+ LWLockAcquire(lock, LW_EXCLUSIVE);
TransactionIdSetPageStatusInternal(xid, nsubxids, subxids, status,
lsn, pageno);
- LWLockRelease(XactSLRULock);
+ LWLockRelease(lock);
}
/*
@@ -345,7 +350,8 @@ TransactionIdSetPageStatusInternal(TransactionId xid, int nsubxids,
Assert(status == TRANSACTION_STATUS_COMMITTED ||
status == TRANSACTION_STATUS_ABORTED ||
(status == TRANSACTION_STATUS_SUB_COMMITTED && !TransactionIdIsValid(xid)));
- Assert(LWLockHeldByMeInMode(XactSLRULock, LW_EXCLUSIVE));
+ Assert(LWLockHeldByMeInMode(SimpleLruGetPartitionLock(XactCtl, pageno),
+ LW_EXCLUSIVE));
/*
* If we're doing an async commit (ie, lsn is valid), then we must wait
@@ -396,14 +402,13 @@ TransactionIdSetPageStatusInternal(TransactionId xid, int nsubxids,
}
/*
- * When we cannot immediately acquire XactSLRULock in exclusive mode at
+ * When we cannot immediately acquire SLRU parition lock in exclusive mode at
* commit time, add ourselves to a list of processes that need their XIDs
* status update. The first process to add itself to the list will acquire
- * XactSLRULock in exclusive mode and set transaction status as required
- * on behalf of all group members. This avoids a great deal of contention
- * around XactSLRULock when many processes are trying to commit at once,
- * since the lock need not be repeatedly handed off from one committing
- * process to the next.
+ * the lock in exclusive mode and set transaction status as required on behalf
+ * of all group members. This avoids a great deal of contention when many
+ * processes are trying to commit at once, since the lock need not be
+ * repeatedly handed off from one committing process to the next.
*
* Returns true when transaction status has been updated in clog; returns
* false if we decided against applying the optimization because the page
@@ -417,6 +422,8 @@ TransactionGroupUpdateXidStatus(TransactionId xid, XidStatus status,
PGPROC *proc = MyProc;
uint32 nextidx;
uint32 wakeidx;
+ int prevpageno;
+ LWLock *prevlock = NULL;
/* We should definitely have an XID whose status needs to be updated. */
Assert(TransactionIdIsValid(xid));
@@ -497,13 +504,10 @@ TransactionGroupUpdateXidStatus(TransactionId xid, XidStatus status,
return true;
}
- /* We are the leader. Acquire the lock on behalf of everyone. */
- LWLockAcquire(XactSLRULock, LW_EXCLUSIVE);
-
/*
- * Now that we've got the lock, clear the list of processes waiting for
- * group XID status update, saving a pointer to the head of the list.
- * Trying to pop elements one at a time could lead to an ABA problem.
+ * We are leader so clear the list of processes waiting for group XID
+ * status update, saving a pointer to the head of the list. Trying to pop
+ * elements one at a time could lead to an ABA problem.
*/
nextidx = pg_atomic_exchange_u32(&procglobal->clogGroupFirst,
INVALID_PGPROCNO);
@@ -511,10 +515,39 @@ TransactionGroupUpdateXidStatus(TransactionId xid, XidStatus status,
/* Remember head of list so we can perform wakeups after dropping lock. */
wakeidx = nextidx;
+ /* Acquire the SLRU partition lock w.r.t. the first page in the group. */
+ prevpageno = ProcGlobal->allProcs[nextidx].clogGroupMemberPage;
+ prevlock = SimpleLruGetPartitionLock(XactCtl, prevpageno);
+ LWLockAcquire(prevlock, LW_EXCLUSIVE);
+
/* Walk the list and update the status of all XIDs. */
while (nextidx != INVALID_PGPROCNO)
{
PGPROC *nextproc = &ProcGlobal->allProcs[nextidx];
+ int thispageno = nextproc->clogGroupMemberPage;
+
+ /*
+ * Although we are trying our best to keep same page in a group, there
+ * are cases where we might get different pages as well for detail
+ * refer comment in above while loop where we are adding this process
+ * for group update. So if the current page we are going to access is
+ * not in the same slru partition in which we updated the last page
+ * then we need to release the lock on the previous partition and
+ * acquire lock on the partition w.r.t. the page we are going to
+ * update now.
+ */
+ if (thispageno != prevpageno)
+ {
+ LWLock *lock = SimpleLruGetPartitionLock(XactCtl, thispageno);
+
+ if (prevlock != lock)
+ {
+ LWLockRelease(prevlock);
+ LWLockAcquire(lock, LW_EXCLUSIVE);
+ }
+ prevlock = lock;
+ prevpageno = thispageno;
+ }
/*
* Transactions with more than THRESHOLD_SUBTRANS_CLOG_OPT sub-XIDs
@@ -534,7 +567,8 @@ TransactionGroupUpdateXidStatus(TransactionId xid, XidStatus status,
}
/* We're done with the lock now. */
- LWLockRelease(XactSLRULock);
+ if (prevlock != NULL)
+ LWLockRelease(prevlock);
/*
* Now that we've released the lock, go back and wake everybody up. We
@@ -563,10 +597,11 @@ TransactionGroupUpdateXidStatus(TransactionId xid, XidStatus status,
/*
* Sets the commit status of a single transaction.
*
- * Must be called with XactSLRULock held
+ * Must be called with slot specific SLRU bank's lock held
*/
static void
-TransactionIdSetStatusBit(TransactionId xid, XidStatus status, XLogRecPtr lsn, int slotno)
+TransactionIdSetStatusBit(TransactionId xid, XidStatus status, XLogRecPtr lsn,
+ int slotno)
{
int byteno = TransactionIdToByte(xid);
int bshift = TransactionIdToBIndex(xid) * CLOG_BITS_PER_XACT;
@@ -655,7 +690,7 @@ TransactionIdGetStatus(TransactionId xid, XLogRecPtr *lsn)
lsnindex = GetLSNIndex(slotno, xid);
*lsn = XactCtl->shared->group_lsn[lsnindex];
- LWLockRelease(XactSLRULock);
+ LWLockRelease(SimpleLruGetPartitionLock(XactCtl, pageno));
return status;
}
@@ -689,8 +724,8 @@ CLOGShmemInit(void)
{
XactCtl->PagePrecedes = CLOGPagePrecedes;
SimpleLruInit(XactCtl, "Xact", CLOGShmemBuffers(), CLOG_LSNS_PER_PAGE,
- XactSLRULock, "pg_xact", LWTRANCHE_XACT_BUFFER,
- SYNC_HANDLER_CLOG);
+ "pg_xact", LWTRANCHE_XACT_BUFFER,
+ LWTRANCHE_XACT_SLRU, SYNC_HANDLER_CLOG);
SlruPagePrecedesUnitTests(XactCtl, CLOG_XACTS_PER_PAGE);
}
@@ -704,8 +739,9 @@ void
BootStrapCLOG(void)
{
int slotno;
+ LWLock *lock = SimpleLruGetPartitionLock(XactCtl, 0);
- LWLockAcquire(XactSLRULock, LW_EXCLUSIVE);
+ LWLockAcquire(lock, LW_EXCLUSIVE);
/* Create and zero the first page of the commit log */
slotno = ZeroCLOGPage(0, false);
@@ -714,7 +750,7 @@ BootStrapCLOG(void)
SimpleLruWritePage(XactCtl, slotno);
Assert(!XactCtl->shared->page_dirty[slotno]);
- LWLockRelease(XactSLRULock);
+ LWLockRelease(lock);
}
/*
@@ -749,14 +785,10 @@ StartupCLOG(void)
TransactionId xid = XidFromFullTransactionId(ShmemVariableCache->nextXid);
int pageno = TransactionIdToPage(xid);
- LWLockAcquire(XactSLRULock, LW_EXCLUSIVE);
-
/*
* Initialize our idea of the latest page number.
*/
- XactCtl->shared->latest_page_number = pageno;
-
- LWLockRelease(XactSLRULock);
+ pg_atomic_init_u32(&XactCtl->shared->latest_page_number, pageno);
}
/*
@@ -767,8 +799,9 @@ TrimCLOG(void)
{
TransactionId xid = XidFromFullTransactionId(ShmemVariableCache->nextXid);
int pageno = TransactionIdToPage(xid);
+ LWLock *lock = SimpleLruGetPartitionLock(XactCtl, pageno);
- LWLockAcquire(XactSLRULock, LW_EXCLUSIVE);
+ LWLockAcquire(lock, LW_EXCLUSIVE);
/*
* Zero out the remainder of the current clog page. Under normal
@@ -800,7 +833,7 @@ TrimCLOG(void)
XactCtl->shared->page_dirty[slotno] = true;
}
- LWLockRelease(XactSLRULock);
+ LWLockRelease(lock);
}
/*
@@ -832,6 +865,7 @@ void
ExtendCLOG(TransactionId newestXact)
{
int pageno;
+ LWLock *lock;
/*
* No work except at first XID of a page. But beware: just after
@@ -842,13 +876,14 @@ ExtendCLOG(TransactionId newestXact)
return;
pageno = TransactionIdToPage(newestXact);
+ lock = SimpleLruGetPartitionLock(XactCtl, pageno);
- LWLockAcquire(XactSLRULock, LW_EXCLUSIVE);
+ LWLockAcquire(lock, LW_EXCLUSIVE);
/* Zero the page and make an XLOG entry about it */
ZeroCLOGPage(pageno, true);
- LWLockRelease(XactSLRULock);
+ LWLockRelease(lock);
}
@@ -986,16 +1021,18 @@ clog_redo(XLogReaderState *record)
{
int pageno;
int slotno;
+ LWLock *lock;
memcpy(&pageno, XLogRecGetData(record), sizeof(int));
- LWLockAcquire(XactSLRULock, LW_EXCLUSIVE);
+ lock = SimpleLruGetPartitionLock(XactCtl, pageno);
+ LWLockAcquire(lock, LW_EXCLUSIVE);
slotno = ZeroCLOGPage(pageno, false);
SimpleLruWritePage(XactCtl, slotno);
Assert(!XactCtl->shared->page_dirty[slotno]);
- LWLockRelease(XactSLRULock);
+ LWLockRelease(lock);
}
else if (info == CLOG_TRUNCATE)
{
diff --git a/src/backend/access/transam/commit_ts.c b/src/backend/access/transam/commit_ts.c
index 47a1c9f0e5..58314e3885 100644
--- a/src/backend/access/transam/commit_ts.c
+++ b/src/backend/access/transam/commit_ts.c
@@ -218,8 +218,9 @@ SetXidCommitTsInPage(TransactionId xid, int nsubxids,
{
int slotno;
int i;
+ LWLock *lock = SimpleLruGetPartitionLock(CommitTsCtl, pageno);
- LWLockAcquire(CommitTsSLRULock, LW_EXCLUSIVE);
+ LWLockAcquire(lock, LW_EXCLUSIVE);
slotno = SimpleLruReadPage(CommitTsCtl, pageno, true, xid);
@@ -229,13 +230,13 @@ SetXidCommitTsInPage(TransactionId xid, int nsubxids,
CommitTsCtl->shared->page_dirty[slotno] = true;
- LWLockRelease(CommitTsSLRULock);
+ LWLockRelease(lock);
}
/*
* Sets the commit timestamp of a single transaction.
*
- * Must be called with CommitTsSLRULock held
+ * Must be called with slot specific SLRU partition's Lock held
*/
static void
TransactionIdSetCommitTs(TransactionId xid, TimestampTz ts,
@@ -336,7 +337,7 @@ TransactionIdGetCommitTsData(TransactionId xid, TimestampTz *ts,
if (nodeid)
*nodeid = entry.nodeid;
- LWLockRelease(CommitTsSLRULock);
+ LWLockRelease(SimpleLruGetPartitionLock(CommitTsCtl, pageno));
return *ts != 0;
}
@@ -526,9 +527,8 @@ CommitTsShmemInit(void)
CommitTsCtl->PagePrecedes = CommitTsPagePrecedes;
SimpleLruInit(CommitTsCtl, "CommitTs", CommitTsShmemBuffers(), 0,
- CommitTsSLRULock, "pg_commit_ts",
- LWTRANCHE_COMMITTS_BUFFER,
- SYNC_HANDLER_COMMIT_TS);
+ "pg_commit_ts", LWTRANCHE_COMMITTS_BUFFER,
+ LWTRANCHE_COMMITTS_SLRU, SYNC_HANDLER_COMMIT_TS);
SlruPagePrecedesUnitTests(CommitTsCtl, COMMIT_TS_XACTS_PER_PAGE);
commitTsShared = ShmemInitStruct("CommitTs shared",
@@ -684,9 +684,7 @@ ActivateCommitTs(void)
/*
* Re-Initialize our idea of the latest page number.
*/
- LWLockAcquire(CommitTsSLRULock, LW_EXCLUSIVE);
- CommitTsCtl->shared->latest_page_number = pageno;
- LWLockRelease(CommitTsSLRULock);
+ pg_atomic_write_u32(&CommitTsCtl->shared->latest_page_number, pageno);
/*
* If CommitTs is enabled, but it wasn't in the previous server run, we
@@ -713,12 +711,13 @@ ActivateCommitTs(void)
if (!SimpleLruDoesPhysicalPageExist(CommitTsCtl, pageno))
{
int slotno;
+ LWLock *lock = SimpleLruGetPartitionLock(CommitTsCtl, pageno);
- LWLockAcquire(CommitTsSLRULock, LW_EXCLUSIVE);
+ LWLockAcquire(lock, LW_EXCLUSIVE);
slotno = ZeroCommitTsPage(pageno, false);
SimpleLruWritePage(CommitTsCtl, slotno);
Assert(!CommitTsCtl->shared->page_dirty[slotno]);
- LWLockRelease(CommitTsSLRULock);
+ LWLockRelease(lock);
}
/* Change the activation status in shared memory. */
@@ -767,9 +766,9 @@ DeactivateCommitTs(void)
* be overwritten anyway when we wrap around, but it seems better to be
* tidy.)
*/
- LWLockAcquire(CommitTsSLRULock, LW_EXCLUSIVE);
+ SimpleLruLockAllPartitions(CommitTsCtl, LW_EXCLUSIVE);
(void) SlruScanDirectory(CommitTsCtl, SlruScanDirCbDeleteAll, NULL);
- LWLockRelease(CommitTsSLRULock);
+ SimpleLruUnLockAllPartitions(CommitTsCtl);
}
/*
@@ -801,6 +800,7 @@ void
ExtendCommitTs(TransactionId newestXact)
{
int pageno;
+ LWLock *lock;
/*
* Nothing to do if module not enabled. Note we do an unlocked read of
@@ -821,12 +821,14 @@ ExtendCommitTs(TransactionId newestXact)
pageno = TransactionIdToCTsPage(newestXact);
- LWLockAcquire(CommitTsSLRULock, LW_EXCLUSIVE);
+ lock = SimpleLruGetPartitionLock(CommitTsCtl, pageno);
+
+ LWLockAcquire(lock, LW_EXCLUSIVE);
/* Zero the page and make an XLOG entry about it */
ZeroCommitTsPage(pageno, !InRecovery);
- LWLockRelease(CommitTsSLRULock);
+ LWLockRelease(lock);
}
/*
@@ -980,16 +982,18 @@ commit_ts_redo(XLogReaderState *record)
{
int pageno;
int slotno;
+ LWLock *lock;
memcpy(&pageno, XLogRecGetData(record), sizeof(int));
+ lock = SimpleLruGetPartitionLock(CommitTsCtl, pageno);
- LWLockAcquire(CommitTsSLRULock, LW_EXCLUSIVE);
+ LWLockAcquire(lock, LW_EXCLUSIVE);
slotno = ZeroCommitTsPage(pageno, false);
SimpleLruWritePage(CommitTsCtl, slotno);
Assert(!CommitTsCtl->shared->page_dirty[slotno]);
- LWLockRelease(CommitTsSLRULock);
+ LWLockRelease(lock);
}
else if (info == COMMIT_TS_TRUNCATE)
{
@@ -1001,7 +1005,8 @@ commit_ts_redo(XLogReaderState *record)
* During XLOG replay, latest_page_number isn't set up yet; insert a
* suitable value to bypass the sanity test in SimpleLruTruncate.
*/
- CommitTsCtl->shared->latest_page_number = trunc->pageno;
+ pg_atomic_write_u32(&CommitTsCtl->shared->latest_page_number,
+ trunc->pageno);
SimpleLruTruncate(CommitTsCtl, trunc->pageno);
}
diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index 62709fcd07..aa4f11fd3b 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -192,10 +192,10 @@ static SlruCtlData MultiXactMemberCtlData;
/*
* MultiXact state shared across all backends. All this state is protected
- * by MultiXactGenLock. (We also use MultiXactOffsetSLRULock and
- * MultiXactMemberSLRULock to guard accesses to the two sets of SLRU
- * buffers. For concurrency's sake, we avoid holding more than one of these
- * locks at a time.)
+ * by MultiXactGenLock. (We also use SLRU partition's lock of MultiXactOffset
+ * and MultiXactMember to guard accesses to the two sets of SLRU buffers. For
+ * concurrency's sake, we avoid holding more than one of these locks at a
+ * time.)
*/
typedef struct MultiXactStateData
{
@@ -870,12 +870,15 @@ RecordNewMultiXact(MultiXactId multi, MultiXactOffset offset,
int slotno;
MultiXactOffset *offptr;
int i;
-
- LWLockAcquire(MultiXactOffsetSLRULock, LW_EXCLUSIVE);
+ LWLock *lock;
+ LWLock *prevlock = NULL;
pageno = MultiXactIdToOffsetPage(multi);
entryno = MultiXactIdToOffsetEntry(multi);
+ lock = SimpleLruGetPartitionLock(MultiXactOffsetCtl, pageno);
+ LWLockAcquire(lock, LW_EXCLUSIVE);
+
/*
* Note: we pass the MultiXactId to SimpleLruReadPage as the "transaction"
* to complain about if there's any I/O error. This is kinda bogus, but
@@ -891,10 +894,8 @@ RecordNewMultiXact(MultiXactId multi, MultiXactOffset offset,
MultiXactOffsetCtl->shared->page_dirty[slotno] = true;
- /* Exchange our lock */
- LWLockRelease(MultiXactOffsetSLRULock);
-
- LWLockAcquire(MultiXactMemberSLRULock, LW_EXCLUSIVE);
+ /* Release MultiXactOffset SLRU lock. */
+ LWLockRelease(lock);
prev_pageno = -1;
@@ -916,6 +917,20 @@ RecordNewMultiXact(MultiXactId multi, MultiXactOffset offset,
if (pageno != prev_pageno)
{
+ /*
+ * MultiXactMember SLRU page is changed so check if this new page
+ * fall into the different SLRU partition then release the old
+ * partition's lock and acquire lock on the new partition.
+ */
+ lock = SimpleLruGetPartitionLock(MultiXactMemberCtl, pageno);
+ if (lock != prevlock)
+ {
+ if (prevlock != NULL)
+ LWLockRelease(prevlock);
+
+ LWLockAcquire(lock, LW_EXCLUSIVE);
+ prevlock = lock;
+ }
slotno = SimpleLruReadPage(MultiXactMemberCtl, pageno, true, multi);
prev_pageno = pageno;
}
@@ -936,7 +951,8 @@ RecordNewMultiXact(MultiXactId multi, MultiXactOffset offset,
MultiXactMemberCtl->shared->page_dirty[slotno] = true;
}
- LWLockRelease(MultiXactMemberSLRULock);
+ if (prevlock != NULL)
+ LWLockRelease(prevlock);
}
/*
@@ -1239,6 +1255,8 @@ GetMultiXactIdMembers(MultiXactId multi, MultiXactMember **members,
MultiXactId tmpMXact;
MultiXactOffset nextOffset;
MultiXactMember *ptr;
+ LWLock *lock;
+ LWLock *prevlock = NULL;
debug_elog3(DEBUG2, "GetMembers: asked for %u", multi);
@@ -1342,11 +1360,23 @@ GetMultiXactIdMembers(MultiXactId multi, MultiXactMember **members,
* time on every multixact creation.
*/
retry:
- LWLockAcquire(MultiXactOffsetSLRULock, LW_EXCLUSIVE);
-
pageno = MultiXactIdToOffsetPage(multi);
entryno = MultiXactIdToOffsetEntry(multi);
+ /*
+ * If the page is on the different SLRU partition then release the lock on
+ * the previous partition if we are already holding one and acquire the
+ * lock on the new partition.
+ */
+ lock = SimpleLruGetPartitionLock(MultiXactOffsetCtl, pageno);
+ if (lock != prevlock)
+ {
+ if (prevlock != NULL)
+ LWLockRelease(prevlock);
+ LWLockAcquire(lock, LW_EXCLUSIVE);
+ prevlock = lock;
+ }
+
slotno = SimpleLruReadPage(MultiXactOffsetCtl, pageno, true, multi);
offptr = (MultiXactOffset *) MultiXactOffsetCtl->shared->page_buffer[slotno];
offptr += entryno;
@@ -1379,7 +1409,22 @@ retry:
entryno = MultiXactIdToOffsetEntry(tmpMXact);
if (pageno != prev_pageno)
+ {
+ /*
+ * SLRU pageno is changed so check whether this page is falling in
+ * the different slru partition than on which we are already
+ * holding the lock and if so release the lock on the old
+ * partition and acquire that on the new partition.
+ */
+ lock = SimpleLruGetPartitionLock(MultiXactOffsetCtl, pageno);
+ if (prevlock != lock)
+ {
+ LWLockRelease(prevlock);
+ LWLockAcquire(lock, LW_EXCLUSIVE);
+ prevlock = lock;
+ }
slotno = SimpleLruReadPage(MultiXactOffsetCtl, pageno, true, tmpMXact);
+ }
offptr = (MultiXactOffset *) MultiXactOffsetCtl->shared->page_buffer[slotno];
offptr += entryno;
@@ -1388,7 +1433,8 @@ retry:
if (nextMXOffset == 0)
{
/* Corner case 2: next multixact is still being filled in */
- LWLockRelease(MultiXactOffsetSLRULock);
+ LWLockRelease(prevlock);
+ prevlock = NULL;
CHECK_FOR_INTERRUPTS();
pg_usleep(1000L);
goto retry;
@@ -1397,13 +1443,11 @@ retry:
length = nextMXOffset - offset;
}
- LWLockRelease(MultiXactOffsetSLRULock);
+ LWLockRelease(prevlock);
+ prevlock = NULL;
ptr = (MultiXactMember *) palloc(length * sizeof(MultiXactMember));
- /* Now get the members themselves. */
- LWLockAcquire(MultiXactMemberSLRULock, LW_EXCLUSIVE);
-
truelength = 0;
prev_pageno = -1;
for (i = 0; i < length; i++, offset++)
@@ -1419,6 +1463,20 @@ retry:
if (pageno != prev_pageno)
{
+ /*
+ * MultiXactMember SLRU page is changed so check if this new page
+ * fall into the different SLRU partition then release the old
+ * partition's lock and acquire lock on the new partition.
+ */
+ lock = SimpleLruGetPartitionLock(MultiXactMemberCtl, pageno);
+ if (lock != prevlock)
+ {
+ if (prevlock)
+ LWLockRelease(prevlock);
+ LWLockAcquire(lock, LW_EXCLUSIVE);
+ prevlock = lock;
+ }
+
slotno = SimpleLruReadPage(MultiXactMemberCtl, pageno, true, multi);
prev_pageno = pageno;
}
@@ -1442,7 +1500,8 @@ retry:
truelength++;
}
- LWLockRelease(MultiXactMemberSLRULock);
+ if (prevlock)
+ LWLockRelease(prevlock);
/* A multixid with zero members should not happen */
Assert(truelength > 0);
@@ -1852,14 +1911,14 @@ MultiXactShmemInit(void)
SimpleLruInit(MultiXactOffsetCtl,
"MultiXactOffset", multixact_offsets_buffers, 0,
- MultiXactOffsetSLRULock, "pg_multixact/offsets",
- LWTRANCHE_MULTIXACTOFFSET_BUFFER,
+ "pg_multixact/offsets", LWTRANCHE_MULTIXACTOFFSET_BUFFER,
+ LWTRANCHE_MULTIXACTOFFSET_SLRU,
SYNC_HANDLER_MULTIXACT_OFFSET);
SlruPagePrecedesUnitTests(MultiXactOffsetCtl, MULTIXACT_OFFSETS_PER_PAGE);
SimpleLruInit(MultiXactMemberCtl,
"MultiXactMember", multixact_members_buffers, 0,
- MultiXactMemberSLRULock, "pg_multixact/members",
- LWTRANCHE_MULTIXACTMEMBER_BUFFER,
+ "pg_multixact/members", LWTRANCHE_MULTIXACTMEMBER_BUFFER,
+ LWTRANCHE_MULTIXACTMEMBER_SLRU,
SYNC_HANDLER_MULTIXACT_MEMBER);
/* doesn't call SimpleLruTruncate() or meet criteria for unit tests */
@@ -1894,8 +1953,10 @@ void
BootStrapMultiXact(void)
{
int slotno;
+ LWLock *lock;
- LWLockAcquire(MultiXactOffsetSLRULock, LW_EXCLUSIVE);
+ lock = SimpleLruGetPartitionLock(MultiXactOffsetCtl, 0);
+ LWLockAcquire(lock, LW_EXCLUSIVE);
/* Create and zero the first page of the offsets log */
slotno = ZeroMultiXactOffsetPage(0, false);
@@ -1904,9 +1965,10 @@ BootStrapMultiXact(void)
SimpleLruWritePage(MultiXactOffsetCtl, slotno);
Assert(!MultiXactOffsetCtl->shared->page_dirty[slotno]);
- LWLockRelease(MultiXactOffsetSLRULock);
+ LWLockRelease(lock);
- LWLockAcquire(MultiXactMemberSLRULock, LW_EXCLUSIVE);
+ lock = SimpleLruGetPartitionLock(MultiXactMemberCtl, 0);
+ LWLockAcquire(lock, LW_EXCLUSIVE);
/* Create and zero the first page of the members log */
slotno = ZeroMultiXactMemberPage(0, false);
@@ -1915,7 +1977,7 @@ BootStrapMultiXact(void)
SimpleLruWritePage(MultiXactMemberCtl, slotno);
Assert(!MultiXactMemberCtl->shared->page_dirty[slotno]);
- LWLockRelease(MultiXactMemberSLRULock);
+ LWLockRelease(lock);
}
/*
@@ -1975,10 +2037,12 @@ static void
MaybeExtendOffsetSlru(void)
{
int pageno;
+ LWLock *lock;
pageno = MultiXactIdToOffsetPage(MultiXactState->nextMXact);
+ lock = SimpleLruGetPartitionLock(MultiXactOffsetCtl, pageno);
- LWLockAcquire(MultiXactOffsetSLRULock, LW_EXCLUSIVE);
+ LWLockAcquire(lock, LW_EXCLUSIVE);
if (!SimpleLruDoesPhysicalPageExist(MultiXactOffsetCtl, pageno))
{
@@ -1993,7 +2057,7 @@ MaybeExtendOffsetSlru(void)
SimpleLruWritePage(MultiXactOffsetCtl, slotno);
}
- LWLockRelease(MultiXactOffsetSLRULock);
+ LWLockRelease(lock);
}
/*
@@ -2015,13 +2079,15 @@ StartupMultiXact(void)
* Initialize offset's idea of the latest page number.
*/
pageno = MultiXactIdToOffsetPage(multi);
- MultiXactOffsetCtl->shared->latest_page_number = pageno;
+ pg_atomic_init_u32(&MultiXactOffsetCtl->shared->latest_page_number,
+ pageno);
/*
* Initialize member's idea of the latest page number.
*/
pageno = MXOffsetToMemberPage(offset);
- MultiXactMemberCtl->shared->latest_page_number = pageno;
+ pg_atomic_init_u32(&MultiXactMemberCtl->shared->latest_page_number,
+ pageno);
}
/*
@@ -2046,13 +2112,13 @@ TrimMultiXact(void)
LWLockRelease(MultiXactGenLock);
/* Clean up offsets state */
- LWLockAcquire(MultiXactOffsetSLRULock, LW_EXCLUSIVE);
/*
* (Re-)Initialize our idea of the latest page number for offsets.
*/
pageno = MultiXactIdToOffsetPage(nextMXact);
- MultiXactOffsetCtl->shared->latest_page_number = pageno;
+ pg_atomic_write_u32(&MultiXactOffsetCtl->shared->latest_page_number,
+ pageno);
/*
* Zero out the remainder of the current offsets page. See notes in
@@ -2067,7 +2133,9 @@ TrimMultiXact(void)
{
int slotno;
MultiXactOffset *offptr;
+ LWLock *lock = SimpleLruGetPartitionLock(MultiXactOffsetCtl, pageno);
+ LWLockAcquire(lock, LW_EXCLUSIVE);
slotno = SimpleLruReadPage(MultiXactOffsetCtl, pageno, true, nextMXact);
offptr = (MultiXactOffset *) MultiXactOffsetCtl->shared->page_buffer[slotno];
offptr += entryno;
@@ -2075,18 +2143,17 @@ TrimMultiXact(void)
MemSet(offptr, 0, BLCKSZ - (entryno * sizeof(MultiXactOffset)));
MultiXactOffsetCtl->shared->page_dirty[slotno] = true;
+ LWLockRelease(lock);
}
- LWLockRelease(MultiXactOffsetSLRULock);
-
/* And the same for members */
- LWLockAcquire(MultiXactMemberSLRULock, LW_EXCLUSIVE);
/*
* (Re-)Initialize our idea of the latest page number for members.
*/
pageno = MXOffsetToMemberPage(offset);
- MultiXactMemberCtl->shared->latest_page_number = pageno;
+ pg_atomic_write_u32(&MultiXactMemberCtl->shared->latest_page_number,
+ pageno);
/*
* Zero out the remainder of the current members page. See notes in
@@ -2098,7 +2165,9 @@ TrimMultiXact(void)
int slotno;
TransactionId *xidptr;
int memberoff;
+ LWLock *lock = SimpleLruGetPartitionLock(MultiXactMemberCtl, pageno);
+ LWLockAcquire(lock, LW_EXCLUSIVE);
memberoff = MXOffsetToMemberOffset(offset);
slotno = SimpleLruReadPage(MultiXactMemberCtl, pageno, true, offset);
xidptr = (TransactionId *)
@@ -2113,10 +2182,9 @@ TrimMultiXact(void)
*/
MultiXactMemberCtl->shared->page_dirty[slotno] = true;
+ LWLockRelease(lock);
}
- LWLockRelease(MultiXactMemberSLRULock);
-
/* signal that we're officially up */
LWLockAcquire(MultiXactGenLock, LW_EXCLUSIVE);
MultiXactState->finishedStartup = true;
@@ -2404,6 +2472,7 @@ static void
ExtendMultiXactOffset(MultiXactId multi)
{
int pageno;
+ LWLock *lock;
/*
* No work except at first MultiXactId of a page. But beware: just after
@@ -2414,13 +2483,14 @@ ExtendMultiXactOffset(MultiXactId multi)
return;
pageno = MultiXactIdToOffsetPage(multi);
+ lock = SimpleLruGetPartitionLock(MultiXactOffsetCtl, pageno);
- LWLockAcquire(MultiXactOffsetSLRULock, LW_EXCLUSIVE);
+ LWLockAcquire(lock, LW_EXCLUSIVE);
/* Zero the page and make an XLOG entry about it */
ZeroMultiXactOffsetPage(pageno, true);
- LWLockRelease(MultiXactOffsetSLRULock);
+ LWLockRelease(lock);
}
/*
@@ -2453,15 +2523,17 @@ ExtendMultiXactMember(MultiXactOffset offset, int nmembers)
if (flagsoff == 0 && flagsbit == 0)
{
int pageno;
+ LWLock *lock;
pageno = MXOffsetToMemberPage(offset);
+ lock = SimpleLruGetPartitionLock(MultiXactMemberCtl, pageno);
- LWLockAcquire(MultiXactMemberSLRULock, LW_EXCLUSIVE);
+ LWLockAcquire(lock, LW_EXCLUSIVE);
/* Zero the page and make an XLOG entry about it */
ZeroMultiXactMemberPage(pageno, true);
- LWLockRelease(MultiXactMemberSLRULock);
+ LWLockRelease(lock);
}
/*
@@ -2759,7 +2831,7 @@ find_multixact_start(MultiXactId multi, MultiXactOffset *result)
offptr = (MultiXactOffset *) MultiXactOffsetCtl->shared->page_buffer[slotno];
offptr += entryno;
offset = *offptr;
- LWLockRelease(MultiXactOffsetSLRULock);
+ LWLockRelease(SimpleLruGetPartitionLock(MultiXactOffsetCtl, pageno));
*result = offset;
return true;
@@ -3241,31 +3313,33 @@ multixact_redo(XLogReaderState *record)
{
int pageno;
int slotno;
+ LWLock *lock;
memcpy(&pageno, XLogRecGetData(record), sizeof(int));
-
- LWLockAcquire(MultiXactOffsetSLRULock, LW_EXCLUSIVE);
+ lock = SimpleLruGetPartitionLock(MultiXactOffsetCtl, pageno);
+ LWLockAcquire(lock, LW_EXCLUSIVE);
slotno = ZeroMultiXactOffsetPage(pageno, false);
SimpleLruWritePage(MultiXactOffsetCtl, slotno);
Assert(!MultiXactOffsetCtl->shared->page_dirty[slotno]);
- LWLockRelease(MultiXactOffsetSLRULock);
+ LWLockRelease(lock);
}
else if (info == XLOG_MULTIXACT_ZERO_MEM_PAGE)
{
int pageno;
int slotno;
+ LWLock *lock;
memcpy(&pageno, XLogRecGetData(record), sizeof(int));
-
- LWLockAcquire(MultiXactMemberSLRULock, LW_EXCLUSIVE);
+ lock = SimpleLruGetPartitionLock(MultiXactMemberCtl, pageno);
+ LWLockAcquire(lock, LW_EXCLUSIVE);
slotno = ZeroMultiXactMemberPage(pageno, false);
SimpleLruWritePage(MultiXactMemberCtl, slotno);
Assert(!MultiXactMemberCtl->shared->page_dirty[slotno]);
- LWLockRelease(MultiXactMemberSLRULock);
+ LWLockRelease(lock);
}
else if (info == XLOG_MULTIXACT_CREATE_ID)
{
@@ -3331,7 +3405,8 @@ multixact_redo(XLogReaderState *record)
* SimpleLruTruncate.
*/
pageno = MultiXactIdToOffsetPage(xlrec.endTruncOff);
- MultiXactOffsetCtl->shared->latest_page_number = pageno;
+ pg_atomic_write_u32(&MultiXactOffsetCtl->shared->latest_page_number,
+ pageno);
PerformOffsetsTruncation(xlrec.startTruncOff, xlrec.endTruncOff);
LWLockRelease(MultiXactTruncationLock);
diff --git a/src/backend/access/transam/slru.c b/src/backend/access/transam/slru.c
index ac23076def..ab7cd276ce 100644
--- a/src/backend/access/transam/slru.c
+++ b/src/backend/access/transam/slru.c
@@ -71,6 +71,7 @@
* to SimpleLruWriteAll(). This data structure remembers which files are open.
*/
#define MAX_WRITEALL_BUFFERS 16
+#define SLRU_NUM_PARTITIONS 8
typedef struct SlruWriteAllData
{
@@ -102,34 +103,6 @@ typedef struct SlruMappingTableEntry
(a).segno = (xx_segno) \
)
-/*
- * Macro to mark a buffer slot "most recently used". Note multiple evaluation
- * of arguments!
- *
- * The reason for the if-test is that there are often many consecutive
- * accesses to the same page (particularly the latest page). By suppressing
- * useless increments of cur_lru_count, we reduce the probability that old
- * pages' counts will "wrap around" and make them appear recently used.
- *
- * We allow this code to be executed concurrently by multiple processes within
- * SimpleLruReadPage_ReadOnly(). As long as int reads and writes are atomic,
- * this should not cause any completely-bogus values to enter the computation.
- * However, it is possible for either cur_lru_count or individual
- * page_lru_count entries to be "reset" to lower values than they should have,
- * in case a process is delayed while it executes this macro. With care in
- * SlruSelectLRUPage(), this does little harm, and in any case the absolute
- * worst possible consequence is a nonoptimal choice of page to evict. The
- * gain from allowing concurrent reads of SLRU pages seems worth it.
- */
-#define SlruRecentlyUsed(shared, slotno) \
- do { \
- int new_lru_count = (shared)->cur_lru_count; \
- if (new_lru_count != (shared)->page_lru_count[slotno]) { \
- (shared)->cur_lru_count = ++new_lru_count; \
- (shared)->page_lru_count[slotno] = new_lru_count; \
- } \
- } while (0)
-
/* Saved info for SlruReportIOError */
typedef enum
{
@@ -160,6 +133,9 @@ static void SlruInternalDeleteSegment(SlruCtl ctl, int segno);
static void SlruMappingAdd(SlruCtl ctl, int pageno, int slotno);
static void SlruMappingRemove(SlruCtl ctl, int pageno);
static int SlruMappingFind(SlruCtl ctl, int pageno);
+static inline int SlruMappingPartNo(SlruCtl ctl, int pageno);
+static inline void SlruRecentlyUsed(SlruShared shared, int slotno,
+ int partsize);
/*
* Helper function of SimpleLruShmemSize to compute the SlruSharedData size.
@@ -177,6 +153,8 @@ SimpleLruStructSize(int nslots, int nlsns)
sz += MAXALIGN(nslots * sizeof(int)); /* page_number[] */
sz += MAXALIGN(nslots * sizeof(int)); /* page_lru_count[] */
sz += MAXALIGN(nslots * sizeof(LWLockPadded)); /* buffer_locks[] */
+ sz += MAXALIGN(SLRU_NUM_PARTITIONS * sizeof(LWLockPadded)); /* part_locks[] */
+ sz += MAXALIGN(SLRU_NUM_PARTITIONS * sizeof(int)); /* part_cur_lru_count[] */
if (nlsns > 0)
sz += MAXALIGN(nslots * nlsns * sizeof(XLogRecPtr)); /* group_lsn[] */
@@ -207,7 +185,7 @@ SimpleLruShmemSize(int nslots, int nlsns)
*/
void
SimpleLruInit(SlruCtl ctl, const char *name, int nslots, int nlsns,
- LWLock *ctllock, const char *subdir, int tranche_id,
+ const char *subdir, int buffer_tranche_id, int part_tranche_id,
SyncRequestHandler sync_handler)
{
char mapping_table_name[SHMEM_INDEX_KEYSIZE];
@@ -226,18 +204,15 @@ SimpleLruInit(SlruCtl ctl, const char *name, int nslots, int nlsns,
char *ptr;
Size offset;
int slotno;
+ int partno;
Assert(!found);
memset(shared, 0, sizeof(SlruSharedData));
- shared->ControlLock = ctllock;
-
shared->num_slots = nslots;
shared->lsn_groups_per_page = nlsns;
- shared->cur_lru_count = 0;
-
/* shared->latest_page_number will be set later */
shared->slru_stats_idx = pgstat_get_slru_index(name);
@@ -258,6 +233,10 @@ SimpleLruInit(SlruCtl ctl, const char *name, int nslots, int nlsns,
/* Initialize LWLocks */
shared->buffer_locks = (LWLockPadded *) (ptr + offset);
offset += MAXALIGN(nslots * sizeof(LWLockPadded));
+ shared->part_locks = (LWLockPadded *) (ptr + offset);
+ offset += MAXALIGN(SLRU_NUM_PARTITIONS * sizeof(LWLockPadded));
+ shared->part_cur_lru_count = (int *) (ptr + offset);
+ offset += MAXALIGN(SLRU_NUM_PARTITIONS * sizeof(int));
if (nlsns > 0)
{
@@ -269,7 +248,7 @@ SimpleLruInit(SlruCtl ctl, const char *name, int nslots, int nlsns,
for (slotno = 0; slotno < nslots; slotno++)
{
LWLockInitialize(&shared->buffer_locks[slotno].lock,
- tranche_id);
+ buffer_tranche_id);
shared->page_buffer[slotno] = ptr;
shared->page_status[slotno] = SLRU_PAGE_EMPTY;
@@ -277,6 +256,13 @@ SimpleLruInit(SlruCtl ctl, const char *name, int nslots, int nlsns,
shared->page_lru_count[slotno] = 0;
ptr += BLCKSZ;
}
+ /* Initialize partition locks for each buffer partition. */
+ for (partno = 0; partno < SLRU_NUM_PARTITIONS; partno++)
+ {
+ LWLockInitialize(&shared->part_locks[partno].lock,
+ part_tranche_id);
+ shared->part_cur_lru_count[partno] = 0;
+ }
/* Should fit to estimated shmem size */
Assert(ptr - (char *) shared <= SimpleLruShmemSize(nslots, nlsns));
@@ -288,10 +274,12 @@ SimpleLruInit(SlruCtl ctl, const char *name, int nslots, int nlsns,
memset(&mapping_table_info, 0, sizeof(mapping_table_info));
mapping_table_info.keysize = sizeof(int);
mapping_table_info.entrysize = sizeof(SlruMappingTableEntry);
+ mapping_table_info.num_partitions = SLRU_NUM_PARTITIONS;
snprintf(mapping_table_name, sizeof(mapping_table_name),
"%s Lookup Table", name);
mapping_table = ShmemInitHash(mapping_table_name, nslots, nslots,
- &mapping_table_info, HASH_ELEM | HASH_BLOBS);
+ &mapping_table_info,
+ HASH_ELEM | HASH_BLOBS | HASH_PARTITION);
/*
* Initialize the unshared control struct, including directory path. We
@@ -300,6 +288,7 @@ SimpleLruInit(SlruCtl ctl, const char *name, int nslots, int nlsns,
ctl->shared = shared;
ctl->mapping_table = mapping_table;
ctl->sync_handler = sync_handler;
+ ctl->part_size = shared->num_slots / SLRU_NUM_PARTITIONS;
strlcpy(ctl->Dir, subdir, sizeof(ctl->Dir));
}
@@ -331,7 +320,7 @@ SimpleLruZeroPage(SlruCtl ctl, int pageno)
shared->page_number[slotno] = pageno;
shared->page_status[slotno] = SLRU_PAGE_VALID;
shared->page_dirty[slotno] = true;
- SlruRecentlyUsed(shared, slotno);
+ SlruRecentlyUsed(shared, slotno, ctl->part_size);
/* Set the buffer to zeroes */
MemSet(shared->page_buffer[slotno], 0, BLCKSZ);
@@ -340,7 +329,7 @@ SimpleLruZeroPage(SlruCtl ctl, int pageno)
SimpleLruZeroLSNs(ctl, slotno);
/* Assume this page is now the latest active page */
- shared->latest_page_number = pageno;
+ pg_atomic_write_u32(&shared->latest_page_number, pageno);
/* update the stats counter of zeroed pages */
pgstat_count_slru_page_zeroed(shared->slru_stats_idx);
@@ -379,12 +368,13 @@ static void
SimpleLruWaitIO(SlruCtl ctl, int slotno)
{
SlruShared shared = ctl->shared;
+ int partno = slotno / ctl->part_size;
/* See notes at top of file */
- LWLockRelease(shared->ControlLock);
+ LWLockRelease(&shared->part_locks[partno].lock);
LWLockAcquire(&shared->buffer_locks[slotno].lock, LW_SHARED);
LWLockRelease(&shared->buffer_locks[slotno].lock);
- LWLockAcquire(shared->ControlLock, LW_EXCLUSIVE);
+ LWLockAcquire(&shared->part_locks[partno].lock, LW_EXCLUSIVE);
/*
* If the slot is still in an io-in-progress state, then either someone
@@ -442,6 +432,7 @@ SimpleLruReadPage(SlruCtl ctl, int pageno, bool write_ok,
for (;;)
{
int slotno;
+ int partno;
bool ok;
/* See if page already is in memory; if not, pick victim slot */
@@ -464,7 +455,7 @@ SimpleLruReadPage(SlruCtl ctl, int pageno, bool write_ok,
continue;
}
/* Otherwise, it's ready to use */
- SlruRecentlyUsed(shared, slotno);
+ SlruRecentlyUsed(shared, slotno, ctl->part_size);
/* update the stats counter of pages found in the SLRU */
pgstat_count_slru_page_hit(shared->slru_stats_idx);
@@ -487,9 +478,10 @@ SimpleLruReadPage(SlruCtl ctl, int pageno, bool write_ok,
/* Acquire per-buffer lock (cannot deadlock, see notes at top) */
LWLockAcquire(&shared->buffer_locks[slotno].lock, LW_EXCLUSIVE);
+ partno = slotno / ctl->part_size;
/* Release control lock while doing I/O */
- LWLockRelease(shared->ControlLock);
+ LWLockRelease(&shared->part_locks[partno].lock);
/* Do the read */
ok = SlruPhysicalReadPage(ctl, pageno, slotno);
@@ -498,7 +490,7 @@ SimpleLruReadPage(SlruCtl ctl, int pageno, bool write_ok,
SimpleLruZeroLSNs(ctl, slotno);
/* Re-acquire control lock and update page state */
- LWLockAcquire(shared->ControlLock, LW_EXCLUSIVE);
+ LWLockAcquire(&shared->part_locks[partno].lock, LW_EXCLUSIVE);
Assert(shared->page_number[slotno] == pageno &&
shared->page_status[slotno] == SLRU_PAGE_READ_IN_PROGRESS &&
@@ -518,7 +510,7 @@ SimpleLruReadPage(SlruCtl ctl, int pageno, bool write_ok,
if (!ok)
SlruReportIOError(ctl, pageno, xid);
- SlruRecentlyUsed(shared, slotno);
+ SlruRecentlyUsed(shared, slotno, ctl->part_size);
/* update the stats counter of pages not found in SLRU */
pgstat_count_slru_page_read(shared->slru_stats_idx);
@@ -546,9 +538,13 @@ SimpleLruReadPage_ReadOnly(SlruCtl ctl, int pageno, TransactionId xid)
{
SlruShared shared = ctl->shared;
int slotno;
+ int partno;
+
+ /* Determine partition number for the page. */
+ partno = SlruMappingPartNo(ctl, pageno);
- /* Try to find the page while holding only shared lock */
- LWLockAcquire(shared->ControlLock, LW_SHARED);
+ /* Try to find the page while holding only shared partition lock */
+ LWLockAcquire(&shared->part_locks[partno].lock, LW_SHARED);
/* See if page is already in a buffer */
slotno = SlruMappingFind(ctl, pageno);
@@ -559,7 +555,7 @@ SimpleLruReadPage_ReadOnly(SlruCtl ctl, int pageno, TransactionId xid)
Assert(shared->page_number[slotno] == pageno);
/* See comments for SlruRecentlyUsed macro */
- SlruRecentlyUsed(shared, slotno);
+ SlruRecentlyUsed(shared, slotno, ctl->part_size);
/* update the stats counter of pages found in the SLRU */
pgstat_count_slru_page_hit(shared->slru_stats_idx);
@@ -568,8 +564,8 @@ SimpleLruReadPage_ReadOnly(SlruCtl ctl, int pageno, TransactionId xid)
}
/* No luck, so switch to normal exclusive lock and do regular read */
- LWLockRelease(shared->ControlLock);
- LWLockAcquire(shared->ControlLock, LW_EXCLUSIVE);
+ LWLockRelease(&shared->part_locks[partno].lock);
+ LWLockAcquire(&shared->part_locks[partno].lock, LW_EXCLUSIVE);
return SimpleLruReadPage(ctl, pageno, true, xid);
}
@@ -591,6 +587,7 @@ SlruInternalWritePage(SlruCtl ctl, int slotno, SlruWriteAll fdata)
SlruShared shared = ctl->shared;
int pageno = shared->page_number[slotno];
bool ok;
+ int partno = slotno / ctl->part_size;
/* If a write is in progress, wait for it to finish */
while (shared->page_status[slotno] == SLRU_PAGE_WRITE_IN_PROGRESS &&
@@ -619,7 +616,7 @@ SlruInternalWritePage(SlruCtl ctl, int slotno, SlruWriteAll fdata)
LWLockAcquire(&shared->buffer_locks[slotno].lock, LW_EXCLUSIVE);
/* Release control lock while doing I/O */
- LWLockRelease(shared->ControlLock);
+ LWLockRelease(&shared->part_locks[partno].lock);
/* Do the write */
ok = SlruPhysicalWritePage(ctl, pageno, slotno, fdata);
@@ -634,7 +631,7 @@ SlruInternalWritePage(SlruCtl ctl, int slotno, SlruWriteAll fdata)
}
/* Re-acquire control lock and update page state */
- LWLockAcquire(shared->ControlLock, LW_EXCLUSIVE);
+ LWLockAcquire(&shared->part_locks[partno].lock, LW_EXCLUSIVE);
Assert(shared->page_number[slotno] == pageno &&
shared->page_status[slotno] == SLRU_PAGE_WRITE_IN_PROGRESS);
@@ -1078,6 +1075,9 @@ SlruSelectLRUPage(SlruCtl ctl, int pageno)
int bestinvalidslot = 0; /* keep compiler quiet */
int best_invalid_delta = -1;
int best_invalid_page_number = 0; /* keep compiler quiet */
+ int partno;
+ int partstart;
+ int partend;
/* See if page already has a buffer assigned */
slotno = SlruMappingFind(ctl, pageno);
@@ -1088,6 +1088,14 @@ SlruSelectLRUPage(SlruCtl ctl, int pageno)
return slotno;
}
+ /*
+ * Get the partition start and partition end slotno based on the
+ * partition no.
+ */
+ partno = SlruMappingPartNo(ctl, pageno);
+ partstart = partno * ctl->part_size;
+ partend = partstart + ctl->part_size;
+
/*
* If we find any EMPTY slot, just select that one. Else choose a
* victim page to replace. We normally take the least recently used
@@ -1115,8 +1123,8 @@ SlruSelectLRUPage(SlruCtl ctl, int pageno)
* That gets us back on the path to having good data when there are
* multiple pages with the same lru_count.
*/
- cur_count = (shared->cur_lru_count)++;
- for (slotno = 0; slotno < shared->num_slots; slotno++)
+ cur_count = (shared->part_cur_lru_count[partno])++;
+ for (slotno = partstart; slotno < partend; slotno++)
{
int this_delta;
int this_page_number;
@@ -1137,7 +1145,7 @@ SlruSelectLRUPage(SlruCtl ctl, int pageno)
this_delta = 0;
}
this_page_number = shared->page_number[slotno];
- if (this_page_number == shared->latest_page_number)
+ if (this_page_number == pg_atomic_read_u32(&shared->latest_page_number))
continue;
if (shared->page_status[slotno] == SLRU_PAGE_VALID)
{
@@ -1211,6 +1219,7 @@ SimpleLruWriteAll(SlruCtl ctl, bool allow_redirtied)
int slotno;
int pageno = 0;
int i;
+ int lastpartno = 0;
bool ok;
/* update the stats counter of flushes */
@@ -1221,10 +1230,19 @@ SimpleLruWriteAll(SlruCtl ctl, bool allow_redirtied)
*/
fdata.num_files = 0;
- LWLockAcquire(shared->ControlLock, LW_EXCLUSIVE);
+ LWLockAcquire(&shared->part_locks[0].lock, LW_EXCLUSIVE);
for (slotno = 0; slotno < shared->num_slots; slotno++)
{
+ int curpartno = slotno / ctl->part_size;
+
+ if (curpartno != lastpartno)
+ {
+ LWLockRelease(&shared->part_locks[lastpartno].lock);
+ LWLockAcquire(&shared->part_locks[curpartno].lock, LW_EXCLUSIVE);
+ lastpartno = curpartno;
+ }
+
SlruInternalWritePage(ctl, slotno, &fdata);
/*
@@ -1238,7 +1256,7 @@ SimpleLruWriteAll(SlruCtl ctl, bool allow_redirtied)
!shared->page_dirty[slotno]));
}
- LWLockRelease(shared->ControlLock);
+ LWLockRelease(&shared->part_locks[lastpartno].lock);
/*
* Now close any files that were open
@@ -1278,6 +1296,7 @@ SimpleLruTruncate(SlruCtl ctl, int cutoffPage)
{
SlruShared shared = ctl->shared;
int slotno;
+ int prevpartno;
/* update the stats counter of truncates */
pgstat_count_slru_truncate(shared->slru_stats_idx);
@@ -1288,25 +1307,38 @@ SimpleLruTruncate(SlruCtl ctl, int cutoffPage)
* or just after a checkpoint, any dirty pages should have been flushed
* already ... we're just being extra careful here.)
*/
- LWLockAcquire(shared->ControlLock, LW_EXCLUSIVE);
-
restart:
/*
* While we are holding the lock, make an important safety check: the
* current endpoint page must not be eligible for removal.
*/
- if (ctl->PagePrecedes(shared->latest_page_number, cutoffPage))
+ if (ctl->PagePrecedes(pg_atomic_read_u32(&shared->latest_page_number),
+ cutoffPage))
{
- LWLockRelease(shared->ControlLock);
ereport(LOG,
(errmsg("could not truncate directory \"%s\": apparent wraparound",
ctl->Dir)));
return;
}
+ prevpartno = 0;
+ LWLockAcquire(&shared->part_locks[prevpartno].lock, LW_EXCLUSIVE);
for (slotno = 0; slotno < shared->num_slots; slotno++)
{
+ int curpartno = slotno / ctl->part_size;
+
+ /*
+ * If the curpartno is not same as prevpartno then release the lock on
+ * the prevpartno and acquire the lock on the curpartno.
+ */
+ if (curpartno != prevpartno)
+ {
+ LWLockRelease(&shared->part_locks[prevpartno].lock);
+ LWLockAcquire(&shared->part_locks[curpartno].lock, LW_EXCLUSIVE);
+ prevpartno = curpartno;
+ }
+
if (shared->page_status[slotno] == SLRU_PAGE_EMPTY)
continue;
if (!ctl->PagePrecedes(shared->page_number[slotno], cutoffPage))
@@ -1337,10 +1369,12 @@ restart:
SlruInternalWritePage(ctl, slotno, NULL);
else
SimpleLruWaitIO(ctl, slotno);
+
+ LWLockRelease(&shared->part_locks[prevpartno].lock);
goto restart;
}
- LWLockRelease(shared->ControlLock);
+ LWLockRelease(&shared->part_locks[prevpartno].lock);
/* Now we can remove the old segment(s) */
(void) SlruScanDirectory(ctl, SlruScanDirCbDeleteCutoff, &cutoffPage);
@@ -1381,15 +1415,31 @@ SlruDeleteSegment(SlruCtl ctl, int segno)
SlruShared shared = ctl->shared;
int slotno;
bool did_write;
+ int prevpartno = 0;
/* Clean out any possibly existing references to the segment. */
- LWLockAcquire(shared->ControlLock, LW_EXCLUSIVE);
+ LWLockAcquire(&shared->part_locks[prevpartno].lock, LW_EXCLUSIVE);
restart:
did_write = false;
for (slotno = 0; slotno < shared->num_slots; slotno++)
{
- int pagesegno = shared->page_number[slotno] / SLRU_PAGES_PER_SEGMENT;
+ int pagesegno;
+ int curpartno;
+
+ curpartno = slotno / ctl->part_size;
+ /*
+ * If the curpartno is not same as prevpartno then release the lock on
+ * the prevpartno and acquire the lock on the curpartno.
+ */
+ if (curpartno != prevpartno)
+ {
+ LWLockRelease(&shared->part_locks[prevpartno].lock);
+ LWLockAcquire(&shared->part_locks[curpartno].lock, LW_EXCLUSIVE);
+ prevpartno = curpartno;
+ }
+
+ pagesegno = shared->page_number[slotno] / SLRU_PAGES_PER_SEGMENT;
if (shared->page_status[slotno] == SLRU_PAGE_EMPTY)
continue;
@@ -1424,7 +1474,7 @@ restart:
SlruInternalDeleteSegment(ctl, segno);
- LWLockRelease(shared->ControlLock);
+ LWLockRelease(&shared->part_locks[prevpartno].lock);
}
/*
@@ -1636,6 +1686,38 @@ SlruScanDirectory(SlruCtl ctl, SlruScanCallback callback, void *data)
return retval;
}
+/*
+ * Function to mark a buffer slot "most recently used". Note multiple
+ * evaluation of arguments!
+ *
+ * The reason for the if-test is that there are often many consecutive
+ * accesses to the same page (particularly the latest page). By suppressing
+ * useless increments of part_cur_lru_count, we reduce the probability that old
+ * pages' counts will "wrap around" and make them appear recently used.
+ *
+ * We allow this code to be executed concurrently by multiple processes within
+ * SimpleLruReadPage_ReadOnly(). As long as int reads and writes are atomic,
+ * this should not cause any completely-bogus values to enter the computation.
+ * However, it is possible for either part_cur_lru_count or individual
+ * page_lru_count entries to be "reset" to lower values than they should have,
+ * in case a process is delayed while it executes this macro. With care in
+ * SlruSelectLRUPage(), this does little harm, and in any case the absolute
+ * worst possible consequence is a nonoptimal choice of page to evict. The
+ * gain from allowing concurrent reads of SLRU pages seems worth it.
+ */
+static inline void
+SlruRecentlyUsed(SlruShared shared, int slotno, int partsize)
+{
+ int slrupartno = slotno / partsize;
+ int new_lru_count = shared->part_cur_lru_count[slrupartno];
+
+ if (new_lru_count != shared->page_lru_count[slotno])
+ {
+ shared->part_cur_lru_count[slrupartno] = ++new_lru_count;
+ shared->page_lru_count[slotno] = new_lru_count;
+ }
+}
+
/*
* Individual SLRUs (clog, ...) have to provide a sync.c handler function so
* that they can provide the correct "SlruCtl" (otherwise we don't know how to
@@ -1709,3 +1791,56 @@ SlruMappingRemove(SlruCtl ctl, int pageno)
Assert(found);
}
+
+/*
+ * The slru buffer mapping table is partitioned to reduce contention. To
+ * determine which partition lock a given pageno requires, compute the pageno's
+ * hash code with SlruBufTableHashCode(), then apply SlruPartitionLock().
+ */
+static inline int
+SlruMappingPartNo(SlruCtl ctl, int pageno)
+{
+ uint32 hashcode = get_hash_value(ctl->mapping_table, (void *) &pageno);
+
+ return hashcode % SLRU_NUM_PARTITIONS;
+}
+
+/*
+ * Get the SLRU part lock for given SlruCtl and the pageno.
+ *
+ * This lock needs to be acquire in order to access the slru buffer slots in
+ * the respective part. For more details refer comments in SlruSharedData.
+ */
+LWLock *
+SimpleLruGetPartitionLock(SlruCtl ctl, int pageno)
+{
+ int partno = SlruMappingPartNo(ctl, pageno);
+
+ return &(ctl->shared->part_locks[partno].lock);
+}
+
+/*
+* Function to acquire all partitions' lock of the given SlruCtl
+*/
+void
+SimpleLruLockAllPartitions(SlruCtl ctl, LWLockMode mode)
+{
+ SlruShared shared = ctl->shared;
+ int partno;
+
+ for (partno = 0; partno < SLRU_NUM_PARTITIONS; partno++)
+ LWLockAcquire(&shared->part_locks[partno].lock, mode);
+}
+
+/*
+* Function to release all partitions' lock of the given SlruCtl
+*/
+void
+SimpleLruUnLockAllPartitions(SlruCtl ctl)
+{
+ SlruShared shared = ctl->shared;
+ int partno;
+
+ for (partno = 0; partno < SLRU_NUM_PARTITIONS; partno++)
+ LWLockRelease(&shared->part_locks[partno].lock);
+}
diff --git a/src/backend/access/transam/subtrans.c b/src/backend/access/transam/subtrans.c
index 0dd48f40f3..e4da6e28ae 100644
--- a/src/backend/access/transam/subtrans.c
+++ b/src/backend/access/transam/subtrans.c
@@ -77,12 +77,14 @@ SubTransSetParent(TransactionId xid, TransactionId parent)
int pageno = TransactionIdToPage(xid);
int entryno = TransactionIdToEntry(xid);
int slotno;
+ LWLock *lock;
TransactionId *ptr;
Assert(TransactionIdIsValid(parent));
Assert(TransactionIdFollows(xid, parent));
- LWLockAcquire(SubtransSLRULock, LW_EXCLUSIVE);
+ lock = SimpleLruGetPartitionLock(SubTransCtl, pageno);
+ LWLockAcquire(lock, LW_EXCLUSIVE);
slotno = SimpleLruReadPage(SubTransCtl, pageno, true, xid);
ptr = (TransactionId *) SubTransCtl->shared->page_buffer[slotno];
@@ -100,7 +102,7 @@ SubTransSetParent(TransactionId xid, TransactionId parent)
SubTransCtl->shared->page_dirty[slotno] = true;
}
- LWLockRelease(SubtransSLRULock);
+ LWLockRelease(lock);
}
/*
@@ -130,7 +132,7 @@ SubTransGetParent(TransactionId xid)
parent = *ptr;
- LWLockRelease(SubtransSLRULock);
+ LWLockRelease(SimpleLruGetPartitionLock(SubTransCtl, pageno));
return parent;
}
@@ -193,8 +195,9 @@ SUBTRANSShmemInit(void)
{
SubTransCtl->PagePrecedes = SubTransPagePrecedes;
SimpleLruInit(SubTransCtl, "Subtrans", subtrans_buffers, 0,
- SubtransSLRULock, "pg_subtrans",
- LWTRANCHE_SUBTRANS_BUFFER, SYNC_HANDLER_NONE);
+ "pg_subtrans", LWTRANCHE_SUBTRANS_BUFFER,
+ LWTRANCHE_SUBTRANS_SLRU,
+ SYNC_HANDLER_NONE);
SlruPagePrecedesUnitTests(SubTransCtl, SUBTRANS_XACTS_PER_PAGE);
}
@@ -212,8 +215,9 @@ void
BootStrapSUBTRANS(void)
{
int slotno;
+ LWLock *lock = SimpleLruGetPartitionLock(SubTransCtl, 0);
- LWLockAcquire(SubtransSLRULock, LW_EXCLUSIVE);
+ LWLockAcquire(lock, LW_EXCLUSIVE);
/* Create and zero the first page of the subtrans log */
slotno = ZeroSUBTRANSPage(0);
@@ -222,7 +226,7 @@ BootStrapSUBTRANS(void)
SimpleLruWritePage(SubTransCtl, slotno);
Assert(!SubTransCtl->shared->page_dirty[slotno]);
- LWLockRelease(SubtransSLRULock);
+ LWLockRelease(lock);
}
/*
@@ -252,6 +256,8 @@ StartupSUBTRANS(TransactionId oldestActiveXID)
FullTransactionId nextXid;
int startPage;
int endPage;
+ LWLock *prevlock;
+ LWLock *lock;
/*
* Since we don't expect pg_subtrans to be valid across crashes, we
@@ -259,23 +265,48 @@ StartupSUBTRANS(TransactionId oldestActiveXID)
* Whenever we advance into a new page, ExtendSUBTRANS will likewise zero
* the new page without regard to whatever was previously on disk.
*/
- LWLockAcquire(SubtransSLRULock, LW_EXCLUSIVE);
-
startPage = TransactionIdToPage(oldestActiveXID);
nextXid = ShmemVariableCache->nextXid;
endPage = TransactionIdToPage(XidFromFullTransactionId(nextXid));
+ prevlock = SimpleLruGetPartitionLock(SubTransCtl, startPage);
+ LWLockAcquire(prevlock, LW_EXCLUSIVE);
while (startPage != endPage)
{
+ lock = SimpleLruGetPartitionLock(SubTransCtl, startPage);
+
+ /*
+ * Check if we need to acquire the lock on the new partition then
+ * release the lock on the old partition and acquire on the new
+ * partition.
+ */
+ if (prevlock != lock)
+ {
+ LWLockRelease(prevlock);
+ LWLockAcquire(lock, LW_EXCLUSIVE);
+ prevlock = lock;
+ }
+
(void) ZeroSUBTRANSPage(startPage);
startPage++;
/* must account for wraparound */
if (startPage > TransactionIdToPage(MaxTransactionId))
startPage = 0;
}
- (void) ZeroSUBTRANSPage(startPage);
- LWLockRelease(SubtransSLRULock);
+ lock = SimpleLruGetPartitionLock(SubTransCtl, startPage);
+
+ /*
+ * Check if we need to acquire the lock on the new partition then release
+ * the lock on the old partition and acquire on the new partition.
+ */
+ if (prevlock != lock)
+ {
+ LWLockRelease(prevlock);
+ LWLockAcquire(lock, LW_EXCLUSIVE);
+ }
+ (void) ZeroSUBTRANSPage(startPage);
+ LWLockRelease(lock);
}
/*
@@ -309,6 +340,7 @@ void
ExtendSUBTRANS(TransactionId newestXact)
{
int pageno;
+ LWLock *lock;
/*
* No work except at first XID of a page. But beware: just after
@@ -320,12 +352,13 @@ ExtendSUBTRANS(TransactionId newestXact)
pageno = TransactionIdToPage(newestXact);
- LWLockAcquire(SubtransSLRULock, LW_EXCLUSIVE);
+ lock = SimpleLruGetPartitionLock(SubTransCtl, pageno);
+ LWLockAcquire(lock, LW_EXCLUSIVE);
/* Zero the page */
ZeroSUBTRANSPage(pageno);
- LWLockRelease(SubtransSLRULock);
+ LWLockRelease(lock);
}
diff --git a/src/backend/commands/async.c b/src/backend/commands/async.c
index 4bdbbe5cc0..81fdca410b 100644
--- a/src/backend/commands/async.c
+++ b/src/backend/commands/async.c
@@ -267,9 +267,10 @@ typedef struct QueueBackendStatus
* both NotifyQueueLock and NotifyQueueTailLock in EXCLUSIVE mode, backends
* can change the tail pointers.
*
- * NotifySLRULock is used as the control lock for the pg_notify SLRU buffers.
+ * SLRU buffer pool is divided in partitions and partition wise SLRU lock is
+ * used as the control lock for the pg_notify SLRU buffers.
* In order to avoid deadlocks, whenever we need multiple locks, we first get
- * NotifyQueueTailLock, then NotifyQueueLock, and lastly NotifySLRULock.
+ * NotifyQueueTailLock, then NotifyQueueLock, and lastly SLRU partition lock.
*
* Each backend uses the backend[] array entry with index equal to its
* BackendId (which can range from 1 to MaxBackends). We rely on this to make
@@ -570,7 +571,7 @@ AsyncShmemInit(void)
*/
NotifyCtl->PagePrecedes = asyncQueuePagePrecedes;
SimpleLruInit(NotifyCtl, "Notify", notify_buffers, 0,
- NotifySLRULock, "pg_notify", LWTRANCHE_NOTIFY_BUFFER,
+ "pg_notify", LWTRANCHE_NOTIFY_BUFFER, LWTRANCHE_NOTIFY_SLRU,
SYNC_HANDLER_NONE);
if (!found)
@@ -1402,7 +1403,7 @@ asyncQueueNotificationToEntry(Notification *n, AsyncQueueEntry *qe)
* Eventually we will return NULL indicating all is done.
*
* We are holding NotifyQueueLock already from the caller and grab
- * NotifySLRULock locally in this function.
+ * page specific SLRU partition lock locally in this function.
*/
static ListCell *
asyncQueueAddEntries(ListCell *nextNotify)
@@ -1412,9 +1413,7 @@ asyncQueueAddEntries(ListCell *nextNotify)
int pageno;
int offset;
int slotno;
-
- /* We hold both NotifyQueueLock and NotifySLRULock during this operation */
- LWLockAcquire(NotifySLRULock, LW_EXCLUSIVE);
+ LWLock *prevlock;
/*
* We work with a local copy of QUEUE_HEAD, which we write back to shared
@@ -1438,6 +1437,14 @@ asyncQueueAddEntries(ListCell *nextNotify)
* wrapped around, but re-zeroing the page is harmless in that case.)
*/
pageno = QUEUE_POS_PAGE(queue_head);
+ prevlock = SimpleLruGetPartitionLock(NotifyCtl, pageno);
+
+ /*
+ * We hold both NotifyQueueLock and SLRU partition lock during this
+ * operation.
+ */
+ LWLockAcquire(prevlock, LW_EXCLUSIVE);
+
if (QUEUE_POS_IS_ZERO(queue_head))
slotno = SimpleLruZeroPage(NotifyCtl, pageno);
else
@@ -1483,6 +1490,8 @@ asyncQueueAddEntries(ListCell *nextNotify)
/* Advance queue_head appropriately, and detect if page is full */
if (asyncQueueAdvance(&(queue_head), qe.length))
{
+ LWLock *lock;
+
/*
* Page is full, so we're done here, but first fill the next page
* with zeroes. The reason to do this is to ensure that slru.c's
@@ -1491,6 +1500,15 @@ asyncQueueAddEntries(ListCell *nextNotify)
* asyncQueueIsFull() ensured that there is room to create this
* page without overrunning the queue.
*/
+ pageno = QUEUE_POS_PAGE(queue_head);
+ lock = SimpleLruGetPartitionLock(NotifyCtl, pageno);
+ if (lock != prevlock)
+ {
+ LWLockRelease(prevlock);
+ LWLockAcquire(lock, LW_EXCLUSIVE);
+ prevlock = lock;
+ }
+
slotno = SimpleLruZeroPage(NotifyCtl, QUEUE_POS_PAGE(queue_head));
/*
@@ -1509,7 +1527,7 @@ asyncQueueAddEntries(ListCell *nextNotify)
/* Success, so update the global QUEUE_HEAD */
QUEUE_HEAD = queue_head;
- LWLockRelease(NotifySLRULock);
+ LWLockRelease(prevlock);
return nextNotify;
}
@@ -1988,9 +2006,9 @@ asyncQueueReadAllNotifications(void)
/*
* We copy the data from SLRU into a local buffer, so as to avoid
- * holding the NotifySLRULock while we are examining the entries
- * and possibly transmitting them to our frontend. Copy only the
- * part of the page we will actually inspect.
+ * holding the SLRU lock while we are examining the entries and
+ * possibly transmitting them to our frontend. Copy only the part
+ * of the page we will actually inspect.
*/
slotno = SimpleLruReadPage_ReadOnly(NotifyCtl, curpage,
InvalidTransactionId);
@@ -2010,7 +2028,7 @@ asyncQueueReadAllNotifications(void)
NotifyCtl->shared->page_buffer[slotno] + curoffset,
copysize);
/* Release lock that we got from SimpleLruReadPage_ReadOnly() */
- LWLockRelease(NotifySLRULock);
+ LWLockRelease(SimpleLruGetPartitionLock(NotifyCtl, curpage));
/*
* Process messages up to the stop position, end of page, or an
@@ -2051,7 +2069,7 @@ asyncQueueReadAllNotifications(void)
*
* The current page must have been fetched into page_buffer from shared
* memory. (We could access the page right in shared memory, but that
- * would imply holding the NotifySLRULock throughout this routine.)
+ * would imply holding the SLRU partition lock throughout this routine.)
*
* We stop if we reach the "stop" position, or reach a notification from an
* uncommitted transaction, or reach the end of the page.
@@ -2204,7 +2222,7 @@ asyncQueueAdvanceTail(void)
if (asyncQueuePagePrecedes(oldtailpage, boundary))
{
/*
- * SimpleLruTruncate() will ask for NotifySLRULock but will also
+ * SimpleLruTruncate() will ask for SLRU partition locks but will also
* release the lock again.
*/
SimpleLruTruncate(NotifyCtl, newtailpage);
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index 315a78cda9..1261af0548 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -190,6 +190,20 @@ static const char *const BuiltinTrancheNames[] = {
"LogicalRepLauncherDSA",
/* LWTRANCHE_LAUNCHER_HASH: */
"LogicalRepLauncherHash",
+ /* LWTRANCHE_XACT_SLRU: */
+ "XactSLRU",
+ /* LWTRANCHE_COMMITTS_SLRU: */
+ "CommitTSSLRU",
+ /* LWTRANCHE_SUBTRANS_SLRU: */
+ "SubtransSLRU",
+ /* LWTRANCHE_MULTIXACTOFFSET_SLRU: */
+ "MultixactOffsetSLRU",
+ /* LWTRANCHE_MULTIXACTMEMBER_SLRU: */
+ "MultixactMemberSLRU",
+ /* LWTRANCHE_NOTIFY_SLRU: */
+ "NotifySLRU",
+ /* LWTRANCHE_SERIAL_SLRU: */
+ "SerialSLRU"
};
StaticAssertDecl(lengthof(BuiltinTrancheNames) ==
diff --git a/src/backend/storage/lmgr/lwlocknames.txt b/src/backend/storage/lmgr/lwlocknames.txt
index f72f2906ce..9e66ecd1ed 100644
--- a/src/backend/storage/lmgr/lwlocknames.txt
+++ b/src/backend/storage/lmgr/lwlocknames.txt
@@ -16,11 +16,11 @@ WALBufMappingLock 7
WALWriteLock 8
ControlFileLock 9
# 10 was CheckpointLock
-XactSLRULock 11
-SubtransSLRULock 12
+# 11 was XactSLRULock
+# 12 was SubtransSLRULock
MultiXactGenLock 13
-MultiXactOffsetSLRULock 14
-MultiXactMemberSLRULock 15
+# 14 was MultiXactOffsetSLRULock
+# 15 was MultiXactMemberSLRULock
RelCacheInitLock 16
CheckpointerCommLock 17
TwoPhaseStateLock 18
@@ -31,19 +31,19 @@ AutovacuumLock 22
AutovacuumScheduleLock 23
SyncScanLock 24
RelationMappingLock 25
-NotifySLRULock 26
+#26 was NotifySLRULock
NotifyQueueLock 27
SerializableXactHashLock 28
SerializableFinishedListLock 29
SerializablePredicateListLock 30
-SerialSLRULock 31
+SerialControlLock 31
SyncRepLock 32
BackgroundWorkerLock 33
DynamicSharedMemoryControlLock 34
AutoFileLock 35
ReplicationSlotAllocationLock 36
ReplicationSlotControlLock 37
-CommitTsSLRULock 38
+#38 was CommitTsSLRULock
CommitTsLock 39
ReplicationOriginLock 40
MultiXactTruncationLock 41
diff --git a/src/backend/storage/lmgr/predicate.c b/src/backend/storage/lmgr/predicate.c
index 18ea18316d..6b7c1aa00e 100644
--- a/src/backend/storage/lmgr/predicate.c
+++ b/src/backend/storage/lmgr/predicate.c
@@ -808,8 +808,9 @@ SerialInit(void)
*/
SerialSlruCtl->PagePrecedes = SerialPagePrecedesLogically;
SimpleLruInit(SerialSlruCtl, "Serial",
- serial_buffers, 0, SerialSLRULock, "pg_serial",
- LWTRANCHE_SERIAL_BUFFER, SYNC_HANDLER_NONE);
+ serial_buffers, 0, "pg_serial",
+ LWTRANCHE_SERIAL_BUFFER, LWTRANCHE_SERIAL_SLRU,
+ SYNC_HANDLER_NONE);
#ifdef USE_ASSERT_CHECKING
SerialPagePrecedesLogicallyUnitTests();
#endif
@@ -846,12 +847,14 @@ SerialAdd(TransactionId xid, SerCommitSeqNo minConflictCommitSeqNo)
int slotno;
int firstZeroPage;
bool isNewPage;
+ LWLock *lock;
Assert(TransactionIdIsValid(xid));
targetPage = SerialPage(xid);
+ lock = SimpleLruGetPartitionLock(SerialSlruCtl, targetPage);
- LWLockAcquire(SerialSLRULock, LW_EXCLUSIVE);
+ LWLockAcquire(lock, LW_EXCLUSIVE);
/*
* If no serializable transactions are active, there shouldn't be anything
@@ -901,7 +904,7 @@ SerialAdd(TransactionId xid, SerCommitSeqNo minConflictCommitSeqNo)
SerialValue(slotno, xid) = minConflictCommitSeqNo;
SerialSlruCtl->shared->page_dirty[slotno] = true;
- LWLockRelease(SerialSLRULock);
+ LWLockRelease(lock);
}
/*
@@ -919,10 +922,10 @@ SerialGetMinConflictCommitSeqNo(TransactionId xid)
Assert(TransactionIdIsValid(xid));
- LWLockAcquire(SerialSLRULock, LW_SHARED);
+ LWLockAcquire(SerialControlLock, LW_SHARED);
headXid = serialControl->headXid;
tailXid = serialControl->tailXid;
- LWLockRelease(SerialSLRULock);
+ LWLockRelease(SerialControlLock);
if (!TransactionIdIsValid(headXid))
return 0;
@@ -934,13 +937,13 @@ SerialGetMinConflictCommitSeqNo(TransactionId xid)
return 0;
/*
- * The following function must be called without holding SerialSLRULock,
- * but will return with that lock held, which must then be released.
+ * The following function must be called without holding SLRU partition
+ * lock, but will return with that lock held, which must then be released.
*/
slotno = SimpleLruReadPage_ReadOnly(SerialSlruCtl,
SerialPage(xid), xid);
val = SerialValue(slotno, xid);
- LWLockRelease(SerialSLRULock);
+ LWLockRelease(SimpleLruGetPartitionLock(SerialSlruCtl, SerialPage(xid)));
return val;
}
@@ -953,7 +956,7 @@ SerialGetMinConflictCommitSeqNo(TransactionId xid)
static void
SerialSetActiveSerXmin(TransactionId xid)
{
- LWLockAcquire(SerialSLRULock, LW_EXCLUSIVE);
+ LWLockAcquire(SerialControlLock, LW_EXCLUSIVE);
/*
* When no sxacts are active, nothing overlaps, set the xid values to
@@ -965,7 +968,7 @@ SerialSetActiveSerXmin(TransactionId xid)
{
serialControl->tailXid = InvalidTransactionId;
serialControl->headXid = InvalidTransactionId;
- LWLockRelease(SerialSLRULock);
+ LWLockRelease(SerialControlLock);
return;
}
@@ -983,7 +986,7 @@ SerialSetActiveSerXmin(TransactionId xid)
{
serialControl->tailXid = xid;
}
- LWLockRelease(SerialSLRULock);
+ LWLockRelease(SerialControlLock);
return;
}
@@ -992,7 +995,7 @@ SerialSetActiveSerXmin(TransactionId xid)
serialControl->tailXid = xid;
- LWLockRelease(SerialSLRULock);
+ LWLockRelease(SerialControlLock);
}
/*
@@ -1006,12 +1009,12 @@ CheckPointPredicate(void)
{
int truncateCutoffPage;
- LWLockAcquire(SerialSLRULock, LW_EXCLUSIVE);
+ LWLockAcquire(SerialControlLock, LW_EXCLUSIVE);
/* Exit quickly if the SLRU is currently not in use. */
if (serialControl->headPage < 0)
{
- LWLockRelease(SerialSLRULock);
+ LWLockRelease(SerialControlLock);
return;
}
@@ -1071,7 +1074,7 @@ CheckPointPredicate(void)
serialControl->headPage = -1;
}
- LWLockRelease(SerialSLRULock);
+ LWLockRelease(SerialControlLock);
/* Truncate away pages that are no longer required */
SimpleLruTruncate(SerialSlruCtl, truncateCutoffPage);
diff --git a/src/include/access/slru.h b/src/include/access/slru.h
index 9cd0899f1d..e6c54d5519 100644
--- a/src/include/access/slru.h
+++ b/src/include/access/slru.h
@@ -58,8 +58,6 @@ typedef enum
*/
typedef struct SlruSharedData
{
- LWLock *ControlLock;
-
/* Number of buffers managed by this SLRU structure */
int num_slots;
@@ -75,33 +73,47 @@ typedef struct SlruSharedData
LWLockPadded *buffer_locks;
/*
- * Optional array of WAL flush LSNs associated with entries in the SLRU
- * pages. If not zero/NULL, we must flush WAL before writing pages (true
- * for pg_xact, false for multixact, pg_subtrans, pg_notify). group_lsn[]
- * has lsn_groups_per_page entries per buffer slot, each containing the
- * highest LSN known for a contiguous group of SLRU entries on that slot's
- * page.
+ * Locks to protect the in memory buffer slot access in per SLRU bank. The
+ * buffer_locks protects the I/O on each buffer slots whereas this lock
+ * protect the in memory operation on the buffer within one SLRU bank.
*/
- XLogRecPtr *group_lsn;
- int lsn_groups_per_page;
+ LWLockPadded *part_locks;
/*----------
+ * Instead of global counter we maintain a partition-wise lru counter
+ * because
+ * a) we are doing the victim buffer selection as partition level so there
+ * is no point of having a global counter b) manipulating a global counter
+ * will have frequent cpu cache invalidation and that will affect the
+ * performance.
+ *
* We mark a page "most recently used" by setting
- * page_lru_count[slotno] = ++cur_lru_count;
+ * page_lru_count[slotno] = ++part_cur_lru_count[partno];
* The oldest page is therefore the one with the highest value of
- * cur_lru_count - page_lru_count[slotno]
+ * part_cur_lru_count[partno] - page_lru_count[slotno]
* The counts will eventually wrap around, but this calculation still
* works as long as no page's age exceeds INT_MAX counts.
*----------
*/
- int cur_lru_count;
+ int *part_cur_lru_count;
+
+ /*
+ * Optional array of WAL flush LSNs associated with entries in the SLRU
+ * pages. If not zero/NULL, we must flush WAL before writing pages (true
+ * for pg_xact, false for multixact, pg_subtrans, pg_notify). group_lsn[]
+ * has lsn_groups_per_page entries per buffer slot, each containing the
+ * highest LSN known for a contiguous group of SLRU entries on that slot's
+ * page.
+ */
+ XLogRecPtr *group_lsn;
+ int lsn_groups_per_page;
/*
* latest_page_number is the page number of the current end of the log;
* this is not critical data, since we use it only to avoid swapping out
* the latest page.
*/
- int latest_page_number;
+ pg_atomic_uint32 latest_page_number;
/* SLRU's index for statistics purposes (might not be unique) */
int slru_stats_idx;
@@ -143,6 +155,9 @@ typedef struct SlruCtlData
* it's always the same, it doesn't need to be in shared memory.
*/
char Dir[64];
+
+ /* Size of one slru buffer pool partition */
+ int part_size;
} SlruCtlData;
typedef SlruCtlData *SlruCtl;
@@ -150,8 +165,8 @@ typedef SlruCtlData *SlruCtl;
extern Size SimpleLruShmemSize(int nslots, int nlsns);
extern void SimpleLruInit(SlruCtl ctl, const char *name, int nslots, int nlsns,
- LWLock *ctllock, const char *subdir, int tranche_id,
- SyncRequestHandler sync_handler);
+ const char *subdir, int buffer_tranche_id,
+ int bank_tranche_id, SyncRequestHandler sync_handler);
extern int SimpleLruZeroPage(SlruCtl ctl, int pageno);
extern int SimpleLruReadPage(SlruCtl ctl, int pageno, bool write_ok,
TransactionId xid);
@@ -179,5 +194,8 @@ extern bool SlruScanDirCbReportPresence(SlruCtl ctl, char *filename,
int segpage, void *data);
extern bool SlruScanDirCbDeleteAll(SlruCtl ctl, char *filename, int segpage,
void *data);
-
+extern LWLock *SimpleLruGetPartitionLock(SlruCtl ctl, int pageno);
+extern void SimpleLruLockAllPartitions(SlruCtl ctl, LWLockMode mode);
+extern void SimpleLruUnLockAllPartitions(SlruCtl ctl);
+extern LWLock *SimpleLruGetPartitionLock(SlruCtl ctl, int pageno);
#endif /* SLRU_H */
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index b038e599c0..87cb812b84 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -207,6 +207,13 @@ typedef enum BuiltinTrancheIds
LWTRANCHE_PGSTATS_DATA,
LWTRANCHE_LAUNCHER_DSA,
LWTRANCHE_LAUNCHER_HASH,
+ LWTRANCHE_XACT_SLRU,
+ LWTRANCHE_COMMITTS_SLRU,
+ LWTRANCHE_SUBTRANS_SLRU,
+ LWTRANCHE_MULTIXACTOFFSET_SLRU,
+ LWTRANCHE_MULTIXACTMEMBER_SLRU,
+ LWTRANCHE_NOTIFY_SLRU,
+ LWTRANCHE_SERIAL_SLRU,
LWTRANCHE_FIRST_USER_DEFINED,
} BuiltinTrancheIds;
diff --git a/src/test/modules/test_slru/test_slru.c b/src/test/modules/test_slru/test_slru.c
index ae21444c47..b9178d0ee2 100644
--- a/src/test/modules/test_slru/test_slru.c
+++ b/src/test/modules/test_slru/test_slru.c
@@ -40,10 +40,6 @@ PG_FUNCTION_INFO_V1(test_slru_delete_all);
/* Number of SLRU page slots */
#define NUM_TEST_BUFFERS 16
-/* SLRU control lock */
-LWLock TestSLRULock;
-#define TestSLRULock (&TestSLRULock)
-
static SlruCtlData TestSlruCtlData;
#define TestSlruCtl (&TestSlruCtlData)
@@ -63,9 +59,9 @@ test_slru_page_write(PG_FUNCTION_ARGS)
int pageno = PG_GETARG_INT32(0);
char *data = text_to_cstring(PG_GETARG_TEXT_PP(1));
int slotno;
+ LWLock *lock = SimpleLruGetPartitionLock(TestSlruCtl, pageno);
- LWLockAcquire(TestSLRULock, LW_EXCLUSIVE);
-
+ LWLockAcquire(lock, LW_EXCLUSIVE);
slotno = SimpleLruZeroPage(TestSlruCtl, pageno);
/* these should match */
@@ -80,7 +76,7 @@ test_slru_page_write(PG_FUNCTION_ARGS)
BLCKSZ - 1);
SimpleLruWritePage(TestSlruCtl, slotno);
- LWLockRelease(TestSLRULock);
+ LWLockRelease(lock);
PG_RETURN_VOID();
}
@@ -99,13 +95,14 @@ test_slru_page_read(PG_FUNCTION_ARGS)
bool write_ok = PG_GETARG_BOOL(1);
char *data = NULL;
int slotno;
+ LWLock *lock = SimpleLruGetPartitionLock(TestSlruCtl, pageno);
/* find page in buffers, reading it if necessary */
- LWLockAcquire(TestSLRULock, LW_EXCLUSIVE);
+ LWLockAcquire(lock, LW_EXCLUSIVE);
slotno = SimpleLruReadPage(TestSlruCtl, pageno,
write_ok, InvalidTransactionId);
data = (char *) TestSlruCtl->shared->page_buffer[slotno];
- LWLockRelease(TestSLRULock);
+ LWLockRelease(lock);
PG_RETURN_TEXT_P(cstring_to_text(data));
}
@@ -116,14 +113,15 @@ test_slru_page_readonly(PG_FUNCTION_ARGS)
int pageno = PG_GETARG_INT32(0);
char *data = NULL;
int slotno;
+ LWLock *lock = SimpleLruGetPartitionLock(TestSlruCtl, pageno);
/* find page in buffers, reading it if necessary */
slotno = SimpleLruReadPage_ReadOnly(TestSlruCtl,
pageno,
InvalidTransactionId);
- Assert(LWLockHeldByMe(TestSLRULock));
+ Assert(LWLockHeldByMe(lock));
data = (char *) TestSlruCtl->shared->page_buffer[slotno];
- LWLockRelease(TestSLRULock);
+ LWLockRelease(lock);
PG_RETURN_TEXT_P(cstring_to_text(data));
}
@@ -133,10 +131,11 @@ test_slru_page_exists(PG_FUNCTION_ARGS)
{
int pageno = PG_GETARG_INT32(0);
bool found;
+ LWLock *lock = SimpleLruGetPartitionLock(TestSlruCtl, pageno);
- LWLockAcquire(TestSLRULock, LW_EXCLUSIVE);
+ LWLockAcquire(lock, LW_EXCLUSIVE);
found = SimpleLruDoesPhysicalPageExist(TestSlruCtl, pageno);
- LWLockRelease(TestSLRULock);
+ LWLockRelease(lock);
PG_RETURN_BOOL(found);
}
@@ -215,6 +214,7 @@ test_slru_shmem_startup(void)
{
const char slru_dir_name[] = "pg_test_slru";
int test_tranche_id;
+ int test_buffer_tranche_id;
if (prev_shmem_startup_hook)
prev_shmem_startup_hook();
@@ -228,11 +228,13 @@ test_slru_shmem_startup(void)
/* initialize the SLRU facility */
test_tranche_id = LWLockNewTrancheId();
LWLockRegisterTranche(test_tranche_id, "test_slru_tranche");
- LWLockInitialize(TestSLRULock, test_tranche_id);
+
+ test_buffer_tranche_id = LWLockNewTrancheId();
+ LWLockRegisterTranche(test_tranche_id, "test_buffer_tranche");
TestSlruCtl->PagePrecedes = test_slru_page_precedes_logically;
SimpleLruInit(TestSlruCtl, "TestSLRU",
- NUM_TEST_BUFFERS, 0, TestSLRULock, slru_dir_name,
+ NUM_TEST_BUFFERS, 0, slru_dir_name, test_buffer_tranche_id,
test_tranche_id, SYNC_HANDLER_NONE);
}
--
2.39.2 (Apple Git-143)
v4-0002-Add-a-buffer-mapping-table-for-SLRUs.patchapplication/octet-stream; name=v4-0002-Add-a-buffer-mapping-table-for-SLRUs.patchDownload
From cb46346ee896b4ea7778d0e0562e1a250e771bb6 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Tue, 31 Oct 2023 10:26:45 +0530
Subject: [PATCH v4 2/5] Add a buffer mapping table for SLRUs.
Instead of doing a linear search for the buffer holding a given page
number, use a hash table. This will allow us to increase the size of
these caches.
Patch By: Thomas Munro and some adjustment by Dilip Kumar
Reviewed-by: Andrey M. Borodin and Dilip Kumar
---
src/backend/access/transam/slru.c | 140 +++++++++++++++++++++++++-----
src/include/access/slru.h | 4 +
src/tools/pgindent/typedefs.list | 1 +
3 files changed, 123 insertions(+), 22 deletions(-)
diff --git a/src/backend/access/transam/slru.c b/src/backend/access/transam/slru.c
index 9ed24e1185..ac23076def 100644
--- a/src/backend/access/transam/slru.c
+++ b/src/backend/access/transam/slru.c
@@ -59,6 +59,7 @@
#include "pgstat.h"
#include "storage/fd.h"
#include "storage/shmem.h"
+#include "utils/hsearch.h"
#define SlruFileName(ctl, path, seg) \
snprintf(path, MAXPGPATH, "%s/%04X", (ctl)->Dir, seg)
@@ -80,6 +81,15 @@ typedef struct SlruWriteAllData
typedef struct SlruWriteAllData *SlruWriteAll;
+/*
+ * hash table entry for mapping from pageno to the slotno in SLRU buffer pool.
+ */
+typedef struct SlruMappingTableEntry
+{
+ int pageno;
+ int slotno;
+} SlruMappingTableEntry;
+
/*
* Populate a file tag describing a segment file. We only use the segment
* number, since we can derive everything else we need by having separate
@@ -147,13 +157,15 @@ static int SlruSelectLRUPage(SlruCtl ctl, int pageno);
static bool SlruScanDirCbDeleteCutoff(SlruCtl ctl, char *filename,
int segpage, void *data);
static void SlruInternalDeleteSegment(SlruCtl ctl, int segno);
+static void SlruMappingAdd(SlruCtl ctl, int pageno, int slotno);
+static void SlruMappingRemove(SlruCtl ctl, int pageno);
+static int SlruMappingFind(SlruCtl ctl, int pageno);
/*
- * Initialization of shared memory
+ * Helper function of SimpleLruShmemSize to compute the SlruSharedData size.
*/
-
-Size
-SimpleLruShmemSize(int nslots, int nlsns)
+static Size
+SimpleLruStructSize(int nslots, int nlsns)
{
Size sz;
@@ -168,10 +180,19 @@ SimpleLruShmemSize(int nslots, int nlsns)
if (nlsns > 0)
sz += MAXALIGN(nslots * nlsns * sizeof(XLogRecPtr)); /* group_lsn[] */
-
return BUFFERALIGN(sz) + BLCKSZ * nslots;
}
+/*
+ * Initialization of shared memory.
+ */
+Size
+SimpleLruShmemSize(int nslots, int nlsns)
+{
+ return SimpleLruStructSize(nslots, nlsns) +
+ hash_estimate_size(nslots, sizeof(SlruMappingTableEntry));
+}
+
/*
* Initialize, or attach to, a simple LRU cache in shared memory.
*
@@ -189,11 +210,14 @@ SimpleLruInit(SlruCtl ctl, const char *name, int nslots, int nlsns,
LWLock *ctllock, const char *subdir, int tranche_id,
SyncRequestHandler sync_handler)
{
+ char mapping_table_name[SHMEM_INDEX_KEYSIZE];
+ HASHCTL mapping_table_info;
+ HTAB *mapping_table;
SlruShared shared;
bool found;
shared = (SlruShared) ShmemInitStruct(name,
- SimpleLruShmemSize(nslots, nlsns),
+ SimpleLruStructSize(nslots, nlsns),
&found);
if (!IsUnderPostmaster)
@@ -260,11 +284,21 @@ SimpleLruInit(SlruCtl ctl, const char *name, int nslots, int nlsns,
else
Assert(found);
+ /* Create or find the buffer mapping table. */
+ memset(&mapping_table_info, 0, sizeof(mapping_table_info));
+ mapping_table_info.keysize = sizeof(int);
+ mapping_table_info.entrysize = sizeof(SlruMappingTableEntry);
+ snprintf(mapping_table_name, sizeof(mapping_table_name),
+ "%s Lookup Table", name);
+ mapping_table = ShmemInitHash(mapping_table_name, nslots, nslots,
+ &mapping_table_info, HASH_ELEM | HASH_BLOBS);
+
/*
* Initialize the unshared control struct, including directory path. We
* assume caller set PagePrecedes.
*/
ctl->shared = shared;
+ ctl->mapping_table = mapping_table;
ctl->sync_handler = sync_handler;
strlcpy(ctl->Dir, subdir, sizeof(ctl->Dir));
}
@@ -291,6 +325,9 @@ SimpleLruZeroPage(SlruCtl ctl, int pageno)
shared->page_number[slotno] == pageno);
/* Mark the slot as containing this page */
+ if (shared->page_status[slotno] != SLRU_PAGE_EMPTY)
+ SlruMappingRemove(ctl, shared->page_number[slotno]);
+ SlruMappingAdd(ctl, pageno, slotno);
shared->page_number[slotno] = pageno;
shared->page_status[slotno] = SLRU_PAGE_VALID;
shared->page_dirty[slotno] = true;
@@ -364,7 +401,10 @@ SimpleLruWaitIO(SlruCtl ctl, int slotno)
{
/* indeed, the I/O must have failed */
if (shared->page_status[slotno] == SLRU_PAGE_READ_IN_PROGRESS)
+ {
+ SlruMappingRemove(ctl, shared->page_number[slotno]);
shared->page_status[slotno] = SLRU_PAGE_EMPTY;
+ }
else /* write_in_progress */
{
shared->page_status[slotno] = SLRU_PAGE_VALID;
@@ -438,6 +478,9 @@ SimpleLruReadPage(SlruCtl ctl, int pageno, bool write_ok,
!shared->page_dirty[slotno]));
/* Mark the slot read-busy */
+ if (shared->page_status[slotno] != SLRU_PAGE_EMPTY)
+ SlruMappingRemove(ctl, shared->page_number[slotno]);
+ SlruMappingAdd(ctl, pageno, slotno);
shared->page_number[slotno] = pageno;
shared->page_status[slotno] = SLRU_PAGE_READ_IN_PROGRESS;
shared->page_dirty[slotno] = false;
@@ -461,7 +504,13 @@ SimpleLruReadPage(SlruCtl ctl, int pageno, bool write_ok,
shared->page_status[slotno] == SLRU_PAGE_READ_IN_PROGRESS &&
!shared->page_dirty[slotno]);
- shared->page_status[slotno] = ok ? SLRU_PAGE_VALID : SLRU_PAGE_EMPTY;
+ if (ok)
+ shared->page_status[slotno] = SLRU_PAGE_VALID;
+ else
+ {
+ SlruMappingRemove(ctl, pageno);
+ shared->page_status[slotno] = SLRU_PAGE_EMPTY;
+ }
LWLockRelease(&shared->buffer_locks[slotno].lock);
@@ -502,20 +551,20 @@ SimpleLruReadPage_ReadOnly(SlruCtl ctl, int pageno, TransactionId xid)
LWLockAcquire(shared->ControlLock, LW_SHARED);
/* See if page is already in a buffer */
- for (slotno = 0; slotno < shared->num_slots; slotno++)
+ slotno = SlruMappingFind(ctl, pageno);
+ if (slotno >= 0 &&
+ shared->page_status[slotno] != SLRU_PAGE_READ_IN_PROGRESS)
{
- if (shared->page_number[slotno] == pageno &&
- shared->page_status[slotno] != SLRU_PAGE_EMPTY &&
- shared->page_status[slotno] != SLRU_PAGE_READ_IN_PROGRESS)
- {
- /* See comments for SlruRecentlyUsed macro */
- SlruRecentlyUsed(shared, slotno);
+ Assert(shared->page_status[slotno] != SLRU_PAGE_EMPTY);
+ Assert(shared->page_number[slotno] == pageno);
- /* update the stats counter of pages found in the SLRU */
- pgstat_count_slru_page_hit(shared->slru_stats_idx);
+ /* See comments for SlruRecentlyUsed macro */
+ SlruRecentlyUsed(shared, slotno);
- return slotno;
- }
+ /* update the stats counter of pages found in the SLRU */
+ pgstat_count_slru_page_hit(shared->slru_stats_idx);
+
+ return slotno;
}
/* No luck, so switch to normal exclusive lock and do regular read */
@@ -1031,11 +1080,12 @@ SlruSelectLRUPage(SlruCtl ctl, int pageno)
int best_invalid_page_number = 0; /* keep compiler quiet */
/* See if page already has a buffer assigned */
- for (slotno = 0; slotno < shared->num_slots; slotno++)
+ slotno = SlruMappingFind(ctl, pageno);
+ if (slotno >= 0)
{
- if (shared->page_number[slotno] == pageno &&
- shared->page_status[slotno] != SLRU_PAGE_EMPTY)
- return slotno;
+ Assert(shared->page_number[slotno] == pageno);
+ Assert(shared->page_status[slotno] != SLRU_PAGE_EMPTY);
+ return slotno;
}
/*
@@ -1268,6 +1318,7 @@ restart:
if (shared->page_status[slotno] == SLRU_PAGE_VALID &&
!shared->page_dirty[slotno])
{
+ SlruMappingRemove(ctl, shared->page_number[slotno]);
shared->page_status[slotno] = SLRU_PAGE_EMPTY;
continue;
}
@@ -1350,6 +1401,7 @@ restart:
if (shared->page_status[slotno] == SLRU_PAGE_VALID &&
!shared->page_dirty[slotno])
{
+ SlruMappingRemove(ctl, shared->page_number[slotno]);
shared->page_status[slotno] = SLRU_PAGE_EMPTY;
continue;
}
@@ -1613,3 +1665,47 @@ SlruSyncFileTag(SlruCtl ctl, const FileTag *ftag, char *path)
errno = save_errno;
return result;
}
+
+/*
+ * Lookup the given pageno entry; return buffer slotno, or -1 if not found.
+ */
+static int
+SlruMappingFind(SlruCtl ctl, int pageno)
+{
+ SlruMappingTableEntry *mapping;
+
+ mapping = hash_search(ctl->mapping_table, &pageno, HASH_FIND, NULL);
+ if (mapping)
+ return mapping->slotno;
+
+ return -1;
+}
+
+/*
+ * Insert a hashtable entry for given pageno and buffer slotno, unless an entry
+ * already exists for that pageno.
+ */
+static void
+SlruMappingAdd(SlruCtl ctl, int pageno, int slotno)
+{
+ SlruMappingTableEntry *mapping;
+ bool found PG_USED_FOR_ASSERTS_ONLY;
+
+ mapping = hash_search(ctl->mapping_table, &pageno, HASH_ENTER, &found);
+ mapping->slotno = slotno;
+
+ Assert(!found);
+}
+
+/*
+ * Delete the hashtable entry for given tag (which must exist).
+ */
+static void
+SlruMappingRemove(SlruCtl ctl, int pageno)
+{
+ bool found PG_USED_FOR_ASSERTS_ONLY;
+
+ hash_search(ctl->mapping_table, &pageno, HASH_REMOVE, &found);
+
+ Assert(found);
+}
diff --git a/src/include/access/slru.h b/src/include/access/slru.h
index c0d37e3eb3..9cd0899f1d 100644
--- a/src/include/access/slru.h
+++ b/src/include/access/slru.h
@@ -16,6 +16,7 @@
#include "access/xlogdefs.h"
#include "storage/lwlock.h"
#include "storage/sync.h"
+#include "utils/hsearch.h"
/*
* To avoid overflowing internal arithmetic and the size_t data type, the
@@ -116,6 +117,9 @@ typedef struct SlruCtlData
{
SlruShared shared;
+ /* Buffer mapping hash table over slru buffer pool */
+ HTAB *mapping_table;
+
/*
* Which sync handler function to use when handing sync requests over to
* the checkpointer. SYNC_HANDLER_NONE to disable fsync (eg pg_notify).
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 87c1aee379..ec8957f12a 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2568,6 +2568,7 @@ SlotNumber
SlruCtl
SlruCtlData
SlruErrorCause
+SlruMappingTableEntry
SlruPageStatus
SlruScanCallback
SlruShared
--
2.39.2 (Apple Git-143)
v4-0004-Merge-partition-locks-array-with-buffer-locks-arr.patchapplication/octet-stream; name=v4-0004-Merge-partition-locks-array-with-buffer-locks-arr.patchDownload
From 30cc4cc9d7f2c65bfa072349ddd26aaa3b3ae0cd Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Thu, 2 Nov 2023 10:59:03 +0530
Subject: [PATCH v4 4/5] Merge partition locks array with buffer locks array
This will help us getting the part_cur_lru_count in same cacheline
which is frequently accessed in SlruRecentlyUsed.
---
src/backend/access/transam/slru.c | 122 ++++++++++++++++--------------
src/include/access/slru.h | 10 +--
2 files changed, 69 insertions(+), 63 deletions(-)
diff --git a/src/backend/access/transam/slru.c b/src/backend/access/transam/slru.c
index ab7cd276ce..8b89a86a10 100644
--- a/src/backend/access/transam/slru.c
+++ b/src/backend/access/transam/slru.c
@@ -152,8 +152,7 @@ SimpleLruStructSize(int nslots, int nlsns)
sz += MAXALIGN(nslots * sizeof(bool)); /* page_dirty[] */
sz += MAXALIGN(nslots * sizeof(int)); /* page_number[] */
sz += MAXALIGN(nslots * sizeof(int)); /* page_lru_count[] */
- sz += MAXALIGN(nslots * sizeof(LWLockPadded)); /* buffer_locks[] */
- sz += MAXALIGN(SLRU_NUM_PARTITIONS * sizeof(LWLockPadded)); /* part_locks[] */
+ sz += MAXALIGN((nslots + SLRU_NUM_PARTITIONS) * sizeof(LWLockPadded)); /* locks[] */
sz += MAXALIGN(SLRU_NUM_PARTITIONS * sizeof(int)); /* part_cur_lru_count[] */
if (nlsns > 0)
@@ -231,10 +230,8 @@ SimpleLruInit(SlruCtl ctl, const char *name, int nslots, int nlsns,
offset += MAXALIGN(nslots * sizeof(int));
/* Initialize LWLocks */
- shared->buffer_locks = (LWLockPadded *) (ptr + offset);
- offset += MAXALIGN(nslots * sizeof(LWLockPadded));
- shared->part_locks = (LWLockPadded *) (ptr + offset);
- offset += MAXALIGN(SLRU_NUM_PARTITIONS * sizeof(LWLockPadded));
+ shared->locks = (LWLockPadded *) (ptr + offset);
+ offset += MAXALIGN((nslots + SLRU_NUM_PARTITIONS) * sizeof(LWLockPadded));
shared->part_cur_lru_count = (int *) (ptr + offset);
offset += MAXALIGN(SLRU_NUM_PARTITIONS * sizeof(int));
@@ -247,8 +244,7 @@ SimpleLruInit(SlruCtl ctl, const char *name, int nslots, int nlsns,
ptr += BUFFERALIGN(offset);
for (slotno = 0; slotno < nslots; slotno++)
{
- LWLockInitialize(&shared->buffer_locks[slotno].lock,
- buffer_tranche_id);
+ LWLockInitialize(&shared->locks[slotno].lock, buffer_tranche_id);
shared->page_buffer[slotno] = ptr;
shared->page_status[slotno] = SLRU_PAGE_EMPTY;
@@ -259,7 +255,7 @@ SimpleLruInit(SlruCtl ctl, const char *name, int nslots, int nlsns,
/* Initialize partition locks for each buffer partition. */
for (partno = 0; partno < SLRU_NUM_PARTITIONS; partno++)
{
- LWLockInitialize(&shared->part_locks[partno].lock,
+ LWLockInitialize(&shared->locks[nslots + partno].lock,
part_tranche_id);
shared->part_cur_lru_count[partno] = 0;
}
@@ -369,12 +365,13 @@ SimpleLruWaitIO(SlruCtl ctl, int slotno)
{
SlruShared shared = ctl->shared;
int partno = slotno / ctl->part_size;
+ int partlockoffset = shared->num_slots + partno;
/* See notes at top of file */
- LWLockRelease(&shared->part_locks[partno].lock);
- LWLockAcquire(&shared->buffer_locks[slotno].lock, LW_SHARED);
- LWLockRelease(&shared->buffer_locks[slotno].lock);
- LWLockAcquire(&shared->part_locks[partno].lock, LW_EXCLUSIVE);
+ LWLockRelease(&shared->locks[partlockoffset].lock);
+ LWLockAcquire(&shared->locks[slotno].lock, LW_SHARED);
+ LWLockRelease(&shared->locks[slotno].lock);
+ LWLockAcquire(&shared->locks[partlockoffset].lock, LW_EXCLUSIVE);
/*
* If the slot is still in an io-in-progress state, then either someone
@@ -387,7 +384,7 @@ SimpleLruWaitIO(SlruCtl ctl, int slotno)
if (shared->page_status[slotno] == SLRU_PAGE_READ_IN_PROGRESS ||
shared->page_status[slotno] == SLRU_PAGE_WRITE_IN_PROGRESS)
{
- if (LWLockConditionalAcquire(&shared->buffer_locks[slotno].lock, LW_SHARED))
+ if (LWLockConditionalAcquire(&shared->locks[slotno].lock, LW_SHARED))
{
/* indeed, the I/O must have failed */
if (shared->page_status[slotno] == SLRU_PAGE_READ_IN_PROGRESS)
@@ -400,7 +397,7 @@ SimpleLruWaitIO(SlruCtl ctl, int slotno)
shared->page_status[slotno] = SLRU_PAGE_VALID;
shared->page_dirty[slotno] = true;
}
- LWLockRelease(&shared->buffer_locks[slotno].lock);
+ LWLockRelease(&shared->locks[slotno].lock);
}
}
}
@@ -433,6 +430,7 @@ SimpleLruReadPage(SlruCtl ctl, int pageno, bool write_ok,
{
int slotno;
int partno;
+ int banklockoffset;
bool ok;
/* See if page already is in memory; if not, pick victim slot */
@@ -477,11 +475,12 @@ SimpleLruReadPage(SlruCtl ctl, int pageno, bool write_ok,
shared->page_dirty[slotno] = false;
/* Acquire per-buffer lock (cannot deadlock, see notes at top) */
- LWLockAcquire(&shared->buffer_locks[slotno].lock, LW_EXCLUSIVE);
+ LWLockAcquire(&shared->locks[slotno].lock, LW_EXCLUSIVE);
partno = slotno / ctl->part_size;
+ banklockoffset = shared->num_slots + partno;
/* Release control lock while doing I/O */
- LWLockRelease(&shared->part_locks[partno].lock);
+ LWLockRelease(&shared->locks[banklockoffset].lock);
/* Do the read */
ok = SlruPhysicalReadPage(ctl, pageno, slotno);
@@ -490,7 +489,7 @@ SimpleLruReadPage(SlruCtl ctl, int pageno, bool write_ok,
SimpleLruZeroLSNs(ctl, slotno);
/* Re-acquire control lock and update page state */
- LWLockAcquire(&shared->part_locks[partno].lock, LW_EXCLUSIVE);
+ LWLockAcquire(&shared->locks[banklockoffset].lock, LW_EXCLUSIVE);
Assert(shared->page_number[slotno] == pageno &&
shared->page_status[slotno] == SLRU_PAGE_READ_IN_PROGRESS &&
@@ -504,7 +503,7 @@ SimpleLruReadPage(SlruCtl ctl, int pageno, bool write_ok,
shared->page_status[slotno] = SLRU_PAGE_EMPTY;
}
- LWLockRelease(&shared->buffer_locks[slotno].lock);
+ LWLockRelease(&shared->locks[slotno].lock);
/* Now it's okay to ereport if we failed */
if (!ok)
@@ -539,12 +538,14 @@ SimpleLruReadPage_ReadOnly(SlruCtl ctl, int pageno, TransactionId xid)
SlruShared shared = ctl->shared;
int slotno;
int partno;
+ int partlockoffset;
/* Determine partition number for the page. */
partno = SlruMappingPartNo(ctl, pageno);
+ partlockoffset = shared->num_slots + partno;
/* Try to find the page while holding only shared partition lock */
- LWLockAcquire(&shared->part_locks[partno].lock, LW_SHARED);
+ LWLockAcquire(&shared->locks[partlockoffset].lock, LW_SHARED);
/* See if page is already in a buffer */
slotno = SlruMappingFind(ctl, pageno);
@@ -564,8 +565,8 @@ SimpleLruReadPage_ReadOnly(SlruCtl ctl, int pageno, TransactionId xid)
}
/* No luck, so switch to normal exclusive lock and do regular read */
- LWLockRelease(&shared->part_locks[partno].lock);
- LWLockAcquire(&shared->part_locks[partno].lock, LW_EXCLUSIVE);
+ LWLockRelease(&shared->locks[partlockoffset].lock);
+ LWLockAcquire(&shared->locks[partlockoffset].lock, LW_EXCLUSIVE);
return SimpleLruReadPage(ctl, pageno, true, xid);
}
@@ -588,6 +589,7 @@ SlruInternalWritePage(SlruCtl ctl, int slotno, SlruWriteAll fdata)
int pageno = shared->page_number[slotno];
bool ok;
int partno = slotno / ctl->part_size;
+ int partlockoffset = shared->num_slots + partno;
/* If a write is in progress, wait for it to finish */
while (shared->page_status[slotno] == SLRU_PAGE_WRITE_IN_PROGRESS &&
@@ -613,10 +615,10 @@ SlruInternalWritePage(SlruCtl ctl, int slotno, SlruWriteAll fdata)
shared->page_dirty[slotno] = false;
/* Acquire per-buffer lock (cannot deadlock, see notes at top) */
- LWLockAcquire(&shared->buffer_locks[slotno].lock, LW_EXCLUSIVE);
+ LWLockAcquire(&shared->locks[slotno].lock, LW_EXCLUSIVE);
/* Release control lock while doing I/O */
- LWLockRelease(&shared->part_locks[partno].lock);
+ LWLockRelease(&shared->locks[partlockoffset].lock);
/* Do the write */
ok = SlruPhysicalWritePage(ctl, pageno, slotno, fdata);
@@ -631,7 +633,7 @@ SlruInternalWritePage(SlruCtl ctl, int slotno, SlruWriteAll fdata)
}
/* Re-acquire control lock and update page state */
- LWLockAcquire(&shared->part_locks[partno].lock, LW_EXCLUSIVE);
+ LWLockAcquire(&shared->locks[partlockoffset].lock, LW_EXCLUSIVE);
Assert(shared->page_number[slotno] == pageno &&
shared->page_status[slotno] == SLRU_PAGE_WRITE_IN_PROGRESS);
@@ -642,7 +644,7 @@ SlruInternalWritePage(SlruCtl ctl, int slotno, SlruWriteAll fdata)
shared->page_status[slotno] = SLRU_PAGE_VALID;
- LWLockRelease(&shared->buffer_locks[slotno].lock);
+ LWLockRelease(&shared->locks[slotno].lock);
/* Now it's okay to ereport if we failed */
if (!ok)
@@ -1219,7 +1221,7 @@ SimpleLruWriteAll(SlruCtl ctl, bool allow_redirtied)
int slotno;
int pageno = 0;
int i;
- int lastpartno = 0;
+ int prevlockoffset = shared->num_slots;
bool ok;
/* update the stats counter of flushes */
@@ -1230,17 +1232,17 @@ SimpleLruWriteAll(SlruCtl ctl, bool allow_redirtied)
*/
fdata.num_files = 0;
- LWLockAcquire(&shared->part_locks[0].lock, LW_EXCLUSIVE);
+ LWLockAcquire(&shared->locks[prevlockoffset].lock, LW_EXCLUSIVE);
for (slotno = 0; slotno < shared->num_slots; slotno++)
{
- int curpartno = slotno / ctl->part_size;
+ int curlockoffset = shared->num_slots + slotno / ctl->part_size;
- if (curpartno != lastpartno)
+ if (curlockoffset != prevlockoffset)
{
- LWLockRelease(&shared->part_locks[lastpartno].lock);
- LWLockAcquire(&shared->part_locks[curpartno].lock, LW_EXCLUSIVE);
- lastpartno = curpartno;
+ LWLockRelease(&shared->locks[prevlockoffset].lock);
+ LWLockAcquire(&shared->locks[curlockoffset].lock, LW_EXCLUSIVE);
+ prevlockoffset = curlockoffset;
}
SlruInternalWritePage(ctl, slotno, &fdata);
@@ -1256,7 +1258,7 @@ SimpleLruWriteAll(SlruCtl ctl, bool allow_redirtied)
!shared->page_dirty[slotno]));
}
- LWLockRelease(&shared->part_locks[lastpartno].lock);
+ LWLockRelease(&shared->locks[prevlockoffset].lock);
/*
* Now close any files that were open
@@ -1296,7 +1298,8 @@ SimpleLruTruncate(SlruCtl ctl, int cutoffPage)
{
SlruShared shared = ctl->shared;
int slotno;
- int prevpartno;
+ int nslots = shared->num_slots;
+ int prevlockoffset;
/* update the stats counter of truncates */
pgstat_count_slru_truncate(shared->slru_stats_idx);
@@ -1322,21 +1325,21 @@ restart:
return;
}
- prevpartno = 0;
- LWLockAcquire(&shared->part_locks[prevpartno].lock, LW_EXCLUSIVE);
- for (slotno = 0; slotno < shared->num_slots; slotno++)
+ prevlockoffset = nslots;
+ LWLockAcquire(&shared->locks[prevlockoffset].lock, LW_EXCLUSIVE);
+ for (slotno = 0; slotno < nslots; slotno++)
{
- int curpartno = slotno / ctl->part_size;
+ int curlockoffset = nslots + (slotno / ctl->part_size);
/*
* If the curpartno is not same as prevpartno then release the lock on
* the prevpartno and acquire the lock on the curpartno.
*/
- if (curpartno != prevpartno)
+ if (curlockoffset != prevlockoffset)
{
- LWLockRelease(&shared->part_locks[prevpartno].lock);
- LWLockAcquire(&shared->part_locks[curpartno].lock, LW_EXCLUSIVE);
- prevpartno = curpartno;
+ LWLockRelease(&shared->locks[prevlockoffset].lock);
+ LWLockAcquire(&shared->locks[curlockoffset].lock, LW_EXCLUSIVE);
+ prevlockoffset = curlockoffset;
}
if (shared->page_status[slotno] == SLRU_PAGE_EMPTY)
@@ -1370,11 +1373,11 @@ restart:
else
SimpleLruWaitIO(ctl, slotno);
- LWLockRelease(&shared->part_locks[prevpartno].lock);
+ LWLockRelease(&shared->locks[prevlockoffset].lock);
goto restart;
}
- LWLockRelease(&shared->part_locks[prevpartno].lock);
+ LWLockRelease(&shared->locks[prevlockoffset].lock);
/* Now we can remove the old segment(s) */
(void) SlruScanDirectory(ctl, SlruScanDirCbDeleteCutoff, &cutoffPage);
@@ -1415,28 +1418,29 @@ SlruDeleteSegment(SlruCtl ctl, int segno)
SlruShared shared = ctl->shared;
int slotno;
bool did_write;
- int prevpartno = 0;
+ int nslots = shared->num_slots;
+ int prevlockoffset = nslots;
/* Clean out any possibly existing references to the segment. */
- LWLockAcquire(&shared->part_locks[prevpartno].lock, LW_EXCLUSIVE);
+ LWLockAcquire(&shared->locks[prevlockoffset].lock, LW_EXCLUSIVE);
restart:
did_write = false;
- for (slotno = 0; slotno < shared->num_slots; slotno++)
+ for (slotno = 0; slotno < nslots; slotno++)
{
int pagesegno;
- int curpartno;
+ int curlockoffset;
- curpartno = slotno / ctl->part_size;
+ curlockoffset = nslots + (slotno / ctl->part_size);
/*
* If the curpartno is not same as prevpartno then release the lock on
* the prevpartno and acquire the lock on the curpartno.
*/
- if (curpartno != prevpartno)
+ if (curlockoffset != prevlockoffset)
{
- LWLockRelease(&shared->part_locks[prevpartno].lock);
- LWLockAcquire(&shared->part_locks[curpartno].lock, LW_EXCLUSIVE);
- prevpartno = curpartno;
+ LWLockRelease(&shared->locks[prevlockoffset].lock);
+ LWLockAcquire(&shared->locks[curlockoffset].lock, LW_EXCLUSIVE);
+ prevlockoffset = curlockoffset;
}
pagesegno = shared->page_number[slotno] / SLRU_PAGES_PER_SEGMENT;
@@ -1474,7 +1478,7 @@ restart:
SlruInternalDeleteSegment(ctl, segno);
- LWLockRelease(&shared->part_locks[prevpartno].lock);
+ LWLockRelease(&shared->locks[prevlockoffset].lock);
}
/*
@@ -1816,7 +1820,7 @@ SimpleLruGetPartitionLock(SlruCtl ctl, int pageno)
{
int partno = SlruMappingPartNo(ctl, pageno);
- return &(ctl->shared->part_locks[partno].lock);
+ return &(ctl->shared->locks[ctl->shared->num_slots + partno].lock);
}
/*
@@ -1827,9 +1831,10 @@ SimpleLruLockAllPartitions(SlruCtl ctl, LWLockMode mode)
{
SlruShared shared = ctl->shared;
int partno;
+ int nslots = shared->num_slots;
for (partno = 0; partno < SLRU_NUM_PARTITIONS; partno++)
- LWLockAcquire(&shared->part_locks[partno].lock, mode);
+ LWLockAcquire(&shared->locks[nslots + partno].lock, mode);
}
/*
@@ -1840,7 +1845,8 @@ SimpleLruUnLockAllPartitions(SlruCtl ctl)
{
SlruShared shared = ctl->shared;
int partno;
+ int nslots = shared->num_slots;
for (partno = 0; partno < SLRU_NUM_PARTITIONS; partno++)
- LWLockRelease(&shared->part_locks[partno].lock);
+ LWLockRelease(&shared->locks[nslots + partno].lock);
}
diff --git a/src/include/access/slru.h b/src/include/access/slru.h
index e6c54d5519..ac1227f29f 100644
--- a/src/include/access/slru.h
+++ b/src/include/access/slru.h
@@ -70,14 +70,14 @@ typedef struct SlruSharedData
bool *page_dirty;
int *page_number;
int *page_lru_count;
- LWLockPadded *buffer_locks;
/*
- * Locks to protect the in memory buffer slot access in per SLRU bank. The
- * buffer_locks protects the I/O on each buffer slots whereas this lock
- * protect the in memory operation on the buffer within one SLRU bank.
+ * This contains nslots numbers of buffers locks and nparts numbers of
+ * part locks. The buffer locks protects the I/O on each buffer slots
+ * whereas the part lock protect the in memory operation on the buffer
+ * within one SLRU part.
*/
- LWLockPadded *part_locks;
+ LWLockPadded *locks;
/*----------
* Instead of global counter we maintain a partition-wise lru counter
--
2.39.2 (Apple Git-143)
On 30 Oct 2023, at 09:20, Dilip Kumar <dilipbalaut@gmail.com> wrote:
changed the logic of SlruAdjustNSlots() in 0002, such that now it
starts with the next power of 2 value of the configured slots and
keeps doubling the number of banks until we reach the number of banks
to the max SLRU_MAX_BANKS(128) and bank size is bigger than
SLRU_MIN_BANK_SIZE (8). By doing so, we will ensure we don't have too
many banks
There was nothing wrong with having too many banks. Until bank-wise locks and counters were added in later patchsets.
Having hashtable to find SLRU page in the buffer IMV is too slow. Some comments on this approach can be found here [0]/messages/by-id/CA+hUKGKVqrxOp82zER1=XN=yPwV_-OCGAg=ez=1iz9rG+A7Smw@mail.gmail.com.
I'm OK with having HTAB for that if we are sure performance does not degrade significantly, but I really doubt this is the case.
I even think SLRU buffers used HTAB in some ancient times, but I could not find commit when it was changed to linear search.
Maybe we could decouple locks and counters from SLRU banks? Banks were meant to be small to exploit performance of local linear search. Lock partitions have to be bigger for sure.
On 30 Oct 2023, at 09:20, Dilip Kumar <dilipbalaut@gmail.com> wrote:
I have taken 0001 and 0002 from [1], done some bug fixes in 0001
BTW can you please describe in more detail what kind of bugs?
Thanks for working on this!
Best regards, Andrey Borodin.
[0]: /messages/by-id/CA+hUKGKVqrxOp82zER1=XN=yPwV_-OCGAg=ez=1iz9rG+A7Smw@mail.gmail.com
On Sun, Nov 5, 2023 at 1:37 AM Andrey M. Borodin <x4mmm@yandex-team.ru> wrote:
On 30 Oct 2023, at 09:20, Dilip Kumar <dilipbalaut@gmail.com> wrote:
changed the logic of SlruAdjustNSlots() in 0002, such that now it
starts with the next power of 2 value of the configured slots and
keeps doubling the number of banks until we reach the number of banks
to the max SLRU_MAX_BANKS(128) and bank size is bigger than
SLRU_MIN_BANK_SIZE (8). By doing so, we will ensure we don't have too
many banksThere was nothing wrong with having too many banks. Until bank-wise locks and counters were added in later patchsets.
I agree with that, but I feel with bank-wise locks we are removing
major contention from the centralized control lock and we can see that
from my first email that how much benefit we can get in one of the
simple test cases when we create subtransaction overflow.
Having hashtable to find SLRU page in the buffer IMV is too slow. Some comments on this approach can be found here [0].
I'm OK with having HTAB for that if we are sure performance does not degrade significantly, but I really doubt this is the case.
I even think SLRU buffers used HTAB in some ancient times, but I could not find commit when it was changed to linear search.
The main intention of having this buffer mapping hash is to find the
SLRU page faster than sequence search when banks are relatively bigger
in size, but if we find the cases where having hash creates more
overhead than providing gain then I am fine to remove the hash because
the whole purpose of adding hash here to make the lookup faster. So
far in my test I did not find the slowness. Do you or anyone else
have any test case based on the previous research on whether it
creates any slowness?
Maybe we could decouple locks and counters from SLRU banks? Banks were meant to be small to exploit performance of local linear search. Lock partitions have to be bigger for sure.
Yeah, that could also be an idea if we plan to drop the hash. I mean
bank-wise counter is fine as we are finding a victim buffer within a
bank itself, but each lock could cover more slots than one bank size
or in other words, it can protect multiple banks. Let's hear more
opinion on this.
On 30 Oct 2023, at 09:20, Dilip Kumar <dilipbalaut@gmail.com> wrote:
I have taken 0001 and 0002 from [1], done some bug fixes in 0001
BTW can you please describe in more detail what kind of bugs?
Yeah, actually that patch was using the same GUC
(multixact_offsets_buffers) in SimpleLruInit for MultiXactOffsetCtl as
well as for MultiXactMemberCtl, see the below patch snippet from the
original patch.
@@ -1851,13 +1851,13 @@ MultiXactShmemInit(void)
MultiXactMemberCtl->PagePrecedes = MultiXactMemberPagePrecedes;
SimpleLruInit(MultiXactOffsetCtl,
- "MultiXactOffset", NUM_MULTIXACTOFFSET_BUFFERS, 0,
+ "MultiXactOffset", multixact_offsets_buffers, 0,
MultiXactOffsetSLRULock, "pg_multixact/offsets",
LWTRANCHE_MULTIXACTOFFSET_BUFFER,
SYNC_HANDLER_MULTIXACT_OFFSET);
SlruPagePrecedesUnitTests(MultiXactOffsetCtl, MULTIXACT_OFFSETS_PER_PAGE);
SimpleLruInit(MultiXactMemberCtl,
- "MultiXactMember", NUM_MULTIXACTMEMBER_BUFFERS, 0,
+ "MultiXactMember", multixact_offsets_buffers, 0,
MultiXactMemberSLRULock, "pg_multixact/members",
LWTRANCHE_MULTIXACTMEMBER_BUFFER,
SYNC_HANDLER_MULTIXACT_MEMBER);
--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
On 6 Nov 2023, at 09:09, Dilip Kumar <dilipbalaut@gmail.com> wrote:
Having hashtable to find SLRU page in the buffer IMV is too slow. Some comments on this approach can be found here [0].
I'm OK with having HTAB for that if we are sure performance does not degrade significantly, but I really doubt this is the case.
I even think SLRU buffers used HTAB in some ancient times, but I could not find commit when it was changed to linear search.The main intention of having this buffer mapping hash is to find the
SLRU page faster than sequence search when banks are relatively bigger
in size, but if we find the cases where having hash creates more
overhead than providing gain then I am fine to remove the hash because
the whole purpose of adding hash here to make the lookup faster. So
far in my test I did not find the slowness. Do you or anyone else
have any test case based on the previous research on whether it
creates any slowness?
PFA test benchmark_slru_page_readonly(). In this test we run SimpleLruReadPage_ReadOnly() (essential part of TransactionIdGetStatus())
before introducing HTAB for buffer mapping I get
Time: 14837.851 ms (00:14.838)
with buffer HTAB I get
Time: 22723.243 ms (00:22.723)
This hash table makes getting transaction status ~50% slower.
Benchmark script I used:
make -C $HOME/postgresMX -j 8 install && (pkill -9 postgres; rm -rf test; ./initdb test && echo "shared_preload_libraries = 'test_slru'">> test/postgresql.conf && ./pg_ctl -D test start && ./psql -c 'create extension test_slru' postgres && ./pg_ctl -D test restart && ./psql -c "SELECT count(test_slru_page_write(a, 'Test SLRU'))
FROM generate_series(12346, 12393, 1) as a;" -c '\timing' -c "SELECT benchmark_slru_page_readonly(12377);" postgres)
Maybe we could decouple locks and counters from SLRU banks? Banks were meant to be small to exploit performance of local linear search. Lock partitions have to be bigger for sure.
Yeah, that could also be an idea if we plan to drop the hash. I mean
bank-wise counter is fine as we are finding a victim buffer within a
bank itself, but each lock could cover more slots than one bank size
or in other words, it can protect multiple banks. Let's hear more
opinion on this.
+1
On 30 Oct 2023, at 09:20, Dilip Kumar <dilipbalaut@gmail.com> wrote:
I have taken 0001 and 0002 from [1], done some bug fixes in 0001
BTW can you please describe in more detail what kind of bugs?
Yeah, actually that patch was using the same GUC
(multixact_offsets_buffers) in SimpleLruInit for MultiXactOffsetCtl as
well as for MultiXactMemberCtl, see the below patch snippet from the
original patch.
Ouch. We were running this for serveral years with this bug... Thanks!
Best regards, Andrey Borodin.
Attachments:
0001-Implement-benchmark_slru_page_readonly-to-assess-SLR.patchapplication/octet-stream; name=0001-Implement-benchmark_slru_page_readonly-to-assess-SLR.patch; x-unix-mode=0644Download
From 4888ae7664224c5a63e2edb598e658afe0e19f87 Mon Sep 17 00:00:00 2001
From: "Andrey M. Borodin" <x4mmm@172.25.72.30-ekb.dhcp.yndx.net>
Date: Mon, 6 Nov 2023 11:55:38 +0500
Subject: [PATCH] Implement benchmark_slru_page_readonly() to assess SLRU
perfromance
---
src/test/modules/test_slru/test_slru--1.0.sql | 2 ++
src/test/modules/test_slru/test_slru.c | 18 ++++++++++++++++++
2 files changed, 20 insertions(+)
diff --git a/src/test/modules/test_slru/test_slru--1.0.sql b/src/test/modules/test_slru/test_slru--1.0.sql
index 8635e7df01..3db6ef1029 100644
--- a/src/test/modules/test_slru/test_slru--1.0.sql
+++ b/src/test/modules/test_slru/test_slru--1.0.sql
@@ -11,6 +11,8 @@ CREATE OR REPLACE FUNCTION test_slru_page_read(int, bool DEFAULT true) RETURNS t
AS 'MODULE_PATHNAME', 'test_slru_page_read' LANGUAGE C;
CREATE OR REPLACE FUNCTION test_slru_page_readonly(int) RETURNS text
AS 'MODULE_PATHNAME', 'test_slru_page_readonly' LANGUAGE C;
+CREATE OR REPLACE FUNCTION benchmark_slru_page_readonly(int) RETURNS void
+ AS 'MODULE_PATHNAME', 'benchmark_slru_page_readonly' LANGUAGE C;
CREATE OR REPLACE FUNCTION test_slru_page_exists(int) RETURNS bool
AS 'MODULE_PATHNAME', 'test_slru_page_exists' LANGUAGE C;
CREATE OR REPLACE FUNCTION test_slru_page_delete(int) RETURNS VOID
diff --git a/src/test/modules/test_slru/test_slru.c b/src/test/modules/test_slru/test_slru.c
index ae21444c47..8a1e67a910 100644
--- a/src/test/modules/test_slru/test_slru.c
+++ b/src/test/modules/test_slru/test_slru.c
@@ -31,6 +31,7 @@ PG_FUNCTION_INFO_V1(test_slru_page_write);
PG_FUNCTION_INFO_V1(test_slru_page_writeall);
PG_FUNCTION_INFO_V1(test_slru_page_read);
PG_FUNCTION_INFO_V1(test_slru_page_readonly);
+PG_FUNCTION_INFO_V1(benchmark_slru_page_readonly);
PG_FUNCTION_INFO_V1(test_slru_page_exists);
PG_FUNCTION_INFO_V1(test_slru_page_sync);
PG_FUNCTION_INFO_V1(test_slru_page_delete);
@@ -128,6 +129,23 @@ test_slru_page_readonly(PG_FUNCTION_ARGS)
PG_RETURN_TEXT_P(cstring_to_text(data));
}
+Datum
+benchmark_slru_page_readonly(PG_FUNCTION_ARGS)
+{
+ int pageno = PG_GETARG_INT32(0);
+
+ for (int i = 0; i < 1000000000; i++)
+ {
+ SimpleLruReadPage_ReadOnly(TestSlruCtl,
+ pageno,
+ InvalidTransactionId);
+ Assert(LWLockHeldByMe(TestSLRULock));
+ LWLockRelease(TestSLRULock);
+ }
+
+ PG_RETURN_VOID();
+}
+
Datum
test_slru_page_exists(PG_FUNCTION_ARGS)
{
--
2.37.1 (Apple Git-137.1)
On Mon, Nov 6, 2023 at 1:05 PM Andrey M. Borodin <x4mmm@yandex-team.ru> wrote:
On 6 Nov 2023, at 09:09, Dilip Kumar <dilipbalaut@gmail.com> wrote:
Having hashtable to find SLRU page in the buffer IMV is too slow. Some comments on this approach can be found here [0].
I'm OK with having HTAB for that if we are sure performance does not degrade significantly, but I really doubt this is the case.
I even think SLRU buffers used HTAB in some ancient times, but I could not find commit when it was changed to linear search.The main intention of having this buffer mapping hash is to find the
SLRU page faster than sequence search when banks are relatively bigger
in size, but if we find the cases where having hash creates more
overhead than providing gain then I am fine to remove the hash because
the whole purpose of adding hash here to make the lookup faster. So
far in my test I did not find the slowness. Do you or anyone else
have any test case based on the previous research on whether it
creates any slowness?PFA test benchmark_slru_page_readonly(). In this test we run SimpleLruReadPage_ReadOnly() (essential part of TransactionIdGetStatus())
before introducing HTAB for buffer mapping I get
Time: 14837.851 ms (00:14.838)
with buffer HTAB I get
Time: 22723.243 ms (00:22.723)This hash table makes getting transaction status ~50% slower.
Benchmark script I used:
make -C $HOME/postgresMX -j 8 install && (pkill -9 postgres; rm -rf test; ./initdb test && echo "shared_preload_libraries = 'test_slru'">> test/postgresql.conf && ./pg_ctl -D test start && ./psql -c 'create extension test_slru' postgres && ./pg_ctl -D test restart && ./psql -c "SELECT count(test_slru_page_write(a, 'Test SLRU'))
FROM generate_series(12346, 12393, 1) as a;" -c '\timing' -c "SELECT benchmark_slru_page_readonly(12377);" postgres)
With this test, I got below numbers,
nslots. no-hash hash
8 10s 13s
16 10s 13s
32 15s 13s
64 17s 13s
Yeah so we can see with a small bank size <=16 slots we are seeing
that the fetching page with hash is 30% slower than the sequential
search, but beyond 32 slots sequential search is become slower as you
grow the number of slots whereas with hash it stays constant as
expected. But now as you told if keep the lock partition range
different than the bank size then we might not have any problem by
having more numbers of banks and with that, we can keep the bank size
small like 16. Let me put some more thought into this and get back.
Any other opinions on this?
--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
On 2023-Nov-06, Dilip Kumar wrote:
Yeah so we can see with a small bank size <=16 slots we are seeing
that the fetching page with hash is 30% slower than the sequential
search, but beyond 32 slots sequential search is become slower as you
grow the number of slots whereas with hash it stays constant as
expected. But now as you told if keep the lock partition range
different than the bank size then we might not have any problem by
having more numbers of banks and with that, we can keep the bank size
small like 16. Let me put some more thought into this and get back.
Any other opinions on this?
dynahash is notoriously slow, which is why we have simplehash.h since
commit b30d3ea824c5. Maybe we could use that instead.
--
Álvaro Herrera Breisgau, Deutschland — https://www.EnterpriseDB.com/
"Escucha y olvidarás; ve y recordarás; haz y entenderás" (Confucio)
On 6 Nov 2023, at 14:31, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
dynahash is notoriously slow, which is why we have simplehash.h since
commit b30d3ea824c5. Maybe we could use that instead.
Dynahash has lock partitioning. Simplehash has not, AFAIK.
The thing is we do not really need a hash function - pageno is already a best hash function itself. And we do not need to cope with collisions much - we can evict a collided buffer.
Given this we do not need a hashtable at all. That’s exact reasoning how banks emerged, I started implementing dynahsh patch in April 2021 and found out that “banks” approach is cleaner. However the term “bank” is not common in software, it’s taken from hardware cache.
Best regards, Andrey Borodin.
On Mon, Nov 6, 2023 at 4:44 PM Andrey M. Borodin <x4mmm@yandex-team.ru> wrote:
On 6 Nov 2023, at 14:31, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
dynahash is notoriously slow, which is why we have simplehash.h since
commit b30d3ea824c5. Maybe we could use that instead.Dynahash has lock partitioning. Simplehash has not, AFAIK.
Yeah, Simplehash doesn't have partitioning so with simple hash we will
be stuck with the centralized control lock that is one of the main
problems trying to solve here.
--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
On Mon, Nov 6, 2023 at 4:44 PM Andrey M. Borodin <x4mmm@yandex-team.ru>
wrote:
On 6 Nov 2023, at 14:31, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
dynahash is notoriously slow, which is why we have simplehash.h since
commit b30d3ea824c5. Maybe we could use that instead.Dynahash has lock partitioning. Simplehash has not, AFAIK.
The thing is we do not really need a hash function - pageno is already a
best hash function itself. And we do not need to cope with collisions much
- we can evict a collided buffer.Given this we do not need a hashtable at all. That’s exact reasoning how
banks emerged, I started implementing dynahsh patch in April 2021 and found
out that “banks” approach is cleaner. However the term “bank” is not common
in software, it’s taken from hardware cache.
I agree that we don't need the hash function to generate hash value out of
pageno which itself is sufficient, but I don't understand how we can get
rid of
the hash table itself -- how we would map the pageno and the slot number?
That mapping is not needed at all?
Regards,
Amul
On Fri, Nov 3, 2023 at 10:59 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
On Mon, Oct 30, 2023 at 11:50 AM Dilip Kumar <dilipbalaut@gmail.com>
wrote:[...]
[1] 0001-Make-all-SLRU-buffer-sizes-configurable: This is the same
patch as the previous patch set
[2] 0002-Add-a-buffer-mapping-table-for-SLRUs: Patch to introduce
buffer mapping hash table
[3] 0003-Partition-wise-slru-locks: Partition the hash table and also
introduce partition-wise locks: this is a merge of 0003 and 0004 from
the previous patch set but instead of bank-wise locks it has
partition-wise locks and LRU counter.
[4] 0004-Merge-partition-locks-array-with-buffer-locks-array: merging
buffer locks and bank locks in the same array so that the bank-wise
LRU counter does not fetch the next cache line in a hot function
SlruRecentlyUsed()(same as 0005 from the previous patch set)
[5] 0005-Ensure-slru-buffer-slots-are-in-multiple-of-number-of: Ensure
that the number of slots is in multiple of the number of banks
[...]
Here are some minor comments:
+ * By default, we'll use 1MB of for every 1GB of shared buffers, up to the
+ * maximum value that slru.c will allow, but always at least 16 buffers.
*/
Size
CommitTsShmemBuffers(void)
{
- return Min(256, Max(4, NBuffers / 256));
+ /* Use configured value if provided. */
+ if (commit_ts_buffers > 0)
+ return Max(16, commit_ts_buffers);
+ return Min(256, Max(16, NBuffers / 256));
Do you mean "4MB of for every 1GB" in the comment?
--
diff --git a/src/include/access/commit_ts.h b/src/include/access/commit_ts.h
index 5087cdce51..78d017ad85 100644
--- a/src/include/access/commit_ts.h
+++ b/src/include/access/commit_ts.h
@@ -16,7 +16,6 @@
#include "replication/origin.h"
#include "storage/sync.h"
-
extern PGDLLIMPORT bool track_commit_timestamp;
A spurious change.
--
@@ -168,10 +180,19 @@ SimpleLruShmemSize(int nslots, int nlsns)
if (nlsns > 0)
sz += MAXALIGN(nslots * nlsns * sizeof(XLogRecPtr)); /*
group_lsn[] */
-
return BUFFERALIGN(sz) + BLCKSZ * nslots;
}
Another spurious change in 0002 patch.
--
+/*
+ * The slru buffer mapping table is partitioned to reduce contention. To
+ * determine which partition lock a given pageno requires, compute the
pageno's
+ * hash code with SlruBufTableHashCode(), then apply SlruPartitionLock().
+ */
I didn't see SlruBufTableHashCode() & SlruPartitionLock() functions
anywhere in
your patches, is that outdated comment?
--
- sz += MAXALIGN(nslots * sizeof(LWLockPadded)); /* buffer_locks[] */
- sz += MAXALIGN(SLRU_NUM_PARTITIONS * sizeof(LWLockPadded)); /*
part_locks[] */
+ sz += MAXALIGN((nslots + SLRU_NUM_PARTITIONS) * sizeof(LWLockPadded));
/* locks[] */
I am a bit uncomfortable with these changes, merging parts and buffer locks
making it hard to understand the code. Not sure what we were getting out of
this?
--
Subject: [PATCH v4 5/5] Ensure slru buffer slots are in multiple of numbe of
partitions
I think the 0005 patch can be merged to 0001.
Regards,
Amul
On Wed, Nov 8, 2023 at 10:52 AM Amul Sul <sulamul@gmail.com> wrote:
Thanks for review Amul,
Here are some minor comments:
+ * By default, we'll use 1MB of for every 1GB of shared buffers, up to the + * maximum value that slru.c will allow, but always at least 16 buffers. */ Size CommitTsShmemBuffers(void) { - return Min(256, Max(4, NBuffers / 256)); + /* Use configured value if provided. */ + if (commit_ts_buffers > 0) + return Max(16, commit_ts_buffers); + return Min(256, Max(16, NBuffers / 256));Do you mean "4MB of for every 1GB" in the comment?
You are right
--
diff --git a/src/include/access/commit_ts.h b/src/include/access/commit_ts.h index 5087cdce51..78d017ad85 100644 --- a/src/include/access/commit_ts.h +++ b/src/include/access/commit_ts.h @@ -16,7 +16,6 @@ #include "replication/origin.h" #include "storage/sync.h"-
extern PGDLLIMPORT bool track_commit_timestamp;A spurious change.
Will fix
--
@@ -168,10 +180,19 @@ SimpleLruShmemSize(int nslots, int nlsns)
if (nlsns > 0)
sz += MAXALIGN(nslots * nlsns * sizeof(XLogRecPtr)); /* group_lsn[] */
-
return BUFFERALIGN(sz) + BLCKSZ * nslots;
}Another spurious change in 0002 patch.
Will fix
--
+/* + * The slru buffer mapping table is partitioned to reduce contention. To + * determine which partition lock a given pageno requires, compute the pageno's + * hash code with SlruBufTableHashCode(), then apply SlruPartitionLock(). + */I didn't see SlruBufTableHashCode() & SlruPartitionLock() functions anywhere in
your patches, is that outdated comment?
Yes will fix it, actually, there are some major design changes to this.
--
- sz += MAXALIGN(nslots * sizeof(LWLockPadded)); /* buffer_locks[] */ - sz += MAXALIGN(SLRU_NUM_PARTITIONS * sizeof(LWLockPadded)); /* part_locks[] */ + sz += MAXALIGN((nslots + SLRU_NUM_PARTITIONS) * sizeof(LWLockPadded)); /* locks[] */I am a bit uncomfortable with these changes, merging parts and buffer locks
making it hard to understand the code. Not sure what we were getting out of
this?
Yes, even I do not like this much because it is confusing. But the
advantage of this is that we are using a single pointer for the lock
which means the next variable for the LRU counter will come in the
same cacheline and frequent updates of lru counter will be benefitted
from this. Although I don't have any number which proves this.
Currently, I want to focus on all the base patches and keep this patch
as add on and later if we find its useful and want to pursue this then
we will see how to make it better readable.
Subject: [PATCH v4 5/5] Ensure slru buffer slots are in multiple of numbe of
partitionsI think the 0005 patch can be merged to 0001.
Yeah in the next version, it is done that way. Planning to post end of the day.
--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
On Sat, 4 Nov 2023 at 22:08, Andrey M. Borodin <x4mmm@yandex-team.ru> wrote:
On 30 Oct 2023, at 09:20, Dilip Kumar <dilipbalaut@gmail.com> wrote:
changed the logic of SlruAdjustNSlots() in 0002, such that now it
starts with the next power of 2 value of the configured slots and
keeps doubling the number of banks until we reach the number of banks
to the max SLRU_MAX_BANKS(128) and bank size is bigger than
SLRU_MIN_BANK_SIZE (8). By doing so, we will ensure we don't have too
many banksThere was nothing wrong with having too many banks. Until bank-wise locks
and counters were added in later patchsets.
Having hashtable to find SLRU page in the buffer IMV is too slow. Some
comments on this approach can be found here [0].
I'm OK with having HTAB for that if we are sure performance does not
degrade significantly, but I really doubt this is the case.
I even think SLRU buffers used HTAB in some ancient times, but I could not
find commit when it was changed to linear search.Maybe we could decouple locks and counters from SLRU banks? Banks were
meant to be small to exploit performance of local linear search. Lock
partitions have to be bigger for sure.
Is there a particular reason why lock partitions need to be bigger? We have
one lock per buffer anyway, bankwise locks will increase the number of
locks < 10%.
I am working on trying out a SIMD based LRU mechanism that uses a 16 entry
bank. The data layout is:
struct CacheBank {
int page_numbers[16];
char access_age[16];
}
The first part uses up one cache line, and the second line has 48 bytes of
space left over that could fit a lwlock and page_status, page_dirty arrays.
Lookup + LRU maintenance has 20 instructions/14 cycle latency and the only
branch is for found/not found. Hoping to have a working prototype of SLRU
on top in the next couple of days.
Regards,
Ants Aasma
On 8 Nov 2023, at 14:17, Ants Aasma <ants@cybertec.at> wrote:
Is there a particular reason why lock partitions need to be bigger? We have one lock per buffer anyway, bankwise locks will increase the number of locks < 10%.
The problem was not attracting much attention for some years. So my reasoning was that solution should not have any costs at all. Initial patchset with banks did not add any memory footprint.
On 8 Nov 2023, at 14:17, Ants Aasma <ants@cybertec.at> wrote:
I am working on trying out a SIMD based LRU mechanism that uses a 16 entry bank.
FWIW I tried to pack struct parts together to minimize cache lines touched, see step 3 in [0]/messages/by-id/93236D36-B91C-4DFA-AF03-99C083840378@yandex-team.ru. So far I could not prove any performance benefits of this approach. But maybe your implementation will be more efficient.
Thanks!
Best regards, Andrey Borodin.
[0]: /messages/by-id/93236D36-B91C-4DFA-AF03-99C083840378@yandex-team.ru
On Mon, Nov 6, 2023 at 9:39 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
On Sun, Nov 5, 2023 at 1:37 AM Andrey M. Borodin <x4mmm@yandex-team.ru> wrote:
Maybe we could decouple locks and counters from SLRU banks? Banks were meant to be small to exploit performance of local linear search. Lock partitions have to be bigger for sure.
Yeah, that could also be an idea if we plan to drop the hash. I mean
bank-wise counter is fine as we are finding a victim buffer within a
bank itself, but each lock could cover more slots than one bank size
or in other words, it can protect multiple banks. Let's hear more
opinion on this.
Here is the updated version of the patch, here I have taken the
approach suggested by Andrey and I discussed the same with Alvaro
offlist and he also agrees with it. So the idea is that we will keep
the bank size fixed which is 16 buffers per bank and the allowed GUC
value for each slru buffer must be in multiple of the bank size. We
have removed the centralized lock but instead of one lock per bank, we
have kept the maximum limit on the number of bank locks which is 128.
We kept the max limit as 128 because, in one of the operations (i.e.
ActivateCommitTs), we need to acquire all the bank locks (but this is
not a performance path at all) and at a time we can acquire a max of
200 LWlocks, so we think this limit of 128 is good. So now if the
number of banks is <= 128 then we will be using one lock per bank
otherwise the one lock may protect access of buffer in multiple banks.
We might argue that we should keep the max lock lesser than 128 i.e.
64 or 32 and I am open to that we can do more experiments with a very
large buffer pool and a very heavy workload to see whether having lock
up to 128 is helpful or not
--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
Attachments:
v5-0002-Divide-SLRU-buffers-into-banks.patchapplication/octet-stream; name=v5-0002-Divide-SLRU-buffers-into-banks.patchDownload
From ca083bb571a927202200f94909bcb280a39055ed Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Wed, 25 Oct 2023 16:51:34 +0530
Subject: [PATCH v5 2/3] Divide SLRU buffers into banks
As we have made slru buffer pool configurable, we want to
eliminate linear search within whole SLRU buffer pool. To
do so we divide SLRU buffers into banks. Each bank holds 16
buffers. Each SLRU pageno may reside only in one bank.
Adjacent pagenos reside in different banks. Along with this
also ensure that the number of slru buffers are given in
multiples of bank size.
Andrey M. Borodin and Dilip Kumar based on fedback by Alvaro Herrera
---
src/backend/access/transam/clog.c | 10 ++++++++
src/backend/access/transam/commit_ts.c | 10 ++++++++
src/backend/access/transam/multixact.c | 19 ++++++++++++++
src/backend/access/transam/slru.c | 34 +++++++++++++++++++++++---
src/backend/access/transam/subtrans.c | 10 ++++++++
src/backend/commands/async.c | 10 ++++++++
src/backend/storage/lmgr/predicate.c | 10 ++++++++
src/backend/utils/misc/guc_tables.c | 14 +++++------
src/include/access/slru.h | 12 ++++++++-
src/include/utils/guc_hooks.h | 11 +++++++++
10 files changed, 128 insertions(+), 12 deletions(-)
diff --git a/src/backend/access/transam/clog.c b/src/backend/access/transam/clog.c
index 7979bbd00f..ab3893cf4f 100644
--- a/src/backend/access/transam/clog.c
+++ b/src/backend/access/transam/clog.c
@@ -43,6 +43,7 @@
#include "pgstat.h"
#include "storage/proc.h"
#include "storage/sync.h"
+#include "utils/guc_hooks.h"
/*
* Defines for CLOG page sizes. A page is the same BLCKSZ as is used
@@ -1019,3 +1020,12 @@ clogsyncfiletag(const FileTag *ftag, char *path)
{
return SlruSyncFileTag(XactCtl, ftag, path);
}
+
+/*
+ * GUC check_hook for xact_buffers
+ */
+bool
+check_xact_buffers(int *newval, void **extra, GucSource source)
+{
+ return check_slru_buffers("xact_buffers", newval);
+}
diff --git a/src/backend/access/transam/commit_ts.c b/src/backend/access/transam/commit_ts.c
index 9ba5ae6534..96810959ab 100644
--- a/src/backend/access/transam/commit_ts.c
+++ b/src/backend/access/transam/commit_ts.c
@@ -33,6 +33,7 @@
#include "pg_trace.h"
#include "storage/shmem.h"
#include "utils/builtins.h"
+#include "utils/guc_hooks.h"
#include "utils/snapmgr.h"
#include "utils/timestamp.h"
@@ -1017,3 +1018,12 @@ committssyncfiletag(const FileTag *ftag, char *path)
{
return SlruSyncFileTag(CommitTsCtl, ftag, path);
}
+
+/*
+ * GUC check_hook for commit_ts_buffers
+ */
+bool
+check_commit_ts_buffers(int *newval, void **extra, GucSource source)
+{
+ return check_slru_buffers("commit_ts_buffers", newval);
+}
\ No newline at end of file
diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index 62709fcd07..77511c6342 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -88,6 +88,7 @@
#include "storage/proc.h"
#include "storage/procarray.h"
#include "utils/builtins.h"
+#include "utils/guc_hooks.h"
#include "utils/memutils.h"
#include "utils/snapmgr.h"
@@ -3419,3 +3420,21 @@ multixactmemberssyncfiletag(const FileTag *ftag, char *path)
{
return SlruSyncFileTag(MultiXactMemberCtl, ftag, path);
}
+
+/*
+ * GUC check_hook for multixact_offsets_buffers
+ */
+bool
+check_multixact_offsets_buffers(int *newval, void **extra, GucSource source)
+{
+ return check_slru_buffers("multixact_offsets_buffers", newval);
+}
+
+/*
+ * GUC check_hook for multixact_members_buffers
+ */
+bool
+check_multixact_members_buffers(int *newval, void **extra, GucSource source)
+{
+ return check_slru_buffers("multixact_members_buffers", newval);
+}
\ No newline at end of file
diff --git a/src/backend/access/transam/slru.c b/src/backend/access/transam/slru.c
index 9ed24e1185..8697a27555 100644
--- a/src/backend/access/transam/slru.c
+++ b/src/backend/access/transam/slru.c
@@ -59,6 +59,7 @@
#include "pgstat.h"
#include "storage/fd.h"
#include "storage/shmem.h"
+#include "utils/guc_hooks.h"
#define SlruFileName(ctl, path, seg) \
snprintf(path, MAXPGPATH, "%s/%04X", (ctl)->Dir, seg)
@@ -134,7 +135,6 @@ typedef enum
static SlruErrorCause slru_errcause;
static int slru_errno;
-
static void SimpleLruZeroLSNs(SlruCtl ctl, int slotno);
static void SimpleLruWaitIO(SlruCtl ctl, int slotno);
static void SlruInternalWritePage(SlruCtl ctl, int slotno, SlruWriteAll fdata);
@@ -258,7 +258,10 @@ SimpleLruInit(SlruCtl ctl, const char *name, int nslots, int nlsns,
Assert(ptr - (char *) shared <= SimpleLruShmemSize(nslots, nlsns));
}
else
+ {
Assert(found);
+ Assert(shared->num_slots == nslots);
+ }
/*
* Initialize the unshared control struct, including directory path. We
@@ -266,6 +269,7 @@ SimpleLruInit(SlruCtl ctl, const char *name, int nslots, int nlsns,
*/
ctl->shared = shared;
ctl->sync_handler = sync_handler;
+ ctl->bank_mask = (nslots / SLRU_BANK_SIZE) - 1;
strlcpy(ctl->Dir, subdir, sizeof(ctl->Dir));
}
@@ -497,12 +501,14 @@ SimpleLruReadPage_ReadOnly(SlruCtl ctl, int pageno, TransactionId xid)
{
SlruShared shared = ctl->shared;
int slotno;
+ int bankstart = (pageno & ctl->bank_mask) * SLRU_BANK_SIZE;
+ int bankend = bankstart + SLRU_BANK_SIZE;
/* Try to find the page while holding only shared lock */
LWLockAcquire(shared->ControlLock, LW_SHARED);
/* See if page is already in a buffer */
- for (slotno = 0; slotno < shared->num_slots; slotno++)
+ for (slotno = bankstart; slotno < bankend; slotno++)
{
if (shared->page_number[slotno] == pageno &&
shared->page_status[slotno] != SLRU_PAGE_EMPTY &&
@@ -1031,7 +1037,10 @@ SlruSelectLRUPage(SlruCtl ctl, int pageno)
int best_invalid_page_number = 0; /* keep compiler quiet */
/* See if page already has a buffer assigned */
- for (slotno = 0; slotno < shared->num_slots; slotno++)
+ int bankstart = (pageno & ctl->bank_mask) * SLRU_BANK_SIZE;
+ int bankend = bankstart + SLRU_BANK_SIZE;
+
+ for (slotno = bankstart; slotno < bankend; slotno++)
{
if (shared->page_number[slotno] == pageno &&
shared->page_status[slotno] != SLRU_PAGE_EMPTY)
@@ -1066,7 +1075,7 @@ SlruSelectLRUPage(SlruCtl ctl, int pageno)
* multiple pages with the same lru_count.
*/
cur_count = (shared->cur_lru_count)++;
- for (slotno = 0; slotno < shared->num_slots; slotno++)
+ for (slotno = bankstart; slotno < bankend; slotno++)
{
int this_delta;
int this_page_number;
@@ -1613,3 +1622,20 @@ SlruSyncFileTag(SlruCtl ctl, const FileTag *ftag, char *path)
errno = save_errno;
return result;
}
+
+/*
+ * Helper function for GUC check_hook to check whether slru buffers are in
+ * multiples of SLRU_BANK_SIZE.
+ */
+bool
+check_slru_buffers(const char *name, int *newval)
+{
+ /* Value upper and lower hard limits are inclusive */
+ if (*newval % SLRU_BANK_SIZE == 0)
+ return true;
+
+ /* Value does not fall within any allowable range */
+ GUC_check_errdetail("\"%s\" must be in multiple of %d", name,
+ SLRU_BANK_SIZE);
+ return false;
+}
diff --git a/src/backend/access/transam/subtrans.c b/src/backend/access/transam/subtrans.c
index 0dd48f40f3..923e706535 100644
--- a/src/backend/access/transam/subtrans.c
+++ b/src/backend/access/transam/subtrans.c
@@ -33,6 +33,7 @@
#include "access/transam.h"
#include "miscadmin.h"
#include "pg_trace.h"
+#include "utils/guc_hooks.h"
#include "utils/snapmgr.h"
@@ -373,3 +374,12 @@ SubTransPagePrecedes(int page1, int page2)
return (TransactionIdPrecedes(xid1, xid2) &&
TransactionIdPrecedes(xid1, xid2 + SUBTRANS_XACTS_PER_PAGE - 1));
}
+
+/*
+ * GUC check_hook for subtrans_buffers
+ */
+bool
+check_subtrans_buffers(int *newval, void **extra, GucSource source)
+{
+ return check_slru_buffers("subtrans_buffers", newval);
+}
diff --git a/src/backend/commands/async.c b/src/backend/commands/async.c
index 4bdbbe5cc0..98449cbdde 100644
--- a/src/backend/commands/async.c
+++ b/src/backend/commands/async.c
@@ -149,6 +149,7 @@
#include "storage/sinval.h"
#include "tcop/tcopprot.h"
#include "utils/builtins.h"
+#include "utils/guc_hooks.h"
#include "utils/memutils.h"
#include "utils/ps_status.h"
#include "utils/snapmgr.h"
@@ -2444,3 +2445,12 @@ ClearPendingActionsAndNotifies(void)
pendingActions = NULL;
pendingNotifies = NULL;
}
+
+/*
+ * GUC check_hook for notify_buffers
+ */
+bool
+check_notify_buffers(int *newval, void **extra, GucSource source)
+{
+ return check_slru_buffers("notify_buffers", newval);
+}
diff --git a/src/backend/storage/lmgr/predicate.c b/src/backend/storage/lmgr/predicate.c
index 18ea18316d..e4903c67ec 100644
--- a/src/backend/storage/lmgr/predicate.c
+++ b/src/backend/storage/lmgr/predicate.c
@@ -208,6 +208,7 @@
#include "storage/predicate_internals.h"
#include "storage/proc.h"
#include "storage/procarray.h"
+#include "utils/guc_hooks.h"
#include "utils/rel.h"
#include "utils/snapmgr.h"
@@ -5011,3 +5012,12 @@ AttachSerializableXact(SerializableXactHandle handle)
if (MySerializableXact != InvalidSerializableXact)
CreateLocalPredicateLockHash();
}
+
+/*
+ * GUC check_hook for serial_buffers
+ */
+bool
+check_serial_buffers(int *newval, void **extra, GucSource source)
+{
+ return check_slru_buffers("serial_buffers", newval);
+}
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index c82635943b..7c85d2126e 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -2296,7 +2296,7 @@ struct config_int ConfigureNamesInt[] =
},
&multixact_offsets_buffers,
64, 16, SLRU_MAX_ALLOWED_BUFFERS,
- NULL, NULL, NULL
+ check_multixact_offsets_buffers, NULL, NULL
},
{
@@ -2307,7 +2307,7 @@ struct config_int ConfigureNamesInt[] =
},
&multixact_members_buffers,
64, 16, SLRU_MAX_ALLOWED_BUFFERS,
- NULL, NULL, NULL
+ check_multixact_members_buffers, NULL, NULL
},
{
@@ -2318,7 +2318,7 @@ struct config_int ConfigureNamesInt[] =
},
&subtrans_buffers,
64, 16, SLRU_MAX_ALLOWED_BUFFERS,
- NULL, NULL, NULL
+ check_subtrans_buffers, NULL, NULL
},
{
{"notify_buffers", PGC_POSTMASTER, RESOURCES_MEM,
@@ -2328,7 +2328,7 @@ struct config_int ConfigureNamesInt[] =
},
¬ify_buffers,
64, 16, SLRU_MAX_ALLOWED_BUFFERS,
- NULL, NULL, NULL
+ check_notify_buffers, NULL, NULL
},
{
@@ -2339,7 +2339,7 @@ struct config_int ConfigureNamesInt[] =
},
&serial_buffers,
64, 16, SLRU_MAX_ALLOWED_BUFFERS,
- NULL, NULL, NULL
+ check_serial_buffers, NULL, NULL
},
{
@@ -2350,7 +2350,7 @@ struct config_int ConfigureNamesInt[] =
},
&xact_buffers,
64, 0, CLOG_MAX_ALLOWED_BUFFERS,
- NULL, NULL, show_xact_buffers
+ check_xact_buffers, NULL, show_xact_buffers
},
{
@@ -2361,7 +2361,7 @@ struct config_int ConfigureNamesInt[] =
},
&commit_ts_buffers,
64, 0, SLRU_MAX_ALLOWED_BUFFERS,
- NULL, NULL, show_commit_ts_buffers
+ check_commit_ts_buffers, NULL, show_commit_ts_buffers
},
{
diff --git a/src/include/access/slru.h b/src/include/access/slru.h
index c0d37e3eb3..51c5762b9f 100644
--- a/src/include/access/slru.h
+++ b/src/include/access/slru.h
@@ -17,6 +17,11 @@
#include "storage/lwlock.h"
#include "storage/sync.h"
+/*
+ * SLRU bank size for slotno hash banks
+ */
+#define SLRU_BANK_SIZE 16
+
/*
* To avoid overflowing internal arithmetic and the size_t data type, the
* number of buffers should not exceed this number.
@@ -139,6 +144,11 @@ typedef struct SlruCtlData
* it's always the same, it doesn't need to be in shared memory.
*/
char Dir[64];
+
+ /*
+ * Mask for slotno banks
+ */
+ Size bank_mask;
} SlruCtlData;
typedef SlruCtlData *SlruCtl;
@@ -175,5 +185,5 @@ extern bool SlruScanDirCbReportPresence(SlruCtl ctl, char *filename,
int segpage, void *data);
extern bool SlruScanDirCbDeleteAll(SlruCtl ctl, char *filename, int segpage,
void *data);
-
+extern bool check_slru_buffers(const char *name, int *newval);
#endif /* SLRU_H */
diff --git a/src/include/utils/guc_hooks.h b/src/include/utils/guc_hooks.h
index 8597e430de..7dd96a2059 100644
--- a/src/include/utils/guc_hooks.h
+++ b/src/include/utils/guc_hooks.h
@@ -128,6 +128,17 @@ extern bool check_ssl(bool *newval, void **extra, GucSource source);
extern bool check_stage_log_stats(bool *newval, void **extra, GucSource source);
extern bool check_synchronous_standby_names(char **newval, void **extra,
GucSource source);
+extern bool check_multixact_offsets_buffers(int *newval, void **extra,
+ GucSource source);
+extern bool check_multixact_members_buffers(int *newval, void **extra,
+ GucSource source);
+extern bool check_subtrans_buffers(int *newval, void **extra,
+ GucSource source);
+extern bool check_notify_buffers(int *newval, void **extra, GucSource source);
+extern bool check_serial_buffers(int *newval, void **extra, GucSource source);
+extern bool check_xact_buffers(int *newval, void **extra, GucSource source);
+extern bool check_commit_ts_buffers(int *newval, void **extra,
+ GucSource source);
extern void assign_synchronous_standby_names(const char *newval, void *extra);
extern void assign_synchronous_commit(int newval, void *extra);
extern void assign_syslog_facility(int newval, void *extra);
--
2.39.2 (Apple Git-143)
v5-0003-Remove-the-centralized-control-lock-and-LRU-count.patchapplication/octet-stream; name=v5-0003-Remove-the-centralized-control-lock-and-LRU-count.patchDownload
From 263f0bb133d8214bced70ba9f0df0b2981974bdf Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Tue, 7 Nov 2023 09:51:37 +0530
Subject: [PATCH v5 3/3] Remove the centralized control lock and LRU counter
The previous patch has divided SLRU buffer pool into associative
banks. This patch is further optimizing it by introducing
multiple SLRU locks instead of a common centralized lock this
will reduce the contention on the slru control lock. Basically,
we will have at max 128 bank locks and if the number of banks
is <= 128 then each lock will cover exactly one bank otherwise
they will cover multiple banks we will find the bank-to-lock
mapping by (bankno % 128). This patch also removes the
centralized lru counter and now we will have bank-wise lru
counters that will help in frequent cache invalidation while
modifying this counter.
Dilip Kumar based on design inputs from Andrey M. Borodin,
Robert Haas, and Alvaro Herrera
---
src/backend/access/transam/clog.c | 114 +++++++----
src/backend/access/transam/commit_ts.c | 43 ++--
src/backend/access/transam/multixact.c | 177 ++++++++++++-----
src/backend/access/transam/slru.c | 238 +++++++++++++++++------
src/backend/access/transam/subtrans.c | 58 ++++--
src/backend/commands/async.c | 43 ++--
src/backend/storage/lmgr/lwlock.c | 14 ++
src/backend/storage/lmgr/lwlocknames.txt | 14 +-
src/backend/storage/lmgr/predicate.c | 33 ++--
src/include/access/slru.h | 63 ++++--
src/include/storage/lwlock.h | 7 +
src/test/modules/test_slru/test_slru.c | 32 +--
12 files changed, 589 insertions(+), 247 deletions(-)
diff --git a/src/backend/access/transam/clog.c b/src/backend/access/transam/clog.c
index ab3893cf4f..7b546cab3c 100644
--- a/src/backend/access/transam/clog.c
+++ b/src/backend/access/transam/clog.c
@@ -275,14 +275,19 @@ TransactionIdSetPageStatus(TransactionId xid, int nsubxids,
XLogRecPtr lsn, int pageno,
bool all_xact_same_page)
{
+ LWLock *lock;
+
/* Can't use group update when PGPROC overflows. */
StaticAssertDecl(THRESHOLD_SUBTRANS_CLOG_OPT <= PGPROC_MAX_CACHED_SUBXIDS,
"group clog threshold less than PGPROC cached subxids");
+ /* Get the SLRU bank lock w.r.t. the page we are going to access. */
+ lock = SimpleLruGetSLRUBankLock(XactCtl, pageno);
+
/*
- * When there is contention on XactSLRULock, we try to group multiple
+ * When there is contention on SLRU lock, we try to group multiple
* updates; a single leader process will perform transaction status
- * updates for multiple backends so that the number of times XactSLRULock
+ * updates for multiple backends so that the number of times the SLRU lock
* needs to be acquired is reduced.
*
* For this optimization to be safe, the XID and subxids in MyProc must be
@@ -301,17 +306,17 @@ TransactionIdSetPageStatus(TransactionId xid, int nsubxids,
nsubxids * sizeof(TransactionId)) == 0))
{
/*
- * If we can immediately acquire XactSLRULock, we update the status of
+ * If we can immediately acquire SLRU lock, we update the status of
* our own XID and release the lock. If not, try use group XID
* update. If that doesn't work out, fall back to waiting for the
* lock to perform an update for this transaction only.
*/
- if (LWLockConditionalAcquire(XactSLRULock, LW_EXCLUSIVE))
+ if (LWLockConditionalAcquire(lock, LW_EXCLUSIVE))
{
/* Got the lock without waiting! Do the update. */
TransactionIdSetPageStatusInternal(xid, nsubxids, subxids, status,
lsn, pageno);
- LWLockRelease(XactSLRULock);
+ LWLockRelease(lock);
return;
}
else if (TransactionGroupUpdateXidStatus(xid, status, lsn, pageno))
@@ -324,10 +329,10 @@ TransactionIdSetPageStatus(TransactionId xid, int nsubxids,
}
/* Group update not applicable, or couldn't accept this page number. */
- LWLockAcquire(XactSLRULock, LW_EXCLUSIVE);
+ LWLockAcquire(lock, LW_EXCLUSIVE);
TransactionIdSetPageStatusInternal(xid, nsubxids, subxids, status,
lsn, pageno);
- LWLockRelease(XactSLRULock);
+ LWLockRelease(lock);
}
/*
@@ -346,7 +351,8 @@ TransactionIdSetPageStatusInternal(TransactionId xid, int nsubxids,
Assert(status == TRANSACTION_STATUS_COMMITTED ||
status == TRANSACTION_STATUS_ABORTED ||
(status == TRANSACTION_STATUS_SUB_COMMITTED && !TransactionIdIsValid(xid)));
- Assert(LWLockHeldByMeInMode(XactSLRULock, LW_EXCLUSIVE));
+ Assert(LWLockHeldByMeInMode(SimpleLruGetSLRUBankLock(XactCtl, pageno),
+ LW_EXCLUSIVE));
/*
* If we're doing an async commit (ie, lsn is valid), then we must wait
@@ -397,14 +403,13 @@ TransactionIdSetPageStatusInternal(TransactionId xid, int nsubxids,
}
/*
- * When we cannot immediately acquire XactSLRULock in exclusive mode at
+ * When we cannot immediately acquire SLRU bank lock in exclusive mode at
* commit time, add ourselves to a list of processes that need their XIDs
* status update. The first process to add itself to the list will acquire
- * XactSLRULock in exclusive mode and set transaction status as required
- * on behalf of all group members. This avoids a great deal of contention
- * around XactSLRULock when many processes are trying to commit at once,
- * since the lock need not be repeatedly handed off from one committing
- * process to the next.
+ * the lock in exclusive mode and set transaction status as required on behalf
+ * of all group members. This avoids a great deal of contention when many
+ * processes are trying to commit at once, since the lock need not be
+ * repeatedly handed off from one committing process to the next.
*
* Returns true when transaction status has been updated in clog; returns
* false if we decided against applying the optimization because the page
@@ -418,6 +423,8 @@ TransactionGroupUpdateXidStatus(TransactionId xid, XidStatus status,
PGPROC *proc = MyProc;
uint32 nextidx;
uint32 wakeidx;
+ int prevpageno;
+ LWLock *prevlock = NULL;
/* We should definitely have an XID whose status needs to be updated. */
Assert(TransactionIdIsValid(xid));
@@ -498,13 +505,10 @@ TransactionGroupUpdateXidStatus(TransactionId xid, XidStatus status,
return true;
}
- /* We are the leader. Acquire the lock on behalf of everyone. */
- LWLockAcquire(XactSLRULock, LW_EXCLUSIVE);
-
/*
- * Now that we've got the lock, clear the list of processes waiting for
- * group XID status update, saving a pointer to the head of the list.
- * Trying to pop elements one at a time could lead to an ABA problem.
+ * We are leader so clear the list of processes waiting for group XID
+ * status update, saving a pointer to the head of the list. Trying to pop
+ * elements one at a time could lead to an ABA problem.
*/
nextidx = pg_atomic_exchange_u32(&procglobal->clogGroupFirst,
INVALID_PGPROCNO);
@@ -512,10 +516,38 @@ TransactionGroupUpdateXidStatus(TransactionId xid, XidStatus status,
/* Remember head of list so we can perform wakeups after dropping lock. */
wakeidx = nextidx;
+ /* Acquire the SLRU bank lock w.r.t. the first page in the group. */
+ prevpageno = ProcGlobal->allProcs[nextidx].clogGroupMemberPage;
+ prevlock = SimpleLruGetSLRUBankLock(XactCtl, prevpageno);
+ LWLockAcquire(prevlock, LW_EXCLUSIVE);
+
/* Walk the list and update the status of all XIDs. */
while (nextidx != INVALID_PGPROCNO)
{
PGPROC *nextproc = &ProcGlobal->allProcs[nextidx];
+ int thispageno = nextproc->clogGroupMemberPage;
+
+ /*
+ * Although we are trying our best to keep same page in a group, there
+ * are cases where we might get different pages as well for detail
+ * refer comment in above while loop where we are adding this process
+ * for group update. So if the current page we are going to access is
+ * not in the same slru bank in which we updated the last page then we
+ * need to release the lock on the previous bank and acquire lock on
+ * the bank w.r.t. the page we are going to update now.
+ */
+ if (thispageno != prevpageno)
+ {
+ LWLock *lock = SimpleLruGetSLRUBankLock(XactCtl, thispageno);
+
+ if (prevlock != lock)
+ {
+ LWLockRelease(prevlock);
+ LWLockAcquire(lock, LW_EXCLUSIVE);
+ }
+ prevlock = lock;
+ prevpageno = thispageno;
+ }
/*
* Transactions with more than THRESHOLD_SUBTRANS_CLOG_OPT sub-XIDs
@@ -535,7 +567,8 @@ TransactionGroupUpdateXidStatus(TransactionId xid, XidStatus status,
}
/* We're done with the lock now. */
- LWLockRelease(XactSLRULock);
+ if (prevlock != NULL)
+ LWLockRelease(prevlock);
/*
* Now that we've released the lock, go back and wake everybody up. We
@@ -564,10 +597,11 @@ TransactionGroupUpdateXidStatus(TransactionId xid, XidStatus status,
/*
* Sets the commit status of a single transaction.
*
- * Must be called with XactSLRULock held
+ * Must be called with slot specific SLRU bank's lock held
*/
static void
-TransactionIdSetStatusBit(TransactionId xid, XidStatus status, XLogRecPtr lsn, int slotno)
+TransactionIdSetStatusBit(TransactionId xid, XidStatus status, XLogRecPtr lsn,
+ int slotno)
{
int byteno = TransactionIdToByte(xid);
int bshift = TransactionIdToBIndex(xid) * CLOG_BITS_PER_XACT;
@@ -656,7 +690,7 @@ TransactionIdGetStatus(TransactionId xid, XLogRecPtr *lsn)
lsnindex = GetLSNIndex(slotno, xid);
*lsn = XactCtl->shared->group_lsn[lsnindex];
- LWLockRelease(XactSLRULock);
+ LWLockRelease(SimpleLruGetSLRUBankLock(XactCtl, pageno));
return status;
}
@@ -690,8 +724,8 @@ CLOGShmemInit(void)
{
XactCtl->PagePrecedes = CLOGPagePrecedes;
SimpleLruInit(XactCtl, "Xact", CLOGShmemBuffers(), CLOG_LSNS_PER_PAGE,
- XactSLRULock, "pg_xact", LWTRANCHE_XACT_BUFFER,
- SYNC_HANDLER_CLOG);
+ "pg_xact", LWTRANCHE_XACT_BUFFER,
+ LWTRANCHE_XACT_SLRU, SYNC_HANDLER_CLOG);
SlruPagePrecedesUnitTests(XactCtl, CLOG_XACTS_PER_PAGE);
}
@@ -705,8 +739,9 @@ void
BootStrapCLOG(void)
{
int slotno;
+ LWLock *lock = SimpleLruGetSLRUBankLock(XactCtl, 0);
- LWLockAcquire(XactSLRULock, LW_EXCLUSIVE);
+ LWLockAcquire(lock, LW_EXCLUSIVE);
/* Create and zero the first page of the commit log */
slotno = ZeroCLOGPage(0, false);
@@ -715,7 +750,7 @@ BootStrapCLOG(void)
SimpleLruWritePage(XactCtl, slotno);
Assert(!XactCtl->shared->page_dirty[slotno]);
- LWLockRelease(XactSLRULock);
+ LWLockRelease(lock);
}
/*
@@ -750,14 +785,10 @@ StartupCLOG(void)
TransactionId xid = XidFromFullTransactionId(ShmemVariableCache->nextXid);
int pageno = TransactionIdToPage(xid);
- LWLockAcquire(XactSLRULock, LW_EXCLUSIVE);
-
/*
* Initialize our idea of the latest page number.
*/
- XactCtl->shared->latest_page_number = pageno;
-
- LWLockRelease(XactSLRULock);
+ pg_atomic_init_u32(&XactCtl->shared->latest_page_number, pageno);
}
/*
@@ -768,8 +799,9 @@ TrimCLOG(void)
{
TransactionId xid = XidFromFullTransactionId(ShmemVariableCache->nextXid);
int pageno = TransactionIdToPage(xid);
+ LWLock *lock = SimpleLruGetSLRUBankLock(XactCtl, pageno);
- LWLockAcquire(XactSLRULock, LW_EXCLUSIVE);
+ LWLockAcquire(lock, LW_EXCLUSIVE);
/*
* Zero out the remainder of the current clog page. Under normal
@@ -801,7 +833,7 @@ TrimCLOG(void)
XactCtl->shared->page_dirty[slotno] = true;
}
- LWLockRelease(XactSLRULock);
+ LWLockRelease(lock);
}
/*
@@ -833,6 +865,7 @@ void
ExtendCLOG(TransactionId newestXact)
{
int pageno;
+ LWLock *lock;
/*
* No work except at first XID of a page. But beware: just after
@@ -843,13 +876,14 @@ ExtendCLOG(TransactionId newestXact)
return;
pageno = TransactionIdToPage(newestXact);
+ lock = SimpleLruGetSLRUBankLock(XactCtl, pageno);
- LWLockAcquire(XactSLRULock, LW_EXCLUSIVE);
+ LWLockAcquire(lock, LW_EXCLUSIVE);
/* Zero the page and make an XLOG entry about it */
ZeroCLOGPage(pageno, true);
- LWLockRelease(XactSLRULock);
+ LWLockRelease(lock);
}
@@ -987,16 +1021,18 @@ clog_redo(XLogReaderState *record)
{
int pageno;
int slotno;
+ LWLock *lock;
memcpy(&pageno, XLogRecGetData(record), sizeof(int));
- LWLockAcquire(XactSLRULock, LW_EXCLUSIVE);
+ lock = SimpleLruGetSLRUBankLock(XactCtl, pageno);
+ LWLockAcquire(lock, LW_EXCLUSIVE);
slotno = ZeroCLOGPage(pageno, false);
SimpleLruWritePage(XactCtl, slotno);
Assert(!XactCtl->shared->page_dirty[slotno]);
- LWLockRelease(XactSLRULock);
+ LWLockRelease(lock);
}
else if (info == CLOG_TRUNCATE)
{
diff --git a/src/backend/access/transam/commit_ts.c b/src/backend/access/transam/commit_ts.c
index 96810959ab..ae1badd295 100644
--- a/src/backend/access/transam/commit_ts.c
+++ b/src/backend/access/transam/commit_ts.c
@@ -219,8 +219,9 @@ SetXidCommitTsInPage(TransactionId xid, int nsubxids,
{
int slotno;
int i;
+ LWLock *lock = SimpleLruGetSLRUBankLock(CommitTsCtl, pageno);
- LWLockAcquire(CommitTsSLRULock, LW_EXCLUSIVE);
+ LWLockAcquire(lock, LW_EXCLUSIVE);
slotno = SimpleLruReadPage(CommitTsCtl, pageno, true, xid);
@@ -230,13 +231,13 @@ SetXidCommitTsInPage(TransactionId xid, int nsubxids,
CommitTsCtl->shared->page_dirty[slotno] = true;
- LWLockRelease(CommitTsSLRULock);
+ LWLockRelease(lock);
}
/*
* Sets the commit timestamp of a single transaction.
*
- * Must be called with CommitTsSLRULock held
+ * Must be called with slot specific SLRU bank's Lock held
*/
static void
TransactionIdSetCommitTs(TransactionId xid, TimestampTz ts,
@@ -337,7 +338,7 @@ TransactionIdGetCommitTsData(TransactionId xid, TimestampTz *ts,
if (nodeid)
*nodeid = entry.nodeid;
- LWLockRelease(CommitTsSLRULock);
+ LWLockRelease(SimpleLruGetSLRUBankLock(CommitTsCtl, pageno));
return *ts != 0;
}
@@ -527,9 +528,8 @@ CommitTsShmemInit(void)
CommitTsCtl->PagePrecedes = CommitTsPagePrecedes;
SimpleLruInit(CommitTsCtl, "CommitTs", CommitTsShmemBuffers(), 0,
- CommitTsSLRULock, "pg_commit_ts",
- LWTRANCHE_COMMITTS_BUFFER,
- SYNC_HANDLER_COMMIT_TS);
+ "pg_commit_ts", LWTRANCHE_COMMITTS_BUFFER,
+ LWTRANCHE_COMMITTS_SLRU, SYNC_HANDLER_COMMIT_TS);
SlruPagePrecedesUnitTests(CommitTsCtl, COMMIT_TS_XACTS_PER_PAGE);
commitTsShared = ShmemInitStruct("CommitTs shared",
@@ -685,9 +685,7 @@ ActivateCommitTs(void)
/*
* Re-Initialize our idea of the latest page number.
*/
- LWLockAcquire(CommitTsSLRULock, LW_EXCLUSIVE);
- CommitTsCtl->shared->latest_page_number = pageno;
- LWLockRelease(CommitTsSLRULock);
+ pg_atomic_write_u32(&CommitTsCtl->shared->latest_page_number, pageno);
/*
* If CommitTs is enabled, but it wasn't in the previous server run, we
@@ -714,12 +712,13 @@ ActivateCommitTs(void)
if (!SimpleLruDoesPhysicalPageExist(CommitTsCtl, pageno))
{
int slotno;
+ LWLock *lock = SimpleLruGetSLRUBankLock(CommitTsCtl, pageno);
- LWLockAcquire(CommitTsSLRULock, LW_EXCLUSIVE);
+ LWLockAcquire(lock, LW_EXCLUSIVE);
slotno = ZeroCommitTsPage(pageno, false);
SimpleLruWritePage(CommitTsCtl, slotno);
Assert(!CommitTsCtl->shared->page_dirty[slotno]);
- LWLockRelease(CommitTsSLRULock);
+ LWLockRelease(lock);
}
/* Change the activation status in shared memory. */
@@ -768,9 +767,9 @@ DeactivateCommitTs(void)
* be overwritten anyway when we wrap around, but it seems better to be
* tidy.)
*/
- LWLockAcquire(CommitTsSLRULock, LW_EXCLUSIVE);
+ SimpleLruAcquireAllBankLock(CommitTsCtl, LW_EXCLUSIVE);
(void) SlruScanDirectory(CommitTsCtl, SlruScanDirCbDeleteAll, NULL);
- LWLockRelease(CommitTsSLRULock);
+ SimpleLruReleaseAllBankLock(CommitTsCtl);
}
/*
@@ -802,6 +801,7 @@ void
ExtendCommitTs(TransactionId newestXact)
{
int pageno;
+ LWLock *lock;
/*
* Nothing to do if module not enabled. Note we do an unlocked read of
@@ -822,12 +822,14 @@ ExtendCommitTs(TransactionId newestXact)
pageno = TransactionIdToCTsPage(newestXact);
- LWLockAcquire(CommitTsSLRULock, LW_EXCLUSIVE);
+ lock = SimpleLruGetSLRUBankLock(CommitTsCtl, pageno);
+
+ LWLockAcquire(lock, LW_EXCLUSIVE);
/* Zero the page and make an XLOG entry about it */
ZeroCommitTsPage(pageno, !InRecovery);
- LWLockRelease(CommitTsSLRULock);
+ LWLockRelease(lock);
}
/*
@@ -981,16 +983,18 @@ commit_ts_redo(XLogReaderState *record)
{
int pageno;
int slotno;
+ LWLock *lock;
memcpy(&pageno, XLogRecGetData(record), sizeof(int));
+ lock = SimpleLruGetSLRUBankLock(CommitTsCtl, pageno);
- LWLockAcquire(CommitTsSLRULock, LW_EXCLUSIVE);
+ LWLockAcquire(lock, LW_EXCLUSIVE);
slotno = ZeroCommitTsPage(pageno, false);
SimpleLruWritePage(CommitTsCtl, slotno);
Assert(!CommitTsCtl->shared->page_dirty[slotno]);
- LWLockRelease(CommitTsSLRULock);
+ LWLockRelease(lock);
}
else if (info == COMMIT_TS_TRUNCATE)
{
@@ -1002,7 +1006,8 @@ commit_ts_redo(XLogReaderState *record)
* During XLOG replay, latest_page_number isn't set up yet; insert a
* suitable value to bypass the sanity test in SimpleLruTruncate.
*/
- CommitTsCtl->shared->latest_page_number = trunc->pageno;
+ pg_atomic_write_u32(&CommitTsCtl->shared->latest_page_number,
+ trunc->pageno);
SimpleLruTruncate(CommitTsCtl, trunc->pageno);
}
diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index 77511c6342..ad31b2017b 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -193,10 +193,10 @@ static SlruCtlData MultiXactMemberCtlData;
/*
* MultiXact state shared across all backends. All this state is protected
- * by MultiXactGenLock. (We also use MultiXactOffsetSLRULock and
- * MultiXactMemberSLRULock to guard accesses to the two sets of SLRU
- * buffers. For concurrency's sake, we avoid holding more than one of these
- * locks at a time.)
+ * by MultiXactGenLock. (We also use SLRU bank's lock of MultiXactOffset and
+ * MultiXactMember to guard accesses to the two sets of SLRU buffers. For
+ * concurrency's sake, we avoid holding more than one of these locks at a
+ * time.)
*/
typedef struct MultiXactStateData
{
@@ -871,12 +871,15 @@ RecordNewMultiXact(MultiXactId multi, MultiXactOffset offset,
int slotno;
MultiXactOffset *offptr;
int i;
-
- LWLockAcquire(MultiXactOffsetSLRULock, LW_EXCLUSIVE);
+ LWLock *lock;
+ LWLock *prevlock = NULL;
pageno = MultiXactIdToOffsetPage(multi);
entryno = MultiXactIdToOffsetEntry(multi);
+ lock = SimpleLruGetSLRUBankLock(MultiXactOffsetCtl, pageno);
+ LWLockAcquire(lock, LW_EXCLUSIVE);
+
/*
* Note: we pass the MultiXactId to SimpleLruReadPage as the "transaction"
* to complain about if there's any I/O error. This is kinda bogus, but
@@ -892,10 +895,8 @@ RecordNewMultiXact(MultiXactId multi, MultiXactOffset offset,
MultiXactOffsetCtl->shared->page_dirty[slotno] = true;
- /* Exchange our lock */
- LWLockRelease(MultiXactOffsetSLRULock);
-
- LWLockAcquire(MultiXactMemberSLRULock, LW_EXCLUSIVE);
+ /* Release MultiXactOffset SLRU lock. */
+ LWLockRelease(lock);
prev_pageno = -1;
@@ -917,6 +918,20 @@ RecordNewMultiXact(MultiXactId multi, MultiXactOffset offset,
if (pageno != prev_pageno)
{
+ /*
+ * MultiXactMember SLRU page is changed so check if this new page
+ * fall into the different SLRU bank then release the old bank's
+ * lock and acquire lock on the new bank.
+ */
+ lock = SimpleLruGetSLRUBankLock(MultiXactMemberCtl, pageno);
+ if (lock != prevlock)
+ {
+ if (prevlock != NULL)
+ LWLockRelease(prevlock);
+
+ LWLockAcquire(lock, LW_EXCLUSIVE);
+ prevlock = lock;
+ }
slotno = SimpleLruReadPage(MultiXactMemberCtl, pageno, true, multi);
prev_pageno = pageno;
}
@@ -937,7 +952,8 @@ RecordNewMultiXact(MultiXactId multi, MultiXactOffset offset,
MultiXactMemberCtl->shared->page_dirty[slotno] = true;
}
- LWLockRelease(MultiXactMemberSLRULock);
+ if (prevlock != NULL)
+ LWLockRelease(prevlock);
}
/*
@@ -1240,6 +1256,8 @@ GetMultiXactIdMembers(MultiXactId multi, MultiXactMember **members,
MultiXactId tmpMXact;
MultiXactOffset nextOffset;
MultiXactMember *ptr;
+ LWLock *lock;
+ LWLock *prevlock = NULL;
debug_elog3(DEBUG2, "GetMembers: asked for %u", multi);
@@ -1343,11 +1361,23 @@ GetMultiXactIdMembers(MultiXactId multi, MultiXactMember **members,
* time on every multixact creation.
*/
retry:
- LWLockAcquire(MultiXactOffsetSLRULock, LW_EXCLUSIVE);
-
pageno = MultiXactIdToOffsetPage(multi);
entryno = MultiXactIdToOffsetEntry(multi);
+ /*
+ * If the page is on the different SLRU bank then release the lock on the
+ * previous bank if we are already holding one and acquire the lock on the
+ * new bank.
+ */
+ lock = SimpleLruGetSLRUBankLock(MultiXactOffsetCtl, pageno);
+ if (lock != prevlock)
+ {
+ if (prevlock != NULL)
+ LWLockRelease(prevlock);
+ LWLockAcquire(lock, LW_EXCLUSIVE);
+ prevlock = lock;
+ }
+
slotno = SimpleLruReadPage(MultiXactOffsetCtl, pageno, true, multi);
offptr = (MultiXactOffset *) MultiXactOffsetCtl->shared->page_buffer[slotno];
offptr += entryno;
@@ -1380,7 +1410,22 @@ retry:
entryno = MultiXactIdToOffsetEntry(tmpMXact);
if (pageno != prev_pageno)
+ {
+ /*
+ * SLRU pageno is changed so check whether this page is falling in
+ * the different slru bank than on which we are already holding
+ * the lock and if so release the lock on the old bank and acquire
+ * that on the new bank.
+ */
+ lock = SimpleLruGetSLRUBankLock(MultiXactOffsetCtl, pageno);
+ if (prevlock != lock)
+ {
+ LWLockRelease(prevlock);
+ LWLockAcquire(lock, LW_EXCLUSIVE);
+ prevlock = lock;
+ }
slotno = SimpleLruReadPage(MultiXactOffsetCtl, pageno, true, tmpMXact);
+ }
offptr = (MultiXactOffset *) MultiXactOffsetCtl->shared->page_buffer[slotno];
offptr += entryno;
@@ -1389,7 +1434,8 @@ retry:
if (nextMXOffset == 0)
{
/* Corner case 2: next multixact is still being filled in */
- LWLockRelease(MultiXactOffsetSLRULock);
+ LWLockRelease(prevlock);
+ prevlock = NULL;
CHECK_FOR_INTERRUPTS();
pg_usleep(1000L);
goto retry;
@@ -1398,13 +1444,11 @@ retry:
length = nextMXOffset - offset;
}
- LWLockRelease(MultiXactOffsetSLRULock);
+ LWLockRelease(prevlock);
+ prevlock = NULL;
ptr = (MultiXactMember *) palloc(length * sizeof(MultiXactMember));
- /* Now get the members themselves. */
- LWLockAcquire(MultiXactMemberSLRULock, LW_EXCLUSIVE);
-
truelength = 0;
prev_pageno = -1;
for (i = 0; i < length; i++, offset++)
@@ -1420,6 +1464,20 @@ retry:
if (pageno != prev_pageno)
{
+ /*
+ * MultiXactMember SLRU page is changed so check if this new page
+ * fall into the different SLRU bank then release the old bank's
+ * lock and acquire lock on the new bank.
+ */
+ lock = SimpleLruGetSLRUBankLock(MultiXactMemberCtl, pageno);
+ if (lock != prevlock)
+ {
+ if (prevlock)
+ LWLockRelease(prevlock);
+ LWLockAcquire(lock, LW_EXCLUSIVE);
+ prevlock = lock;
+ }
+
slotno = SimpleLruReadPage(MultiXactMemberCtl, pageno, true, multi);
prev_pageno = pageno;
}
@@ -1443,7 +1501,8 @@ retry:
truelength++;
}
- LWLockRelease(MultiXactMemberSLRULock);
+ if (prevlock)
+ LWLockRelease(prevlock);
/* A multixid with zero members should not happen */
Assert(truelength > 0);
@@ -1853,14 +1912,14 @@ MultiXactShmemInit(void)
SimpleLruInit(MultiXactOffsetCtl,
"MultiXactOffset", multixact_offsets_buffers, 0,
- MultiXactOffsetSLRULock, "pg_multixact/offsets",
- LWTRANCHE_MULTIXACTOFFSET_BUFFER,
+ "pg_multixact/offsets", LWTRANCHE_MULTIXACTOFFSET_BUFFER,
+ LWTRANCHE_MULTIXACTOFFSET_SLRU,
SYNC_HANDLER_MULTIXACT_OFFSET);
SlruPagePrecedesUnitTests(MultiXactOffsetCtl, MULTIXACT_OFFSETS_PER_PAGE);
SimpleLruInit(MultiXactMemberCtl,
"MultiXactMember", multixact_members_buffers, 0,
- MultiXactMemberSLRULock, "pg_multixact/members",
- LWTRANCHE_MULTIXACTMEMBER_BUFFER,
+ "pg_multixact/members", LWTRANCHE_MULTIXACTMEMBER_BUFFER,
+ LWTRANCHE_MULTIXACTMEMBER_SLRU,
SYNC_HANDLER_MULTIXACT_MEMBER);
/* doesn't call SimpleLruTruncate() or meet criteria for unit tests */
@@ -1895,8 +1954,10 @@ void
BootStrapMultiXact(void)
{
int slotno;
+ LWLock *lock;
- LWLockAcquire(MultiXactOffsetSLRULock, LW_EXCLUSIVE);
+ lock = SimpleLruGetSLRUBankLock(MultiXactOffsetCtl, 0);
+ LWLockAcquire(lock, LW_EXCLUSIVE);
/* Create and zero the first page of the offsets log */
slotno = ZeroMultiXactOffsetPage(0, false);
@@ -1905,9 +1966,10 @@ BootStrapMultiXact(void)
SimpleLruWritePage(MultiXactOffsetCtl, slotno);
Assert(!MultiXactOffsetCtl->shared->page_dirty[slotno]);
- LWLockRelease(MultiXactOffsetSLRULock);
+ LWLockRelease(lock);
- LWLockAcquire(MultiXactMemberSLRULock, LW_EXCLUSIVE);
+ lock = SimpleLruGetSLRUBankLock(MultiXactMemberCtl, 0);
+ LWLockAcquire(lock, LW_EXCLUSIVE);
/* Create and zero the first page of the members log */
slotno = ZeroMultiXactMemberPage(0, false);
@@ -1916,7 +1978,7 @@ BootStrapMultiXact(void)
SimpleLruWritePage(MultiXactMemberCtl, slotno);
Assert(!MultiXactMemberCtl->shared->page_dirty[slotno]);
- LWLockRelease(MultiXactMemberSLRULock);
+ LWLockRelease(lock);
}
/*
@@ -1976,10 +2038,12 @@ static void
MaybeExtendOffsetSlru(void)
{
int pageno;
+ LWLock *lock;
pageno = MultiXactIdToOffsetPage(MultiXactState->nextMXact);
+ lock = SimpleLruGetSLRUBankLock(MultiXactOffsetCtl, pageno);
- LWLockAcquire(MultiXactOffsetSLRULock, LW_EXCLUSIVE);
+ LWLockAcquire(lock, LW_EXCLUSIVE);
if (!SimpleLruDoesPhysicalPageExist(MultiXactOffsetCtl, pageno))
{
@@ -1994,7 +2058,7 @@ MaybeExtendOffsetSlru(void)
SimpleLruWritePage(MultiXactOffsetCtl, slotno);
}
- LWLockRelease(MultiXactOffsetSLRULock);
+ LWLockRelease(lock);
}
/*
@@ -2016,13 +2080,15 @@ StartupMultiXact(void)
* Initialize offset's idea of the latest page number.
*/
pageno = MultiXactIdToOffsetPage(multi);
- MultiXactOffsetCtl->shared->latest_page_number = pageno;
+ pg_atomic_init_u32(&MultiXactOffsetCtl->shared->latest_page_number,
+ pageno);
/*
* Initialize member's idea of the latest page number.
*/
pageno = MXOffsetToMemberPage(offset);
- MultiXactMemberCtl->shared->latest_page_number = pageno;
+ pg_atomic_init_u32(&MultiXactMemberCtl->shared->latest_page_number,
+ pageno);
}
/*
@@ -2047,13 +2113,13 @@ TrimMultiXact(void)
LWLockRelease(MultiXactGenLock);
/* Clean up offsets state */
- LWLockAcquire(MultiXactOffsetSLRULock, LW_EXCLUSIVE);
/*
* (Re-)Initialize our idea of the latest page number for offsets.
*/
pageno = MultiXactIdToOffsetPage(nextMXact);
- MultiXactOffsetCtl->shared->latest_page_number = pageno;
+ pg_atomic_write_u32(&MultiXactOffsetCtl->shared->latest_page_number,
+ pageno);
/*
* Zero out the remainder of the current offsets page. See notes in
@@ -2068,7 +2134,9 @@ TrimMultiXact(void)
{
int slotno;
MultiXactOffset *offptr;
+ LWLock *lock = SimpleLruGetSLRUBankLock(MultiXactOffsetCtl, pageno);
+ LWLockAcquire(lock, LW_EXCLUSIVE);
slotno = SimpleLruReadPage(MultiXactOffsetCtl, pageno, true, nextMXact);
offptr = (MultiXactOffset *) MultiXactOffsetCtl->shared->page_buffer[slotno];
offptr += entryno;
@@ -2076,18 +2144,17 @@ TrimMultiXact(void)
MemSet(offptr, 0, BLCKSZ - (entryno * sizeof(MultiXactOffset)));
MultiXactOffsetCtl->shared->page_dirty[slotno] = true;
+ LWLockRelease(lock);
}
- LWLockRelease(MultiXactOffsetSLRULock);
-
/* And the same for members */
- LWLockAcquire(MultiXactMemberSLRULock, LW_EXCLUSIVE);
/*
* (Re-)Initialize our idea of the latest page number for members.
*/
pageno = MXOffsetToMemberPage(offset);
- MultiXactMemberCtl->shared->latest_page_number = pageno;
+ pg_atomic_write_u32(&MultiXactMemberCtl->shared->latest_page_number,
+ pageno);
/*
* Zero out the remainder of the current members page. See notes in
@@ -2099,7 +2166,9 @@ TrimMultiXact(void)
int slotno;
TransactionId *xidptr;
int memberoff;
+ LWLock *lock = SimpleLruGetSLRUBankLock(MultiXactMemberCtl, pageno);
+ LWLockAcquire(lock, LW_EXCLUSIVE);
memberoff = MXOffsetToMemberOffset(offset);
slotno = SimpleLruReadPage(MultiXactMemberCtl, pageno, true, offset);
xidptr = (TransactionId *)
@@ -2114,10 +2183,9 @@ TrimMultiXact(void)
*/
MultiXactMemberCtl->shared->page_dirty[slotno] = true;
+ LWLockRelease(lock);
}
- LWLockRelease(MultiXactMemberSLRULock);
-
/* signal that we're officially up */
LWLockAcquire(MultiXactGenLock, LW_EXCLUSIVE);
MultiXactState->finishedStartup = true;
@@ -2405,6 +2473,7 @@ static void
ExtendMultiXactOffset(MultiXactId multi)
{
int pageno;
+ LWLock *lock;
/*
* No work except at first MultiXactId of a page. But beware: just after
@@ -2415,13 +2484,14 @@ ExtendMultiXactOffset(MultiXactId multi)
return;
pageno = MultiXactIdToOffsetPage(multi);
+ lock = SimpleLruGetSLRUBankLock(MultiXactOffsetCtl, pageno);
- LWLockAcquire(MultiXactOffsetSLRULock, LW_EXCLUSIVE);
+ LWLockAcquire(lock, LW_EXCLUSIVE);
/* Zero the page and make an XLOG entry about it */
ZeroMultiXactOffsetPage(pageno, true);
- LWLockRelease(MultiXactOffsetSLRULock);
+ LWLockRelease(lock);
}
/*
@@ -2454,15 +2524,17 @@ ExtendMultiXactMember(MultiXactOffset offset, int nmembers)
if (flagsoff == 0 && flagsbit == 0)
{
int pageno;
+ LWLock *lock;
pageno = MXOffsetToMemberPage(offset);
+ lock = SimpleLruGetSLRUBankLock(MultiXactMemberCtl, pageno);
- LWLockAcquire(MultiXactMemberSLRULock, LW_EXCLUSIVE);
+ LWLockAcquire(lock, LW_EXCLUSIVE);
/* Zero the page and make an XLOG entry about it */
ZeroMultiXactMemberPage(pageno, true);
- LWLockRelease(MultiXactMemberSLRULock);
+ LWLockRelease(lock);
}
/*
@@ -2760,7 +2832,7 @@ find_multixact_start(MultiXactId multi, MultiXactOffset *result)
offptr = (MultiXactOffset *) MultiXactOffsetCtl->shared->page_buffer[slotno];
offptr += entryno;
offset = *offptr;
- LWLockRelease(MultiXactOffsetSLRULock);
+ LWLockRelease(SimpleLruGetSLRUBankLock(MultiXactOffsetCtl, pageno));
*result = offset;
return true;
@@ -3242,31 +3314,33 @@ multixact_redo(XLogReaderState *record)
{
int pageno;
int slotno;
+ LWLock *lock;
memcpy(&pageno, XLogRecGetData(record), sizeof(int));
-
- LWLockAcquire(MultiXactOffsetSLRULock, LW_EXCLUSIVE);
+ lock = SimpleLruGetSLRUBankLock(MultiXactOffsetCtl, pageno);
+ LWLockAcquire(lock, LW_EXCLUSIVE);
slotno = ZeroMultiXactOffsetPage(pageno, false);
SimpleLruWritePage(MultiXactOffsetCtl, slotno);
Assert(!MultiXactOffsetCtl->shared->page_dirty[slotno]);
- LWLockRelease(MultiXactOffsetSLRULock);
+ LWLockRelease(lock);
}
else if (info == XLOG_MULTIXACT_ZERO_MEM_PAGE)
{
int pageno;
int slotno;
+ LWLock *lock;
memcpy(&pageno, XLogRecGetData(record), sizeof(int));
-
- LWLockAcquire(MultiXactMemberSLRULock, LW_EXCLUSIVE);
+ lock = SimpleLruGetSLRUBankLock(MultiXactMemberCtl, pageno);
+ LWLockAcquire(lock, LW_EXCLUSIVE);
slotno = ZeroMultiXactMemberPage(pageno, false);
SimpleLruWritePage(MultiXactMemberCtl, slotno);
Assert(!MultiXactMemberCtl->shared->page_dirty[slotno]);
- LWLockRelease(MultiXactMemberSLRULock);
+ LWLockRelease(lock);
}
else if (info == XLOG_MULTIXACT_CREATE_ID)
{
@@ -3332,7 +3406,8 @@ multixact_redo(XLogReaderState *record)
* SimpleLruTruncate.
*/
pageno = MultiXactIdToOffsetPage(xlrec.endTruncOff);
- MultiXactOffsetCtl->shared->latest_page_number = pageno;
+ pg_atomic_write_u32(&MultiXactOffsetCtl->shared->latest_page_number,
+ pageno);
PerformOffsetsTruncation(xlrec.startTruncOff, xlrec.endTruncOff);
LWLockRelease(MultiXactTruncationLock);
diff --git a/src/backend/access/transam/slru.c b/src/backend/access/transam/slru.c
index 8697a27555..dd1a4f13b2 100644
--- a/src/backend/access/transam/slru.c
+++ b/src/backend/access/transam/slru.c
@@ -72,6 +72,21 @@
*/
#define MAX_WRITEALL_BUFFERS 16
+/*
+ * Macro to get the index of lock for a given slotno in bank_lock array in
+ * SlruSharedData.
+ *
+ * Basically, the slru buffer pool is divided into banks of buffer and there is
+ * total SLRU_MAX_BANKLOCKS number of locks to protect access to buffer in the
+ * banks. Since we have max limit on the number of locks we can not always have
+ * one lock for each bank. So until the number of banks are
+ * <= SLRU_MAX_BANKLOCKS then there would be one lock protecting each bank
+ * otherwise one lock might protect multiple banks based on the number of
+ * banks.
+ */
+#define SLRU_SLOTNO_GET_BANKLOCKNO(slotno) \
+ (((slotno) / SLRU_BANK_SIZE) % SLRU_MAX_BANKLOCKS)
+
typedef struct SlruWriteAllData
{
int num_files; /* # files actually open */
@@ -93,34 +108,6 @@ typedef struct SlruWriteAllData *SlruWriteAll;
(a).segno = (xx_segno) \
)
-/*
- * Macro to mark a buffer slot "most recently used". Note multiple evaluation
- * of arguments!
- *
- * The reason for the if-test is that there are often many consecutive
- * accesses to the same page (particularly the latest page). By suppressing
- * useless increments of cur_lru_count, we reduce the probability that old
- * pages' counts will "wrap around" and make them appear recently used.
- *
- * We allow this code to be executed concurrently by multiple processes within
- * SimpleLruReadPage_ReadOnly(). As long as int reads and writes are atomic,
- * this should not cause any completely-bogus values to enter the computation.
- * However, it is possible for either cur_lru_count or individual
- * page_lru_count entries to be "reset" to lower values than they should have,
- * in case a process is delayed while it executes this macro. With care in
- * SlruSelectLRUPage(), this does little harm, and in any case the absolute
- * worst possible consequence is a nonoptimal choice of page to evict. The
- * gain from allowing concurrent reads of SLRU pages seems worth it.
- */
-#define SlruRecentlyUsed(shared, slotno) \
- do { \
- int new_lru_count = (shared)->cur_lru_count; \
- if (new_lru_count != (shared)->page_lru_count[slotno]) { \
- (shared)->cur_lru_count = ++new_lru_count; \
- (shared)->page_lru_count[slotno] = new_lru_count; \
- } \
- } while (0)
-
/* Saved info for SlruReportIOError */
typedef enum
{
@@ -147,6 +134,7 @@ static int SlruSelectLRUPage(SlruCtl ctl, int pageno);
static bool SlruScanDirCbDeleteCutoff(SlruCtl ctl, char *filename,
int segpage, void *data);
static void SlruInternalDeleteSegment(SlruCtl ctl, int segno);
+static inline void SlruRecentlyUsed(SlruShared shared, int slotno);
/*
* Initialization of shared memory
@@ -156,6 +144,8 @@ Size
SimpleLruShmemSize(int nslots, int nlsns)
{
Size sz;
+ int nbanks = nslots / SLRU_BANK_SIZE;
+ int nbanklocks = Min(nbanks, SLRU_MAX_BANKLOCKS);
/* we assume nslots isn't so large as to risk overflow */
sz = MAXALIGN(sizeof(SlruSharedData));
@@ -165,6 +155,8 @@ SimpleLruShmemSize(int nslots, int nlsns)
sz += MAXALIGN(nslots * sizeof(int)); /* page_number[] */
sz += MAXALIGN(nslots * sizeof(int)); /* page_lru_count[] */
sz += MAXALIGN(nslots * sizeof(LWLockPadded)); /* buffer_locks[] */
+ sz += MAXALIGN(nbanklocks * sizeof(LWLockPadded)); /* bank_locks[] */
+ sz += MAXALIGN(nbanks * sizeof(int)); /* bank_cur_lru_count[] */
if (nlsns > 0)
sz += MAXALIGN(nslots * nlsns * sizeof(XLogRecPtr)); /* group_lsn[] */
@@ -181,16 +173,19 @@ SimpleLruShmemSize(int nslots, int nlsns)
* nlsns: number of LSN groups per page (set to zero if not relevant).
* ctllock: LWLock to use to control access to the shared control structure.
* subdir: PGDATA-relative subdirectory that will contain the files.
- * tranche_id: LWLock tranche ID to use for the SLRU's per-buffer LWLocks.
+ * buffer_tranche_id: tranche ID to use for the SLRU's per-buffer LWLocks.
+ * bank_tranche_id: tranche ID to use for the bank LWLocks.
* sync_handler: which set of functions to use to handle sync requests
*/
void
SimpleLruInit(SlruCtl ctl, const char *name, int nslots, int nlsns,
- LWLock *ctllock, const char *subdir, int tranche_id,
+ const char *subdir, int buffer_tranche_id, int bank_tranche_id,
SyncRequestHandler sync_handler)
{
SlruShared shared;
bool found;
+ int nbanks = nslots / SLRU_BANK_SIZE;
+ int nbanklocks = Min(nbanks, SLRU_MAX_BANKLOCKS);
shared = (SlruShared) ShmemInitStruct(name,
SimpleLruShmemSize(nslots, nlsns),
@@ -202,18 +197,16 @@ SimpleLruInit(SlruCtl ctl, const char *name, int nslots, int nlsns,
char *ptr;
Size offset;
int slotno;
+ int bankno;
+ int banklockno;
Assert(!found);
memset(shared, 0, sizeof(SlruSharedData));
- shared->ControlLock = ctllock;
-
shared->num_slots = nslots;
shared->lsn_groups_per_page = nlsns;
- shared->cur_lru_count = 0;
-
/* shared->latest_page_number will be set later */
shared->slru_stats_idx = pgstat_get_slru_index(name);
@@ -234,6 +227,10 @@ SimpleLruInit(SlruCtl ctl, const char *name, int nslots, int nlsns,
/* Initialize LWLocks */
shared->buffer_locks = (LWLockPadded *) (ptr + offset);
offset += MAXALIGN(nslots * sizeof(LWLockPadded));
+ shared->bank_locks = (LWLockPadded *) (ptr + offset);
+ offset += MAXALIGN(nbanklocks * sizeof(LWLockPadded));
+ shared->bank_cur_lru_count = (int *) (ptr + offset);
+ offset += MAXALIGN(nbanks * sizeof(int));
if (nlsns > 0)
{
@@ -245,7 +242,7 @@ SimpleLruInit(SlruCtl ctl, const char *name, int nslots, int nlsns,
for (slotno = 0; slotno < nslots; slotno++)
{
LWLockInitialize(&shared->buffer_locks[slotno].lock,
- tranche_id);
+ buffer_tranche_id);
shared->page_buffer[slotno] = ptr;
shared->page_status[slotno] = SLRU_PAGE_EMPTY;
@@ -254,6 +251,15 @@ SimpleLruInit(SlruCtl ctl, const char *name, int nslots, int nlsns,
ptr += BLCKSZ;
}
+ /* Initialize the bank locks. */
+ for (banklockno = 0; banklockno < nbanklocks; banklockno++)
+ LWLockInitialize(&shared->bank_locks[banklockno].lock,
+ bank_tranche_id);
+
+ /* Initialize the bank lru counters. */
+ for (bankno = 0; bankno < nbanks; bankno++)
+ shared->bank_cur_lru_count[bankno] = 0;
+
/* Should fit to estimated shmem size */
Assert(ptr - (char *) shared <= SimpleLruShmemSize(nslots, nlsns));
}
@@ -307,7 +313,7 @@ SimpleLruZeroPage(SlruCtl ctl, int pageno)
SimpleLruZeroLSNs(ctl, slotno);
/* Assume this page is now the latest active page */
- shared->latest_page_number = pageno;
+ pg_atomic_write_u32(&shared->latest_page_number, pageno);
/* update the stats counter of zeroed pages */
pgstat_count_slru_page_zeroed(shared->slru_stats_idx);
@@ -346,12 +352,13 @@ static void
SimpleLruWaitIO(SlruCtl ctl, int slotno)
{
SlruShared shared = ctl->shared;
+ int banklockno = SLRU_SLOTNO_GET_BANKLOCKNO(slotno);
/* See notes at top of file */
- LWLockRelease(shared->ControlLock);
+ LWLockRelease(&shared->bank_locks[banklockno].lock);
LWLockAcquire(&shared->buffer_locks[slotno].lock, LW_SHARED);
LWLockRelease(&shared->buffer_locks[slotno].lock);
- LWLockAcquire(shared->ControlLock, LW_EXCLUSIVE);
+ LWLockAcquire(&shared->bank_locks[banklockno].lock, LW_EXCLUSIVE);
/*
* If the slot is still in an io-in-progress state, then either someone
@@ -406,6 +413,7 @@ SimpleLruReadPage(SlruCtl ctl, int pageno, bool write_ok,
for (;;)
{
int slotno;
+ int banklockno;
bool ok;
/* See if page already is in memory; if not, pick victim slot */
@@ -448,9 +456,10 @@ SimpleLruReadPage(SlruCtl ctl, int pageno, bool write_ok,
/* Acquire per-buffer lock (cannot deadlock, see notes at top) */
LWLockAcquire(&shared->buffer_locks[slotno].lock, LW_EXCLUSIVE);
+ banklockno = SLRU_SLOTNO_GET_BANKLOCKNO(slotno);
/* Release control lock while doing I/O */
- LWLockRelease(shared->ControlLock);
+ LWLockRelease(&shared->bank_locks[banklockno].lock);
/* Do the read */
ok = SlruPhysicalReadPage(ctl, pageno, slotno);
@@ -459,7 +468,7 @@ SimpleLruReadPage(SlruCtl ctl, int pageno, bool write_ok,
SimpleLruZeroLSNs(ctl, slotno);
/* Re-acquire control lock and update page state */
- LWLockAcquire(shared->ControlLock, LW_EXCLUSIVE);
+ LWLockAcquire(&shared->bank_locks[banklockno].lock, LW_EXCLUSIVE);
Assert(shared->page_number[slotno] == pageno &&
shared->page_status[slotno] == SLRU_PAGE_READ_IN_PROGRESS &&
@@ -503,9 +512,10 @@ SimpleLruReadPage_ReadOnly(SlruCtl ctl, int pageno, TransactionId xid)
int slotno;
int bankstart = (pageno & ctl->bank_mask) * SLRU_BANK_SIZE;
int bankend = bankstart + SLRU_BANK_SIZE;
+ int banklockno = SLRU_SLOTNO_GET_BANKLOCKNO(bankstart);
/* Try to find the page while holding only shared lock */
- LWLockAcquire(shared->ControlLock, LW_SHARED);
+ LWLockAcquire(&shared->bank_locks[banklockno].lock, LW_SHARED);
/* See if page is already in a buffer */
for (slotno = bankstart; slotno < bankend; slotno++)
@@ -525,8 +535,8 @@ SimpleLruReadPage_ReadOnly(SlruCtl ctl, int pageno, TransactionId xid)
}
/* No luck, so switch to normal exclusive lock and do regular read */
- LWLockRelease(shared->ControlLock);
- LWLockAcquire(shared->ControlLock, LW_EXCLUSIVE);
+ LWLockRelease(&shared->bank_locks[banklockno].lock);
+ LWLockAcquire(&shared->bank_locks[banklockno].lock, LW_EXCLUSIVE);
return SimpleLruReadPage(ctl, pageno, true, xid);
}
@@ -548,6 +558,7 @@ SlruInternalWritePage(SlruCtl ctl, int slotno, SlruWriteAll fdata)
SlruShared shared = ctl->shared;
int pageno = shared->page_number[slotno];
bool ok;
+ int banklockno = SLRU_SLOTNO_GET_BANKLOCKNO(slotno);
/* If a write is in progress, wait for it to finish */
while (shared->page_status[slotno] == SLRU_PAGE_WRITE_IN_PROGRESS &&
@@ -576,7 +587,7 @@ SlruInternalWritePage(SlruCtl ctl, int slotno, SlruWriteAll fdata)
LWLockAcquire(&shared->buffer_locks[slotno].lock, LW_EXCLUSIVE);
/* Release control lock while doing I/O */
- LWLockRelease(shared->ControlLock);
+ LWLockRelease(&shared->bank_locks[banklockno].lock);
/* Do the write */
ok = SlruPhysicalWritePage(ctl, pageno, slotno, fdata);
@@ -591,7 +602,7 @@ SlruInternalWritePage(SlruCtl ctl, int slotno, SlruWriteAll fdata)
}
/* Re-acquire control lock and update page state */
- LWLockAcquire(shared->ControlLock, LW_EXCLUSIVE);
+ LWLockAcquire(&shared->bank_locks[banklockno].lock, LW_EXCLUSIVE);
Assert(shared->page_number[slotno] == pageno &&
shared->page_status[slotno] == SLRU_PAGE_WRITE_IN_PROGRESS);
@@ -1037,7 +1048,8 @@ SlruSelectLRUPage(SlruCtl ctl, int pageno)
int best_invalid_page_number = 0; /* keep compiler quiet */
/* See if page already has a buffer assigned */
- int bankstart = (pageno & ctl->bank_mask) * SLRU_BANK_SIZE;
+ int bankno = pageno & ctl->bank_mask;
+ int bankstart = bankno * SLRU_BANK_SIZE;
int bankend = bankstart + SLRU_BANK_SIZE;
for (slotno = bankstart; slotno < bankend; slotno++)
@@ -1074,7 +1086,7 @@ SlruSelectLRUPage(SlruCtl ctl, int pageno)
* That gets us back on the path to having good data when there are
* multiple pages with the same lru_count.
*/
- cur_count = (shared->cur_lru_count)++;
+ cur_count = (shared->bank_cur_lru_count[bankno])++;
for (slotno = bankstart; slotno < bankend; slotno++)
{
int this_delta;
@@ -1096,7 +1108,7 @@ SlruSelectLRUPage(SlruCtl ctl, int pageno)
this_delta = 0;
}
this_page_number = shared->page_number[slotno];
- if (this_page_number == shared->latest_page_number)
+ if (this_page_number == pg_atomic_read_u32(&shared->latest_page_number))
continue;
if (shared->page_status[slotno] == SLRU_PAGE_VALID)
{
@@ -1170,6 +1182,7 @@ SimpleLruWriteAll(SlruCtl ctl, bool allow_redirtied)
int slotno;
int pageno = 0;
int i;
+ int prevlockno = SLRU_SLOTNO_GET_BANKLOCKNO(0);
bool ok;
/* update the stats counter of flushes */
@@ -1180,10 +1193,23 @@ SimpleLruWriteAll(SlruCtl ctl, bool allow_redirtied)
*/
fdata.num_files = 0;
- LWLockAcquire(shared->ControlLock, LW_EXCLUSIVE);
+ LWLockAcquire(&shared->bank_locks[prevlockno].lock, LW_EXCLUSIVE);
for (slotno = 0; slotno < shared->num_slots; slotno++)
{
+ int curlockno = SLRU_SLOTNO_GET_BANKLOCKNO(slotno);
+
+ /*
+ * If the curlockno is not same as prevlockno then release the previous
+ * lock and acquire the new lock.
+ */
+ if (curlockno != prevlockno)
+ {
+ LWLockRelease(&shared->bank_locks[prevlockno].lock);
+ LWLockAcquire(&shared->bank_locks[curlockno].lock, LW_EXCLUSIVE);
+ prevlockno = curlockno;
+ }
+
SlruInternalWritePage(ctl, slotno, &fdata);
/*
@@ -1197,7 +1223,7 @@ SimpleLruWriteAll(SlruCtl ctl, bool allow_redirtied)
!shared->page_dirty[slotno]));
}
- LWLockRelease(shared->ControlLock);
+ LWLockRelease(&shared->bank_locks[prevlockno].lock);
/*
* Now close any files that were open
@@ -1237,6 +1263,7 @@ SimpleLruTruncate(SlruCtl ctl, int cutoffPage)
{
SlruShared shared = ctl->shared;
int slotno;
+ int prevlockno;
/* update the stats counter of truncates */
pgstat_count_slru_truncate(shared->slru_stats_idx);
@@ -1247,25 +1274,38 @@ SimpleLruTruncate(SlruCtl ctl, int cutoffPage)
* or just after a checkpoint, any dirty pages should have been flushed
* already ... we're just being extra careful here.)
*/
- LWLockAcquire(shared->ControlLock, LW_EXCLUSIVE);
-
restart:
/*
* While we are holding the lock, make an important safety check: the
* current endpoint page must not be eligible for removal.
*/
- if (ctl->PagePrecedes(shared->latest_page_number, cutoffPage))
+ if (ctl->PagePrecedes(pg_atomic_read_u32(&shared->latest_page_number),
+ cutoffPage))
{
- LWLockRelease(shared->ControlLock);
ereport(LOG,
(errmsg("could not truncate directory \"%s\": apparent wraparound",
ctl->Dir)));
return;
}
+ prevlockno = SLRU_SLOTNO_GET_BANKLOCKNO(0);
+ LWLockAcquire(&shared->bank_locks[prevlockno].lock, LW_EXCLUSIVE);
for (slotno = 0; slotno < shared->num_slots; slotno++)
{
+ int curlockno = SLRU_SLOTNO_GET_BANKLOCKNO(slotno);
+
+ /*
+ * If the curlockno is not same as prevlockno then release the previous
+ * lock and acquire the new lock.
+ */
+ if (curlockno != prevlockno)
+ {
+ LWLockRelease(&shared->bank_locks[prevlockno].lock);
+ LWLockAcquire(&shared->bank_locks[curlockno].lock, LW_EXCLUSIVE);
+ prevlockno = curlockno;
+ }
+
if (shared->page_status[slotno] == SLRU_PAGE_EMPTY)
continue;
if (!ctl->PagePrecedes(shared->page_number[slotno], cutoffPage))
@@ -1295,10 +1335,12 @@ restart:
SlruInternalWritePage(ctl, slotno, NULL);
else
SimpleLruWaitIO(ctl, slotno);
+
+ LWLockRelease(&shared->bank_locks[prevlockno].lock);
goto restart;
}
- LWLockRelease(shared->ControlLock);
+ LWLockRelease(&shared->bank_locks[prevlockno].lock);
/* Now we can remove the old segment(s) */
(void) SlruScanDirectory(ctl, SlruScanDirCbDeleteCutoff, &cutoffPage);
@@ -1339,15 +1381,29 @@ SlruDeleteSegment(SlruCtl ctl, int segno)
SlruShared shared = ctl->shared;
int slotno;
bool did_write;
+ int prevlockno = SLRU_SLOTNO_GET_BANKLOCKNO(0);
/* Clean out any possibly existing references to the segment. */
- LWLockAcquire(shared->ControlLock, LW_EXCLUSIVE);
+ LWLockAcquire(&shared->bank_locks[prevlockno].lock, LW_EXCLUSIVE);
restart:
did_write = false;
for (slotno = 0; slotno < shared->num_slots; slotno++)
{
- int pagesegno = shared->page_number[slotno] / SLRU_PAGES_PER_SEGMENT;
+ int pagesegno;
+ int curlockno = SLRU_SLOTNO_GET_BANKLOCKNO(slotno);
+
+ /*
+ * If the curlockno is not same as prevlockno then release the previous
+ * lock and acquire the new lock.
+ */
+ if (curlockno != prevlockno)
+ {
+ LWLockRelease(&shared->bank_locks[prevlockno].lock);
+ LWLockAcquire(&shared->bank_locks[curlockno].lock, LW_EXCLUSIVE);
+ prevlockno = curlockno;
+ }
+ pagesegno = shared->page_number[slotno] / SLRU_PAGES_PER_SEGMENT;
if (shared->page_status[slotno] == SLRU_PAGE_EMPTY)
continue;
@@ -1381,7 +1437,7 @@ restart:
SlruInternalDeleteSegment(ctl, segno);
- LWLockRelease(shared->ControlLock);
+ LWLockRelease(&shared->bank_locks[prevlockno].lock);
}
/*
@@ -1623,6 +1679,38 @@ SlruSyncFileTag(SlruCtl ctl, const FileTag *ftag, char *path)
return result;
}
+/*
+ * Function to mark a buffer slot "most recently used". Note multiple
+ * evaluation of arguments!
+ *
+ * The reason for the if-test is that there are often many consecutive
+ * accesses to the same page (particularly the latest page). By suppressing
+ * useless increments of bank_cur_lru_count, we reduce the probability that old
+ * pages' counts will "wrap around" and make them appear recently used.
+ *
+ * We allow this code to be executed concurrently by multiple processes within
+ * SimpleLruReadPage_ReadOnly(). As long as int reads and writes are atomic,
+ * this should not cause any completely-bogus values to enter the computation.
+ * However, it is possible for either bank_cur_lru_count or individual
+ * page_lru_count entries to be "reset" to lower values than they should have,
+ * in case a process is delayed while it executes this macro. With care in
+ * SlruSelectLRUPage(), this does little harm, and in any case the absolute
+ * worst possible consequence is a nonoptimal choice of page to evict. The
+ * gain from allowing concurrent reads of SLRU pages seems worth it.
+ */
+static inline void
+SlruRecentlyUsed(SlruShared shared, int slotno)
+{
+ int bankno = slotno / SLRU_BANK_SIZE;
+ int new_lru_count = shared->bank_cur_lru_count[bankno];
+
+ if (new_lru_count != shared->page_lru_count[slotno])
+ {
+ shared->bank_cur_lru_count[bankno] = ++new_lru_count;
+ shared->page_lru_count[slotno] = new_lru_count;
+ }
+}
+
/*
* Helper function for GUC check_hook to check whether slru buffers are in
* multiples of SLRU_BANK_SIZE.
@@ -1639,3 +1727,37 @@ check_slru_buffers(const char *name, int *newval)
SLRU_BANK_SIZE);
return false;
}
+
+/*
+ * Function to acquire all bank's lock of the given SlruCtl
+ */
+void
+SimpleLruAcquireAllBankLock(SlruCtl ctl, LWLockMode mode)
+{
+ SlruShared shared = ctl->shared;
+ int banklockno;
+ int nbanklocks;
+
+ /* Compute number of bank locks. */
+ nbanklocks = Min(shared->num_slots / SLRU_BANK_SIZE, SLRU_MAX_BANKLOCKS);
+
+ for (banklockno = 0; banklockno < nbanklocks; banklockno++)
+ LWLockAcquire(&shared->bank_locks[banklockno].lock, mode);
+}
+
+/*
+ * Function to release all bank's lock of the given SlruCtl
+ */
+void
+SimpleLruReleaseAllBankLock(SlruCtl ctl)
+{
+ SlruShared shared = ctl->shared;
+ int banklockno;
+ int nbanklocks;
+
+ /* Compute number of bank locks. */
+ nbanklocks = Min(shared->num_slots / SLRU_BANK_SIZE, SLRU_MAX_BANKLOCKS);
+
+ for (banklockno = 0; banklockno < nbanklocks; banklockno++)
+ LWLockRelease(&shared->bank_locks[banklockno].lock);
+}
diff --git a/src/backend/access/transam/subtrans.c b/src/backend/access/transam/subtrans.c
index 923e706535..ff47985f08 100644
--- a/src/backend/access/transam/subtrans.c
+++ b/src/backend/access/transam/subtrans.c
@@ -78,12 +78,14 @@ SubTransSetParent(TransactionId xid, TransactionId parent)
int pageno = TransactionIdToPage(xid);
int entryno = TransactionIdToEntry(xid);
int slotno;
+ LWLock *lock;
TransactionId *ptr;
Assert(TransactionIdIsValid(parent));
Assert(TransactionIdFollows(xid, parent));
- LWLockAcquire(SubtransSLRULock, LW_EXCLUSIVE);
+ lock = SimpleLruGetSLRUBankLock(SubTransCtl, pageno);
+ LWLockAcquire(lock, LW_EXCLUSIVE);
slotno = SimpleLruReadPage(SubTransCtl, pageno, true, xid);
ptr = (TransactionId *) SubTransCtl->shared->page_buffer[slotno];
@@ -101,7 +103,7 @@ SubTransSetParent(TransactionId xid, TransactionId parent)
SubTransCtl->shared->page_dirty[slotno] = true;
}
- LWLockRelease(SubtransSLRULock);
+ LWLockRelease(lock);
}
/*
@@ -131,7 +133,7 @@ SubTransGetParent(TransactionId xid)
parent = *ptr;
- LWLockRelease(SubtransSLRULock);
+ LWLockRelease(SimpleLruGetSLRUBankLock(SubTransCtl, pageno));
return parent;
}
@@ -194,8 +196,9 @@ SUBTRANSShmemInit(void)
{
SubTransCtl->PagePrecedes = SubTransPagePrecedes;
SimpleLruInit(SubTransCtl, "Subtrans", subtrans_buffers, 0,
- SubtransSLRULock, "pg_subtrans",
- LWTRANCHE_SUBTRANS_BUFFER, SYNC_HANDLER_NONE);
+ "pg_subtrans", LWTRANCHE_SUBTRANS_BUFFER,
+ LWTRANCHE_SUBTRANS_SLRU,
+ SYNC_HANDLER_NONE);
SlruPagePrecedesUnitTests(SubTransCtl, SUBTRANS_XACTS_PER_PAGE);
}
@@ -213,8 +216,9 @@ void
BootStrapSUBTRANS(void)
{
int slotno;
+ LWLock *lock = SimpleLruGetSLRUBankLock(SubTransCtl, 0);
- LWLockAcquire(SubtransSLRULock, LW_EXCLUSIVE);
+ LWLockAcquire(lock, LW_EXCLUSIVE);
/* Create and zero the first page of the subtrans log */
slotno = ZeroSUBTRANSPage(0);
@@ -223,7 +227,7 @@ BootStrapSUBTRANS(void)
SimpleLruWritePage(SubTransCtl, slotno);
Assert(!SubTransCtl->shared->page_dirty[slotno]);
- LWLockRelease(SubtransSLRULock);
+ LWLockRelease(lock);
}
/*
@@ -253,6 +257,8 @@ StartupSUBTRANS(TransactionId oldestActiveXID)
FullTransactionId nextXid;
int startPage;
int endPage;
+ LWLock *prevlock;
+ LWLock *lock;
/*
* Since we don't expect pg_subtrans to be valid across crashes, we
@@ -260,23 +266,47 @@ StartupSUBTRANS(TransactionId oldestActiveXID)
* Whenever we advance into a new page, ExtendSUBTRANS will likewise zero
* the new page without regard to whatever was previously on disk.
*/
- LWLockAcquire(SubtransSLRULock, LW_EXCLUSIVE);
-
startPage = TransactionIdToPage(oldestActiveXID);
nextXid = ShmemVariableCache->nextXid;
endPage = TransactionIdToPage(XidFromFullTransactionId(nextXid));
+ prevlock = SimpleLruGetSLRUBankLock(SubTransCtl, startPage);
+ LWLockAcquire(prevlock, LW_EXCLUSIVE);
while (startPage != endPage)
{
+ lock = SimpleLruGetSLRUBankLock(SubTransCtl, startPage);
+
+ /*
+ * Check if we need to acquire the lock on the new bank then release
+ * the lock on the old bank and acquire on the new bank.
+ */
+ if (prevlock != lock)
+ {
+ LWLockRelease(prevlock);
+ LWLockAcquire(lock, LW_EXCLUSIVE);
+ prevlock = lock;
+ }
+
(void) ZeroSUBTRANSPage(startPage);
startPage++;
/* must account for wraparound */
if (startPage > TransactionIdToPage(MaxTransactionId))
startPage = 0;
}
- (void) ZeroSUBTRANSPage(startPage);
- LWLockRelease(SubtransSLRULock);
+ lock = SimpleLruGetSLRUBankLock(SubTransCtl, startPage);
+
+ /*
+ * Check if we need to acquire the lock on the new bank then release the
+ * lock on the old bank and acquire on the new bank.
+ */
+ if (prevlock != lock)
+ {
+ LWLockRelease(prevlock);
+ LWLockAcquire(lock, LW_EXCLUSIVE);
+ }
+ (void) ZeroSUBTRANSPage(startPage);
+ LWLockRelease(lock);
}
/*
@@ -310,6 +340,7 @@ void
ExtendSUBTRANS(TransactionId newestXact)
{
int pageno;
+ LWLock *lock;
/*
* No work except at first XID of a page. But beware: just after
@@ -321,12 +352,13 @@ ExtendSUBTRANS(TransactionId newestXact)
pageno = TransactionIdToPage(newestXact);
- LWLockAcquire(SubtransSLRULock, LW_EXCLUSIVE);
+ lock = SimpleLruGetSLRUBankLock(SubTransCtl, pageno);
+ LWLockAcquire(lock, LW_EXCLUSIVE);
/* Zero the page */
ZeroSUBTRANSPage(pageno);
- LWLockRelease(SubtransSLRULock);
+ LWLockRelease(lock);
}
diff --git a/src/backend/commands/async.c b/src/backend/commands/async.c
index 98449cbdde..67da0b48bd 100644
--- a/src/backend/commands/async.c
+++ b/src/backend/commands/async.c
@@ -268,9 +268,10 @@ typedef struct QueueBackendStatus
* both NotifyQueueLock and NotifyQueueTailLock in EXCLUSIVE mode, backends
* can change the tail pointers.
*
- * NotifySLRULock is used as the control lock for the pg_notify SLRU buffers.
+ * SLRU buffer pool is divided in banks and bank wise SLRU lock is used as
+ * the control lock for the pg_notify SLRU buffers.
* In order to avoid deadlocks, whenever we need multiple locks, we first get
- * NotifyQueueTailLock, then NotifyQueueLock, and lastly NotifySLRULock.
+ * NotifyQueueTailLock, then NotifyQueueLock, and lastly SLRU bank lock.
*
* Each backend uses the backend[] array entry with index equal to its
* BackendId (which can range from 1 to MaxBackends). We rely on this to make
@@ -571,7 +572,7 @@ AsyncShmemInit(void)
*/
NotifyCtl->PagePrecedes = asyncQueuePagePrecedes;
SimpleLruInit(NotifyCtl, "Notify", notify_buffers, 0,
- NotifySLRULock, "pg_notify", LWTRANCHE_NOTIFY_BUFFER,
+ "pg_notify", LWTRANCHE_NOTIFY_BUFFER, LWTRANCHE_NOTIFY_SLRU,
SYNC_HANDLER_NONE);
if (!found)
@@ -1403,7 +1404,7 @@ asyncQueueNotificationToEntry(Notification *n, AsyncQueueEntry *qe)
* Eventually we will return NULL indicating all is done.
*
* We are holding NotifyQueueLock already from the caller and grab
- * NotifySLRULock locally in this function.
+ * page specific SLRU bank lock locally in this function.
*/
static ListCell *
asyncQueueAddEntries(ListCell *nextNotify)
@@ -1413,9 +1414,7 @@ asyncQueueAddEntries(ListCell *nextNotify)
int pageno;
int offset;
int slotno;
-
- /* We hold both NotifyQueueLock and NotifySLRULock during this operation */
- LWLockAcquire(NotifySLRULock, LW_EXCLUSIVE);
+ LWLock *prevlock;
/*
* We work with a local copy of QUEUE_HEAD, which we write back to shared
@@ -1439,6 +1438,11 @@ asyncQueueAddEntries(ListCell *nextNotify)
* wrapped around, but re-zeroing the page is harmless in that case.)
*/
pageno = QUEUE_POS_PAGE(queue_head);
+ prevlock = SimpleLruGetSLRUBankLock(NotifyCtl, pageno);
+
+ /* We hold both NotifyQueueLock and SLRU bank lock during this operation */
+ LWLockAcquire(prevlock, LW_EXCLUSIVE);
+
if (QUEUE_POS_IS_ZERO(queue_head))
slotno = SimpleLruZeroPage(NotifyCtl, pageno);
else
@@ -1484,6 +1488,17 @@ asyncQueueAddEntries(ListCell *nextNotify)
/* Advance queue_head appropriately, and detect if page is full */
if (asyncQueueAdvance(&(queue_head), qe.length))
{
+ LWLock *lock;
+
+ pageno = QUEUE_POS_PAGE(queue_head);
+ lock = SimpleLruGetSLRUBankLock(NotifyCtl, pageno);
+ if (lock != prevlock)
+ {
+ LWLockRelease(prevlock);
+ LWLockAcquire(lock, LW_EXCLUSIVE);
+ prevlock = lock;
+ }
+
/*
* Page is full, so we're done here, but first fill the next page
* with zeroes. The reason to do this is to ensure that slru.c's
@@ -1510,7 +1525,7 @@ asyncQueueAddEntries(ListCell *nextNotify)
/* Success, so update the global QUEUE_HEAD */
QUEUE_HEAD = queue_head;
- LWLockRelease(NotifySLRULock);
+ LWLockRelease(prevlock);
return nextNotify;
}
@@ -1989,9 +2004,9 @@ asyncQueueReadAllNotifications(void)
/*
* We copy the data from SLRU into a local buffer, so as to avoid
- * holding the NotifySLRULock while we are examining the entries
- * and possibly transmitting them to our frontend. Copy only the
- * part of the page we will actually inspect.
+ * holding the SLRU lock while we are examining the entries and
+ * possibly transmitting them to our frontend. Copy only the part
+ * of the page we will actually inspect.
*/
slotno = SimpleLruReadPage_ReadOnly(NotifyCtl, curpage,
InvalidTransactionId);
@@ -2011,7 +2026,7 @@ asyncQueueReadAllNotifications(void)
NotifyCtl->shared->page_buffer[slotno] + curoffset,
copysize);
/* Release lock that we got from SimpleLruReadPage_ReadOnly() */
- LWLockRelease(NotifySLRULock);
+ LWLockRelease(SimpleLruGetSLRUBankLock(NotifyCtl, curpage));
/*
* Process messages up to the stop position, end of page, or an
@@ -2052,7 +2067,7 @@ asyncQueueReadAllNotifications(void)
*
* The current page must have been fetched into page_buffer from shared
* memory. (We could access the page right in shared memory, but that
- * would imply holding the NotifySLRULock throughout this routine.)
+ * would imply holding the SLRU bank lock throughout this routine.)
*
* We stop if we reach the "stop" position, or reach a notification from an
* uncommitted transaction, or reach the end of the page.
@@ -2205,7 +2220,7 @@ asyncQueueAdvanceTail(void)
if (asyncQueuePagePrecedes(oldtailpage, boundary))
{
/*
- * SimpleLruTruncate() will ask for NotifySLRULock but will also
+ * SimpleLruTruncate() will ask for SLRU bank locks but will also
* release the lock again.
*/
SimpleLruTruncate(NotifyCtl, newtailpage);
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index 315a78cda9..1261af0548 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -190,6 +190,20 @@ static const char *const BuiltinTrancheNames[] = {
"LogicalRepLauncherDSA",
/* LWTRANCHE_LAUNCHER_HASH: */
"LogicalRepLauncherHash",
+ /* LWTRANCHE_XACT_SLRU: */
+ "XactSLRU",
+ /* LWTRANCHE_COMMITTS_SLRU: */
+ "CommitTSSLRU",
+ /* LWTRANCHE_SUBTRANS_SLRU: */
+ "SubtransSLRU",
+ /* LWTRANCHE_MULTIXACTOFFSET_SLRU: */
+ "MultixactOffsetSLRU",
+ /* LWTRANCHE_MULTIXACTMEMBER_SLRU: */
+ "MultixactMemberSLRU",
+ /* LWTRANCHE_NOTIFY_SLRU: */
+ "NotifySLRU",
+ /* LWTRANCHE_SERIAL_SLRU: */
+ "SerialSLRU"
};
StaticAssertDecl(lengthof(BuiltinTrancheNames) ==
diff --git a/src/backend/storage/lmgr/lwlocknames.txt b/src/backend/storage/lmgr/lwlocknames.txt
index f72f2906ce..9e66ecd1ed 100644
--- a/src/backend/storage/lmgr/lwlocknames.txt
+++ b/src/backend/storage/lmgr/lwlocknames.txt
@@ -16,11 +16,11 @@ WALBufMappingLock 7
WALWriteLock 8
ControlFileLock 9
# 10 was CheckpointLock
-XactSLRULock 11
-SubtransSLRULock 12
+# 11 was XactSLRULock
+# 12 was SubtransSLRULock
MultiXactGenLock 13
-MultiXactOffsetSLRULock 14
-MultiXactMemberSLRULock 15
+# 14 was MultiXactOffsetSLRULock
+# 15 was MultiXactMemberSLRULock
RelCacheInitLock 16
CheckpointerCommLock 17
TwoPhaseStateLock 18
@@ -31,19 +31,19 @@ AutovacuumLock 22
AutovacuumScheduleLock 23
SyncScanLock 24
RelationMappingLock 25
-NotifySLRULock 26
+#26 was NotifySLRULock
NotifyQueueLock 27
SerializableXactHashLock 28
SerializableFinishedListLock 29
SerializablePredicateListLock 30
-SerialSLRULock 31
+SerialControlLock 31
SyncRepLock 32
BackgroundWorkerLock 33
DynamicSharedMemoryControlLock 34
AutoFileLock 35
ReplicationSlotAllocationLock 36
ReplicationSlotControlLock 37
-CommitTsSLRULock 38
+#38 was CommitTsSLRULock
CommitTsLock 39
ReplicationOriginLock 40
MultiXactTruncationLock 41
diff --git a/src/backend/storage/lmgr/predicate.c b/src/backend/storage/lmgr/predicate.c
index e4903c67ec..7632c42978 100644
--- a/src/backend/storage/lmgr/predicate.c
+++ b/src/backend/storage/lmgr/predicate.c
@@ -809,8 +809,9 @@ SerialInit(void)
*/
SerialSlruCtl->PagePrecedes = SerialPagePrecedesLogically;
SimpleLruInit(SerialSlruCtl, "Serial",
- serial_buffers, 0, SerialSLRULock, "pg_serial",
- LWTRANCHE_SERIAL_BUFFER, SYNC_HANDLER_NONE);
+ serial_buffers, 0, "pg_serial",
+ LWTRANCHE_SERIAL_BUFFER, LWTRANCHE_SERIAL_SLRU,
+ SYNC_HANDLER_NONE);
#ifdef USE_ASSERT_CHECKING
SerialPagePrecedesLogicallyUnitTests();
#endif
@@ -847,12 +848,14 @@ SerialAdd(TransactionId xid, SerCommitSeqNo minConflictCommitSeqNo)
int slotno;
int firstZeroPage;
bool isNewPage;
+ LWLock *lock;
Assert(TransactionIdIsValid(xid));
targetPage = SerialPage(xid);
+ lock = SimpleLruGetSLRUBankLock(SerialSlruCtl, targetPage);
- LWLockAcquire(SerialSLRULock, LW_EXCLUSIVE);
+ LWLockAcquire(lock, LW_EXCLUSIVE);
/*
* If no serializable transactions are active, there shouldn't be anything
@@ -902,7 +905,7 @@ SerialAdd(TransactionId xid, SerCommitSeqNo minConflictCommitSeqNo)
SerialValue(slotno, xid) = minConflictCommitSeqNo;
SerialSlruCtl->shared->page_dirty[slotno] = true;
- LWLockRelease(SerialSLRULock);
+ LWLockRelease(lock);
}
/*
@@ -920,10 +923,10 @@ SerialGetMinConflictCommitSeqNo(TransactionId xid)
Assert(TransactionIdIsValid(xid));
- LWLockAcquire(SerialSLRULock, LW_SHARED);
+ LWLockAcquire(SerialControlLock, LW_SHARED);
headXid = serialControl->headXid;
tailXid = serialControl->tailXid;
- LWLockRelease(SerialSLRULock);
+ LWLockRelease(SerialControlLock);
if (!TransactionIdIsValid(headXid))
return 0;
@@ -935,13 +938,13 @@ SerialGetMinConflictCommitSeqNo(TransactionId xid)
return 0;
/*
- * The following function must be called without holding SerialSLRULock,
+ * The following function must be called without holding SLRU bank lock,
* but will return with that lock held, which must then be released.
*/
slotno = SimpleLruReadPage_ReadOnly(SerialSlruCtl,
SerialPage(xid), xid);
val = SerialValue(slotno, xid);
- LWLockRelease(SerialSLRULock);
+ LWLockRelease(SimpleLruGetSLRUBankLock(SerialSlruCtl, SerialPage(xid)));
return val;
}
@@ -954,7 +957,7 @@ SerialGetMinConflictCommitSeqNo(TransactionId xid)
static void
SerialSetActiveSerXmin(TransactionId xid)
{
- LWLockAcquire(SerialSLRULock, LW_EXCLUSIVE);
+ LWLockAcquire(SerialControlLock, LW_EXCLUSIVE);
/*
* When no sxacts are active, nothing overlaps, set the xid values to
@@ -966,7 +969,7 @@ SerialSetActiveSerXmin(TransactionId xid)
{
serialControl->tailXid = InvalidTransactionId;
serialControl->headXid = InvalidTransactionId;
- LWLockRelease(SerialSLRULock);
+ LWLockRelease(SerialControlLock);
return;
}
@@ -984,7 +987,7 @@ SerialSetActiveSerXmin(TransactionId xid)
{
serialControl->tailXid = xid;
}
- LWLockRelease(SerialSLRULock);
+ LWLockRelease(SerialControlLock);
return;
}
@@ -993,7 +996,7 @@ SerialSetActiveSerXmin(TransactionId xid)
serialControl->tailXid = xid;
- LWLockRelease(SerialSLRULock);
+ LWLockRelease(SerialControlLock);
}
/*
@@ -1007,12 +1010,12 @@ CheckPointPredicate(void)
{
int truncateCutoffPage;
- LWLockAcquire(SerialSLRULock, LW_EXCLUSIVE);
+ LWLockAcquire(SerialControlLock, LW_EXCLUSIVE);
/* Exit quickly if the SLRU is currently not in use. */
if (serialControl->headPage < 0)
{
- LWLockRelease(SerialSLRULock);
+ LWLockRelease(SerialControlLock);
return;
}
@@ -1072,7 +1075,7 @@ CheckPointPredicate(void)
serialControl->headPage = -1;
}
- LWLockRelease(SerialSLRULock);
+ LWLockRelease(SerialControlLock);
/* Truncate away pages that are no longer required */
SimpleLruTruncate(SerialSlruCtl, truncateCutoffPage);
diff --git a/src/include/access/slru.h b/src/include/access/slru.h
index 51c5762b9f..d9be57de75 100644
--- a/src/include/access/slru.h
+++ b/src/include/access/slru.h
@@ -21,6 +21,7 @@
* SLRU bank size for slotno hash banks
*/
#define SLRU_BANK_SIZE 16
+#define SLRU_MAX_BANKLOCKS 128
/*
* To avoid overflowing internal arithmetic and the size_t data type, the
@@ -62,8 +63,6 @@ typedef enum
*/
typedef struct SlruSharedData
{
- LWLock *ControlLock;
-
/* Number of buffers managed by this SLRU structure */
int num_slots;
@@ -76,36 +75,52 @@ typedef struct SlruSharedData
bool *page_dirty;
int *page_number;
int *page_lru_count;
+
+ /* The buffer_locks protects the I/O on each buffer slots */
LWLockPadded *buffer_locks;
/*
- * Optional array of WAL flush LSNs associated with entries in the SLRU
- * pages. If not zero/NULL, we must flush WAL before writing pages (true
- * for pg_xact, false for multixact, pg_subtrans, pg_notify). group_lsn[]
- * has lsn_groups_per_page entries per buffer slot, each containing the
- * highest LSN known for a contiguous group of SLRU entries on that slot's
- * page.
+ * Locks to protect the in memory buffer slot access in SLRU bank. If the
+ * number of banks are <= SLRU_MAX_BANKLOCKS then there will be one lock
+ * per bank otherwise each lock will protect multiple banks depends upon
+ * the number of banks.
*/
- XLogRecPtr *group_lsn;
- int lsn_groups_per_page;
+ LWLockPadded *bank_locks;
/*----------
+ * Instead of global counter we maintain a bank-wise lru counter because
+ * a) we are doing the victim buffer selection as bank level so there is
+ * no point of having a global counter b) manipulating a global counter
+ * will have frequent cpu cache invalidation and that will affect the
+ * performance.
+ *
* We mark a page "most recently used" by setting
- * page_lru_count[slotno] = ++cur_lru_count;
+ * page_lru_count[slotno] = ++bank_cur_lru_count[bankno];
* The oldest page is therefore the one with the highest value of
- * cur_lru_count - page_lru_count[slotno]
+ * bank_cur_lru_count[bankno] - page_lru_count[slotno]
* The counts will eventually wrap around, but this calculation still
* works as long as no page's age exceeds INT_MAX counts.
*----------
*/
- int cur_lru_count;
+ int *bank_cur_lru_count;
+
+ /*
+ * Optional array of WAL flush LSNs associated with entries in the SLRU
+ * pages. If not zero/NULL, we must flush WAL before writing pages (true
+ * for pg_xact, false for multixact, pg_subtrans, pg_notify). group_lsn[]
+ * has lsn_groups_per_page entries per buffer slot, each containing the
+ * highest LSN known for a contiguous group of SLRU entries on that slot's
+ * page.
+ */
+ XLogRecPtr *group_lsn;
+ int lsn_groups_per_page;
/*
* latest_page_number is the page number of the current end of the log;
* this is not critical data, since we use it only to avoid swapping out
* the latest page.
*/
- int latest_page_number;
+ pg_atomic_uint32 latest_page_number;
/* SLRU's index for statistics purposes (might not be unique) */
int slru_stats_idx;
@@ -153,11 +168,24 @@ typedef struct SlruCtlData
typedef SlruCtlData *SlruCtl;
+/*
+ * Get the SLRU bank lock for given SlruCtl and the pageno.
+ *
+ * This lock needs to be acquire in order to access the slru buffer slots in
+ * the respective bank. For more details refer comments in SlruSharedData.
+ */
+static inline LWLock *
+SimpleLruGetSLRUBankLock(SlruCtl ctl, int pageno)
+{
+ int banklockno = (pageno & ctl->bank_mask) % SLRU_MAX_BANKLOCKS;
+
+ return &(ctl->shared->bank_locks[banklockno].lock);
+}
extern Size SimpleLruShmemSize(int nslots, int nlsns);
extern void SimpleLruInit(SlruCtl ctl, const char *name, int nslots, int nlsns,
- LWLock *ctllock, const char *subdir, int tranche_id,
- SyncRequestHandler sync_handler);
+ const char *subdir, int buffer_tranche_id,
+ int bank_tranche_id, SyncRequestHandler sync_handler);
extern int SimpleLruZeroPage(SlruCtl ctl, int pageno);
extern int SimpleLruReadPage(SlruCtl ctl, int pageno, bool write_ok,
TransactionId xid);
@@ -185,5 +213,8 @@ extern bool SlruScanDirCbReportPresence(SlruCtl ctl, char *filename,
int segpage, void *data);
extern bool SlruScanDirCbDeleteAll(SlruCtl ctl, char *filename, int segpage,
void *data);
+extern LWLock *SimpleLruGetSLRUBankLock(SlruCtl ctl, int pageno);
extern bool check_slru_buffers(const char *name, int *newval);
+extern void SimpleLruAcquireAllBankLock(SlruCtl ctl, LWLockMode mode);
+extern void SimpleLruReleaseAllBankLock(SlruCtl ctl);
#endif /* SLRU_H */
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index b038e599c0..87cb812b84 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -207,6 +207,13 @@ typedef enum BuiltinTrancheIds
LWTRANCHE_PGSTATS_DATA,
LWTRANCHE_LAUNCHER_DSA,
LWTRANCHE_LAUNCHER_HASH,
+ LWTRANCHE_XACT_SLRU,
+ LWTRANCHE_COMMITTS_SLRU,
+ LWTRANCHE_SUBTRANS_SLRU,
+ LWTRANCHE_MULTIXACTOFFSET_SLRU,
+ LWTRANCHE_MULTIXACTMEMBER_SLRU,
+ LWTRANCHE_NOTIFY_SLRU,
+ LWTRANCHE_SERIAL_SLRU,
LWTRANCHE_FIRST_USER_DEFINED,
} BuiltinTrancheIds;
diff --git a/src/test/modules/test_slru/test_slru.c b/src/test/modules/test_slru/test_slru.c
index ae21444c47..9a02f33933 100644
--- a/src/test/modules/test_slru/test_slru.c
+++ b/src/test/modules/test_slru/test_slru.c
@@ -40,10 +40,6 @@ PG_FUNCTION_INFO_V1(test_slru_delete_all);
/* Number of SLRU page slots */
#define NUM_TEST_BUFFERS 16
-/* SLRU control lock */
-LWLock TestSLRULock;
-#define TestSLRULock (&TestSLRULock)
-
static SlruCtlData TestSlruCtlData;
#define TestSlruCtl (&TestSlruCtlData)
@@ -63,9 +59,9 @@ test_slru_page_write(PG_FUNCTION_ARGS)
int pageno = PG_GETARG_INT32(0);
char *data = text_to_cstring(PG_GETARG_TEXT_PP(1));
int slotno;
+ LWLock *lock = SimpleLruGetSLRUBankLock(TestSlruCtl, pageno);
- LWLockAcquire(TestSLRULock, LW_EXCLUSIVE);
-
+ LWLockAcquire(lock, LW_EXCLUSIVE);
slotno = SimpleLruZeroPage(TestSlruCtl, pageno);
/* these should match */
@@ -80,7 +76,7 @@ test_slru_page_write(PG_FUNCTION_ARGS)
BLCKSZ - 1);
SimpleLruWritePage(TestSlruCtl, slotno);
- LWLockRelease(TestSLRULock);
+ LWLockRelease(lock);
PG_RETURN_VOID();
}
@@ -99,13 +95,14 @@ test_slru_page_read(PG_FUNCTION_ARGS)
bool write_ok = PG_GETARG_BOOL(1);
char *data = NULL;
int slotno;
+ LWLock *lock = SimpleLruGetSLRUBankLock(TestSlruCtl, pageno);
/* find page in buffers, reading it if necessary */
- LWLockAcquire(TestSLRULock, LW_EXCLUSIVE);
+ LWLockAcquire(lock, LW_EXCLUSIVE);
slotno = SimpleLruReadPage(TestSlruCtl, pageno,
write_ok, InvalidTransactionId);
data = (char *) TestSlruCtl->shared->page_buffer[slotno];
- LWLockRelease(TestSLRULock);
+ LWLockRelease(lock);
PG_RETURN_TEXT_P(cstring_to_text(data));
}
@@ -116,14 +113,15 @@ test_slru_page_readonly(PG_FUNCTION_ARGS)
int pageno = PG_GETARG_INT32(0);
char *data = NULL;
int slotno;
+ LWLock *lock = SimpleLruGetSLRUBankLock(TestSlruCtl, pageno);
/* find page in buffers, reading it if necessary */
slotno = SimpleLruReadPage_ReadOnly(TestSlruCtl,
pageno,
InvalidTransactionId);
- Assert(LWLockHeldByMe(TestSLRULock));
+ Assert(LWLockHeldByMe(lock));
data = (char *) TestSlruCtl->shared->page_buffer[slotno];
- LWLockRelease(TestSLRULock);
+ LWLockRelease(lock);
PG_RETURN_TEXT_P(cstring_to_text(data));
}
@@ -133,10 +131,11 @@ test_slru_page_exists(PG_FUNCTION_ARGS)
{
int pageno = PG_GETARG_INT32(0);
bool found;
+ LWLock *lock = SimpleLruGetSLRUBankLock(TestSlruCtl, pageno);
- LWLockAcquire(TestSLRULock, LW_EXCLUSIVE);
+ LWLockAcquire(lock, LW_EXCLUSIVE);
found = SimpleLruDoesPhysicalPageExist(TestSlruCtl, pageno);
- LWLockRelease(TestSLRULock);
+ LWLockRelease(lock);
PG_RETURN_BOOL(found);
}
@@ -215,6 +214,7 @@ test_slru_shmem_startup(void)
{
const char slru_dir_name[] = "pg_test_slru";
int test_tranche_id;
+ int test_buffer_tranche_id;
if (prev_shmem_startup_hook)
prev_shmem_startup_hook();
@@ -228,11 +228,13 @@ test_slru_shmem_startup(void)
/* initialize the SLRU facility */
test_tranche_id = LWLockNewTrancheId();
LWLockRegisterTranche(test_tranche_id, "test_slru_tranche");
- LWLockInitialize(TestSLRULock, test_tranche_id);
+
+ test_buffer_tranche_id = LWLockNewTrancheId();
+ LWLockRegisterTranche(test_tranche_id, "test_buffer_tranche");
TestSlruCtl->PagePrecedes = test_slru_page_precedes_logically;
SimpleLruInit(TestSlruCtl, "TestSLRU",
- NUM_TEST_BUFFERS, 0, TestSLRULock, slru_dir_name,
+ NUM_TEST_BUFFERS, 0, slru_dir_name, test_buffer_tranche_id,
test_tranche_id, SYNC_HANDLER_NONE);
}
--
2.39.2 (Apple Git-143)
v5-0001-Make-all-SLRU-buffer-sizes-configurable.patchapplication/octet-stream; name=v5-0001-Make-all-SLRU-buffer-sizes-configurable.patchDownload
From 8d2e41ed3d7b105cb224608b75e2cc4a2568b266 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Wed, 25 Oct 2023 14:45:00 +0530
Subject: [PATCH v5 1/3] Make all SLRU buffer sizes configurable.
Provide new GUCs to set the number of buffers, instead of using hard
coded defaults.
Default sizes are also set to 64 as sizes much larger than the old
limits have been shown to be useful on modern systems.
Patch by Andrey M. Borodin, Dilip Kumar
Reviewed By Anastasia Lubennikova, Tomas Vondra, Alexander Korotkov,
Gilles Darold, Thomas Munro
---
doc/src/sgml/config.sgml | 135 ++++++++++++++++++
src/backend/access/transam/clog.c | 23 ++-
src/backend/access/transam/commit_ts.c | 7 +-
src/backend/access/transam/multixact.c | 8 +-
src/backend/access/transam/subtrans.c | 5 +-
src/backend/commands/async.c | 8 +-
src/backend/commands/variable.c | 25 ++++
src/backend/storage/lmgr/predicate.c | 4 +-
src/backend/utils/init/globals.c | 8 ++
src/backend/utils/misc/guc_tables.c | 77 ++++++++++
src/backend/utils/misc/postgresql.conf.sample | 9 ++
src/include/access/clog.h | 10 ++
src/include/access/multixact.h | 4 -
src/include/access/slru.h | 5 +
src/include/access/subtrans.h | 3 -
src/include/commands/async.h | 5 -
src/include/miscadmin.h | 7 +
src/include/storage/predicate.h | 4 -
src/include/utils/guc_hooks.h | 2 +
19 files changed, 305 insertions(+), 44 deletions(-)
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index bd70ff2e4b..654db076b1 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2006,6 +2006,141 @@ include_dir 'conf.d'
</listitem>
</varlistentry>
+ <varlistentry id="guc-multixact-offsets-buffers" xreflabel="multixact_offsets_buffers">
+ <term><varname>multixact_offsets_buffers</varname> (<type>integer</type>)
+ <indexterm>
+ <primary><varname>multixact_offsets_buffers</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Specifies the amount of shared memory to use to cache the contents
+ of <literal>pg_multixact/offsets</literal> (see
+ <xref linkend="pgdata-contents-table"/>).
+ If this value is specified without units, it is taken as blocks,
+ that is <symbol>BLCKSZ</symbol> bytes, typically 8kB.
+ The default value is <literal>64</literal>.
+ This parameter can only be set at server start.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry id="guc-multixact-members-buffers" xreflabel="multixact_members_buffers">
+ <term><varname>multixact_members_buffers</varname> (<type>integer</type>)
+ <indexterm>
+ <primary><varname>multixact_members_buffers</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Specifies the amount of shared memory to use to cache the contents
+ of <literal>pg_multixact/members</literal> (see
+ <xref linkend="pgdata-contents-table"/>).
+ If this value is specified without units, it is taken as blocks,
+ that is <symbol>BLCKSZ</symbol> bytes, typically 8kB.
+ The default value is <literal>64</literal>.
+ This parameter can only be set at server start.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry id="guc-subtrans-buffers" xreflabel="subtrans_buffers">
+ <term><varname>subtrans_buffers</varname> (<type>integer</type>)
+ <indexterm>
+ <primary><varname>subtrans_buffers</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Specifies the amount of shared memory to use to cache the contents
+ of <literal>pg_subtrans</literal> (see
+ <xref linkend="pgdata-contents-table"/>).
+ If this value is specified without units, it is taken as blocks,
+ that is <symbol>BLCKSZ</symbol> bytes, typically 8kB.
+ The default value is <literal>64</literal>.
+ This parameter can only be set at server start.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry id="guc-notify-buffers" xreflabel="notify_buffers">
+ <term><varname>notify_buffers</varname> (<type>integer</type>)
+ <indexterm>
+ <primary><varname>notify_buffers</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Specifies the amount of shared memory to use to cache the contents
+ of <literal>pg_notify</literal> (see
+ <xref linkend="pgdata-contents-table"/>).
+ If this value is specified without units, it is taken as blocks,
+ that is <symbol>BLCKSZ</symbol> bytes, typically 8kB.
+ The default value is <literal>64</literal>.
+ This parameter can only be set at server start.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry id="guc-serial-buffers" xreflabel="serial_buffers">
+ <term><varname>serial_buffers</varname> (<type>integer</type>)
+ <indexterm>
+ <primary><varname>serial_buffers</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Specifies the amount of shared memory to use to cache the contents
+ of <literal>pg_serial</literal> (see
+ <xref linkend="pgdata-contents-table"/>).
+ If this value is specified without units, it is taken as blocks,
+ that is <symbol>BLCKSZ</symbol> bytes, typically 8kB.
+ The default value is <literal>64</literal>.
+ This parameter can only be set at server start.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry id="guc-xact-buffers" xreflabel="xact_buffers">
+ <term><varname>xact_buffers</varname> (<type>integer</type>)
+ <indexterm>
+ <primary><varname>xact_buffers</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Specifies the amount of shared memory to use to cache the contents
+ of <literal>pg_xact</literal> (see
+ <xref linkend="pgdata-contents-table"/>).
+ If this value is specified without units, it is taken as blocks,
+ that is <symbol>BLCKSZ</symbol> bytes, typically 8kB.
+ The default value is <literal>0</literal>, which requests
+ <varname>shared_buffers</varname> / 512, but not fewer than 16 blocks.
+ This parameter can only be set at server start.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry id="guc-commit-ts-buffers" xreflabel="commit_ts_buffers">
+ <term><varname>commit_ts_buffers</varname> (<type>integer</type>)
+ <indexterm>
+ <primary><varname>commit_ts_buffers</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Specifies the amount of memory to use to cache the cotents of
+ <literal>pg_commit_ts</literal> (see
+ <xref linkend="pgdata-contents-table"/>).
+ If this value is specified without units, it is taken as blocks,
+ that is <symbol>BLCKSZ</symbol> bytes, typically 8kB.
+ The default value is <literal>0</literal>, which requests
+ <varname>shared_buffers</varname> / 1024, but not fewer than 16 blocks.
+ This parameter can only be set at server start.
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry id="guc-max-stack-depth" xreflabel="max_stack_depth">
<term><varname>max_stack_depth</varname> (<type>integer</type>)
<indexterm>
diff --git a/src/backend/access/transam/clog.c b/src/backend/access/transam/clog.c
index 4a431d5876..7979bbd00f 100644
--- a/src/backend/access/transam/clog.c
+++ b/src/backend/access/transam/clog.c
@@ -58,8 +58,8 @@
/* We need two bits per xact, so four xacts fit in a byte */
#define CLOG_BITS_PER_XACT 2
-#define CLOG_XACTS_PER_BYTE 4
-#define CLOG_XACTS_PER_PAGE (BLCKSZ * CLOG_XACTS_PER_BYTE)
+StaticAssertDecl((CLOG_BITS_PER_XACT * CLOG_XACTS_PER_BYTE) == BITS_PER_BYTE,
+ "CLOG_BITS_PER_XACT and CLOG_XACTS_PER_BYTE are inconsistent");
#define CLOG_XACT_BITMASK ((1 << CLOG_BITS_PER_XACT) - 1)
#define TransactionIdToPage(xid) ((xid) / (TransactionId) CLOG_XACTS_PER_PAGE)
@@ -663,23 +663,16 @@ TransactionIdGetStatus(TransactionId xid, XLogRecPtr *lsn)
/*
* Number of shared CLOG buffers.
*
- * On larger multi-processor systems, it is possible to have many CLOG page
- * requests in flight at one time which could lead to disk access for CLOG
- * page if the required page is not found in memory. Testing revealed that we
- * can get the best performance by having 128 CLOG buffers, more than that it
- * doesn't improve performance.
- *
- * Unconditionally keeping the number of CLOG buffers to 128 did not seem like
- * a good idea, because it would increase the minimum amount of shared memory
- * required to start, which could be a problem for people running very small
- * configurations. The following formula seems to represent a reasonable
- * compromise: people with very low values for shared_buffers will get fewer
- * CLOG buffers as well, and everyone else will get 128.
+ * By default, we'll use 2MB of for every 1GB of shared buffers, up to the
+ * theoretical maximum useful value, but always at least 16 buffers.
*/
Size
CLOGShmemBuffers(void)
{
- return Min(128, Max(4, NBuffers / 512));
+ /* Use configured value if provided. */
+ if (xact_buffers > 0)
+ return Max(16, xact_buffers);
+ return Min(CLOG_MAX_ALLOWED_BUFFERS, Max(16, NBuffers / 512));
}
/*
diff --git a/src/backend/access/transam/commit_ts.c b/src/backend/access/transam/commit_ts.c
index b897fabc70..9ba5ae6534 100644
--- a/src/backend/access/transam/commit_ts.c
+++ b/src/backend/access/transam/commit_ts.c
@@ -493,11 +493,16 @@ pg_xact_commit_timestamp_origin(PG_FUNCTION_ARGS)
* We use a very similar logic as for the number of CLOG buffers (except we
* scale up twice as fast with shared buffers, and the maximum is twice as
* high); see comments in CLOGShmemBuffers.
+ * By default, we'll use 4MB of for every 1GB of shared buffers, up to the
+ * maximum value that slru.c will allow, but always at least 16 buffers.
*/
Size
CommitTsShmemBuffers(void)
{
- return Min(256, Max(4, NBuffers / 256));
+ /* Use configured value if provided. */
+ if (commit_ts_buffers > 0)
+ return Max(16, commit_ts_buffers);
+ return Min(256, Max(16, NBuffers / 256));
}
/*
diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index 57ed34c0a8..62709fcd07 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -1834,8 +1834,8 @@ MultiXactShmemSize(void)
mul_size(sizeof(MultiXactId) * 2, MaxOldestSlot))
size = SHARED_MULTIXACT_STATE_SIZE;
- size = add_size(size, SimpleLruShmemSize(NUM_MULTIXACTOFFSET_BUFFERS, 0));
- size = add_size(size, SimpleLruShmemSize(NUM_MULTIXACTMEMBER_BUFFERS, 0));
+ size = add_size(size, SimpleLruShmemSize(multixact_offsets_buffers, 0));
+ size = add_size(size, SimpleLruShmemSize(multixact_members_buffers, 0));
return size;
}
@@ -1851,13 +1851,13 @@ MultiXactShmemInit(void)
MultiXactMemberCtl->PagePrecedes = MultiXactMemberPagePrecedes;
SimpleLruInit(MultiXactOffsetCtl,
- "MultiXactOffset", NUM_MULTIXACTOFFSET_BUFFERS, 0,
+ "MultiXactOffset", multixact_offsets_buffers, 0,
MultiXactOffsetSLRULock, "pg_multixact/offsets",
LWTRANCHE_MULTIXACTOFFSET_BUFFER,
SYNC_HANDLER_MULTIXACT_OFFSET);
SlruPagePrecedesUnitTests(MultiXactOffsetCtl, MULTIXACT_OFFSETS_PER_PAGE);
SimpleLruInit(MultiXactMemberCtl,
- "MultiXactMember", NUM_MULTIXACTMEMBER_BUFFERS, 0,
+ "MultiXactMember", multixact_members_buffers, 0,
MultiXactMemberSLRULock, "pg_multixact/members",
LWTRANCHE_MULTIXACTMEMBER_BUFFER,
SYNC_HANDLER_MULTIXACT_MEMBER);
diff --git a/src/backend/access/transam/subtrans.c b/src/backend/access/transam/subtrans.c
index 62bb610167..0dd48f40f3 100644
--- a/src/backend/access/transam/subtrans.c
+++ b/src/backend/access/transam/subtrans.c
@@ -31,6 +31,7 @@
#include "access/slru.h"
#include "access/subtrans.h"
#include "access/transam.h"
+#include "miscadmin.h"
#include "pg_trace.h"
#include "utils/snapmgr.h"
@@ -184,14 +185,14 @@ SubTransGetTopmostTransaction(TransactionId xid)
Size
SUBTRANSShmemSize(void)
{
- return SimpleLruShmemSize(NUM_SUBTRANS_BUFFERS, 0);
+ return SimpleLruShmemSize(subtrans_buffers, 0);
}
void
SUBTRANSShmemInit(void)
{
SubTransCtl->PagePrecedes = SubTransPagePrecedes;
- SimpleLruInit(SubTransCtl, "Subtrans", NUM_SUBTRANS_BUFFERS, 0,
+ SimpleLruInit(SubTransCtl, "Subtrans", subtrans_buffers, 0,
SubtransSLRULock, "pg_subtrans",
LWTRANCHE_SUBTRANS_BUFFER, SYNC_HANDLER_NONE);
SlruPagePrecedesUnitTests(SubTransCtl, SUBTRANS_XACTS_PER_PAGE);
diff --git a/src/backend/commands/async.c b/src/backend/commands/async.c
index 38ddae08b8..4bdbbe5cc0 100644
--- a/src/backend/commands/async.c
+++ b/src/backend/commands/async.c
@@ -117,7 +117,7 @@
* frontend during startup.) The above design guarantees that notifies from
* other backends will never be missed by ignoring self-notifies.
*
- * The amount of shared memory used for notify management (NUM_NOTIFY_BUFFERS)
+ * The amount of shared memory used for notify management (notify_buffers)
* can be varied without affecting anything but performance. The maximum
* amount of notification data that can be queued at one time is determined
* by slru.c's wraparound limit; see QUEUE_MAX_PAGE below.
@@ -235,7 +235,7 @@ typedef struct QueuePosition
*
* Resist the temptation to make this really large. While that would save
* work in some places, it would add cost in others. In particular, this
- * should likely be less than NUM_NOTIFY_BUFFERS, to ensure that backends
+ * should likely be less than notify_buffers, to ensure that backends
* catch up before the pages they'll need to read fall out of SLRU cache.
*/
#define QUEUE_CLEANUP_DELAY 4
@@ -521,7 +521,7 @@ AsyncShmemSize(void)
size = mul_size(MaxBackends + 1, sizeof(QueueBackendStatus));
size = add_size(size, offsetof(AsyncQueueControl, backend));
- size = add_size(size, SimpleLruShmemSize(NUM_NOTIFY_BUFFERS, 0));
+ size = add_size(size, SimpleLruShmemSize(notify_buffers, 0));
return size;
}
@@ -569,7 +569,7 @@ AsyncShmemInit(void)
* Set up SLRU management of the pg_notify data.
*/
NotifyCtl->PagePrecedes = asyncQueuePagePrecedes;
- SimpleLruInit(NotifyCtl, "Notify", NUM_NOTIFY_BUFFERS, 0,
+ SimpleLruInit(NotifyCtl, "Notify", notify_buffers, 0,
NotifySLRULock, "pg_notify", LWTRANCHE_NOTIFY_BUFFER,
SYNC_HANDLER_NONE);
diff --git a/src/backend/commands/variable.c b/src/backend/commands/variable.c
index a88cf5f118..c68d668514 100644
--- a/src/backend/commands/variable.c
+++ b/src/backend/commands/variable.c
@@ -18,6 +18,8 @@
#include <ctype.h>
+#include "access/clog.h"
+#include "access/commit_ts.h"
#include "access/htup_details.h"
#include "access/parallel.h"
#include "access/xact.h"
@@ -400,6 +402,29 @@ show_timezone(void)
return "unknown";
}
+/*
+ * GUC show_hook for xact_buffers
+ */
+const char *
+show_xact_buffers(void)
+{
+ static char nbuf[16];
+
+ snprintf(nbuf, sizeof(nbuf), "%zu", CLOGShmemBuffers());
+ return nbuf;
+}
+
+/*
+ * GUC show_hook for commit_ts_buffers
+ */
+const char *
+show_commit_ts_buffers(void)
+{
+ static char nbuf[16];
+
+ snprintf(nbuf, sizeof(nbuf), "%zu", CommitTsShmemBuffers());
+ return nbuf;
+}
/*
* LOG_TIMEZONE
diff --git a/src/backend/storage/lmgr/predicate.c b/src/backend/storage/lmgr/predicate.c
index a794546db3..18ea18316d 100644
--- a/src/backend/storage/lmgr/predicate.c
+++ b/src/backend/storage/lmgr/predicate.c
@@ -808,7 +808,7 @@ SerialInit(void)
*/
SerialSlruCtl->PagePrecedes = SerialPagePrecedesLogically;
SimpleLruInit(SerialSlruCtl, "Serial",
- NUM_SERIAL_BUFFERS, 0, SerialSLRULock, "pg_serial",
+ serial_buffers, 0, SerialSLRULock, "pg_serial",
LWTRANCHE_SERIAL_BUFFER, SYNC_HANDLER_NONE);
#ifdef USE_ASSERT_CHECKING
SerialPagePrecedesLogicallyUnitTests();
@@ -1347,7 +1347,7 @@ PredicateLockShmemSize(void)
/* Shared memory structures for SLRU tracking of old committed xids. */
size = add_size(size, sizeof(SerialControlData));
- size = add_size(size, SimpleLruShmemSize(NUM_SERIAL_BUFFERS, 0));
+ size = add_size(size, SimpleLruShmemSize(serial_buffers, 0));
return size;
}
diff --git a/src/backend/utils/init/globals.c b/src/backend/utils/init/globals.c
index 60bc1217fb..96d480325b 100644
--- a/src/backend/utils/init/globals.c
+++ b/src/backend/utils/init/globals.c
@@ -156,3 +156,11 @@ int64 VacuumPageDirty = 0;
int VacuumCostBalance = 0; /* working state for vacuum */
bool VacuumCostActive = false;
+
+int multixact_offsets_buffers = 64;
+int multixact_members_buffers = 64;
+int subtrans_buffers = 64;
+int notify_buffers = 64;
+int serial_buffers = 64;
+int xact_buffers = 64;
+int commit_ts_buffers = 64;
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 7605eff9b9..c82635943b 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -28,6 +28,7 @@
#include "access/commit_ts.h"
#include "access/gin.h"
+#include "access/slru.h"
#include "access/toast_compression.h"
#include "access/twophase.h"
#include "access/xlog_internal.h"
@@ -2287,6 +2288,82 @@ struct config_int ConfigureNamesInt[] =
NULL, NULL, NULL
},
+ {
+ {"multixact_offsets_buffers", PGC_POSTMASTER, RESOURCES_MEM,
+ gettext_noop("Sets the number of shared memory buffers used for the MultiXact offset SLRU cache."),
+ NULL,
+ GUC_UNIT_BLOCKS
+ },
+ &multixact_offsets_buffers,
+ 64, 16, SLRU_MAX_ALLOWED_BUFFERS,
+ NULL, NULL, NULL
+ },
+
+ {
+ {"multixact_members_buffers", PGC_POSTMASTER, RESOURCES_MEM,
+ gettext_noop("Sets the number of shared memory buffers used for the MultiXact member SLRU cache."),
+ NULL,
+ GUC_UNIT_BLOCKS
+ },
+ &multixact_members_buffers,
+ 64, 16, SLRU_MAX_ALLOWED_BUFFERS,
+ NULL, NULL, NULL
+ },
+
+ {
+ {"subtrans_buffers", PGC_POSTMASTER, RESOURCES_MEM,
+ gettext_noop("Sets the number of shared memory buffers used for the sub-transaction SLRU cache."),
+ NULL,
+ GUC_UNIT_BLOCKS
+ },
+ &subtrans_buffers,
+ 64, 16, SLRU_MAX_ALLOWED_BUFFERS,
+ NULL, NULL, NULL
+ },
+ {
+ {"notify_buffers", PGC_POSTMASTER, RESOURCES_MEM,
+ gettext_noop("Sets the number of shared memory buffers used for the NOTIFY message SLRU cache."),
+ NULL,
+ GUC_UNIT_BLOCKS
+ },
+ ¬ify_buffers,
+ 64, 16, SLRU_MAX_ALLOWED_BUFFERS,
+ NULL, NULL, NULL
+ },
+
+ {
+ {"serial_buffers", PGC_POSTMASTER, RESOURCES_MEM,
+ gettext_noop("Sets the number of shared memory buffers used for the serializable transaction SLRU cache."),
+ NULL,
+ GUC_UNIT_BLOCKS
+ },
+ &serial_buffers,
+ 64, 16, SLRU_MAX_ALLOWED_BUFFERS,
+ NULL, NULL, NULL
+ },
+
+ {
+ {"xact_buffers", PGC_POSTMASTER, RESOURCES_MEM,
+ gettext_noop("Sets the number of shared memory buffers used for the transaction status SLRU cache."),
+ NULL,
+ GUC_UNIT_BLOCKS
+ },
+ &xact_buffers,
+ 64, 0, CLOG_MAX_ALLOWED_BUFFERS,
+ NULL, NULL, show_xact_buffers
+ },
+
+ {
+ {"commit_ts_buffers", PGC_POSTMASTER, RESOURCES_MEM,
+ gettext_noop("Sets the size of the dedicated buffer pool used for the commit timestamp SLRU cache."),
+ NULL,
+ GUC_UNIT_BLOCKS
+ },
+ &commit_ts_buffers,
+ 64, 0, SLRU_MAX_ALLOWED_BUFFERS,
+ NULL, NULL, show_commit_ts_buffers
+ },
+
{
{"temp_buffers", PGC_USERSET, RESOURCES_MEM,
gettext_noop("Sets the maximum number of temporary buffers used by each session."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index e48c066a5b..364553a314 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -50,6 +50,15 @@
#external_pid_file = '' # write an extra PID file
# (change requires restart)
+# - SLRU Buffers (change requires restart) -
+
+#xact_buffers = 0 # memory for pg_xact (0 = auto)
+#subtrans_buffers = 64 # memory for pg_subtrans
+#multixact_offsets_buffers = 64 # memory for pg_multixact/offsets
+#multixact_members_buffers = 64 # memory for pg_multixact/members
+#notify_buffers = 64 # memory for pg_notify
+#serial_buffers = 64 # memory for pg_serial
+#commit_ts_buffers = 0 # memory for pg_commit_ts (0 = auto)
#------------------------------------------------------------------------------
# CONNECTIONS AND AUTHENTICATION
diff --git a/src/include/access/clog.h b/src/include/access/clog.h
index d99444f073..a9cd65db36 100644
--- a/src/include/access/clog.h
+++ b/src/include/access/clog.h
@@ -15,6 +15,16 @@
#include "storage/sync.h"
#include "lib/stringinfo.h"
+/*
+ * Don't allow xact_buffers to be set higher than could possibly be useful or
+ * SLRU would allow.
+ */
+#define CLOG_XACTS_PER_BYTE 4
+#define CLOG_XACTS_PER_PAGE (BLCKSZ * CLOG_XACTS_PER_BYTE)
+#define CLOG_MAX_ALLOWED_BUFFERS \
+ Min(SLRU_MAX_ALLOWED_BUFFERS, \
+ (((MaxTransactionId / 2) + (CLOG_XACTS_PER_PAGE - 1)) / CLOG_XACTS_PER_PAGE))
+
/*
* Possible transaction statuses --- note that all-zeroes is the initial
* state.
diff --git a/src/include/access/multixact.h b/src/include/access/multixact.h
index 0be1355892..18d7ba4ca9 100644
--- a/src/include/access/multixact.h
+++ b/src/include/access/multixact.h
@@ -29,10 +29,6 @@
#define MaxMultiXactOffset ((MultiXactOffset) 0xFFFFFFFF)
-/* Number of SLRU buffers to use for multixact */
-#define NUM_MULTIXACTOFFSET_BUFFERS 8
-#define NUM_MULTIXACTMEMBER_BUFFERS 16
-
/*
* Possible multixact lock modes ("status"). The first four modes are for
* tuple locks (FOR KEY SHARE, FOR SHARE, FOR NO KEY UPDATE, FOR UPDATE); the
diff --git a/src/include/access/slru.h b/src/include/access/slru.h
index 552cc19e68..c0d37e3eb3 100644
--- a/src/include/access/slru.h
+++ b/src/include/access/slru.h
@@ -17,6 +17,11 @@
#include "storage/lwlock.h"
#include "storage/sync.h"
+/*
+ * To avoid overflowing internal arithmetic and the size_t data type, the
+ * number of buffers should not exceed this number.
+ */
+#define SLRU_MAX_ALLOWED_BUFFERS ((1024 * 1024 * 1024) / BLCKSZ)
/*
* Define SLRU segment size. A page is the same BLCKSZ as is used everywhere
diff --git a/src/include/access/subtrans.h b/src/include/access/subtrans.h
index 46a473c77f..147dc4acc3 100644
--- a/src/include/access/subtrans.h
+++ b/src/include/access/subtrans.h
@@ -11,9 +11,6 @@
#ifndef SUBTRANS_H
#define SUBTRANS_H
-/* Number of SLRU buffers to use for subtrans */
-#define NUM_SUBTRANS_BUFFERS 32
-
extern void SubTransSetParent(TransactionId xid, TransactionId parent);
extern TransactionId SubTransGetParent(TransactionId xid);
extern TransactionId SubTransGetTopmostTransaction(TransactionId xid);
diff --git a/src/include/commands/async.h b/src/include/commands/async.h
index 02da6ba7e1..b3e6815ee4 100644
--- a/src/include/commands/async.h
+++ b/src/include/commands/async.h
@@ -15,11 +15,6 @@
#include <signal.h>
-/*
- * The number of SLRU page buffers we use for the notification queue.
- */
-#define NUM_NOTIFY_BUFFERS 8
-
extern PGDLLIMPORT bool Trace_notify;
extern PGDLLIMPORT volatile sig_atomic_t notifyInterruptPending;
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index f0cc651435..e2473f41de 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -177,6 +177,13 @@ extern PGDLLIMPORT int MaxBackends;
extern PGDLLIMPORT int MaxConnections;
extern PGDLLIMPORT int max_worker_processes;
extern PGDLLIMPORT int max_parallel_workers;
+extern PGDLLIMPORT int multixact_offsets_buffers;
+extern PGDLLIMPORT int multixact_members_buffers;
+extern PGDLLIMPORT int subtrans_buffers;
+extern PGDLLIMPORT int notify_buffers;
+extern PGDLLIMPORT int serial_buffers;
+extern PGDLLIMPORT int xact_buffers;
+extern PGDLLIMPORT int commit_ts_buffers;
extern PGDLLIMPORT int MyProcPid;
extern PGDLLIMPORT pg_time_t MyStartTime;
diff --git a/src/include/storage/predicate.h b/src/include/storage/predicate.h
index cd48afa17b..7b68c8f1c7 100644
--- a/src/include/storage/predicate.h
+++ b/src/include/storage/predicate.h
@@ -26,10 +26,6 @@ extern PGDLLIMPORT int max_predicate_locks_per_xact;
extern PGDLLIMPORT int max_predicate_locks_per_relation;
extern PGDLLIMPORT int max_predicate_locks_per_page;
-
-/* Number of SLRU buffers to use for Serial SLRU */
-#define NUM_SERIAL_BUFFERS 16
-
/*
* A handle used for sharing SERIALIZABLEXACT objects between the participants
* in a parallel query.
diff --git a/src/include/utils/guc_hooks.h b/src/include/utils/guc_hooks.h
index 2a191830a8..8597e430de 100644
--- a/src/include/utils/guc_hooks.h
+++ b/src/include/utils/guc_hooks.h
@@ -161,4 +161,6 @@ extern void assign_wal_consistency_checking(const char *newval, void *extra);
extern bool check_wal_segment_size(int *newval, void **extra, GucSource source);
extern void assign_wal_sync_method(int new_wal_sync_method, void *extra);
+extern const char *show_xact_buffers(void);
+extern const char *show_commit_ts_buffers(void);
#endif /* GUC_HOOKS_H */
--
2.39.2 (Apple Git-143)
IMO the whole area of SLRU buffering is in horrible shape and many users
are struggling with overall PG performance because of it. An
improvement doesn't have to be perfect -- it just has to be much better
than the current situation, which should be easy enough. We can
continue to improve later, using more scalable algorithms or ones that
allow us to raise the limits higher.
The only point on which we do not have full consensus yet is the need to
have one GUC per SLRU, and a lot of effort seems focused on trying to
fix the problem without adding so many GUCs (for example, using shared
buffers instead, or use a single "scaling" GUC). I think that hinders
progress. Let's just add multiple GUCs, and users can leave most of
them alone and only adjust the one with which they have a performance
problems; it's not going to be the same one for everybody.
--
Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/
"Sallah, I said NO camels! That's FIVE camels; can't you count?"
(Indiana Jones)
On Wed, Nov 8, 2023 at 6:41 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
Here is the updated version of the patch, here I have taken the
approach suggested by Andrey and I discussed the same with Alvaro
offlist and he also agrees with it. So the idea is that we will keep
the bank size fixed which is 16 buffers per bank and the allowed GUC
value for each slru buffer must be in multiple of the bank size. We
have removed the centralized lock but instead of one lock per bank, we
have kept the maximum limit on the number of bank locks which is 128.
We kept the max limit as 128 because, in one of the operations (i.e.
ActivateCommitTs), we need to acquire all the bank locks (but this is
not a performance path at all) and at a time we can acquire a max of
200 LWlocks, so we think this limit of 128 is good. So now if the
number of banks is <= 128 then we will be using one lock per bank
otherwise the one lock may protect access of buffer in multiple banks.
Just so I understand, I guess this means that an SLRU is limited to
16*128 = 2k buffers = 16MB?
When we were talking about this earlier, I suggested fixing the number
of banks and allowing the number of buffers per bank to scale
depending on the setting. That seemed simpler than allowing both the
number of banks and the number of buffers to vary, and it might allow
the compiler to optimize some code better, by converting a calculation
like page_no%number_of_banks into a masking operation like page_no&0xf
or whatever. However, because it allows an individual bank to become
arbitrarily large, it more or less requires us to use a buffer mapping
table. Some of the performance problems mentioned could be alleviated
by omitting the hash table when the number of buffers per bank is
small, and we could also create the dynahash with a custom hash
function that just does modular arithmetic on the page number rather
than a real hashing operation. However, maybe we don't really need to
do any of that. I agree that dynahash is clunky on a good day. I
hadn't realized the impact would be so noticeable.
This proposal takes the opposite approach of fixing the number of
buffers per bank, letting the number of banks vary. I think that's
probably fine, although it does reduce the effective associativity of
the cache. If there are more hot buffers in a bank than the bank size,
the bank will be contended, even if other banks are cold. However,
given the way SLRUs are accessed, it seems hard to imagine this being
a real problem in practice. There aren't likely to be say 20 hot
buffers that just so happen to all be separated from one another by a
number of pages that is a multiple of the configured number of banks.
And in the seemingly very unlikely event that you have a workload that
behaves like that, you could always adjust the number of banks up or
down by one, and the problem would go away. So this seems OK to me.
I also agree with a couple of points that Alvaro made, specifically
that (1) this doesn't have to be perfect, just better than now and (2)
separate GUCs for each SLRU is fine. On the latter point, it's worth
keeping in mind that the cost of a GUC that most people don't need to
tune is fairly low. GUCs like work_mem and shared_buffers are
"expensive" because everybody more or less needs to understand what
they are and how to set them and getting the right value can tricky --
but a GUC like autovacuum_naptime is a lot cheaper, because almost
nobody needs to change it. It seems to me that these GUCs will fall
into the latter category. Users can hopefully just ignore them except
if they see a contention on the SLRU bank locks -- and then they can
consider increasing the number of banks for that particular SLRU. That
seems simple enough. As with autovacuum_naptime, there is a danger
that people will configure a ridiculous value of the parameter for no
good reason and get bad results, so it would be nice if someday we had
a magical system that just got all of this right without the user
needing to configure anything. But in the meantime, it's better to
have a somewhat manual system to relieve pressure on these locks than
no system at all.
--
Robert Haas
EDB: http://www.enterprisedb.com
On Thu, Nov 9, 2023 at 9:39 PM Robert Haas <robertmhaas@gmail.com> wrote:
On Wed, Nov 8, 2023 at 6:41 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
Here is the updated version of the patch, here I have taken the
approach suggested by Andrey and I discussed the same with Alvaro
offlist and he also agrees with it. So the idea is that we will keep
the bank size fixed which is 16 buffers per bank and the allowed GUC
value for each slru buffer must be in multiple of the bank size. We
have removed the centralized lock but instead of one lock per bank, we
have kept the maximum limit on the number of bank locks which is 128.
We kept the max limit as 128 because, in one of the operations (i.e.
ActivateCommitTs), we need to acquire all the bank locks (but this is
not a performance path at all) and at a time we can acquire a max of
200 LWlocks, so we think this limit of 128 is good. So now if the
number of banks is <= 128 then we will be using one lock per bank
otherwise the one lock may protect access of buffer in multiple banks.Just so I understand, I guess this means that an SLRU is limited to
16*128 = 2k buffers = 16MB?
Not really, because 128 is the maximum limit on the number of bank
locks not on the number of banks. So for example, if you have 16*128
= 2k buffers then each lock will protect one bank, and likewise when
you have 16 * 512 = 8k buffers then each lock will protect 4 banks.
So in short we can get the lock for each bank by simple computation
(banklockno = bankno % 128 )
When we were talking about this earlier, I suggested fixing the number
of banks and allowing the number of buffers per bank to scale
depending on the setting. That seemed simpler than allowing both the
number of banks and the number of buffers to vary, and it might allow
the compiler to optimize some code better, by converting a calculation
like page_no%number_of_banks into a masking operation like page_no&0xf
or whatever. However, because it allows an individual bank to become
arbitrarily large, it more or less requires us to use a buffer mapping
table. Some of the performance problems mentioned could be alleviated
by omitting the hash table when the number of buffers per bank is
small, and we could also create the dynahash with a custom hash
function that just does modular arithmetic on the page number rather
than a real hashing operation. However, maybe we don't really need to
do any of that. I agree that dynahash is clunky on a good day. I
hadn't realized the impact would be so noticeable.
Yes, so one idea is that we keep the number of banks fixed and with
that, as you pointed out correctly with a large number of buffers, the
bank size can be quite big and for that, we would need a hash table
and OTOH what I am doing here is keeping the bank size fixed and
smaller (16 buffers per bank) and with that we can have large numbers
of the bank when the buffer pool size is quite big. But I feel having
more banks is not really a problem if we grow the number of locks
beyond a certain limit as in some corner cases we need to acquire all
locks together and there is a limit on that. So I like this idea of
sharing locks across the banks with that 1) We can have enough locks
so that lock contention or cache invalidation due to a common lock
should not be a problem anymore 2) We can keep a small bank size with
that seq search within the bank is quite fast so reads are fast 3)
With small bank size victim buffer search which has to be sequential
is quite fast.
This proposal takes the opposite approach of fixing the number of
buffers per bank, letting the number of banks vary. I think that's
probably fine, although it does reduce the effective associativity of
the cache. If there are more hot buffers in a bank than the bank size,
the bank will be contended, even if other banks are cold. However,
given the way SLRUs are accessed, it seems hard to imagine this being
a real problem in practice. There aren't likely to be say 20 hot
buffers that just so happen to all be separated from one another by a
number of pages that is a multiple of the configured number of banks.
And in the seemingly very unlikely event that you have a workload that
behaves like that, you could always adjust the number of banks up or
down by one, and the problem would go away. So this seems OK to me.
I agree with this
I also agree with a couple of points that Alvaro made, specifically
that (1) this doesn't have to be perfect, just better than now and (2)
separate GUCs for each SLRU is fine. On the latter point, it's worth
keeping in mind that the cost of a GUC that most people don't need to
tune is fairly low. GUCs like work_mem and shared_buffers are
"expensive" because everybody more or less needs to understand what
they are and how to set them and getting the right value can tricky --
but a GUC like autovacuum_naptime is a lot cheaper, because almost
nobody needs to change it. It seems to me that these GUCs will fall
into the latter category. Users can hopefully just ignore them except
if they see a contention on the SLRU bank locks -- and then they can
consider increasing the number of banks for that particular SLRU. That
seems simple enough. As with autovacuum_naptime, there is a danger
that people will configure a ridiculous value of the parameter for no
good reason and get bad results, so it would be nice if someday we had
a magical system that just got all of this right without the user
needing to configure anything. But in the meantime, it's better to
have a somewhat manual system to relieve pressure on these locks than
no system at all.
+1
--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
On Thu, Nov 9, 2023 at 4:55 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
IMO the whole area of SLRU buffering is in horrible shape and many users
are struggling with overall PG performance because of it. An
improvement doesn't have to be perfect -- it just has to be much better
than the current situation, which should be easy enough. We can
continue to improve later, using more scalable algorithms or ones that
allow us to raise the limits higher.
I agree with this.
The only point on which we do not have full consensus yet is the need to
have one GUC per SLRU, and a lot of effort seems focused on trying to
fix the problem without adding so many GUCs (for example, using shared
buffers instead, or use a single "scaling" GUC). I think that hinders
progress. Let's just add multiple GUCs, and users can leave most of
them alone and only adjust the one with which they have a performance
problems; it's not going to be the same one for everybody.
+1
--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
On Fri, Nov 10, 2023 at 10:17:49AM +0530, Dilip Kumar wrote:
On Thu, Nov 9, 2023 at 4:55 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
The only point on which we do not have full consensus yet is the need to
have one GUC per SLRU, and a lot of effort seems focused on trying to
fix the problem without adding so many GUCs (for example, using shared
buffers instead, or use a single "scaling" GUC). I think that hinders
progress. Let's just add multiple GUCs, and users can leave most of
them alone and only adjust the one with which they have a performance
problems; it's not going to be the same one for everybody.+1
+1
--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
I just noticed that 0003 does some changes to
TransactionGroupUpdateXidStatus() that haven't been adequately
explained AFAICS. How do you know that these changes are safe?
0001 contains one typo in the docs, "cotents".
I'm not a fan of the fact that some CLOG sizing macros moved to clog.h,
leaving others in clog.c. Maybe add commentary cross-linking both.
Alternatively, perhaps allowing xact_buffers to grow beyond 65536 up to
the slru.h-defined limit of 131072 is not that bad, even if it's more
than could possibly be needed for xact_buffers; nobody is going to use
64k buffers, since useful values are below a couple thousand anyhow.
--
Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/
Tom: There seems to be something broken here.
Teodor: I'm in sackcloth and ashes... Fixed.
/messages/by-id/482D1632.8010507@sigaev.ru
On Thu, Nov 16, 2023 at 3:11 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
I just noticed that 0003 does some changes to
TransactionGroupUpdateXidStatus() that haven't been adequately
explained AFAICS. How do you know that these changes are safe?
IMHO this is safe as well as logical to do w.r.t. performance. It's
safe because whenever we are updating any page in a group we are
acquiring the respective bank lock in exclusive mode and in extreme
cases if there are pages from different banks then we do switch the
lock as well before updating the pages from different groups. And, we
do not wake any process in a group unless we have done the status
update for all the processes so there could not be any race condition
as well. Also, It should not affect the performance adversely as well
and this will not remove the need for group updates. The main use
case of group update is that it will optimize a situation when most of
the processes are contending for status updates on the same page and
processes that are waiting for status updates on different pages will
go to different groups w.r.t. that page, so in short in a group on
best effort basis we are trying to have the processes which are
waiting to update the same clog page that mean logically all the
processes in the group will be waiting on the same bank lock. In an
extreme situation if there are processes in the group that are trying
to update different pages or even pages from different banks then we
are handling it well by changing the lock. Although someone may raise
a concern that in cases where there are processes that are waiting for
different bank locks then after releasing one lock why not wake up
those processes, I think that is not required because that is the
situation we are trying to avoid where there are processes trying to
update different are in the same group so there is no point in adding
complexity to optimize that case.
0001 contains one typo in the docs, "cotents".
I'm not a fan of the fact that some CLOG sizing macros moved to clog.h,
leaving others in clog.c. Maybe add commentary cross-linking both.
Alternatively, perhaps allowing xact_buffers to grow beyond 65536 up to
the slru.h-defined limit of 131072 is not that bad, even if it's more
than could possibly be needed for xact_buffers; nobody is going to use
64k buffers, since useful values are below a couple thousand anyhow.
I agree, that allowing xact_buffers to grow beyond 65536 up to the
slru.h-defined limit of 131072 is not that bad, so I will change that
in the next version.
--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
On Fri, Nov 17, 2023 at 1:09 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
On Thu, Nov 16, 2023 at 3:11 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
PFA, updated patch version, this fixes the comment given by Alvaro and
also improves some of the comments.
--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
Attachments:
v6-0002-Divide-SLRU-buffers-into-banks.patchapplication/octet-stream; name=v6-0002-Divide-SLRU-buffers-into-banks.patchDownload
From dd32d90d3a6563bba258ee78fe3e3a5c1a413ede Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Fri, 17 Nov 2023 10:24:41 +0530
Subject: [PATCH v6 2/3] Divide SLRU buffers into banks
As we have made slru buffer pool configurable, we want to
eliminate linear search within whole SLRU buffer pool. To
do so we divide SLRU buffers into banks. Each bank holds 16
buffers. Each SLRU pageno may reside only in one bank.
Adjacent pagenos reside in different banks. Along with this
also ensure that the number of slru buffers are given in
multiples of bank size.
Andrey M. Borodin and Dilip Kumar based on fedback by Alvaro Herrera
---
src/backend/access/transam/clog.c | 10 ++++++
src/backend/access/transam/commit_ts.c | 10 ++++++
src/backend/access/transam/multixact.c | 19 +++++++++++
src/backend/access/transam/slru.c | 45 ++++++++++++++++++++++----
src/backend/access/transam/subtrans.c | 10 ++++++
src/backend/commands/async.c | 10 ++++++
src/backend/storage/lmgr/predicate.c | 10 ++++++
src/backend/utils/misc/guc_tables.c | 14 ++++----
src/include/access/slru.h | 12 ++++++-
src/include/utils/guc_hooks.h | 11 +++++++
10 files changed, 137 insertions(+), 14 deletions(-)
diff --git a/src/backend/access/transam/clog.c b/src/backend/access/transam/clog.c
index 8237b40aa6..44008222da 100644
--- a/src/backend/access/transam/clog.c
+++ b/src/backend/access/transam/clog.c
@@ -43,6 +43,7 @@
#include "pgstat.h"
#include "storage/proc.h"
#include "storage/sync.h"
+#include "utils/guc_hooks.h"
/*
* Defines for CLOG page sizes. A page is the same BLCKSZ as is used
@@ -1019,3 +1020,12 @@ clogsyncfiletag(const FileTag *ftag, char *path)
{
return SlruSyncFileTag(XactCtl, ftag, path);
}
+
+/*
+ * GUC check_hook for xact_buffers
+ */
+bool
+check_xact_buffers(int *newval, void **extra, GucSource source)
+{
+ return check_slru_buffers("xact_buffers", newval);
+}
diff --git a/src/backend/access/transam/commit_ts.c b/src/backend/access/transam/commit_ts.c
index 9ba5ae6534..96810959ab 100644
--- a/src/backend/access/transam/commit_ts.c
+++ b/src/backend/access/transam/commit_ts.c
@@ -33,6 +33,7 @@
#include "pg_trace.h"
#include "storage/shmem.h"
#include "utils/builtins.h"
+#include "utils/guc_hooks.h"
#include "utils/snapmgr.h"
#include "utils/timestamp.h"
@@ -1017,3 +1018,12 @@ committssyncfiletag(const FileTag *ftag, char *path)
{
return SlruSyncFileTag(CommitTsCtl, ftag, path);
}
+
+/*
+ * GUC check_hook for commit_ts_buffers
+ */
+bool
+check_commit_ts_buffers(int *newval, void **extra, GucSource source)
+{
+ return check_slru_buffers("commit_ts_buffers", newval);
+}
\ No newline at end of file
diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index 62709fcd07..77511c6342 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -88,6 +88,7 @@
#include "storage/proc.h"
#include "storage/procarray.h"
#include "utils/builtins.h"
+#include "utils/guc_hooks.h"
#include "utils/memutils.h"
#include "utils/snapmgr.h"
@@ -3419,3 +3420,21 @@ multixactmemberssyncfiletag(const FileTag *ftag, char *path)
{
return SlruSyncFileTag(MultiXactMemberCtl, ftag, path);
}
+
+/*
+ * GUC check_hook for multixact_offsets_buffers
+ */
+bool
+check_multixact_offsets_buffers(int *newval, void **extra, GucSource source)
+{
+ return check_slru_buffers("multixact_offsets_buffers", newval);
+}
+
+/*
+ * GUC check_hook for multixact_members_buffers
+ */
+bool
+check_multixact_members_buffers(int *newval, void **extra, GucSource source)
+{
+ return check_slru_buffers("multixact_members_buffers", newval);
+}
\ No newline at end of file
diff --git a/src/backend/access/transam/slru.c b/src/backend/access/transam/slru.c
index 9ed24e1185..b0d90a4bd2 100644
--- a/src/backend/access/transam/slru.c
+++ b/src/backend/access/transam/slru.c
@@ -59,6 +59,7 @@
#include "pgstat.h"
#include "storage/fd.h"
#include "storage/shmem.h"
+#include "utils/guc_hooks.h"
#define SlruFileName(ctl, path, seg) \
snprintf(path, MAXPGPATH, "%s/%04X", (ctl)->Dir, seg)
@@ -134,7 +135,6 @@ typedef enum
static SlruErrorCause slru_errcause;
static int slru_errno;
-
static void SimpleLruZeroLSNs(SlruCtl ctl, int slotno);
static void SimpleLruWaitIO(SlruCtl ctl, int slotno);
static void SlruInternalWritePage(SlruCtl ctl, int slotno, SlruWriteAll fdata);
@@ -258,7 +258,10 @@ SimpleLruInit(SlruCtl ctl, const char *name, int nslots, int nlsns,
Assert(ptr - (char *) shared <= SimpleLruShmemSize(nslots, nlsns));
}
else
+ {
Assert(found);
+ Assert(shared->num_slots == nslots);
+ }
/*
* Initialize the unshared control struct, including directory path. We
@@ -266,6 +269,7 @@ SimpleLruInit(SlruCtl ctl, const char *name, int nslots, int nlsns,
*/
ctl->shared = shared;
ctl->sync_handler = sync_handler;
+ ctl->bank_mask = (nslots / SLRU_BANK_SIZE) - 1;
strlcpy(ctl->Dir, subdir, sizeof(ctl->Dir));
}
@@ -497,12 +501,18 @@ SimpleLruReadPage_ReadOnly(SlruCtl ctl, int pageno, TransactionId xid)
{
SlruShared shared = ctl->shared;
int slotno;
+ int bankstart = (pageno & ctl->bank_mask) * SLRU_BANK_SIZE;
+ int bankend = bankstart + SLRU_BANK_SIZE;
/* Try to find the page while holding only shared lock */
LWLockAcquire(shared->ControlLock, LW_SHARED);
- /* See if page is already in a buffer */
- for (slotno = 0; slotno < shared->num_slots; slotno++)
+ /*
+ * See if the page is already in a buffer pool. The buffer pool is
+ * divided into banks of buffers and each pageno may reside only in one
+ * bank so limit the search within the bank.
+ */
+ for (slotno = bankstart; slotno < bankend; slotno++)
{
if (shared->page_number[slotno] == pageno &&
shared->page_status[slotno] != SLRU_PAGE_EMPTY &&
@@ -1029,9 +1039,15 @@ SlruSelectLRUPage(SlruCtl ctl, int pageno)
int bestinvalidslot = 0; /* keep compiler quiet */
int best_invalid_delta = -1;
int best_invalid_page_number = 0; /* keep compiler quiet */
+ int bankstart = (pageno & ctl->bank_mask) * SLRU_BANK_SIZE;
+ int bankend = bankstart + SLRU_BANK_SIZE;
- /* See if page already has a buffer assigned */
- for (slotno = 0; slotno < shared->num_slots; slotno++)
+ /*
+ * See if the page is already in a buffer pool. The buffer pool is
+ * divided into banks of buffers and each pageno may reside only in one
+ * bank so limit the search within the bank.
+ */
+ for (slotno = bankstart; slotno < bankend; slotno++)
{
if (shared->page_number[slotno] == pageno &&
shared->page_status[slotno] != SLRU_PAGE_EMPTY)
@@ -1066,7 +1082,7 @@ SlruSelectLRUPage(SlruCtl ctl, int pageno)
* multiple pages with the same lru_count.
*/
cur_count = (shared->cur_lru_count)++;
- for (slotno = 0; slotno < shared->num_slots; slotno++)
+ for (slotno = bankstart; slotno < bankend; slotno++)
{
int this_delta;
int this_page_number;
@@ -1613,3 +1629,20 @@ SlruSyncFileTag(SlruCtl ctl, const FileTag *ftag, char *path)
errno = save_errno;
return result;
}
+
+/*
+ * Helper function for GUC check_hook to check whether slru buffers are in
+ * multiples of SLRU_BANK_SIZE.
+ */
+bool
+check_slru_buffers(const char *name, int *newval)
+{
+ /* Value upper and lower hard limits are inclusive */
+ if (*newval % SLRU_BANK_SIZE == 0)
+ return true;
+
+ /* Value does not fall within any allowable range */
+ GUC_check_errdetail("\"%s\" must be in multiple of %d", name,
+ SLRU_BANK_SIZE);
+ return false;
+}
diff --git a/src/backend/access/transam/subtrans.c b/src/backend/access/transam/subtrans.c
index 0dd48f40f3..923e706535 100644
--- a/src/backend/access/transam/subtrans.c
+++ b/src/backend/access/transam/subtrans.c
@@ -33,6 +33,7 @@
#include "access/transam.h"
#include "miscadmin.h"
#include "pg_trace.h"
+#include "utils/guc_hooks.h"
#include "utils/snapmgr.h"
@@ -373,3 +374,12 @@ SubTransPagePrecedes(int page1, int page2)
return (TransactionIdPrecedes(xid1, xid2) &&
TransactionIdPrecedes(xid1, xid2 + SUBTRANS_XACTS_PER_PAGE - 1));
}
+
+/*
+ * GUC check_hook for subtrans_buffers
+ */
+bool
+check_subtrans_buffers(int *newval, void **extra, GucSource source)
+{
+ return check_slru_buffers("subtrans_buffers", newval);
+}
diff --git a/src/backend/commands/async.c b/src/backend/commands/async.c
index 4bdbbe5cc0..98449cbdde 100644
--- a/src/backend/commands/async.c
+++ b/src/backend/commands/async.c
@@ -149,6 +149,7 @@
#include "storage/sinval.h"
#include "tcop/tcopprot.h"
#include "utils/builtins.h"
+#include "utils/guc_hooks.h"
#include "utils/memutils.h"
#include "utils/ps_status.h"
#include "utils/snapmgr.h"
@@ -2444,3 +2445,12 @@ ClearPendingActionsAndNotifies(void)
pendingActions = NULL;
pendingNotifies = NULL;
}
+
+/*
+ * GUC check_hook for notify_buffers
+ */
+bool
+check_notify_buffers(int *newval, void **extra, GucSource source)
+{
+ return check_slru_buffers("notify_buffers", newval);
+}
diff --git a/src/backend/storage/lmgr/predicate.c b/src/backend/storage/lmgr/predicate.c
index 18ea18316d..e4903c67ec 100644
--- a/src/backend/storage/lmgr/predicate.c
+++ b/src/backend/storage/lmgr/predicate.c
@@ -208,6 +208,7 @@
#include "storage/predicate_internals.h"
#include "storage/proc.h"
#include "storage/procarray.h"
+#include "utils/guc_hooks.h"
#include "utils/rel.h"
#include "utils/snapmgr.h"
@@ -5011,3 +5012,12 @@ AttachSerializableXact(SerializableXactHandle handle)
if (MySerializableXact != InvalidSerializableXact)
CreateLocalPredicateLockHash();
}
+
+/*
+ * GUC check_hook for serial_buffers
+ */
+bool
+check_serial_buffers(int *newval, void **extra, GucSource source)
+{
+ return check_slru_buffers("serial_buffers", newval);
+}
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index c1345dab98..8649b066a8 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -2296,7 +2296,7 @@ struct config_int ConfigureNamesInt[] =
},
&multixact_offsets_buffers,
64, 16, SLRU_MAX_ALLOWED_BUFFERS,
- NULL, NULL, NULL
+ check_multixact_offsets_buffers, NULL, NULL
},
{
@@ -2307,7 +2307,7 @@ struct config_int ConfigureNamesInt[] =
},
&multixact_members_buffers,
64, 16, SLRU_MAX_ALLOWED_BUFFERS,
- NULL, NULL, NULL
+ check_multixact_members_buffers, NULL, NULL
},
{
@@ -2318,7 +2318,7 @@ struct config_int ConfigureNamesInt[] =
},
&subtrans_buffers,
64, 16, SLRU_MAX_ALLOWED_BUFFERS,
- NULL, NULL, NULL
+ check_subtrans_buffers, NULL, NULL
},
{
{"notify_buffers", PGC_POSTMASTER, RESOURCES_MEM,
@@ -2328,7 +2328,7 @@ struct config_int ConfigureNamesInt[] =
},
¬ify_buffers,
64, 16, SLRU_MAX_ALLOWED_BUFFERS,
- NULL, NULL, NULL
+ check_notify_buffers, NULL, NULL
},
{
@@ -2339,7 +2339,7 @@ struct config_int ConfigureNamesInt[] =
},
&serial_buffers,
64, 16, SLRU_MAX_ALLOWED_BUFFERS,
- NULL, NULL, NULL
+ check_serial_buffers, NULL, NULL
},
{
@@ -2350,7 +2350,7 @@ struct config_int ConfigureNamesInt[] =
},
&xact_buffers,
64, 0, SLRU_MAX_ALLOWED_BUFFERS,
- NULL, NULL, show_xact_buffers
+ check_xact_buffers, NULL, show_xact_buffers
},
{
@@ -2361,7 +2361,7 @@ struct config_int ConfigureNamesInt[] =
},
&commit_ts_buffers,
64, 0, SLRU_MAX_ALLOWED_BUFFERS,
- NULL, NULL, show_commit_ts_buffers
+ check_commit_ts_buffers, NULL, show_commit_ts_buffers
},
{
diff --git a/src/include/access/slru.h b/src/include/access/slru.h
index c0d37e3eb3..51c5762b9f 100644
--- a/src/include/access/slru.h
+++ b/src/include/access/slru.h
@@ -17,6 +17,11 @@
#include "storage/lwlock.h"
#include "storage/sync.h"
+/*
+ * SLRU bank size for slotno hash banks
+ */
+#define SLRU_BANK_SIZE 16
+
/*
* To avoid overflowing internal arithmetic and the size_t data type, the
* number of buffers should not exceed this number.
@@ -139,6 +144,11 @@ typedef struct SlruCtlData
* it's always the same, it doesn't need to be in shared memory.
*/
char Dir[64];
+
+ /*
+ * Mask for slotno banks
+ */
+ Size bank_mask;
} SlruCtlData;
typedef SlruCtlData *SlruCtl;
@@ -175,5 +185,5 @@ extern bool SlruScanDirCbReportPresence(SlruCtl ctl, char *filename,
int segpage, void *data);
extern bool SlruScanDirCbDeleteAll(SlruCtl ctl, char *filename, int segpage,
void *data);
-
+extern bool check_slru_buffers(const char *name, int *newval);
#endif /* SLRU_H */
diff --git a/src/include/utils/guc_hooks.h b/src/include/utils/guc_hooks.h
index 7b95acf36e..0edd59f867 100644
--- a/src/include/utils/guc_hooks.h
+++ b/src/include/utils/guc_hooks.h
@@ -130,6 +130,17 @@ extern bool check_ssl(bool *newval, void **extra, GucSource source);
extern bool check_stage_log_stats(bool *newval, void **extra, GucSource source);
extern bool check_synchronous_standby_names(char **newval, void **extra,
GucSource source);
+extern bool check_multixact_offsets_buffers(int *newval, void **extra,
+ GucSource source);
+extern bool check_multixact_members_buffers(int *newval, void **extra,
+ GucSource source);
+extern bool check_subtrans_buffers(int *newval, void **extra,
+ GucSource source);
+extern bool check_notify_buffers(int *newval, void **extra, GucSource source);
+extern bool check_serial_buffers(int *newval, void **extra, GucSource source);
+extern bool check_xact_buffers(int *newval, void **extra, GucSource source);
+extern bool check_commit_ts_buffers(int *newval, void **extra,
+ GucSource source);
extern void assign_synchronous_standby_names(const char *newval, void *extra);
extern void assign_synchronous_commit(int newval, void *extra);
extern void assign_syslog_facility(int newval, void *extra);
--
2.39.2 (Apple Git-143)
v6-0001-Make-all-SLRU-buffer-sizes-configurable.patchapplication/octet-stream; name=v6-0001-Make-all-SLRU-buffer-sizes-configurable.patchDownload
From 37027c2a3560fc3a9c017cdb3a0b6501b85d9522 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Wed, 25 Oct 2023 14:45:00 +0530
Subject: [PATCH v6 1/3] Make all SLRU buffer sizes configurable.
Provide new GUCs to set the number of buffers, instead of using hard
coded defaults.
Default sizes are also set to 64 as sizes much larger than the old
limits have been shown to be useful on modern systems.
Patch by Andrey M. Borodin, Dilip Kumar
Reviewed By Anastasia Lubennikova, Tomas Vondra, Alexander Korotkov,
Gilles Darold, Thomas Munro
---
doc/src/sgml/config.sgml | 135 ++++++++++++++++++
src/backend/access/transam/clog.c | 19 +--
src/backend/access/transam/commit_ts.c | 7 +-
src/backend/access/transam/multixact.c | 8 +-
src/backend/access/transam/subtrans.c | 5 +-
src/backend/commands/async.c | 8 +-
src/backend/commands/variable.c | 25 ++++
src/backend/storage/lmgr/predicate.c | 4 +-
src/backend/utils/init/globals.c | 8 ++
src/backend/utils/misc/guc_tables.c | 77 ++++++++++
src/backend/utils/misc/postgresql.conf.sample | 9 ++
src/include/access/multixact.h | 4 -
src/include/access/slru.h | 5 +
src/include/access/subtrans.h | 3 -
src/include/commands/async.h | 5 -
src/include/miscadmin.h | 7 +
src/include/storage/predicate.h | 4 -
src/include/utils/guc_hooks.h | 2 +
18 files changed, 293 insertions(+), 42 deletions(-)
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index fc35a46e5e..693a0e6172 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2006,6 +2006,141 @@ include_dir 'conf.d'
</listitem>
</varlistentry>
+ <varlistentry id="guc-multixact-offsets-buffers" xreflabel="multixact_offsets_buffers">
+ <term><varname>multixact_offsets_buffers</varname> (<type>integer</type>)
+ <indexterm>
+ <primary><varname>multixact_offsets_buffers</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Specifies the amount of shared memory to use to cache the contents
+ of <literal>pg_multixact/offsets</literal> (see
+ <xref linkend="pgdata-contents-table"/>).
+ If this value is specified without units, it is taken as blocks,
+ that is <symbol>BLCKSZ</symbol> bytes, typically 8kB.
+ The default value is <literal>64</literal>.
+ This parameter can only be set at server start.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry id="guc-multixact-members-buffers" xreflabel="multixact_members_buffers">
+ <term><varname>multixact_members_buffers</varname> (<type>integer</type>)
+ <indexterm>
+ <primary><varname>multixact_members_buffers</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Specifies the amount of shared memory to use to cache the contents
+ of <literal>pg_multixact/members</literal> (see
+ <xref linkend="pgdata-contents-table"/>).
+ If this value is specified without units, it is taken as blocks,
+ that is <symbol>BLCKSZ</symbol> bytes, typically 8kB.
+ The default value is <literal>64</literal>.
+ This parameter can only be set at server start.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry id="guc-subtrans-buffers" xreflabel="subtrans_buffers">
+ <term><varname>subtrans_buffers</varname> (<type>integer</type>)
+ <indexterm>
+ <primary><varname>subtrans_buffers</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Specifies the amount of shared memory to use to cache the contents
+ of <literal>pg_subtrans</literal> (see
+ <xref linkend="pgdata-contents-table"/>).
+ If this value is specified without units, it is taken as blocks,
+ that is <symbol>BLCKSZ</symbol> bytes, typically 8kB.
+ The default value is <literal>64</literal>.
+ This parameter can only be set at server start.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry id="guc-notify-buffers" xreflabel="notify_buffers">
+ <term><varname>notify_buffers</varname> (<type>integer</type>)
+ <indexterm>
+ <primary><varname>notify_buffers</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Specifies the amount of shared memory to use to cache the contents
+ of <literal>pg_notify</literal> (see
+ <xref linkend="pgdata-contents-table"/>).
+ If this value is specified without units, it is taken as blocks,
+ that is <symbol>BLCKSZ</symbol> bytes, typically 8kB.
+ The default value is <literal>64</literal>.
+ This parameter can only be set at server start.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry id="guc-serial-buffers" xreflabel="serial_buffers">
+ <term><varname>serial_buffers</varname> (<type>integer</type>)
+ <indexterm>
+ <primary><varname>serial_buffers</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Specifies the amount of shared memory to use to cache the contents
+ of <literal>pg_serial</literal> (see
+ <xref linkend="pgdata-contents-table"/>).
+ If this value is specified without units, it is taken as blocks,
+ that is <symbol>BLCKSZ</symbol> bytes, typically 8kB.
+ The default value is <literal>64</literal>.
+ This parameter can only be set at server start.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry id="guc-xact-buffers" xreflabel="xact_buffers">
+ <term><varname>xact_buffers</varname> (<type>integer</type>)
+ <indexterm>
+ <primary><varname>xact_buffers</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Specifies the amount of shared memory to use to cache the contents
+ of <literal>pg_xact</literal> (see
+ <xref linkend="pgdata-contents-table"/>).
+ If this value is specified without units, it is taken as blocks,
+ that is <symbol>BLCKSZ</symbol> bytes, typically 8kB.
+ The default value is <literal>0</literal>, which requests
+ <varname>shared_buffers</varname> / 512, but not fewer than 16 blocks.
+ This parameter can only be set at server start.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry id="guc-commit-ts-buffers" xreflabel="commit_ts_buffers">
+ <term><varname>commit_ts_buffers</varname> (<type>integer</type>)
+ <indexterm>
+ <primary><varname>commit_ts_buffers</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Specifies the amount of memory to use to cache the contents of
+ <literal>pg_commit_ts</literal> (see
+ <xref linkend="pgdata-contents-table"/>).
+ If this value is specified without units, it is taken as blocks,
+ that is <symbol>BLCKSZ</symbol> bytes, typically 8kB.
+ The default value is <literal>0</literal>, which requests
+ <varname>shared_buffers</varname> / 256, but not fewer than 16 blocks.
+ This parameter can only be set at server start.
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry id="guc-max-stack-depth" xreflabel="max_stack_depth">
<term><varname>max_stack_depth</varname> (<type>integer</type>)
<indexterm>
diff --git a/src/backend/access/transam/clog.c b/src/backend/access/transam/clog.c
index 4a431d5876..8237b40aa6 100644
--- a/src/backend/access/transam/clog.c
+++ b/src/backend/access/transam/clog.c
@@ -663,23 +663,16 @@ TransactionIdGetStatus(TransactionId xid, XLogRecPtr *lsn)
/*
* Number of shared CLOG buffers.
*
- * On larger multi-processor systems, it is possible to have many CLOG page
- * requests in flight at one time which could lead to disk access for CLOG
- * page if the required page is not found in memory. Testing revealed that we
- * can get the best performance by having 128 CLOG buffers, more than that it
- * doesn't improve performance.
- *
- * Unconditionally keeping the number of CLOG buffers to 128 did not seem like
- * a good idea, because it would increase the minimum amount of shared memory
- * required to start, which could be a problem for people running very small
- * configurations. The following formula seems to represent a reasonable
- * compromise: people with very low values for shared_buffers will get fewer
- * CLOG buffers as well, and everyone else will get 128.
+ * By default, we'll use 2MB of for every 1GB of shared buffers, up to the
+ * maximum value that slru.c will allow, but always at least 16 buffers.
*/
Size
CLOGShmemBuffers(void)
{
- return Min(128, Max(4, NBuffers / 512));
+ /* Use configured value if provided. */
+ if (xact_buffers > 0)
+ return Max(16, xact_buffers);
+ return Min(SLRU_MAX_ALLOWED_BUFFERS, Max(16, NBuffers / 512));
}
/*
diff --git a/src/backend/access/transam/commit_ts.c b/src/backend/access/transam/commit_ts.c
index b897fabc70..9ba5ae6534 100644
--- a/src/backend/access/transam/commit_ts.c
+++ b/src/backend/access/transam/commit_ts.c
@@ -493,11 +493,16 @@ pg_xact_commit_timestamp_origin(PG_FUNCTION_ARGS)
* We use a very similar logic as for the number of CLOG buffers (except we
* scale up twice as fast with shared buffers, and the maximum is twice as
* high); see comments in CLOGShmemBuffers.
+ * By default, we'll use 4MB of for every 1GB of shared buffers, up to the
+ * maximum value that slru.c will allow, but always at least 16 buffers.
*/
Size
CommitTsShmemBuffers(void)
{
- return Min(256, Max(4, NBuffers / 256));
+ /* Use configured value if provided. */
+ if (commit_ts_buffers > 0)
+ return Max(16, commit_ts_buffers);
+ return Min(256, Max(16, NBuffers / 256));
}
/*
diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index 57ed34c0a8..62709fcd07 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -1834,8 +1834,8 @@ MultiXactShmemSize(void)
mul_size(sizeof(MultiXactId) * 2, MaxOldestSlot))
size = SHARED_MULTIXACT_STATE_SIZE;
- size = add_size(size, SimpleLruShmemSize(NUM_MULTIXACTOFFSET_BUFFERS, 0));
- size = add_size(size, SimpleLruShmemSize(NUM_MULTIXACTMEMBER_BUFFERS, 0));
+ size = add_size(size, SimpleLruShmemSize(multixact_offsets_buffers, 0));
+ size = add_size(size, SimpleLruShmemSize(multixact_members_buffers, 0));
return size;
}
@@ -1851,13 +1851,13 @@ MultiXactShmemInit(void)
MultiXactMemberCtl->PagePrecedes = MultiXactMemberPagePrecedes;
SimpleLruInit(MultiXactOffsetCtl,
- "MultiXactOffset", NUM_MULTIXACTOFFSET_BUFFERS, 0,
+ "MultiXactOffset", multixact_offsets_buffers, 0,
MultiXactOffsetSLRULock, "pg_multixact/offsets",
LWTRANCHE_MULTIXACTOFFSET_BUFFER,
SYNC_HANDLER_MULTIXACT_OFFSET);
SlruPagePrecedesUnitTests(MultiXactOffsetCtl, MULTIXACT_OFFSETS_PER_PAGE);
SimpleLruInit(MultiXactMemberCtl,
- "MultiXactMember", NUM_MULTIXACTMEMBER_BUFFERS, 0,
+ "MultiXactMember", multixact_members_buffers, 0,
MultiXactMemberSLRULock, "pg_multixact/members",
LWTRANCHE_MULTIXACTMEMBER_BUFFER,
SYNC_HANDLER_MULTIXACT_MEMBER);
diff --git a/src/backend/access/transam/subtrans.c b/src/backend/access/transam/subtrans.c
index 62bb610167..0dd48f40f3 100644
--- a/src/backend/access/transam/subtrans.c
+++ b/src/backend/access/transam/subtrans.c
@@ -31,6 +31,7 @@
#include "access/slru.h"
#include "access/subtrans.h"
#include "access/transam.h"
+#include "miscadmin.h"
#include "pg_trace.h"
#include "utils/snapmgr.h"
@@ -184,14 +185,14 @@ SubTransGetTopmostTransaction(TransactionId xid)
Size
SUBTRANSShmemSize(void)
{
- return SimpleLruShmemSize(NUM_SUBTRANS_BUFFERS, 0);
+ return SimpleLruShmemSize(subtrans_buffers, 0);
}
void
SUBTRANSShmemInit(void)
{
SubTransCtl->PagePrecedes = SubTransPagePrecedes;
- SimpleLruInit(SubTransCtl, "Subtrans", NUM_SUBTRANS_BUFFERS, 0,
+ SimpleLruInit(SubTransCtl, "Subtrans", subtrans_buffers, 0,
SubtransSLRULock, "pg_subtrans",
LWTRANCHE_SUBTRANS_BUFFER, SYNC_HANDLER_NONE);
SlruPagePrecedesUnitTests(SubTransCtl, SUBTRANS_XACTS_PER_PAGE);
diff --git a/src/backend/commands/async.c b/src/backend/commands/async.c
index 38ddae08b8..4bdbbe5cc0 100644
--- a/src/backend/commands/async.c
+++ b/src/backend/commands/async.c
@@ -117,7 +117,7 @@
* frontend during startup.) The above design guarantees that notifies from
* other backends will never be missed by ignoring self-notifies.
*
- * The amount of shared memory used for notify management (NUM_NOTIFY_BUFFERS)
+ * The amount of shared memory used for notify management (notify_buffers)
* can be varied without affecting anything but performance. The maximum
* amount of notification data that can be queued at one time is determined
* by slru.c's wraparound limit; see QUEUE_MAX_PAGE below.
@@ -235,7 +235,7 @@ typedef struct QueuePosition
*
* Resist the temptation to make this really large. While that would save
* work in some places, it would add cost in others. In particular, this
- * should likely be less than NUM_NOTIFY_BUFFERS, to ensure that backends
+ * should likely be less than notify_buffers, to ensure that backends
* catch up before the pages they'll need to read fall out of SLRU cache.
*/
#define QUEUE_CLEANUP_DELAY 4
@@ -521,7 +521,7 @@ AsyncShmemSize(void)
size = mul_size(MaxBackends + 1, sizeof(QueueBackendStatus));
size = add_size(size, offsetof(AsyncQueueControl, backend));
- size = add_size(size, SimpleLruShmemSize(NUM_NOTIFY_BUFFERS, 0));
+ size = add_size(size, SimpleLruShmemSize(notify_buffers, 0));
return size;
}
@@ -569,7 +569,7 @@ AsyncShmemInit(void)
* Set up SLRU management of the pg_notify data.
*/
NotifyCtl->PagePrecedes = asyncQueuePagePrecedes;
- SimpleLruInit(NotifyCtl, "Notify", NUM_NOTIFY_BUFFERS, 0,
+ SimpleLruInit(NotifyCtl, "Notify", notify_buffers, 0,
NotifySLRULock, "pg_notify", LWTRANCHE_NOTIFY_BUFFER,
SYNC_HANDLER_NONE);
diff --git a/src/backend/commands/variable.c b/src/backend/commands/variable.c
index a88cf5f118..c68d668514 100644
--- a/src/backend/commands/variable.c
+++ b/src/backend/commands/variable.c
@@ -18,6 +18,8 @@
#include <ctype.h>
+#include "access/clog.h"
+#include "access/commit_ts.h"
#include "access/htup_details.h"
#include "access/parallel.h"
#include "access/xact.h"
@@ -400,6 +402,29 @@ show_timezone(void)
return "unknown";
}
+/*
+ * GUC show_hook for xact_buffers
+ */
+const char *
+show_xact_buffers(void)
+{
+ static char nbuf[16];
+
+ snprintf(nbuf, sizeof(nbuf), "%zu", CLOGShmemBuffers());
+ return nbuf;
+}
+
+/*
+ * GUC show_hook for commit_ts_buffers
+ */
+const char *
+show_commit_ts_buffers(void)
+{
+ static char nbuf[16];
+
+ snprintf(nbuf, sizeof(nbuf), "%zu", CommitTsShmemBuffers());
+ return nbuf;
+}
/*
* LOG_TIMEZONE
diff --git a/src/backend/storage/lmgr/predicate.c b/src/backend/storage/lmgr/predicate.c
index a794546db3..18ea18316d 100644
--- a/src/backend/storage/lmgr/predicate.c
+++ b/src/backend/storage/lmgr/predicate.c
@@ -808,7 +808,7 @@ SerialInit(void)
*/
SerialSlruCtl->PagePrecedes = SerialPagePrecedesLogically;
SimpleLruInit(SerialSlruCtl, "Serial",
- NUM_SERIAL_BUFFERS, 0, SerialSLRULock, "pg_serial",
+ serial_buffers, 0, SerialSLRULock, "pg_serial",
LWTRANCHE_SERIAL_BUFFER, SYNC_HANDLER_NONE);
#ifdef USE_ASSERT_CHECKING
SerialPagePrecedesLogicallyUnitTests();
@@ -1347,7 +1347,7 @@ PredicateLockShmemSize(void)
/* Shared memory structures for SLRU tracking of old committed xids. */
size = add_size(size, sizeof(SerialControlData));
- size = add_size(size, SimpleLruShmemSize(NUM_SERIAL_BUFFERS, 0));
+ size = add_size(size, SimpleLruShmemSize(serial_buffers, 0));
return size;
}
diff --git a/src/backend/utils/init/globals.c b/src/backend/utils/init/globals.c
index 60bc1217fb..96d480325b 100644
--- a/src/backend/utils/init/globals.c
+++ b/src/backend/utils/init/globals.c
@@ -156,3 +156,11 @@ int64 VacuumPageDirty = 0;
int VacuumCostBalance = 0; /* working state for vacuum */
bool VacuumCostActive = false;
+
+int multixact_offsets_buffers = 64;
+int multixact_members_buffers = 64;
+int subtrans_buffers = 64;
+int notify_buffers = 64;
+int serial_buffers = 64;
+int xact_buffers = 64;
+int commit_ts_buffers = 64;
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index b764ef6998..c1345dab98 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -28,6 +28,7 @@
#include "access/commit_ts.h"
#include "access/gin.h"
+#include "access/slru.h"
#include "access/toast_compression.h"
#include "access/twophase.h"
#include "access/xlog_internal.h"
@@ -2287,6 +2288,82 @@ struct config_int ConfigureNamesInt[] =
NULL, NULL, NULL
},
+ {
+ {"multixact_offsets_buffers", PGC_POSTMASTER, RESOURCES_MEM,
+ gettext_noop("Sets the number of shared memory buffers used for the MultiXact offset SLRU cache."),
+ NULL,
+ GUC_UNIT_BLOCKS
+ },
+ &multixact_offsets_buffers,
+ 64, 16, SLRU_MAX_ALLOWED_BUFFERS,
+ NULL, NULL, NULL
+ },
+
+ {
+ {"multixact_members_buffers", PGC_POSTMASTER, RESOURCES_MEM,
+ gettext_noop("Sets the number of shared memory buffers used for the MultiXact member SLRU cache."),
+ NULL,
+ GUC_UNIT_BLOCKS
+ },
+ &multixact_members_buffers,
+ 64, 16, SLRU_MAX_ALLOWED_BUFFERS,
+ NULL, NULL, NULL
+ },
+
+ {
+ {"subtrans_buffers", PGC_POSTMASTER, RESOURCES_MEM,
+ gettext_noop("Sets the number of shared memory buffers used for the sub-transaction SLRU cache."),
+ NULL,
+ GUC_UNIT_BLOCKS
+ },
+ &subtrans_buffers,
+ 64, 16, SLRU_MAX_ALLOWED_BUFFERS,
+ NULL, NULL, NULL
+ },
+ {
+ {"notify_buffers", PGC_POSTMASTER, RESOURCES_MEM,
+ gettext_noop("Sets the number of shared memory buffers used for the NOTIFY message SLRU cache."),
+ NULL,
+ GUC_UNIT_BLOCKS
+ },
+ ¬ify_buffers,
+ 64, 16, SLRU_MAX_ALLOWED_BUFFERS,
+ NULL, NULL, NULL
+ },
+
+ {
+ {"serial_buffers", PGC_POSTMASTER, RESOURCES_MEM,
+ gettext_noop("Sets the number of shared memory buffers used for the serializable transaction SLRU cache."),
+ NULL,
+ GUC_UNIT_BLOCKS
+ },
+ &serial_buffers,
+ 64, 16, SLRU_MAX_ALLOWED_BUFFERS,
+ NULL, NULL, NULL
+ },
+
+ {
+ {"xact_buffers", PGC_POSTMASTER, RESOURCES_MEM,
+ gettext_noop("Sets the number of shared memory buffers used for the transaction status SLRU cache."),
+ NULL,
+ GUC_UNIT_BLOCKS
+ },
+ &xact_buffers,
+ 64, 0, SLRU_MAX_ALLOWED_BUFFERS,
+ NULL, NULL, show_xact_buffers
+ },
+
+ {
+ {"commit_ts_buffers", PGC_POSTMASTER, RESOURCES_MEM,
+ gettext_noop("Sets the size of the dedicated buffer pool used for the commit timestamp SLRU cache."),
+ NULL,
+ GUC_UNIT_BLOCKS
+ },
+ &commit_ts_buffers,
+ 64, 0, SLRU_MAX_ALLOWED_BUFFERS,
+ NULL, NULL, show_commit_ts_buffers
+ },
+
{
{"temp_buffers", PGC_USERSET, RESOURCES_MEM,
gettext_noop("Sets the maximum number of temporary buffers used by each session."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index e48c066a5b..364553a314 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -50,6 +50,15 @@
#external_pid_file = '' # write an extra PID file
# (change requires restart)
+# - SLRU Buffers (change requires restart) -
+
+#xact_buffers = 0 # memory for pg_xact (0 = auto)
+#subtrans_buffers = 64 # memory for pg_subtrans
+#multixact_offsets_buffers = 64 # memory for pg_multixact/offsets
+#multixact_members_buffers = 64 # memory for pg_multixact/members
+#notify_buffers = 64 # memory for pg_notify
+#serial_buffers = 64 # memory for pg_serial
+#commit_ts_buffers = 0 # memory for pg_commit_ts (0 = auto)
#------------------------------------------------------------------------------
# CONNECTIONS AND AUTHENTICATION
diff --git a/src/include/access/multixact.h b/src/include/access/multixact.h
index 0be1355892..18d7ba4ca9 100644
--- a/src/include/access/multixact.h
+++ b/src/include/access/multixact.h
@@ -29,10 +29,6 @@
#define MaxMultiXactOffset ((MultiXactOffset) 0xFFFFFFFF)
-/* Number of SLRU buffers to use for multixact */
-#define NUM_MULTIXACTOFFSET_BUFFERS 8
-#define NUM_MULTIXACTMEMBER_BUFFERS 16
-
/*
* Possible multixact lock modes ("status"). The first four modes are for
* tuple locks (FOR KEY SHARE, FOR SHARE, FOR NO KEY UPDATE, FOR UPDATE); the
diff --git a/src/include/access/slru.h b/src/include/access/slru.h
index 552cc19e68..c0d37e3eb3 100644
--- a/src/include/access/slru.h
+++ b/src/include/access/slru.h
@@ -17,6 +17,11 @@
#include "storage/lwlock.h"
#include "storage/sync.h"
+/*
+ * To avoid overflowing internal arithmetic and the size_t data type, the
+ * number of buffers should not exceed this number.
+ */
+#define SLRU_MAX_ALLOWED_BUFFERS ((1024 * 1024 * 1024) / BLCKSZ)
/*
* Define SLRU segment size. A page is the same BLCKSZ as is used everywhere
diff --git a/src/include/access/subtrans.h b/src/include/access/subtrans.h
index 46a473c77f..147dc4acc3 100644
--- a/src/include/access/subtrans.h
+++ b/src/include/access/subtrans.h
@@ -11,9 +11,6 @@
#ifndef SUBTRANS_H
#define SUBTRANS_H
-/* Number of SLRU buffers to use for subtrans */
-#define NUM_SUBTRANS_BUFFERS 32
-
extern void SubTransSetParent(TransactionId xid, TransactionId parent);
extern TransactionId SubTransGetParent(TransactionId xid);
extern TransactionId SubTransGetTopmostTransaction(TransactionId xid);
diff --git a/src/include/commands/async.h b/src/include/commands/async.h
index 02da6ba7e1..b3e6815ee4 100644
--- a/src/include/commands/async.h
+++ b/src/include/commands/async.h
@@ -15,11 +15,6 @@
#include <signal.h>
-/*
- * The number of SLRU page buffers we use for the notification queue.
- */
-#define NUM_NOTIFY_BUFFERS 8
-
extern PGDLLIMPORT bool Trace_notify;
extern PGDLLIMPORT volatile sig_atomic_t notifyInterruptPending;
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index f0cc651435..e2473f41de 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -177,6 +177,13 @@ extern PGDLLIMPORT int MaxBackends;
extern PGDLLIMPORT int MaxConnections;
extern PGDLLIMPORT int max_worker_processes;
extern PGDLLIMPORT int max_parallel_workers;
+extern PGDLLIMPORT int multixact_offsets_buffers;
+extern PGDLLIMPORT int multixact_members_buffers;
+extern PGDLLIMPORT int subtrans_buffers;
+extern PGDLLIMPORT int notify_buffers;
+extern PGDLLIMPORT int serial_buffers;
+extern PGDLLIMPORT int xact_buffers;
+extern PGDLLIMPORT int commit_ts_buffers;
extern PGDLLIMPORT int MyProcPid;
extern PGDLLIMPORT pg_time_t MyStartTime;
diff --git a/src/include/storage/predicate.h b/src/include/storage/predicate.h
index cd48afa17b..7b68c8f1c7 100644
--- a/src/include/storage/predicate.h
+++ b/src/include/storage/predicate.h
@@ -26,10 +26,6 @@ extern PGDLLIMPORT int max_predicate_locks_per_xact;
extern PGDLLIMPORT int max_predicate_locks_per_relation;
extern PGDLLIMPORT int max_predicate_locks_per_page;
-
-/* Number of SLRU buffers to use for Serial SLRU */
-#define NUM_SERIAL_BUFFERS 16
-
/*
* A handle used for sharing SERIALIZABLEXACT objects between the participants
* in a parallel query.
diff --git a/src/include/utils/guc_hooks.h b/src/include/utils/guc_hooks.h
index 3d74483f44..7b95acf36e 100644
--- a/src/include/utils/guc_hooks.h
+++ b/src/include/utils/guc_hooks.h
@@ -163,4 +163,6 @@ extern void assign_wal_consistency_checking(const char *newval, void *extra);
extern bool check_wal_segment_size(int *newval, void **extra, GucSource source);
extern void assign_wal_sync_method(int new_wal_sync_method, void *extra);
+extern const char *show_xact_buffers(void);
+extern const char *show_commit_ts_buffers(void);
#endif /* GUC_HOOKS_H */
--
2.39.2 (Apple Git-143)
v6-0003-Remove-the-centralized-control-lock-and-LRU-count.patchapplication/octet-stream; name=v6-0003-Remove-the-centralized-control-lock-and-LRU-count.patchDownload
From ab0493dee5c682aa0e8d22075b88fd2ca8fb0bfe Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Fri, 17 Nov 2023 14:42:25 +0530
Subject: [PATCH v6 3/3] Remove the centralized control lock and LRU counter
The previous patch has divided SLRU buffer pool into associative
banks. This patch is further optimizing it by introducing
multiple SLRU locks instead of a common centralized lock this
will reduce the contention on the slru control lock. Basically,
we will have at max 128 bank locks and if the number of banks
is <= 128 then each lock will cover exactly one bank otherwise
they will cover multiple banks we will find the bank-to-lock
mapping by (bankno % 128). This patch also removes the
centralized lru counter and now we will have bank-wise lru
counters that will help in frequent cache invalidation while
modifying this counter.
Dilip Kumar based on design inputs from Robert Haas, Andrey M. Borodin,
and Alvaro Herrera
---
src/backend/access/transam/clog.c | 122 ++++++++----
src/backend/access/transam/commit_ts.c | 43 ++--
src/backend/access/transam/multixact.c | 175 ++++++++++++-----
src/backend/access/transam/slru.c | 238 +++++++++++++++++------
src/backend/access/transam/subtrans.c | 58 ++++--
src/backend/commands/async.c | 43 ++--
src/backend/storage/lmgr/lwlock.c | 14 ++
src/backend/storage/lmgr/lwlocknames.txt | 14 +-
src/backend/storage/lmgr/predicate.c | 33 ++--
src/include/access/slru.h | 63 ++++--
src/include/storage/lwlock.h | 7 +
src/test/modules/test_slru/test_slru.c | 32 +--
12 files changed, 594 insertions(+), 248 deletions(-)
diff --git a/src/backend/access/transam/clog.c b/src/backend/access/transam/clog.c
index 44008222da..a4fd16ec7f 100644
--- a/src/backend/access/transam/clog.c
+++ b/src/backend/access/transam/clog.c
@@ -275,15 +275,20 @@ TransactionIdSetPageStatus(TransactionId xid, int nsubxids,
XLogRecPtr lsn, int pageno,
bool all_xact_same_page)
{
+ LWLock *lock;
+
/* Can't use group update when PGPROC overflows. */
StaticAssertDecl(THRESHOLD_SUBTRANS_CLOG_OPT <= PGPROC_MAX_CACHED_SUBXIDS,
"group clog threshold less than PGPROC cached subxids");
+ /* Get the SLRU bank lock w.r.t. the page we are going to access. */
+ lock = SimpleLruGetSLRUBankLock(XactCtl, pageno);
+
/*
- * When there is contention on XactSLRULock, we try to group multiple
+ * When there is contention on Xact SLRU lock, we try to group multiple
* updates; a single leader process will perform transaction status
- * updates for multiple backends so that the number of times XactSLRULock
- * needs to be acquired is reduced.
+ * updates for multiple backends so that the number of times the Xact SLRU
+ * lock needs to be acquired is reduced.
*
* For this optimization to be safe, the XID and subxids in MyProc must be
* the same as the ones for which we're setting the status. Check that
@@ -301,17 +306,17 @@ TransactionIdSetPageStatus(TransactionId xid, int nsubxids,
nsubxids * sizeof(TransactionId)) == 0))
{
/*
- * If we can immediately acquire XactSLRULock, we update the status of
+ * If we can immediately acquire SLRU lock, we update the status of
* our own XID and release the lock. If not, try use group XID
* update. If that doesn't work out, fall back to waiting for the
* lock to perform an update for this transaction only.
*/
- if (LWLockConditionalAcquire(XactSLRULock, LW_EXCLUSIVE))
+ if (LWLockConditionalAcquire(lock, LW_EXCLUSIVE))
{
/* Got the lock without waiting! Do the update. */
TransactionIdSetPageStatusInternal(xid, nsubxids, subxids, status,
lsn, pageno);
- LWLockRelease(XactSLRULock);
+ LWLockRelease(lock);
return;
}
else if (TransactionGroupUpdateXidStatus(xid, status, lsn, pageno))
@@ -324,10 +329,10 @@ TransactionIdSetPageStatus(TransactionId xid, int nsubxids,
}
/* Group update not applicable, or couldn't accept this page number. */
- LWLockAcquire(XactSLRULock, LW_EXCLUSIVE);
+ LWLockAcquire(lock, LW_EXCLUSIVE);
TransactionIdSetPageStatusInternal(xid, nsubxids, subxids, status,
lsn, pageno);
- LWLockRelease(XactSLRULock);
+ LWLockRelease(lock);
}
/*
@@ -346,7 +351,8 @@ TransactionIdSetPageStatusInternal(TransactionId xid, int nsubxids,
Assert(status == TRANSACTION_STATUS_COMMITTED ||
status == TRANSACTION_STATUS_ABORTED ||
(status == TRANSACTION_STATUS_SUB_COMMITTED && !TransactionIdIsValid(xid)));
- Assert(LWLockHeldByMeInMode(XactSLRULock, LW_EXCLUSIVE));
+ Assert(LWLockHeldByMeInMode(SimpleLruGetSLRUBankLock(XactCtl, pageno),
+ LW_EXCLUSIVE));
/*
* If we're doing an async commit (ie, lsn is valid), then we must wait
@@ -397,14 +403,13 @@ TransactionIdSetPageStatusInternal(TransactionId xid, int nsubxids,
}
/*
- * When we cannot immediately acquire XactSLRULock in exclusive mode at
+ * When we cannot immediately acquire SLRU bank lock in exclusive mode at
* commit time, add ourselves to a list of processes that need their XIDs
* status update. The first process to add itself to the list will acquire
- * XactSLRULock in exclusive mode and set transaction status as required
- * on behalf of all group members. This avoids a great deal of contention
- * around XactSLRULock when many processes are trying to commit at once,
- * since the lock need not be repeatedly handed off from one committing
- * process to the next.
+ * the lock in exclusive mode and set transaction status as required on behalf
+ * of all group members. This avoids a great deal of contention when many
+ * processes are trying to commit at once, since the lock need not be
+ * repeatedly handed off from one committing process to the next.
*
* Returns true when transaction status has been updated in clog; returns
* false if we decided against applying the optimization because the page
@@ -418,6 +423,8 @@ TransactionGroupUpdateXidStatus(TransactionId xid, XidStatus status,
PGPROC *proc = MyProc;
uint32 nextidx;
uint32 wakeidx;
+ int prevpageno;
+ LWLock *prevlock = NULL;
/* We should definitely have an XID whose status needs to be updated. */
Assert(TransactionIdIsValid(xid));
@@ -498,13 +505,10 @@ TransactionGroupUpdateXidStatus(TransactionId xid, XidStatus status,
return true;
}
- /* We are the leader. Acquire the lock on behalf of everyone. */
- LWLockAcquire(XactSLRULock, LW_EXCLUSIVE);
-
/*
- * Now that we've got the lock, clear the list of processes waiting for
- * group XID status update, saving a pointer to the head of the list.
- * Trying to pop elements one at a time could lead to an ABA problem.
+ * We are leader so clear the list of processes waiting for group XID
+ * status update, saving a pointer to the head of the list. Trying to pop
+ * elements one at a time could lead to an ABA problem.
*/
nextidx = pg_atomic_exchange_u32(&procglobal->clogGroupFirst,
INVALID_PGPROCNO);
@@ -512,10 +516,44 @@ TransactionGroupUpdateXidStatus(TransactionId xid, XidStatus status,
/* Remember head of list so we can perform wakeups after dropping lock. */
wakeidx = nextidx;
+ /*
+ * Acquire the SLRU bank lock w.r.t. the first page in the group. And if
+ * there are multiple pages in the group which falls under different banks
+ * then we will release this lock and acquire the new lock before accessing
+ * the new page. There is rare a possibility that there may be more than
+ * one page in a group (for detail refer comment in above while loop) and
+ * that it could be from a different bank, but we are safe since we will be
+ * releasing the old lock before getting the new lock, so if the concurrent
+ * updaters lock in opposite orders, there shouldn't be any deadlocks.
+ */
+ prevpageno = ProcGlobal->allProcs[nextidx].clogGroupMemberPage;
+ prevlock = SimpleLruGetSLRUBankLock(XactCtl, prevpageno);
+ LWLockAcquire(prevlock, LW_EXCLUSIVE);
+
/* Walk the list and update the status of all XIDs. */
while (nextidx != INVALID_PGPROCNO)
{
PGPROC *nextproc = &ProcGlobal->allProcs[nextidx];
+ int thispageno = nextproc->clogGroupMemberPage;
+
+ /*
+ * If the SLRU bank lock w.r.t. the current page is not in the same as
+ * that of the last page then we need to release the lock on the
+ * previous bank and acquire the lock on the bank w.r.t. the page we
+ * are going to update now.
+ */
+ if (thispageno != prevpageno)
+ {
+ LWLock *lock = SimpleLruGetSLRUBankLock(XactCtl, thispageno);
+
+ if (prevlock != lock)
+ {
+ LWLockRelease(prevlock);
+ LWLockAcquire(lock, LW_EXCLUSIVE);
+ }
+ prevlock = lock;
+ prevpageno = thispageno;
+ }
/*
* Transactions with more than THRESHOLD_SUBTRANS_CLOG_OPT sub-XIDs
@@ -535,7 +573,8 @@ TransactionGroupUpdateXidStatus(TransactionId xid, XidStatus status,
}
/* We're done with the lock now. */
- LWLockRelease(XactSLRULock);
+ if (prevlock != NULL)
+ LWLockRelease(prevlock);
/*
* Now that we've released the lock, go back and wake everybody up. We
@@ -564,10 +603,11 @@ TransactionGroupUpdateXidStatus(TransactionId xid, XidStatus status,
/*
* Sets the commit status of a single transaction.
*
- * Must be called with XactSLRULock held
+ * Must be called with slot specific SLRU bank's lock held
*/
static void
-TransactionIdSetStatusBit(TransactionId xid, XidStatus status, XLogRecPtr lsn, int slotno)
+TransactionIdSetStatusBit(TransactionId xid, XidStatus status, XLogRecPtr lsn,
+ int slotno)
{
int byteno = TransactionIdToByte(xid);
int bshift = TransactionIdToBIndex(xid) * CLOG_BITS_PER_XACT;
@@ -656,7 +696,7 @@ TransactionIdGetStatus(TransactionId xid, XLogRecPtr *lsn)
lsnindex = GetLSNIndex(slotno, xid);
*lsn = XactCtl->shared->group_lsn[lsnindex];
- LWLockRelease(XactSLRULock);
+ LWLockRelease(SimpleLruGetSLRUBankLock(XactCtl, pageno));
return status;
}
@@ -690,8 +730,8 @@ CLOGShmemInit(void)
{
XactCtl->PagePrecedes = CLOGPagePrecedes;
SimpleLruInit(XactCtl, "Xact", CLOGShmemBuffers(), CLOG_LSNS_PER_PAGE,
- XactSLRULock, "pg_xact", LWTRANCHE_XACT_BUFFER,
- SYNC_HANDLER_CLOG);
+ "pg_xact", LWTRANCHE_XACT_BUFFER,
+ LWTRANCHE_XACT_SLRU, SYNC_HANDLER_CLOG);
SlruPagePrecedesUnitTests(XactCtl, CLOG_XACTS_PER_PAGE);
}
@@ -705,8 +745,9 @@ void
BootStrapCLOG(void)
{
int slotno;
+ LWLock *lock = SimpleLruGetSLRUBankLock(XactCtl, 0);
- LWLockAcquire(XactSLRULock, LW_EXCLUSIVE);
+ LWLockAcquire(lock, LW_EXCLUSIVE);
/* Create and zero the first page of the commit log */
slotno = ZeroCLOGPage(0, false);
@@ -715,7 +756,7 @@ BootStrapCLOG(void)
SimpleLruWritePage(XactCtl, slotno);
Assert(!XactCtl->shared->page_dirty[slotno]);
- LWLockRelease(XactSLRULock);
+ LWLockRelease(lock);
}
/*
@@ -750,14 +791,10 @@ StartupCLOG(void)
TransactionId xid = XidFromFullTransactionId(ShmemVariableCache->nextXid);
int pageno = TransactionIdToPage(xid);
- LWLockAcquire(XactSLRULock, LW_EXCLUSIVE);
-
/*
* Initialize our idea of the latest page number.
*/
- XactCtl->shared->latest_page_number = pageno;
-
- LWLockRelease(XactSLRULock);
+ pg_atomic_init_u32(&XactCtl->shared->latest_page_number, pageno);
}
/*
@@ -768,8 +805,9 @@ TrimCLOG(void)
{
TransactionId xid = XidFromFullTransactionId(ShmemVariableCache->nextXid);
int pageno = TransactionIdToPage(xid);
+ LWLock *lock = SimpleLruGetSLRUBankLock(XactCtl, pageno);
- LWLockAcquire(XactSLRULock, LW_EXCLUSIVE);
+ LWLockAcquire(lock, LW_EXCLUSIVE);
/*
* Zero out the remainder of the current clog page. Under normal
@@ -801,7 +839,7 @@ TrimCLOG(void)
XactCtl->shared->page_dirty[slotno] = true;
}
- LWLockRelease(XactSLRULock);
+ LWLockRelease(lock);
}
/*
@@ -833,6 +871,7 @@ void
ExtendCLOG(TransactionId newestXact)
{
int pageno;
+ LWLock *lock;
/*
* No work except at first XID of a page. But beware: just after
@@ -843,13 +882,14 @@ ExtendCLOG(TransactionId newestXact)
return;
pageno = TransactionIdToPage(newestXact);
+ lock = SimpleLruGetSLRUBankLock(XactCtl, pageno);
- LWLockAcquire(XactSLRULock, LW_EXCLUSIVE);
+ LWLockAcquire(lock, LW_EXCLUSIVE);
/* Zero the page and make an XLOG entry about it */
ZeroCLOGPage(pageno, true);
- LWLockRelease(XactSLRULock);
+ LWLockRelease(lock);
}
@@ -987,16 +1027,18 @@ clog_redo(XLogReaderState *record)
{
int pageno;
int slotno;
+ LWLock *lock;
memcpy(&pageno, XLogRecGetData(record), sizeof(int));
- LWLockAcquire(XactSLRULock, LW_EXCLUSIVE);
+ lock = SimpleLruGetSLRUBankLock(XactCtl, pageno);
+ LWLockAcquire(lock, LW_EXCLUSIVE);
slotno = ZeroCLOGPage(pageno, false);
SimpleLruWritePage(XactCtl, slotno);
Assert(!XactCtl->shared->page_dirty[slotno]);
- LWLockRelease(XactSLRULock);
+ LWLockRelease(lock);
}
else if (info == CLOG_TRUNCATE)
{
diff --git a/src/backend/access/transam/commit_ts.c b/src/backend/access/transam/commit_ts.c
index 96810959ab..ae1badd295 100644
--- a/src/backend/access/transam/commit_ts.c
+++ b/src/backend/access/transam/commit_ts.c
@@ -219,8 +219,9 @@ SetXidCommitTsInPage(TransactionId xid, int nsubxids,
{
int slotno;
int i;
+ LWLock *lock = SimpleLruGetSLRUBankLock(CommitTsCtl, pageno);
- LWLockAcquire(CommitTsSLRULock, LW_EXCLUSIVE);
+ LWLockAcquire(lock, LW_EXCLUSIVE);
slotno = SimpleLruReadPage(CommitTsCtl, pageno, true, xid);
@@ -230,13 +231,13 @@ SetXidCommitTsInPage(TransactionId xid, int nsubxids,
CommitTsCtl->shared->page_dirty[slotno] = true;
- LWLockRelease(CommitTsSLRULock);
+ LWLockRelease(lock);
}
/*
* Sets the commit timestamp of a single transaction.
*
- * Must be called with CommitTsSLRULock held
+ * Must be called with slot specific SLRU bank's Lock held
*/
static void
TransactionIdSetCommitTs(TransactionId xid, TimestampTz ts,
@@ -337,7 +338,7 @@ TransactionIdGetCommitTsData(TransactionId xid, TimestampTz *ts,
if (nodeid)
*nodeid = entry.nodeid;
- LWLockRelease(CommitTsSLRULock);
+ LWLockRelease(SimpleLruGetSLRUBankLock(CommitTsCtl, pageno));
return *ts != 0;
}
@@ -527,9 +528,8 @@ CommitTsShmemInit(void)
CommitTsCtl->PagePrecedes = CommitTsPagePrecedes;
SimpleLruInit(CommitTsCtl, "CommitTs", CommitTsShmemBuffers(), 0,
- CommitTsSLRULock, "pg_commit_ts",
- LWTRANCHE_COMMITTS_BUFFER,
- SYNC_HANDLER_COMMIT_TS);
+ "pg_commit_ts", LWTRANCHE_COMMITTS_BUFFER,
+ LWTRANCHE_COMMITTS_SLRU, SYNC_HANDLER_COMMIT_TS);
SlruPagePrecedesUnitTests(CommitTsCtl, COMMIT_TS_XACTS_PER_PAGE);
commitTsShared = ShmemInitStruct("CommitTs shared",
@@ -685,9 +685,7 @@ ActivateCommitTs(void)
/*
* Re-Initialize our idea of the latest page number.
*/
- LWLockAcquire(CommitTsSLRULock, LW_EXCLUSIVE);
- CommitTsCtl->shared->latest_page_number = pageno;
- LWLockRelease(CommitTsSLRULock);
+ pg_atomic_write_u32(&CommitTsCtl->shared->latest_page_number, pageno);
/*
* If CommitTs is enabled, but it wasn't in the previous server run, we
@@ -714,12 +712,13 @@ ActivateCommitTs(void)
if (!SimpleLruDoesPhysicalPageExist(CommitTsCtl, pageno))
{
int slotno;
+ LWLock *lock = SimpleLruGetSLRUBankLock(CommitTsCtl, pageno);
- LWLockAcquire(CommitTsSLRULock, LW_EXCLUSIVE);
+ LWLockAcquire(lock, LW_EXCLUSIVE);
slotno = ZeroCommitTsPage(pageno, false);
SimpleLruWritePage(CommitTsCtl, slotno);
Assert(!CommitTsCtl->shared->page_dirty[slotno]);
- LWLockRelease(CommitTsSLRULock);
+ LWLockRelease(lock);
}
/* Change the activation status in shared memory. */
@@ -768,9 +767,9 @@ DeactivateCommitTs(void)
* be overwritten anyway when we wrap around, but it seems better to be
* tidy.)
*/
- LWLockAcquire(CommitTsSLRULock, LW_EXCLUSIVE);
+ SimpleLruAcquireAllBankLock(CommitTsCtl, LW_EXCLUSIVE);
(void) SlruScanDirectory(CommitTsCtl, SlruScanDirCbDeleteAll, NULL);
- LWLockRelease(CommitTsSLRULock);
+ SimpleLruReleaseAllBankLock(CommitTsCtl);
}
/*
@@ -802,6 +801,7 @@ void
ExtendCommitTs(TransactionId newestXact)
{
int pageno;
+ LWLock *lock;
/*
* Nothing to do if module not enabled. Note we do an unlocked read of
@@ -822,12 +822,14 @@ ExtendCommitTs(TransactionId newestXact)
pageno = TransactionIdToCTsPage(newestXact);
- LWLockAcquire(CommitTsSLRULock, LW_EXCLUSIVE);
+ lock = SimpleLruGetSLRUBankLock(CommitTsCtl, pageno);
+
+ LWLockAcquire(lock, LW_EXCLUSIVE);
/* Zero the page and make an XLOG entry about it */
ZeroCommitTsPage(pageno, !InRecovery);
- LWLockRelease(CommitTsSLRULock);
+ LWLockRelease(lock);
}
/*
@@ -981,16 +983,18 @@ commit_ts_redo(XLogReaderState *record)
{
int pageno;
int slotno;
+ LWLock *lock;
memcpy(&pageno, XLogRecGetData(record), sizeof(int));
+ lock = SimpleLruGetSLRUBankLock(CommitTsCtl, pageno);
- LWLockAcquire(CommitTsSLRULock, LW_EXCLUSIVE);
+ LWLockAcquire(lock, LW_EXCLUSIVE);
slotno = ZeroCommitTsPage(pageno, false);
SimpleLruWritePage(CommitTsCtl, slotno);
Assert(!CommitTsCtl->shared->page_dirty[slotno]);
- LWLockRelease(CommitTsSLRULock);
+ LWLockRelease(lock);
}
else if (info == COMMIT_TS_TRUNCATE)
{
@@ -1002,7 +1006,8 @@ commit_ts_redo(XLogReaderState *record)
* During XLOG replay, latest_page_number isn't set up yet; insert a
* suitable value to bypass the sanity test in SimpleLruTruncate.
*/
- CommitTsCtl->shared->latest_page_number = trunc->pageno;
+ pg_atomic_write_u32(&CommitTsCtl->shared->latest_page_number,
+ trunc->pageno);
SimpleLruTruncate(CommitTsCtl, trunc->pageno);
}
diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index 77511c6342..6aa72acf22 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -193,10 +193,10 @@ static SlruCtlData MultiXactMemberCtlData;
/*
* MultiXact state shared across all backends. All this state is protected
- * by MultiXactGenLock. (We also use MultiXactOffsetSLRULock and
- * MultiXactMemberSLRULock to guard accesses to the two sets of SLRU
- * buffers. For concurrency's sake, we avoid holding more than one of these
- * locks at a time.)
+ * by MultiXactGenLock. (We also use SLRU bank's lock of MultiXactOffset and
+ * MultiXactMember to guard accesses to the two sets of SLRU buffers. For
+ * concurrency's sake, we avoid holding more than one of these locks at a
+ * time.)
*/
typedef struct MultiXactStateData
{
@@ -871,12 +871,15 @@ RecordNewMultiXact(MultiXactId multi, MultiXactOffset offset,
int slotno;
MultiXactOffset *offptr;
int i;
-
- LWLockAcquire(MultiXactOffsetSLRULock, LW_EXCLUSIVE);
+ LWLock *lock;
+ LWLock *prevlock = NULL;
pageno = MultiXactIdToOffsetPage(multi);
entryno = MultiXactIdToOffsetEntry(multi);
+ lock = SimpleLruGetSLRUBankLock(MultiXactOffsetCtl, pageno);
+ LWLockAcquire(lock, LW_EXCLUSIVE);
+
/*
* Note: we pass the MultiXactId to SimpleLruReadPage as the "transaction"
* to complain about if there's any I/O error. This is kinda bogus, but
@@ -892,10 +895,8 @@ RecordNewMultiXact(MultiXactId multi, MultiXactOffset offset,
MultiXactOffsetCtl->shared->page_dirty[slotno] = true;
- /* Exchange our lock */
- LWLockRelease(MultiXactOffsetSLRULock);
-
- LWLockAcquire(MultiXactMemberSLRULock, LW_EXCLUSIVE);
+ /* Release MultiXactOffset SLRU lock. */
+ LWLockRelease(lock);
prev_pageno = -1;
@@ -917,6 +918,20 @@ RecordNewMultiXact(MultiXactId multi, MultiXactOffset offset,
if (pageno != prev_pageno)
{
+ /*
+ * MultiXactMember SLRU page is changed so check if this new page
+ * fall into the different SLRU bank then release the old bank's
+ * lock and acquire lock on the new bank.
+ */
+ lock = SimpleLruGetSLRUBankLock(MultiXactMemberCtl, pageno);
+ if (lock != prevlock)
+ {
+ if (prevlock != NULL)
+ LWLockRelease(prevlock);
+
+ LWLockAcquire(lock, LW_EXCLUSIVE);
+ prevlock = lock;
+ }
slotno = SimpleLruReadPage(MultiXactMemberCtl, pageno, true, multi);
prev_pageno = pageno;
}
@@ -937,7 +952,8 @@ RecordNewMultiXact(MultiXactId multi, MultiXactOffset offset,
MultiXactMemberCtl->shared->page_dirty[slotno] = true;
}
- LWLockRelease(MultiXactMemberSLRULock);
+ if (prevlock != NULL)
+ LWLockRelease(prevlock);
}
/*
@@ -1240,6 +1256,8 @@ GetMultiXactIdMembers(MultiXactId multi, MultiXactMember **members,
MultiXactId tmpMXact;
MultiXactOffset nextOffset;
MultiXactMember *ptr;
+ LWLock *lock;
+ LWLock *prevlock = NULL;
debug_elog3(DEBUG2, "GetMembers: asked for %u", multi);
@@ -1343,11 +1361,22 @@ GetMultiXactIdMembers(MultiXactId multi, MultiXactMember **members,
* time on every multixact creation.
*/
retry:
- LWLockAcquire(MultiXactOffsetSLRULock, LW_EXCLUSIVE);
-
pageno = MultiXactIdToOffsetPage(multi);
entryno = MultiXactIdToOffsetEntry(multi);
+ /*
+ * If this page falls under a different bank, release the old bank's lock
+ * and acquire the lock of the new bank.
+ */
+ lock = SimpleLruGetSLRUBankLock(MultiXactOffsetCtl, pageno);
+ if (lock != prevlock)
+ {
+ if (prevlock != NULL)
+ LWLockRelease(prevlock);
+ LWLockAcquire(lock, LW_EXCLUSIVE);
+ prevlock = lock;
+ }
+
slotno = SimpleLruReadPage(MultiXactOffsetCtl, pageno, true, multi);
offptr = (MultiXactOffset *) MultiXactOffsetCtl->shared->page_buffer[slotno];
offptr += entryno;
@@ -1380,7 +1409,21 @@ retry:
entryno = MultiXactIdToOffsetEntry(tmpMXact);
if (pageno != prev_pageno)
+ {
+ /*
+ * Since we're going to access a different SLRU page, if this page
+ * falls under a different bank, release the old bank's lock and
+ * acquire the lock of the new bank.
+ */
+ lock = SimpleLruGetSLRUBankLock(MultiXactOffsetCtl, pageno);
+ if (prevlock != lock)
+ {
+ LWLockRelease(prevlock);
+ LWLockAcquire(lock, LW_EXCLUSIVE);
+ prevlock = lock;
+ }
slotno = SimpleLruReadPage(MultiXactOffsetCtl, pageno, true, tmpMXact);
+ }
offptr = (MultiXactOffset *) MultiXactOffsetCtl->shared->page_buffer[slotno];
offptr += entryno;
@@ -1389,7 +1432,8 @@ retry:
if (nextMXOffset == 0)
{
/* Corner case 2: next multixact is still being filled in */
- LWLockRelease(MultiXactOffsetSLRULock);
+ LWLockRelease(prevlock);
+ prevlock = NULL;
CHECK_FOR_INTERRUPTS();
pg_usleep(1000L);
goto retry;
@@ -1398,13 +1442,11 @@ retry:
length = nextMXOffset - offset;
}
- LWLockRelease(MultiXactOffsetSLRULock);
+ LWLockRelease(prevlock);
+ prevlock = NULL;
ptr = (MultiXactMember *) palloc(length * sizeof(MultiXactMember));
- /* Now get the members themselves. */
- LWLockAcquire(MultiXactMemberSLRULock, LW_EXCLUSIVE);
-
truelength = 0;
prev_pageno = -1;
for (i = 0; i < length; i++, offset++)
@@ -1420,6 +1462,20 @@ retry:
if (pageno != prev_pageno)
{
+ /*
+ * Since we're going to access a different SLRU page, if this page
+ * falls under a different bank, release the old bank's lock and
+ * acquire the lock of the new bank.
+ */
+ lock = SimpleLruGetSLRUBankLock(MultiXactMemberCtl, pageno);
+ if (lock != prevlock)
+ {
+ if (prevlock)
+ LWLockRelease(prevlock);
+ LWLockAcquire(lock, LW_EXCLUSIVE);
+ prevlock = lock;
+ }
+
slotno = SimpleLruReadPage(MultiXactMemberCtl, pageno, true, multi);
prev_pageno = pageno;
}
@@ -1443,7 +1499,8 @@ retry:
truelength++;
}
- LWLockRelease(MultiXactMemberSLRULock);
+ if (prevlock)
+ LWLockRelease(prevlock);
/* A multixid with zero members should not happen */
Assert(truelength > 0);
@@ -1853,14 +1910,14 @@ MultiXactShmemInit(void)
SimpleLruInit(MultiXactOffsetCtl,
"MultiXactOffset", multixact_offsets_buffers, 0,
- MultiXactOffsetSLRULock, "pg_multixact/offsets",
- LWTRANCHE_MULTIXACTOFFSET_BUFFER,
+ "pg_multixact/offsets", LWTRANCHE_MULTIXACTOFFSET_BUFFER,
+ LWTRANCHE_MULTIXACTOFFSET_SLRU,
SYNC_HANDLER_MULTIXACT_OFFSET);
SlruPagePrecedesUnitTests(MultiXactOffsetCtl, MULTIXACT_OFFSETS_PER_PAGE);
SimpleLruInit(MultiXactMemberCtl,
"MultiXactMember", multixact_members_buffers, 0,
- MultiXactMemberSLRULock, "pg_multixact/members",
- LWTRANCHE_MULTIXACTMEMBER_BUFFER,
+ "pg_multixact/members", LWTRANCHE_MULTIXACTMEMBER_BUFFER,
+ LWTRANCHE_MULTIXACTMEMBER_SLRU,
SYNC_HANDLER_MULTIXACT_MEMBER);
/* doesn't call SimpleLruTruncate() or meet criteria for unit tests */
@@ -1895,8 +1952,10 @@ void
BootStrapMultiXact(void)
{
int slotno;
+ LWLock *lock;
- LWLockAcquire(MultiXactOffsetSLRULock, LW_EXCLUSIVE);
+ lock = SimpleLruGetSLRUBankLock(MultiXactOffsetCtl, 0);
+ LWLockAcquire(lock, LW_EXCLUSIVE);
/* Create and zero the first page of the offsets log */
slotno = ZeroMultiXactOffsetPage(0, false);
@@ -1905,9 +1964,10 @@ BootStrapMultiXact(void)
SimpleLruWritePage(MultiXactOffsetCtl, slotno);
Assert(!MultiXactOffsetCtl->shared->page_dirty[slotno]);
- LWLockRelease(MultiXactOffsetSLRULock);
+ LWLockRelease(lock);
- LWLockAcquire(MultiXactMemberSLRULock, LW_EXCLUSIVE);
+ lock = SimpleLruGetSLRUBankLock(MultiXactMemberCtl, 0);
+ LWLockAcquire(lock, LW_EXCLUSIVE);
/* Create and zero the first page of the members log */
slotno = ZeroMultiXactMemberPage(0, false);
@@ -1916,7 +1976,7 @@ BootStrapMultiXact(void)
SimpleLruWritePage(MultiXactMemberCtl, slotno);
Assert(!MultiXactMemberCtl->shared->page_dirty[slotno]);
- LWLockRelease(MultiXactMemberSLRULock);
+ LWLockRelease(lock);
}
/*
@@ -1976,10 +2036,12 @@ static void
MaybeExtendOffsetSlru(void)
{
int pageno;
+ LWLock *lock;
pageno = MultiXactIdToOffsetPage(MultiXactState->nextMXact);
+ lock = SimpleLruGetSLRUBankLock(MultiXactOffsetCtl, pageno);
- LWLockAcquire(MultiXactOffsetSLRULock, LW_EXCLUSIVE);
+ LWLockAcquire(lock, LW_EXCLUSIVE);
if (!SimpleLruDoesPhysicalPageExist(MultiXactOffsetCtl, pageno))
{
@@ -1994,7 +2056,7 @@ MaybeExtendOffsetSlru(void)
SimpleLruWritePage(MultiXactOffsetCtl, slotno);
}
- LWLockRelease(MultiXactOffsetSLRULock);
+ LWLockRelease(lock);
}
/*
@@ -2016,13 +2078,15 @@ StartupMultiXact(void)
* Initialize offset's idea of the latest page number.
*/
pageno = MultiXactIdToOffsetPage(multi);
- MultiXactOffsetCtl->shared->latest_page_number = pageno;
+ pg_atomic_init_u32(&MultiXactOffsetCtl->shared->latest_page_number,
+ pageno);
/*
* Initialize member's idea of the latest page number.
*/
pageno = MXOffsetToMemberPage(offset);
- MultiXactMemberCtl->shared->latest_page_number = pageno;
+ pg_atomic_init_u32(&MultiXactMemberCtl->shared->latest_page_number,
+ pageno);
}
/*
@@ -2047,13 +2111,13 @@ TrimMultiXact(void)
LWLockRelease(MultiXactGenLock);
/* Clean up offsets state */
- LWLockAcquire(MultiXactOffsetSLRULock, LW_EXCLUSIVE);
/*
* (Re-)Initialize our idea of the latest page number for offsets.
*/
pageno = MultiXactIdToOffsetPage(nextMXact);
- MultiXactOffsetCtl->shared->latest_page_number = pageno;
+ pg_atomic_write_u32(&MultiXactOffsetCtl->shared->latest_page_number,
+ pageno);
/*
* Zero out the remainder of the current offsets page. See notes in
@@ -2068,7 +2132,9 @@ TrimMultiXact(void)
{
int slotno;
MultiXactOffset *offptr;
+ LWLock *lock = SimpleLruGetSLRUBankLock(MultiXactOffsetCtl, pageno);
+ LWLockAcquire(lock, LW_EXCLUSIVE);
slotno = SimpleLruReadPage(MultiXactOffsetCtl, pageno, true, nextMXact);
offptr = (MultiXactOffset *) MultiXactOffsetCtl->shared->page_buffer[slotno];
offptr += entryno;
@@ -2076,18 +2142,17 @@ TrimMultiXact(void)
MemSet(offptr, 0, BLCKSZ - (entryno * sizeof(MultiXactOffset)));
MultiXactOffsetCtl->shared->page_dirty[slotno] = true;
+ LWLockRelease(lock);
}
- LWLockRelease(MultiXactOffsetSLRULock);
-
/* And the same for members */
- LWLockAcquire(MultiXactMemberSLRULock, LW_EXCLUSIVE);
/*
* (Re-)Initialize our idea of the latest page number for members.
*/
pageno = MXOffsetToMemberPage(offset);
- MultiXactMemberCtl->shared->latest_page_number = pageno;
+ pg_atomic_write_u32(&MultiXactMemberCtl->shared->latest_page_number,
+ pageno);
/*
* Zero out the remainder of the current members page. See notes in
@@ -2099,7 +2164,9 @@ TrimMultiXact(void)
int slotno;
TransactionId *xidptr;
int memberoff;
+ LWLock *lock = SimpleLruGetSLRUBankLock(MultiXactMemberCtl, pageno);
+ LWLockAcquire(lock, LW_EXCLUSIVE);
memberoff = MXOffsetToMemberOffset(offset);
slotno = SimpleLruReadPage(MultiXactMemberCtl, pageno, true, offset);
xidptr = (TransactionId *)
@@ -2114,10 +2181,9 @@ TrimMultiXact(void)
*/
MultiXactMemberCtl->shared->page_dirty[slotno] = true;
+ LWLockRelease(lock);
}
- LWLockRelease(MultiXactMemberSLRULock);
-
/* signal that we're officially up */
LWLockAcquire(MultiXactGenLock, LW_EXCLUSIVE);
MultiXactState->finishedStartup = true;
@@ -2405,6 +2471,7 @@ static void
ExtendMultiXactOffset(MultiXactId multi)
{
int pageno;
+ LWLock *lock;
/*
* No work except at first MultiXactId of a page. But beware: just after
@@ -2415,13 +2482,14 @@ ExtendMultiXactOffset(MultiXactId multi)
return;
pageno = MultiXactIdToOffsetPage(multi);
+ lock = SimpleLruGetSLRUBankLock(MultiXactOffsetCtl, pageno);
- LWLockAcquire(MultiXactOffsetSLRULock, LW_EXCLUSIVE);
+ LWLockAcquire(lock, LW_EXCLUSIVE);
/* Zero the page and make an XLOG entry about it */
ZeroMultiXactOffsetPage(pageno, true);
- LWLockRelease(MultiXactOffsetSLRULock);
+ LWLockRelease(lock);
}
/*
@@ -2454,15 +2522,17 @@ ExtendMultiXactMember(MultiXactOffset offset, int nmembers)
if (flagsoff == 0 && flagsbit == 0)
{
int pageno;
+ LWLock *lock;
pageno = MXOffsetToMemberPage(offset);
+ lock = SimpleLruGetSLRUBankLock(MultiXactMemberCtl, pageno);
- LWLockAcquire(MultiXactMemberSLRULock, LW_EXCLUSIVE);
+ LWLockAcquire(lock, LW_EXCLUSIVE);
/* Zero the page and make an XLOG entry about it */
ZeroMultiXactMemberPage(pageno, true);
- LWLockRelease(MultiXactMemberSLRULock);
+ LWLockRelease(lock);
}
/*
@@ -2760,7 +2830,7 @@ find_multixact_start(MultiXactId multi, MultiXactOffset *result)
offptr = (MultiXactOffset *) MultiXactOffsetCtl->shared->page_buffer[slotno];
offptr += entryno;
offset = *offptr;
- LWLockRelease(MultiXactOffsetSLRULock);
+ LWLockRelease(SimpleLruGetSLRUBankLock(MultiXactOffsetCtl, pageno));
*result = offset;
return true;
@@ -3242,31 +3312,33 @@ multixact_redo(XLogReaderState *record)
{
int pageno;
int slotno;
+ LWLock *lock;
memcpy(&pageno, XLogRecGetData(record), sizeof(int));
-
- LWLockAcquire(MultiXactOffsetSLRULock, LW_EXCLUSIVE);
+ lock = SimpleLruGetSLRUBankLock(MultiXactOffsetCtl, pageno);
+ LWLockAcquire(lock, LW_EXCLUSIVE);
slotno = ZeroMultiXactOffsetPage(pageno, false);
SimpleLruWritePage(MultiXactOffsetCtl, slotno);
Assert(!MultiXactOffsetCtl->shared->page_dirty[slotno]);
- LWLockRelease(MultiXactOffsetSLRULock);
+ LWLockRelease(lock);
}
else if (info == XLOG_MULTIXACT_ZERO_MEM_PAGE)
{
int pageno;
int slotno;
+ LWLock *lock;
memcpy(&pageno, XLogRecGetData(record), sizeof(int));
-
- LWLockAcquire(MultiXactMemberSLRULock, LW_EXCLUSIVE);
+ lock = SimpleLruGetSLRUBankLock(MultiXactMemberCtl, pageno);
+ LWLockAcquire(lock, LW_EXCLUSIVE);
slotno = ZeroMultiXactMemberPage(pageno, false);
SimpleLruWritePage(MultiXactMemberCtl, slotno);
Assert(!MultiXactMemberCtl->shared->page_dirty[slotno]);
- LWLockRelease(MultiXactMemberSLRULock);
+ LWLockRelease(lock);
}
else if (info == XLOG_MULTIXACT_CREATE_ID)
{
@@ -3332,7 +3404,8 @@ multixact_redo(XLogReaderState *record)
* SimpleLruTruncate.
*/
pageno = MultiXactIdToOffsetPage(xlrec.endTruncOff);
- MultiXactOffsetCtl->shared->latest_page_number = pageno;
+ pg_atomic_write_u32(&MultiXactOffsetCtl->shared->latest_page_number,
+ pageno);
PerformOffsetsTruncation(xlrec.startTruncOff, xlrec.endTruncOff);
LWLockRelease(MultiXactTruncationLock);
diff --git a/src/backend/access/transam/slru.c b/src/backend/access/transam/slru.c
index b0d90a4bd2..dfbe0fd5f4 100644
--- a/src/backend/access/transam/slru.c
+++ b/src/backend/access/transam/slru.c
@@ -72,6 +72,21 @@
*/
#define MAX_WRITEALL_BUFFERS 16
+/*
+ * Macro to get the index of lock for a given slotno in bank_lock array in
+ * SlruSharedData.
+ *
+ * Basically, the slru buffer pool is divided into banks of buffer and there is
+ * total SLRU_MAX_BANKLOCKS number of locks to protect access to buffer in the
+ * banks. Since we have max limit on the number of locks we can not always have
+ * one lock for each bank. So until the number of banks are
+ * <= SLRU_MAX_BANKLOCKS then there would be one lock protecting each bank
+ * otherwise one lock might protect multiple banks based on the number of
+ * banks.
+ */
+#define SLRU_SLOTNO_GET_BANKLOCKNO(slotno) \
+ (((slotno) / SLRU_BANK_SIZE) % SLRU_MAX_BANKLOCKS)
+
typedef struct SlruWriteAllData
{
int num_files; /* # files actually open */
@@ -93,34 +108,6 @@ typedef struct SlruWriteAllData *SlruWriteAll;
(a).segno = (xx_segno) \
)
-/*
- * Macro to mark a buffer slot "most recently used". Note multiple evaluation
- * of arguments!
- *
- * The reason for the if-test is that there are often many consecutive
- * accesses to the same page (particularly the latest page). By suppressing
- * useless increments of cur_lru_count, we reduce the probability that old
- * pages' counts will "wrap around" and make them appear recently used.
- *
- * We allow this code to be executed concurrently by multiple processes within
- * SimpleLruReadPage_ReadOnly(). As long as int reads and writes are atomic,
- * this should not cause any completely-bogus values to enter the computation.
- * However, it is possible for either cur_lru_count or individual
- * page_lru_count entries to be "reset" to lower values than they should have,
- * in case a process is delayed while it executes this macro. With care in
- * SlruSelectLRUPage(), this does little harm, and in any case the absolute
- * worst possible consequence is a nonoptimal choice of page to evict. The
- * gain from allowing concurrent reads of SLRU pages seems worth it.
- */
-#define SlruRecentlyUsed(shared, slotno) \
- do { \
- int new_lru_count = (shared)->cur_lru_count; \
- if (new_lru_count != (shared)->page_lru_count[slotno]) { \
- (shared)->cur_lru_count = ++new_lru_count; \
- (shared)->page_lru_count[slotno] = new_lru_count; \
- } \
- } while (0)
-
/* Saved info for SlruReportIOError */
typedef enum
{
@@ -147,6 +134,7 @@ static int SlruSelectLRUPage(SlruCtl ctl, int pageno);
static bool SlruScanDirCbDeleteCutoff(SlruCtl ctl, char *filename,
int segpage, void *data);
static void SlruInternalDeleteSegment(SlruCtl ctl, int segno);
+static inline void SlruRecentlyUsed(SlruShared shared, int slotno);
/*
* Initialization of shared memory
@@ -156,6 +144,8 @@ Size
SimpleLruShmemSize(int nslots, int nlsns)
{
Size sz;
+ int nbanks = nslots / SLRU_BANK_SIZE;
+ int nbanklocks = Min(nbanks, SLRU_MAX_BANKLOCKS);
/* we assume nslots isn't so large as to risk overflow */
sz = MAXALIGN(sizeof(SlruSharedData));
@@ -165,6 +155,8 @@ SimpleLruShmemSize(int nslots, int nlsns)
sz += MAXALIGN(nslots * sizeof(int)); /* page_number[] */
sz += MAXALIGN(nslots * sizeof(int)); /* page_lru_count[] */
sz += MAXALIGN(nslots * sizeof(LWLockPadded)); /* buffer_locks[] */
+ sz += MAXALIGN(nbanklocks * sizeof(LWLockPadded)); /* bank_locks[] */
+ sz += MAXALIGN(nbanks * sizeof(int)); /* bank_cur_lru_count[] */
if (nlsns > 0)
sz += MAXALIGN(nslots * nlsns * sizeof(XLogRecPtr)); /* group_lsn[] */
@@ -181,16 +173,19 @@ SimpleLruShmemSize(int nslots, int nlsns)
* nlsns: number of LSN groups per page (set to zero if not relevant).
* ctllock: LWLock to use to control access to the shared control structure.
* subdir: PGDATA-relative subdirectory that will contain the files.
- * tranche_id: LWLock tranche ID to use for the SLRU's per-buffer LWLocks.
+ * buffer_tranche_id: tranche ID to use for the SLRU's per-buffer LWLocks.
+ * bank_tranche_id: tranche ID to use for the bank LWLocks.
* sync_handler: which set of functions to use to handle sync requests
*/
void
SimpleLruInit(SlruCtl ctl, const char *name, int nslots, int nlsns,
- LWLock *ctllock, const char *subdir, int tranche_id,
+ const char *subdir, int buffer_tranche_id, int bank_tranche_id,
SyncRequestHandler sync_handler)
{
SlruShared shared;
bool found;
+ int nbanks = nslots / SLRU_BANK_SIZE;
+ int nbanklocks = Min(nbanks, SLRU_MAX_BANKLOCKS);
shared = (SlruShared) ShmemInitStruct(name,
SimpleLruShmemSize(nslots, nlsns),
@@ -202,18 +197,16 @@ SimpleLruInit(SlruCtl ctl, const char *name, int nslots, int nlsns,
char *ptr;
Size offset;
int slotno;
+ int bankno;
+ int banklockno;
Assert(!found);
memset(shared, 0, sizeof(SlruSharedData));
- shared->ControlLock = ctllock;
-
shared->num_slots = nslots;
shared->lsn_groups_per_page = nlsns;
- shared->cur_lru_count = 0;
-
/* shared->latest_page_number will be set later */
shared->slru_stats_idx = pgstat_get_slru_index(name);
@@ -234,6 +227,10 @@ SimpleLruInit(SlruCtl ctl, const char *name, int nslots, int nlsns,
/* Initialize LWLocks */
shared->buffer_locks = (LWLockPadded *) (ptr + offset);
offset += MAXALIGN(nslots * sizeof(LWLockPadded));
+ shared->bank_locks = (LWLockPadded *) (ptr + offset);
+ offset += MAXALIGN(nbanklocks * sizeof(LWLockPadded));
+ shared->bank_cur_lru_count = (int *) (ptr + offset);
+ offset += MAXALIGN(nbanks * sizeof(int));
if (nlsns > 0)
{
@@ -245,7 +242,7 @@ SimpleLruInit(SlruCtl ctl, const char *name, int nslots, int nlsns,
for (slotno = 0; slotno < nslots; slotno++)
{
LWLockInitialize(&shared->buffer_locks[slotno].lock,
- tranche_id);
+ buffer_tranche_id);
shared->page_buffer[slotno] = ptr;
shared->page_status[slotno] = SLRU_PAGE_EMPTY;
@@ -254,6 +251,15 @@ SimpleLruInit(SlruCtl ctl, const char *name, int nslots, int nlsns,
ptr += BLCKSZ;
}
+ /* Initialize the bank locks. */
+ for (banklockno = 0; banklockno < nbanklocks; banklockno++)
+ LWLockInitialize(&shared->bank_locks[banklockno].lock,
+ bank_tranche_id);
+
+ /* Initialize the bank lru counters. */
+ for (bankno = 0; bankno < nbanks; bankno++)
+ shared->bank_cur_lru_count[bankno] = 0;
+
/* Should fit to estimated shmem size */
Assert(ptr - (char *) shared <= SimpleLruShmemSize(nslots, nlsns));
}
@@ -307,7 +313,7 @@ SimpleLruZeroPage(SlruCtl ctl, int pageno)
SimpleLruZeroLSNs(ctl, slotno);
/* Assume this page is now the latest active page */
- shared->latest_page_number = pageno;
+ pg_atomic_write_u32(&shared->latest_page_number, pageno);
/* update the stats counter of zeroed pages */
pgstat_count_slru_page_zeroed(shared->slru_stats_idx);
@@ -346,12 +352,13 @@ static void
SimpleLruWaitIO(SlruCtl ctl, int slotno)
{
SlruShared shared = ctl->shared;
+ int banklockno = SLRU_SLOTNO_GET_BANKLOCKNO(slotno);
/* See notes at top of file */
- LWLockRelease(shared->ControlLock);
+ LWLockRelease(&shared->bank_locks[banklockno].lock);
LWLockAcquire(&shared->buffer_locks[slotno].lock, LW_SHARED);
LWLockRelease(&shared->buffer_locks[slotno].lock);
- LWLockAcquire(shared->ControlLock, LW_EXCLUSIVE);
+ LWLockAcquire(&shared->bank_locks[banklockno].lock, LW_EXCLUSIVE);
/*
* If the slot is still in an io-in-progress state, then either someone
@@ -406,6 +413,7 @@ SimpleLruReadPage(SlruCtl ctl, int pageno, bool write_ok,
for (;;)
{
int slotno;
+ int banklockno;
bool ok;
/* See if page already is in memory; if not, pick victim slot */
@@ -448,9 +456,10 @@ SimpleLruReadPage(SlruCtl ctl, int pageno, bool write_ok,
/* Acquire per-buffer lock (cannot deadlock, see notes at top) */
LWLockAcquire(&shared->buffer_locks[slotno].lock, LW_EXCLUSIVE);
+ banklockno = SLRU_SLOTNO_GET_BANKLOCKNO(slotno);
/* Release control lock while doing I/O */
- LWLockRelease(shared->ControlLock);
+ LWLockRelease(&shared->bank_locks[banklockno].lock);
/* Do the read */
ok = SlruPhysicalReadPage(ctl, pageno, slotno);
@@ -459,7 +468,7 @@ SimpleLruReadPage(SlruCtl ctl, int pageno, bool write_ok,
SimpleLruZeroLSNs(ctl, slotno);
/* Re-acquire control lock and update page state */
- LWLockAcquire(shared->ControlLock, LW_EXCLUSIVE);
+ LWLockAcquire(&shared->bank_locks[banklockno].lock, LW_EXCLUSIVE);
Assert(shared->page_number[slotno] == pageno &&
shared->page_status[slotno] == SLRU_PAGE_READ_IN_PROGRESS &&
@@ -503,9 +512,10 @@ SimpleLruReadPage_ReadOnly(SlruCtl ctl, int pageno, TransactionId xid)
int slotno;
int bankstart = (pageno & ctl->bank_mask) * SLRU_BANK_SIZE;
int bankend = bankstart + SLRU_BANK_SIZE;
+ int banklockno = SLRU_SLOTNO_GET_BANKLOCKNO(bankstart);
/* Try to find the page while holding only shared lock */
- LWLockAcquire(shared->ControlLock, LW_SHARED);
+ LWLockAcquire(&shared->bank_locks[banklockno].lock, LW_SHARED);
/*
* See if the page is already in a buffer pool. The buffer pool is
@@ -529,8 +539,8 @@ SimpleLruReadPage_ReadOnly(SlruCtl ctl, int pageno, TransactionId xid)
}
/* No luck, so switch to normal exclusive lock and do regular read */
- LWLockRelease(shared->ControlLock);
- LWLockAcquire(shared->ControlLock, LW_EXCLUSIVE);
+ LWLockRelease(&shared->bank_locks[banklockno].lock);
+ LWLockAcquire(&shared->bank_locks[banklockno].lock, LW_EXCLUSIVE);
return SimpleLruReadPage(ctl, pageno, true, xid);
}
@@ -552,6 +562,7 @@ SlruInternalWritePage(SlruCtl ctl, int slotno, SlruWriteAll fdata)
SlruShared shared = ctl->shared;
int pageno = shared->page_number[slotno];
bool ok;
+ int banklockno = SLRU_SLOTNO_GET_BANKLOCKNO(slotno);
/* If a write is in progress, wait for it to finish */
while (shared->page_status[slotno] == SLRU_PAGE_WRITE_IN_PROGRESS &&
@@ -580,7 +591,7 @@ SlruInternalWritePage(SlruCtl ctl, int slotno, SlruWriteAll fdata)
LWLockAcquire(&shared->buffer_locks[slotno].lock, LW_EXCLUSIVE);
/* Release control lock while doing I/O */
- LWLockRelease(shared->ControlLock);
+ LWLockRelease(&shared->bank_locks[banklockno].lock);
/* Do the write */
ok = SlruPhysicalWritePage(ctl, pageno, slotno, fdata);
@@ -595,7 +606,7 @@ SlruInternalWritePage(SlruCtl ctl, int slotno, SlruWriteAll fdata)
}
/* Re-acquire control lock and update page state */
- LWLockAcquire(shared->ControlLock, LW_EXCLUSIVE);
+ LWLockAcquire(&shared->bank_locks[banklockno].lock, LW_EXCLUSIVE);
Assert(shared->page_number[slotno] == pageno &&
shared->page_status[slotno] == SLRU_PAGE_WRITE_IN_PROGRESS);
@@ -1039,7 +1050,8 @@ SlruSelectLRUPage(SlruCtl ctl, int pageno)
int bestinvalidslot = 0; /* keep compiler quiet */
int best_invalid_delta = -1;
int best_invalid_page_number = 0; /* keep compiler quiet */
- int bankstart = (pageno & ctl->bank_mask) * SLRU_BANK_SIZE;
+ int bankno = pageno & ctl->bank_mask;
+ int bankstart = bankno * SLRU_BANK_SIZE;
int bankend = bankstart + SLRU_BANK_SIZE;
/*
@@ -1081,7 +1093,7 @@ SlruSelectLRUPage(SlruCtl ctl, int pageno)
* That gets us back on the path to having good data when there are
* multiple pages with the same lru_count.
*/
- cur_count = (shared->cur_lru_count)++;
+ cur_count = (shared->bank_cur_lru_count[bankno])++;
for (slotno = bankstart; slotno < bankend; slotno++)
{
int this_delta;
@@ -1103,7 +1115,7 @@ SlruSelectLRUPage(SlruCtl ctl, int pageno)
this_delta = 0;
}
this_page_number = shared->page_number[slotno];
- if (this_page_number == shared->latest_page_number)
+ if (this_page_number == pg_atomic_read_u32(&shared->latest_page_number))
continue;
if (shared->page_status[slotno] == SLRU_PAGE_VALID)
{
@@ -1177,6 +1189,7 @@ SimpleLruWriteAll(SlruCtl ctl, bool allow_redirtied)
int slotno;
int pageno = 0;
int i;
+ int prevlockno = SLRU_SLOTNO_GET_BANKLOCKNO(0);
bool ok;
/* update the stats counter of flushes */
@@ -1187,10 +1200,23 @@ SimpleLruWriteAll(SlruCtl ctl, bool allow_redirtied)
*/
fdata.num_files = 0;
- LWLockAcquire(shared->ControlLock, LW_EXCLUSIVE);
+ LWLockAcquire(&shared->bank_locks[prevlockno].lock, LW_EXCLUSIVE);
for (slotno = 0; slotno < shared->num_slots; slotno++)
{
+ int curlockno = SLRU_SLOTNO_GET_BANKLOCKNO(slotno);
+
+ /*
+ * If the current bank lock is not same as the previous bank lock then
+ * release the previous lock and acquire the new lock.
+ */
+ if (curlockno != prevlockno)
+ {
+ LWLockRelease(&shared->bank_locks[prevlockno].lock);
+ LWLockAcquire(&shared->bank_locks[curlockno].lock, LW_EXCLUSIVE);
+ prevlockno = curlockno;
+ }
+
SlruInternalWritePage(ctl, slotno, &fdata);
/*
@@ -1204,7 +1230,7 @@ SimpleLruWriteAll(SlruCtl ctl, bool allow_redirtied)
!shared->page_dirty[slotno]));
}
- LWLockRelease(shared->ControlLock);
+ LWLockRelease(&shared->bank_locks[prevlockno].lock);
/*
* Now close any files that were open
@@ -1244,6 +1270,7 @@ SimpleLruTruncate(SlruCtl ctl, int cutoffPage)
{
SlruShared shared = ctl->shared;
int slotno;
+ int prevlockno;
/* update the stats counter of truncates */
pgstat_count_slru_truncate(shared->slru_stats_idx);
@@ -1254,25 +1281,38 @@ SimpleLruTruncate(SlruCtl ctl, int cutoffPage)
* or just after a checkpoint, any dirty pages should have been flushed
* already ... we're just being extra careful here.)
*/
- LWLockAcquire(shared->ControlLock, LW_EXCLUSIVE);
-
restart:
/*
* While we are holding the lock, make an important safety check: the
* current endpoint page must not be eligible for removal.
*/
- if (ctl->PagePrecedes(shared->latest_page_number, cutoffPage))
+ if (ctl->PagePrecedes(pg_atomic_read_u32(&shared->latest_page_number),
+ cutoffPage))
{
- LWLockRelease(shared->ControlLock);
ereport(LOG,
(errmsg("could not truncate directory \"%s\": apparent wraparound",
ctl->Dir)));
return;
}
+ prevlockno = SLRU_SLOTNO_GET_BANKLOCKNO(0);
+ LWLockAcquire(&shared->bank_locks[prevlockno].lock, LW_EXCLUSIVE);
for (slotno = 0; slotno < shared->num_slots; slotno++)
{
+ int curlockno = SLRU_SLOTNO_GET_BANKLOCKNO(slotno);
+
+ /*
+ * If the current bank lock is not same as the previous bank lock then
+ * release the previous lock and acquire the new lock.
+ */
+ if (curlockno != prevlockno)
+ {
+ LWLockRelease(&shared->bank_locks[prevlockno].lock);
+ LWLockAcquire(&shared->bank_locks[curlockno].lock, LW_EXCLUSIVE);
+ prevlockno = curlockno;
+ }
+
if (shared->page_status[slotno] == SLRU_PAGE_EMPTY)
continue;
if (!ctl->PagePrecedes(shared->page_number[slotno], cutoffPage))
@@ -1302,10 +1342,12 @@ restart:
SlruInternalWritePage(ctl, slotno, NULL);
else
SimpleLruWaitIO(ctl, slotno);
+
+ LWLockRelease(&shared->bank_locks[prevlockno].lock);
goto restart;
}
- LWLockRelease(shared->ControlLock);
+ LWLockRelease(&shared->bank_locks[prevlockno].lock);
/* Now we can remove the old segment(s) */
(void) SlruScanDirectory(ctl, SlruScanDirCbDeleteCutoff, &cutoffPage);
@@ -1346,15 +1388,29 @@ SlruDeleteSegment(SlruCtl ctl, int segno)
SlruShared shared = ctl->shared;
int slotno;
bool did_write;
+ int prevlockno = SLRU_SLOTNO_GET_BANKLOCKNO(0);
/* Clean out any possibly existing references to the segment. */
- LWLockAcquire(shared->ControlLock, LW_EXCLUSIVE);
+ LWLockAcquire(&shared->bank_locks[prevlockno].lock, LW_EXCLUSIVE);
restart:
did_write = false;
for (slotno = 0; slotno < shared->num_slots; slotno++)
{
- int pagesegno = shared->page_number[slotno] / SLRU_PAGES_PER_SEGMENT;
+ int pagesegno;
+ int curlockno = SLRU_SLOTNO_GET_BANKLOCKNO(slotno);
+
+ /*
+ * If the current bank lock is not same as the previous bank lock then
+ * release the previous lock and acquire the new lock.
+ */
+ if (curlockno != prevlockno)
+ {
+ LWLockRelease(&shared->bank_locks[prevlockno].lock);
+ LWLockAcquire(&shared->bank_locks[curlockno].lock, LW_EXCLUSIVE);
+ prevlockno = curlockno;
+ }
+ pagesegno = shared->page_number[slotno] / SLRU_PAGES_PER_SEGMENT;
if (shared->page_status[slotno] == SLRU_PAGE_EMPTY)
continue;
@@ -1388,7 +1444,7 @@ restart:
SlruInternalDeleteSegment(ctl, segno);
- LWLockRelease(shared->ControlLock);
+ LWLockRelease(&shared->bank_locks[prevlockno].lock);
}
/*
@@ -1630,6 +1686,38 @@ SlruSyncFileTag(SlruCtl ctl, const FileTag *ftag, char *path)
return result;
}
+/*
+ * Function to mark a buffer slot "most recently used". Note multiple
+ * evaluation of arguments!
+ *
+ * The reason for the if-test is that there are often many consecutive
+ * accesses to the same page (particularly the latest page). By suppressing
+ * useless increments of bank_cur_lru_count, we reduce the probability that old
+ * pages' counts will "wrap around" and make them appear recently used.
+ *
+ * We allow this code to be executed concurrently by multiple processes within
+ * SimpleLruReadPage_ReadOnly(). As long as int reads and writes are atomic,
+ * this should not cause any completely-bogus values to enter the computation.
+ * However, it is possible for either bank_cur_lru_count or individual
+ * page_lru_count entries to be "reset" to lower values than they should have,
+ * in case a process is delayed while it executes this macro. With care in
+ * SlruSelectLRUPage(), this does little harm, and in any case the absolute
+ * worst possible consequence is a nonoptimal choice of page to evict. The
+ * gain from allowing concurrent reads of SLRU pages seems worth it.
+ */
+static inline void
+SlruRecentlyUsed(SlruShared shared, int slotno)
+{
+ int bankno = slotno / SLRU_BANK_SIZE;
+ int new_lru_count = shared->bank_cur_lru_count[bankno];
+
+ if (new_lru_count != shared->page_lru_count[slotno])
+ {
+ shared->bank_cur_lru_count[bankno] = ++new_lru_count;
+ shared->page_lru_count[slotno] = new_lru_count;
+ }
+}
+
/*
* Helper function for GUC check_hook to check whether slru buffers are in
* multiples of SLRU_BANK_SIZE.
@@ -1646,3 +1734,37 @@ check_slru_buffers(const char *name, int *newval)
SLRU_BANK_SIZE);
return false;
}
+
+/*
+ * Function to acquire all bank's lock of the given SlruCtl
+ */
+void
+SimpleLruAcquireAllBankLock(SlruCtl ctl, LWLockMode mode)
+{
+ SlruShared shared = ctl->shared;
+ int banklockno;
+ int nbanklocks;
+
+ /* Compute number of bank locks. */
+ nbanklocks = Min(shared->num_slots / SLRU_BANK_SIZE, SLRU_MAX_BANKLOCKS);
+
+ for (banklockno = 0; banklockno < nbanklocks; banklockno++)
+ LWLockAcquire(&shared->bank_locks[banklockno].lock, mode);
+}
+
+/*
+ * Function to release all bank's lock of the given SlruCtl
+ */
+void
+SimpleLruReleaseAllBankLock(SlruCtl ctl)
+{
+ SlruShared shared = ctl->shared;
+ int banklockno;
+ int nbanklocks;
+
+ /* Compute number of bank locks. */
+ nbanklocks = Min(shared->num_slots / SLRU_BANK_SIZE, SLRU_MAX_BANKLOCKS);
+
+ for (banklockno = 0; banklockno < nbanklocks; banklockno++)
+ LWLockRelease(&shared->bank_locks[banklockno].lock);
+}
diff --git a/src/backend/access/transam/subtrans.c b/src/backend/access/transam/subtrans.c
index 923e706535..ff47985f08 100644
--- a/src/backend/access/transam/subtrans.c
+++ b/src/backend/access/transam/subtrans.c
@@ -78,12 +78,14 @@ SubTransSetParent(TransactionId xid, TransactionId parent)
int pageno = TransactionIdToPage(xid);
int entryno = TransactionIdToEntry(xid);
int slotno;
+ LWLock *lock;
TransactionId *ptr;
Assert(TransactionIdIsValid(parent));
Assert(TransactionIdFollows(xid, parent));
- LWLockAcquire(SubtransSLRULock, LW_EXCLUSIVE);
+ lock = SimpleLruGetSLRUBankLock(SubTransCtl, pageno);
+ LWLockAcquire(lock, LW_EXCLUSIVE);
slotno = SimpleLruReadPage(SubTransCtl, pageno, true, xid);
ptr = (TransactionId *) SubTransCtl->shared->page_buffer[slotno];
@@ -101,7 +103,7 @@ SubTransSetParent(TransactionId xid, TransactionId parent)
SubTransCtl->shared->page_dirty[slotno] = true;
}
- LWLockRelease(SubtransSLRULock);
+ LWLockRelease(lock);
}
/*
@@ -131,7 +133,7 @@ SubTransGetParent(TransactionId xid)
parent = *ptr;
- LWLockRelease(SubtransSLRULock);
+ LWLockRelease(SimpleLruGetSLRUBankLock(SubTransCtl, pageno));
return parent;
}
@@ -194,8 +196,9 @@ SUBTRANSShmemInit(void)
{
SubTransCtl->PagePrecedes = SubTransPagePrecedes;
SimpleLruInit(SubTransCtl, "Subtrans", subtrans_buffers, 0,
- SubtransSLRULock, "pg_subtrans",
- LWTRANCHE_SUBTRANS_BUFFER, SYNC_HANDLER_NONE);
+ "pg_subtrans", LWTRANCHE_SUBTRANS_BUFFER,
+ LWTRANCHE_SUBTRANS_SLRU,
+ SYNC_HANDLER_NONE);
SlruPagePrecedesUnitTests(SubTransCtl, SUBTRANS_XACTS_PER_PAGE);
}
@@ -213,8 +216,9 @@ void
BootStrapSUBTRANS(void)
{
int slotno;
+ LWLock *lock = SimpleLruGetSLRUBankLock(SubTransCtl, 0);
- LWLockAcquire(SubtransSLRULock, LW_EXCLUSIVE);
+ LWLockAcquire(lock, LW_EXCLUSIVE);
/* Create and zero the first page of the subtrans log */
slotno = ZeroSUBTRANSPage(0);
@@ -223,7 +227,7 @@ BootStrapSUBTRANS(void)
SimpleLruWritePage(SubTransCtl, slotno);
Assert(!SubTransCtl->shared->page_dirty[slotno]);
- LWLockRelease(SubtransSLRULock);
+ LWLockRelease(lock);
}
/*
@@ -253,6 +257,8 @@ StartupSUBTRANS(TransactionId oldestActiveXID)
FullTransactionId nextXid;
int startPage;
int endPage;
+ LWLock *prevlock;
+ LWLock *lock;
/*
* Since we don't expect pg_subtrans to be valid across crashes, we
@@ -260,23 +266,47 @@ StartupSUBTRANS(TransactionId oldestActiveXID)
* Whenever we advance into a new page, ExtendSUBTRANS will likewise zero
* the new page without regard to whatever was previously on disk.
*/
- LWLockAcquire(SubtransSLRULock, LW_EXCLUSIVE);
-
startPage = TransactionIdToPage(oldestActiveXID);
nextXid = ShmemVariableCache->nextXid;
endPage = TransactionIdToPage(XidFromFullTransactionId(nextXid));
+ prevlock = SimpleLruGetSLRUBankLock(SubTransCtl, startPage);
+ LWLockAcquire(prevlock, LW_EXCLUSIVE);
while (startPage != endPage)
{
+ lock = SimpleLruGetSLRUBankLock(SubTransCtl, startPage);
+
+ /*
+ * Check if we need to acquire the lock on the new bank then release
+ * the lock on the old bank and acquire on the new bank.
+ */
+ if (prevlock != lock)
+ {
+ LWLockRelease(prevlock);
+ LWLockAcquire(lock, LW_EXCLUSIVE);
+ prevlock = lock;
+ }
+
(void) ZeroSUBTRANSPage(startPage);
startPage++;
/* must account for wraparound */
if (startPage > TransactionIdToPage(MaxTransactionId))
startPage = 0;
}
- (void) ZeroSUBTRANSPage(startPage);
- LWLockRelease(SubtransSLRULock);
+ lock = SimpleLruGetSLRUBankLock(SubTransCtl, startPage);
+
+ /*
+ * Check if we need to acquire the lock on the new bank then release the
+ * lock on the old bank and acquire on the new bank.
+ */
+ if (prevlock != lock)
+ {
+ LWLockRelease(prevlock);
+ LWLockAcquire(lock, LW_EXCLUSIVE);
+ }
+ (void) ZeroSUBTRANSPage(startPage);
+ LWLockRelease(lock);
}
/*
@@ -310,6 +340,7 @@ void
ExtendSUBTRANS(TransactionId newestXact)
{
int pageno;
+ LWLock *lock;
/*
* No work except at first XID of a page. But beware: just after
@@ -321,12 +352,13 @@ ExtendSUBTRANS(TransactionId newestXact)
pageno = TransactionIdToPage(newestXact);
- LWLockAcquire(SubtransSLRULock, LW_EXCLUSIVE);
+ lock = SimpleLruGetSLRUBankLock(SubTransCtl, pageno);
+ LWLockAcquire(lock, LW_EXCLUSIVE);
/* Zero the page */
ZeroSUBTRANSPage(pageno);
- LWLockRelease(SubtransSLRULock);
+ LWLockRelease(lock);
}
diff --git a/src/backend/commands/async.c b/src/backend/commands/async.c
index 98449cbdde..67da0b48bd 100644
--- a/src/backend/commands/async.c
+++ b/src/backend/commands/async.c
@@ -268,9 +268,10 @@ typedef struct QueueBackendStatus
* both NotifyQueueLock and NotifyQueueTailLock in EXCLUSIVE mode, backends
* can change the tail pointers.
*
- * NotifySLRULock is used as the control lock for the pg_notify SLRU buffers.
+ * SLRU buffer pool is divided in banks and bank wise SLRU lock is used as
+ * the control lock for the pg_notify SLRU buffers.
* In order to avoid deadlocks, whenever we need multiple locks, we first get
- * NotifyQueueTailLock, then NotifyQueueLock, and lastly NotifySLRULock.
+ * NotifyQueueTailLock, then NotifyQueueLock, and lastly SLRU bank lock.
*
* Each backend uses the backend[] array entry with index equal to its
* BackendId (which can range from 1 to MaxBackends). We rely on this to make
@@ -571,7 +572,7 @@ AsyncShmemInit(void)
*/
NotifyCtl->PagePrecedes = asyncQueuePagePrecedes;
SimpleLruInit(NotifyCtl, "Notify", notify_buffers, 0,
- NotifySLRULock, "pg_notify", LWTRANCHE_NOTIFY_BUFFER,
+ "pg_notify", LWTRANCHE_NOTIFY_BUFFER, LWTRANCHE_NOTIFY_SLRU,
SYNC_HANDLER_NONE);
if (!found)
@@ -1403,7 +1404,7 @@ asyncQueueNotificationToEntry(Notification *n, AsyncQueueEntry *qe)
* Eventually we will return NULL indicating all is done.
*
* We are holding NotifyQueueLock already from the caller and grab
- * NotifySLRULock locally in this function.
+ * page specific SLRU bank lock locally in this function.
*/
static ListCell *
asyncQueueAddEntries(ListCell *nextNotify)
@@ -1413,9 +1414,7 @@ asyncQueueAddEntries(ListCell *nextNotify)
int pageno;
int offset;
int slotno;
-
- /* We hold both NotifyQueueLock and NotifySLRULock during this operation */
- LWLockAcquire(NotifySLRULock, LW_EXCLUSIVE);
+ LWLock *prevlock;
/*
* We work with a local copy of QUEUE_HEAD, which we write back to shared
@@ -1439,6 +1438,11 @@ asyncQueueAddEntries(ListCell *nextNotify)
* wrapped around, but re-zeroing the page is harmless in that case.)
*/
pageno = QUEUE_POS_PAGE(queue_head);
+ prevlock = SimpleLruGetSLRUBankLock(NotifyCtl, pageno);
+
+ /* We hold both NotifyQueueLock and SLRU bank lock during this operation */
+ LWLockAcquire(prevlock, LW_EXCLUSIVE);
+
if (QUEUE_POS_IS_ZERO(queue_head))
slotno = SimpleLruZeroPage(NotifyCtl, pageno);
else
@@ -1484,6 +1488,17 @@ asyncQueueAddEntries(ListCell *nextNotify)
/* Advance queue_head appropriately, and detect if page is full */
if (asyncQueueAdvance(&(queue_head), qe.length))
{
+ LWLock *lock;
+
+ pageno = QUEUE_POS_PAGE(queue_head);
+ lock = SimpleLruGetSLRUBankLock(NotifyCtl, pageno);
+ if (lock != prevlock)
+ {
+ LWLockRelease(prevlock);
+ LWLockAcquire(lock, LW_EXCLUSIVE);
+ prevlock = lock;
+ }
+
/*
* Page is full, so we're done here, but first fill the next page
* with zeroes. The reason to do this is to ensure that slru.c's
@@ -1510,7 +1525,7 @@ asyncQueueAddEntries(ListCell *nextNotify)
/* Success, so update the global QUEUE_HEAD */
QUEUE_HEAD = queue_head;
- LWLockRelease(NotifySLRULock);
+ LWLockRelease(prevlock);
return nextNotify;
}
@@ -1989,9 +2004,9 @@ asyncQueueReadAllNotifications(void)
/*
* We copy the data from SLRU into a local buffer, so as to avoid
- * holding the NotifySLRULock while we are examining the entries
- * and possibly transmitting them to our frontend. Copy only the
- * part of the page we will actually inspect.
+ * holding the SLRU lock while we are examining the entries and
+ * possibly transmitting them to our frontend. Copy only the part
+ * of the page we will actually inspect.
*/
slotno = SimpleLruReadPage_ReadOnly(NotifyCtl, curpage,
InvalidTransactionId);
@@ -2011,7 +2026,7 @@ asyncQueueReadAllNotifications(void)
NotifyCtl->shared->page_buffer[slotno] + curoffset,
copysize);
/* Release lock that we got from SimpleLruReadPage_ReadOnly() */
- LWLockRelease(NotifySLRULock);
+ LWLockRelease(SimpleLruGetSLRUBankLock(NotifyCtl, curpage));
/*
* Process messages up to the stop position, end of page, or an
@@ -2052,7 +2067,7 @@ asyncQueueReadAllNotifications(void)
*
* The current page must have been fetched into page_buffer from shared
* memory. (We could access the page right in shared memory, but that
- * would imply holding the NotifySLRULock throughout this routine.)
+ * would imply holding the SLRU bank lock throughout this routine.)
*
* We stop if we reach the "stop" position, or reach a notification from an
* uncommitted transaction, or reach the end of the page.
@@ -2205,7 +2220,7 @@ asyncQueueAdvanceTail(void)
if (asyncQueuePagePrecedes(oldtailpage, boundary))
{
/*
- * SimpleLruTruncate() will ask for NotifySLRULock but will also
+ * SimpleLruTruncate() will ask for SLRU bank locks but will also
* release the lock again.
*/
SimpleLruTruncate(NotifyCtl, newtailpage);
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index 315a78cda9..1261af0548 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -190,6 +190,20 @@ static const char *const BuiltinTrancheNames[] = {
"LogicalRepLauncherDSA",
/* LWTRANCHE_LAUNCHER_HASH: */
"LogicalRepLauncherHash",
+ /* LWTRANCHE_XACT_SLRU: */
+ "XactSLRU",
+ /* LWTRANCHE_COMMITTS_SLRU: */
+ "CommitTSSLRU",
+ /* LWTRANCHE_SUBTRANS_SLRU: */
+ "SubtransSLRU",
+ /* LWTRANCHE_MULTIXACTOFFSET_SLRU: */
+ "MultixactOffsetSLRU",
+ /* LWTRANCHE_MULTIXACTMEMBER_SLRU: */
+ "MultixactMemberSLRU",
+ /* LWTRANCHE_NOTIFY_SLRU: */
+ "NotifySLRU",
+ /* LWTRANCHE_SERIAL_SLRU: */
+ "SerialSLRU"
};
StaticAssertDecl(lengthof(BuiltinTrancheNames) ==
diff --git a/src/backend/storage/lmgr/lwlocknames.txt b/src/backend/storage/lmgr/lwlocknames.txt
index f72f2906ce..9e66ecd1ed 100644
--- a/src/backend/storage/lmgr/lwlocknames.txt
+++ b/src/backend/storage/lmgr/lwlocknames.txt
@@ -16,11 +16,11 @@ WALBufMappingLock 7
WALWriteLock 8
ControlFileLock 9
# 10 was CheckpointLock
-XactSLRULock 11
-SubtransSLRULock 12
+# 11 was XactSLRULock
+# 12 was SubtransSLRULock
MultiXactGenLock 13
-MultiXactOffsetSLRULock 14
-MultiXactMemberSLRULock 15
+# 14 was MultiXactOffsetSLRULock
+# 15 was MultiXactMemberSLRULock
RelCacheInitLock 16
CheckpointerCommLock 17
TwoPhaseStateLock 18
@@ -31,19 +31,19 @@ AutovacuumLock 22
AutovacuumScheduleLock 23
SyncScanLock 24
RelationMappingLock 25
-NotifySLRULock 26
+#26 was NotifySLRULock
NotifyQueueLock 27
SerializableXactHashLock 28
SerializableFinishedListLock 29
SerializablePredicateListLock 30
-SerialSLRULock 31
+SerialControlLock 31
SyncRepLock 32
BackgroundWorkerLock 33
DynamicSharedMemoryControlLock 34
AutoFileLock 35
ReplicationSlotAllocationLock 36
ReplicationSlotControlLock 37
-CommitTsSLRULock 38
+#38 was CommitTsSLRULock
CommitTsLock 39
ReplicationOriginLock 40
MultiXactTruncationLock 41
diff --git a/src/backend/storage/lmgr/predicate.c b/src/backend/storage/lmgr/predicate.c
index e4903c67ec..7632c42978 100644
--- a/src/backend/storage/lmgr/predicate.c
+++ b/src/backend/storage/lmgr/predicate.c
@@ -809,8 +809,9 @@ SerialInit(void)
*/
SerialSlruCtl->PagePrecedes = SerialPagePrecedesLogically;
SimpleLruInit(SerialSlruCtl, "Serial",
- serial_buffers, 0, SerialSLRULock, "pg_serial",
- LWTRANCHE_SERIAL_BUFFER, SYNC_HANDLER_NONE);
+ serial_buffers, 0, "pg_serial",
+ LWTRANCHE_SERIAL_BUFFER, LWTRANCHE_SERIAL_SLRU,
+ SYNC_HANDLER_NONE);
#ifdef USE_ASSERT_CHECKING
SerialPagePrecedesLogicallyUnitTests();
#endif
@@ -847,12 +848,14 @@ SerialAdd(TransactionId xid, SerCommitSeqNo minConflictCommitSeqNo)
int slotno;
int firstZeroPage;
bool isNewPage;
+ LWLock *lock;
Assert(TransactionIdIsValid(xid));
targetPage = SerialPage(xid);
+ lock = SimpleLruGetSLRUBankLock(SerialSlruCtl, targetPage);
- LWLockAcquire(SerialSLRULock, LW_EXCLUSIVE);
+ LWLockAcquire(lock, LW_EXCLUSIVE);
/*
* If no serializable transactions are active, there shouldn't be anything
@@ -902,7 +905,7 @@ SerialAdd(TransactionId xid, SerCommitSeqNo minConflictCommitSeqNo)
SerialValue(slotno, xid) = minConflictCommitSeqNo;
SerialSlruCtl->shared->page_dirty[slotno] = true;
- LWLockRelease(SerialSLRULock);
+ LWLockRelease(lock);
}
/*
@@ -920,10 +923,10 @@ SerialGetMinConflictCommitSeqNo(TransactionId xid)
Assert(TransactionIdIsValid(xid));
- LWLockAcquire(SerialSLRULock, LW_SHARED);
+ LWLockAcquire(SerialControlLock, LW_SHARED);
headXid = serialControl->headXid;
tailXid = serialControl->tailXid;
- LWLockRelease(SerialSLRULock);
+ LWLockRelease(SerialControlLock);
if (!TransactionIdIsValid(headXid))
return 0;
@@ -935,13 +938,13 @@ SerialGetMinConflictCommitSeqNo(TransactionId xid)
return 0;
/*
- * The following function must be called without holding SerialSLRULock,
+ * The following function must be called without holding SLRU bank lock,
* but will return with that lock held, which must then be released.
*/
slotno = SimpleLruReadPage_ReadOnly(SerialSlruCtl,
SerialPage(xid), xid);
val = SerialValue(slotno, xid);
- LWLockRelease(SerialSLRULock);
+ LWLockRelease(SimpleLruGetSLRUBankLock(SerialSlruCtl, SerialPage(xid)));
return val;
}
@@ -954,7 +957,7 @@ SerialGetMinConflictCommitSeqNo(TransactionId xid)
static void
SerialSetActiveSerXmin(TransactionId xid)
{
- LWLockAcquire(SerialSLRULock, LW_EXCLUSIVE);
+ LWLockAcquire(SerialControlLock, LW_EXCLUSIVE);
/*
* When no sxacts are active, nothing overlaps, set the xid values to
@@ -966,7 +969,7 @@ SerialSetActiveSerXmin(TransactionId xid)
{
serialControl->tailXid = InvalidTransactionId;
serialControl->headXid = InvalidTransactionId;
- LWLockRelease(SerialSLRULock);
+ LWLockRelease(SerialControlLock);
return;
}
@@ -984,7 +987,7 @@ SerialSetActiveSerXmin(TransactionId xid)
{
serialControl->tailXid = xid;
}
- LWLockRelease(SerialSLRULock);
+ LWLockRelease(SerialControlLock);
return;
}
@@ -993,7 +996,7 @@ SerialSetActiveSerXmin(TransactionId xid)
serialControl->tailXid = xid;
- LWLockRelease(SerialSLRULock);
+ LWLockRelease(SerialControlLock);
}
/*
@@ -1007,12 +1010,12 @@ CheckPointPredicate(void)
{
int truncateCutoffPage;
- LWLockAcquire(SerialSLRULock, LW_EXCLUSIVE);
+ LWLockAcquire(SerialControlLock, LW_EXCLUSIVE);
/* Exit quickly if the SLRU is currently not in use. */
if (serialControl->headPage < 0)
{
- LWLockRelease(SerialSLRULock);
+ LWLockRelease(SerialControlLock);
return;
}
@@ -1072,7 +1075,7 @@ CheckPointPredicate(void)
serialControl->headPage = -1;
}
- LWLockRelease(SerialSLRULock);
+ LWLockRelease(SerialControlLock);
/* Truncate away pages that are no longer required */
SimpleLruTruncate(SerialSlruCtl, truncateCutoffPage);
diff --git a/src/include/access/slru.h b/src/include/access/slru.h
index 51c5762b9f..d9be57de75 100644
--- a/src/include/access/slru.h
+++ b/src/include/access/slru.h
@@ -21,6 +21,7 @@
* SLRU bank size for slotno hash banks
*/
#define SLRU_BANK_SIZE 16
+#define SLRU_MAX_BANKLOCKS 128
/*
* To avoid overflowing internal arithmetic and the size_t data type, the
@@ -62,8 +63,6 @@ typedef enum
*/
typedef struct SlruSharedData
{
- LWLock *ControlLock;
-
/* Number of buffers managed by this SLRU structure */
int num_slots;
@@ -76,36 +75,52 @@ typedef struct SlruSharedData
bool *page_dirty;
int *page_number;
int *page_lru_count;
+
+ /* The buffer_locks protects the I/O on each buffer slots */
LWLockPadded *buffer_locks;
/*
- * Optional array of WAL flush LSNs associated with entries in the SLRU
- * pages. If not zero/NULL, we must flush WAL before writing pages (true
- * for pg_xact, false for multixact, pg_subtrans, pg_notify). group_lsn[]
- * has lsn_groups_per_page entries per buffer slot, each containing the
- * highest LSN known for a contiguous group of SLRU entries on that slot's
- * page.
+ * Locks to protect the in memory buffer slot access in SLRU bank. If the
+ * number of banks are <= SLRU_MAX_BANKLOCKS then there will be one lock
+ * per bank otherwise each lock will protect multiple banks depends upon
+ * the number of banks.
*/
- XLogRecPtr *group_lsn;
- int lsn_groups_per_page;
+ LWLockPadded *bank_locks;
/*----------
+ * Instead of global counter we maintain a bank-wise lru counter because
+ * a) we are doing the victim buffer selection as bank level so there is
+ * no point of having a global counter b) manipulating a global counter
+ * will have frequent cpu cache invalidation and that will affect the
+ * performance.
+ *
* We mark a page "most recently used" by setting
- * page_lru_count[slotno] = ++cur_lru_count;
+ * page_lru_count[slotno] = ++bank_cur_lru_count[bankno];
* The oldest page is therefore the one with the highest value of
- * cur_lru_count - page_lru_count[slotno]
+ * bank_cur_lru_count[bankno] - page_lru_count[slotno]
* The counts will eventually wrap around, but this calculation still
* works as long as no page's age exceeds INT_MAX counts.
*----------
*/
- int cur_lru_count;
+ int *bank_cur_lru_count;
+
+ /*
+ * Optional array of WAL flush LSNs associated with entries in the SLRU
+ * pages. If not zero/NULL, we must flush WAL before writing pages (true
+ * for pg_xact, false for multixact, pg_subtrans, pg_notify). group_lsn[]
+ * has lsn_groups_per_page entries per buffer slot, each containing the
+ * highest LSN known for a contiguous group of SLRU entries on that slot's
+ * page.
+ */
+ XLogRecPtr *group_lsn;
+ int lsn_groups_per_page;
/*
* latest_page_number is the page number of the current end of the log;
* this is not critical data, since we use it only to avoid swapping out
* the latest page.
*/
- int latest_page_number;
+ pg_atomic_uint32 latest_page_number;
/* SLRU's index for statistics purposes (might not be unique) */
int slru_stats_idx;
@@ -153,11 +168,24 @@ typedef struct SlruCtlData
typedef SlruCtlData *SlruCtl;
+/*
+ * Get the SLRU bank lock for given SlruCtl and the pageno.
+ *
+ * This lock needs to be acquire in order to access the slru buffer slots in
+ * the respective bank. For more details refer comments in SlruSharedData.
+ */
+static inline LWLock *
+SimpleLruGetSLRUBankLock(SlruCtl ctl, int pageno)
+{
+ int banklockno = (pageno & ctl->bank_mask) % SLRU_MAX_BANKLOCKS;
+
+ return &(ctl->shared->bank_locks[banklockno].lock);
+}
extern Size SimpleLruShmemSize(int nslots, int nlsns);
extern void SimpleLruInit(SlruCtl ctl, const char *name, int nslots, int nlsns,
- LWLock *ctllock, const char *subdir, int tranche_id,
- SyncRequestHandler sync_handler);
+ const char *subdir, int buffer_tranche_id,
+ int bank_tranche_id, SyncRequestHandler sync_handler);
extern int SimpleLruZeroPage(SlruCtl ctl, int pageno);
extern int SimpleLruReadPage(SlruCtl ctl, int pageno, bool write_ok,
TransactionId xid);
@@ -185,5 +213,8 @@ extern bool SlruScanDirCbReportPresence(SlruCtl ctl, char *filename,
int segpage, void *data);
extern bool SlruScanDirCbDeleteAll(SlruCtl ctl, char *filename, int segpage,
void *data);
+extern LWLock *SimpleLruGetSLRUBankLock(SlruCtl ctl, int pageno);
extern bool check_slru_buffers(const char *name, int *newval);
+extern void SimpleLruAcquireAllBankLock(SlruCtl ctl, LWLockMode mode);
+extern void SimpleLruReleaseAllBankLock(SlruCtl ctl);
#endif /* SLRU_H */
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index b038e599c0..87cb812b84 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -207,6 +207,13 @@ typedef enum BuiltinTrancheIds
LWTRANCHE_PGSTATS_DATA,
LWTRANCHE_LAUNCHER_DSA,
LWTRANCHE_LAUNCHER_HASH,
+ LWTRANCHE_XACT_SLRU,
+ LWTRANCHE_COMMITTS_SLRU,
+ LWTRANCHE_SUBTRANS_SLRU,
+ LWTRANCHE_MULTIXACTOFFSET_SLRU,
+ LWTRANCHE_MULTIXACTMEMBER_SLRU,
+ LWTRANCHE_NOTIFY_SLRU,
+ LWTRANCHE_SERIAL_SLRU,
LWTRANCHE_FIRST_USER_DEFINED,
} BuiltinTrancheIds;
diff --git a/src/test/modules/test_slru/test_slru.c b/src/test/modules/test_slru/test_slru.c
index ae21444c47..9a02f33933 100644
--- a/src/test/modules/test_slru/test_slru.c
+++ b/src/test/modules/test_slru/test_slru.c
@@ -40,10 +40,6 @@ PG_FUNCTION_INFO_V1(test_slru_delete_all);
/* Number of SLRU page slots */
#define NUM_TEST_BUFFERS 16
-/* SLRU control lock */
-LWLock TestSLRULock;
-#define TestSLRULock (&TestSLRULock)
-
static SlruCtlData TestSlruCtlData;
#define TestSlruCtl (&TestSlruCtlData)
@@ -63,9 +59,9 @@ test_slru_page_write(PG_FUNCTION_ARGS)
int pageno = PG_GETARG_INT32(0);
char *data = text_to_cstring(PG_GETARG_TEXT_PP(1));
int slotno;
+ LWLock *lock = SimpleLruGetSLRUBankLock(TestSlruCtl, pageno);
- LWLockAcquire(TestSLRULock, LW_EXCLUSIVE);
-
+ LWLockAcquire(lock, LW_EXCLUSIVE);
slotno = SimpleLruZeroPage(TestSlruCtl, pageno);
/* these should match */
@@ -80,7 +76,7 @@ test_slru_page_write(PG_FUNCTION_ARGS)
BLCKSZ - 1);
SimpleLruWritePage(TestSlruCtl, slotno);
- LWLockRelease(TestSLRULock);
+ LWLockRelease(lock);
PG_RETURN_VOID();
}
@@ -99,13 +95,14 @@ test_slru_page_read(PG_FUNCTION_ARGS)
bool write_ok = PG_GETARG_BOOL(1);
char *data = NULL;
int slotno;
+ LWLock *lock = SimpleLruGetSLRUBankLock(TestSlruCtl, pageno);
/* find page in buffers, reading it if necessary */
- LWLockAcquire(TestSLRULock, LW_EXCLUSIVE);
+ LWLockAcquire(lock, LW_EXCLUSIVE);
slotno = SimpleLruReadPage(TestSlruCtl, pageno,
write_ok, InvalidTransactionId);
data = (char *) TestSlruCtl->shared->page_buffer[slotno];
- LWLockRelease(TestSLRULock);
+ LWLockRelease(lock);
PG_RETURN_TEXT_P(cstring_to_text(data));
}
@@ -116,14 +113,15 @@ test_slru_page_readonly(PG_FUNCTION_ARGS)
int pageno = PG_GETARG_INT32(0);
char *data = NULL;
int slotno;
+ LWLock *lock = SimpleLruGetSLRUBankLock(TestSlruCtl, pageno);
/* find page in buffers, reading it if necessary */
slotno = SimpleLruReadPage_ReadOnly(TestSlruCtl,
pageno,
InvalidTransactionId);
- Assert(LWLockHeldByMe(TestSLRULock));
+ Assert(LWLockHeldByMe(lock));
data = (char *) TestSlruCtl->shared->page_buffer[slotno];
- LWLockRelease(TestSLRULock);
+ LWLockRelease(lock);
PG_RETURN_TEXT_P(cstring_to_text(data));
}
@@ -133,10 +131,11 @@ test_slru_page_exists(PG_FUNCTION_ARGS)
{
int pageno = PG_GETARG_INT32(0);
bool found;
+ LWLock *lock = SimpleLruGetSLRUBankLock(TestSlruCtl, pageno);
- LWLockAcquire(TestSLRULock, LW_EXCLUSIVE);
+ LWLockAcquire(lock, LW_EXCLUSIVE);
found = SimpleLruDoesPhysicalPageExist(TestSlruCtl, pageno);
- LWLockRelease(TestSLRULock);
+ LWLockRelease(lock);
PG_RETURN_BOOL(found);
}
@@ -215,6 +214,7 @@ test_slru_shmem_startup(void)
{
const char slru_dir_name[] = "pg_test_slru";
int test_tranche_id;
+ int test_buffer_tranche_id;
if (prev_shmem_startup_hook)
prev_shmem_startup_hook();
@@ -228,11 +228,13 @@ test_slru_shmem_startup(void)
/* initialize the SLRU facility */
test_tranche_id = LWLockNewTrancheId();
LWLockRegisterTranche(test_tranche_id, "test_slru_tranche");
- LWLockInitialize(TestSLRULock, test_tranche_id);
+
+ test_buffer_tranche_id = LWLockNewTrancheId();
+ LWLockRegisterTranche(test_tranche_id, "test_buffer_tranche");
TestSlruCtl->PagePrecedes = test_slru_page_precedes_logically;
SimpleLruInit(TestSlruCtl, "TestSLRU",
- NUM_TEST_BUFFERS, 0, TestSLRULock, slru_dir_name,
+ NUM_TEST_BUFFERS, 0, slru_dir_name, test_buffer_tranche_id,
test_tranche_id, SYNC_HANDLER_NONE);
}
--
2.39.2 (Apple Git-143)
In SlruSharedData, a new comment is added that starts:
"Instead of global counter we maintain a bank-wise lru counter because ..."
You don't need to explain what we don't do. Just explain what we do do.
So remove the words "Instead of a global counter" from there, because
they offer no wisdom. Same with the phrase "so there is no point to ...".
I think "The oldest page is therefore" should say "The oldest page *in
the bank* is therefore", for extra clarity.
I wonder what's the deal with false sharing in the new
bank_cur_lru_count array. Maybe instead of using LWLockPadded for
bank_locks, we should have a new struct, with both the LWLock and the
LRU counter; then pad *that* to the cacheline size. This way, both the
lwlock and the counter come to the CPU running this code together.
Looking at SlruRecentlyUsed, which was a macro and is now a function.
The comment about "multiple evaluation of arguments" no longer applies,
so it needs to be removed; and it shouldn't talk about itself being a
macro.
Using "Size" as type for bank_mask looks odd. For a bitmask, maybe it's
be more appropriate to use bits64 if we do need a 64-bit mask (we don't
have bits64, but it's easy to add a typedef). I bet we don't really
need a 64 bit mask, and a 32-bit or even a 16-bit is sufficient, given
the other limitations on number of buffers.
I think SimpleLruReadPage should have this assert at start:
+ Assert(LWLockHeldByMe(SimpleLruGetSLRUBankLock(ctl, pageno)));
Do we really need one separate lwlock tranche for each SLRU?
--
Álvaro Herrera Breisgau, Deutschland — https://www.EnterpriseDB.com/
"Cuando mañana llegue pelearemos segun lo que mañana exija" (Mowgli)
On 2023-Nov-17, Dilip Kumar wrote:
On Thu, Nov 16, 2023 at 3:11 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
I just noticed that 0003 does some changes to
TransactionGroupUpdateXidStatus() that haven't been adequately
explained AFAICS. How do you know that these changes are safe?IMHO this is safe as well as logical to do w.r.t. performance. It's
safe because whenever we are updating any page in a group we are
acquiring the respective bank lock in exclusive mode and in extreme
cases if there are pages from different banks then we do switch the
lock as well before updating the pages from different groups.
Looking at the coverage for this code,
https://coverage.postgresql.org/src/backend/access/transam/clog.c.gcov.html#413
it seems in our test suites we never hit the case where there is
anything in the "nextidx" field for commit groups. To be honest, I
don't understand this group stuff, and so I'm doubly hesitant to go
without any testing here. Maybe it'd be possible to use Michael
Paquier's injection points somehow?
I think in the code comments where you use "w.r.t.", that acronym can be
replaced with "for", which improves readability.
--
Álvaro Herrera Breisgau, Deutschland — https://www.EnterpriseDB.com/
"All rings of power are equal,
But some rings of power are more equal than others."
(George Orwell's The Lord of the Rings)
On Fri, Nov 17, 2023 at 6:16 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
Thanks for the review, all comments looks fine to me, replying to
those that need some clarification
I wonder what's the deal with false sharing in the new
bank_cur_lru_count array. Maybe instead of using LWLockPadded for
bank_locks, we should have a new struct, with both the LWLock and the
LRU counter; then pad *that* to the cacheline size. This way, both the
lwlock and the counter come to the CPU running this code together.
Actually, the array lengths of both LWLock and the LRU counter are
different so I don't think we can move them to a common structure.
The length of the *buffer_locks array is equal to the number of slots,
the length of the *bank_locks array is Min (number_of_banks, 128), and
the length of the *bank_cur_lru_count array is number_of_banks.
Looking at the coverage for this code,
https://coverage.postgresql.org/src/backend/access/transam/clog.c.gcov.html#413
it seems in our test suites we never hit the case where there is
anything in the "nextidx" field for commit groups. To be honest, I
don't understand this group stuff, and so I'm doubly hesitant to go
without any testing here. Maybe it'd be possible to use Michael
Paquier's injection points somehow?
Sorry, but I am not aware of "Michael Paquier's injection" Is it
something already in the repo? Can you redirect me to some of the
example test cases if we already have them? Then I will try it out.
--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
On 2023-Nov-18, Dilip Kumar wrote:
On Fri, Nov 17, 2023 at 6:16 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
I wonder what's the deal with false sharing in the new
bank_cur_lru_count array. Maybe instead of using LWLockPadded for
bank_locks, we should have a new struct, with both the LWLock and the
LRU counter; then pad *that* to the cacheline size. This way, both the
lwlock and the counter come to the CPU running this code together.Actually, the array lengths of both LWLock and the LRU counter are
different so I don't think we can move them to a common structure.
The length of the *buffer_locks array is equal to the number of slots,
the length of the *bank_locks array is Min (number_of_banks, 128), and
the length of the *bank_cur_lru_count array is number_of_banks.
Oh.
Looking at the coverage for this code,
https://coverage.postgresql.org/src/backend/access/transam/clog.c.gcov.html#413
it seems in our test suites we never hit the case where there is
anything in the "nextidx" field for commit groups. To be honest, I
don't understand this group stuff, and so I'm doubly hesitant to go
without any testing here. Maybe it'd be possible to use Michael
Paquier's injection points somehow?Sorry, but I am not aware of "Michael Paquier's injection" Is it
something already in the repo? Can you redirect me to some of the
example test cases if we already have them? Then I will try it out.
https://postgr.es/ZVWufO_YKzTJHEHW@paquier.xyz
--
Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/
"Sallah, I said NO camels! That's FIVE camels; can't you count?"
(Indiana Jones)
On 17 Nov 2023, at 16:11, Dilip Kumar <dilipbalaut@gmail.com> wrote:
On Fri, Nov 17, 2023 at 1:09 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
On Thu, Nov 16, 2023 at 3:11 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
PFA, updated patch version, this fixes the comment given by Alvaro and
also improves some of the comments.
I’ve skimmed through the patch set. Here are some minor notes.
1. Cycles “for (slotno = bankstart; slotno < bankend; slotno++)” in SlruSelectLRUPage() and SimpleLruReadPage_ReadOnly() now have identical comments. I think a little of copy-paste is OK.
But SimpleLruReadPage_ReadOnly() does pgstat_count_slru_page_hit(), while SlruSelectLRUPage() does not. This is not related to the patch set, just a code nearby.
2. Do we really want these functions doing all the same?
extern bool check_multixact_offsets_buffers(int *newval, void **extra,GucSource source);
extern bool check_multixact_members_buffers(int *newval, void **extra,GucSource source);
extern bool check_subtrans_buffers(int *newval, void **extra,GucSource source);
extern bool check_notify_buffers(int *newval, void **extra, GucSource source);
extern bool check_serial_buffers(int *newval, void **extra, GucSource source);
extern bool check_xact_buffers(int *newval, void **extra, GucSource source);
extern bool check_commit_ts_buffers(int *newval, void **extra,GucSource source);
3. The name SimpleLruGetSLRUBankLock() contains meaning of SLRU twice. I’d suggest truncating prefix of infix.
I do not have hard opinion on any of this items.
Best regards, Andrey Borodin.
On Sun, Nov 19, 2023 at 12:39 PM Andrey M. Borodin <x4mmm@yandex-team.ru> wrote:
I’ve skimmed through the patch set. Here are some minor notes.
Thanks for the review
1. Cycles “for (slotno = bankstart; slotno < bankend; slotno++)” in SlruSelectLRUPage() and SimpleLruReadPage_ReadOnly() now have identical comments. I think a little of copy-paste is OK.
But SimpleLruReadPage_ReadOnly() does pgstat_count_slru_page_hit(), while SlruSelectLRUPage() does not. This is not related to the patch set, just a code nearby.
Do you mean to say we need to modify the comments or you are saying
pgstat_count_slru_page_hit() is missing in SlruSelectLRUPage(), if it
is later then I can see the caller of SlruSelectLRUPage() is calling
pgstat_count_slru_page_hit() and the SlruRecentlyUsed().
2. Do we really want these functions doing all the same?
extern bool check_multixact_offsets_buffers(int *newval, void **extra,GucSource source);
extern bool check_multixact_members_buffers(int *newval, void **extra,GucSource source);
extern bool check_subtrans_buffers(int *newval, void **extra,GucSource source);
extern bool check_notify_buffers(int *newval, void **extra, GucSource source);
extern bool check_serial_buffers(int *newval, void **extra, GucSource source);
extern bool check_xact_buffers(int *newval, void **extra, GucSource source);
extern bool check_commit_ts_buffers(int *newval, void **extra,GucSource source);
I tried duplicating these by doing all the work inside the
check_slru_buffers() function. But I think it is hard to make them a
single function because there is no option to pass an SLRU name in the
GUC check hook and IMHO in the check hook we need to print the GUC
name, any suggestions on how we can avoid having so many functions?
3. The name SimpleLruGetSLRUBankLock() contains meaning of SLRU twice. I’d suggest truncating prefix of infix.
I do not have hard opinion on any of this items.
I prefer SimpleLruGetBankLock() so that it is consistent with other
external functions starting with "SimpleLruGet", are you fine with
this name?
--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
On Fri, Nov 17, 2023 at 7:28 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
On 2023-Nov-17, Dilip Kumar wrote:
I think I need some more clarification for some of the review comments
On Thu, Nov 16, 2023 at 3:11 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
I just noticed that 0003 does some changes to
TransactionGroupUpdateXidStatus() that haven't been adequately
explained AFAICS. How do you know that these changes are safe?IMHO this is safe as well as logical to do w.r.t. performance. It's
safe because whenever we are updating any page in a group we are
acquiring the respective bank lock in exclusive mode and in extreme
cases if there are pages from different banks then we do switch the
lock as well before updating the pages from different groups.Looking at the coverage for this code,
https://coverage.postgresql.org/src/backend/access/transam/clog.c.gcov.html#413
it seems in our test suites we never hit the case where there is
anything in the "nextidx" field for commit groups.
1)
I was looking into your coverage report and I have attached a
screenshot from the same, it seems we do hit the block where nextidx
is not INVALID_PGPROCNO, which means there is some other process other
than the group leader. Although I have already started exploring the
injection point but just wanted to be sure what is your main concern
point about the coverage so though of checking that first.
470 : /*
471 : * If the list was not empty, the leader
will update the status of our
472 : * XID. It is impossible to have followers
without a leader because the
473 : * first process that has added itself to
the list will always have
474 : * nextidx as INVALID_PGPROCNO.
475 : */
476 98 : if (nextidx != INVALID_PGPROCNO)
477 : {
478 2 : int extraWaits = 0;
479 :
480 : /* Sleep until the leader updates our
XID status. */
481 2 :
pgstat_report_wait_start(WAIT_EVENT_XACT_GROUP_UPDATE);
482 : for (;;)
483 : {
484 : /* acts as a read barrier */
485 2 : PGSemaphoreLock(proc->sem);
486 2 : if (!proc->clogGroupMember)
487 2 : break;
488 0 : extraWaits++;
489 : }
2) Do we really need one separate lwlock tranche for each SLRU?
IMHO if we use the same lwlock tranche then the wait event will show
the same wait event name, right? And that would be confusing for the
user, whether we are waiting for Subtransaction or Multixact or
anything else. Is my understanding no correct here?
--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
Attachments:
image.pngimage/png; name=image.pngDownload
�PNG
IHDR ` $ I�R� ]iCCPICC Profile (�u�?K�P�Oj��
�t����� ��B�T��$�m|�)Et�$:8��D� n���E���K���.�����
j��� ��c�����u)r�a� b���`iU]"�gu� �y>�w=d���[�9�tKO���}5d�
��;uBg�2��r�m��M���9��|����#�������� �������ZS����GMk%��P�#�:4�p !�������l�a�^TQF��&���xtLA&Vh��$�����
Z�(�D��S��p"2��~�*t��RR��X�����{�*n�����yo@d�����d�� 8eXIfMM * �i � `� $ R�: @ IDATx���T���DQ0`���Q�����(��bB�fA1�k���9#�`��o��(� 0��"���-�����93����[�<3't�:�����z)%q*�����p�
���!C�~��e�Ge������#��Y�f�K,E��3����#e���2g�i����p� 2h� i��Q��oG�pG�p�����u��5���7��C���,�������c�9&��"��;J�>}��'���z��%={�t��`t��#�8��#�8�@mE`��Z��\s���q�������^�����n���9y�di���|��Gr�WH��}�_�~��[75jT�oG�pG�p���@=WA,_������eK� �X��
���"R���?��:��d��i!Q9�9s����2v�X����d��12{�l���/�]�vi���pG�pG��K����]�^�(��}�����~��&�:���*�p�E5��W\Q�w�n~�4i��W���#�8��#�8�@]C�%`u���X�?��SV_}u8p���bt�pG�pG��S8V��������;��c�,��89��#�8��#�8�@�8V8V��pG�pG�p�*!�{���vG�pG�pG�p�++��8��#�8��#�8�@�����������
���_��;�N�4i�������S'���k?�����[d�5����
������{n���C�f��E������K�-�s����~������KQ��\8��C�k
q�g=����i��<�����s�Y6�[o�
y=z�|������������[�"�D\s�Q��9W�u���o�-������{�Q����uJ��K�J�GT
~��WYx��+�����
�o��v�����O<!|�A�\/~��g���e����w�y���OWx_�z��|��'��o�I'�$?��C�h/��R;G-������|��9\k�k�G ��H������KQg��{��;X��J�t�A����~����+������0n��S�L�t������:�{��s!���[!�k����q����6;y��B��~G ��� �_W��>�L^z���"�������/�l�.������V[��������~X���~[aN������~����^����<�J��A�W\q��5*1����J�<��D�>��caRW� �c�=V#�^��-��FVZ 2��}{�:u�,���i����6t��-�}x8��c��i����u���^��\HT5o5}�F2�"���>*������8�@����x��w�I�L����j���|�}��Q����E��^u�U�����/o����R#0���������=_t�EYb��
U��6��v�m7��O�X��b�-l�4t�P���o"gV�?�pYi���}�=�Ls�<V�����l��F����
+��t�9��&�l"��yl0������������E����e��/�P`��82�$m���SO�yG�
?�/���1�!��4�0�LVI�r��bR@Y���Zy����g��B���.���%�������/,>�i�p� V�~����_m~h?��
3�A-����7���]~������
��@�H�3gNxU�kU�����/��r��O9�)���|��:f��:!�#�8B~���(j�����o^A
��w�������0�9����O�����'u��G4hP XRUqgbL{���i*�g�y�0n��L�~��7�wI������^����~��;����0��}��Q�I7�n�������k�a�5~����V�o��Vu����-��u�x���]����93�����v�i���Q?�7~�qB_ �,��_$Q���c�g<*�����#P�����L���Y��t�B�rQ�du0M���j)�������q�S~�aJ��~�NL?8�������I[J'� &����ge����z �/�`a���>�����X���7����BJ���W_���uT>���t�VJ�����i��Y�Y��J�R��Km���)]Os��G,e��!)��O)d~t�ne������������8�=�d��L�81�Y'�V~eS���T�� @J��K����_H�{V�/���)��������3��N[�toO����O�� ]������>��={�=e�yF*�/���O�������/��Zk�g�i�t�M�������C�6mR7�p���?�'O+��bJsK�X��H'�)�P
�J{WF"��2��h�����������N��;���Y��/�����?Q�������N����������]J�6v��qs���Z��t&�>�q�r�v�m��2����y���8 �/�[�5�t��p�QGY_ n�(ay��@�3����>��r�-���W�9rd��_L1~���4����q������aM)�h��g�P9�s����|�9i����������r����(P���l�_s�5���_���n��}��1����3�yW%���8Z�!.�ee�
D����������w����,�����"�l�����(W���.�����N�#m�7�����5V��J-���m�VP��Z�nm�v��KXV� V�x�$��N��5+�Hzt�$:4�w�g��m?Vr������b�eFU��N2�_�&Mr�e����j�����!��\�o���1`�<[�d_'���U+a5ic��-E'v�:����Z4�!�������EY�VH����;+���Q��0 *���f�md�%������]T���Q��n�s�x�# �LYz��mR
Veq�l���Ik��������������z�N�L2K��7$�����w$C����)���-����N;U9{�p'�{���#mT����0�
+��L��W�-&�'�|r����$������1�)s���iP�f�Zc�5���LL�=��9��3���>�������q$�d )(�5C�j�H���;���ePM���N���$*g{N�����=We����%��������2x�`��&�G��>��8.��rE��I�wR���H���k������� ��������/�e�N������Y��#��YW�D%��.Y"����E��x��b0`�|�&�1 �Y���ER�0��c�{���Y@u-��R�dz��eH&�*���-_]�7�#'�2�dr�L��D�cB�j���\���1��
�%�J��a� u8&���k��f�Lx��{0��Y�A�fq/�����:�����jzp�w�&��d�DA�+���/l)��y��w&_�����a���T�L���7�8�]hoA=�vh��7�RL�a���P7��-��02Q����U�,���b����4+.q!��
�G�=qe���RK� ?����:��k�,���������q'$�B�����W���j��[��W����@\V�J�dr����o_a�tA'>����3����d�>�0�#F�({�Hk�{��A�=?����wL���~�Bb��(�����j��'��f-1�
������ ���J'����V���&n����M$�q��G\H��?�R��}�����0�H� &��������0�H+���I���|��-LJ &�L�R�|y�z��
[U��>�T%�T�X�R�L?��C^�m�!���������l�= w0DB��>��1$�!0
�c�(�>����%�X���$�X�C���),8�������!�r�J���.na��w��=����X_dY8@:��I �0s�C�O{%�8�j�+���I����x�|��l�Ic\��%��P�>s�FB�blfdX�����h� }�!,��q���X\�f0�����&��\|$Qq�W������8��LH�@3��Y�#��<���#�a�/�����I
_�;�k1AV]r�:�U�������c�� V����UVY��:P]du0.=����0m��0�D"E��1�D"�{��
��w�i�e�������>L�3�
��aE�<3�BJ��.n\�x(A�r�V�C�h��yB�(�@�G`�Q�P
|��gL������ad1x���&L#u�ob�W��`Ja���>�|y��`�f�e�]�A��'�H8;���</*��a���u�Y����6HY��\�=�a�F��M�z��6���/��$l�G�I��F��*�!��+�?P���p�SP���{���{�n�(x�+\\��?�`�g�$>>�=���\�I����
v0R�@j���+u�3��'d�F��;�('�� ����[��2���=2���[R{e�5~���& ��18s�����4�]YT;v��>`[@c��]1�����]��M� �p�_|�=��#P�X w���Lc�@���]KJ�o1����_pB\�U-e���n�nN�l�U���������l2t�1��t�Qx4� lZ���*��{�7j.�6^�$�`����'�Q���+'t�on�?�og����V�Ny���L� 6�>l���R�(}6��N9 �I��-rG���[!I�c0%�����1M����}!_�t�����A'�)�>�#,��p�2 �^'�fdC'�
Fe��~���D<�*p�����=z I���������Pe��2�V��)W�K��{>������X-���`�~p��� �%PR��a6Ls�%��i(cj�Jh��02�,�c�#N��T��#���4��D�0��g�'�x?P�q���#�Q���{��?�pU2���PUI�����]��we�����H1T��\��+oI�9�?���{^�:w��3�B�����bpJ���^ik|�3��R�q�>i�V)r���!C�>n�C-�����_��zd^A'G�(�V���jT(�^i������Jk0&R��* �f+3����~$�H"g$0��Be�O2B%����=���(�po�R0(����� S����E��
:�W$|� ����&��JI�pG�K�:!��$m �H�Qu�=[z5�}���Y_����2
�a�
�$\�H��W���Qo�
M��zW��*��m��c\R{N*���u�������8>M�D�cX</����o4����L^�����u8V����8����1�-�`�o��FS)��$�f�A*�0�������BVD�G�p��@�%����SuG`�C�=@�T�|���=��u��s6b���\R�la��#����t����G�p���U��#�8��#�8��#�8i��48��pG�pG�p��C��)J����� �����M�a�6����������q�����;�1�_ia�3���v�t��f����X�������6���9P�5>Li���8�1���E�:t���
��q�*P�jd^���q���6��b���)8��:���2r06e����y�k���'a�J+�\8���QR�)uyG�m����n��
�m�a����^�����\��y�5����:�S9��_F�<�@j��&tX�c�����_�P����
���6��{�iH7�Kt�is�S�8��sn�B���Bu�����:������3N5�m�Q���j��a��:d�}��Q�����N:�Z��p0���I���9Nj"�d�UGD�/��q�g�z^����z���r�%�s&�]]�l�=��a����p�zW'�b��L���Le���s��z>/%sc�����*��=�U����E����_�S����v1t����o=��q���#���/7��.������/��\ z���(Q2=z���� ��e����������p+�0yq��S��SU�CRU�'[X��O�:�,f�����w�����i1a6lh���"�d�(�E
L }�E�p�j�!$`zNUV�;����w��-�{]x���sxx8�~~cQ�1�2yNj3��������k�����3`e�qT���t����T)cVJ�4�}@0_��
��s��"�,"7�t����6�<x�������^{�w��V]uU���s�V_}���CPC�C�e9L|��XU�(��3�H�4�7on��\A��P��^{�<��v���?�'^Q�CI�L:��sd�M61�J��jw�����aP
�&c�w�-�0�l�m��=��SP����[,z&����+�#� u��w������W
Nv��w��!I��;uLx���v�;�F^���|�a��)�~.��2+i3�GZVM�2��9H��o~Hr�X@�7��bI�� �
=�4
�z�f�mfq�~�d����YM�����B���k�S��;V!���zp�����Jj�_e��s����[��s��H�C�P�LBJy�����<�K�[��8Qo�������(_{~2���������z+���;f����Yg�e�<�Z�3�;C���g���:����93�=��#�������")lU�8���n;�}��H;*%��<��M��9�m|q�����~H����fX8cA
�H��B�z��������n��G��T��b���1.�����t��/��@�a�(i|N�=���Z��\��;[~��#0�����L�":�]'Fe�Ei�� aJ�j�t��3fLJ�/J}����������Wpb��`����)=��B8�\������)��R���;��c���+�����t���2�)�H���g�T��=��g����/��N�R:�J��cJ��i^y�+�� R����=2)����������i�R�����v����L:Y77���t��R������������t�n��l��A����/X}�����[����j`J���t��R����n���$!������v3��=`)UgK]}���Nq���S%�:u�%��{� �^���JR�������]�g��Y�7��fi�����2fY��+~U�3�������C��t����O��R����J8����f���{�=��������S)S�_u�������m ��2�V�SO=��I(;���6�Vu9����y0`@���|��l�5����1\�3�O=O�81-�NV��.lV�9�����1>�L�
D���J'�)��������v���kJ'��+�WP��N��|)#e�<3�A�Hb�&
�c :��s������D�m�����{J�GH��=Wu�S�g�[��)������b(W�Ij���z��m���]������J���a*�6w�WR��/���t�/u�wT4���B��i���������X�|��6L^�B��Wam���xp�u�YYqO��6�n� y �4>'���v�j{����-?���_�LG�r��������f!f����/���2��!��]|����Ux�*1�BYz��+����Z�_.4l��0<��2��5i����������y�����Y3�ds0/������*�d���:94���Qg�H��� �u$�A��T��<�L�x������� ���\y�� �;�$��|�!���(z�����[�N����
�<��T��G��R>��F'Q�F�����B��C��L.��>)�XU�&���$�"�">� �����b�������E��0�
���4�0+�-Z��Y�f��*�+�:2��NL�7��qc+/�v����
m���?V��;�����y'}�1{w(����2��]�.]D�tk�H�xO����*�P1�mm��v���j�q?���Y�J"�g��~������;y�L�=�Y���.e!���@�x���i ���s�*�:I�s^�:��n�����cx�� )�3�B
i���
y@�����1����nH�TL�AR�F[CJ[(���W��k��c��iA��2S_H�1,D�A�Ic-�C[\����*����Bp�W����������n�@u#���]�)y�`�����Q��EW���'���1�P�P���P��y��Z��U
&T4����R`�+��
+OTfJALH ��1A���\P����=L
��X{�5J�72��=��w���P;��3
��g�o�u�
��a�*���s�P6�H~U�y#���D/+q�����I{���`���5D������;�.���!���z`o�S�1q�P�F�18�l%��B�Im��8r�A���
D���Ro�90���$����K�0�L�Y���U���Z*D]{��������e���$�
Q0?0��S,���}�6C��`�)k ��B*����V�+V�LpL#��U�����Ak0+�� �kB0�`O`Z�Th����k�.��{�p�u�Y��;���c�A0Y����v����{��9�]]������{�����sJ��3`�B��0I
���|��k%��1A�L���*;{]T=�����0����D�s�yaOz���3��69J���g�1/��9����U�RK���3-Iy,���
&�au�l$*�a-��H92�9�+�H�Xq��� ����_����PR�Y���[m+��$^aoU�7���%�BB�D/N�/���|1�����Ke)�]V��=q����'����W��J��(�X�*0��,0X<�(��!BDYFA���������^RJj!1��P�w�+e�3���`������w�9��?&���qiU`�V^ye�|R�Xe������?��tb���y�����|���~X,�G(HP �P��=^��L����E����8���_hLK��LR{fo1L9�&~�h� }�`~�2Qv�s����6�De)0W�����B���
>�$���e��W�4��$�a���
{��s��9�]�q�2��T�w(�_����� �O�3�b�cS-����g$7_Y�cU�M���L����>��<`�U&�|��J��Gma�Xbm-L
X}�9>a`��M�L������$0!q&�H��+�`5����cn�� ���j���
��-�D~L� �}�da�D���
C$Lpq/�(k��0�>����>0q�P��P�c��D�+� ����"���=&�,P
Xp
��e��0q��?6��[�$T��`��:jy0�<A����~���������d����y���C�P��P�7������c�`���vF��f~�>�DW���N����%�+�u��G)(�M�4��`�is�u'�A��6�;���eU��R�Hb����CA���=�A0������ �b�b� �s\%��|q �w��b�EE��������w��7HG���H�K7&�,��O����T!�� &�*`�I����0>��[Pe��?�9[�Ij�0�����4���iAe�q�j������Y4G����{��+����,�Gi��P7,�!������0�|��$���7��)s�7>'�n(������w�s/�@��A��L�Z�)���)��,��1�P(���6��zZ
�W�h�=���N�,F8�{6�c��0qb�r�S,�:����8tH<��2�f|$�� ���0���[���W5(3���x����{�r`PD%�Z���~��?b���t2����W�o�'�=��1���(�N�����z�m��C�=��3���hC:�2����'��Ea���7Re$�
t�d��d#��jf�{���(S���Q+�_�������Q�rc����������*DeW]����!<����4�9���m�����M��+Coq��N<�Nf�BW�#1( �>�e�Iy�1�Y�7����?��E��{�&R�5+��'��G��)j �4i{� ���@��{�����?e�� ��
�� ���?s�?���@�7�s7�R���l���q���IK�����e�����k���d�B�/�7"���h�7��}�N%��@���s����C�h��pE��zQ ~0�H�4l��J��9 w] I�f�����N����H�V�=�j�N���;�F�jgvr� $+��bDs����. I`� }����b?�;��IK�cR�YG���U+a�Fe��HQE+�����
lQ�)%�"M��+nUMI0jB��/�@$B�������M�"��HN�
]1q"MC�C{CbWM�'�='�/�;}�>�J`6ucpYa���U5��s��R����|�m �#�<������,�OU�Lf\�����s��~p�OQ7��I4k��>�= w��%3y(d|��{f9J�\��w����9 ���G�HD ��%zv��#�8��#�89�=`9�qG����^"'G�pG�p��!������G�pG�pG�(���{tG�pG�pG�j�WkL�T-
�T�G��6&�1��q�8a2����u��P�
����l�Zz�L����+������D8C-�9��c*c������� f��q��/G_�`l�
s������\��l�"h�]��������9��1-��T��p�G��Z4���1����lg��2O��+[���VL�)6O�f����|�����k~�_������������uO� 8�y��w�sQ��f��.���3�W��C�9 ��x8����2I��g=�*�_�g�����b��y1:��s>����-���
f�"�zHx����K8�<�k�s������"������9U����g�qv~8������|�yfQ�:kp����I�lq����S��Fe�b1 e�[R�
m3��#��,�����+�\�����6�T=_�@)p��hz\i`��"+C��e���L��\�
��ya��]���/0����x8���}u�k���@��V]u�!�Hr8�5!1�1�����b���N�5�� n���Ll`������������T�B�L�2%���p�������#�8�8�Q�y���:u�d��T_J�7f�X�`��+��}��p� �t����������@�(j�m��l����� �^~�e����&�lb��p�U�s�9'z��������)S,\���~��~z�����_��c�=��`A�l�����1����j����C'M���u��i��_?;C�stv�e�<"��_=0���m�=��o�����[���m��jJ ���tHX~�[(
:����4�=��H�����[�����@LlB�9���#�=L���`���^+<�@7�y1�����oz8rI�f�m�����1#r�&W�I������p�e��S�<���r���5�@���{��Q{�d0h[z ���i3L������N[����#������k��9R)$������&�=)��i�m�q��=��#)�������+�B�=PZ��b��������i;�(d������0g�/�
�q/��o��J_���L�!���+>���[f�R?'��\���Y��"���9����/�?-q�a��BU�C���1f�.�r!���=!�_G ��~�����z�z��e�Ei���lJ�j���I�3&���;�>����2R)��%&����N>R&L0lt��g��DauR��IWJU�R���~�>��>