SLRU optimization - configurable buffer pool and partitioning the SLRU lock

Started by Dilip Kumarover 2 years ago131 messages
Jump to latest
#1Dilip Kumar
dilipbalaut@gmail.com

The small size of the SLRU buffer pools can sometimes become a
performance problem because it’s not difficult to have a workload
where the number of buffers actively in use is larger than the
fixed-size buffer pool. However, just increasing the size of the
buffer pool doesn’t necessarily help, because the linear search that
we use for buffer replacement doesn’t scale, and also because
contention on the single centralized lock limits scalability.

There is a couple of patches proposed in the past to address the
problem of increasing the buffer pool size, one of the patch [1] was
proposed by Thomas Munro where we make the size of the buffer pool
configurable. And, in order to deal with the linear search in the
large buffer pool, we divide the SLRU buffer pool into associative
banks so that searching in the buffer pool doesn’t get affected by the
large size of the buffer pool. This does well for the workloads which
are mainly impacted by the frequent buffer replacement but this still
doesn’t stand well with the workloads where the centralized control
lock is the bottleneck.

So I have taken this patch as my base patch (v1-0001) and further
added 2 more improvements to this 1) In v1-0002, Instead of a
centralized control lock for the SLRU I have introduced a bank-wise
control lock 2)In v1-0003, I have removed the global LRU counter and
introduced a bank-wise counter. The second change (v1-0003) is in
order to avoid the CPU/OS cache invalidation due to frequent updates
of the single variable, later in my performance test I will show how
much gain we have gotten because of these 2 changes.

Note: This is going to be a long email but I have summarised the main
idea above this point and now I am going to discuss more internal
information in order to show that the design idea is valid and also
going to show 2 performance tests where one is specific to the
contention on the centralized lock and other is mainly contention due
to frequent buffer replacement in SLRU buffer pool. We are getting ~2x
TPS compared to the head by these patches and in later sections, I am
going discuss this in more detail i.e. exact performance numbers and
analysis of why we are seeing the gain.

There are some problems I faced while converting this centralized
control lock to a bank-wise lock and that is mainly because this lock
is (mis)used for different purposes. The main purpose of this control
lock as I understand it is to protect the in-memory access
(read/write) of the buffers in the SLRU buffer pool.

Here is the list of some problems and their analysis:

1) In some of the SLRU, we use this lock for protecting the members
inside the control structure which is specific to that SLRU layer i.e.
SerialControlData() members are protected by the SerialSLRULock, and I
don’t think it is the right use of this lock so for this purpose I
have introduced another lock called SerialControlLock for this
specific purpose. Based on my analysis there is no reason for
protecting these members and the SLRU buffer access with the same
lock.
2) The member called ‘latest_page_number’ inside SlruSharedData is
also protected by the SLRULock, I would not say this use case is wrong
but since this is a common variable and not a per bank variable can
not be protected by the bankwise lock now. But the usage of this
variable is just to track the latest page in an SLRU so that we do not
evict out the latest page during victim page selection. So I have
converted this to an atomic variable as it is completely independent
of the SLRU buffer access.
3) In order to protect SlruScanDirectory, basically the
SlruScanDirectory() from DeactivateCommitTs(), is called under the
SLRU control lock, but from all other places SlruScanDirectory() is
called without lock and that is because the caller of this function is
called from the places which are not executed concurrently(i.e.
startup, checkpoint). This function DeactivateCommitTs() is also
called from the startup only so there doesn't seem any use in calling
this under the SLRU control lock. Currently, I have called this under
the all-bank lock because logically this is not a performance path,
and that way we are keeping it consistent with the current logic, but
if others also think that we do not need a lock at this place then we
might remove it and then we don't need this all-bank lock anywhere.

There are some other uses of this lock where we might think it will be
a problem if we divide it into a bank-wise lock but it's not and I
have given my analysis for the same

1) SimpleLruTruncate: We might worry that if we convert to a bank-wise
lock then this could be an issue as we might need to release and
acquire different locks as we scan different banks. But as per my
analysis, this is not an issue because a) With the current code also
do release and acquire the centralized lock multiple times in order to
perform the I/O on the buffer slot so the behavior is not changed but
the most important thing is b) All SLRU layers take care that this
function should not be accessed concurrently, I have verified all
access to this function and its true and the function header of this
function also says the same. So this is not an issue as per my
analysis.

2) Extending or adding a new page to SLRU: I have noticed that this
is also protected by either some other exclusive lock or only done
during startup. So in short the SLRULock is just used for protecting
against the access of the buffers in the buffer pool but that is not
for guaranteeing the exclusive access inside the function because that
is taken care of in some other way.

3) Another thing that I noticed while writing this and thought it
would be good to make a note of that as well. Basically for the CLOG
group update of the xid status. Therein if we do not get the control
lock on the SLRU then we add ourselves to a group and then the group
leader does the job for all the members in the group. One might think
that different pages in the group might belong to different SLRU bank
so the leader might need to acquire/release the lock as it process the
request in the group. Yes, that is true, and it is taken care but we
don’t need to worry about the case because as per the implementation
of the group update, we are trying to have the members with the same
page request in one group and only due to some exception there could
be members with the different page request. So the point is with a
bank-wise lock we are handling that exception case but that's not a
regular case that we need to acquire/release multiple times. So
design-wise we are good and performance-wise there should not be any
problem because most of the time we might be updating the pages from
the same bank, and if in some cases we have some updates for old
transactions due to long-running transactions then we should do better
by not having a centralized lock.

Performance Test:
Exp1: Show problems due to CPU/OS cache invalidation due to frequent
updates of the centralized lock and a common LRU counter. So here we
are running a parallel transaction to pgbench script which frequently
creates subtransaction overflow and that forces the visibility-check
mechanism to access the subtrans SLRU.
Test machine: 8 CPU/ 64 core/ 128 with HT/ 512 MB RAM / SSD
scale factor: 300
shared_buffers=20GB
checkpoint_timeout=40min
max_wal_size=20GB
max_connections=200

Workload: Run these 2 scripts parallelly:
./pgbench -c $ -j $ -T 600 -P5 -M prepared postgres
./pgbench -c 1 -j 1 -T 600 -f savepoint.sql postgres

savepoint.sql (create subtransaction overflow)
BEGIN;
SAVEPOINT S1;
INSERT INTO test VALUES(1)
← repeat 70 times →
SELECT pg_sleep(1);
COMMIT;

Code under test:
Head: PostgreSQL head code
SlruBank: The first patch applied to convert the SLRU buffer pool into
the bank (0001)
SlruBank+BankwiseLockAndLru: Applied 0001+0002+0003

Results:
Clients Head SlruBank SlruBank+BankwiseLockAndLru
1 457 491 475
8 3753 3819 3782
32 14594 14328 17028
64 15600 16243 25944
128 15957 16272 31731

So we can see that at 128 clients, we get ~2x TPS(with SlruBank +
BankwiseLock and bankwise LRU counter) as compared to HEAD. We might
be thinking that we do not see much gain only with the SlruBank patch.
The reason is that in this particular test case, we are not seeing
much load on the buffer replacement. In fact, the wait event also
doesn’t show contention on any lock instead the main load is due to
frequently modifying the common variable like the centralized control
lock and the centralized LRU counters. That is evident in perf data
as shown below

+  74.72%   0.06% postgres postgres  [.] XidInMVCCSnapshot
+  74.08%   0.02% postgres postgres  [.] SubTransGetTopmostTransaction
+  74.04%   0.07% postgres postgres  [.] SubTransGetParent
+  57.66%   0.04% postgres postgres  [.] LWLockAcquire
+  57.64%   0.26% postgres postgres  [.] SimpleLruReadPage_ReadOnly
……
+  16.53%   0.07% postgres postgres  [.] LWLockRelease
+  16.36%   0.04% postgres postgres  [.] pg_atomic_sub_fetch_u32
+  16.31%  16.24% postgres postgres  [.] pg_atomic_fetch_sub_u32_impl

We can notice that the main load is on the atomic variable within the
LWLockAcquire and LWLockRelease. Once we apply the bankwise lock
patch(v1-0002) the same problem is visible on cur_lru_count updation
in the SlruRecentlyUsed[2]#define SlruRecentlyUsed(shared, slotno) \ do { \ .. (shared)->cur_lru_count = ++new_lru_count; \ .. } \ } while (0) macro (I have not shown that here but it
was visible in my perf report). And that is resolved by implementing
a bankwise counter.

[2]: #define SlruRecentlyUsed(shared, slotno) \ do { \ .. (shared)->cur_lru_count = ++new_lru_count; \ .. } \ } while (0)
#define SlruRecentlyUsed(shared, slotno) \
do { \
..
(shared)->cur_lru_count = ++new_lru_count; \
..
} \
} while (0)

Exp2: This test shows the load on SLRU frequent buffer replacement. In
this test, we are running the pgbench kind script which frequently
generates multixact-id, and parallelly we are starting and committing
a long-running transaction so that the multixact-ids are not
immediately cleaned up by the vacuum and we create contention on the
SLRU buffer pool. I am not leaving the long-running transaction
running forever as that will start to show another problem with
respect to bloat and we will lose the purpose of what I am trying to
show here.

Note: test configurations are the same as Exp1, just the workload is
different, we are running below 2 scripts.
and new config parameter(added in v1-0001) slru_buffers_size_scale=4,
that means NUM_MULTIXACTOFFSET_BUFFERS will be 64 that is 16 in Head
and
NUM_MULTIXACTMEMBER_BUFFERS will be 128 which is 32 in head

./pgbench -c $ -j $ -T 600 -P5 -M prepared -f multixact.sql postgres
./pgbench -c 1 -j 1 -T 600 -f longrunning.sql postgres

cat > multixact.sql <<EOF
\set aid random(1, 100000 * :scale)
\set bid random(1, 1 * :scale)
\set tid random(1, 10 * :scale)
\set delta random(-5000, 5000)
BEGIN;
SELECT FROM pgbench_accounts WHERE aid = :aid FOR UPDATE;
SAVEPOINT S1;
UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid;
SELECT abalance FROM pgbench_accounts WHERE aid = :aid;
INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES
(:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP);
END;
EOF

cat > longrunning.sql << EOF
BEGIN;
INSERT INTO pgbench_test VALUES(1);
select pg_sleep(10);
COMMIT;
EOF

Results:
Clients Head SlruBank SlruBank+BankwiseLock
1 528 513 531
8 3870 4239 4157
32 13945 14470 14556
64 10086 19034 24482
128 6909 15627 18161

Here we can see good improvement with the SlruBank patch itself
because of increasing the SLRU buffer pool, as in this workload there
is a lot of contention due to buffer replacement. As shown below we
can see a lot of load on MultiXactOffsetSLRU as well as on
MultiXactOffsetBuffer which shows there are frequent buffer evictions
in this workload. And, increasing the SLRU buffer pool size is
helping a lot, and further dividing the SLRU lock into bank-wise locks
we are seeing a further gain. So in total, we are seeing ~2.5x TPS at
64 and 128 thread compared to head.

3401 LWLock | MultiXactOffsetSLRU
2031 LWLock | MultiXactOffsetBuffer
687 |
427 LWLock | BufferContent

Credits:
- The base patch v1-0001 is authored by Thomas Munro and I have just rebased it.
- 0002 and 0003 are new patches written by me based on design ideas
from Robert and Myself.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachments:

v1-0003-Introduce-bank-wise-LRU-counter.patchapplication/octet-stream; name=v1-0003-Introduce-bank-wise-LRU-counter.patchDownload+32-20
v1-0001-Divide-SLRU-buffers-into-banks.patchapplication/octet-stream; name=v1-0001-Divide-SLRU-buffers-into-banks.patchDownload+117-49
v1-0002-bank-wise-slru-locks.patchapplication/octet-stream; name=v1-0002-bank-wise-slru-locks.patchDownload+452-199
#2Dilip Kumar
dilipbalaut@gmail.com
In reply to: Dilip Kumar (#1)
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

On Wed, Oct 11, 2023 at 4:34 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

The small size of the SLRU buffer pools can sometimes become a
performance problem because it’s not difficult to have a workload
where the number of buffers actively in use is larger than the
fixed-size buffer pool. However, just increasing the size of the
buffer pool doesn’t necessarily help, because the linear search that
we use for buffer replacement doesn’t scale, and also because
contention on the single centralized lock limits scalability.

There is a couple of patches proposed in the past to address the
problem of increasing the buffer pool size, one of the patch [1] was
proposed by Thomas Munro where we make the size of the buffer pool
configurable.

In my last email, I forgot to give the link from where I have taken
the base path for dividing the buffer pool in banks so giving the same
here[1]https://commitfest.postgresql.org/43/2627/. And looking at this again it seems that the idea of that
patch was from
Andrey M. Borodin and the idea of the SLRU scale factor were
introduced by Yura Sokolov and Ivan Lazarev. Apologies for missing
that in the first email.

[1]: https://commitfest.postgresql.org/43/2627/

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#3Dilip Kumar
dilipbalaut@gmail.com
In reply to: Dilip Kumar (#2)
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

On Wed, Oct 11, 2023 at 5:57 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Wed, Oct 11, 2023 at 4:34 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

In my last email, I forgot to give the link from where I have taken
the base path for dividing the buffer pool in banks so giving the same
here[1]. And looking at this again it seems that the idea of that
patch was from
Andrey M. Borodin and the idea of the SLRU scale factor were
introduced by Yura Sokolov and Ivan Lazarev. Apologies for missing
that in the first email.

[1] https://commitfest.postgresql.org/43/2627/

In my last email I have just rebased the base patch, so now while
reading through that patch I realized that there was some refactoring
needed and some unused functions were there so I have removed that and
also added some comments. Also did some refactoring to my patches. So
reposting the patch series.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachments:

v2-0002-bank-wise-slru-locks.patchapplication/octet-stream; name=v2-0002-bank-wise-slru-locks.patchDownload+482-207
v2-0003-Introduce-bank-wise-LRU-counter.patchapplication/octet-stream; name=v2-0003-Introduce-bank-wise-LRU-counter.patchDownload+32-20
v2-0001-Divide-SLRU-buffers-into-banks.patchapplication/octet-stream; name=v2-0001-Divide-SLRU-buffers-into-banks.patchDownload+119-51
#4Amit Kapila
amit.kapila16@gmail.com
In reply to: Dilip Kumar (#1)
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

On Wed, Oct 11, 2023 at 4:35 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

The small size of the SLRU buffer pools can sometimes become a
performance problem because it’s not difficult to have a workload
where the number of buffers actively in use is larger than the
fixed-size buffer pool. However, just increasing the size of the
buffer pool doesn’t necessarily help, because the linear search that
we use for buffer replacement doesn’t scale, and also because
contention on the single centralized lock limits scalability.

There is a couple of patches proposed in the past to address the
problem of increasing the buffer pool size, one of the patch [1] was
proposed by Thomas Munro where we make the size of the buffer pool
configurable. And, in order to deal with the linear search in the
large buffer pool, we divide the SLRU buffer pool into associative
banks so that searching in the buffer pool doesn’t get affected by the
large size of the buffer pool. This does well for the workloads which
are mainly impacted by the frequent buffer replacement but this still
doesn’t stand well with the workloads where the centralized control
lock is the bottleneck.

So I have taken this patch as my base patch (v1-0001) and further
added 2 more improvements to this 1) In v1-0002, Instead of a
centralized control lock for the SLRU I have introduced a bank-wise
control lock 2)In v1-0003, I have removed the global LRU counter and
introduced a bank-wise counter. The second change (v1-0003) is in
order to avoid the CPU/OS cache invalidation due to frequent updates
of the single variable, later in my performance test I will show how
much gain we have gotten because of these 2 changes.

Note: This is going to be a long email but I have summarised the main
idea above this point and now I am going to discuss more internal
information in order to show that the design idea is valid and also
going to show 2 performance tests where one is specific to the
contention on the centralized lock and other is mainly contention due
to frequent buffer replacement in SLRU buffer pool. We are getting ~2x
TPS compared to the head by these patches and in later sections, I am
going discuss this in more detail i.e. exact performance numbers and
analysis of why we are seeing the gain.

...

Performance Test:
Exp1: Show problems due to CPU/OS cache invalidation due to frequent
updates of the centralized lock and a common LRU counter. So here we
are running a parallel transaction to pgbench script which frequently
creates subtransaction overflow and that forces the visibility-check
mechanism to access the subtrans SLRU.
Test machine: 8 CPU/ 64 core/ 128 with HT/ 512 MB RAM / SSD
scale factor: 300
shared_buffers=20GB
checkpoint_timeout=40min
max_wal_size=20GB
max_connections=200

Workload: Run these 2 scripts parallelly:
./pgbench -c $ -j $ -T 600 -P5 -M prepared postgres
./pgbench -c 1 -j 1 -T 600 -f savepoint.sql postgres

savepoint.sql (create subtransaction overflow)
BEGIN;
SAVEPOINT S1;
INSERT INTO test VALUES(1)
← repeat 70 times →
SELECT pg_sleep(1);
COMMIT;

Code under test:
Head: PostgreSQL head code
SlruBank: The first patch applied to convert the SLRU buffer pool into
the bank (0001)
SlruBank+BankwiseLockAndLru: Applied 0001+0002+0003

Results:
Clients Head SlruBank SlruBank+BankwiseLockAndLru
1 457 491 475
8 3753 3819 3782
32 14594 14328 17028
64 15600 16243 25944
128 15957 16272 31731

So we can see that at 128 clients, we get ~2x TPS(with SlruBank +
BankwiseLock and bankwise LRU counter) as compared to HEAD.

This and other results shared by you look promising. Will there be any
improvement in workloads related to clog buffer usage? BTW, I remember
that there was also a discussion of moving SLRU into a regular buffer
pool [1]https://commitfest.postgresql.org/43/3514/. You have not provided any explanation as to whether that
approach will have any merits after we do this or whether that
approach is not worth pursuing at all.

[1]: https://commitfest.postgresql.org/43/3514/

--
With Regards,
Amit Kapila.

#5Dilip Kumar
dilipbalaut@gmail.com
In reply to: Amit Kapila (#4)
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

On Sat, Oct 14, 2023 at 9:43 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

This and other results shared by you look promising. Will there be any
improvement in workloads related to clog buffer usage?

I did not understand this question can you explain this a bit? In
short, if it is regarding the performance then we will see it for all
the SLRUs as the control lock is not centralized anymore instead it is
a bank-wise lock.

BTW, I remember

that there was also a discussion of moving SLRU into a regular buffer
pool [1]. You have not provided any explanation as to whether that
approach will have any merits after we do this or whether that
approach is not worth pursuing at all.

[1] - https://commitfest.postgresql.org/43/3514/

Yeah, I haven't read that thread in detail about performance numbers
and all. But both of these can not coexist because this patch is
improving the SLRU buffer pool access/configurable size and also lock
contention. If we move SLRU to the main buffer pool then we might not
have a similar problem instead there might be other problems like SLRU
buffers getting swapped out due to other relation buffers and all and
OTOH the advantages of that approach would be that we can just use a
bigger buffer pool and SLRU can also take advantage of that. But in
my opinion, most of the time we have limited page access in SLRU and
the SLRU buffer access pattern is also quite different from the
relation pages access pattern so keeping them under the same buffer
pool and comparing against relation pages for victim buffer selection
might cause different problems. But anyway I would discuss those
points maybe in that thread.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#6Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Dilip Kumar (#2)
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

On 2023-Oct-11, Dilip Kumar wrote:

In my last email, I forgot to give the link from where I have taken
the base path for dividing the buffer pool in banks so giving the same
here[1]. And looking at this again it seems that the idea of that
patch was from Andrey M. Borodin and the idea of the SLRU scale factor
were introduced by Yura Sokolov and Ivan Lazarev. Apologies for
missing that in the first email.

You mean [1]/messages/by-id/452d01f7e331458f56ad79bef537c31b@postgrespro.ru I don't like this idea very much, because of the magic numbers that act as ratios for numbers of buffers on each SLRU compared to other SLRUs. These values, which I took from the documentation part of the patch, appear to have been selected by throwing darts at the wall:.
[1]: /messages/by-id/452d01f7e331458f56ad79bef537c31b@postgrespro.ru I don't like this idea very much, because of the magic numbers that act as ratios for numbers of buffers on each SLRU compared to other SLRUs. These values, which I took from the documentation part of the patch, appear to have been selected by throwing darts at the wall:
I don't like this idea very much, because of the magic numbers that act
as ratios for numbers of buffers on each SLRU compared to other SLRUs.
These values, which I took from the documentation part of the patch,
appear to have been selected by throwing darts at the wall:

NUM_CLOG_BUFFERS = Min(128 << slru_buffers_size_scale, shared_buffers/256)
NUM_COMMIT_TS_BUFFERS = Min(128 << slru_buffers_size_scale, shared_buffers/256)
NUM_SUBTRANS_BUFFERS = Min(64 << slru_buffers_size_scale, shared_buffers/256)
NUM_NOTIFY_BUFFERS = Min(32 << slru_buffers_size_scale, shared_buffers/256)
NUM_SERIAL_BUFFERS = Min(32 << slru_buffers_size_scale, shared_buffers/256)
NUM_MULTIXACTOFFSET_BUFFERS = Min(32 << slru_buffers_size_scale, shared_buffers/256)
NUM_MULTIXACTMEMBER_BUFFERS = Min(64 << slru_buffers_size_scale, shared_buffers/256)

... which look pretty random already, if similar enough to the current
hardcoded values. In reality, the code implements different values than
what the documentation says.

I don't see why would CLOG have the same number as COMMIT_TS, when the
size for elements of the latter is like 32 times bigger -- however, the
frequency of reads for COMMIT_TS is like 1000x smaller than for CLOG.
SUBTRANS is half of CLOG, yet it is 16 times larger, and it covers the
same range. The MULTIXACT ones appear to keep the current ratio among
them (8/16 gets changed to 32/64).

... and this whole mess is scaled exponentially without regard to the
size that each SLRU requires. This is just betting that enough memory
can be wasted across all SLRUs up to the point where the one that is
actually contended has sufficient memory. This doesn't sound sensible
to me.

Like everybody else, I like having less GUCs to configure, but going
this far to avoid them looks rather disastrous to me. IMO we should
just use Munro's older patches that gave one GUC per SLRU, and users
only need to increase the one that shows up in pg_wait_event sampling.
Someday we will get the (much more complicated) patches to move these
buffers to steal memory from shared buffers, and that'll hopefully let
use get rid of all this complexity.

I'm inclined to use Borodin's patch last posted here [2]/messages/by-id/93236D36-B91C-4DFA-AF03-99C083840378@yandex-team.ru instead of your
proposed 0001.
[2]: /messages/by-id/93236D36-B91C-4DFA-AF03-99C083840378@yandex-team.ru

I did skim patches 0002 and 0003 without going into too much detail;
they look reasonable ideas. I have not tried to reproduce the claimed
performance benefits. I think measuring this patch set with the tests
posted by Shawn Debnath in [3]/messages/by-id/YemDdpMrsoJFQJnU@f01898859afd.ant.amazon.com is important, too.
[3]: /messages/by-id/YemDdpMrsoJFQJnU@f01898859afd.ant.amazon.com

On the other hand, here's a somewhat crazy idea. What if, instead of
stealing buffers from shared_buffers (which causes a lot of complexity),
we allocate a common pool for all SLRUs to use? We provide a single
knob -- say, non_relational_buffers=32MB as default -- and we use a LRU
algorithm (or something) to distribute that memory across all the SLRUs.
So the ratio to use for this SLRU or that one would depend on the nature
of the workload: maybe more for multixact in this server here, but more
for subtrans in that server there; it's just the total amount that the
user would have to configure, side by side with shared_buffers (and
perhaps scale with it like wal_buffers), and the LRU would handle the
rest. The "only" problem here is finding a distribution algorithm that
doesn't further degrade performance, of course ...

--
Álvaro Herrera Breisgau, Deutschland — https://www.EnterpriseDB.com/
"The problem with the facetime model is not just that it's demoralizing, but
that the people pretending to work interrupt the ones actually working."
-- Paul Graham, http://www.paulgraham.com/opensource.html

#7Nathan Bossart
nathandbossart@gmail.com
In reply to: Alvaro Herrera (#6)
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

On Tue, Oct 24, 2023 at 06:04:13PM +0200, Alvaro Herrera wrote:

Like everybody else, I like having less GUCs to configure, but going
this far to avoid them looks rather disastrous to me. IMO we should
just use Munro's older patches that gave one GUC per SLRU, and users
only need to increase the one that shows up in pg_wait_event sampling.
Someday we will get the (much more complicated) patches to move these
buffers to steal memory from shared buffers, and that'll hopefully let
use get rid of all this complexity.

+1

On the other hand, here's a somewhat crazy idea. What if, instead of
stealing buffers from shared_buffers (which causes a lot of complexity),
we allocate a common pool for all SLRUs to use? We provide a single
knob -- say, non_relational_buffers=32MB as default -- and we use a LRU
algorithm (or something) to distribute that memory across all the SLRUs.
So the ratio to use for this SLRU or that one would depend on the nature
of the workload: maybe more for multixact in this server here, but more
for subtrans in that server there; it's just the total amount that the
user would have to configure, side by side with shared_buffers (and
perhaps scale with it like wal_buffers), and the LRU would handle the
rest. The "only" problem here is finding a distribution algorithm that
doesn't further degrade performance, of course ...

I think it's worth a try. It does seem simpler, and it might allow us to
sidestep some concerns about scaling when the SLRU pages are in
shared_buffers [0]/messages/by-id/ZPsaEGRvllitxB3v@tamriel.snowman.net.

[0]: /messages/by-id/ZPsaEGRvllitxB3v@tamriel.snowman.net

--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

#8Dilip Kumar
dilipbalaut@gmail.com
In reply to: Alvaro Herrera (#6)
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

On Tue, Oct 24, 2023 at 9:34 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:

On 2023-Oct-11, Dilip Kumar wrote:

In my last email, I forgot to give the link from where I have taken
the base path for dividing the buffer pool in banks so giving the same
here[1]. And looking at this again it seems that the idea of that
patch was from Andrey M. Borodin and the idea of the SLRU scale factor
were introduced by Yura Sokolov and Ivan Lazarev. Apologies for
missing that in the first email.

You mean [1].
[1] /messages/by-id/452d01f7e331458f56ad79bef537c31b@postgrespro.ru
I don't like this idea very much, because of the magic numbers that act
as ratios for numbers of buffers on each SLRU compared to other SLRUs.
These values, which I took from the documentation part of the patch,
appear to have been selected by throwing darts at the wall:

NUM_CLOG_BUFFERS = Min(128 << slru_buffers_size_scale, shared_buffers/256)
NUM_COMMIT_TS_BUFFERS = Min(128 << slru_buffers_size_scale, shared_buffers/256)
NUM_SUBTRANS_BUFFERS = Min(64 << slru_buffers_size_scale, shared_buffers/256)
NUM_NOTIFY_BUFFERS = Min(32 << slru_buffers_size_scale, shared_buffers/256)
NUM_SERIAL_BUFFERS = Min(32 << slru_buffers_size_scale, shared_buffers/256)
NUM_MULTIXACTOFFSET_BUFFERS = Min(32 << slru_buffers_size_scale, shared_buffers/256)
NUM_MULTIXACTMEMBER_BUFFERS = Min(64 << slru_buffers_size_scale, shared_buffers/256)

... which look pretty random already, if similar enough to the current
hardcoded values. In reality, the code implements different values than
what the documentation says.

I don't see why would CLOG have the same number as COMMIT_TS, when the
size for elements of the latter is like 32 times bigger -- however, the
frequency of reads for COMMIT_TS is like 1000x smaller than for CLOG.
SUBTRANS is half of CLOG, yet it is 16 times larger, and it covers the
same range. The MULTIXACT ones appear to keep the current ratio among
them (8/16 gets changed to 32/64).

... and this whole mess is scaled exponentially without regard to the
size that each SLRU requires. This is just betting that enough memory
can be wasted across all SLRUs up to the point where the one that is
actually contended has sufficient memory. This doesn't sound sensible
to me.

Like everybody else, I like having less GUCs to configure, but going
this far to avoid them looks rather disastrous to me. IMO we should
just use Munro's older patches that gave one GUC per SLRU, and users
only need to increase the one that shows up in pg_wait_event sampling.
Someday we will get the (much more complicated) patches to move these
buffers to steal memory from shared buffers, and that'll hopefully let
use get rid of all this complexity.

Overall I agree with your comments, actually, I haven't put that much
thought into the GUC part and how it scales the SLRU buffers w.r.t.
this single configurable parameter. Yeah, so I think it is better
that we take the older patch version as our base patch where we have
separate GUC per SLRU.

I'm inclined to use Borodin's patch last posted here [2] instead of your
proposed 0001.
[2] /messages/by-id/93236D36-B91C-4DFA-AF03-99C083840378@yandex-team.ru

I will rebase my patches on top of this.

I did skim patches 0002 and 0003 without going into too much detail;
they look reasonable ideas. I have not tried to reproduce the claimed
performance benefits. I think measuring this patch set with the tests
posted by Shawn Debnath in [3] is important, too.
[3] /messages/by-id/YemDdpMrsoJFQJnU@f01898859afd.ant.amazon.com

Thanks for taking a look.

On the other hand, here's a somewhat crazy idea. What if, instead of
stealing buffers from shared_buffers (which causes a lot of complexity),

Currently, we do not steal buffers from shared_buffers, computation is
dependent upon Nbuffers though. I mean for each SLRU we are computing
separate memory which is additional than the shared_buffers no?

we allocate a common pool for all SLRUs to use? We provide a single
knob -- say, non_relational_buffers=32MB as default -- and we use a LRU
algorithm (or something) to distribute that memory across all the SLRUs.
So the ratio to use for this SLRU or that one would depend on the nature
of the workload: maybe more for multixact in this server here, but more
for subtrans in that server there; it's just the total amount that the
user would have to configure, side by side with shared_buffers (and
perhaps scale with it like wal_buffers), and the LRU would handle the
rest. The "only" problem here is finding a distribution algorithm that
doesn't further degrade performance, of course ...

Yeah, this could be an idea, but are you talking about that all the
SLRUs will share the single buffer pool and based on the LRU algorithm
it will be decided which page will stay in the buffer pool and which
will be out? But wouldn't that create another issue of different
SLRUs starting to contend on the same lock if we have a common buffer
pool for all the SLRUs? Or am I missing something? Or you are saying
that although there is a common buffer pool each SLRU will have its
own boundaries in it so protected by a separate lock and based on the
workload those boundaries can change dynamically? I haven't put much
thought into how practical the idea is but just trying to understand
what you have in mind.
--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#9Amit Kapila
amit.kapila16@gmail.com
In reply to: Dilip Kumar (#5)
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

On Fri, Oct 20, 2023 at 9:40 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Sat, Oct 14, 2023 at 9:43 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

This and other results shared by you look promising. Will there be any
improvement in workloads related to clog buffer usage?

I did not understand this question can you explain this a bit?

I meant to ask about the impact of this patch on accessing transaction
status via TransactionIdGetStatus(). Shouldn't we expect some
improvement in accessing CLOG buffers?

--
With Regards,
Amit Kapila.

#10Dilip Kumar
dilipbalaut@gmail.com
In reply to: Amit Kapila (#9)
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

On Wed, Oct 25, 2023 at 5:58 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Fri, Oct 20, 2023 at 9:40 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Sat, Oct 14, 2023 at 9:43 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

This and other results shared by you look promising. Will there be any
improvement in workloads related to clog buffer usage?

I did not understand this question can you explain this a bit?

I meant to ask about the impact of this patch on accessing transaction
status via TransactionIdGetStatus(). Shouldn't we expect some
improvement in accessing CLOG buffers?

Yes, there should be because 1) Now there is no common lock so
contention on a centralized control lock will be reduced when we are
accessing the transaction status from pages falling in different SLRU
banks 2) Buffers size is configurable so if the workload is accessing
transactions status of different range then it would help in frequent
buffer eviction but this might not be most common case.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#11Dilip Kumar
dilipbalaut@gmail.com
In reply to: Dilip Kumar (#8)
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

On Wed, Oct 25, 2023 at 10:34 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Tue, Oct 24, 2023 at 9:34 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:

Overall I agree with your comments, actually, I haven't put that much
thought into the GUC part and how it scales the SLRU buffers w.r.t.
this single configurable parameter. Yeah, so I think it is better
that we take the older patch version as our base patch where we have
separate GUC per SLRU.

I'm inclined to use Borodin's patch last posted here [2] instead of your
proposed 0001.
[2] /messages/by-id/93236D36-B91C-4DFA-AF03-99C083840378@yandex-team.ru

I will rebase my patches on top of this.

I have taken 0001 and 0002 from [1]/messages/by-id/93236D36-B91C-4DFA-AF03-99C083840378@yandex-team.ru, done some bug fixes in 0001, and
changed the logic of SlruAdjustNSlots() in 0002, such that now it
starts with the next power of 2 value of the configured slots and
keeps doubling the number of banks until we reach the number of banks
to the max SLRU_MAX_BANKS(128) and bank size is bigger than
SLRU_MIN_BANK_SIZE (8). By doing so, we will ensure we don't have too
many banks, but also that we don't have very large banks. There was
also a patch 0003 in this thread but I haven't taken this as this is
another optimization of merging some structure members and I will
analyze the performance characteristic of this and try to add it on
top of the complete patch series.

Patch details:
0001 - GUC parameter for each SLRU
0002 - Divide the SLRU pool into banks
(The above 2 are taken from [1]/messages/by-id/93236D36-B91C-4DFA-AF03-99C083840378@yandex-team.ru with some modification and rebasing by me)
0003 - Implement bank-wise SLRU lock as described in the first email
of this thread
0004 - Implement bank-wise LRU counter as described in the first email
of this thread
0005 - Some other optimization suggested offlist by Alvaro, i.e.
merging buffer locks and bank locks in the same array so that the
bank-wise LRU counter does not fetch the next cache line in a hot
function SlruRecentlyUsed()

Note: I think 0003,0004 and 0005 can be merged together but kept
separate so that we can review them independently and see how useful
each of them is.

[1]: /messages/by-id/93236D36-B91C-4DFA-AF03-99C083840378@yandex-team.ru

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachments:

v3-0005-Merge-bank-locks-array-with-buffer-locks-array.patchapplication/octet-stream; name=v3-0005-Merge-bank-locks-array-with-buffer-locks-array.patchDownload+72-67
v3-0002-Divide-SLRU-buffers-into-banks.patchapplication/octet-stream; name=v3-0002-Divide-SLRU-buffers-into-banks.patchDownload+75-5
v3-0004-Introduce-bank-wise-LRU-counter.patchapplication/octet-stream; name=v3-0004-Introduce-bank-wise-LRU-counter.patchDownload+64-48
v3-0001-Make-all-SLRU-buffer-sizes-configurable.patchapplication/octet-stream; name=v3-0001-Make-all-SLRU-buffer-sizes-configurable.patchDownload+298-45
v3-0003-Bank-wise-slru-locks.patchapplication/octet-stream; name=v3-0003-Bank-wise-slru-locks.patchDownload+494-211
#12Dilip Kumar
dilipbalaut@gmail.com
In reply to: Dilip Kumar (#11)
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

On Mon, Oct 30, 2023 at 11:50 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

Based on some offlist discussions with Alvaro and Robert in separate
conversations, I and Alvaro we came to the same point if a user sets a
very high value for the number of slots (say 1GB) then the number of
slots in each bank will be 1024 (considering max number of bank 128)
and if we continue the sequence search for finding the buffer for the
page then that could be costly in such cases. But later in one of the
conversations with Robert, I realized that we can have this bank-wise
lock approach along with the partitioned hash table.

So the idea is, that we will use the buffer mapping hash table
something like Thoams used in one of his patches [1]/messages/by-id/CA+hUKGLCLDtgDj2Xsf0uBk5WXDCeHxBDDJPsyY7m65Fde-=pyg@mail.gmail.com, but instead of a
normal hash table, we will use the partitioned hash table. The SLRU
buffer pool is still divided as we have done in the bank-wise approach
and there will be separate locks for each slot range. So now we get
the benefit of both approaches 1) By having a mapping hash we can
avoid the sequence search 2) By dividing the buffer pool into banks
and keeping the victim buffer search within those banks we avoid
locking all the partitions during victim buffer search 3) And we can
also maintain a bank-wise LRU counter so that we avoid contention on a
single variable as we have discussed in my first email of this thread.
Please find the updated patch set details and patches attached to the
email.

[1]: /messages/by-id/CA+hUKGLCLDtgDj2Xsf0uBk5WXDCeHxBDDJPsyY7m65Fde-=pyg@mail.gmail.com
patch as the previous patch set
[2]: 0002-Add-a-buffer-mapping-table-for-SLRUs: Patch to introduce buffer mapping hash table
buffer mapping hash table
[3]: 0003-Partition-wise-slru-locks: Partition the hash table and also introduce partition-wise locks: this is a merge of 0003 and 0004 from the previous patch set but instead of bank-wise locks it has partition-wise locks and LRU counter.
introduce partition-wise locks: this is a merge of 0003 and 0004 from
the previous patch set but instead of bank-wise locks it has
partition-wise locks and LRU counter.
[4]: 0004-Merge-partition-locks-array-with-buffer-locks-array: merging buffer locks and bank locks in the same array so that the bank-wise LRU counter does not fetch the next cache line in a hot function SlruRecentlyUsed()(same as 0005 from the previous patch set)
buffer locks and bank locks in the same array so that the bank-wise
LRU counter does not fetch the next cache line in a hot function
SlruRecentlyUsed()(same as 0005 from the previous patch set)
[5]: 0005-Ensure-slru-buffer-slots-are-in-multiple-of-number-of: Ensure that the number of slots is in multiple of the number of banks
that the number of slots is in multiple of the number of banks

With this approach, I have also made some changes where the number of
banks is constant (i.e. 8) so that some of the computations are easy.
I think with a buffer mapping hash table we should not have much
problem in keeping this fixed as with very extreme configuration and
very high numbers of slots also we do not have performance problems as
we are not doing sequence search because of buffer mapping hash and if
the number of slots is set so high then the victim buffer search also
should not be frequent so we should not be worried about sequence
search within a bank for victim buffer search. I have also changed
the default value of the number of slots to 64 and the minimum value
to 16 I think this is a reasonable default value because the existing
values are too low considering the modern hardware and these
parameters is configurable so user can set it to low value if running
with very low memory.

[1]: /messages/by-id/CA+hUKGLCLDtgDj2Xsf0uBk5WXDCeHxBDDJPsyY7m65Fde-=pyg@mail.gmail.com

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachments:

v4-0005-Ensure-slru-buffer-slots-are-in-multiple-of-numbe.patchapplication/octet-stream; name=v4-0005-Ensure-slru-buffer-slots-are-in-multiple-of-numbe.patchDownload+106-8
v4-0001-Make-all-SLRU-buffer-sizes-configurable.patchapplication/octet-stream; name=v4-0001-Make-all-SLRU-buffer-sizes-configurable.patchDownload+299-46
v4-0003-Partition-wise-slru-locks.patchapplication/octet-stream; name=v4-0003-Partition-wise-slru-locks.patchDownload+601-255
v4-0002-Add-a-buffer-mapping-table-for-SLRUs.patchapplication/octet-stream; name=v4-0002-Add-a-buffer-mapping-table-for-SLRUs.patchDownload+123-23
v4-0004-Merge-partition-locks-array-with-buffer-locks-arr.patchapplication/octet-stream; name=v4-0004-Merge-partition-locks-array-with-buffer-locks-arr.patchDownload+69-64
#13Andrey Borodin
amborodin@acm.org
In reply to: Dilip Kumar (#11)
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

On 30 Oct 2023, at 09:20, Dilip Kumar <dilipbalaut@gmail.com> wrote:

changed the logic of SlruAdjustNSlots() in 0002, such that now it
starts with the next power of 2 value of the configured slots and
keeps doubling the number of banks until we reach the number of banks
to the max SLRU_MAX_BANKS(128) and bank size is bigger than
SLRU_MIN_BANK_SIZE (8). By doing so, we will ensure we don't have too
many banks

There was nothing wrong with having too many banks. Until bank-wise locks and counters were added in later patchsets.
Having hashtable to find SLRU page in the buffer IMV is too slow. Some comments on this approach can be found here [0]/messages/by-id/CA+hUKGKVqrxOp82zER1=XN=yPwV_-OCGAg=ez=1iz9rG+A7Smw@mail.gmail.com.
I'm OK with having HTAB for that if we are sure performance does not degrade significantly, but I really doubt this is the case.
I even think SLRU buffers used HTAB in some ancient times, but I could not find commit when it was changed to linear search.

Maybe we could decouple locks and counters from SLRU banks? Banks were meant to be small to exploit performance of local linear search. Lock partitions have to be bigger for sure.

On 30 Oct 2023, at 09:20, Dilip Kumar <dilipbalaut@gmail.com> wrote:

I have taken 0001 and 0002 from [1], done some bug fixes in 0001

BTW can you please describe in more detail what kind of bugs?

Thanks for working on this!

Best regards, Andrey Borodin.

[0]: /messages/by-id/CA+hUKGKVqrxOp82zER1=XN=yPwV_-OCGAg=ez=1iz9rG+A7Smw@mail.gmail.com

#14Dilip Kumar
dilipbalaut@gmail.com
In reply to: Andrey Borodin (#13)
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

On Sun, Nov 5, 2023 at 1:37 AM Andrey M. Borodin <x4mmm@yandex-team.ru> wrote:

On 30 Oct 2023, at 09:20, Dilip Kumar <dilipbalaut@gmail.com> wrote:

changed the logic of SlruAdjustNSlots() in 0002, such that now it
starts with the next power of 2 value of the configured slots and
keeps doubling the number of banks until we reach the number of banks
to the max SLRU_MAX_BANKS(128) and bank size is bigger than
SLRU_MIN_BANK_SIZE (8). By doing so, we will ensure we don't have too
many banks

There was nothing wrong with having too many banks. Until bank-wise locks and counters were added in later patchsets.

I agree with that, but I feel with bank-wise locks we are removing
major contention from the centralized control lock and we can see that
from my first email that how much benefit we can get in one of the
simple test cases when we create subtransaction overflow.

Having hashtable to find SLRU page in the buffer IMV is too slow. Some comments on this approach can be found here [0].
I'm OK with having HTAB for that if we are sure performance does not degrade significantly, but I really doubt this is the case.
I even think SLRU buffers used HTAB in some ancient times, but I could not find commit when it was changed to linear search.

The main intention of having this buffer mapping hash is to find the
SLRU page faster than sequence search when banks are relatively bigger
in size, but if we find the cases where having hash creates more
overhead than providing gain then I am fine to remove the hash because
the whole purpose of adding hash here to make the lookup faster. So
far in my test I did not find the slowness. Do you or anyone else
have any test case based on the previous research on whether it
creates any slowness?

Maybe we could decouple locks and counters from SLRU banks? Banks were meant to be small to exploit performance of local linear search. Lock partitions have to be bigger for sure.

Yeah, that could also be an idea if we plan to drop the hash. I mean
bank-wise counter is fine as we are finding a victim buffer within a
bank itself, but each lock could cover more slots than one bank size
or in other words, it can protect multiple banks. Let's hear more
opinion on this.

On 30 Oct 2023, at 09:20, Dilip Kumar <dilipbalaut@gmail.com> wrote:

I have taken 0001 and 0002 from [1], done some bug fixes in 0001

BTW can you please describe in more detail what kind of bugs?

Yeah, actually that patch was using the same GUC
(multixact_offsets_buffers) in SimpleLruInit for MultiXactOffsetCtl as
well as for MultiXactMemberCtl, see the below patch snippet from the
original patch.

@@ -1851,13 +1851,13 @@ MultiXactShmemInit(void)
MultiXactMemberCtl->PagePrecedes = MultiXactMemberPagePrecedes;

  SimpleLruInit(MultiXactOffsetCtl,
-   "MultiXactOffset", NUM_MULTIXACTOFFSET_BUFFERS, 0,
+   "MultiXactOffset", multixact_offsets_buffers, 0,
    MultiXactOffsetSLRULock, "pg_multixact/offsets",
    LWTRANCHE_MULTIXACTOFFSET_BUFFER,
    SYNC_HANDLER_MULTIXACT_OFFSET);
  SlruPagePrecedesUnitTests(MultiXactOffsetCtl, MULTIXACT_OFFSETS_PER_PAGE);
  SimpleLruInit(MultiXactMemberCtl,
-   "MultiXactMember", NUM_MULTIXACTMEMBER_BUFFERS, 0,
+   "MultiXactMember", multixact_offsets_buffers, 0,
    MultiXactMemberSLRULock, "pg_multixact/members",
    LWTRANCHE_MULTIXACTMEMBER_BUFFER,
    SYNC_HANDLER_MULTIXACT_MEMBER);

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#15Andrey Borodin
amborodin@acm.org
In reply to: Dilip Kumar (#14)
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

On 6 Nov 2023, at 09:09, Dilip Kumar <dilipbalaut@gmail.com> wrote:

Having hashtable to find SLRU page in the buffer IMV is too slow. Some comments on this approach can be found here [0].
I'm OK with having HTAB for that if we are sure performance does not degrade significantly, but I really doubt this is the case.
I even think SLRU buffers used HTAB in some ancient times, but I could not find commit when it was changed to linear search.

The main intention of having this buffer mapping hash is to find the
SLRU page faster than sequence search when banks are relatively bigger
in size, but if we find the cases where having hash creates more
overhead than providing gain then I am fine to remove the hash because
the whole purpose of adding hash here to make the lookup faster. So
far in my test I did not find the slowness. Do you or anyone else
have any test case based on the previous research on whether it
creates any slowness?

PFA test benchmark_slru_page_readonly(). In this test we run SimpleLruReadPage_ReadOnly() (essential part of TransactionIdGetStatus())
before introducing HTAB for buffer mapping I get
Time: 14837.851 ms (00:14.838)
with buffer HTAB I get
Time: 22723.243 ms (00:22.723)

This hash table makes getting transaction status ~50% slower.

Benchmark script I used:
make -C $HOME/postgresMX -j 8 install && (pkill -9 postgres; rm -rf test; ./initdb test && echo "shared_preload_libraries = 'test_slru'">> test/postgresql.conf && ./pg_ctl -D test start && ./psql -c 'create extension test_slru' postgres && ./pg_ctl -D test restart && ./psql -c "SELECT count(test_slru_page_write(a, 'Test SLRU'))
FROM generate_series(12346, 12393, 1) as a;" -c '\timing' -c "SELECT benchmark_slru_page_readonly(12377);" postgres)

Maybe we could decouple locks and counters from SLRU banks? Banks were meant to be small to exploit performance of local linear search. Lock partitions have to be bigger for sure.

Yeah, that could also be an idea if we plan to drop the hash. I mean
bank-wise counter is fine as we are finding a victim buffer within a
bank itself, but each lock could cover more slots than one bank size
or in other words, it can protect multiple banks. Let's hear more
opinion on this.

+1

On 30 Oct 2023, at 09:20, Dilip Kumar <dilipbalaut@gmail.com> wrote:

I have taken 0001 and 0002 from [1], done some bug fixes in 0001

BTW can you please describe in more detail what kind of bugs?

Yeah, actually that patch was using the same GUC
(multixact_offsets_buffers) in SimpleLruInit for MultiXactOffsetCtl as
well as for MultiXactMemberCtl, see the below patch snippet from the
original patch.

Ouch. We were running this for serveral years with this bug... Thanks!

Best regards, Andrey Borodin.

Attachments:

0001-Implement-benchmark_slru_page_readonly-to-assess-SLR.patchapplication/octet-stream; name=0001-Implement-benchmark_slru_page_readonly-to-assess-SLR.patch; x-unix-mode=0644Download+20-1
#16Dilip Kumar
dilipbalaut@gmail.com
In reply to: Andrey Borodin (#15)
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

On Mon, Nov 6, 2023 at 1:05 PM Andrey M. Borodin <x4mmm@yandex-team.ru> wrote:

On 6 Nov 2023, at 09:09, Dilip Kumar <dilipbalaut@gmail.com> wrote:

Having hashtable to find SLRU page in the buffer IMV is too slow. Some comments on this approach can be found here [0].
I'm OK with having HTAB for that if we are sure performance does not degrade significantly, but I really doubt this is the case.
I even think SLRU buffers used HTAB in some ancient times, but I could not find commit when it was changed to linear search.

The main intention of having this buffer mapping hash is to find the
SLRU page faster than sequence search when banks are relatively bigger
in size, but if we find the cases where having hash creates more
overhead than providing gain then I am fine to remove the hash because
the whole purpose of adding hash here to make the lookup faster. So
far in my test I did not find the slowness. Do you or anyone else
have any test case based on the previous research on whether it
creates any slowness?

PFA test benchmark_slru_page_readonly(). In this test we run SimpleLruReadPage_ReadOnly() (essential part of TransactionIdGetStatus())
before introducing HTAB for buffer mapping I get
Time: 14837.851 ms (00:14.838)
with buffer HTAB I get
Time: 22723.243 ms (00:22.723)

This hash table makes getting transaction status ~50% slower.

Benchmark script I used:
make -C $HOME/postgresMX -j 8 install && (pkill -9 postgres; rm -rf test; ./initdb test && echo "shared_preload_libraries = 'test_slru'">> test/postgresql.conf && ./pg_ctl -D test start && ./psql -c 'create extension test_slru' postgres && ./pg_ctl -D test restart && ./psql -c "SELECT count(test_slru_page_write(a, 'Test SLRU'))
FROM generate_series(12346, 12393, 1) as a;" -c '\timing' -c "SELECT benchmark_slru_page_readonly(12377);" postgres)

With this test, I got below numbers,

nslots. no-hash hash
8 10s 13s
16 10s 13s
32 15s 13s
64 17s 13s

Yeah so we can see with a small bank size <=16 slots we are seeing
that the fetching page with hash is 30% slower than the sequential
search, but beyond 32 slots sequential search is become slower as you
grow the number of slots whereas with hash it stays constant as
expected. But now as you told if keep the lock partition range
different than the bank size then we might not have any problem by
having more numbers of banks and with that, we can keep the bank size
small like 16. Let me put some more thought into this and get back.
Any other opinions on this?

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#17Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Dilip Kumar (#16)
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

On 2023-Nov-06, Dilip Kumar wrote:

Yeah so we can see with a small bank size <=16 slots we are seeing
that the fetching page with hash is 30% slower than the sequential
search, but beyond 32 slots sequential search is become slower as you
grow the number of slots whereas with hash it stays constant as
expected. But now as you told if keep the lock partition range
different than the bank size then we might not have any problem by
having more numbers of banks and with that, we can keep the bank size
small like 16. Let me put some more thought into this and get back.
Any other opinions on this?

dynahash is notoriously slow, which is why we have simplehash.h since
commit b30d3ea824c5. Maybe we could use that instead.

--
Álvaro Herrera Breisgau, Deutschland — https://www.EnterpriseDB.com/
"Escucha y olvidarás; ve y recordarás; haz y entenderás" (Confucio)

#18Andrey Borodin
amborodin@acm.org
In reply to: Alvaro Herrera (#17)
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

On 6 Nov 2023, at 14:31, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:

dynahash is notoriously slow, which is why we have simplehash.h since
commit b30d3ea824c5. Maybe we could use that instead.

Dynahash has lock partitioning. Simplehash has not, AFAIK.
The thing is we do not really need a hash function - pageno is already a best hash function itself. And we do not need to cope with collisions much - we can evict a collided buffer.

Given this we do not need a hashtable at all. That’s exact reasoning how banks emerged, I started implementing dynahsh patch in April 2021 and found out that “banks” approach is cleaner. However the term “bank” is not common in software, it’s taken from hardware cache.

Best regards, Andrey Borodin.

#19Dilip Kumar
dilipbalaut@gmail.com
In reply to: Andrey Borodin (#18)
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

On Mon, Nov 6, 2023 at 4:44 PM Andrey M. Borodin <x4mmm@yandex-team.ru> wrote:

On 6 Nov 2023, at 14:31, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:

dynahash is notoriously slow, which is why we have simplehash.h since
commit b30d3ea824c5. Maybe we could use that instead.

Dynahash has lock partitioning. Simplehash has not, AFAIK.

Yeah, Simplehash doesn't have partitioning so with simple hash we will
be stuck with the centralized control lock that is one of the main
problems trying to solve here.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#20Amul Sul
sulamul@gmail.com
In reply to: Andrey Borodin (#18)
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

On Mon, Nov 6, 2023 at 4:44 PM Andrey M. Borodin <x4mmm@yandex-team.ru>
wrote:

On 6 Nov 2023, at 14:31, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:

dynahash is notoriously slow, which is why we have simplehash.h since
commit b30d3ea824c5. Maybe we could use that instead.

Dynahash has lock partitioning. Simplehash has not, AFAIK.
The thing is we do not really need a hash function - pageno is already a
best hash function itself. And we do not need to cope with collisions much
- we can evict a collided buffer.

Given this we do not need a hashtable at all. That’s exact reasoning how
banks emerged, I started implementing dynahsh patch in April 2021 and found
out that “banks” approach is cleaner. However the term “bank” is not common
in software, it’s taken from hardware cache.

I agree that we don't need the hash function to generate hash value out of
pageno which itself is sufficient, but I don't understand how we can get
rid of
the hash table itself -- how we would map the pageno and the slot number?
That mapping is not needed at all?

Regards,
Amul

#21Amul Sul
sulamul@gmail.com
In reply to: Dilip Kumar (#12)
#22Dilip Kumar
dilipbalaut@gmail.com
In reply to: Amul Sul (#21)
#23Ants Aasma
ants.aasma@cybertec.at
In reply to: Andrey Borodin (#13)
#24Andrey Borodin
amborodin@acm.org
In reply to: Ants Aasma (#23)
#25Dilip Kumar
dilipbalaut@gmail.com
In reply to: Dilip Kumar (#14)
#26Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Ants Aasma (#23)
#27Robert Haas
robertmhaas@gmail.com
In reply to: Dilip Kumar (#25)
#28Dilip Kumar
dilipbalaut@gmail.com
In reply to: Robert Haas (#27)
#29Dilip Kumar
dilipbalaut@gmail.com
In reply to: Alvaro Herrera (#26)
#30Nathan Bossart
nathandbossart@gmail.com
In reply to: Dilip Kumar (#29)
#31Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Dilip Kumar (#25)
#32Dilip Kumar
dilipbalaut@gmail.com
In reply to: Alvaro Herrera (#31)
#33Dilip Kumar
dilipbalaut@gmail.com
In reply to: Dilip Kumar (#32)
#34Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Dilip Kumar (#33)
#35Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Dilip Kumar (#32)
#36Dilip Kumar
dilipbalaut@gmail.com
In reply to: Alvaro Herrera (#34)
#37Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Dilip Kumar (#36)
#38Andrey Borodin
amborodin@acm.org
In reply to: Dilip Kumar (#33)
#39Dilip Kumar
dilipbalaut@gmail.com
In reply to: Andrey Borodin (#38)
#40Dilip Kumar
dilipbalaut@gmail.com
In reply to: Alvaro Herrera (#35)
#41Andrey Borodin
amborodin@acm.org
In reply to: Dilip Kumar (#40)
#42Dilip Kumar
dilipbalaut@gmail.com
In reply to: Andrey Borodin (#41)
#43Dilip Kumar
dilipbalaut@gmail.com
In reply to: Dilip Kumar (#42)
#44Dilip Kumar
dilipbalaut@gmail.com
In reply to: Dilip Kumar (#43)
#45Dilip Kumar
dilipbalaut@gmail.com
In reply to: Dilip Kumar (#44)
#46Dilip Kumar
dilipbalaut@gmail.com
In reply to: Dilip Kumar (#45)
#47Tender Wang
tndrwang@gmail.com
In reply to: Dilip Kumar (#46)
#48Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Tender Wang (#47)
#49Dilip Kumar
dilipbalaut@gmail.com
In reply to: Alvaro Herrera (#48)
#50Dilip Kumar
dilipbalaut@gmail.com
In reply to: Dilip Kumar (#49)
#51Dilip Kumar
dilipbalaut@gmail.com
In reply to: Dilip Kumar (#50)
#52Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Dilip Kumar (#51)
#53Dilip Kumar
dilipbalaut@gmail.com
In reply to: Alvaro Herrera (#52)
#54Andrey Borodin
amborodin@acm.org
In reply to: Alvaro Herrera (#52)
#55Amul Sul
sulamul@gmail.com
In reply to: Dilip Kumar (#51)
#56Dilip Kumar
dilipbalaut@gmail.com
In reply to: Amul Sul (#55)
#57Andrey Borodin
amborodin@acm.org
In reply to: Amul Sul (#55)
#58Tender Wang
tndrwang@gmail.com
In reply to: Andrey Borodin (#57)
#59Andrey Borodin
amborodin@acm.org
In reply to: Tender Wang (#58)
#60Dilip Kumar
dilipbalaut@gmail.com
In reply to: Andrey Borodin (#54)
#61Andrey Borodin
amborodin@acm.org
In reply to: Dilip Kumar (#60)
#62Tender Wang
tndrwang@gmail.com
In reply to: Andrey Borodin (#59)
#63Andrey Borodin
amborodin@acm.org
In reply to: Tender Wang (#62)
#64Dilip Kumar
dilipbalaut@gmail.com
In reply to: Dilip Kumar (#60)
#65Dilip Kumar
dilipbalaut@gmail.com
In reply to: Dilip Kumar (#56)
#66Robert Haas
robertmhaas@gmail.com
In reply to: Alvaro Herrera (#52)
#67Robert Haas
robertmhaas@gmail.com
In reply to: Robert Haas (#66)
#68Andrey Borodin
amborodin@acm.org
In reply to: Robert Haas (#67)
#69Robert Haas
robertmhaas@gmail.com
In reply to: Andrey Borodin (#68)
#70Dilip Kumar
dilipbalaut@gmail.com
In reply to: Robert Haas (#67)
#71Andrey Borodin
amborodin@acm.org
In reply to: Dilip Kumar (#70)
#72Robert Haas
robertmhaas@gmail.com
In reply to: Andrey Borodin (#71)
#73Andrey Borodin
amborodin@acm.org
In reply to: Robert Haas (#72)
#74Robert Haas
robertmhaas@gmail.com
In reply to: Andrey Borodin (#73)
#75Dilip Kumar
dilipbalaut@gmail.com
In reply to: Robert Haas (#72)
#76Robert Haas
robertmhaas@gmail.com
In reply to: Dilip Kumar (#75)
#77Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Dilip Kumar (#51)
#78Dilip Kumar
dilipbalaut@gmail.com
In reply to: Alvaro Herrera (#77)
#79Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Dilip Kumar (#78)
#80Dilip Kumar
dilipbalaut@gmail.com
In reply to: Alvaro Herrera (#79)
#81Dilip Kumar
dilipbalaut@gmail.com
In reply to: Dilip Kumar (#80)
#82Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Dilip Kumar (#81)
#83Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Alvaro Herrera (#82)
#84Robert Haas
robertmhaas@gmail.com
In reply to: Alvaro Herrera (#82)
#85Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Dilip Kumar (#81)
#86Andrey Borodin
amborodin@acm.org
In reply to: Alvaro Herrera (#85)
#87Dilip Kumar
dilipbalaut@gmail.com
In reply to: Alvaro Herrera (#83)
#88Dilip Kumar
dilipbalaut@gmail.com
In reply to: Alvaro Herrera (#85)
#89Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Dilip Kumar (#87)
#90Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Alvaro Herrera (#89)
#91Dilip Kumar
dilipbalaut@gmail.com
In reply to: Alvaro Herrera (#90)
#92Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Dilip Kumar (#91)
#93Dilip Kumar
dilipbalaut@gmail.com
In reply to: Alvaro Herrera (#92)
#94Dilip Kumar
dilipbalaut@gmail.com
In reply to: Dilip Kumar (#93)
#95Dilip Kumar
dilipbalaut@gmail.com
In reply to: Dilip Kumar (#94)
#96Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Dilip Kumar (#95)
#97Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Alvaro Herrera (#96)
#98Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Alvaro Herrera (#97)
#99Andrey Borodin
amborodin@acm.org
In reply to: Alvaro Herrera (#96)
#100Dilip Kumar
dilipbalaut@gmail.com
In reply to: Alvaro Herrera (#96)
#101Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Dilip Kumar (#100)
#102Dilip Kumar
dilipbalaut@gmail.com
In reply to: Alvaro Herrera (#101)
#103Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Dilip Kumar (#102)
#104Dilip Kumar
dilipbalaut@gmail.com
In reply to: Alvaro Herrera (#103)
#105Andrey Borodin
amborodin@acm.org
In reply to: Dilip Kumar (#104)
#106Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Dilip Kumar (#104)
#107Dilip Kumar
dilipbalaut@gmail.com
In reply to: Alvaro Herrera (#106)
#108Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Dilip Kumar (#107)
#109Dilip Kumar
dilipbalaut@gmail.com
In reply to: Alvaro Herrera (#108)
#110Andrey Borodin
amborodin@acm.org
In reply to: Dilip Kumar (#109)
#111Dilip Kumar
dilipbalaut@gmail.com
In reply to: Dilip Kumar (#109)
#112Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Andrey Borodin (#110)
#113Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Dilip Kumar (#111)
#114Dilip Kumar
dilipbalaut@gmail.com
In reply to: Alvaro Herrera (#113)
#115Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Dilip Kumar (#114)
#116Andrey Borodin
amborodin@acm.org
In reply to: Alvaro Herrera (#115)
#117Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Andrey Borodin (#116)
#118Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Alvaro Herrera (#117)
#119Andrey Borodin
amborodin@acm.org
In reply to: Alvaro Herrera (#118)
#120Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Alvaro Herrera (#118)
#121Dilip Kumar
dilipbalaut@gmail.com
In reply to: Alvaro Herrera (#120)
#122Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Alvaro Herrera (#118)
#123Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Kyotaro Horiguchi (#122)
#124Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Alvaro Herrera (#123)
#125Tom Lane
tgl@sss.pgh.pa.us
In reply to: Alvaro Herrera (#124)
#126Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#125)
#127Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Tom Lane (#126)
#128Alexander Lakhin
exclusion@gmail.com
In reply to: Alvaro Herrera (#118)
#129Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Alexander Lakhin (#128)
#130Dilip Kumar
dilipbalaut@gmail.com
In reply to: Alvaro Herrera (#129)
#131Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Dilip Kumar (#130)