SLRU optimization - configurable buffer pool and partitioning the SLRU lock

Started by Dilip Kumarover 2 years ago131 messages

dilipbalaut@gmail.com

over 2 years ago

The small size of the SLRU buffer pools can sometimes become a
performance problem because it’s not difficult to have a workload
where the number of buffers actively in use is larger than the
fixed-size buffer pool. However, just increasing the size of the
buffer pool doesn’t necessarily help, because the linear search that
we use for buffer replacement doesn’t scale, and also because
contention on the single centralized lock limits scalability.

There is a couple of patches proposed in the past to address the
problem of increasing the buffer pool size, one of the patch [1] was
proposed by Thomas Munro where we make the size of the buffer pool
configurable. And, in order to deal with the linear search in the
large buffer pool, we divide the SLRU buffer pool into associative
banks so that searching in the buffer pool doesn’t get affected by the
large size of the buffer pool. This does well for the workloads which
are mainly impacted by the frequent buffer replacement but this still
doesn’t stand well with the workloads where the centralized control
lock is the bottleneck.

So I have taken this patch as my base patch (v1-0001) and further
added 2 more improvements to this 1) In v1-0002, Instead of a
centralized control lock for the SLRU I have introduced a bank-wise
control lock 2)In v1-0003, I have removed the global LRU counter and
introduced a bank-wise counter. The second change (v1-0003) is in
order to avoid the CPU/OS cache invalidation due to frequent updates
of the single variable, later in my performance test I will show how
much gain we have gotten because of these 2 changes.

Note: This is going to be a long email but I have summarised the main
idea above this point and now I am going to discuss more internal
information in order to show that the design idea is valid and also
going to show 2 performance tests where one is specific to the
contention on the centralized lock and other is mainly contention due
to frequent buffer replacement in SLRU buffer pool. We are getting ~2x
TPS compared to the head by these patches and in later sections, I am
going discuss this in more detail i.e. exact performance numbers and
analysis of why we are seeing the gain.

There are some problems I faced while converting this centralized
control lock to a bank-wise lock and that is mainly because this lock
is (mis)used for different purposes. The main purpose of this control
lock as I understand it is to protect the in-memory access
(read/write) of the buffers in the SLRU buffer pool.

Here is the list of some problems and their analysis:

1) In some of the SLRU, we use this lock for protecting the members
inside the control structure which is specific to that SLRU layer i.e.
SerialControlData() members are protected by the SerialSLRULock, and I
don’t think it is the right use of this lock so for this purpose I
have introduced another lock called SerialControlLock for this
specific purpose. Based on my analysis there is no reason for
protecting these members and the SLRU buffer access with the same
lock.
2) The member called ‘latest_page_number’ inside SlruSharedData is
also protected by the SLRULock, I would not say this use case is wrong
but since this is a common variable and not a per bank variable can
not be protected by the bankwise lock now. But the usage of this
variable is just to track the latest page in an SLRU so that we do not
evict out the latest page during victim page selection. So I have
converted this to an atomic variable as it is completely independent
of the SLRU buffer access.
3) In order to protect SlruScanDirectory, basically the
SlruScanDirectory() from DeactivateCommitTs(), is called under the
SLRU control lock, but from all other places SlruScanDirectory() is
called without lock and that is because the caller of this function is
called from the places which are not executed concurrently(i.e.
startup, checkpoint). This function DeactivateCommitTs() is also
called from the startup only so there doesn't seem any use in calling
this under the SLRU control lock. Currently, I have called this under
the all-bank lock because logically this is not a performance path,
and that way we are keeping it consistent with the current logic, but
if others also think that we do not need a lock at this place then we
might remove it and then we don't need this all-bank lock anywhere.

There are some other uses of this lock where we might think it will be
a problem if we divide it into a bank-wise lock but it's not and I
have given my analysis for the same

1) SimpleLruTruncate: We might worry that if we convert to a bank-wise
lock then this could be an issue as we might need to release and
acquire different locks as we scan different banks. But as per my
analysis, this is not an issue because a) With the current code also
do release and acquire the centralized lock multiple times in order to
perform the I/O on the buffer slot so the behavior is not changed but
the most important thing is b) All SLRU layers take care that this
function should not be accessed concurrently, I have verified all
access to this function and its true and the function header of this
function also says the same. So this is not an issue as per my
analysis.

2) Extending or adding a new page to SLRU: I have noticed that this
is also protected by either some other exclusive lock or only done
during startup. So in short the SLRULock is just used for protecting
against the access of the buffers in the buffer pool but that is not
for guaranteeing the exclusive access inside the function because that
is taken care of in some other way.

3) Another thing that I noticed while writing this and thought it
would be good to make a note of that as well. Basically for the CLOG
group update of the xid status. Therein if we do not get the control
lock on the SLRU then we add ourselves to a group and then the group
leader does the job for all the members in the group. One might think
that different pages in the group might belong to different SLRU bank
so the leader might need to acquire/release the lock as it process the
request in the group. Yes, that is true, and it is taken care but we
don’t need to worry about the case because as per the implementation
of the group update, we are trying to have the members with the same
page request in one group and only due to some exception there could
be members with the different page request. So the point is with a
bank-wise lock we are handling that exception case but that's not a
regular case that we need to acquire/release multiple times. So
design-wise we are good and performance-wise there should not be any
problem because most of the time we might be updating the pages from
the same bank, and if in some cases we have some updates for old
transactions due to long-running transactions then we should do better
by not having a centralized lock.

Performance Test:
Exp1: Show problems due to CPU/OS cache invalidation due to frequent
updates of the centralized lock and a common LRU counter. So here we
are running a parallel transaction to pgbench script which frequently
creates subtransaction overflow and that forces the visibility-check
mechanism to access the subtrans SLRU.
Test machine: 8 CPU/ 64 core/ 128 with HT/ 512 MB RAM / SSD
scale factor: 300
shared_buffers=20GB
checkpoint_timeout=40min
max_wal_size=20GB
max_connections=200

Workload: Run these 2 scripts parallelly:
./pgbench -c $ -j $ -T 600 -P5 -M prepared postgres
./pgbench -c 1 -j 1 -T 600 -f savepoint.sql postgres

savepoint.sql (create subtransaction overflow)
BEGIN;
SAVEPOINT S1;
INSERT INTO test VALUES(1)
← repeat 70 times →
SELECT pg_sleep(1);
COMMIT;

Code under test:
Head: PostgreSQL head code
SlruBank: The first patch applied to convert the SLRU buffer pool into
the bank (0001)
SlruBank+BankwiseLockAndLru: Applied 0001+0002+0003

Results:
Clients Head SlruBank SlruBank+BankwiseLockAndLru
1 457 491 475
8 3753 3819 3782
32 14594 14328 17028
64 15600 16243 25944
128 15957 16272 31731

So we can see that at 128 clients, we get ~2x TPS(with SlruBank +
BankwiseLock and bankwise LRU counter) as compared to HEAD. We might
be thinking that we do not see much gain only with the SlruBank patch.
The reason is that in this particular test case, we are not seeing
much load on the buffer replacement. In fact, the wait event also
doesn’t show contention on any lock instead the main load is due to
frequently modifying the common variable like the centralized control
lock and the centralized LRU counters. That is evident in perf data
as shown below

+  74.72%   0.06% postgres postgres  [.] XidInMVCCSnapshot
+  74.08%   0.02% postgres postgres  [.] SubTransGetTopmostTransaction
+  74.04%   0.07% postgres postgres  [.] SubTransGetParent
+  57.66%   0.04% postgres postgres  [.] LWLockAcquire
+  57.64%   0.26% postgres postgres  [.] SimpleLruReadPage_ReadOnly
……
+  16.53%   0.07% postgres postgres  [.] LWLockRelease
+  16.36%   0.04% postgres postgres  [.] pg_atomic_sub_fetch_u32
+  16.31%  16.24% postgres postgres  [.] pg_atomic_fetch_sub_u32_impl

We can notice that the main load is on the atomic variable within the
LWLockAcquire and LWLockRelease. Once we apply the bankwise lock
patch(v1-0002) the same problem is visible on cur_lru_count updation
in the SlruRecentlyUsed[2]#define SlruRecentlyUsed(shared, slotno) \ do { \ .. (shared)->cur_lru_count = ++new_lru_count; \ .. } \ } while (0) macro (I have not shown that here but it
was visible in my perf report). And that is resolved by implementing
a bankwise counter.

[2]: #define SlruRecentlyUsed(shared, slotno) \ do { \ .. (shared)->cur_lru_count = ++new_lru_count; \ .. } \ } while (0)
#define SlruRecentlyUsed(shared, slotno) \
do { \
..
(shared)->cur_lru_count = ++new_lru_count; \
..
} \
} while (0)

Exp2: This test shows the load on SLRU frequent buffer replacement. In
this test, we are running the pgbench kind script which frequently
generates multixact-id, and parallelly we are starting and committing
a long-running transaction so that the multixact-ids are not
immediately cleaned up by the vacuum and we create contention on the
SLRU buffer pool. I am not leaving the long-running transaction
running forever as that will start to show another problem with
respect to bloat and we will lose the purpose of what I am trying to
show here.

Note: test configurations are the same as Exp1, just the workload is
different, we are running below 2 scripts.
and new config parameter(added in v1-0001) slru_buffers_size_scale=4,
that means NUM_MULTIXACTOFFSET_BUFFERS will be 64 that is 16 in Head
and
NUM_MULTIXACTMEMBER_BUFFERS will be 128 which is 32 in head

./pgbench -c $ -j $ -T 600 -P5 -M prepared -f multixact.sql postgres
./pgbench -c 1 -j 1 -T 600 -f longrunning.sql postgres

cat > multixact.sql <<EOF
\set aid random(1, 100000 * :scale)
\set bid random(1, 1 * :scale)
\set tid random(1, 10 * :scale)
\set delta random(-5000, 5000)
BEGIN;
SELECT FROM pgbench_accounts WHERE aid = :aid FOR UPDATE;
SAVEPOINT S1;
UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid;
SELECT abalance FROM pgbench_accounts WHERE aid = :aid;
INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES
(:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP);
END;
EOF

cat > longrunning.sql << EOF
BEGIN;
INSERT INTO pgbench_test VALUES(1);
select pg_sleep(10);
COMMIT;
EOF

Results:
Clients Head SlruBank SlruBank+BankwiseLock
1 528 513 531
8 3870 4239 4157
32 13945 14470 14556
64 10086 19034 24482
128 6909 15627 18161

Here we can see good improvement with the SlruBank patch itself
because of increasing the SLRU buffer pool, as in this workload there
is a lot of contention due to buffer replacement. As shown below we
can see a lot of load on MultiXactOffsetSLRU as well as on
MultiXactOffsetBuffer which shows there are frequent buffer evictions
in this workload. And, increasing the SLRU buffer pool size is
helping a lot, and further dividing the SLRU lock into bank-wise locks
we are seeing a further gain. So in total, we are seeing ~2.5x TPS at
64 and 128 thread compared to head.

3401 LWLock | MultiXactOffsetSLRU
2031 LWLock | MultiXactOffsetBuffer
687 |
427 LWLock | BufferContent

Credits:
- The base patch v1-0001 is authored by Thomas Munro and I have just rebased it.
- 0002 and 0003 are new patches written by me based on design ideas
from Robert and Myself.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Dilip Kumar

dilipbalaut@gmail.com

over 2 years ago

In reply to: Dilip Kumar (#1)

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

On Wed, Oct 11, 2023 at 4:34 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

The small size of the SLRU buffer pools can sometimes become a
performance problem because it’s not difficult to have a workload
where the number of buffers actively in use is larger than the
fixed-size buffer pool. However, just increasing the size of the
buffer pool doesn’t necessarily help, because the linear search that
we use for buffer replacement doesn’t scale, and also because
contention on the single centralized lock limits scalability.

There is a couple of patches proposed in the past to address the
problem of increasing the buffer pool size, one of the patch [1] was
proposed by Thomas Munro where we make the size of the buffer pool
configurable.

In my last email, I forgot to give the link from where I have taken
the base path for dividing the buffer pool in banks so giving the same
here[1]https://commitfest.postgresql.org/43/2627/. And looking at this again it seems that the idea of that
patch was from
Andrey M. Borodin and the idea of the SLRU scale factor were
introduced by Yura Sokolov and Ivan Lazarev. Apologies for missing
that in the first email.

[1]: https://commitfest.postgresql.org/43/2627/

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Dilip Kumar

dilipbalaut@gmail.com

over 2 years ago

In reply to: Dilip Kumar (#2)

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

On Wed, Oct 11, 2023 at 5:57 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Wed, Oct 11, 2023 at 4:34 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

In my last email, I forgot to give the link from where I have taken
the base path for dividing the buffer pool in banks so giving the same
here[1]. And looking at this again it seems that the idea of that
patch was from
Andrey M. Borodin and the idea of the SLRU scale factor were
introduced by Yura Sokolov and Ivan Lazarev. Apologies for missing
that in the first email.

[1] https://commitfest.postgresql.org/43/2627/

In my last email I have just rebased the base patch, so now while
reading through that patch I realized that there was some refactoring
needed and some unused functions were there so I have removed that and
also added some comments. Also did some refactoring to my patches. So
reposting the patch series.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Amit Kapila

amit.kapila16@gmail.com

over 2 years ago

In reply to: Dilip Kumar (#1)

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

On Wed, Oct 11, 2023 at 4:35 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

The small size of the SLRU buffer pools can sometimes become a
performance problem because it’s not difficult to have a workload
where the number of buffers actively in use is larger than the
fixed-size buffer pool. However, just increasing the size of the
buffer pool doesn’t necessarily help, because the linear search that
we use for buffer replacement doesn’t scale, and also because
contention on the single centralized lock limits scalability.

There is a couple of patches proposed in the past to address the
problem of increasing the buffer pool size, one of the patch [1] was
proposed by Thomas Munro where we make the size of the buffer pool
configurable. And, in order to deal with the linear search in the
large buffer pool, we divide the SLRU buffer pool into associative
banks so that searching in the buffer pool doesn’t get affected by the
large size of the buffer pool. This does well for the workloads which
are mainly impacted by the frequent buffer replacement but this still
doesn’t stand well with the workloads where the centralized control
lock is the bottleneck.

So I have taken this patch as my base patch (v1-0001) and further
added 2 more improvements to this 1) In v1-0002, Instead of a
centralized control lock for the SLRU I have introduced a bank-wise
control lock 2)In v1-0003, I have removed the global LRU counter and
introduced a bank-wise counter. The second change (v1-0003) is in
order to avoid the CPU/OS cache invalidation due to frequent updates
of the single variable, later in my performance test I will show how
much gain we have gotten because of these 2 changes.

Note: This is going to be a long email but I have summarised the main
idea above this point and now I am going to discuss more internal
information in order to show that the design idea is valid and also
going to show 2 performance tests where one is specific to the
contention on the centralized lock and other is mainly contention due
to frequent buffer replacement in SLRU buffer pool. We are getting ~2x
TPS compared to the head by these patches and in later sections, I am
going discuss this in more detail i.e. exact performance numbers and
analysis of why we are seeing the gain.

...

Performance Test:
Exp1: Show problems due to CPU/OS cache invalidation due to frequent
updates of the centralized lock and a common LRU counter. So here we
are running a parallel transaction to pgbench script which frequently
creates subtransaction overflow and that forces the visibility-check
mechanism to access the subtrans SLRU.
Test machine: 8 CPU/ 64 core/ 128 with HT/ 512 MB RAM / SSD
scale factor: 300
shared_buffers=20GB
checkpoint_timeout=40min
max_wal_size=20GB
max_connections=200

Workload: Run these 2 scripts parallelly:
./pgbench -c $ -j $ -T 600 -P5 -M prepared postgres
./pgbench -c 1 -j 1 -T 600 -f savepoint.sql postgres

savepoint.sql (create subtransaction overflow)
BEGIN;
SAVEPOINT S1;
INSERT INTO test VALUES(1)
← repeat 70 times →
SELECT pg_sleep(1);
COMMIT;

Code under test:
Head: PostgreSQL head code
SlruBank: The first patch applied to convert the SLRU buffer pool into
the bank (0001)
SlruBank+BankwiseLockAndLru: Applied 0001+0002+0003

Results:
Clients Head SlruBank SlruBank+BankwiseLockAndLru
1 457 491 475
8 3753 3819 3782
32 14594 14328 17028
64 15600 16243 25944
128 15957 16272 31731

So we can see that at 128 clients, we get ~2x TPS(with SlruBank +
BankwiseLock and bankwise LRU counter) as compared to HEAD.

This and other results shared by you look promising. Will there be any
improvement in workloads related to clog buffer usage? BTW, I remember
that there was also a discussion of moving SLRU into a regular buffer
pool [1]https://commitfest.postgresql.org/43/3514/. You have not provided any explanation as to whether that
approach will have any merits after we do this or whether that
approach is not worth pursuing at all.

[1]: https://commitfest.postgresql.org/43/3514/

--
With Regards,
Amit Kapila.

Dilip Kumar

dilipbalaut@gmail.com

over 2 years ago

In reply to: Amit Kapila (#4)

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

On Sat, Oct 14, 2023 at 9:43 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

This and other results shared by you look promising. Will there be any
improvement in workloads related to clog buffer usage?

I did not understand this question can you explain this a bit? In
short, if it is regarding the performance then we will see it for all
the SLRUs as the control lock is not centralized anymore instead it is
a bank-wise lock.

BTW, I remember

that there was also a discussion of moving SLRU into a regular buffer
pool [1]. You have not provided any explanation as to whether that
approach will have any merits after we do this or whether that
approach is not worth pursuing at all.

[1] - https://commitfest.postgresql.org/43/3514/

Yeah, I haven't read that thread in detail about performance numbers
and all. But both of these can not coexist because this patch is
improving the SLRU buffer pool access/configurable size and also lock
contention. If we move SLRU to the main buffer pool then we might not
have a similar problem instead there might be other problems like SLRU
buffers getting swapped out due to other relation buffers and all and
OTOH the advantages of that approach would be that we can just use a
bigger buffer pool and SLRU can also take advantage of that. But in
my opinion, most of the time we have limited page access in SLRU and
the SLRU buffer access pattern is also quite different from the
relation pages access pattern so keeping them under the same buffer
pool and comparing against relation pages for victim buffer selection
might cause different problems. But anyway I would discuss those
points maybe in that thread.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Alvaro Herrera

alvherre@2ndquadrant.com

over 2 years ago

In reply to: Dilip Kumar (#2)

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

On 2023-Oct-11, Dilip Kumar wrote:

In my last email, I forgot to give the link from where I have taken
the base path for dividing the buffer pool in banks so giving the same
here[1]. And looking at this again it seems that the idea of that
patch was from Andrey M. Borodin and the idea of the SLRU scale factor
were introduced by Yura Sokolov and Ivan Lazarev. Apologies for
missing that in the first email.

You mean [1]/messages/by-id/452d01f7e331458f56ad79bef537c31b@postgrespro.ru I don't like this idea very much, because of the magic numbers that act as ratios for numbers of buffers on each SLRU compared to other SLRUs. These values, which I took from the documentation part of the patch, appear to have been selected by throwing darts at the wall:.
[1]: /messages/by-id/452d01f7e331458f56ad79bef537c31b@postgrespro.ru I don't like this idea very much, because of the magic numbers that act as ratios for numbers of buffers on each SLRU compared to other SLRUs. These values, which I took from the documentation part of the patch, appear to have been selected by throwing darts at the wall:
I don't like this idea very much, because of the magic numbers that act
as ratios for numbers of buffers on each SLRU compared to other SLRUs.
These values, which I took from the documentation part of the patch,
appear to have been selected by throwing darts at the wall:

NUM_CLOG_BUFFERS = Min(128 << slru_buffers_size_scale, shared_buffers/256)
NUM_COMMIT_TS_BUFFERS = Min(128 << slru_buffers_size_scale, shared_buffers/256)
NUM_SUBTRANS_BUFFERS = Min(64 << slru_buffers_size_scale, shared_buffers/256)
NUM_NOTIFY_BUFFERS = Min(32 << slru_buffers_size_scale, shared_buffers/256)
NUM_SERIAL_BUFFERS = Min(32 << slru_buffers_size_scale, shared_buffers/256)
NUM_MULTIXACTOFFSET_BUFFERS = Min(32 << slru_buffers_size_scale, shared_buffers/256)
NUM_MULTIXACTMEMBER_BUFFERS = Min(64 << slru_buffers_size_scale, shared_buffers/256)

... which look pretty random already, if similar enough to the current
hardcoded values. In reality, the code implements different values than
what the documentation says.

I don't see why would CLOG have the same number as COMMIT_TS, when the
size for elements of the latter is like 32 times bigger -- however, the
frequency of reads for COMMIT_TS is like 1000x smaller than for CLOG.
SUBTRANS is half of CLOG, yet it is 16 times larger, and it covers the
same range. The MULTIXACT ones appear to keep the current ratio among
them (8/16 gets changed to 32/64).

... and this whole mess is scaled exponentially without regard to the
size that each SLRU requires. This is just betting that enough memory
can be wasted across all SLRUs up to the point where the one that is
actually contended has sufficient memory. This doesn't sound sensible
to me.

Like everybody else, I like having less GUCs to configure, but going
this far to avoid them looks rather disastrous to me. IMO we should
just use Munro's older patches that gave one GUC per SLRU, and users
only need to increase the one that shows up in pg_wait_event sampling.
Someday we will get the (much more complicated) patches to move these
buffers to steal memory from shared buffers, and that'll hopefully let
use get rid of all this complexity.

I'm inclined to use Borodin's patch last posted here [2]/messages/by-id/93236D36-B91C-4DFA-AF03-99C083840378@yandex-team.ru instead of your
proposed 0001.
[2]: /messages/by-id/93236D36-B91C-4DFA-AF03-99C083840378@yandex-team.ru

I did skim patches 0002 and 0003 without going into too much detail;
they look reasonable ideas. I have not tried to reproduce the claimed
performance benefits. I think measuring this patch set with the tests
posted by Shawn Debnath in [3]/messages/by-id/YemDdpMrsoJFQJnU@f01898859afd.ant.amazon.com is important, too.
[3]: /messages/by-id/YemDdpMrsoJFQJnU@f01898859afd.ant.amazon.com

On the other hand, here's a somewhat crazy idea. What if, instead of
stealing buffers from shared_buffers (which causes a lot of complexity),
we allocate a common pool for all SLRUs to use? We provide a single
knob -- say, non_relational_buffers=32MB as default -- and we use a LRU
algorithm (or something) to distribute that memory across all the SLRUs.
So the ratio to use for this SLRU or that one would depend on the nature
of the workload: maybe more for multixact in this server here, but more
for subtrans in that server there; it's just the total amount that the
user would have to configure, side by side with shared_buffers (and
perhaps scale with it like wal_buffers), and the LRU would handle the
rest. The "only" problem here is finding a distribution algorithm that
doesn't further degrade performance, of course ...

--
Álvaro Herrera Breisgau, Deutschland — https://www.EnterpriseDB.com/
"The problem with the facetime model is not just that it's demoralizing, but
that the people pretending to work interrupt the ones actually working."
-- Paul Graham, http://www.paulgraham.com/opensource.html

Nathan Bossart

nathandbossart@gmail.com

over 2 years ago

In reply to: Alvaro Herrera (#6)

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

On Tue, Oct 24, 2023 at 06:04:13PM +0200, Alvaro Herrera wrote:

Like everybody else, I like having less GUCs to configure, but going
this far to avoid them looks rather disastrous to me. IMO we should
just use Munro's older patches that gave one GUC per SLRU, and users
only need to increase the one that shows up in pg_wait_event sampling.
Someday we will get the (much more complicated) patches to move these
buffers to steal memory from shared buffers, and that'll hopefully let
use get rid of all this complexity.

On the other hand, here's a somewhat crazy idea. What if, instead of
stealing buffers from shared_buffers (which causes a lot of complexity),
we allocate a common pool for all SLRUs to use? We provide a single
knob -- say, non_relational_buffers=32MB as default -- and we use a LRU
algorithm (or something) to distribute that memory across all the SLRUs.
So the ratio to use for this SLRU or that one would depend on the nature
of the workload: maybe more for multixact in this server here, but more
for subtrans in that server there; it's just the total amount that the
user would have to configure, side by side with shared_buffers (and
perhaps scale with it like wal_buffers), and the LRU would handle the
rest. The "only" problem here is finding a distribution algorithm that
doesn't further degrade performance, of course ...

I think it's worth a try. It does seem simpler, and it might allow us to
sidestep some concerns about scaling when the SLRU pages are in
shared_buffers [0]/messages/by-id/ZPsaEGRvllitxB3v@tamriel.snowman.net.

[0]: /messages/by-id/ZPsaEGRvllitxB3v@tamriel.snowman.net

--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Dilip Kumar

dilipbalaut@gmail.com

over 2 years ago

In reply to: Alvaro Herrera (#6)

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

On Tue, Oct 24, 2023 at 9:34 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:

On 2023-Oct-11, Dilip Kumar wrote:

In my last email, I forgot to give the link from where I have taken
the base path for dividing the buffer pool in banks so giving the same
here[1]. And looking at this again it seems that the idea of that
patch was from Andrey M. Borodin and the idea of the SLRU scale factor
were introduced by Yura Sokolov and Ivan Lazarev. Apologies for
missing that in the first email.

You mean [1].
[1] /messages/by-id/452d01f7e331458f56ad79bef537c31b@postgrespro.ru
I don't like this idea very much, because of the magic numbers that act
as ratios for numbers of buffers on each SLRU compared to other SLRUs.
These values, which I took from the documentation part of the patch,
appear to have been selected by throwing darts at the wall:

NUM_CLOG_BUFFERS = Min(128 << slru_buffers_size_scale, shared_buffers/256)
NUM_COMMIT_TS_BUFFERS = Min(128 << slru_buffers_size_scale, shared_buffers/256)
NUM_SUBTRANS_BUFFERS = Min(64 << slru_buffers_size_scale, shared_buffers/256)
NUM_NOTIFY_BUFFERS = Min(32 << slru_buffers_size_scale, shared_buffers/256)
NUM_SERIAL_BUFFERS = Min(32 << slru_buffers_size_scale, shared_buffers/256)
NUM_MULTIXACTOFFSET_BUFFERS = Min(32 << slru_buffers_size_scale, shared_buffers/256)
NUM_MULTIXACTMEMBER_BUFFERS = Min(64 << slru_buffers_size_scale, shared_buffers/256)

... which look pretty random already, if similar enough to the current
hardcoded values. In reality, the code implements different values than
what the documentation says.

I don't see why would CLOG have the same number as COMMIT_TS, when the
size for elements of the latter is like 32 times bigger -- however, the
frequency of reads for COMMIT_TS is like 1000x smaller than for CLOG.
SUBTRANS is half of CLOG, yet it is 16 times larger, and it covers the
same range. The MULTIXACT ones appear to keep the current ratio among
them (8/16 gets changed to 32/64).

... and this whole mess is scaled exponentially without regard to the
size that each SLRU requires. This is just betting that enough memory
can be wasted across all SLRUs up to the point where the one that is
actually contended has sufficient memory. This doesn't sound sensible
to me.

Like everybody else, I like having less GUCs to configure, but going
this far to avoid them looks rather disastrous to me. IMO we should
just use Munro's older patches that gave one GUC per SLRU, and users
only need to increase the one that shows up in pg_wait_event sampling.
Someday we will get the (much more complicated) patches to move these
buffers to steal memory from shared buffers, and that'll hopefully let
use get rid of all this complexity.

Overall I agree with your comments, actually, I haven't put that much
thought into the GUC part and how it scales the SLRU buffers w.r.t.
this single configurable parameter. Yeah, so I think it is better
that we take the older patch version as our base patch where we have
separate GUC per SLRU.

I'm inclined to use Borodin's patch last posted here [2] instead of your
proposed 0001.
[2] /messages/by-id/93236D36-B91C-4DFA-AF03-99C083840378@yandex-team.ru

I will rebase my patches on top of this.

I did skim patches 0002 and 0003 without going into too much detail;
they look reasonable ideas. I have not tried to reproduce the claimed
performance benefits. I think measuring this patch set with the tests
posted by Shawn Debnath in [3] is important, too.
[3] /messages/by-id/YemDdpMrsoJFQJnU@f01898859afd.ant.amazon.com

Thanks for taking a look.

On the other hand, here's a somewhat crazy idea. What if, instead of
stealing buffers from shared_buffers (which causes a lot of complexity),

Currently, we do not steal buffers from shared_buffers, computation is
dependent upon Nbuffers though. I mean for each SLRU we are computing
separate memory which is additional than the shared_buffers no?

we allocate a common pool for all SLRUs to use? We provide a single
knob -- say, non_relational_buffers=32MB as default -- and we use a LRU
algorithm (or something) to distribute that memory across all the SLRUs.
So the ratio to use for this SLRU or that one would depend on the nature
of the workload: maybe more for multixact in this server here, but more
for subtrans in that server there; it's just the total amount that the
user would have to configure, side by side with shared_buffers (and
perhaps scale with it like wal_buffers), and the LRU would handle the
rest. The "only" problem here is finding a distribution algorithm that
doesn't further degrade performance, of course ...

Yeah, this could be an idea, but are you talking about that all the
SLRUs will share the single buffer pool and based on the LRU algorithm
it will be decided which page will stay in the buffer pool and which
will be out? But wouldn't that create another issue of different
SLRUs starting to contend on the same lock if we have a common buffer
pool for all the SLRUs? Or am I missing something? Or you are saying
that although there is a common buffer pool each SLRU will have its
own boundaries in it so protected by a separate lock and based on the
workload those boundaries can change dynamically? I haven't put much
thought into how practical the idea is but just trying to understand
what you have in mind.
--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Amit Kapila

amit.kapila16@gmail.com

over 2 years ago

In reply to: Dilip Kumar (#5)

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

On Fri, Oct 20, 2023 at 9:40 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Sat, Oct 14, 2023 at 9:43 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

This and other results shared by you look promising. Will there be any
improvement in workloads related to clog buffer usage?

I did not understand this question can you explain this a bit?

I meant to ask about the impact of this patch on accessing transaction
status via TransactionIdGetStatus(). Shouldn't we expect some
improvement in accessing CLOG buffers?

--
With Regards,
Amit Kapila.

#10

Dilip Kumar

dilipbalaut@gmail.com

over 2 years ago

In reply to: Amit Kapila (#9)

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

On Wed, Oct 25, 2023 at 5:58 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Fri, Oct 20, 2023 at 9:40 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Sat, Oct 14, 2023 at 9:43 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

This and other results shared by you look promising. Will there be any
improvement in workloads related to clog buffer usage?

I did not understand this question can you explain this a bit?

I meant to ask about the impact of this patch on accessing transaction
status via TransactionIdGetStatus(). Shouldn't we expect some
improvement in accessing CLOG buffers?

Yes, there should be because 1) Now there is no common lock so
contention on a centralized control lock will be reduced when we are
accessing the transaction status from pages falling in different SLRU
banks 2) Buffers size is configurable so if the workload is accessing
transactions status of different range then it would help in frequent
buffer eviction but this might not be most common case.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#11

Dilip Kumar

dilipbalaut@gmail.com

over 2 years ago

In reply to: Dilip Kumar (#8)

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

On Wed, Oct 25, 2023 at 10:34 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Tue, Oct 24, 2023 at 9:34 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:

Overall I agree with your comments, actually, I haven't put that much
thought into the GUC part and how it scales the SLRU buffers w.r.t.
this single configurable parameter. Yeah, so I think it is better
that we take the older patch version as our base patch where we have
separate GUC per SLRU.

I'm inclined to use Borodin's patch last posted here [2] instead of your
proposed 0001.
[2] /messages/by-id/93236D36-B91C-4DFA-AF03-99C083840378@yandex-team.ru

I will rebase my patches on top of this.

I have taken 0001 and 0002 from [1]/messages/by-id/93236D36-B91C-4DFA-AF03-99C083840378@yandex-team.ru, done some bug fixes in 0001, and
changed the logic of SlruAdjustNSlots() in 0002, such that now it
starts with the next power of 2 value of the configured slots and
keeps doubling the number of banks until we reach the number of banks
to the max SLRU_MAX_BANKS(128) and bank size is bigger than
SLRU_MIN_BANK_SIZE (8). By doing so, we will ensure we don't have too
many banks, but also that we don't have very large banks. There was
also a patch 0003 in this thread but I haven't taken this as this is
another optimization of merging some structure members and I will
analyze the performance characteristic of this and try to add it on
top of the complete patch series.

Patch details:
0001 - GUC parameter for each SLRU
0002 - Divide the SLRU pool into banks
(The above 2 are taken from [1]/messages/by-id/93236D36-B91C-4DFA-AF03-99C083840378@yandex-team.ru with some modification and rebasing by me)
0003 - Implement bank-wise SLRU lock as described in the first email
of this thread
0004 - Implement bank-wise LRU counter as described in the first email
of this thread
0005 - Some other optimization suggested offlist by Alvaro, i.e.
merging buffer locks and bank locks in the same array so that the
bank-wise LRU counter does not fetch the next cache line in a hot
function SlruRecentlyUsed()

Note: I think 0003,0004 and 0005 can be merged together but kept
separate so that we can review them independently and see how useful
each of them is.

[1]: /messages/by-id/93236D36-B91C-4DFA-AF03-99C083840378@yandex-team.ru

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#12

Dilip Kumar

dilipbalaut@gmail.com

over 2 years ago

In reply to: Dilip Kumar (#11)

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

On Mon, Oct 30, 2023 at 11:50 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

Based on some offlist discussions with Alvaro and Robert in separate
conversations, I and Alvaro we came to the same point if a user sets a
very high value for the number of slots (say 1GB) then the number of
slots in each bank will be 1024 (considering max number of bank 128)
and if we continue the sequence search for finding the buffer for the
page then that could be costly in such cases. But later in one of the
conversations with Robert, I realized that we can have this bank-wise
lock approach along with the partitioned hash table.

So the idea is, that we will use the buffer mapping hash table
something like Thoams used in one of his patches [1]/messages/by-id/CA+hUKGLCLDtgDj2Xsf0uBk5WXDCeHxBDDJPsyY7m65Fde-=pyg@mail.gmail.com, but instead of a
normal hash table, we will use the partitioned hash table. The SLRU
buffer pool is still divided as we have done in the bank-wise approach
and there will be separate locks for each slot range. So now we get
the benefit of both approaches 1) By having a mapping hash we can
avoid the sequence search 2) By dividing the buffer pool into banks
and keeping the victim buffer search within those banks we avoid
locking all the partitions during victim buffer search 3) And we can
also maintain a bank-wise LRU counter so that we avoid contention on a
single variable as we have discussed in my first email of this thread.
Please find the updated patch set details and patches attached to the
email.

[1]: /messages/by-id/CA+hUKGLCLDtgDj2Xsf0uBk5WXDCeHxBDDJPsyY7m65Fde-=pyg@mail.gmail.com
patch as the previous patch set
[2]: 0002-Add-a-buffer-mapping-table-for-SLRUs: Patch to introduce buffer mapping hash table
buffer mapping hash table
[3]: 0003-Partition-wise-slru-locks: Partition the hash table and also introduce partition-wise locks: this is a merge of 0003 and 0004 from the previous patch set but instead of bank-wise locks it has partition-wise locks and LRU counter.
introduce partition-wise locks: this is a merge of 0003 and 0004 from
the previous patch set but instead of bank-wise locks it has
partition-wise locks and LRU counter.
[4]: 0004-Merge-partition-locks-array-with-buffer-locks-array: merging buffer locks and bank locks in the same array so that the bank-wise LRU counter does not fetch the next cache line in a hot function SlruRecentlyUsed()(same as 0005 from the previous patch set)
buffer locks and bank locks in the same array so that the bank-wise
LRU counter does not fetch the next cache line in a hot function
SlruRecentlyUsed()(same as 0005 from the previous patch set)
[5]: 0005-Ensure-slru-buffer-slots-are-in-multiple-of-number-of: Ensure that the number of slots is in multiple of the number of banks
that the number of slots is in multiple of the number of banks

With this approach, I have also made some changes where the number of
banks is constant (i.e. 8) so that some of the computations are easy.
I think with a buffer mapping hash table we should not have much
problem in keeping this fixed as with very extreme configuration and
very high numbers of slots also we do not have performance problems as
we are not doing sequence search because of buffer mapping hash and if
the number of slots is set so high then the victim buffer search also
should not be frequent so we should not be worried about sequence
search within a bank for victim buffer search. I have also changed
the default value of the number of slots to 64 and the minimum value
to 16 I think this is a reasonable default value because the existing
values are too low considering the modern hardware and these
parameters is configurable so user can set it to low value if running
with very low memory.

[1]: /messages/by-id/CA+hUKGLCLDtgDj2Xsf0uBk5WXDCeHxBDDJPsyY7m65Fde-=pyg@mail.gmail.com

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#13

Andrey Borodin

amborodin@acm.org

over 2 years ago

In reply to: Dilip Kumar (#11)

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

On 30 Oct 2023, at 09:20, Dilip Kumar <dilipbalaut@gmail.com> wrote:

changed the logic of SlruAdjustNSlots() in 0002, such that now it
starts with the next power of 2 value of the configured slots and
keeps doubling the number of banks until we reach the number of banks
to the max SLRU_MAX_BANKS(128) and bank size is bigger than
SLRU_MIN_BANK_SIZE (8). By doing so, we will ensure we don't have too
many banks

There was nothing wrong with having too many banks. Until bank-wise locks and counters were added in later patchsets.
Having hashtable to find SLRU page in the buffer IMV is too slow. Some comments on this approach can be found here [0]/messages/by-id/CA+hUKGKVqrxOp82zER1=XN=yPwV_-OCGAg=ez=1iz9rG+A7Smw@mail.gmail.com.
I'm OK with having HTAB for that if we are sure performance does not degrade significantly, but I really doubt this is the case.
I even think SLRU buffers used HTAB in some ancient times, but I could not find commit when it was changed to linear search.

Maybe we could decouple locks and counters from SLRU banks? Banks were meant to be small to exploit performance of local linear search. Lock partitions have to be bigger for sure.

On 30 Oct 2023, at 09:20, Dilip Kumar <dilipbalaut@gmail.com> wrote:

I have taken 0001 and 0002 from [1], done some bug fixes in 0001

BTW can you please describe in more detail what kind of bugs?

Thanks for working on this!

Best regards, Andrey Borodin.

[0]: /messages/by-id/CA+hUKGKVqrxOp82zER1=XN=yPwV_-OCGAg=ez=1iz9rG+A7Smw@mail.gmail.com

#14

Dilip Kumar

dilipbalaut@gmail.com

over 2 years ago

In reply to: Andrey Borodin (#13)

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

On Sun, Nov 5, 2023 at 1:37 AM Andrey M. Borodin <x4mmm@yandex-team.ru> wrote:

On 30 Oct 2023, at 09:20, Dilip Kumar <dilipbalaut@gmail.com> wrote:

changed the logic of SlruAdjustNSlots() in 0002, such that now it
starts with the next power of 2 value of the configured slots and
keeps doubling the number of banks until we reach the number of banks
to the max SLRU_MAX_BANKS(128) and bank size is bigger than
SLRU_MIN_BANK_SIZE (8). By doing so, we will ensure we don't have too
many banks

There was nothing wrong with having too many banks. Until bank-wise locks and counters were added in later patchsets.

I agree with that, but I feel with bank-wise locks we are removing
major contention from the centralized control lock and we can see that
from my first email that how much benefit we can get in one of the
simple test cases when we create subtransaction overflow.

Having hashtable to find SLRU page in the buffer IMV is too slow. Some comments on this approach can be found here [0].
I'm OK with having HTAB for that if we are sure performance does not degrade significantly, but I really doubt this is the case.
I even think SLRU buffers used HTAB in some ancient times, but I could not find commit when it was changed to linear search.

The main intention of having this buffer mapping hash is to find the
SLRU page faster than sequence search when banks are relatively bigger
in size, but if we find the cases where having hash creates more
overhead than providing gain then I am fine to remove the hash because
the whole purpose of adding hash here to make the lookup faster. So
far in my test I did not find the slowness. Do you or anyone else
have any test case based on the previous research on whether it
creates any slowness?

Maybe we could decouple locks and counters from SLRU banks? Banks were meant to be small to exploit performance of local linear search. Lock partitions have to be bigger for sure.

Yeah, that could also be an idea if we plan to drop the hash. I mean
bank-wise counter is fine as we are finding a victim buffer within a
bank itself, but each lock could cover more slots than one bank size
or in other words, it can protect multiple banks. Let's hear more
opinion on this.

On 30 Oct 2023, at 09:20, Dilip Kumar <dilipbalaut@gmail.com> wrote:

I have taken 0001 and 0002 from [1], done some bug fixes in 0001

BTW can you please describe in more detail what kind of bugs?

Yeah, actually that patch was using the same GUC
(multixact_offsets_buffers) in SimpleLruInit for MultiXactOffsetCtl as
well as for MultiXactMemberCtl, see the below patch snippet from the
original patch.

@@ -1851,13 +1851,13 @@ MultiXactShmemInit(void)
MultiXactMemberCtl->PagePrecedes = MultiXactMemberPagePrecedes;

  SimpleLruInit(MultiXactOffsetCtl,
-   "MultiXactOffset", NUM_MULTIXACTOFFSET_BUFFERS, 0,
+   "MultiXactOffset", multixact_offsets_buffers, 0,
    MultiXactOffsetSLRULock, "pg_multixact/offsets",
    LWTRANCHE_MULTIXACTOFFSET_BUFFER,
    SYNC_HANDLER_MULTIXACT_OFFSET);
  SlruPagePrecedesUnitTests(MultiXactOffsetCtl, MULTIXACT_OFFSETS_PER_PAGE);
  SimpleLruInit(MultiXactMemberCtl,
-   "MultiXactMember", NUM_MULTIXACTMEMBER_BUFFERS, 0,
+   "MultiXactMember", multixact_offsets_buffers, 0,
    MultiXactMemberSLRULock, "pg_multixact/members",
    LWTRANCHE_MULTIXACTMEMBER_BUFFER,
    SYNC_HANDLER_MULTIXACT_MEMBER);

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#15

Andrey Borodin

amborodin@acm.org

over 2 years ago

In reply to: Dilip Kumar (#14)

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

On 6 Nov 2023, at 09:09, Dilip Kumar <dilipbalaut@gmail.com> wrote:

Having hashtable to find SLRU page in the buffer IMV is too slow. Some comments on this approach can be found here [0].
I'm OK with having HTAB for that if we are sure performance does not degrade significantly, but I really doubt this is the case.
I even think SLRU buffers used HTAB in some ancient times, but I could not find commit when it was changed to linear search.

The main intention of having this buffer mapping hash is to find the
SLRU page faster than sequence search when banks are relatively bigger
in size, but if we find the cases where having hash creates more
overhead than providing gain then I am fine to remove the hash because
the whole purpose of adding hash here to make the lookup faster. So
far in my test I did not find the slowness. Do you or anyone else
have any test case based on the previous research on whether it
creates any slowness?

PFA test benchmark_slru_page_readonly(). In this test we run SimpleLruReadPage_ReadOnly() (essential part of TransactionIdGetStatus())
before introducing HTAB for buffer mapping I get
Time: 14837.851 ms (00:14.838)
with buffer HTAB I get
Time: 22723.243 ms (00:22.723)

This hash table makes getting transaction status ~50% slower.

Benchmark script I used:
make -C $HOME/postgresMX -j 8 install && (pkill -9 postgres; rm -rf test; ./initdb test && echo "shared_preload_libraries = 'test_slru'">> test/postgresql.conf && ./pg_ctl -D test start && ./psql -c 'create extension test_slru' postgres && ./pg_ctl -D test restart && ./psql -c "SELECT count(test_slru_page_write(a, 'Test SLRU'))
FROM generate_series(12346, 12393, 1) as a;" -c '\timing' -c "SELECT benchmark_slru_page_readonly(12377);" postgres)

Maybe we could decouple locks and counters from SLRU banks? Banks were meant to be small to exploit performance of local linear search. Lock partitions have to be bigger for sure.

Yeah, that could also be an idea if we plan to drop the hash. I mean
bank-wise counter is fine as we are finding a victim buffer within a
bank itself, but each lock could cover more slots than one bank size
or in other words, it can protect multiple banks. Let's hear more
opinion on this.

On 30 Oct 2023, at 09:20, Dilip Kumar <dilipbalaut@gmail.com> wrote:

I have taken 0001 and 0002 from [1], done some bug fixes in 0001

BTW can you please describe in more detail what kind of bugs?

Yeah, actually that patch was using the same GUC
(multixact_offsets_buffers) in SimpleLruInit for MultiXactOffsetCtl as
well as for MultiXactMemberCtl, see the below patch snippet from the
original patch.

Ouch. We were running this for serveral years with this bug... Thanks!

Best regards, Andrey Borodin.

#16

Dilip Kumar

dilipbalaut@gmail.com

over 2 years ago

In reply to: Andrey Borodin (#15)

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

On Mon, Nov 6, 2023 at 1:05 PM Andrey M. Borodin <x4mmm@yandex-team.ru> wrote:

On 6 Nov 2023, at 09:09, Dilip Kumar <dilipbalaut@gmail.com> wrote:

Having hashtable to find SLRU page in the buffer IMV is too slow. Some comments on this approach can be found here [0].
I'm OK with having HTAB for that if we are sure performance does not degrade significantly, but I really doubt this is the case.
I even think SLRU buffers used HTAB in some ancient times, but I could not find commit when it was changed to linear search.

The main intention of having this buffer mapping hash is to find the
SLRU page faster than sequence search when banks are relatively bigger
in size, but if we find the cases where having hash creates more
overhead than providing gain then I am fine to remove the hash because
the whole purpose of adding hash here to make the lookup faster. So
far in my test I did not find the slowness. Do you or anyone else
have any test case based on the previous research on whether it
creates any slowness?

PFA test benchmark_slru_page_readonly(). In this test we run SimpleLruReadPage_ReadOnly() (essential part of TransactionIdGetStatus())
before introducing HTAB for buffer mapping I get
Time: 14837.851 ms (00:14.838)
with buffer HTAB I get
Time: 22723.243 ms (00:22.723)

This hash table makes getting transaction status ~50% slower.

Benchmark script I used:
make -C $HOME/postgresMX -j 8 install && (pkill -9 postgres; rm -rf test; ./initdb test && echo "shared_preload_libraries = 'test_slru'">> test/postgresql.conf && ./pg_ctl -D test start && ./psql -c 'create extension test_slru' postgres && ./pg_ctl -D test restart && ./psql -c "SELECT count(test_slru_page_write(a, 'Test SLRU'))
FROM generate_series(12346, 12393, 1) as a;" -c '\timing' -c "SELECT benchmark_slru_page_readonly(12377);" postgres)

With this test, I got below numbers,

nslots. no-hash hash
8 10s 13s
16 10s 13s
32 15s 13s
64 17s 13s

Yeah so we can see with a small bank size <=16 slots we are seeing
that the fetching page with hash is 30% slower than the sequential
search, but beyond 32 slots sequential search is become slower as you
grow the number of slots whereas with hash it stays constant as
expected. But now as you told if keep the lock partition range
different than the bank size then we might not have any problem by
having more numbers of banks and with that, we can keep the bank size
small like 16. Let me put some more thought into this and get back.
Any other opinions on this?

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#17

Alvaro Herrera

alvherre@2ndquadrant.com

over 2 years ago

In reply to: Dilip Kumar (#16)

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

On 2023-Nov-06, Dilip Kumar wrote:

Yeah so we can see with a small bank size <=16 slots we are seeing
that the fetching page with hash is 30% slower than the sequential
search, but beyond 32 slots sequential search is become slower as you
grow the number of slots whereas with hash it stays constant as
expected. But now as you told if keep the lock partition range
different than the bank size then we might not have any problem by
having more numbers of banks and with that, we can keep the bank size
small like 16. Let me put some more thought into this and get back.
Any other opinions on this?

dynahash is notoriously slow, which is why we have simplehash.h since
commit b30d3ea824c5. Maybe we could use that instead.

--
Álvaro Herrera Breisgau, Deutschland — https://www.EnterpriseDB.com/
"Escucha y olvidarás; ve y recordarás; haz y entenderás" (Confucio)

#18

Andrey Borodin

amborodin@acm.org

over 2 years ago

In reply to: Alvaro Herrera (#17)

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

On 6 Nov 2023, at 14:31, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:

dynahash is notoriously slow, which is why we have simplehash.h since
commit b30d3ea824c5. Maybe we could use that instead.

Dynahash has lock partitioning. Simplehash has not, AFAIK.
The thing is we do not really need a hash function - pageno is already a best hash function itself. And we do not need to cope with collisions much - we can evict a collided buffer.

Given this we do not need a hashtable at all. That’s exact reasoning how banks emerged, I started implementing dynahsh patch in April 2021 and found out that “banks” approach is cleaner. However the term “bank” is not common in software, it’s taken from hardware cache.

Best regards, Andrey Borodin.

#19

Dilip Kumar

dilipbalaut@gmail.com

over 2 years ago

In reply to: Andrey Borodin (#18)

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

On Mon, Nov 6, 2023 at 4:44 PM Andrey M. Borodin <x4mmm@yandex-team.ru> wrote:

On 6 Nov 2023, at 14:31, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:

dynahash is notoriously slow, which is why we have simplehash.h since
commit b30d3ea824c5. Maybe we could use that instead.

Dynahash has lock partitioning. Simplehash has not, AFAIK.

Yeah, Simplehash doesn't have partitioning so with simple hash we will
be stuck with the centralized control lock that is one of the main
problems trying to solve here.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#20

Amul Sul

sulamul@gmail.com

over 2 years ago

In reply to: Andrey Borodin (#18)

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

On Mon, Nov 6, 2023 at 4:44 PM Andrey M. Borodin <x4mmm@yandex-team.ru>
wrote:

On 6 Nov 2023, at 14:31, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:

dynahash is notoriously slow, which is why we have simplehash.h since
commit b30d3ea824c5. Maybe we could use that instead.

Dynahash has lock partitioning. Simplehash has not, AFAIK.
The thing is we do not really need a hash function - pageno is already a
best hash function itself. And we do not need to cope with collisions much
- we can evict a collided buffer.

Given this we do not need a hashtable at all. That’s exact reasoning how
banks emerged, I started implementing dynahsh patch in April 2021 and found
out that “banks” approach is cleaner. However the term “bank” is not common
in software, it’s taken from hardware cache.

I agree that we don't need the hash function to generate hash value out of
pageno which itself is sufficient, but I don't understand how we can get
rid of
the hash table itself -- how we would map the pageno and the slot number?
That mapping is not needed at all?

Regards,
Amul

#21

Amul Sul

sulamul@gmail.com

over 2 years ago

In reply to: Dilip Kumar (#12)

#22

Dilip Kumar

dilipbalaut@gmail.com

over 2 years ago

In reply to: Amul Sul (#21)

#23

Ants Aasma

ants.aasma@cybertec.at

over 2 years ago

In reply to: Andrey Borodin (#13)

#24

Andrey Borodin

amborodin@acm.org

over 2 years ago

In reply to: Ants Aasma (#23)

#25

Dilip Kumar

dilipbalaut@gmail.com

over 2 years ago

In reply to: Dilip Kumar (#14)

#26

Alvaro Herrera

alvherre@2ndquadrant.com

over 2 years ago

In reply to: Ants Aasma (#23)

#27

Robert Haas

robertmhaas@gmail.com

over 2 years ago

In reply to: Dilip Kumar (#25)

#28

Dilip Kumar

dilipbalaut@gmail.com

over 2 years ago

In reply to: Robert Haas (#27)

#29

Dilip Kumar

dilipbalaut@gmail.com

over 2 years ago

In reply to: Alvaro Herrera (#26)

#30

Nathan Bossart

nathandbossart@gmail.com

over 2 years ago

In reply to: Dilip Kumar (#29)

#31

Alvaro Herrera

alvherre@2ndquadrant.com

over 2 years ago

In reply to: Dilip Kumar (#25)

#32

Dilip Kumar

dilipbalaut@gmail.com

over 2 years ago

In reply to: Alvaro Herrera (#31)

#33

Dilip Kumar

dilipbalaut@gmail.com

over 2 years ago

In reply to: Dilip Kumar (#32)

#34

Alvaro Herrera

alvherre@2ndquadrant.com

over 2 years ago

In reply to: Dilip Kumar (#33)

#35

Alvaro Herrera

alvherre@2ndquadrant.com

over 2 years ago

In reply to: Dilip Kumar (#32)

#36

Dilip Kumar

dilipbalaut@gmail.com

over 2 years ago

In reply to: Alvaro Herrera (#34)

#37

Alvaro Herrera

alvherre@2ndquadrant.com

over 2 years ago

In reply to: Dilip Kumar (#36)

#38

Andrey Borodin

amborodin@acm.org

over 2 years ago

In reply to: Dilip Kumar (#33)

#39

Dilip Kumar

dilipbalaut@gmail.com

over 2 years ago

In reply to: Andrey Borodin (#38)

#40

Dilip Kumar

dilipbalaut@gmail.com

over 2 years ago

In reply to: Alvaro Herrera (#35)

#41

Andrey Borodin

amborodin@acm.org

over 2 years ago

In reply to: Dilip Kumar (#40)

#42

Dilip Kumar

dilipbalaut@gmail.com

over 2 years ago

In reply to: Andrey Borodin (#41)

#43

Dilip Kumar

dilipbalaut@gmail.com

over 2 years ago

In reply to: Dilip Kumar (#42)

#44

Dilip Kumar

dilipbalaut@gmail.com

over 2 years ago

In reply to: Dilip Kumar (#43)

#45

Dilip Kumar

dilipbalaut@gmail.com

over 2 years ago

In reply to: Dilip Kumar (#44)

#46

Dilip Kumar

dilipbalaut@gmail.com

over 2 years ago

In reply to: Dilip Kumar (#45)

#47

Tender Wang

tndrwang@gmail.com

over 2 years ago

In reply to: Dilip Kumar (#46)

#48

Alvaro Herrera

alvherre@2ndquadrant.com

over 2 years ago

In reply to: Tender Wang (#47)

#49

Dilip Kumar

dilipbalaut@gmail.com

over 2 years ago

In reply to: Alvaro Herrera (#48)

#50

Dilip Kumar

dilipbalaut@gmail.com

over 2 years ago

In reply to: Dilip Kumar (#49)

#51

Dilip Kumar

dilipbalaut@gmail.com

about 2 years ago

In reply to: Dilip Kumar (#50)

#52

Alvaro Herrera

alvherre@2ndquadrant.com

about 2 years ago

In reply to: Dilip Kumar (#51)

#53

Dilip Kumar

dilipbalaut@gmail.com

about 2 years ago

In reply to: Alvaro Herrera (#52)

#54

Andrey Borodin

amborodin@acm.org

about 2 years ago

In reply to: Alvaro Herrera (#52)

#55

Amul Sul

sulamul@gmail.com

about 2 years ago

In reply to: Dilip Kumar (#51)

#56

Dilip Kumar

dilipbalaut@gmail.com

about 2 years ago

In reply to: Amul Sul (#55)

#57

Andrey Borodin

amborodin@acm.org

about 2 years ago

In reply to: Amul Sul (#55)

#58

Tender Wang

tndrwang@gmail.com

about 2 years ago

In reply to: Andrey Borodin (#57)

#59

Andrey Borodin

amborodin@acm.org

about 2 years ago

In reply to: Tender Wang (#58)

#60

Dilip Kumar

dilipbalaut@gmail.com

about 2 years ago

In reply to: Andrey Borodin (#54)

#61

Andrey Borodin

amborodin@acm.org

about 2 years ago

In reply to: Dilip Kumar (#60)

#62

Tender Wang

tndrwang@gmail.com

about 2 years ago

In reply to: Andrey Borodin (#59)

#63

Andrey Borodin

amborodin@acm.org

about 2 years ago

In reply to: Tender Wang (#62)

#64

Dilip Kumar

dilipbalaut@gmail.com

about 2 years ago

In reply to: Dilip Kumar (#60)

#65

Dilip Kumar

dilipbalaut@gmail.com

about 2 years ago

In reply to: Dilip Kumar (#56)

#66

Robert Haas

robertmhaas@gmail.com

about 2 years ago

In reply to: Alvaro Herrera (#52)

#67

Robert Haas

robertmhaas@gmail.com

about 2 years ago

In reply to: Robert Haas (#66)

#68

Andrey Borodin

amborodin@acm.org

about 2 years ago

In reply to: Robert Haas (#67)

#69

Robert Haas

robertmhaas@gmail.com

about 2 years ago

In reply to: Andrey Borodin (#68)

#70

Dilip Kumar

dilipbalaut@gmail.com

about 2 years ago

In reply to: Robert Haas (#67)

#71

Andrey Borodin

amborodin@acm.org

about 2 years ago

In reply to: Dilip Kumar (#70)

#72

Robert Haas

robertmhaas@gmail.com

about 2 years ago

In reply to: Andrey Borodin (#71)

#73

Andrey Borodin

amborodin@acm.org

about 2 years ago

In reply to: Robert Haas (#72)

#74

Robert Haas

robertmhaas@gmail.com

about 2 years ago

In reply to: Andrey Borodin (#73)

#75

Dilip Kumar

dilipbalaut@gmail.com

about 2 years ago

In reply to: Robert Haas (#72)

#76

Robert Haas

robertmhaas@gmail.com

about 2 years ago

In reply to: Dilip Kumar (#75)

#77

Alvaro Herrera

alvherre@2ndquadrant.com

about 2 years ago

In reply to: Dilip Kumar (#51)

#78

Dilip Kumar

dilipbalaut@gmail.com

about 2 years ago

In reply to: Alvaro Herrera (#77)

#79

Alvaro Herrera

alvherre@2ndquadrant.com

about 2 years ago

In reply to: Dilip Kumar (#78)

#80

Dilip Kumar

dilipbalaut@gmail.com

about 2 years ago

In reply to: Alvaro Herrera (#79)

#81

Dilip Kumar

dilipbalaut@gmail.com

about 2 years ago

In reply to: Dilip Kumar (#80)

#82

Alvaro Herrera

alvherre@2ndquadrant.com

about 2 years ago

In reply to: Dilip Kumar (#81)

#83

Alvaro Herrera

alvherre@2ndquadrant.com

about 2 years ago

In reply to: Alvaro Herrera (#82)

#84

Robert Haas

robertmhaas@gmail.com

about 2 years ago

In reply to: Alvaro Herrera (#82)

#85

Alvaro Herrera

alvherre@2ndquadrant.com

about 2 years ago

In reply to: Dilip Kumar (#81)

#86

Andrey Borodin

amborodin@acm.org

about 2 years ago

In reply to: Alvaro Herrera (#85)

#87

Dilip Kumar

dilipbalaut@gmail.com

about 2 years ago

In reply to: Alvaro Herrera (#83)

#88

Dilip Kumar

dilipbalaut@gmail.com

about 2 years ago

In reply to: Alvaro Herrera (#85)

#89

Alvaro Herrera

alvherre@2ndquadrant.com

about 2 years ago

In reply to: Dilip Kumar (#87)

#90

Alvaro Herrera

alvherre@2ndquadrant.com

about 2 years ago

In reply to: Alvaro Herrera (#89)

#91

Dilip Kumar

dilipbalaut@gmail.com

about 2 years ago

In reply to: Alvaro Herrera (#90)

#92

Alvaro Herrera

alvherre@2ndquadrant.com

about 2 years ago

In reply to: Dilip Kumar (#91)

#93

Dilip Kumar

dilipbalaut@gmail.com

about 2 years ago

In reply to: Alvaro Herrera (#92)

#94

Dilip Kumar

dilipbalaut@gmail.com

about 2 years ago

In reply to: Dilip Kumar (#93)

#95

Dilip Kumar

dilipbalaut@gmail.com

about 2 years ago

In reply to: Dilip Kumar (#94)

#96

Alvaro Herrera

alvherre@2ndquadrant.com

about 2 years ago

In reply to: Dilip Kumar (#95)

#97

Alvaro Herrera

alvherre@2ndquadrant.com

about 2 years ago

In reply to: Alvaro Herrera (#96)

#98

Alvaro Herrera

alvherre@2ndquadrant.com

about 2 years ago

In reply to: Alvaro Herrera (#97)

#99

Andrey Borodin

amborodin@acm.org

about 2 years ago

In reply to: Alvaro Herrera (#96)

#100

Dilip Kumar

dilipbalaut@gmail.com

about 2 years ago

In reply to: Alvaro Herrera (#96)

#101

Alvaro Herrera

alvherre@2ndquadrant.com

about 2 years ago

In reply to: Dilip Kumar (#100)

#102

Dilip Kumar

dilipbalaut@gmail.com

about 2 years ago

In reply to: Alvaro Herrera (#101)

#103

Alvaro Herrera

alvherre@2ndquadrant.com

about 2 years ago

In reply to: Dilip Kumar (#102)

#104

Dilip Kumar

dilipbalaut@gmail.com

about 2 years ago

In reply to: Alvaro Herrera (#103)

#105

Andrey Borodin

amborodin@acm.org

about 2 years ago

In reply to: Dilip Kumar (#104)

#106

Alvaro Herrera

alvherre@2ndquadrant.com

about 2 years ago

In reply to: Dilip Kumar (#104)

#107

Dilip Kumar

dilipbalaut@gmail.com

about 2 years ago

In reply to: Alvaro Herrera (#106)

#108

Alvaro Herrera

alvherre@2ndquadrant.com

about 2 years ago

In reply to: Dilip Kumar (#107)

#109

Dilip Kumar

dilipbalaut@gmail.com

about 2 years ago

In reply to: Alvaro Herrera (#108)

#110

Andrey Borodin

amborodin@acm.org

about 2 years ago

In reply to: Dilip Kumar (#109)

#111

Dilip Kumar

dilipbalaut@gmail.com

about 2 years ago

In reply to: Dilip Kumar (#109)

#112

Alvaro Herrera

alvherre@2ndquadrant.com

about 2 years ago

In reply to: Andrey Borodin (#110)

#113

Alvaro Herrera

alvherre@2ndquadrant.com

about 2 years ago

In reply to: Dilip Kumar (#111)

#114

Dilip Kumar

dilipbalaut@gmail.com

about 2 years ago

In reply to: Alvaro Herrera (#113)

#115

Alvaro Herrera

alvherre@2ndquadrant.com

about 2 years ago

In reply to: Dilip Kumar (#114)

#116

Andrey Borodin

amborodin@acm.org

about 2 years ago

In reply to: Alvaro Herrera (#115)

#117

Alvaro Herrera

alvherre@2ndquadrant.com

about 2 years ago

In reply to: Andrey Borodin (#116)

#118

Alvaro Herrera

alvherre@2ndquadrant.com

about 2 years ago

In reply to: Alvaro Herrera (#117)

#119

Andrey Borodin

amborodin@acm.org

about 2 years ago

In reply to: Alvaro Herrera (#118)

#120

Alvaro Herrera

alvherre@2ndquadrant.com

about 2 years ago

In reply to: Alvaro Herrera (#118)

#121

Dilip Kumar

dilipbalaut@gmail.com

about 2 years ago

In reply to: Alvaro Herrera (#120)

#122

Kyotaro Horiguchi

horikyota.ntt@gmail.com

about 2 years ago

In reply to: Alvaro Herrera (#118)

#123

Alvaro Herrera

alvherre@2ndquadrant.com

about 2 years ago

In reply to: Kyotaro Horiguchi (#122)

#124

Alvaro Herrera

alvherre@2ndquadrant.com

about 2 years ago

In reply to: Alvaro Herrera (#123)

#125

Tom Lane

tgl@sss.pgh.pa.us

about 2 years ago

In reply to: Alvaro Herrera (#124)

#126

Tom Lane

tgl@sss.pgh.pa.us

about 2 years ago

In reply to: Tom Lane (#125)

#127

Alvaro Herrera

alvherre@2ndquadrant.com

about 2 years ago

In reply to: Tom Lane (#126)

#128

Alexander Lakhin

exclusion@gmail.com

almost 2 years ago

In reply to: Alvaro Herrera (#118)

#129

Alvaro Herrera

alvherre@2ndquadrant.com

almost 2 years ago

In reply to: Alexander Lakhin (#128)

#130

Dilip Kumar

dilipbalaut@gmail.com

almost 2 years ago

In reply to: Alvaro Herrera (#129)

#131

Alvaro Herrera

alvherre@2ndquadrant.com

almost 2 years ago

In reply to: Dilip Kumar (#130)

SLRU optimization - configurable buffer pool and partitioning the SLRU lock

Attachments:

Attachments:

Attachments:

Attachments:

Attachments: