Thoughts about NUM_BUFFER_PARTITIONS

Started by wenhui qiualmost 2 years ago14 messages
#1wenhui qiu
qiuwenhuifx@gmail.com

HI hackers
When I read this text in this document there is a paragraph in it(
https://www.interdb.jp/pg/pgsql08/03.html)
/*
The BufMappingLock is split into partitions to reduce contention in the
buffer table (the default is 128 partitions). Each BufMappingLock partition
guards a portion of the corresponding hash bucket slots.
*/,

Physical servers with terabytes of RAM are now commonplace,I'm looking at
the comments inside the source code.I'm looking at the comments inside the
source code to see if they still match the current hardware capability? The
comment says that in the future there may be a parameter,Iam a dba now
and I try to think of implementing this parameter, but I'm not a
professional kernel developer, I still want the community senior developer
to implement this parameter

/*
* It's a bit odd to declare NUM_BUFFER_PARTITIONS and NUM_LOCK_PARTITIONS
* here, but we need them to figure out offsets within MainLWLockArray, and
* having this file include lock.h or bufmgr.h would be backwards.
*/

/* Number of partitions of the shared buffer mapping hashtable */
#define NUM_BUFFER_PARTITIONS 128

/*
* The number of partitions for locking purposes. This is set to match
* NUM_BUFFER_PARTITIONS for now, on the basis that whatever's good enough
for
* the buffer pool must be good enough for any other purpose. This could
* become a runtime parameter in future.
*/
#define DSHASH_NUM_PARTITIONS_LOG2 7
#define DSHASH_NUM_PARTITIONS (1 << DSHASH_NUM_PARTITIONS_LOG2)

#2Heikki Linnakangas
hlinnaka@iki.fi
In reply to: wenhui qiu (#1)
Re: Thoughts about NUM_BUFFER_PARTITIONS

On 08/02/2024 12:17, wenhui qiu wrote:

HI hackers
    When I read this text in this document there is a paragraph in
it(https://www.interdb.jp/pg/pgsql08/03.html
<https://www.interdb.jp/pg/pgsql08/03.html&gt;)
/*
The BufMappingLock is split into partitions to reduce contention in the
buffer table (the default is 128 partitions). Each BufMappingLock
partition guards a portion of the corresponding hash bucket slots.
*/,

Physical servers with terabytes of RAM are now commonplace,I'm looking
at the comments inside the source code.I'm looking at the comments
inside the source code to see if they still match the current hardware
capability?

The optimal number of partitions has more to do with the number of
concurrent processes using the buffer cache, rather than the size of the
cache. The number of CPUs in servers has increased too, but not as much
as RAM.

But yeah, it's a valid question if the settings still make sense with
modern hardware.

The  comment says that in the future there may be a
parameter,Iam a  dba now and I try to think of implementing this
parameter, but I'm not a professional kernel developer, I still want the
community senior developer to implement this parameter

The first thing to do would be to benchmark with different
NUM_BUFFER_PARTITIONS settings, and see if there's benefit in having
more partitions.

--
Heikki Linnakangas
Neon (https://neon.tech)

#3wenhui qiu
qiuwenhuifx@gmail.com
In reply to: Heikki Linnakangas (#2)
Re: Thoughts about NUM_BUFFER_PARTITIONS

Hi Heikki Linnakangas
I think the larger shared buffer higher the probability of multiple
backend processes accessing the same bucket slot BufMappingLock
simultaneously, ( InitBufTable(NBuffers + NUM_BUFFER_PARTITIONS); When I
have free time, I want to do this test. I have seen some tests, but the
result report is in Chinese

Best wishes

Heikki Linnakangas <hlinnaka@iki.fi> 于2024年2月8日周四 19:26写道:

Show quoted text

On 08/02/2024 12:17, wenhui qiu wrote:

HI hackers
When I read this text in this document there is a paragraph in
it(https://www.interdb.jp/pg/pgsql08/03.html
<https://www.interdb.jp/pg/pgsql08/03.html&gt;)
/*
The BufMappingLock is split into partitions to reduce contention in the
buffer table (the default is 128 partitions). Each BufMappingLock
partition guards a portion of the corresponding hash bucket slots.
*/,

Physical servers with terabytes of RAM are now commonplace,I'm looking
at the comments inside the source code.I'm looking at the comments
inside the source code to see if they still match the current hardware
capability?

The optimal number of partitions has more to do with the number of
concurrent processes using the buffer cache, rather than the size of the
cache. The number of CPUs in servers has increased too, but not as much
as RAM.

But yeah, it's a valid question if the settings still make sense with
modern hardware.

The comment says that in the future there may be a
parameter,Iam a dba now and I try to think of implementing this
parameter, but I'm not a professional kernel developer, I still want the
community senior developer to implement this parameter

The first thing to do would be to benchmark with different
NUM_BUFFER_PARTITIONS settings, and see if there's benefit in having
more partitions.

--
Heikki Linnakangas
Neon (https://neon.tech)

#4Tomas Vondra
tomas.vondra@enterprisedb.com
In reply to: wenhui qiu (#3)
Re: Thoughts about NUM_BUFFER_PARTITIONS

On 2/8/24 14:27, wenhui qiu wrote:

Hi Heikki Linnakangas
I think the larger shared buffer higher the probability of multiple
backend processes accessing the same bucket slot BufMappingLock
simultaneously, ( InitBufTable(NBuffers + NUM_BUFFER_PARTITIONS); When I
have free time, I want to do this test. I have seen some tests, but the
result report is in Chinese

I think Heikki is right this is unrelated to the amount of RAM. The
partitions are meant to reduce the number of lock collisions when
multiple processes try to map a buffer concurrently. But the machines
got much larger in this regard too - in 2006 the common CPUs had maybe
2-4 cores, now it's common to have CPUs with ~100 cores, and systems
with multiple of them. OTOH the time spent holing the partition lock
should be pretty low, IIRC we pretty much just pin the buffer before
releasing it, and the backend should do plenty other expensive stuff. So
who knows how many backends end up doing the locking at the same time.

OTOH, with 128 partitions it takes just 14 backends to have 50% chance
of a conflict, so with enough cores ... But how many partitions would be
enough? With 1024 partitions it still takes only 38 backends to get 50%
chance of a collision. Better, but considering we now have hundreds of
cores, not sure if sufficient.

(Obviously, we probably want much lower probability of a collision, I
only used 50% to illustrate the changes).

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#5Li Japin
japinli@hotmail.com
In reply to: Tomas Vondra (#4)
Re: Thoughts about NUM_BUFFER_PARTITIONS

On Feb 10, 2024, at 20:15, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:

On 2/8/24 14:27, wenhui qiu wrote:

Hi Heikki Linnakangas
I think the larger shared buffer higher the probability of multiple
backend processes accessing the same bucket slot BufMappingLock
simultaneously, ( InitBufTable(NBuffers + NUM_BUFFER_PARTITIONS); When I
have free time, I want to do this test. I have seen some tests, but the
result report is in Chinese

I think Heikki is right this is unrelated to the amount of RAM. The
partitions are meant to reduce the number of lock collisions when
multiple processes try to map a buffer concurrently. But the machines
got much larger in this regard too - in 2006 the common CPUs had maybe
2-4 cores, now it's common to have CPUs with ~100 cores, and systems
with multiple of them. OTOH the time spent holing the partition lock
should be pretty low, IIRC we pretty much just pin the buffer before
releasing it, and the backend should do plenty other expensive stuff. So
who knows how many backends end up doing the locking at the same time.

OTOH, with 128 partitions it takes just 14 backends to have 50% chance
of a conflict, so with enough cores ... But how many partitions would be
enough? With 1024 partitions it still takes only 38 backends to get 50%
chance of a collision. Better, but considering we now have hundreds of
cores, not sure if sufficient.

(Obviously, we probably want much lower probability of a collision, I
only used 50% to illustrate the changes).

I find it seems need to change MAX_SIMUL_LWLOCKS if we enlarge the NUM_BUFFER_PARTITIONS,
I didn’t find any comments to describe the relation between MAX_SIMUL_LWLOCKS and
NUM_BUFFER_PARTITIONS, am I missing someghing?

#6Tomas Vondra
tomas.vondra@enterprisedb.com
In reply to: Li Japin (#5)
Re: Thoughts about NUM_BUFFER_PARTITIONS

On 2/18/24 03:30, Li Japin wrote:

On Feb 10, 2024, at 20:15, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:

On 2/8/24 14:27, wenhui qiu wrote:

Hi Heikki Linnakangas
I think the larger shared buffer higher the probability of multiple
backend processes accessing the same bucket slot BufMappingLock
simultaneously, ( InitBufTable(NBuffers + NUM_BUFFER_PARTITIONS); When I
have free time, I want to do this test. I have seen some tests, but the
result report is in Chinese

I think Heikki is right this is unrelated to the amount of RAM. The
partitions are meant to reduce the number of lock collisions when
multiple processes try to map a buffer concurrently. But the machines
got much larger in this regard too - in 2006 the common CPUs had maybe
2-4 cores, now it's common to have CPUs with ~100 cores, and systems
with multiple of them. OTOH the time spent holing the partition lock
should be pretty low, IIRC we pretty much just pin the buffer before
releasing it, and the backend should do plenty other expensive stuff. So
who knows how many backends end up doing the locking at the same time.

OTOH, with 128 partitions it takes just 14 backends to have 50% chance
of a conflict, so with enough cores ... But how many partitions would be
enough? With 1024 partitions it still takes only 38 backends to get 50%
chance of a collision. Better, but considering we now have hundreds of
cores, not sure if sufficient.

(Obviously, we probably want much lower probability of a collision, I
only used 50% to illustrate the changes).

I find it seems need to change MAX_SIMUL_LWLOCKS if we enlarge the NUM_BUFFER_PARTITIONS,
I didn’t find any comments to describe the relation between MAX_SIMUL_LWLOCKS and
NUM_BUFFER_PARTITIONS, am I missing someghing?

IMHO the relationship is pretty simple - MAX_SIMUL_LWLOCKS needs to be
higher than NUM_BUFFER_PARTITIONS, so that the backend can acquire all
the partition locks if needed.

There's other places that acquire a bunch of locks, and all of them need
to be careful not to exceed MAX_SIMUL_LWLOCKS. For example gist has
GIST_MAX_SPLIT_PAGES.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#7Japin Li
japinli@hotmail.com
In reply to: Tomas Vondra (#6)
Re: Thoughts about NUM_BUFFER_PARTITIONS

On Mon, 19 Feb 2024 at 00:56, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:

On 2/18/24 03:30, Li Japin wrote:

I find it seems need to change MAX_SIMUL_LWLOCKS if we enlarge the NUM_BUFFER_PARTITIONS,
I didn’t find any comments to describe the relation between MAX_SIMUL_LWLOCKS and
NUM_BUFFER_PARTITIONS, am I missing someghing?

IMHO the relationship is pretty simple - MAX_SIMUL_LWLOCKS needs to be
higher than NUM_BUFFER_PARTITIONS, so that the backend can acquire all
the partition locks if needed.

Thanks for the explanation! Got it.

Show quoted text

There's other places that acquire a bunch of locks, and all of them need
to be careful not to exceed MAX_SIMUL_LWLOCKS. For example gist has
GIST_MAX_SPLIT_PAGES.

regards

#8wenhui qiu
qiuwenhuifx@gmail.com
In reply to: Japin Li (#7)
Re: Thoughts about NUM_BUFFER_PARTITIONS

Hi Japlin Li
Thank you for such important information ! Got it

Japin Li <japinli@hotmail.com> 于2024年2月19日周一 10:26写道:

Show quoted text

On Mon, 19 Feb 2024 at 00:56, Tomas Vondra <tomas.vondra@enterprisedb.com>
wrote:

On 2/18/24 03:30, Li Japin wrote:

I find it seems need to change MAX_SIMUL_LWLOCKS if we enlarge the

NUM_BUFFER_PARTITIONS,

I didn’t find any comments to describe the relation between

MAX_SIMUL_LWLOCKS and

NUM_BUFFER_PARTITIONS, am I missing someghing?

IMHO the relationship is pretty simple - MAX_SIMUL_LWLOCKS needs to be
higher than NUM_BUFFER_PARTITIONS, so that the backend can acquire all
the partition locks if needed.

Thanks for the explanation! Got it.

There's other places that acquire a bunch of locks, and all of them need
to be careful not to exceed MAX_SIMUL_LWLOCKS. For example gist has
GIST_MAX_SPLIT_PAGES.

regards

#9wenhui qiu
qiuwenhuifx@gmail.com
In reply to: wenhui qiu (#8)
Re: Thoughts about NUM_BUFFER_PARTITIONS

Hi Heikki Linnakangas
I saw git log found this commit:
https://github.com/postgres/postgres/commit/3acc10c997f916f6a741d0b4876126b7b08e3892
,I don't seem to see an email discussing this commit. As the commit log
tells us, we don't know exactly how large a value is optimal, and I believe
it's more flexible to make it as a parameter.Thank you very much
tomas.vondra for explaining the relationship, i see that MAX_SIMUL_LWLOCKS
was just doubled in this commit, is there a more appropriate ratio between
them?

```````````````````````````````````````````````````````````````````````````
commit 3acc10c997f916f6a741d0b4876126b7b08e3892
Author: Robert Haas <rhaas@postgresql.org>
Date: Thu Oct 2 13:58:50 2014 -0400

Increase the number of buffer mapping partitions to 128.

Testing by Amit Kapila, Andres Freund, and myself, with and without
other patches that also aim to improve scalability, seems to indicate
that this change is a significant win over the current value and over
smaller values such as 64. It's not clear how high we can push this
value before it starts to have negative side-effects elsewhere, but
going this far looks OK.

`````````````````````````````````````````````````````````

wenhui qiu <qiuwenhuifx@gmail.com> 于2024年2月20日周二 09:36写道:

Show quoted text

Hi Japlin Li
Thank you for such important information ! Got it

Japin Li <japinli@hotmail.com> 于2024年2月19日周一 10:26写道:

On Mon, 19 Feb 2024 at 00:56, Tomas Vondra <tomas.vondra@enterprisedb.com>
wrote:

On 2/18/24 03:30, Li Japin wrote:

I find it seems need to change MAX_SIMUL_LWLOCKS if we enlarge the

NUM_BUFFER_PARTITIONS,

I didn’t find any comments to describe the relation between

MAX_SIMUL_LWLOCKS and

NUM_BUFFER_PARTITIONS, am I missing someghing?

IMHO the relationship is pretty simple - MAX_SIMUL_LWLOCKS needs to be
higher than NUM_BUFFER_PARTITIONS, so that the backend can acquire all
the partition locks if needed.

Thanks for the explanation! Got it.

There's other places that acquire a bunch of locks, and all of them need
to be careful not to exceed MAX_SIMUL_LWLOCKS. For example gist has
GIST_MAX_SPLIT_PAGES.

regards

#10Tomas Vondra
tomas.vondra@enterprisedb.com
In reply to: wenhui qiu (#9)
Re: Thoughts about NUM_BUFFER_PARTITIONS

Hi,

On 2/20/24 03:16, wenhui qiu wrote:

Hi Heikki Linnakangas
I saw git log found this commit:
https://github.com/postgres/postgres/commit/3acc10c997f916f6a741d0b4876126b7b08e3892
,I don't seem to see an email discussing this commit. As the commit log
tells us, we don't know exactly how large a value is optimal, and I believe
it's more flexible to make it as a parameter.Thank you very much
tomas.vondra for explaining the relationship, i see that MAX_SIMUL_LWLOCKS
was just doubled in this commit, is there a more appropriate ratio between
them?

I think the discussion for that commit is in [1]/messages/by-id/CAA4eK1LSTcMwXNO8ovGh7c0UgCHzGbN=+PjggfzQDukKr3q_DA@mail.gmail.com (and especially [2]/messages/by-id/CA+TgmoY58dQi8Z=FDAu4ggxHV-HYV03-R9on1LSP9OJU_fy_zA@mail.gmail.com).

That being said, I don't think MAX_SIMUL_LOCKS and NUM_BUFFER_PARTITIONS
need to be in any particular ratio. The only requirement is that there
needs to be enough slack, and 72 locks seemed to work quite fine until
now - I don't think we need to change that.

What might be necessary is improving held_lwlocks - we treat is as LIFO,
but more as an expectation than a hard rule. I'm not sure how often we
violate that rule (if at all), but if we do then it's going to get more
expensive as we increase the number of locks. But I'm not sure this is
actually a problem in practice, we usually hold very few LWLocks at the
same time.

As for making this a parameter, I'm rather opposed to the idea. If we
don't have a very clear idea how to set this limit, what's the chance
users with little knowledge of the internals will pick a good value?
Adding yet another knob would just mean users start messing with it in
random ways (typically increasing it to very high value, because "more
is better"), causing more harm than good.

Adding it as a GUC would also require making some parts dynamic (instead
of just doing static allocation with compile-time constants). That's not
great, but I'm not sure how significant the penalty might be.

IMHO adding a GUC might be acceptable only if we fail to come up with a
good value (which is going to be a trade off), and if someone
demonstrates a clear benefit of increasing the value (which I don't
think happen in this thread yet).

regards

[1]: /messages/by-id/CAA4eK1LSTcMwXNO8ovGh7c0UgCHzGbN=+PjggfzQDukKr3q_DA@mail.gmail.com
/messages/by-id/CAA4eK1LSTcMwXNO8ovGh7c0UgCHzGbN=+PjggfzQDukKr3q_DA@mail.gmail.com

[2]: /messages/by-id/CA+TgmoY58dQi8Z=FDAu4ggxHV-HYV03-R9on1LSP9OJU_fy_zA@mail.gmail.com
/messages/by-id/CA+TgmoY58dQi8Z=FDAu4ggxHV-HYV03-R9on1LSP9OJU_fy_zA@mail.gmail.com

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#11wenhui qiu
qiuwenhuifx@gmail.com
In reply to: Tomas Vondra (#10)
Re: Thoughts about NUM_BUFFER_PARTITIONS

Hi Tomas Vondra
Thanks for the information! But I found postgres pro enterprise
version has been implemented ,However, it defaults to 16 and maxes out at
128, and the maxes are the same as in PostgreSQL.I kindly hope that if the
developers can explain what the purpose of this is.May be 128 partitions is
the optimal value,It's a parameter to make it easier to adjust the number
of partitions in the future when it's really not enough. and the code
comments also said that hope to implement the parameter in the future

( https://postgrespro.com/docs/enterprise/16/runtime-config-locks )

log2_num_lock_partitions (integer) #
<https://postgrespro.com/docs/enterprise/16/runtime-config-locks#GUC-LOG2-NUM-LOCK-PARTITIONS&gt;

This controls how many partitions the shared lock tables are divided into.
Number of partitions is calculated by raising 2 to the power of this
parameter. The default value is 4, which corresponds to 16 partitions, and
the maximum is 8. This parameter can only be set in the postgresql.conf file
or on the server command line.

Best wish

On Tue, 20 Feb 2024 at 21:55, Tomas Vondra <tomas.vondra@enterprisedb.com>
wrote:

Show quoted text

Hi,

On 2/20/24 03:16, wenhui qiu wrote:

Hi Heikki Linnakangas
I saw git log found this commit:

https://github.com/postgres/postgres/commit/3acc10c997f916f6a741d0b4876126b7b08e3892

,I don't seem to see an email discussing this commit. As the commit log
tells us, we don't know exactly how large a value is optimal, and I

believe

it's more flexible to make it as a parameter.Thank you very much
tomas.vondra for explaining the relationship, i see that

MAX_SIMUL_LWLOCKS

was just doubled in this commit, is there a more appropriate ratio

between

them?

I think the discussion for that commit is in [1] (and especially [2]).

That being said, I don't think MAX_SIMUL_LOCKS and NUM_BUFFER_PARTITIONS
need to be in any particular ratio. The only requirement is that there
needs to be enough slack, and 72 locks seemed to work quite fine until
now - I don't think we need to change that.

What might be necessary is improving held_lwlocks - we treat is as LIFO,
but more as an expectation than a hard rule. I'm not sure how often we
violate that rule (if at all), but if we do then it's going to get more
expensive as we increase the number of locks. But I'm not sure this is
actually a problem in practice, we usually hold very few LWLocks at the
same time.

As for making this a parameter, I'm rather opposed to the idea. If we
don't have a very clear idea how to set this limit, what's the chance
users with little knowledge of the internals will pick a good value?
Adding yet another knob would just mean users start messing with it in
random ways (typically increasing it to very high value, because "more
is better"), causing more harm than good.

Adding it as a GUC would also require making some parts dynamic (instead
of just doing static allocation with compile-time constants). That's not
great, but I'm not sure how significant the penalty might be.

IMHO adding a GUC might be acceptable only if we fail to come up with a
good value (which is going to be a trade off), and if someone
demonstrates a clear benefit of increasing the value (which I don't
think happen in this thread yet).

regards

[1]

/messages/by-id/CAA4eK1LSTcMwXNO8ovGh7c0UgCHzGbN=+PjggfzQDukKr3q_DA@mail.gmail.com

[2]

/messages/by-id/CA+TgmoY58dQi8Z=FDAu4ggxHV-HYV03-R9on1LSP9OJU_fy_zA@mail.gmail.com

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#12Tomas Vondra
tomas.vondra@enterprisedb.com
In reply to: wenhui qiu (#11)
Re: Thoughts about NUM_BUFFER_PARTITIONS

On 2/23/24 15:40, wenhui qiu wrote:

Hi Tomas Vondra
Thanks for the information! But I found postgres pro enterprise
version has been implemented ,However, it defaults to 16 and maxes out at
128, and the maxes are the same as in PostgreSQL.I kindly hope that if the
developers can explain what the purpose of this is.May be 128 partitions is
the optimal value,It's a parameter to make it easier to adjust the number
of partitions in the future when it's really not enough. and the code
comments also said that hope to implement the parameter in the future

( https://postgrespro.com/docs/enterprise/16/runtime-config-locks )

log2_num_lock_partitions (integer) #
<https://postgrespro.com/docs/enterprise/16/runtime-config-locks#GUC-LOG2-NUM-LOCK-PARTITIONS&gt;

This controls how many partitions the shared lock tables are divided into.
Number of partitions is calculated by raising 2 to the power of this
parameter. The default value is 4, which corresponds to 16 partitions, and
the maximum is 8. This parameter can only be set in the postgresql.conf file
or on the server command line.

Best wish

Hi,

Well, if Postgres Pro implements this, I don't know what their reasoning
was exactly, but I guess they wanted to make it easier to experiment
with different values (without rebuild), or maybe they actually have
systems where they know higher values help ...

Note: I'd point the maximum value 8 translates to 256, so no - it does
not max at the same value as PostgreSQL.

Anyway, this value is inherently a trade off. If it wasn't, we'd set it
to something super high from the start. But having more partitions of
the lock table has a cost too, because some parts need to acquire all
the partition locks (and that's O(N) where N = number of partitions).

Of course, not having enough lock table partitions has a cost too,
because it increases the chance of conflict between backends (who happen
to need to operate on the same partition). This constant is not
constant, it changes over time - with 16 cores the collisions might have
been rare, with 128 not so much. Also, with partitioning we may need
many more locks per query.

This means it's entirely possible it'd be good to have more than 128
partitions of the lock table, but we only change this kind of stuff if
we have 2 things:

1) clear demonstration of the benefits (e.g. a workload showing an
improvement with higher number of partitions)

2) analysis of how this affects other workloads (e.g. cases that may
need to lock all the partitions etc)

Ultimately it's a trade off - we need to judge if the impact in (2) is
worth the improvement in (1).

None of this was done in this thread. There's no demonstration of the
benefits, no analysis of the impact etc.

As for turning the parameter into a GUC, that has a cost too. Either
direct - a compiler can do far more optimizations with compile-time
constants than with values that may change during execution, for
example. Or indirect - if we can't give users any guidance how/when to
tune the GUC, it can easily lead to misconfiguration (I can't even count
how many times I had to deal with systems where the values were "tuned"
following the logic that more is always better).

Which just leads back to (1) and (2) even for this case.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#13Andrey M. Borodin
x4mmm@yandex-team.ru
In reply to: Tomas Vondra (#12)
4 attachment(s)
Re: Thoughts about NUM_BUFFER_PARTITIONS

One of our customers recently asked me to look into buffer mapping.
Following is my POV on the problem of optimal NUM_BUFFER_PARTITIONS.

I’ve found some dead code: BufMappingPartitionLockByIndex() is unused, and unused for a long time. See patch 1.

On 23 Feb 2024, at 22:25, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:

Well, if Postgres Pro implements this, I don't know what their reasoning
was exactly, but I guess they wanted to make it easier to experiment
with different values (without rebuild), or maybe they actually have
systems where they know higher values help ...

Note: I'd point the maximum value 8 translates to 256, so no - it does
not max at the same value as PostgreSQL.

I’ve prototyped similar GUC for anyone willing to do such experiments. See patch 2, 4. Probably, I’ll do some experiments too, on customer's clusters and workloads :)

Anyway, this value is inherently a trade off. If it wasn't, we'd set it
to something super high from the start. But having more partitions of
the lock table has a cost too, because some parts need to acquire all
the partition locks (and that's O(N) where N = number of partitions).

I’ve found none such cases, actually. Or, perhaps, I was not looking good enough.
pg_buffercache iterates over buffers and releases locks. See patch 3 to fix comments.

Of course, not having enough lock table partitions has a cost too,
because it increases the chance of conflict between backends (who happen
to need to operate on the same partition). This constant is not
constant, it changes over time - with 16 cores the collisions might have
been rare, with 128 not so much. Also, with partitioning we may need
many more locks per query.

This means it's entirely possible it'd be good to have more than 128
partitions of the lock table, but we only change this kind of stuff if
we have 2 things:

1) clear demonstration of the benefits (e.g. a workload showing an
improvement with higher number of partitions)

2) analysis of how this affects other workloads (e.g. cases that may
need to lock all the partitions etc)

Ultimately it's a trade off - we need to judge if the impact in (2) is
worth the improvement in (1).

None of this was done in this thread. There's no demonstration of the
benefits, no analysis of the impact etc.

As for turning the parameter into a GUC, that has a cost too. Either
direct - a compiler can do far more optimizations with compile-time
constants than with values that may change during execution, for
example.

I think overhead of finding partition by hash is negligible small.
num_partitions in HTAB controls number of freelists. This might have some effect.

Or indirect - if we can't give users any guidance how/when to
tune the GUC, it can easily lead to misconfiguration (I can't even count
how many times I had to deal with systems where the values were "tuned"
following the logic that more is always better).

Yes, this argument IMHO is most important. By doing more such knobs we promote superstitious approach to tuning.

Best regards, Andrey Borodin.

Attachments:

v0-0001-Remove-unused-functions-in-buf_internals.h.patchapplication/octet-stream; name=v0-0001-Remove-unused-functions-in-buf_internals.h.patch; x-unix-mode=0644Download
From 35b737162a336d3cac5e79e0b4bc691a5947b202 Mon Sep 17 00:00:00 2001
From: Andrey Borodin <amborodin@acm.org>
Date: Sun, 4 Aug 2024 20:59:33 +0500
Subject: [PATCH v0 1/4] Remove unused functions in buf_internals.h

---
 src/include/storage/buf_internals.h | 14 +-------------
 1 file changed, 1 insertion(+), 13 deletions(-)

diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index f190e6e5e4..fd311a47a2 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -176,23 +176,11 @@ BufTagMatchesRelFileLocator(const BufferTag *tag,
  * hash code with BufTableHashCode(), then apply BufMappingPartitionLock().
  * NB: NUM_BUFFER_PARTITIONS must be a power of 2!
  */
-static inline uint32
-BufTableHashPartition(uint32 hashcode)
-{
-	return hashcode % NUM_BUFFER_PARTITIONS;
-}
-
 static inline LWLock *
 BufMappingPartitionLock(uint32 hashcode)
 {
 	return &MainLWLockArray[BUFFER_MAPPING_LWLOCK_OFFSET +
-							BufTableHashPartition(hashcode)].lock;
-}
-
-static inline LWLock *
-BufMappingPartitionLockByIndex(uint32 index)
-{
-	return &MainLWLockArray[BUFFER_MAPPING_LWLOCK_OFFSET + index].lock;
+							hashcode % NUM_BUFFER_PARTITIONS].lock;
 }
 
 /*
-- 
2.39.3 (Apple Git-146)

v0-0002-GUCify-NUM_BUFFER_PARTITIONS.patchapplication/octet-stream; name=v0-0002-GUCify-NUM_BUFFER_PARTITIONS.patch; x-unix-mode=0644Download
From cf441fafbc3e778cc353057589ecd038881550ee Mon Sep 17 00:00:00 2001
From: Andrey Borodin <amborodin@acm.org>
Date: Sun, 4 Aug 2024 21:27:59 +0500
Subject: [PATCH v0 2/4] GUCify NUM_BUFFER_PARTITIONS

---
 src/backend/storage/buffer/bufmgr.c           |  8 ++++++++
 src/backend/utils/init/globals.c              |  4 ++++
 src/backend/utils/misc/guc_tables.c           | 10 ++++++++++
 src/backend/utils/misc/postgresql.conf.sample |  1 +
 src/include/storage/buf_internals.h           |  2 +-
 src/include/storage/lwlock.h                  |  4 +++-
 src/include/utils/guc_hooks.h                 |  1 +
 src/test/modules/test_misc/t/003_check_guc.pl |  4 +++-
 8 files changed, 31 insertions(+), 3 deletions(-)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 4415ba648a..566f7e85eb 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -57,6 +57,7 @@
 #include "storage/read_stream.h"
 #include "storage/smgr.h"
 #include "storage/standby.h"
+#include "utils/guc_hooks.h"
 #include "utils/memdebug.h"
 #include "utils/ps_status.h"
 #include "utils/rel.h"
@@ -169,6 +170,13 @@ int			bgwriter_lru_maxpages = 100;
 double		bgwriter_lru_multiplier = 2.0;
 bool		track_io_timing = false;
 
+/* GUC assign hook for num_buffer_partitions_log2 */
+void
+assign_num_buffer_partitions_log2(int newval, void *extra)
+{
+	num_buffer_partitions_mask = (1 << newval) - 1;
+}
+
 /*
  * How many buffers PrefetchBuffer callers should try to stay ahead of their
  * ReadBuffer calls by.  Zero means "never prefetch".  This value is only used
diff --git a/src/backend/utils/init/globals.c b/src/backend/utils/init/globals.c
index d8debd160e..c6cdf4a89b 100644
--- a/src/backend/utils/init/globals.c
+++ b/src/backend/utils/init/globals.c
@@ -168,3 +168,7 @@ int			notify_buffers = 16;
 int			serializable_buffers = 32;
 int			subtransaction_buffers = 0;
 int			transaction_buffers = 0;
+
+/* shared buffers partitions number and mask */
+int			num_buffer_partitions_log2 = 7;
+int			num_buffer_partitions_mask = 127;
\ No newline at end of file
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 74a38bb245..201a7f7947 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -2548,6 +2548,16 @@ struct config_int ConfigureNamesInt[] =
 		check_max_stack_depth, assign_max_stack_depth, NULL
 	},
 
+	{
+		{"num_buffer_partitions_log2", PGC_POSTMASTER, RESOURCES_MEM,
+			gettext_noop("Sets number of partitions for shared buffers mapping hashtable."),
+			NULL,
+		},
+		&num_buffer_partitions_log2,
+		7, 5, 16,
+		NULL, assign_num_buffer_partitions_log2, NULL
+	},
+
 	{
 		{"temp_file_limit", PGC_SUSET, RESOURCES_DISK,
 			gettext_noop("Limits the total size of all temporary files used by each process."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 9ec9f97e92..27b1371f38 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -143,6 +143,7 @@
 #autovacuum_work_mem = -1		# min 1MB, or -1 to use maintenance_work_mem
 #logical_decoding_work_mem = 64MB	# min 64kB
 #max_stack_depth = 2MB			# min 100kB
+#num_buffer_partitions_log2 = 7	# number of partitions in shared buffer mapping hashtable
 #shared_memory_type = mmap		# the default is the first option
 					# supported by the operating system:
 					#   mmap
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index fd311a47a2..14421647c2 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -180,7 +180,7 @@ static inline LWLock *
 BufMappingPartitionLock(uint32 hashcode)
 {
 	return &MainLWLockArray[BUFFER_MAPPING_LWLOCK_OFFSET +
-							hashcode % NUM_BUFFER_PARTITIONS].lock;
+							(hashcode & num_buffer_partitions_mask)].lock;
 }
 
 /*
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index d70e6d37e0..dad06f37b3 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -90,7 +90,9 @@ extern PGDLLIMPORT int NamedLWLockTrancheRequests;
  */
 
 /* Number of partitions of the shared buffer mapping hashtable */
-#define NUM_BUFFER_PARTITIONS  128
+extern int num_buffer_partitions_log2;
+extern int num_buffer_partitions_mask;
+#define NUM_BUFFER_PARTITIONS  (1 << num_buffer_partitions_log2)
 
 /* Number of partitions the shared lock tables are divided into */
 #define LOG2_NUM_LOCK_PARTITIONS  4
diff --git a/src/include/utils/guc_hooks.h b/src/include/utils/guc_hooks.h
index 153c652c93..8fd87c70dc 100644
--- a/src/include/utils/guc_hooks.h
+++ b/src/include/utils/guc_hooks.h
@@ -87,6 +87,7 @@ extern bool check_max_slot_wal_keep_size(int *newval, void **extra,
 extern void assign_max_wal_size(int newval, void *extra);
 extern bool check_max_stack_depth(int *newval, void **extra, GucSource source);
 extern void assign_max_stack_depth(int newval, void *extra);
+extern void assign_num_buffer_partitions_log2(int newval, void *extra);
 extern bool check_multixact_member_buffers(int *newval, void **extra,
 										   GucSource source);
 extern bool check_multixact_offset_buffers(int *newval, void **extra,
diff --git a/src/test/modules/test_misc/t/003_check_guc.pl b/src/test/modules/test_misc/t/003_check_guc.pl
index 3ae4bb1cd9..7f6e566af0 100644
--- a/src/test/modules/test_misc/t/003_check_guc.pl
+++ b/src/test/modules/test_misc/t/003_check_guc.pl
@@ -56,8 +56,10 @@ while (my $line = <$contents>)
 	# file.
 	# - Valid configuration options are followed immediately by " = ",
 	# with one space before and after the equal sign.
-	if ($line =~ m/^#?([_[:alpha:]]+) = .*/)
+	print $line;
+	if ($line =~ m/^#?([_[:alnum:]]+) = .*/)
 	{
+		print "da";
 		# Lower-case conversion matters for some of the GUCs.
 		my $param_name = lc($1);
 
-- 
2.39.3 (Apple Git-146)

v0-0003-Remove-reference-to-pg_buffercache-near-MAX_SIMUL.patchapplication/octet-stream; name=v0-0003-Remove-reference-to-pg_buffercache-near-MAX_SIMUL.patch; x-unix-mode=0644Download
From c681261d449e884bd744ce4d5df9f2f92c7b9b5c Mon Sep 17 00:00:00 2001
From: Andrey Borodin <amborodin@acm.org>
Date: Sun, 4 Aug 2024 21:31:00 +0500
Subject: [PATCH v0 3/4] Remove reference to pg_buffercache near
 MAX_SIMUL_LWLOCKS

---
 src/backend/storage/lmgr/lwlock.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index e765754d80..8254c0d479 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -190,8 +190,7 @@ LWLockPadded *MainLWLockArray = NULL;
 /*
  * We use this structure to keep track of locked LWLocks for release
  * during error recovery.  Normally, only a few will be held at once, but
- * occasionally the number can be much higher; for example, the pg_buffercache
- * extension locks all buffer partitions simultaneously.
+ * occasionally the number can be much higher.
  */
 #define MAX_SIMUL_LWLOCKS	200
 
-- 
2.39.3 (Apple Git-146)

v0-0004-Adjust-dshash.c-comment-about-NUM_BUFFER_PARTITIO.patchapplication/octet-stream; name=v0-0004-Adjust-dshash.c-comment-about-NUM_BUFFER_PARTITIO.patch; x-unix-mode=0644Download
From d637312952ac34c5f5783dd033398caeb6257d17 Mon Sep 17 00:00:00 2001
From: Andrey Borodin <amborodin@acm.org>
Date: Sun, 4 Aug 2024 21:34:24 +0500
Subject: [PATCH v0 4/4] Adjust dshash.c comment about NUM_BUFFER_PARTITIONS

---
 src/backend/lib/dshash.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/backend/lib/dshash.c b/src/backend/lib/dshash.c
index 93a9e21ddd..e1a34d853f 100644
--- a/src/backend/lib/dshash.c
+++ b/src/backend/lib/dshash.c
@@ -51,7 +51,7 @@ struct dshash_table_item
 };
 
 /*
- * The number of partitions for locking purposes.  This is set to match
+ * The number of partitions for locking purposes.  This is set to match default
  * NUM_BUFFER_PARTITIONS for now, on the basis that whatever's good enough for
  * the buffer pool must be good enough for any other purpose.  This could
  * become a runtime parameter in future.
-- 
2.39.3 (Apple Git-146)

#14Nathan Bossart
nathandbossart@gmail.com
In reply to: Andrey M. Borodin (#13)
Re: Thoughts about NUM_BUFFER_PARTITIONS

On Mon, Aug 05, 2024 at 12:32:20AM +0500, Andrey M. Borodin wrote:

I�ve found some dead code: BufMappingPartitionLockByIndex() is unused,
and unused for a long time. See patch 1.

I don't see a reason to also get rid of BufTableHashPartition(), but
otherwise this looks reasonable to me. It would probably be a good idea to
first check whether there are any external callers we can find.

I�ve prototyped similar GUC for anyone willing to do such experiments.
See patch 2, 4. Probably, I�ll do some experiments too, on customer's
clusters and workloads :)

Like Tomas, I'm not too wild about making this a GUC. And as Heikki
pointed out upthread, a good first step would be to benchmark different
NUM_BUFFER_PARTITIONS settings on modern hardware. I suspect the current
setting is much lower than optimal (e.g., I doubled it and saw a TPS boost
for read-only pgbench on an i5-13500T), but I don't think we fully
understand the performance characteristics with different settings yet. If
we find that the ideal setting depends heavily on workload/hardware, then
perhaps we can consider adding a GUC, but I don't think we are there yet.

Anyway, this value is inherently a trade off. If it wasn't, we'd set it
to something super high from the start. But having more partitions of
the lock table has a cost too, because some parts need to acquire all
the partition locks (and that's O(N) where N = number of partitions).

I�ve found none such cases, actually. Or, perhaps, I was not looking good
enough. pg_buffercache iterates over buffers and releases locks. See
patch 3 to fix comments.

Yeah, I think 0003 is reasonable, too. pg_buffercache stopped acquiring
all the partition locks in v10 (see commit 6e65454), which is also the
commit that removed all remaining uses of BufMappingPartitionLockByIndex().
In fact, I think BufMappingPartitionLockByIndex() was introduced just for
this pg_buffercache use-case (see commit ea9df81).

--
nathan