scalability bottlenecks with (many) partitions (and more)

Started by Tomas Vondraalmost 2 years ago61 messages
#1Tomas Vondra
Tomas Vondra
tomas.vondra@enterprisedb.com
5 attachment(s)

Hi,

I happened to investigate a query involving a partitioned table, which
led me to a couple of bottlenecks severely affecting queries dealing
with multiple partitions (or relations in general). After a while I came
up with three WIP patches that improve the behavior by an order of
magnitude, and not just in some extreme cases.

Consider a partitioned pgbench with 20 partitions, say:

pgbench -i -s 100 --partitions 100 testdb

but let's modify the pgbench_accounts a little bit:

ALTER TABLE pgbench_accounts ADD COLUMN aid_parent INT;
UPDATE pgbench_accounts SET aid_parent = aid;
CREATE INDEX ON pgbench_accounts(aid_parent);
VACUUM FULL pgbench_accounts;

which simply adds "aid_parent" column which is not a partition key. And
now let's do a query

SELECT * FROM pgbench_accounts pa JOIN pgbench_branches pb
ON (pa.bid = pb.bid) WHERE pa.aid_parent = :aid

so pretty much the regular "pgbench -S" except that on the column that
does not allow partition elimination. Now, the plan looks like this:

QUERY PLAN
----------------------------------------------------------------------
Hash Join (cost=1.52..34.41 rows=10 width=465)
Hash Cond: (pa.bid = pb.bid)
-> Append (cost=0.29..33.15 rows=10 width=101)
-> Index Scan using pgbench_accounts_1_aid_parent_idx on
pgbench_accounts_1 pa_1 (cost=0.29..3.31 rows=1 width=101)
Index Cond: (aid_parent = 3489734)
-> Index Scan using pgbench_accounts_2_aid_parent_idx on
pgbench_accounts_2 pa_2 (cost=0.29..3.31 rows=1 width=101)
Index Cond: (aid_parent = 3489734)
-> Index Scan using pgbench_accounts_3_aid_parent_idx on
pgbench_accounts_3 pa_3 (cost=0.29..3.31 rows=1 width=101)
Index Cond: (aid_parent = 3489734)
-> Index Scan using pgbench_accounts_4_aid_parent_idx on
pgbench_accounts_4 pa_4 (cost=0.29..3.31 rows=1 width=101)
Index Cond: (aid_parent = 3489734)
-> ...
-> Hash (cost=1.10..1.10 rows=10 width=364)
-> Seq Scan on pgbench_branches pb (cost=0.00..1.10 rows=10
width=364)

So yeah, scanning all 100 partitions. Not great, but no partitioning
scheme is perfect for all queries. Anyway, let's see how this works on a
big AMD EPYC machine with 96/192 cores - with "-M simple" we get:

parts 1 8 16 32 64 96 160 224
-----------------------------------------------------------------------
0 13877 105732 210890 410452 709509 844683 1050658 1163026
100 653 3957 7120 12022 12707 11813 10349 9633
1000 20 142 270 474 757 808 567 427

These are transactions per second, for different number of clients
(numbers in the header). With -M prepared the story doesn't change - the
numbers are higher, but the overall behavior is pretty much the same.

Firstly, with no partitions (first row), the throughput by ~13k/client
initially, then it gradually levels off. But it grows all the time.

But with 100 or 1000 partitions, it peaks and then starts dropping
again. And moreover, the throughput with 100 or 1000 partitions is just
a tiny fraction of the non-partitioned value. The difference is roughly
equal to the number of partitions - for example with 96 clients, the
difference between 0 and 1000 partitions is 844683/808 = 1045.

I could demonstrate the same behavior with fewer partitions - e.g. with
10 partitions you get ~10x difference, and so on.

Another thing I'd mention is that this is not just about partitioning.
Imagine a star schema with a fact table and dimensions - you'll get the
same behavior depending on the number of dimensions you need to join
with. With "-M simple" you may get this, for example:

dims 1 8 16 32 64 96 160 224
----------------------------------------------------------------------
1 11737 92925 183678 361497 636598 768956 958679 1042799
10 462 3558 7086 13889 25367 29503 25353 24030
100 4 31 61 122 231 292 292 288

So, similar story - significant slowdown as we're adding dimensions.

Now, what could be causing this? Clearly, there's a bottleneck of some
kind, and we're hitting it. Some of this may be simply due to execution
doing more stuff (more index scans, more initialization, ...) but maybe
not - one of the reasons why I started looking into this was not using
all the CPU even for small scales - the CPU was maybe 60% utilized.

So I started poking at things. The first thing that I thought about was
locking, obviously. That's consistent with the limited CPU utilization
(waiting on a lock = not running), and it's somewhat expected when using
many partitions - we need to lock all of them, and if we have 100 or
1000 of them, that's potentially lot of locks.

From past experiments I've known about two places where such bottleneck
could be - NUM_LOCK_PARTITIONS and fast-path locking. So I decided to
give it a try, increase these values and see what happens.

For NUM_LOCK_PARTITIONS this is pretty simple (see 0001 patch). The
LWLock table has 16 partitions by default - it's quite possible that on
machine with many cores and/or many partitions, we can easily hit this.
So I bumped this 4x to 64 partitions.

For fast-path locking the changes are more complicated (see 0002). We
allow keeping 16 relation locks right in PGPROC, and only when this gets
full we promote them to the actual lock table. But with enough
partitions we're guaranteed to fill these 16 slots, of course. But
increasing the number of slots is not simple - firstly, the information
is split between an array of 16 OIDs and UINT64 serving as a bitmap.
Increasing the size of the OID array is simple, but it's harder for the
auxiliary bitmap. But there's more problems - with more OIDs a simple
linear search won't do. But a simple hash table is not a good idea too,
because of poor locality and the need to delete stuff ...

What I ended up doing is having a hash table of 16-element arrays. There
are 64 "pieces", each essentially the (16 x OID + UINT64 bitmap) that we
have now. Each OID is mapped to exactly one of these parts as if in a
hash table, and in each of those 16-element parts we do exactly the same
thing we do now (linear search, removal, etc.). This works great, the
locality is great, etc. The one disadvantage is this makes PGPROC
larger, but I did a lot of benchmarks and I haven't seen any regression
that I could attribute to this. (More about this later.)

Unfortunately, for the pgbench join this does not make much difference.
But for the "star join" (with -M prepared) it does this:

1 8 16 32 64 96 160 224
------------------------------------------------------------------------
master 21610 137450 247541 300902 270932 229692 191454 189233
patched 21664 151695 301451 594615 1036424 1211716 1480953 1656203
speedup 1.0 1.1 1.2 2.0 3.8 5.3 7.7 8.8

That's a pretty nice speedup, I think.

However, why doesn't the partitioned join improve (at not very much)?
Well, perf profile says stuff like this:

9.16% 0.77% postgres [kernel.kallsyms] [k] asm_exc_page_fault
|
--8.39%--asm_exc_page_fault
|
--7.52%--exc_page_fault
|
--7.13%--do_user_addr_fault
|
--6.64%--handle_mm_fault
|
--6.29%--__handle_mm_fault
|
|--2.17%--__mem_cgroup_charge
| |
| |--1.25%--charge_memcg
| | |
| | --0.57%-- ...
| |
| --0.67%-- ...
|
|--2.04%--vma_alloc_folio

After investigating this for a bit, I came to the conclusion this may be
some sort of a scalability problem in glibc/malloc. I decided to try if
the "memory pool" patch (which I've mentioned in the memory limit thread
as an alternative way to introduce backend-level accounting/limit) could
serve as a backend-level malloc cache, and how would that work. So I
cleaned up the PoC patch I already had (see 0003), and gave it a try.

And with both patches applied, the results for the partitioned join with
100 partitions look like this:

-M simple

1 8 16 32 64 96 160 224
------------------------------------------------------------------------
master 653 3957 7120 12022 12707 11813 10349 9633
both patches 954 7356 14580 28259 51552 65278 70607 69598
speedup 1.5 1.9 2.0 2.4 4.1 5.5 6.8 7.2

-M prepared

1 8 16 32 64 96 160 224
------------------------------------------------------------------------
master 1639 8273 14138 14746 13446 14001 11129 10136
both patches 4792 30102 62208 122157 220984 267763 315632 323567
speedup 2.9 3.6 4.4 8.3 16.4 19.1 28.4 31.9

That's pretty nice, I think. And I've seen many such improvements, it's
not a cherry-picked example. For the star join, the improvements are
very similar.

I'm attaching PDF files with a table visualizing results for these two
benchmarks - there's results for different number of partitions/scales,
and different builds (master, one or both of the patches). There's also
a comparison to master, with color scale "red = slower, green = faster"
(but there's no red anywhere, not even for low client counts).

It's also interesting that with just the 0003 patch applied, the change
is much smaller. It's as if the two bottlenecks (locking and malloc) are
in balance - if you only address one one, you don't get much. But if you
address both, it flies.

FWIW where does the malloc overhead come from? For one, while we do have
some caching of malloc-ed memory in memory contexts, that doesn't quite
work cross-query, because we destroy the contexts at the end of the
query. We attempt to cache the memory contexts too, but in this case
that can't help because the allocations come from btbeginscan() where we
do this:

so = (BTScanOpaque) palloc(sizeof(BTScanOpaqueData));

and BTScanOpaqueData is ~27kB, which means it's an oversized chunk and
thus always allocated using a separate malloc() call. Maybe we could
break it into smaller/cacheable parts, but I haven't tried, and I doubt
it's the only such allocation.

I don't want to get into too much detail about the memory pool, but I
think it's something we should consider doing - I'm sure there's stuff
to improve, but caching the malloc may clearly be very beneficial. The
basic idea is to have a cache that is "adaptive" (i.e. adjusts to
caching blocks of sizes needed by the workload) but also cheap. The
patch is PoC/WIP and needs more work, but I think it works quite well.
If anyone wants to take a look or have a chat at FOSDEM, for example,
I'm available.

FWIW I was wondering if this is a glibc-specific malloc bottleneck, so I
tried running the benchmarks with LD_PRELOAD=jemalloc, and that improves
the behavior a lot - it gets us maybe ~80% of the mempool benefits.
Which is nice, it confirms it's glibc-specific (I wonder if there's a
way to tweak glibc to address this), and it also means systems using
jemalloc (e.g. FreeBSD, right?) don't have this problem. But it also
says the mempool has ~20% benefit on top of jemalloc.

FWIW there's another bottleneck people may not realize, and that's the
number of file descriptors. Once you get to >1000 relations, you can
easily get into situation like this:

54.18% 0.48% postgres [kernel.kallsyms] [k]
entry_SYSCALL_64_after_hwframe
|
--53.70%--entry_SYSCALL_64_after_hwframe
|
--53.03%--do_syscall_64
|
|--28.29%--__x64_sys_openat
| |
| --28.14%--do_sys_openat2
| |
| |--23.14%--do_filp_open
| | |
| | --22.72%--path_openat

That's pretty bad, it means we're closing/opening file descriptors like
crazy, because every query needs the files. If I increase the number of
file descriptors (both in ulimit and max_files_per_process) to prevent
this trashing, I can increase the throughput ~5x. Of course, this is not
a bottleneck that we can "fix" in code, it's simply a consequence of not
having enough file descriptors etc. But I wonder if we might make it
easier to monitor this, e.g. by tracking the fd cache hit ratio, or
something like that ...

There's a more complete set of benchmarking scripts and results for
these and other tests, in various formats (PDF, ODS, ...) at

https://github.com/tvondra/scalability-patches

There's results from multiple machines - not just the big epyc machine,
but also smaller intel machines (4C and 16C), and even two rpi5 (yes, it
helps even on rpi5, quite a bit).

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachments:

join - epyc _ builds.pdfapplication/pdf; name="join - epyc _ builds.pdf"
star - epyc _ builds.pdfapplication/pdf; name="star - epyc _ builds.pdf"
v240118-0001-Increase-NUM_LOCK_PARTITIONS-to-64.patchtext/x-patch; charset=UTF-8; name=v240118-0001-Increase-NUM_LOCK_PARTITIONS-to-64.patch
v240118-0002-Increase-the-number-of-fastpath-locks.patchtext/x-patch; charset=UTF-8; name=v240118-0002-Increase-the-number-of-fastpath-locks.patch
v240118-0003-Add-a-memory-pool-with-adaptive-rebalancing.patchtext/x-patch; charset=UTF-8; name=v240118-0003-Add-a-memory-pool-with-adaptive-rebalancing.patch
#2Ronan Dunklau
Ronan Dunklau
ronan.dunklau@aiven.io
In reply to: Tomas Vondra (#1)
Re: scalability bottlenecks with (many) partitions (and more)

Le dimanche 28 janvier 2024, 22:57:02 CET Tomas Vondra a écrit :

Hi Tomas !

I'll comment on glibc-malloc part as I studied that part last year, and
proposed some things here: https://www.postgresql.org/message-id/
3424675.QJadu78ljV%40aivenlaptop

FWIW where does the malloc overhead come from? For one, while we do have
some caching of malloc-ed memory in memory contexts, that doesn't quite
work cross-query, because we destroy the contexts at the end of the
query. We attempt to cache the memory contexts too, but in this case
that can't help because the allocations come from btbeginscan() where we
do this:

so = (BTScanOpaque) palloc(sizeof(BTScanOpaqueData));

and BTScanOpaqueData is ~27kB, which means it's an oversized chunk and
thus always allocated using a separate malloc() call. Maybe we could
break it into smaller/cacheable parts, but I haven't tried, and I doubt

it's the only such allocation.

Did you try running an strace on the process ? That may give you some
hindsights into what malloc is doing. A more sophisticated approach would be
using stap and plugging it into the malloc probes, for example
memory_sbrk_more and memory_sbrk_less.

An important part of glibc's malloc behaviour in that regard comes from the
adjustment of the mmap and free threshold. By default, mmap adjusts them
dynamically and you can poke into that using the
memory_mallopt_free_dyn_thresholds probe.

FWIW I was wondering if this is a glibc-specific malloc bottleneck, so I
tried running the benchmarks with LD_PRELOAD=jemalloc, and that improves
the behavior a lot - it gets us maybe ~80% of the mempool benefits.
Which is nice, it confirms it's glibc-specific (I wonder if there's a
way to tweak glibc to address this), and it also means systems using
jemalloc (e.g. FreeBSD, right?) don't have this problem. But it also
says the mempool has ~20% benefit on top of jemalloc.

GLIBC's malloc offers some tuning for this. In particular, setting either
M_MMAP_THRESHOLD or M_TRIM_THRESHOLD will disable the unpredictable "auto
adjustment" beheviour and allow you to control what it's doing.

By setting a bigger M_TRIM_THRESHOLD, one can make sure memory allocated using
sbrk isn't freed as easily, and you don't run into a pattern of moving the
sbrk pointer up and down repeatedly. The automatic trade off between the mmap
and trim thresholds is supposed to prevent that, but the way it is incremented
means you can end in a bad place depending on your particular allocation
patttern.

Best regards,

--
Ronan Dunklau

#3Tomas Vondra
Tomas Vondra
tomas.vondra@enterprisedb.com
In reply to: Ronan Dunklau (#2)
Re: scalability bottlenecks with (many) partitions (and more)

On 1/29/24 09:53, Ronan Dunklau wrote:

Le dimanche 28 janvier 2024, 22:57:02 CET Tomas Vondra a écrit :

Hi Tomas !

I'll comment on glibc-malloc part as I studied that part last year, and
proposed some things here: https://www.postgresql.org/message-id/
3424675.QJadu78ljV%40aivenlaptop

Thanks for reminding me. I'll re-read that thread.

FWIW where does the malloc overhead come from? For one, while we do have
some caching of malloc-ed memory in memory contexts, that doesn't quite
work cross-query, because we destroy the contexts at the end of the
query. We attempt to cache the memory contexts too, but in this case
that can't help because the allocations come from btbeginscan() where we
do this:

so = (BTScanOpaque) palloc(sizeof(BTScanOpaqueData));

and BTScanOpaqueData is ~27kB, which means it's an oversized chunk and
thus always allocated using a separate malloc() call. Maybe we could
break it into smaller/cacheable parts, but I haven't tried, and I doubt

it's the only such allocation.

Did you try running an strace on the process ? That may give you some
hindsights into what malloc is doing. A more sophisticated approach would be
using stap and plugging it into the malloc probes, for example
memory_sbrk_more and memory_sbrk_less.

No, I haven't tried that. In my experience strace is pretty expensive,
and if the issue is in glibc itself (before it does the syscalls),
strace won't really tell us much. Not sure, ofc.

An important part of glibc's malloc behaviour in that regard comes from the
adjustment of the mmap and free threshold. By default, mmap adjusts them
dynamically and you can poke into that using the
memory_mallopt_free_dyn_thresholds probe.

Thanks, I'll take a look at that.

FWIW I was wondering if this is a glibc-specific malloc bottleneck, so I
tried running the benchmarks with LD_PRELOAD=jemalloc, and that improves
the behavior a lot - it gets us maybe ~80% of the mempool benefits.
Which is nice, it confirms it's glibc-specific (I wonder if there's a
way to tweak glibc to address this), and it also means systems using
jemalloc (e.g. FreeBSD, right?) don't have this problem. But it also
says the mempool has ~20% benefit on top of jemalloc.

GLIBC's malloc offers some tuning for this. In particular, setting either
M_MMAP_THRESHOLD or M_TRIM_THRESHOLD will disable the unpredictable "auto
adjustment" beheviour and allow you to control what it's doing.

By setting a bigger M_TRIM_THRESHOLD, one can make sure memory allocated using
sbrk isn't freed as easily, and you don't run into a pattern of moving the
sbrk pointer up and down repeatedly. The automatic trade off between the mmap
and trim thresholds is supposed to prevent that, but the way it is incremented
means you can end in a bad place depending on your particular allocation
patttern.

So, what values would you recommend for these parameters?

My concern is increasing those value would lead to (much) higher memory
usage, with little control over it. With the mempool we keep more
blocks, ofc, but we have control over freeing the memory.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#4Ronan Dunklau
Ronan Dunklau
ronan.dunklau@aiven.io
In reply to: Tomas Vondra (#3)
Re: scalability bottlenecks with (many) partitions (and more)

Le lundi 29 janvier 2024, 13:17:07 CET Tomas Vondra a écrit :

Did you try running an strace on the process ? That may give you some
hindsights into what malloc is doing. A more sophisticated approach would
be using stap and plugging it into the malloc probes, for example
memory_sbrk_more and memory_sbrk_less.

No, I haven't tried that. In my experience strace is pretty expensive,
and if the issue is in glibc itself (before it does the syscalls),
strace won't really tell us much. Not sure, ofc.

It would tell you how malloc actually performs your allocations, and how often
they end up translated into syscalls. The main issue with glibc would be that
it releases the memory too agressively to the OS, IMO.

An important part of glibc's malloc behaviour in that regard comes from
the
adjustment of the mmap and free threshold. By default, mmap adjusts them
dynamically and you can poke into that using the
memory_mallopt_free_dyn_thresholds probe.

Thanks, I'll take a look at that.

FWIW I was wondering if this is a glibc-specific malloc bottleneck, so I
tried running the benchmarks with LD_PRELOAD=jemalloc, and that improves
the behavior a lot - it gets us maybe ~80% of the mempool benefits.
Which is nice, it confirms it's glibc-specific (I wonder if there's a
way to tweak glibc to address this), and it also means systems using
jemalloc (e.g. FreeBSD, right?) don't have this problem. But it also
says the mempool has ~20% benefit on top of jemalloc.

GLIBC's malloc offers some tuning for this. In particular, setting either
M_MMAP_THRESHOLD or M_TRIM_THRESHOLD will disable the unpredictable "auto
adjustment" beheviour and allow you to control what it's doing.

By setting a bigger M_TRIM_THRESHOLD, one can make sure memory allocated
using sbrk isn't freed as easily, and you don't run into a pattern of
moving the sbrk pointer up and down repeatedly. The automatic trade off
between the mmap and trim thresholds is supposed to prevent that, but the
way it is incremented means you can end in a bad place depending on your
particular allocation patttern.

So, what values would you recommend for these parameters?

My concern is increasing those value would lead to (much) higher memory
usage, with little control over it. With the mempool we keep more
blocks, ofc, but we have control over freeing the memory.

Right now depending on your workload (especially if you use connection
pooling) you can end up with something like 32 or 64MB of dynamically adjusted
trim-threshold which will never be released back.

The first heurstic I had in mind was to set it to work_mem, up to a
"reasonable" limit I guess. One can argue that it is expected for a backend to
use work_mem frequently, and as such it shouldn't be released back. By setting
work_mem to a lower value, we could ask glibc at the same time to trim the
excess kept memory. That could be useful when a long-lived connection is
pooled, and sees a spike in memory usage only once. Currently that could well
end up with 32MB "wasted" permanently but tuning it ourselves could allow us
to releaase it back.

Since it was last year I worked on this, I'm a bit fuzzy on the details but I
hope this helps.

#5Tomas Vondra
Tomas Vondra
tomas.vondra@enterprisedb.com
In reply to: Ronan Dunklau (#4)
Re: scalability bottlenecks with (many) partitions (and more)

On 1/29/24 15:15, Ronan Dunklau wrote:

Le lundi 29 janvier 2024, 13:17:07 CET Tomas Vondra a écrit :

Did you try running an strace on the process ? That may give you some
hindsights into what malloc is doing. A more sophisticated approach would
be using stap and plugging it into the malloc probes, for example
memory_sbrk_more and memory_sbrk_less.

No, I haven't tried that. In my experience strace is pretty expensive,
and if the issue is in glibc itself (before it does the syscalls),
strace won't really tell us much. Not sure, ofc.

It would tell you how malloc actually performs your allocations, and how often
they end up translated into syscalls. The main issue with glibc would be that
it releases the memory too agressively to the OS, IMO.

An important part of glibc's malloc behaviour in that regard comes from
the
adjustment of the mmap and free threshold. By default, mmap adjusts them
dynamically and you can poke into that using the
memory_mallopt_free_dyn_thresholds probe.

Thanks, I'll take a look at that.

FWIW I was wondering if this is a glibc-specific malloc bottleneck, so I
tried running the benchmarks with LD_PRELOAD=jemalloc, and that improves
the behavior a lot - it gets us maybe ~80% of the mempool benefits.
Which is nice, it confirms it's glibc-specific (I wonder if there's a
way to tweak glibc to address this), and it also means systems using
jemalloc (e.g. FreeBSD, right?) don't have this problem. But it also
says the mempool has ~20% benefit on top of jemalloc.

GLIBC's malloc offers some tuning for this. In particular, setting either
M_MMAP_THRESHOLD or M_TRIM_THRESHOLD will disable the unpredictable "auto
adjustment" beheviour and allow you to control what it's doing.

By setting a bigger M_TRIM_THRESHOLD, one can make sure memory allocated
using sbrk isn't freed as easily, and you don't run into a pattern of
moving the sbrk pointer up and down repeatedly. The automatic trade off
between the mmap and trim thresholds is supposed to prevent that, but the
way it is incremented means you can end in a bad place depending on your
particular allocation patttern.

So, what values would you recommend for these parameters?

My concern is increasing those value would lead to (much) higher memory
usage, with little control over it. With the mempool we keep more
blocks, ofc, but we have control over freeing the memory.

Right now depending on your workload (especially if you use connection
pooling) you can end up with something like 32 or 64MB of dynamically adjusted
trim-threshold which will never be released back.

OK, so let's say I expect each backend to use ~90MB of memory (allocated
at once through memory contexts). How would you set the two limits? By
default it's set to 128kB, which means blocks larger than 128kB are
mmap-ed and released immediately.

But there's very few such allocations - a vast majority of blocks in the
benchmark workloads is <= 8kB or ~27kB (those from btbeginscan).

So I'm thinking about leaving M_TRIM_THRESHOLD as is, but increasing the
M_TRIM_THRESHOLD value to a couple MBs. But I doubt that'll really help,
because what I expect to happen is we execute a query and it allocates
all memory up to a high watermark of ~90MB. And then the query
completes, and we release almost all of it. And even with trim threshold
set to e.g. 8MB we'll free almost all of it, no?

What we want to do is say - hey, we needed 90MB, and now we need 8MB. We
could free 82MB, but maybe let's wait a bit and see if we need that
memory again. And that's pretty much what the mempool does, but I don't
see how to do that using the mmap options.

The first heurstic I had in mind was to set it to work_mem, up to a
"reasonable" limit I guess. One can argue that it is expected for a backend to
use work_mem frequently, and as such it shouldn't be released back. By setting
work_mem to a lower value, we could ask glibc at the same time to trim the
excess kept memory. That could be useful when a long-lived connection is
pooled, and sees a spike in memory usage only once. Currently that could well
end up with 32MB "wasted" permanently but tuning it ourselves could allow us
to releaase it back.

I'm not sure work_mem is a good parameter to drive this. It doesn't say
how much memory we expect the backend to use - it's a per-operation
limit, so it doesn't work particularly well with partitioning (e.g. with
100 partitions, we may get 100 nodes, which is completely unrelated to
what work_mem says). A backend running the join query with 1000
partitions uses ~90MB (judging by data reported by the mempool), even
with work_mem=4MB. So setting the trim limit to 4MB is pretty useless.

The mempool could tell us how much memory we need (but we could track
this in some other way too, probably). And we could even adjust the mmap
parameters regularly, based on current workload.

But there's then there's the problem that the mmap parameters don't tell
us how much memory to keep, but how large chunks to release.

Let's say we want to keep the 90MB (to allocate the memory once and then
reuse it). How would you do that? We could set MMAP_TRIM_TRESHOLD 100MB,
but then it takes just a little bit of extra memory to release all the
memory, or something.

Since it was last year I worked on this, I'm a bit fuzzy on the details but I
hope this helps.

Thanks for the feedback / insights!

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#6Ronan Dunklau
Ronan Dunklau
ronan.dunklau@aiven.io
In reply to: Tomas Vondra (#5)
Re: scalability bottlenecks with (many) partitions (and more)

Le lundi 29 janvier 2024, 15:59:04 CET Tomas Vondra a écrit :

I'm not sure work_mem is a good parameter to drive this. It doesn't say
how much memory we expect the backend to use - it's a per-operation
limit, so it doesn't work particularly well with partitioning (e.g. with
100 partitions, we may get 100 nodes, which is completely unrelated to
what work_mem says). A backend running the join query with 1000
partitions uses ~90MB (judging by data reported by the mempool), even
with work_mem=4MB. So setting the trim limit to 4MB is pretty useless.

I understand your point, I was basing my previous observations on what a
backend typically does during the execution.

The mempool could tell us how much memory we need (but we could track
this in some other way too, probably). And we could even adjust the mmap
parameters regularly, based on current workload.

But there's then there's the problem that the mmap parameters don't tell
If we > > us how much memory to keep, but how large chunks to release.

Let's say we want to keep the 90MB (to allocate the memory once and then
reuse it). How would you do that? We could set MMAP_TRIM_TRESHOLD 100MB,
but then it takes just a little bit of extra memory to release all the
memory, or something.

For doing this you can set M_TOP_PAD using glibc malloc. Which makes sure a
certain amount of memory is always kept.

But the way the dynamic adjustment works makes it sort-of work like this.
MMAP_THRESHOLD and TRIM_THRESHOLD start with low values, meaning we don't
expect to keep much memory around.

So even "small" memory allocations will be served using mmap at first. Once
mmaped memory is released, glibc's consider it a benchmark for "normal"
allocations that can be routinely freed, and adjusts mmap_threshold to the
released mmaped region size, and trim threshold to two times that.

It means over time the two values will converge either to the max value (32MB
for MMAP_THRESHOLD, 64 for trim threshold) or to something big enough to
accomodate your released memory, since anything bigger than half trim
threshold will be allocated using mmap.

Setting any parameter disable that.

But I'm not arguing against the mempool, just chiming in with glibc's malloc
tuning possibilities :-)

#7Tomas Vondra
Tomas Vondra
tomas.vondra@enterprisedb.com
In reply to: Ronan Dunklau (#6)
1 attachment(s)
Re: scalability bottlenecks with (many) partitions (and more)

On 1/29/24 16:42, Ronan Dunklau wrote:

Le lundi 29 janvier 2024, 15:59:04 CET Tomas Vondra a écrit :

I'm not sure work_mem is a good parameter to drive this. It doesn't say
how much memory we expect the backend to use - it's a per-operation
limit, so it doesn't work particularly well with partitioning (e.g. with
100 partitions, we may get 100 nodes, which is completely unrelated to
what work_mem says). A backend running the join query with 1000
partitions uses ~90MB (judging by data reported by the mempool), even
with work_mem=4MB. So setting the trim limit to 4MB is pretty useless.

I understand your point, I was basing my previous observations on what a
backend typically does during the execution.

The mempool could tell us how much memory we need (but we could track
this in some other way too, probably). And we could even adjust the mmap
parameters regularly, based on current workload.

But there's then there's the problem that the mmap parameters don't tell
If we > > us how much memory to keep, but how large chunks to release.

Let's say we want to keep the 90MB (to allocate the memory once and then
reuse it). How would you do that? We could set MMAP_TRIM_TRESHOLD 100MB,
but then it takes just a little bit of extra memory to release all the
memory, or something.

For doing this you can set M_TOP_PAD using glibc malloc. Which makes sure a
certain amount of memory is always kept.

But the way the dynamic adjustment works makes it sort-of work like this.
MMAP_THRESHOLD and TRIM_THRESHOLD start with low values, meaning we don't
expect to keep much memory around.

So even "small" memory allocations will be served using mmap at first. Once
mmaped memory is released, glibc's consider it a benchmark for "normal"
allocations that can be routinely freed, and adjusts mmap_threshold to the
released mmaped region size, and trim threshold to two times that.

It means over time the two values will converge either to the max value (32MB
for MMAP_THRESHOLD, 64 for trim threshold) or to something big enough to
accomodate your released memory, since anything bigger than half trim
threshold will be allocated using mmap.

Setting any parameter disable that.

Thanks. I gave this a try, and I started the tests with this setting:

export MALLOC_TOP_PAD_=$((64*1024*1024))
export MALLOC_MMAP_THRESHOLD_=$((1024*1024))
export MALLOC_TRIM_THRESHOLD_=$((1024*1024))

which I believe means that:

1) we'll keep 64MB "extra" memory on top of heap, serving as a cache for
future allocations

2) everything below 1MB (so most of the blocks we allocate for contexts)
will be allocated on heap (hence from the cache)

3) we won't trim heap unless there's at least 1MB of free contiguous
space (I wonder if this should be the same as MALLOC_TOP_PAD)

Those are mostly arbitrary values / guesses, and I don't have complete
results yet. But from the results I have it seems this has almost the
same effect as the mempool thing - see the attached PDF, with results
for the "partitioned join" benchmark.

first column - "master" (17dev) with no patches, default glibc

second column - 17dev + locking + mempool, default glibc

third column - 17dev + locking, tuned glibc

The color scale on the right is throughput comparison (third/second), as
a percentage with e.g. 90% meaning tuned glibc is 10% slower than the
mempool results. Most of the time it's slower but very close to 100%,
sometimes it's a bit faster. So overall it's roughly the same.

The color scales below the results is a comparison of each branch to the
master (without patches), showing comparison to current performance.
It's almost the same, although the tuned glibc has a couple regressions
that the mempool does not have.

But I'm not arguing against the mempool, just chiming in with glibc's malloc
tuning possibilities :-)

Yeah. I think the main problem with the glibc parameters is that it's
very implementation-specific and also static - the mempool is more
adaptive, I think. But it's an interesting experiment.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachments:

glibc-malloc-tuning.pdfapplication/pdf; name=glibc-malloc-tuning.pdf
#8Robert Haas
Robert Haas
robertmhaas@gmail.com
In reply to: Tomas Vondra (#1)
Re: scalability bottlenecks with (many) partitions (and more)

On Sun, Jan 28, 2024 at 4:57 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:

For NUM_LOCK_PARTITIONS this is pretty simple (see 0001 patch). The
LWLock table has 16 partitions by default - it's quite possible that on
machine with many cores and/or many partitions, we can easily hit this.
So I bumped this 4x to 64 partitions.

I think this probably makes sense. I'm a little worried that we're
just kicking the can down the road here where maybe we should be
solving the problem in some more fundamental way, and I'm also a
little worried that we might be reducing single-core performance. But
it's probably fine.

What I ended up doing is having a hash table of 16-element arrays. There
are 64 "pieces", each essentially the (16 x OID + UINT64 bitmap) that we
have now. Each OID is mapped to exactly one of these parts as if in a
hash table, and in each of those 16-element parts we do exactly the same
thing we do now (linear search, removal, etc.). This works great, the
locality is great, etc. The one disadvantage is this makes PGPROC
larger, but I did a lot of benchmarks and I haven't seen any regression
that I could attribute to this. (More about this later.)

I think this is a reasonable approach. Some comments:

- FastPathLocalUseInitialized seems unnecessary to me; the contents of
an uninitialized local variable are undefined, but an uninitialized
global variable always starts out zeroed.

- You need comments in various places, including here, where someone
is certain to have questions about the algorithm and choice of
constants:

+#define FAST_PATH_LOCK_REL_GROUP(rel) (((uint64) (rel) * 7883 + 4481)
% FP_LOCK_GROUPS_PER_BACKEND)

When I originally coded up the fast-path locking stuff, I supposed
that we couldn't make the number of slots too big because the
algorithm requires a linear search of the whole array. But with this
one trick (a partially-associative cache), there's no real reason that
I can think of why you can't make the number of slots almost
arbitrarily large. At some point you're going to use too much memory,
and probably before that point you're going to make the cache big
enough that it doesn't fit in the CPU cache of an individual core, at
which point possibly it will stop working as well. But honestly ... I
don't quite see why this approach couldn't be scaled quite far.

Like, if we raised FP_LOCK_GROUPS_PER_BACKEND from your proposed value
of 64 to say 65536, would that still perform well? I'm not saying we
should do that, because that's probably a silly amount of memory to
use for this, but the point is that when you start to have enough
partitions that you run out of lock slots, performance is going to
degrade, so you can imagine wanting to try to have enough lock groups
to make that unlikely. Which leads me to wonder if there's any
particular number of lock groups that is clearly "too many" or whether
it's just about how much memory we want to use.

--
Robert Haas
EDB: http://www.enterprisedb.com

#9Tomas Vondra
Tomas Vondra
tomas.vondra@enterprisedb.com
In reply to: Robert Haas (#8)
Re: scalability bottlenecks with (many) partitions (and more)

On 6/24/24 17:05, Robert Haas wrote:

On Sun, Jan 28, 2024 at 4:57 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:

For NUM_LOCK_PARTITIONS this is pretty simple (see 0001 patch). The
LWLock table has 16 partitions by default - it's quite possible that on
machine with many cores and/or many partitions, we can easily hit this.
So I bumped this 4x to 64 partitions.

I think this probably makes sense. I'm a little worried that we're
just kicking the can down the road here where maybe we should be
solving the problem in some more fundamental way, and I'm also a
little worried that we might be reducing single-core performance. But
it's probably fine.

Yeah, I haven't seen this causing any regressions - the sensitive paths
typically lock only one partition, so having more of them does not
affect that. Or if it does, it's likely a reasonable trade off as it
reduces the risk of lock contention.

That being said, I don't recall benchmarking this patch in isolation,
only with the other patches. Maybe I should do that ...

What I ended up doing is having a hash table of 16-element arrays. There
are 64 "pieces", each essentially the (16 x OID + UINT64 bitmap) that we
have now. Each OID is mapped to exactly one of these parts as if in a
hash table, and in each of those 16-element parts we do exactly the same
thing we do now (linear search, removal, etc.). This works great, the
locality is great, etc. The one disadvantage is this makes PGPROC
larger, but I did a lot of benchmarks and I haven't seen any regression
that I could attribute to this. (More about this later.)

I think this is a reasonable approach. Some comments:

- FastPathLocalUseInitialized seems unnecessary to me; the contents of
an uninitialized local variable are undefined, but an uninitialized
global variable always starts out zeroed.

OK. I didn't realize global variables start a zero.

- You need comments in various places, including here, where someone
is certain to have questions about the algorithm and choice of
constants:

+#define FAST_PATH_LOCK_REL_GROUP(rel) (((uint64) (rel) * 7883 + 4481)
% FP_LOCK_GROUPS_PER_BACKEND)

Yeah, definitely needs comment explaining this.

I admit those numbers are pretty arbitrary primes, to implement a
trivial hash function. That was good enough for a PoC patch, but maybe
for a "proper" version this should use a better hash function. It needs
to be fast, and maybe it doesn't matter that much if it's not perfect.

When I originally coded up the fast-path locking stuff, I supposed
that we couldn't make the number of slots too big because the
algorithm requires a linear search of the whole array. But with this
one trick (a partially-associative cache), there's no real reason that
I can think of why you can't make the number of slots almost
arbitrarily large. At some point you're going to use too much memory,
and probably before that point you're going to make the cache big
enough that it doesn't fit in the CPU cache of an individual core, at
which point possibly it will stop working as well. But honestly ... I
don't quite see why this approach couldn't be scaled quite far.

I don't think I've heard the term "partially-associative cache" before,
but now that I look at the approach again, it very much reminds me how
set-associative caches work (e.g. with cachelines in CPU caches). It's a
16-way associative cache, assigning each entry into one of 16 slots.

I must have been reading some papers in this area shortly before the PoC
patch, and the idea came from there, probably. Which is good, because it
means it's a well-understood and widely-used approach.

Like, if we raised FP_LOCK_GROUPS_PER_BACKEND from your proposed value
of 64 to say 65536, would that still perform well? I'm not saying we
should do that, because that's probably a silly amount of memory to
use for this, but the point is that when you start to have enough
partitions that you run out of lock slots, performance is going to
degrade, so you can imagine wanting to try to have enough lock groups
to make that unlikely. Which leads me to wonder if there's any
particular number of lock groups that is clearly "too many" or whether
it's just about how much memory we want to use.

That's an excellent question. I don't know.

I agree 64 groups is pretty arbitrary, and having 1024 may not be enough
even with a modest number of partitions. When I was thinking about using
a higher value, my main concern was that it'd made the PGPROC entry
larger. Each "fast-path" group is ~72B, so 64 groups is ~4.5kB, and that
felt like quite a bit.

But maybe it's fine and we could make it much larger - L3 caches tend to
be many MBs these days, although AFAIK it's shared by threads running on
the CPU.

I'll see if I can do some more testing of this, and see if there's a
value where it stops improving / starts degrading, etc.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#10Robert Haas
Robert Haas
robertmhaas@gmail.com
In reply to: Tomas Vondra (#9)
Re: scalability bottlenecks with (many) partitions (and more)

On Tue, Jun 25, 2024 at 6:04 AM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:

Yeah, definitely needs comment explaining this.

I admit those numbers are pretty arbitrary primes, to implement a
trivial hash function. That was good enough for a PoC patch, but maybe
for a "proper" version this should use a better hash function. It needs
to be fast, and maybe it doesn't matter that much if it's not perfect.

Right. My guess is that if we try too hard to make the hash function
good, there will be a performance hit. Unlike, say, strings that come
from the user, we have no reason to believe that relfilenumbers will
have any particular structure or pattern to them, so a low-quality,
fast function seems like a good trade-off to me. But I'm *far* from a
hashing expert, so I'm prepared for someone who is to tell me that I'm
full of garbage.

I don't think I've heard the term "partially-associative cache" before
That's an excellent question. I don't know.

I agree 64 groups is pretty arbitrary, and having 1024 may not be enough
even with a modest number of partitions. When I was thinking about using
a higher value, my main concern was that it'd made the PGPROC entry
larger. Each "fast-path" group is ~72B, so 64 groups is ~4.5kB, and that
felt like quite a bit.

But maybe it's fine and we could make it much larger - L3 caches tend to
be many MBs these days, although AFAIK it's shared by threads running on
the CPU.

I'll see if I can do some more testing of this, and see if there's a
value where it stops improving / starts degrading, etc.

Sounds good.

--
Robert Haas
EDB: http://www.enterprisedb.com

#11Tomas Vondra
Tomas Vondra
tomas@vondra.me
In reply to: Tomas Vondra (#9)
Re: scalability bottlenecks with (many) partitions (and more)

Hi,

On 6/25/24 12:04, Tomas Vondra wrote:

On 6/24/24 17:05, Robert Haas wrote:

On Sun, Jan 28, 2024 at 4:57 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:

For NUM_LOCK_PARTITIONS this is pretty simple (see 0001 patch). The
LWLock table has 16 partitions by default - it's quite possible that on
machine with many cores and/or many partitions, we can easily hit this.
So I bumped this 4x to 64 partitions.

I think this probably makes sense. I'm a little worried that we're
just kicking the can down the road here where maybe we should be
solving the problem in some more fundamental way, and I'm also a
little worried that we might be reducing single-core performance. But
it's probably fine.

Yeah, I haven't seen this causing any regressions - the sensitive paths
typically lock only one partition, so having more of them does not
affect that. Or if it does, it's likely a reasonable trade off as it
reduces the risk of lock contention.

That being said, I don't recall benchmarking this patch in isolation,
only with the other patches. Maybe I should do that ...

What I ended up doing is having a hash table of 16-element arrays. There
are 64 "pieces", each essentially the (16 x OID + UINT64 bitmap) that we
have now. Each OID is mapped to exactly one of these parts as if in a
hash table, and in each of those 16-element parts we do exactly the same
thing we do now (linear search, removal, etc.). This works great, the
locality is great, etc. The one disadvantage is this makes PGPROC
larger, but I did a lot of benchmarks and I haven't seen any regression
that I could attribute to this. (More about this later.)

I think this is a reasonable approach. Some comments:

- FastPathLocalUseInitialized seems unnecessary to me; the contents of
an uninitialized local variable are undefined, but an uninitialized
global variable always starts out zeroed.

OK. I didn't realize global variables start a zero.

I haven't fixed this yet, but it's pretty clear the "init" is not really
needed, because it did the memset() wrong:

memset(FastPathLocalUseCounts, 0, sizeof(FastPathLocalUseInitialized));

This only resets one byte of the array, yet it still worked correctly.

- You need comments in various places, including here, where someone
is certain to have questions about the algorithm and choice of
constants:

+#define FAST_PATH_LOCK_REL_GROUP(rel) (((uint64) (rel) * 7883 + 4481)
% FP_LOCK_GROUPS_PER_BACKEND)

Yeah, definitely needs comment explaining this.

I admit those numbers are pretty arbitrary primes, to implement a
trivial hash function. That was good enough for a PoC patch, but maybe
for a "proper" version this should use a better hash function. It needs
to be fast, and maybe it doesn't matter that much if it's not perfect.

When I originally coded up the fast-path locking stuff, I supposed
that we couldn't make the number of slots too big because the
algorithm requires a linear search of the whole array. But with this
one trick (a partially-associative cache), there's no real reason that
I can think of why you can't make the number of slots almost
arbitrarily large. At some point you're going to use too much memory,
and probably before that point you're going to make the cache big
enough that it doesn't fit in the CPU cache of an individual core, at
which point possibly it will stop working as well. But honestly ... I
don't quite see why this approach couldn't be scaled quite far.

I don't think I've heard the term "partially-associative cache" before,
but now that I look at the approach again, it very much reminds me how
set-associative caches work (e.g. with cachelines in CPU caches). It's a
16-way associative cache, assigning each entry into one of 16 slots.

I must have been reading some papers in this area shortly before the PoC
patch, and the idea came from there, probably. Which is good, because it
means it's a well-understood and widely-used approach.

Like, if we raised FP_LOCK_GROUPS_PER_BACKEND from your proposed value
of 64 to say 65536, would that still perform well? I'm not saying we
should do that, because that's probably a silly amount of memory to
use for this, but the point is that when you start to have enough
partitions that you run out of lock slots, performance is going to
degrade, so you can imagine wanting to try to have enough lock groups
to make that unlikely. Which leads me to wonder if there's any
particular number of lock groups that is clearly "too many" or whether
it's just about how much memory we want to use.

That's an excellent question. I don't know.

I agree 64 groups is pretty arbitrary, and having 1024 may not be enough
even with a modest number of partitions. When I was thinking about using
a higher value, my main concern was that it'd made the PGPROC entry
larger. Each "fast-path" group is ~72B, so 64 groups is ~4.5kB, and that
felt like quite a bit.

But maybe it's fine and we could make it much larger - L3 caches tend to
be many MBs these days, although AFAIK it's shared by threads running on
the CPU.

I'll see if I can do some more testing of this, and see if there's a
value where it stops improving / starts degrading, etc.

I finally got to do those experiments. The scripts and results (both raw
and summarized) are too big to attach everything here, available at

https://github.com/tvondra/scalability-tests

The initial patch used 64 (which means 1024 fast-path slots), I ran the
tests with 0, 1, 8, 32, 128, 512, 1024 (so up to 16k locks). I thought
about testing with ~64k groups, but I didn't go with the extreme value
because I don't quite see the point.

It would only matter for cases with a truly extreme number of partitions
(64k groups is ~1M fast-path slots), and just creating enough partitions
would take a lot of time. Moreover, with that many partitions we seems
to have various other bottlenecks, and improving this does not make it
really practical. And it's so slow the benchmark results are somewhat
bogus too.

Because if we achieve 50 tps with 1000 partitions, does it really matter
a patch changes that to 25 of 100 tps? I doubt that, especially if going
to 100 partitions gives you 2000 tps. Now imagine you have 10k or 100k
partitions - how fast is that going to be?

So I think stopping at 1024 groups is sensible, and if there are some
inefficiencies / costs, I'd expect those to gradually show up even at
those lower sizes.

But if you look at results, for example from the "join" test:

https://github.com/tvondra/scalability-tests/blob/main/join.pdf

there's no such negative effect. the table shows results for different
combinations of parameters, with the first group of columns being on
regular glibc, the second one has glibc tuning (see [1]/messages/by-id/0da51f67-c88b-497e-bb89-d5139309eb9c@enterprisedb.com for details).
And the values are for different number of fast-path groups (0 means the
patch was not applied).

And the color scale on the show the impact of increasing the number of
groups. So for example when a column for "32 groups" says 150%, it means
going from 8 to 32 groups improved throughput to 1.5x. As usual, green
is "good" and red is "bad".

But if you look at the tables, there's very little change - most of the
values are close to 100%. This might seem a bit strange, considering the
promise of these patches is to improve throughput, and "no change" is an
absence of that. But that's because the charts illustrate effect of
changing the group count with other parameters fixed. It never compares
runs with/without glibc runing, and that's an important part of the
improvement. Doing the pivot table a bit differently would still show a
substantial 2-3x improvement.

There's a fair amount of noise - especially for the rpi5 machines (not
the right hw for sensitive benchmarks), but also on some i5/xeon runs. I
attribute this to only doing one short run (10s) for each combinations
of parameters. I'll do more runs next time.

Anyway, I think these results show a couple things:

1) There's no systemic negative effect of increasing the number of
groups. We could go with 32k or 64k groups, and it doesn't seem like
there would be a problem.

2) But there's not much point in doing that, because we run into various
other bottlenecks well before having that many locks. By the results, it
doesn't seem going beyond 32 or 64 groups would give us much.

3) The memory allocation caching (be it the mempool patch, or the glibc
tuning like in this test round) is a crucial piece for this. Not doing
that means some tests get no improvement at all, or a much smaller one.

4) The increase of NUM_LOCK_PARTITIONS has very limited effect, or
perhaps even no effect at all.

Based on this, my plan is to polish the patch adding fast-path groups,
with either 32 or 64 groups, which seems to be reasonable values. Then
in the future, if/when the other bottlenecks get addressed, we can
rethink and increase this.

This however reminds me that all those machines are pretty small. Which
is good for showing it doesn't break existing/smaller systems, but the
initial goal of the patch was to improve behavior on big boxes. I don't
have access to the original box at the moment, so if someone could
provide an access to one of those big epyc/xeon machines with 100+ cores
for a couple days, that would be helpful.

That being said, I think it's pretty clear how serious the issue with
memory allocation overhead can be, especially in cases when the existing
memory context caching is ineffective (like for the btree palloc). I'm
not sure what to do about that. The mempool patch shared in this thread
does the trick, it's fairly complex/invasive. I still like it, but maybe
doing something with the glibc tuning would be enough - it's not as
effective, but 80% of the improvement is better than no improvement.

regards

[1]: /messages/by-id/0da51f67-c88b-497e-bb89-d5139309eb9c@enterprisedb.com
/messages/by-id/0da51f67-c88b-497e-bb89-d5139309eb9c@enterprisedb.com

--
Tomas Vondra

#12Tomas Vondra
Tomas Vondra
tomas@vondra.me
In reply to: Tomas Vondra (#11)
2 attachment(s)
Re: scalability bottlenecks with (many) partitions (and more)

Hi,

While discussing this patch with Robert off-list, one of the questions
he asked was is there's some size threshold after which it starts to
have negative impact. I didn't have a good answer to that - I did have
some intuition (that making it too large would not hurt), but I haven't
done any tests with "extreme" sizes of the fast-path structs.

So I ran some more tests, with up to 4096 "groups" (which means 64k
fast-path slots). And no matter how I slice the results, there's no
clear regression points, beyond which the performance would start to
decline (even just slowly). It's the same for all benchmarks, client
counts, query mode, and so on.

I'm attaching two PDFs with results for the "join" benchmark I described
earlier (query with a join on many partitions) from EPYC 7763 (64/128c).
The first one is with "raw" data (throughput = tps), the second one is
relative throughput to the first column (which is pretty much current
master, with no optimizations applied).

The complete results including some nice .odp spreadsheets and scripts
are available here:

https://github.com/tvondra/pg-lock-scalability-results

There's often a very clear point where the performance significantly
improves - this is usually when all the relation locks start to fit into
the fast-path array. With 1000 relations that's ~64 groups, and so on.
But there's no point where it would start declining.

My explanation is that the PGPROC (where the fast-path array is) is so
large already (close to 1kB), that making it large does not really cause
any additional cache misses, etc. And if it does, it's far out-weighted
by cost of accessing (or not having to) the shared lock table.

So I don't think there's any point at which point we'd start to regress,
at least not because of cache misses, CPU etc. It stops improving, but
that's just a sign that we've hit some other bottleneck - that's not a
fault of this patch.

But that's not the whole story, of course. Because if there were no
issues, why not to just make the fast-path array insanely large? In
another off-list discussion Andres asked me about the memory this would
need, and after looking at the numbers I think that's a strong argument
to keep the numbers reasonable.

I did a quick experiment to see the per-connection memory requirements,
and how would it be affected by this patch. I simply logged the amount
of shared memory CalculateShmemSize(), started the server with 100 and
1000 connections, and did a bit of math to calculate how much memory we
need "per connection".

For master and different numbers of fast-path groups I got this:

master 64 1024 32765
---------------------------------
47668 52201 121324 2406892

So by default we need ~48kB / connection, with 64 groups we need ~52kB
(which makes sense because that's 1024 x 4B slots), and then with 1024
slots we get to 120kB etc and with 32k ~2.5MB.

I guess those higher values seem a bit insane - we don't want to just
increase the per-connection memory requirements 50x for everyone, right?

But what about the people who actually want this many locks? Let's bump
the max_locks_per_transactions from 64 to 1024, and we get this:

master 64 1024 32765
-------------------------------------
419367 423909 493022 2778590

Suddenly, the differences are much smaller, especially for the 64
groups, which is roughly the same number of fast-path slots as the max
locks per transactions. That shrunk to ~1% difference. But wen for 1024
groups it's now just ~20%, which I think it well worth the benefits.

And likely something the system should have available - with 1000
connections that's ~80MB. And if you run with 1000 connections, 80MB
should be rounding error, IMO.

Of course, it does not seem great to force everyone to pay this price,
even if they don't need that many locks (and so there's no benefit). So
how would we improve that?

I don't think that's possible with hard-coded size of the array - that
allocates the memory for everyone. We'd need to make it variable-length,
and while doing those benchmarks I think we actually already have a GUC
for that - max_locks_per_transaction tells us exactly what we need to
know, right? I mean, if I know I'll need ~1000 locks, why not to make
the fast-path array large enough for that?

Of course, the consequence of this would be making PGPROC variable
length, or having to point to a memory allocated separately (I prefer
the latter option, I think). I haven't done any experiments, but it
seems fairly doable - of course, not sure if it might be more expensive
compared to compile-time constants.

At this point I think it's fairly clear we have significant bottlenecks
when having to lock many relations - and that won't go away, thanks to
partitioning etc. We're already fixing various other bottlenecks for
these workloads, which will just increase pressure on locking.

Fundamentally, I think we'll need to either evolve the fast-path system
to handle more relations (the limit of 16 was always rather quite low),
or invent some entirely new thing that does something radical (say,
locking a "group" of relations instead of locking them one by one).

This patch is doing the first thing, and IMHO the increased memory
consumption is a sensible / acceptable trade off. I'm not sure of any
proposal for the second approach, and I don't have any concrete idea how
it might work.

regards

--
Tomas Vondra

Attachments:

join-epyc-relative.pdfapplication/pdf; name=join-epyc-relative.pdf
join-epyc-data.pdfapplication/pdf; name=join-epyc-data.pdf
#13Robert Haas
Robert Haas
robertmhaas@gmail.com
In reply to: Tomas Vondra (#12)
Re: scalability bottlenecks with (many) partitions (and more)

On Sun, Sep 1, 2024 at 3:30 PM Tomas Vondra <tomas@vondra.me> wrote:

I don't think that's possible with hard-coded size of the array - that
allocates the memory for everyone. We'd need to make it variable-length,
and while doing those benchmarks I think we actually already have a GUC
for that - max_locks_per_transaction tells us exactly what we need to
know, right? I mean, if I know I'll need ~1000 locks, why not to make
the fast-path array large enough for that?

I really like this idea. I'm not sure about exactly how many fast path
slots you should get for what value of max_locks_per_transaction, but
coupling the two things together in some way sounds smart.

Of course, the consequence of this would be making PGPROC variable
length, or having to point to a memory allocated separately (I prefer
the latter option, I think). I haven't done any experiments, but it
seems fairly doable - of course, not sure if it might be more expensive
compared to compile-time constants.

I agree that this is a potential problem but it sounds like the idea
works well enough that we'd probably still come out quite far ahead
even with a bit more overhead.

--
Robert Haas
EDB: http://www.enterprisedb.com

#14Tomas Vondra
Tomas Vondra
tomas@vondra.me
In reply to: Robert Haas (#13)
4 attachment(s)
Re: scalability bottlenecks with (many) partitions (and more)

On 9/2/24 01:53, Robert Haas wrote:

On Sun, Sep 1, 2024 at 3:30 PM Tomas Vondra <tomas@vondra.me> wrote:

I don't think that's possible with hard-coded size of the array - that
allocates the memory for everyone. We'd need to make it variable-length,
and while doing those benchmarks I think we actually already have a GUC
for that - max_locks_per_transaction tells us exactly what we need to
know, right? I mean, if I know I'll need ~1000 locks, why not to make
the fast-path array large enough for that?

I really like this idea. I'm not sure about exactly how many fast path
slots you should get for what value of max_locks_per_transaction, but
coupling the two things together in some way sounds smart.

I think we should keep that simple and make the cache large enough for
max_locks_per_transaction locks. That's the best information about
expected number of locks we have. If the GUC is left at the default
value, that probably means they backends need that many locks on
average. Yes, maybe there's an occasional spike in one of the backends,
but then that means other backends need fewer locks, and so there's less
contention for the shared lock table.

Of course, it's possible to construct counter-examples to this. Say a
single backend that needs a lot of these locks. But how's that different
from every other fixed-size cache with eviction?

The one argument to not tie this to max_locks_per_transaction is the
vastly different "per element" memory requirements. If you add one entry
to max_locks_per_transaction, that adds LOCK which is a whopping 152B.
OTOH one fast-path entry is ~5B, give or take. That's a pretty big
difference, and it if the locks fit into the shared lock table, but
you'd like to allow more fast-path locks, having to increase
max_locks_per_transaction is not great - pretty wastefull.

OTOH I'd really hate to just add another GUC and hope the users will
magically know how to set it correctly. That's pretty unlikely, IMO. I
myself wouldn't know what a good value is, I think.

But say we add a GUC and set it to -1 by default, in which case it just
inherits the max_locks_per_transaction value. And then also provide some
basic metric about this fast-path cache, so that people can tune this?

I think just knowing the "hit ratio" would be enough, i.e. counters for
how often it fits into the fast-path array, and how often we had to
promote it to the shared lock table would be enough, no?

Of course, the consequence of this would be making PGPROC variable
length, or having to point to a memory allocated separately (I prefer
the latter option, I think). I haven't done any experiments, but it
seems fairly doable - of course, not sure if it might be more expensive
compared to compile-time constants.

I agree that this is a potential problem but it sounds like the idea
works well enough that we'd probably still come out quite far ahead
even with a bit more overhead.

OK, I did some quick tests on this, and I don't see any regressions.

Attached are 4 patches:

1) 0001 - original patch, with some minor fixes (remove init, which is
not necessary, that sort of thing)

2) 0002 - a bit of reworks, improving comments, structuring the macros a
little bit better, etc. But still compile-time constants.

3) 0003 - dynamic sizing, based on max_locks_pet_transaction. It's a bit
ugly, because the size is calculated during shmem allocation - it
should happen earlier, but good enough for PoC.

4) 0004 - introduce a separate GUC, this is mostly to allow testing of
different values without changing max_locks_per_transaction

I've only did that on my smaller 32-core machine, but for three simple
tests it looks like this (throughput using 16 clients):

mode test master 1 2 3 4
----------------------------------------------------------------
prepared count 1460 1477 1488 1490 1491
join 15556 24451 26044 25026 24237
pgbench 148187 151192 151688 150389 152681
----------------------------------------------------------------
simple count 1341 1351 1373 1374 1370
join 4643 5439 5459 5393 5345
pgbench 139763 141267 142796 141207 142600

Those are some simple benchmarks on 100 partitions, where the regular
pgbench and count(*) are expected to not be improved, and the join is
the partitioned join this thread started with. 1-4 are the attached
patches, to see the impact for each of them.

Translated to results relative to

mode test 1 2 3 4
-------------------------------------------------
prepared count 101% 102% 102% 102%
join 157% 167% 161% 156%
pgbench 102% 102% 101% 103%
-------------------------------------------------
simple count 101% 102% 102% 102%
join 117% 118% 116% 115%
pgbench 101% 102% 101% 102%

So pretty much no difference between the patches. A bit of noise, but
that's what I'd expect on this machine.

I'll do more testing on the bit EPYC machine once it gets available, but
from these results it seems pretty promising.

regards

--
Tomas Vondra

Attachments:

v20240902-0001-v1.patchtext/x-patch; charset=UTF-8; name=v20240902-0001-v1.patch
v20240902-0002-rework.patchtext/x-patch; charset=UTF-8; name=v20240902-0002-rework.patch
v20240902-0003-drive-this-by-max_locks_per_transaction.patchtext/x-patch; charset=UTF-8; name=v20240902-0003-drive-this-by-max_locks_per_transaction.patch
v20240902-0004-separate-guc-to-allow-benchmarking.patchtext/x-patch; charset=UTF-8; name=v20240902-0004-separate-guc-to-allow-benchmarking.patch
#15Robert Haas
Robert Haas
robertmhaas@gmail.com
In reply to: Tomas Vondra (#14)
Re: scalability bottlenecks with (many) partitions (and more)

On Mon, Sep 2, 2024 at 1:46 PM Tomas Vondra <tomas@vondra.me> wrote:

The one argument to not tie this to max_locks_per_transaction is the
vastly different "per element" memory requirements. If you add one entry
to max_locks_per_transaction, that adds LOCK which is a whopping 152B.
OTOH one fast-path entry is ~5B, give or take. That's a pretty big
difference, and it if the locks fit into the shared lock table, but
you'd like to allow more fast-path locks, having to increase
max_locks_per_transaction is not great - pretty wastefull.

OTOH I'd really hate to just add another GUC and hope the users will
magically know how to set it correctly. That's pretty unlikely, IMO. I
myself wouldn't know what a good value is, I think.

But say we add a GUC and set it to -1 by default, in which case it just
inherits the max_locks_per_transaction value. And then also provide some
basic metric about this fast-path cache, so that people can tune this?

All things being equal, I would prefer not to add another GUC for
this, but we might need it.

Doing some worst case math, suppose somebody has max_connections=1000
(which is near the upper limit of what I'd consider a sane setting)
and max_locks_per_transaction=10000 (ditto). The product is 10
million, so every 10 bytes of storage each a gigabyte of RAM. Chewing
up 15GB of RAM when you could have chewed up only 0.5GB certainly
isn't too great. On the other hand, those values are kind of pushing
the limits of what is actually sane. If you imagine
max_locks_per_transaction=2000 rather than
max_locks_per_connection=10000, then it's only 3GB and that's
hopefully not a lot on the hopefully-giant machine where you're
running this.

I think just knowing the "hit ratio" would be enough, i.e. counters for
how often it fits into the fast-path array, and how often we had to
promote it to the shared lock table would be enough, no?

Yeah, probably. I mean, that won't tell you how big it needs to be,
but it will tell you whether it's big enough.

I wonder if we should be looking at further improvements in the lock
manager of some kind. For instance, imagine if we allocated storage
via DSM or DSA for cases where we need a really large number of Lock
entries. The downside of that is that we might run out of memory for
locks at runtime, which would perhaps suck, but you'd probably use
significantly less memory on average. Or, maybe we need an even bigger
rethink where we reconsider the idea that we take a separate lock for
every single partition instead of having some kind of hierarchy-aware
lock manager. I don't know. But this feels like very old, crufty tech.
There's probably something more state of the art that we could or
should be doing.

--
Robert Haas
EDB: http://www.enterprisedb.com

#16Tomas Vondra
Tomas Vondra
tomas@vondra.me
In reply to: Robert Haas (#15)
Re: scalability bottlenecks with (many) partitions (and more)

On 9/3/24 17:06, Robert Haas wrote:

On Mon, Sep 2, 2024 at 1:46 PM Tomas Vondra <tomas@vondra.me> wrote:

The one argument to not tie this to max_locks_per_transaction is the
vastly different "per element" memory requirements. If you add one entry
to max_locks_per_transaction, that adds LOCK which is a whopping 152B.
OTOH one fast-path entry is ~5B, give or take. That's a pretty big
difference, and it if the locks fit into the shared lock table, but
you'd like to allow more fast-path locks, having to increase
max_locks_per_transaction is not great - pretty wastefull.

OTOH I'd really hate to just add another GUC and hope the users will
magically know how to set it correctly. That's pretty unlikely, IMO. I
myself wouldn't know what a good value is, I think.

But say we add a GUC and set it to -1 by default, in which case it just
inherits the max_locks_per_transaction value. And then also provide some
basic metric about this fast-path cache, so that people can tune this?

All things being equal, I would prefer not to add another GUC for
this, but we might need it.

Agreed.

Doing some worst case math, suppose somebody has max_connections=1000
(which is near the upper limit of what I'd consider a sane setting)
and max_locks_per_transaction=10000 (ditto). The product is 10
million, so every 10 bytes of storage each a gigabyte of RAM. Chewing
up 15GB of RAM when you could have chewed up only 0.5GB certainly
isn't too great. On the other hand, those values are kind of pushing
the limits of what is actually sane. If you imagine
max_locks_per_transaction=2000 rather than
max_locks_per_connection=10000, then it's only 3GB and that's
hopefully not a lot on the hopefully-giant machine where you're
running this.

Yeah, although I don't quite follow the math. With 1000/10000 settings,
why would that eat 15GB of RAM? I mean, that's 1.5GB, right?

FWIW the actual cost is somewhat higher, because we seem to need ~400B
for every lock (not just the 150B for the LOCK struct). At least based
on a quick experiment. (Seems a bit high, right?).

Anyway, I agree this might be acceptable. If your transactions use this
many locks regularly, you probably need this setting anyway. If you only
need this many locks occasionally (so that you can keep the locks/xact
value low), it probably does not matter that much.

And if you're running massively-partitioned table on a tiny box, well, I
don't really think that's a particularly sane idea.

So I think I'm OK with just tying this to max_locks_per_transaction.

I think just knowing the "hit ratio" would be enough, i.e. counters for
how often it fits into the fast-path array, and how often we had to
promote it to the shared lock table would be enough, no?

Yeah, probably. I mean, that won't tell you how big it needs to be,
but it will tell you whether it's big enough.

True, but that applies to all "cache hit ratio" metrics (like for our
shared buffers). It'd be great to have something better, enough to tell
you how large the cache needs to be. But we don't :-(

I wonder if we should be looking at further improvements in the lock
manager of some kind. For instance, imagine if we allocated storage
via DSM or DSA for cases where we need a really large number of Lock
entries. The downside of that is that we might run out of memory for
locks at runtime, which would perhaps suck, but you'd probably use
significantly less memory on average. Or, maybe we need an even bigger
rethink where we reconsider the idea that we take a separate lock for
every single partition instead of having some kind of hierarchy-aware
lock manager. I don't know. But this feels like very old, crufty tech.
There's probably something more state of the art that we could or
should be doing.

Perhaps. I agree we'll probably need something more radical soon, not
just changes that aim to fix some rare exceptional case (which may be
annoying, but not particularly harmful for the complete workload).

For example, if we did what you propose, that might help when very few
transactions need a lot of locks. I don't mind saving memory in that
case, ofc. but is it a problem if those rare cases are a bit slower?
Shouldn't we focus more on cases where many locks are common? Because
people are simply going to use partitioning, a lot of indexes, etc?

So yeah, I agree we probably need a more fundamental rethink. I don't
think we can just keep optimizing the current approach, there's a limit
of fast it can be. Whether it's not locking individual partitions, or
not locking some indexes, ... I don't know.

regards

--
Tomas Vondra

#17Jakub Wartak
Jakub Wartak
jakub.wartak@enterprisedb.com
In reply to: Tomas Vondra (#16)
Re: scalability bottlenecks with (many) partitions (and more)

Hi Tomas!

On Tue, Sep 3, 2024 at 6:20 PM Tomas Vondra <tomas@vondra.me> wrote:

On 9/3/24 17:06, Robert Haas wrote:

On Mon, Sep 2, 2024 at 1:46 PM Tomas Vondra <tomas@vondra.me> wrote:

The one argument to not tie this to max_locks_per_transaction is the
vastly different "per element" memory requirements. If you add one entry
to max_locks_per_transaction, that adds LOCK which is a whopping 152B.
OTOH one fast-path entry is ~5B, give or take. That's a pretty big
difference, and it if the locks fit into the shared lock table, but
you'd like to allow more fast-path locks, having to increase
max_locks_per_transaction is not great - pretty wastefull.

OTOH I'd really hate to just add another GUC and hope the users will
magically know how to set it correctly. That's pretty unlikely, IMO. I
myself wouldn't know what a good value is, I think.

But say we add a GUC and set it to -1 by default, in which case it just
inherits the max_locks_per_transaction value. And then also provide some
basic metric about this fast-path cache, so that people can tune this?

All things being equal, I would prefer not to add another GUC for
this, but we might need it.

Agreed.

[..]

So I think I'm OK with just tying this to max_locks_per_transaction.

If that matters then the SLRU configurability effort added 7 GUCs
(with 3 scaling up based on shared_buffers) just to give high-end
users some relief, so here 1 new shouldn't be that such a deal. We
could add to the LWLock/lock_manager wait event docs to recommend just
using known-to-be-good certain values from this $thread (or ask the
user to benchmark it himself).

I think just knowing the "hit ratio" would be enough, i.e. counters for
how often it fits into the fast-path array, and how often we had to
promote it to the shared lock table would be enough, no?

Yeah, probably. I mean, that won't tell you how big it needs to be,
but it will tell you whether it's big enough.

True, but that applies to all "cache hit ratio" metrics (like for our
shared buffers). It'd be great to have something better, enough to tell
you how large the cache needs to be. But we don't :-(

My $0.02 cents: the originating case that triggered those patches,
actually started with LWLock/lock_manager waits being the top#1. The
operator can cross check (join) that with a group by pg_locks.fastpath
(='f'), count(*). So, IMHO we have good observability in this case
(rare thing to say!)

I wonder if we should be looking at further improvements in the lock
manager of some kind. [..]

Perhaps. I agree we'll probably need something more radical soon, not
just changes that aim to fix some rare exceptional case (which may be
annoying, but not particularly harmful for the complete workload).

For example, if we did what you propose, that might help when very few
transactions need a lot of locks. I don't mind saving memory in that
case, ofc. but is it a problem if those rare cases are a bit slower?
Shouldn't we focus more on cases where many locks are common? Because
people are simply going to use partitioning, a lot of indexes, etc?

So yeah, I agree we probably need a more fundamental rethink. I don't
think we can just keep optimizing the current approach, there's a limit
of fast it can be.

Please help me understand: so are You both discussing potential far
future further improvements instead of this one ? My question is
really about: is the patchset good enough or are you considering some
other new effort instead?

BTW some other random questions:
Q1. I've been lurking into
https://github.com/tvondra/pg-lock-scalability-results and those
shouldn't be used anymore for further discussions, as they contained
earlier patches (including
0003-Add-a-memory-pool-with-adaptive-rebalancing.patch) and they were
replaced by benchmark data in this $thread, right?
Q2. Earlier attempts did contain a mempool patch to get those nice
numbers (or was that jemalloc or glibc tuning). So were those recent
results in [1]/messages/by-id/b8c43eda-0c3f-4cb4-809b-841fa5c40ada@vondra.me collected with still 0003 or you have switched
completely to glibc/jemalloc tuning?

-J.

[1]: /messages/by-id/b8c43eda-0c3f-4cb4-809b-841fa5c40ada@vondra.me

#18Tomas Vondra
Tomas Vondra
tomas@vondra.me
In reply to: Jakub Wartak (#17)
Re: scalability bottlenecks with (many) partitions (and more)

On 9/4/24 11:29, Jakub Wartak wrote:

Hi Tomas!

On Tue, Sep 3, 2024 at 6:20 PM Tomas Vondra <tomas@vondra.me> wrote:

On 9/3/24 17:06, Robert Haas wrote:

On Mon, Sep 2, 2024 at 1:46 PM Tomas Vondra <tomas@vondra.me> wrote:

The one argument to not tie this to max_locks_per_transaction is the
vastly different "per element" memory requirements. If you add one entry
to max_locks_per_transaction, that adds LOCK which is a whopping 152B.
OTOH one fast-path entry is ~5B, give or take. That's a pretty big
difference, and it if the locks fit into the shared lock table, but
you'd like to allow more fast-path locks, having to increase
max_locks_per_transaction is not great - pretty wastefull.

OTOH I'd really hate to just add another GUC and hope the users will
magically know how to set it correctly. That's pretty unlikely, IMO. I
myself wouldn't know what a good value is, I think.

But say we add a GUC and set it to -1 by default, in which case it just
inherits the max_locks_per_transaction value. And then also provide some
basic metric about this fast-path cache, so that people can tune this?

All things being equal, I would prefer not to add another GUC for
this, but we might need it.

Agreed.

[..]

So I think I'm OK with just tying this to max_locks_per_transaction.

If that matters then the SLRU configurability effort added 7 GUCs
(with 3 scaling up based on shared_buffers) just to give high-end
users some relief, so here 1 new shouldn't be that such a deal. We
could add to the LWLock/lock_manager wait event docs to recommend just
using known-to-be-good certain values from this $thread (or ask the
user to benchmark it himself).

TBH I'm skeptical we'll be able to tune those GUCs. Maybe it was the
right thing for the SLRU thread, I don't know - I haven't been following
that very closely. But my impression is that we often add a GUC when
we're not quite sure how to pick a good value. So we just shift the
responsibility to someone else, who however also doesn't know.

I'd very much prefer not to do that here. Of course, it's challenging
because we can't easily resize these arrays, so even if we had some nice
heuristics to calculate the "optimal" number of fast-path slots, what
would we do with it ...

I think just knowing the "hit ratio" would be enough, i.e. counters for
how often it fits into the fast-path array, and how often we had to
promote it to the shared lock table would be enough, no?

Yeah, probably. I mean, that won't tell you how big it needs to be,
but it will tell you whether it's big enough.

True, but that applies to all "cache hit ratio" metrics (like for our
shared buffers). It'd be great to have something better, enough to tell
you how large the cache needs to be. But we don't :-(

My $0.02 cents: the originating case that triggered those patches,
actually started with LWLock/lock_manager waits being the top#1. The
operator can cross check (join) that with a group by pg_locks.fastpath
(='f'), count(*). So, IMHO we have good observability in this case
(rare thing to say!)

That's a good point. So if you had to give some instructions to users
what to measure / monitor, and how to adjust the GUC based on that, what
would your instructions be?

I wonder if we should be looking at further improvements in the lock
manager of some kind. [..]

Perhaps. I agree we'll probably need something more radical soon, not
just changes that aim to fix some rare exceptional case (which may be
annoying, but not particularly harmful for the complete workload).

For example, if we did what you propose, that might help when very few
transactions need a lot of locks. I don't mind saving memory in that
case, ofc. but is it a problem if those rare cases are a bit slower?
Shouldn't we focus more on cases where many locks are common? Because
people are simply going to use partitioning, a lot of indexes, etc?

So yeah, I agree we probably need a more fundamental rethink. I don't
think we can just keep optimizing the current approach, there's a limit
of fast it can be.

Please help me understand: so are You both discussing potential far
future further improvements instead of this one ? My question is
really about: is the patchset good enough or are you considering some
other new effort instead?

I think it was mostly a brainstorming about alternative / additional
improvements in locking. The proposed patch does not change the locking
in any fundamental way, it merely optimizes one piece - we still acquire
exactly the same set of locks, exactly the same way.

AFAICS there's an agreement the current approach has limits, and with
the growing number of partitions we're hitting them already. That may
need rethinking the fundamental approach, but I think that should not
block improvements to the current approach.

Not to mention there's no proposal for such "fundamental rework" yet.

BTW some other random questions:
Q1. I've been lurking into
https://github.com/tvondra/pg-lock-scalability-results and those
shouldn't be used anymore for further discussions, as they contained
earlier patches (including
0003-Add-a-memory-pool-with-adaptive-rebalancing.patch) and they were
replaced by benchmark data in this $thread, right?

The github results are still valid, I've only shared them 3 days ago. It
does test both the mempool and glibc tuning, to assess (and compare) the
benefits of that, but why would that make it obsolete?

By "results in this thread" I suppose you mean the couple numbers I
shared on September 2? Those were just very limited benchmarks to asses
if making the arrays variable-length (based on GUC) would make things
slower. And it doesn't, so the "full" github results still apply.

Q2. Earlier attempts did contain a mempool patch to get those nice
numbers (or was that jemalloc or glibc tuning). So were those recent
results in [1] collected with still 0003 or you have switched
completely to glibc/jemalloc tuning?

The results pushed to github are all with glibc, and test four cases:

a) mempool patch not applied, no glibc tuning
b) mempool patch applied, no glibc tuning
c) mempool patch not applied, glibc tuning
d) mempool patch applied, glibc tuning

These are the 4 "column groups" in some of the pivot tables, to allow
comparing those cases. My interpretation of the results are

1) The mempool / glibc tuning have significant benefits, at least for
some workloads (where the locking patch alone does help much).

2) There's very little difference between the mempool / glibc tuning.
The mempool does seem to have a small advantage.

3) The mempool / glibc tuning is irrelevant for non-glibc systems (e.g.
for FreeBSD which I think uses jemalloc or something like that).

I think the mempool might be interesting and useful for other reasons
(e.g. I initially wrote it to enforce a per-backend memory limit), but
you can get mostly the same caching benefits by tuning the glibc parameters.

So I'm focusing on the locking stuff.

regards

--
Tomas Vondra

#19Matthias van de Meent
Matthias van de Meent
boekewurm+postgres@gmail.com
In reply to: Tomas Vondra (#16)
Re: scalability bottlenecks with (many) partitions (and more)

On Tue, 3 Sept 2024 at 18:20, Tomas Vondra <tomas@vondra.me> wrote:

FWIW the actual cost is somewhat higher, because we seem to need ~400B
for every lock (not just the 150B for the LOCK struct).

We do indeed allocate two PROCLOCKs for every LOCK, and allocate those
inside dynahash tables. That amounts to (152+2*64+3*16=) 328 bytes in
dynahash elements, and (3 * 8-16) = 24-48 bytes for the dynahash
buckets/segments, resulting in 352-376 bytes * NLOCKENTS() being
used[^1]. Does that align with your usage numbers, or are they
significantly larger?

At least based on a quick experiment. (Seems a bit high, right?).

Yeah, that does seem high, thanks for nerd-sniping me.

The 152 bytes of LOCK are mostly due to a combination of two
MAX_LOCKMODES-sized int[]s that are used to keep track of the number
of requested/granted locks of each level. As MAX_LOCKMODES = 10, these
arrays use a total of 2*4*10=80 bytes, with the remaining 72 spent on
tracking. MAX_BACKENDS sadly doesn't fit in int16, so we'll have to
keep using int[]s, but that doesn't mean we can't improve this size:

ISTM that MAX_LOCKMODES is 2 larger than it has to be: LOCKMODE=0 is
NoLock, which is never used or counted in these shared structures, and
the max lock mode supported by any of the supported lock methods is
AccessExclusiveLock (8). We can thus reduce MAX_LOCKMODES to 8,
reducing size of the LOCK struct by 16 bytes.

If some struct- and field packing is OK, then we could further reduce
the size of LOCK by an additional 8 bytes by resizing the LOCKMASK
type from int to int16 (we only use the first MaxLockMode (8) + 1
bits), and then storing the grant/waitMask fields (now 4 bytes total)
in the padding that's present at the end of the waitProcs struct. This
would depend on dclist not writing in its padding space, but I
couldn't find any user that did so, and most critically dclist_init
doesn't scribble in the padding with memset.

If these are both implemented, it would save 24 bytes, reducing the
struct to 128 bytes. :) [^2]

I also checked PROCLOCK: If it is worth further packing the struct, we
should probably look at whether it's worth replacing the PGPROC* typed
fields with ProcNumber -based ones, potentially in both PROCLOCK and
PROCLOCKTAG. When combined with int16-typed LOCKMASKs, either one of
these fields being replaced with ProcNumber would allow a reduction in
size by one MAXALIGN quantum, reducing the struct to 56 bytes, the
smallest I could get it to without ignoring type alignments.

Further shmem savings can be achieved by reducing the "10% safety
margin" added at the end of LockShmemSize, as I'm fairly sure the
memory used in shared hashmaps doesn't exceed the estimated amount,
and if it did then we should probably fix that part, rather than
requesting that (up to) 10% overhead here.

Alltogether that'd save 40 bytes/lock entry on size, and ~35
bytes/lock on "safety margin", for a saving of (up to) 19% of our
current allocation. I'm not sure if these tricks would benefit with
performance or even be a demerit, apart from smaller structs usually
being better at fitting better in CPU caches.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

[^1] NLOCKENTS() benefits from being a power of 2, or slightly below
one, as it's rounded up to a power of 2 when dynahash decides its
number of buckets to allocate.
[^2] Sadly this 2-cachelines alignment is lost due to dynahash's
HASHELEMENT prefix of elements. :(

#20David Rowley
David Rowley
dgrowleyml@gmail.com
In reply to: Robert Haas (#15)
Re: scalability bottlenecks with (many) partitions (and more)

On Wed, 4 Sept 2024 at 03:06, Robert Haas <robertmhaas@gmail.com> wrote:

On Mon, Sep 2, 2024 at 1:46 PM Tomas Vondra <tomas@vondra.me> wrote:

But say we add a GUC and set it to -1 by default, in which case it just
inherits the max_locks_per_transaction value. And then also provide some
basic metric about this fast-path cache, so that people can tune this?

All things being equal, I would prefer not to add another GUC for
this, but we might need it.

I think driving the array size from max_locks_per_transaction is a
good idea (rounded up to the next multiple of 16?). If someone comes
along one day and shows us a compelling case where some backend needs
more than its fair share of locks and performance is bad because of
that, then maybe we can consider adding a GUC then. Certainly, it's
much easier to add a GUC later if someone convinces us that it's a
good idea than it is to add it now and try to take it away in the
future if we realise it's not useful enough to keep.

David

#21Tomas Vondra
Tomas Vondra
tomas@vondra.me
In reply to: Matthias van de Meent (#19)
Re: scalability bottlenecks with (many) partitions (and more)

On 9/4/24 16:25, Matthias van de Meent wrote:

On Tue, 3 Sept 2024 at 18:20, Tomas Vondra <tomas@vondra.me> wrote:

FWIW the actual cost is somewhat higher, because we seem to need ~400B
for every lock (not just the 150B for the LOCK struct).

We do indeed allocate two PROCLOCKs for every LOCK, and allocate those
inside dynahash tables. That amounts to (152+2*64+3*16=) 328 bytes in
dynahash elements, and (3 * 8-16) = 24-48 bytes for the dynahash
buckets/segments, resulting in 352-376 bytes * NLOCKENTS() being
used[^1]. Does that align with your usage numbers, or are they
significantly larger?

I see more like ~470B per lock. If I patch CalculateShmemSize to log the
shmem allocated, I get this:

max_connections=100 max_locks_per_transaction=1000 => 194264001
max_connections=100 max_locks_per_transaction=2000 => 241756967

and (((241756967-194264001)/100/1000)) = 474

Could be alignment of structs or something, not sure.

At least based on a quick experiment. (Seems a bit high, right?).

Yeah, that does seem high, thanks for nerd-sniping me.

The 152 bytes of LOCK are mostly due to a combination of two
MAX_LOCKMODES-sized int[]s that are used to keep track of the number
of requested/granted locks of each level. As MAX_LOCKMODES = 10, these
arrays use a total of 2*4*10=80 bytes, with the remaining 72 spent on
tracking. MAX_BACKENDS sadly doesn't fit in int16, so we'll have to
keep using int[]s, but that doesn't mean we can't improve this size:

ISTM that MAX_LOCKMODES is 2 larger than it has to be: LOCKMODE=0 is
NoLock, which is never used or counted in these shared structures, and
the max lock mode supported by any of the supported lock methods is
AccessExclusiveLock (8). We can thus reduce MAX_LOCKMODES to 8,
reducing size of the LOCK struct by 16 bytes.

If some struct- and field packing is OK, then we could further reduce
the size of LOCK by an additional 8 bytes by resizing the LOCKMASK
type from int to int16 (we only use the first MaxLockMode (8) + 1
bits), and then storing the grant/waitMask fields (now 4 bytes total)
in the padding that's present at the end of the waitProcs struct. This
would depend on dclist not writing in its padding space, but I
couldn't find any user that did so, and most critically dclist_init
doesn't scribble in the padding with memset.

If these are both implemented, it would save 24 bytes, reducing the
struct to 128 bytes. :) [^2]

I also checked PROCLOCK: If it is worth further packing the struct, we
should probably look at whether it's worth replacing the PGPROC* typed
fields with ProcNumber -based ones, potentially in both PROCLOCK and
PROCLOCKTAG. When combined with int16-typed LOCKMASKs, either one of
these fields being replaced with ProcNumber would allow a reduction in
size by one MAXALIGN quantum, reducing the struct to 56 bytes, the
smallest I could get it to without ignoring type alignments.

Further shmem savings can be achieved by reducing the "10% safety
margin" added at the end of LockShmemSize, as I'm fairly sure the
memory used in shared hashmaps doesn't exceed the estimated amount,
and if it did then we should probably fix that part, rather than
requesting that (up to) 10% overhead here.

Alltogether that'd save 40 bytes/lock entry on size, and ~35
bytes/lock on "safety margin", for a saving of (up to) 19% of our
current allocation. I'm not sure if these tricks would benefit with
performance or even be a demerit, apart from smaller structs usually
being better at fitting better in CPU caches.

Not sure either, but it seems worth exploring. If you do an experimental
patch for the LOCK size reduction, I can get some numbers.

I'm not sure about the safety margins. 10% sure seems like quite a bit
of memory (it might not have in the past, but as the instances are
growing, that probably changed).

regards

--
Tomas Vondra

#22Tomas Vondra
Tomas Vondra
tomas@vondra.me
In reply to: David Rowley (#20)
Re: scalability bottlenecks with (many) partitions (and more)

On 9/4/24 17:12, David Rowley wrote:

On Wed, 4 Sept 2024 at 03:06, Robert Haas <robertmhaas@gmail.com> wrote:

On Mon, Sep 2, 2024 at 1:46 PM Tomas Vondra <tomas@vondra.me> wrote:

But say we add a GUC and set it to -1 by default, in which case it just
inherits the max_locks_per_transaction value. And then also provide some
basic metric about this fast-path cache, so that people can tune this?

All things being equal, I would prefer not to add another GUC for
this, but we might need it.

I think driving the array size from max_locks_per_transaction is a
good idea (rounded up to the next multiple of 16?).

Maybe, although I was thinking we'd just use the regular doubling, to
get nice "round" numbers. It will likely overshoot a little bit (unless
people set the GUC to exactly 2^N), but I don't think that's a problem.

If someone comes along one day and shows us a compelling case where
some backend needs more than its fair share of locks and performance
is bad because of that, then maybe we can consider adding a GUC then.
Certainly, it's much easier to add a GUC later if someone convinces
us that it's a good idea than it is to add it now and try to take it
away in the future if we realise it's not useful enough to keep.

Yeah, I agree with this.

regards

--
Tomas Vondra

#23Robert Haas
Robert Haas
robertmhaas@gmail.com
In reply to: Tomas Vondra (#16)
Re: scalability bottlenecks with (many) partitions (and more)

On Tue, Sep 3, 2024 at 12:19 PM Tomas Vondra <tomas@vondra.me> wrote:

Doing some worst case math, suppose somebody has max_connections=1000
(which is near the upper limit of what I'd consider a sane setting)
and max_locks_per_transaction=10000 (ditto). The product is 10
million, so every 10 bytes of storage each a gigabyte of RAM. Chewing
up 15GB of RAM when you could have chewed up only 0.5GB certainly
isn't too great. On the other hand, those values are kind of pushing
the limits of what is actually sane. If you imagine
max_locks_per_transaction=2000 rather than
max_locks_per_connection=10000, then it's only 3GB and that's
hopefully not a lot on the hopefully-giant machine where you're
running this.

Yeah, although I don't quite follow the math. With 1000/10000 settings,
why would that eat 15GB of RAM? I mean, that's 1.5GB, right?

Oh, right.

FWIW the actual cost is somewhat higher, because we seem to need ~400B
for every lock (not just the 150B for the LOCK struct). At least based
on a quick experiment. (Seems a bit high, right?).

Hmm, yes, that's unpleasant.

Perhaps. I agree we'll probably need something more radical soon, not
just changes that aim to fix some rare exceptional case (which may be
annoying, but not particularly harmful for the complete workload).

For example, if we did what you propose, that might help when very few
transactions need a lot of locks. I don't mind saving memory in that
case, ofc. but is it a problem if those rare cases are a bit slower?
Shouldn't we focus more on cases where many locks are common? Because
people are simply going to use partitioning, a lot of indexes, etc?

So yeah, I agree we probably need a more fundamental rethink. I don't
think we can just keep optimizing the current approach, there's a limit
of fast it can be. Whether it's not locking individual partitions, or
not locking some indexes, ... I don't know.

I don't know, either. We don't have to decide right now; it's just
something to keep in mind.

--
Robert Haas
EDB: http://www.enterprisedb.com

#24Tomas Vondra
Tomas Vondra
tomas@vondra.me
In reply to: Robert Haas (#23)
4 attachment(s)
Re: scalability bottlenecks with (many) partitions (and more)

Hi,

Here's a bit more polished version of this patch series. I only propose
0001 and 0002 for eventual commit, the two other bits are just stuff to
help with benchmarking etc.

0001
----
increases the size of the arrays, but uses hard-coded number of groups
(64, so 1024 locks) and leaves everything in PGPROC

0002
----
Allocates that separately from PGPROC, and sets the number based on
max_locks_per_transactions

I think 0001 and 0002 should be in fairly good shape, IMO. There's a
couple cosmetic things that bother me (e.g. the way it Asserts after
each FAST_PATH_LOCK_REL_GROUP seems distracting).

But other than that I think it's fine, so a review / opinions would be
very welcome.

0003
----
Adds a separate GUC to make benchmarking easier (without the impact of
changing the size of the lock table).

I think the agreement is to not have a new GUC, unless it turns out to
be necessary in the future. So 0003 was just to make benchmarking a bit
easier.

0004
----
This was a quick attempt to track the fraction of fast-path locks, and
adding the infrastructure is mostly mechanical thing. But it turns out
it's not quite trivial to track why a lock did not use fast-path. It
might have been because it wouldn't fit, or maybe it's not eligible, or
maybe there's a stronger lock. It's not obvious how to count these to
help with evaluating the number of fast-path slots.

regards

--
Tomas Vondra

Attachments:

v20240905-0001-Increase-the-number-of-fast-path-lock-slot.patchtext/x-patch; charset=UTF-8; name=v20240905-0001-Increase-the-number-of-fast-path-lock-slot.patch
v20240905-0002-Size-fast-path-slots-using-max_locks_per_t.patchtext/x-patch; charset=UTF-8; name=v20240905-0002-Size-fast-path-slots-using-max_locks_per_t.patch
v20240905-0003-separate-guc-to-allow-benchmarking.patchtext/x-patch; charset=UTF-8; name=v20240905-0003-separate-guc-to-allow-benchmarking.patch
v20240905-0004-lock-stats.patchtext/x-patch; charset=UTF-8; name=v20240905-0004-lock-stats.patch
#25Tomas Vondra
Tomas Vondra
tomas@vondra.me
In reply to: Tomas Vondra (#18)
Re: scalability bottlenecks with (many) partitions (and more)

On 9/4/24 13:15, Tomas Vondra wrote:

On 9/4/24 11:29, Jakub Wartak wrote:

Hi Tomas!

...

My $0.02 cents: the originating case that triggered those patches,
actually started with LWLock/lock_manager waits being the top#1. The
operator can cross check (join) that with a group by pg_locks.fastpath
(='f'), count(*). So, IMHO we have good observability in this case
(rare thing to say!)

That's a good point. So if you had to give some instructions to users
what to measure / monitor, and how to adjust the GUC based on that, what
would your instructions be?

After thinking about this a bit more, I'm actually wondering if this is
source of information is sufficient. Firstly, it's just a snapshot of a
single instance, and it's not quite trivial to get some summary for
longer time period (people would have to sample it often enough, etc.).
Doable, but much less convenient than the cumulative counters.

But for the sampling, doesn't this produce skewed data? Imagine you have
a workload with very short queries (which is when fast-path matters), so
you're likely to see the backend while it's obtaining the locks. If the
fast-path locks take much faster acquire (kinda the whole point), we're
more likely to see the backend while it's obtaining the regular locks.

Let's say the backend needs 1000 locks, and 500 of those fit into the
fast-path array. We're likely to see the 500 fast-path locks already
acquired, and a random fraction of the 500 non-fast-path locks. So in
the end you'll se backends needing 500 fast-path locks and 250 regular
locks. That doesn't seem terrible, but I guess the difference can be
made even larger.

regards

--
Tomas Vondra

#26Jakub Wartak
Jakub Wartak
jakub.wartak@enterprisedb.com
In reply to: Tomas Vondra (#25)
Re: scalability bottlenecks with (many) partitions (and more)

On Thu, Sep 5, 2024 at 7:33 PM Tomas Vondra <tomas@vondra.me> wrote:

My $0.02 cents: the originating case that triggered those patches,
actually started with LWLock/lock_manager waits being the top#1. The
operator can cross check (join) that with a group by pg_locks.fastpath
(='f'), count(*). So, IMHO we have good observability in this case
(rare thing to say!)

That's a good point. So if you had to give some instructions to users
what to measure / monitor, and how to adjust the GUC based on that, what
would your instructions be?

After thinking about this a bit more, I'm actually wondering if this is
source of information is sufficient. Firstly, it's just a snapshot of a
single instance, and it's not quite trivial to get some summary for
longer time period (people would have to sample it often enough, etc.).
Doable, but much less convenient than the cumulative counters.

OK, so answering previous question:

Probably just monitor pg_stat_activty (group on wait events count(*))
with pg_locks with group by on per-pid and fastpath . Even simple
observations with \watch 0.1 are good enough to confirm/deny the
root-cause in practice even for short bursts while it is happening.
While deploying monitoring for a longer time (with say sample of 1s),
you sooner or later would get the __high water mark__ and possibly
allow up to that many fastpaths as starting point as there are locks
occuring for affected PIDs (or double the amount).

But for the sampling, doesn't this produce skewed data? Imagine you have
a workload with very short queries (which is when fast-path matters), so
you're likely to see the backend while it's obtaining the locks. If the
fast-path locks take much faster acquire (kinda the whole point), we're
more likely to see the backend while it's obtaining the regular locks.

Let's say the backend needs 1000 locks, and 500 of those fit into the
fast-path array. We're likely to see the 500 fast-path locks already
acquired, and a random fraction of the 500 non-fast-path locks. So in
the end you'll se backends needing 500 fast-path locks and 250 regular
locks. That doesn't seem terrible, but I guess the difference can be
made even larger.

... it doesn't need to perfect data to act, right? We may just need
information that it is happening (well we do). Maybe it's too
pragmatic point of view, but wasting some bits of memory for this, but
still being allowed to control it how much it allocates in the end --
is much better situation than today, without any control where we are
wasting crazy CPU time on all those futex() syscalls and context
switches

Another angle is that if you see the SQL causing it, it is most often
going to be attributed to partitioning and people ending up accessing
way too many partitions (thousands) without proper partition pruning -
sometimes it even triggers re-partitioning of the said tables. So
maybe the realistic "fastpath sizing" should assume something that
supports:
a) usual number of tables in JOINs (just few of them are fine like today) -> ok
b) interval 1 month partitions for let's say 5 years (12*5 = 60),
joined to some other table like that gives like what, max 120? -> so
if you have users doing SELECT * FROM such_table , they will already
have set the max_locks_per_xact probably to something higher.
c) HASH partitioning up to VCPU-that-are-in-the-wild count? (say 64 or
128? so it sounds same as above?)
d) probably we should not care here at all if somebody wants daily
partitioning across years with HASH (sub)partitions without partition
pruning -> it has nothing to do with being "fast" anyway

Judging from the current reports, people have configured
max_locks_per_xact like this: ~75% have it at default (64), 10% has
1024, 5% has 128 and the rest (~10%) is having between 100..thousands,
with extreme one-offs @ 25k (wild misconfiguration judging from the
other specs).

BTW: you probably need to register this $thread into CF for others to
see too (it's not there?)

-J.

#27Tomas Vondra
Tomas Vondra
tomas@vondra.me
In reply to: Tomas Vondra (#24)
5 attachment(s)
Re: scalability bottlenecks with (many) partitions (and more)

Hi,

I've spent quite a bit of time trying to identify cases where having
more fast-path lock slots could be harmful, without any luck. I started
with the EPYC machine I used for the earlier tests, but found nothing,
except for a couple cases unrelated to this patch, because it affects
even cases without the patch applied at all. More like random noise or
maybe some issue with the VM (or differences to the VM used earlier). I
pushed the results to githus [1]https://github.com/tvondra/pg-lock-scalability-results anyway, if anyone wants to look.

So I switched to my smaller machines, and ran a simple test on master,
with the hard-coded arrays, and with the arrays moves out of PGPROC (and
sized per max_locks_per_transaction).

I was looking for regressions, so I wanted to test a case that can't
benefit from fast-path locking, while paying the costs. So I decided to
do pgbench -S with 4 partitions, because that fits into the 16 slots we
had before, and scale 1 to keep everything in memory. And then did a
couple read-only runs, first with 64 locks/transaction (default), then
with 1024 locks/transaction.

Attached is a shell script I used to collect this - it creates and
removes clusters, so be careful. Should be fairly obvious what it tests
and how.

The results for max_locks_per_transaction=64 look like this (the numbers
are throughput):

machine mode clients master built-in with-guc
---------------------------------------------------------
i5 prepared 1 14970 14991 14981
4 51638 51615 51388
simple 1 14042 14136 14008
4 48705 48572 48457
------------------------------------------------------
xeon prepared 1 13213 13330 13170
4 49280 49191 49263
16 151413 152268 151560
simple 1 12250 12291 12316
4 45910 46148 45843
16 141774 142165 142310

And compared to master

machine mode clients built-in with-guc
-------------------------------------------------
i5 prepared 1 100.14% 100.08%
4 99.95% 99.51%
simple 1 100.67% 99.76%
4 99.73% 99.49%
----------------------------------------------
xeon prepared 1 100.89% 99.68%
4 99.82% 99.97%
16 100.56% 100.10%
simple 1 100.34% 100.54%
4 100.52% 99.85%
16 100.28% 100.38%

So, no difference whatsoever - it's +/- 0.5%, well within random noise.
And with max_locks_per_transaction=1024 the story is exactly the same:

machine mode clients master built-in with-guc
---------------------------------------------------------
i5 prepared 1 15000 14928 14948
4 51498 51351 51504
simple 1 14124 14092 14065
4 48531 48517 48351
xeon prepared 1 13384 13325 13290
4 49257 49309 49345
16 151668 151940 152201
simple 1 12357 12351 12363
4 46039 46126 46201
16 141851 142402 142427

machine mode clients built-in with-guc
-------------------------------------------------
i5 prepared 1 99.52% 99.65%
4 99.71% 100.01%
simple 1 99.77% 99.58%
4 99.97% 99.63%
xeon prepared 1 99.56% 99.30%
4 100.11% 100.18%
16 100.18% 100.35%
simple 1 99.96% 100.05%
4 100.19% 100.35%
16 100.39% 100.41%

with max_locks_per_transaction=1024, it's fair to expect the fast-path
locking to be quite beneficial. Of course, it's possible the GUC is set
this high because of some rare issue (say, to run pg_dump, which needs
to lock everything).

I did look at docs if anything needs updating, but I don't think so. The
SGML docs only talk about fast-path locking at fairly high level, not
about how many we have etc. Same for src/backend/storage/lmgr/README,
which is focusing on the correctness of fast-path locking, and that's
not changed by this patch.

I also cleaned up (removed) some of the Asserts checking that we got a
valid group / slot index. I don't think this really helped in practice,
once I added asserts to the macros.

Anyway, at this point I'm quite happy with this improvement. I didn't
have any clear plan when to commit this, but I'm considering doing so
sometime next week, unless someone objects or asks for some additional
benchmarks etc.

One thing I'm not quite sure about yet is whether to commit this as a
single change, or the way the attached patches do that, with the first
patch keeping the larger array in PGPROC and the second patch making it
separate and sized on max_locks_per_transaction ... Opinions?

regards

[1]: https://github.com/tvondra/pg-lock-scalability-results

--
Tomas Vondra

Attachments:

v20240912-0001-Increase-the-number-of-fast-path-lock-slot.patchtext/x-patch; charset=UTF-8; name=v20240912-0001-Increase-the-number-of-fast-path-lock-slot.patch
v20240912-0002-Set-fast-path-slots-using-max_locks_per_tr.patchtext/x-patch; charset=UTF-8; name=v20240912-0002-Set-fast-path-slots-using-max_locks_per_tr.patch
results-1024.csvtext/csv; charset=UTF-8; name=results-1024.csv
run-lock-test.shapplication/x-shellscript; name=run-lock-test.sh
results-64.csvtext/csv; charset=UTF-8; name=results-64.csv
#28Tomas Vondra
Tomas Vondra
tomas@vondra.me
In reply to: Tomas Vondra (#27)
2 attachment(s)
Re: scalability bottlenecks with (many) partitions (and more)

Turns out there was a bug in EXEC_BACKEND mode, causing failures on the
Windows machine in CI. AFAIK the reason is pretty simple - the backends
don't see the number of fast-path groups postmaster calculated from
max_locks_per_transaction.

Fixed that by calculating it again in AttachSharedMemoryStructs, which
seems to have done the trick. With this the CI builds pass just fine,
but I'm not sure if EXEC_BACKENDS may have some other issues with the
PGPROC changes. Could it happen that the shared memory gets mapped
differently, in which case the pointers might need to be adjusted?

regards

--
Tomas Vondra

Attachments:

v20240913-0002-Set-fast-path-slots-using-max_locks_per_tr.patchtext/x-patch; charset=UTF-8; name=v20240913-0002-Set-fast-path-slots-using-max_locks_per_tr.patch
v20240913-0001-Increase-the-number-of-fast-path-lock-slot.patchtext/x-patch; charset=UTF-8; name=v20240913-0001-Increase-the-number-of-fast-path-lock-slot.patch
#29Jakub Wartak
Jakub Wartak
jakub.wartak@enterprisedb.com
In reply to: Tomas Vondra (#28)
Re: scalability bottlenecks with (many) partitions (and more)

On Fri, Sep 13, 2024 at 1:45 AM Tomas Vondra <tomas@vondra.me> wrote:

[..]

Anyway, at this point I'm quite happy with this improvement. I didn't
have any clear plan when to commit this, but I'm considering doing so
sometime next week, unless someone objects or asks for some additional
benchmarks etc.

Thank you very much for working on this :)

The only fact that comes to my mind is that we could blow up L2
caches. Fun fact, so if we are growing PGPROC by 6.3x, that's going to
be like one or two 2MB huge pages more @ common max_connections=1000
x86_64 (830kB -> ~5.1MB), and indeed:

# without patch:
postgres@hive:~$ /usr/pgsql18/bin/postgres -D /tmp/pg18 -C
shared_memory_size_in_huge_pages
177

# with patch:
postgres@hive:~$ /usr/pgsql18/bin/postgres -D /tmp/pg18 -C
shared_memory_size_in_huge_pages
178

So playing Devil's advocate , the worst situation that could possibly
hurt (?) could be:
* memory size of PGPROC working set >> L2_cache (thus very high
max_connections),
* insane number of working sessions on CPU (sessions >> VCPU) - sadly
happens to some,
* those sessions wouldn't have to be competing for the same Oids -
just fetching this new big fpLockBits[] structure - so probing a lot
for lots of Oids, but *NOT* having to use futex() syscall [so not that
syscall price]
* no huge pages (to cause dTLB misses)

then maybe(?) one could observe further degradation of dTLB misses in
the perf-stat counter under some microbenchmark, but measuring that
requires isolated and physical hardware. Maybe that would be actually
noise due to overhead of context-switches itself. Just trying to think
out loud, what big PGPROC could cause here. But this is already an
unhealthy and non-steady state of the system, so IMHO we are good,
unless someone comes up with a better (more evil) idea.

I did look at docs if anything needs updating, but I don't think so. The

SGML docs only talk about fast-path locking at fairly high level, not
about how many we have etc.

Well the only thing I could think of was to add to the
doc/src/sgml/config.sgml / "max_locks_per_transaction" GUC, that "it
is also used as advisory for the number of groups used in
lockmanager's fast-path implementation" (that is, without going into
further discussion, as even pg_locks discussion
doc/src/sgml/system-views.sgml simply uses that term).

-J.

#30Tomas Vondra
Tomas Vondra
tomas@vondra.me
In reply to: Jakub Wartak (#29)
Re: scalability bottlenecks with (many) partitions (and more)

On 9/16/24 15:11, Jakub Wartak wrote:

On Fri, Sep 13, 2024 at 1:45 AM Tomas Vondra <tomas@vondra.me> wrote:

[..]

Anyway, at this point I'm quite happy with this improvement. I didn't
have any clear plan when to commit this, but I'm considering doing so
sometime next week, unless someone objects or asks for some additional
benchmarks etc.

Thank you very much for working on this :)

The only fact that comes to my mind is that we could blow up L2
caches. Fun fact, so if we are growing PGPROC by 6.3x, that's going to
be like one or two 2MB huge pages more @ common max_connections=1000
x86_64 (830kB -> ~5.1MB), and indeed:

# without patch:
postgres@hive:~$ /usr/pgsql18/bin/postgres -D /tmp/pg18 -C
shared_memory_size_in_huge_pages
177

# with patch:
postgres@hive:~$ /usr/pgsql18/bin/postgres -D /tmp/pg18 -C
shared_memory_size_in_huge_pages
178

So playing Devil's advocate , the worst situation that could possibly
hurt (?) could be:
* memory size of PGPROC working set >> L2_cache (thus very high
max_connections),
* insane number of working sessions on CPU (sessions >> VCPU) - sadly
happens to some,
* those sessions wouldn't have to be competing for the same Oids -
just fetching this new big fpLockBits[] structure - so probing a lot
for lots of Oids, but *NOT* having to use futex() syscall [so not that
syscall price]
* no huge pages (to cause dTLB misses)

then maybe(?) one could observe further degradation of dTLB misses in
the perf-stat counter under some microbenchmark, but measuring that
requires isolated and physical hardware. Maybe that would be actually
noise due to overhead of context-switches itself. Just trying to think
out loud, what big PGPROC could cause here. But this is already an
unhealthy and non-steady state of the system, so IMHO we are good,
unless someone comes up with a better (more evil) idea.

I've been thinking about such cases too, but I don't think it can really
happen in practice, because:

- How likely is it that the sessions will need a lot of OIDs, but not
the same ones? Also, why would it matter that the OIDs are not the same,
I don't think it matters unless one of the sessions needs an exclusive
lock, at which point the optimization doesn't really matter.

- If having more fast-path slots means it doesn't fit into L2 cache,
would we fit into L2 without it? I don't think so - if there really are
that many locks, we'd have to add those into the shared lock table, and
there's a lot of extra stuff to keep in memory (relcaches, ...).

This is pretty much one of the cases I focused on in my benchmarking,
and I'm yet to see any regression.

I did look at docs if anything needs updating, but I don't think so. The

SGML docs only talk about fast-path locking at fairly high level, not
about how many we have etc.

Well the only thing I could think of was to add to the
doc/src/sgml/config.sgml / "max_locks_per_transaction" GUC, that "it
is also used as advisory for the number of groups used in
lockmanager's fast-path implementation" (that is, without going into
further discussion, as even pg_locks discussion
doc/src/sgml/system-views.sgml simply uses that term).

Thanks, I'll consider mentioning this in max_locks_per_transaction.
Also, I think there's a place calculating the amount of per-connection
memory, so maybe that needs to be updated too.

regards

--
Tomas Vondra

#31Tomas Vondra
Tomas Vondra
tomas@vondra.me
In reply to: Tomas Vondra (#30)
3 attachment(s)
Re: scalability bottlenecks with (many) partitions (and more)

I've spent the last couple days doing all kinds of experiments trying to
find regressions caused by the patch, but no success. Which is good.

Attached is a script that just does a simple pgbench on a tiny table,
with no or very few partitions. The idea is that this will will fit into
shared buffers (thus no I/O), and will fit into the 16 fast-path slots
we have now. It can't benefit from the patch - it can only get worse, if
having more fast-path slots hurts.

I ran this on my two machines, and in both cases the results are +/- 1%
from the master for all combinations of parameters (clients, mode,
number of partitions, ..). In most cases it's actually much closer,
particularly with the default max_locks_per_transaction value.

For higher values of the GUC, I think it's fine too - the differences
are perhaps a bit larger (~1.5%), but it's clearly hardware specific (i5
gets a bit faster, xeon a bit slower). And I'm pretty sure people who
increased that GUC value likely did that because of locking many rels,
and so will actually benefit from the increased fast-path capacity.

At this point I'm pretty happy and confident the patch is fine. Unless
someone objects, I'll get it committed after going over over it one more
time. I decided to commit that as as a single change - it would be weird
to have an intermediate state with larger arrays in PGPROC, when that's
not something we actually want.

I still haven't found any places in the docs that should mention this,
except for the bit about max_locks_per_transaction GUC. There's nothing
in SGML mentioning details of fast-path locking. I thought we have some
formula to calculate per-connection memory, but I think I confused that
with the shmmem formulas we had in "Managing Kernel Resources". But even
that no longer mentions max_connections in master.

regards

--
Tomas Vondra

Attachments:

lock-test.odsapplication/vnd.oasis.opendocument.spreadsheet; name=lock-test.ods
lock-test.shapplication/x-shellscript; name=lock-test.sh
lock-test.pdfapplication/pdf; name=lock-test.pdf
#32Tomas Vondra
Tomas Vondra
tomas@vondra.me
In reply to: Tomas Vondra (#31)
Re: scalability bottlenecks with (many) partitions (and more)

Hi,

I've finally pushed this, after many rounds of careful testing to ensure
no regressions, and polishing. All changes since the version shared on
September 13 are only cosmetic - renaming a macro to keep it consistent
with the other ones, clarifying a couple comments etc. Nothing major.

I ended up squashing the two parts into a single commit. I thought about
keeping the two steps, but it seemed pointless - the first part inflated
the PGPROC struct, which I didn't like to commit, even if only as an
intermediate WIP state.

So far buildfarm didn't blew up, so let's hope it will stay that way.

I just realized there's no CF entry for this - sorry about that :-( I
started the thread a year ago to discuss an experimental patche, and it
never made it to CFA. But there was a discussion spanning a year, so
hopefully that's enough.

regards

--
Tomas Vondra

#33Ants Aasma
Ants Aasma
ants.aasma@cybertec.at
In reply to: Tomas Vondra (#32)
Re: scalability bottlenecks with (many) partitions (and more)

On Sat, 21 Sept 2024 at 21:33, Tomas Vondra <tomas@vondra.me> wrote:

I've finally pushed this, after many rounds of careful testing to ensure
no regressions, and polishing. All changes since the version shared on
September 13 are only cosmetic - renaming a macro to keep it consistent
with the other ones, clarifying a couple comments etc. Nothing major.

Great work on this, I have seen multiple customers hitting fast path
capacity related LockManager contention. They will certainly be glad
to have a fix available when they eventually upgrade. Regretfully I
did not find the time to participate in this discussion during
development. But I did have some thoughts that I wanted to unload to
the list, not as a criticism, but in case it turns out follow up work
is needed.

Driving the array sizing from max_locks_per_transaction seems like a
good idea. The one major difference from the lock table is that while
the lock table is partitioned dynamically between backends, the fast
path array has a static size per backend. One case where that
distinction matters is when only a fraction of backends try to lock
large numbers of relations. This fraction will still fall back to main
lock tables, but at least the contention should be limited by virtue
of not having too many of those backends. The other case is when
max_connections is much higher than the number of backends actually
used. Then backends may be consuming well over
max_locks_per_transaction without running into lock table capacity
issues.

In both cases users will have the simple workaround of just increasing
the max_locks_per_transaction setting. Still, I'm sure they would be
happier if things just worked without any tuning. So I tried to figure
out some scheme to get dynamic allocation of fast path locks.

The best data structure I came up with was to have a shared fast path
lock array. Still partitioned as a 16-way associative cache, but
indexed by hash(BackendId, RelationId). fpLockBits can be stuffed into
the high byte of BackendId thanks to MAX_BACKENDS. Locking could be
handled by one lock per way, or at least on cursory glance it
shouldn't be too difficult to convert the whole fast path acquisition
to be lock free.

Either way, it feels like structuring the array this way could result
in a large amount of false sharing of cache lines. Current static
allocation means that each process needs to touch only a small set of
cache lines only referenced by itself - quite probable to keep those
lines in CPU local L2 in exclusive mode. In a shared array a larger
number of cache lines are needed and they will be concurrently written
to by other backends - lots of invalidation messages and cache line
bouncing. I don't know how large this effect will be without doing a
prototype and running it on a large machine with high core-to-core
latencies.

It would be possible to create a hybrid approach of a small local FP
array servicing the majority of acquisitions with a larger shared
victim cache for exceptional cases. But it doesn't feel like it is
worth the complexity. At least not without seeing some example
workloads where it would help. And even then, maybe using hierarchical
locking to do less work is the better approach.

Being optimistic, perhaps the current patch was enough to resolve the issue.

--
Ants Aasma
Senior Database Engineer
www.cybertec-postgresql.com

#34Tomas Vondra
Tomas Vondra
tomas@vondra.me
In reply to: Ants Aasma (#33)
Re: scalability bottlenecks with (many) partitions (and more)

On 9/22/24 10:50, Ants Aasma wrote:

On Sat, 21 Sept 2024 at 21:33, Tomas Vondra <tomas@vondra.me> wrote:

I've finally pushed this, after many rounds of careful testing to ensure
no regressions, and polishing. All changes since the version shared on
September 13 are only cosmetic - renaming a macro to keep it consistent
with the other ones, clarifying a couple comments etc. Nothing major.

Great work on this, I have seen multiple customers hitting fast path
capacity related LockManager contention. They will certainly be glad
to have a fix available when they eventually upgrade. Regretfully I
did not find the time to participate in this discussion during
development. But I did have some thoughts that I wanted to unload to
the list, not as a criticism, but in case it turns out follow up work
is needed.

Driving the array sizing from max_locks_per_transaction seems like a
good idea. The one major difference from the lock table is that while
the lock table is partitioned dynamically between backends, the fast
path array has a static size per backend. One case where that
distinction matters is when only a fraction of backends try to lock
large numbers of relations. This fraction will still fall back to main
lock tables, but at least the contention should be limited by virtue
of not having too many of those backends. The other case is when
max_connections is much higher than the number of backends actually
used. Then backends may be consuming well over
max_locks_per_transaction without running into lock table capacity
issues.

I agree. I don't think the case with a couple lock-hungry backends
matters too much, because as you say there can't be too many of them, so
the contention should not be too bad. At least that was my reasoning.

Regarding the case with very high max_connection values - I doubt we
want to optimize for that very much. Extremely high max_connection
values are a clear anti-pattern (IMO), and if you choose to do that
anyway, you simply have to accept that connections have costs. The
memory for fast-path locking is one of those costs.

I'm not against improving that, ofc, but I think we should only do that
if it doesn't hurt the "reasonable" setups.

In both cases users will have the simple workaround of just increasing
the max_locks_per_transaction setting. Still, I'm sure they would be
happier if things just worked without any tuning. So I tried to figure
out some scheme to get dynamic allocation of fast path locks.

I agree with the premise that less tuning is better. Which is why we
tied this to max_locks_per_transaction.

The best data structure I came up with was to have a shared fast path
lock array. Still partitioned as a 16-way associative cache, but
indexed by hash(BackendId, RelationId). fpLockBits can be stuffed into
the high byte of BackendId thanks to MAX_BACKENDS. Locking could be
handled by one lock per way, or at least on cursory glance it
shouldn't be too difficult to convert the whole fast path acquisition
to be lock free.

Either way, it feels like structuring the array this way could result
in a large amount of false sharing of cache lines. Current static
allocation means that each process needs to touch only a small set of
cache lines only referenced by itself - quite probable to keep those
lines in CPU local L2 in exclusive mode. In a shared array a larger
number of cache lines are needed and they will be concurrently written
to by other backends - lots of invalidation messages and cache line
bouncing. I don't know how large this effect will be without doing a
prototype and running it on a large machine with high core-to-core
latencies.

I don't have a very good intuition regarding cachelines. Ideally, the
backends would access disjunct parts of the array, so there should not
be a lot of false sharing. But maybe I'm wrong, hard to say without an
experimental patch.

It would be possible to create a hybrid approach of a small local FP
array servicing the majority of acquisitions with a larger shared
victim cache for exceptional cases. But it doesn't feel like it is
worth the complexity. At least not without seeing some example
workloads where it would help. And even then, maybe using hierarchical
locking to do less work is the better approach.

Not sure. My intuition would be to keep this as simple a possible.
Having a shared lock table and also a separate fast-path cache is
already sufficiently complex, adding cache for a cache seems a bit too
much to me.

Being optimistic, perhaps the current patch was enough to resolve the issue.

It's an improvement. But if you want to give the shared fast-path cache
a try, go ahead - if you write a patch, I promise to review it.

regards

--
Tomas Vondra

#35Tom Lane
Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tomas Vondra (#32)
Re: scalability bottlenecks with (many) partitions (and more)

Tomas Vondra <tomas@vondra.me> writes:

I've finally pushed this, after many rounds of careful testing to ensure
no regressions, and polishing.

Coverity is not terribly happy with this. "Assert(fpPtr = fpEndPtr);"
is very clearly not doing what you presumably intended. The others
look like overaggressive assertion checking. If you don't want those
macros to assume that the argument is unsigned, you could force the
issue, say with

 #define FAST_PATH_GROUP(index)	\
-	(AssertMacro(((index) >= 0) && ((index) < FP_LOCK_SLOTS_PER_BACKEND)), \
+	(AssertMacro((uint32) (index) < FP_LOCK_SLOTS_PER_BACKEND), \
 	 ((index) / FP_LOCK_SLOTS_PER_GROUP))

________________________________________________________________________________________________________
*** CID 1619664: Incorrect expression (ASSERT_SIDE_EFFECT)
/srv/coverity/git/pgsql-git/postgresql/src/backend/storage/lmgr/proc.c: 325 in InitProcGlobal()
319 pg_atomic_init_u32(&(proc->procArrayGroupNext), INVALID_PROC_NUMBER);
320 pg_atomic_init_u32(&(proc->clogGroupNext), INVALID_PROC_NUMBER);
321 pg_atomic_init_u64(&(proc->waitStart), 0);
322 }
323
324 /* Should have consumed exactly the expected amount of fast-path memory. */

CID 1619664: Incorrect expression (ASSERT_SIDE_EFFECT)
Assignment "fpPtr = fpEndPtr" has a side effect. This code will work differently in a non-debug build.

325 Assert(fpPtr = fpEndPtr);
326
327 /*
328 * Save pointers to the blocks of PGPROC structures reserved for auxiliary
329 * processes and prepared transactions.
330 */

________________________________________________________________________________________________________
*** CID 1619662: Integer handling issues (NO_EFFECT)
/srv/coverity/git/pgsql-git/postgresql/src/backend/storage/lmgr/lock.c: 3731 in GetLockStatusData()
3725
3726 LWLockAcquire(&proc->fpInfoLock, LW_SHARED);
3727
3728 for (f = 0; f < FP_LOCK_SLOTS_PER_BACKEND; ++f)
3729 {
3730 LockInstanceData *instance;

CID 1619662: Integer handling issues (NO_EFFECT)
This greater-than-or-equal-to-zero comparison of an unsigned value is always true. "f >= 0U".

3731 uint32 lockbits = FAST_PATH_GET_BITS(proc, f);
3732
3733 /* Skip unallocated slots. */
3734 if (!lockbits)
3735 continue;
3736

________________________________________________________________________________________________________
*** CID 1619661: Integer handling issues (NO_EFFECT)
/srv/coverity/git/pgsql-git/postgresql/src/backend/storage/lmgr/lock.c: 2696 in FastPathGrantRelationLock()
2690 uint32 group = FAST_PATH_REL_GROUP(relid);
2691
2692 /* Scan for existing entry for this relid, remembering empty slot. */
2693 for (i = 0; i < FP_LOCK_SLOTS_PER_GROUP; i++)
2694 {
2695 /* index into the whole per-backend array */

CID 1619661: Integer handling issues (NO_EFFECT)
This greater-than-or-equal-to-zero comparison of an unsigned value is always true. "group >= 0U".

2696 uint32 f = FAST_PATH_SLOT(group, i);
2697
2698 if (FAST_PATH_GET_BITS(MyProc, f) == 0)
2699 unused_slot = f;
2700 else if (MyProc->fpRelId[f] == relid)
2701 {

________________________________________________________________________________________________________
*** CID 1619660: Integer handling issues (NO_EFFECT)
/srv/coverity/git/pgsql-git/postgresql/src/backend/storage/lmgr/lock.c: 2813 in FastPathTransferRelationLocks()
2807
2808 for (j = 0; j < FP_LOCK_SLOTS_PER_GROUP; j++)
2809 {
2810 uint32 lockmode;
2811
2812 /* index into the whole per-backend array */

CID 1619660: Integer handling issues (NO_EFFECT)
This greater-than-or-equal-to-zero comparison of an unsigned value is always true. "group >= 0U".

2813 uint32 f = FAST_PATH_SLOT(group, j);
2814
2815 /* Look for an allocated slot matching the given relid. */
2816 if (relid != proc->fpRelId[f] || FAST_PATH_GET_BITS(proc, f) == 0)
2817 continue;
2818

________________________________________________________________________________________________________
*** CID 1619659: Integer handling issues (NO_EFFECT)
/srv/coverity/git/pgsql-git/postgresql/src/backend/storage/lmgr/lock.c: 3067 in GetLockConflicts()
3061
3062 for (j = 0; j < FP_LOCK_SLOTS_PER_GROUP; j++)
3063 {
3064 uint32 lockmask;
3065
3066 /* index into the whole per-backend array */

CID 1619659: Integer handling issues (NO_EFFECT)
This greater-than-or-equal-to-zero comparison of an unsigned value is always true. "group >= 0U".

3067 uint32 f = FAST_PATH_SLOT(group, j);
3068
3069 /* Look for an allocated slot matching the given relid. */
3070 if (relid != proc->fpRelId[f])
3071 continue;
3072 lockmask = FAST_PATH_GET_BITS(proc, f);

________________________________________________________________________________________________________
*** CID 1619658: Integer handling issues (NO_EFFECT)
/srv/coverity/git/pgsql-git/postgresql/src/backend/storage/lmgr/lock.c: 2739 in FastPathUnGrantRelationLock()
2733 uint32 group = FAST_PATH_REL_GROUP(relid);
2734
2735 FastPathLocalUseCounts[group] = 0;
2736 for (i = 0; i < FP_LOCK_SLOTS_PER_GROUP; i++)
2737 {
2738 /* index into the whole per-backend array */

CID 1619658: Integer handling issues (NO_EFFECT)
This greater-than-or-equal-to-zero comparison of an unsigned value is always true. "group >= 0U".

2739 uint32 f = FAST_PATH_SLOT(group, i);
2740
2741 if (MyProc->fpRelId[f] == relid
2742 && FAST_PATH_CHECK_LOCKMODE(MyProc, f, lockmode))
2743 {
2744 Assert(!result);

________________________________________________________________________________________________________
*** CID 1619657: Integer handling issues (NO_EFFECT)
/srv/coverity/git/pgsql-git/postgresql/src/backend/storage/lmgr/lock.c: 2878 in FastPathGetRelationLockEntry()
2872
2873 for (i = 0; i < FP_LOCK_SLOTS_PER_GROUP; i++)
2874 {
2875 uint32 lockmode;
2876
2877 /* index into the whole per-backend array */

CID 1619657: Integer handling issues (NO_EFFECT)
This greater-than-or-equal-to-zero comparison of an unsigned value is always true. "group >= 0U".

2878 uint32 f = FAST_PATH_SLOT(group, i);
2879
2880 /* Look for an allocated slot matching the given relid. */
2881 if (relid != MyProc->fpRelId[f] || FAST_PATH_GET_BITS(MyProc, f) == 0)
2882 continue;
2883

regards, tom lane

#36Tomas Vondra
Tomas Vondra
tomas@vondra.me
In reply to: Tom Lane (#35)
Re: scalability bottlenecks with (many) partitions (and more)

On 9/22/24 17:45, Tom Lane wrote:

Tomas Vondra <tomas@vondra.me> writes:

I've finally pushed this, after many rounds of careful testing to ensure
no regressions, and polishing.

Coverity is not terribly happy with this. "Assert(fpPtr = fpEndPtr);"
is very clearly not doing what you presumably intended. The others
look like overaggressive assertion checking. If you don't want those
macros to assume that the argument is unsigned, you could force the
issue, say with

#define FAST_PATH_GROUP(index)	\
-	(AssertMacro(((index) >= 0) && ((index) < FP_LOCK_SLOTS_PER_BACKEND)), \
+	(AssertMacro((uint32) (index) < FP_LOCK_SLOTS_PER_BACKEND), \
((index) / FP_LOCK_SLOTS_PER_GROUP))

Ah, you're right. I'll fix those asserts tomorrow.

The first is clearly wrong, of course.

For the (x >= 0) asserts, doing it this way relies on negative values
wrapping to large positive ones, correct? AFAIK it's guaranteed to be a
very large value, so it can't accidentally be less than the slot count.

regards

--
Tomas Vondra

#37Tom Lane
Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tomas Vondra (#36)
Re: scalability bottlenecks with (many) partitions (and more)

Tomas Vondra <tomas@vondra.me> writes:

On 9/22/24 17:45, Tom Lane wrote:

#define FAST_PATH_GROUP(index)	\
-	(AssertMacro(((index) >= 0) && ((index) < FP_LOCK_SLOTS_PER_BACKEND)), \
+	(AssertMacro((uint32) (index) < FP_LOCK_SLOTS_PER_BACKEND), \
((index) / FP_LOCK_SLOTS_PER_GROUP))

For the (x >= 0) asserts, doing it this way relies on negative values
wrapping to large positive ones, correct? AFAIK it's guaranteed to be a
very large value, so it can't accidentally be less than the slot count.

Right, any negative value would wrap to something more than
INT32_MAX.

regards, tom lane

#38Jakub Wartak
Jakub Wartak
jakub.wartak@enterprisedb.com
In reply to: Tomas Vondra (#30)
Re: scalability bottlenecks with (many) partitions (and more)

On Mon, Sep 16, 2024 at 4:19 PM Tomas Vondra <tomas@vondra.me> wrote:

On 9/16/24 15:11, Jakub Wartak wrote:

On Fri, Sep 13, 2024 at 1:45 AM Tomas Vondra <tomas@vondra.me> wrote:

[..]

Anyway, at this point I'm quite happy with this improvement. I didn't
have any clear plan when to commit this, but I'm considering doing so
sometime next week, unless someone objects or asks for some additional
benchmarks etc.

Thank you very much for working on this :)

The only fact that comes to my mind is that we could blow up L2
caches. Fun fact, so if we are growing PGPROC by 6.3x, that's going to
be like one or two 2MB huge pages more @ common max_connections=1000
x86_64 (830kB -> ~5.1MB), and indeed:

[..]

then maybe(?) one could observe further degradation of dTLB misses in
the perf-stat counter under some microbenchmark, but measuring that
requires isolated and physical hardware. Maybe that would be actually
noise due to overhead of context-switches itself. Just trying to think
out loud, what big PGPROC could cause here. But this is already an
unhealthy and non-steady state of the system, so IMHO we are good,
unless someone comes up with a better (more evil) idea.

I've been thinking about such cases too, but I don't think it can really
happen in practice, because:

- How likely is it that the sessions will need a lot of OIDs, but not
the same ones? Also, why would it matter that the OIDs are not the same,
I don't think it matters unless one of the sessions needs an exclusive
lock, at which point the optimization doesn't really matter.

- If having more fast-path slots means it doesn't fit into L2 cache,
would we fit into L2 without it? I don't think so - if there really are
that many locks, we'd have to add those into the shared lock table, and
there's a lot of extra stuff to keep in memory (relcaches, ...).

This is pretty much one of the cases I focused on in my benchmarking,
and I'm yet to see any regression.

Sorry for answering this so late. Just for context here: I was
imagining a scenario with high max_connections about e.g. schema-based
multi-tenancy and no partitioning (so all would be fine without this
$thread/commit ; so under 16 (fast)locks would be taken). The OIDs
need to be different to avoid contention: so that futex() does not end
up really in syscall (just user-space part). My theory was that a much
smaller PGPROC should be doing much less (data) cache-line fetches
than with-the-patch. That hash() % prime , hits various parts of a
larger array - so without patch should be quicker as it wouldn't be
randomly hitting some larger array[], but it might be noise as you
state. It was a theoretical attempt at crafting the worst possible
conditions for the patch, so feel free to disregard as it already
assumes some anti-pattern (big & all active max_connections).

Well the only thing I could think of was to add to the
doc/src/sgml/config.sgml / "max_locks_per_transaction" GUC, that "it
is also used as advisory for the number of groups used in
lockmanager's fast-path implementation" (that is, without going into
further discussion, as even pg_locks discussion
doc/src/sgml/system-views.sgml simply uses that term).

Thanks, I'll consider mentioning this in max_locks_per_transaction.
Also, I think there's a place calculating the amount of per-connection
memory, so maybe that needs to be updated too.

I couldn't find it in current versions, but maybe that's helpful/reaffirming:
- up to 9.2. there were exact formulas used, see "(1800 + 270 *
max_locks_per_transaction) * max_connections" [1]https://www.postgresql.org/docs/9.2/kernel-resources.html , that's a long time
gone now.
- if anything then Andres might want to improve a little his blog
entry: [1]https://www.postgresql.org/docs/9.2/kernel-resources.html (my take is that is seems to be the most accurate and
authoritative technical information that we have online)

-J.

[1]: https://www.postgresql.org/docs/9.2/kernel-resources.html
[2]: https://blog.anarazel.de/2020/10/07/measuring-the-memory-overhead-of-a-postgres-connection/

#39Tomas Vondra
Tomas Vondra
tomas@vondra.me
In reply to: Tom Lane (#37)
Re: scalability bottlenecks with (many) partitions (and more)

On 9/23/24 01:06, Tom Lane wrote:

Tomas Vondra <tomas@vondra.me> writes:

On 9/22/24 17:45, Tom Lane wrote:

#define FAST_PATH_GROUP(index)	\
-	(AssertMacro(((index) >= 0) && ((index) < FP_LOCK_SLOTS_PER_BACKEND)), \
+	(AssertMacro((uint32) (index) < FP_LOCK_SLOTS_PER_BACKEND), \
((index) / FP_LOCK_SLOTS_PER_GROUP))

For the (x >= 0) asserts, doing it this way relies on negative values
wrapping to large positive ones, correct? AFAIK it's guaranteed to be a
very large value, so it can't accidentally be less than the slot count.

Right, any negative value would wrap to something more than
INT32_MAX.

Thanks. Pushed a fix for these issues, hopefully coverity will be happy.

BTW is the coverity report accessible somewhere? I know someone
mentioned that in the past, but I don't recall the details. Maybe we
should have a list of all these resources, useful for committers,
somewhere on the wiki?

regards

--
Tomas Vondra

#40Tom Lane
Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tomas Vondra (#39)
Re: scalability bottlenecks with (many) partitions (and more)

Tomas Vondra <tomas@vondra.me> writes:

Thanks. Pushed a fix for these issues, hopefully coverity will be happy.

Thanks.

BTW is the coverity report accessible somewhere? I know someone
mentioned that in the past, but I don't recall the details. Maybe we
should have a list of all these resources, useful for committers,
somewhere on the wiki?

Currently those reports only go to the security team. Perhaps
we should rethink that?

regards, tom lane

#41Matthias van de Meent
Matthias van de Meent
boekewurm+postgres@gmail.com
In reply to: Tomas Vondra (#21)
4 attachment(s)
Re: scalability bottlenecks with (many) partitions (and more)

On Wed, 4 Sept 2024 at 17:32, Tomas Vondra <tomas@vondra.me> wrote:

On 9/4/24 16:25, Matthias van de Meent wrote:

On Tue, 3 Sept 2024 at 18:20, Tomas Vondra <tomas@vondra.me> wrote:

FWIW the actual cost is somewhat higher, because we seem to need ~400B
for every lock (not just the 150B for the LOCK struct).

We do indeed allocate two PROCLOCKs for every LOCK, and allocate those
inside dynahash tables. That amounts to (152+2*64+3*16=) 328 bytes in
dynahash elements, and (3 * 8-16) = 24-48 bytes for the dynahash
buckets/segments, resulting in 352-376 bytes * NLOCKENTS() being
used[^1]. Does that align with your usage numbers, or are they
significantly larger?

I see more like ~470B per lock. If I patch CalculateShmemSize to log the
shmem allocated, I get this:

max_connections=100 max_locks_per_transaction=1000 => 194264001
max_connections=100 max_locks_per_transaction=2000 => 241756967

and (((241756967-194264001)/100/1000)) = 474

Could be alignment of structs or something, not sure.

NLOCKENTS is calculated based off of MaxBackends, which is the sum of
MaxConnections + autovacuum_max_workers + 1 +
max_worker_processes + max_wal_senders; which by default add
22 more slots.

After adjusting for that, we get 388 bytes /lock, which is
approximately in line with the calculation.

At least based on a quick experiment. (Seems a bit high, right?).

Yeah, that does seem high, thanks for nerd-sniping me.

[...]

Alltogether that'd save 40 bytes/lock entry on size, and ~35
bytes/lock on "safety margin", for a saving of (up to) 19% of our
current allocation. I'm not sure if these tricks would benefit with
performance or even be a demerit, apart from smaller structs usually
being better at fitting better in CPU caches.

Not sure either, but it seems worth exploring. If you do an experimental
patch for the LOCK size reduction, I can get some numbers.

It took me some time to get back to this, and a few hours to
experiment, but here's that experimental patch. Attached 4 patches,
which together reduce the size of the shared lock tables by about 34%
on my 64-bit system.

1/4 implements the MAX_LOCKMODES changes to LOCK I mentioned before,
saving 16 bytes.
2/4 packs the LOCK struct more tightly, for another 8 bytes saved.
3/4 reduces the PROCLOCK struct size by 8 bytes with a PGPROC* ->
ProcNumber substitution, allowing packing with fields previously
reduced in size in patch 2/4.
4/4 reduces the size fo the PROCLOCK table by limiting the average
number of per-backend locks to max_locks_per_transaction (rather than
the current 2*max_locks_per_transaction when getting locks that other
backends also requested), and makes the shared lock tables fully
pre-allocated.

1-3 together save 11% on the lock tables in 64-bit builds, and 4/4
saves another ~25%, for a total of ~34% on per-lockentry shared memory
usage; from ~360 bytes to ~240 bytes.

Note that this doesn't include the ~4.5 bytes added per PGPROC entry
per mlpxid for fastpath locking; I've ignored those for now.

Not implemented, but technically possible: the PROCLOCK table _could_
be further reduced in size by acknowledging that each of that struct
is always stored after dynahash HASHELEMENTs, which have 4 bytes of
padding on 64-bit systems. By changing PROCLOCKTAG's myProc to
ProcNumber, one could pack that field into the padding of the hash
element header, reducing the effective size of the hash table's
entries by 8 bytes, and thus the total size of the tables by another
few %. I don't think that trade-off is worth it though, given the
complexity and trickery required to get that to work well.

I'm not sure about the safety margins. 10% sure seems like quite a bit
of memory (it might not have in the past, but as the instances are
growing, that probably changed).

I have not yet touched this safety margin.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

Attachments:

v0-0002-Reduce-size-of-LOCK-by-8-more-bytes.patchapplication/octet-stream; name=v0-0002-Reduce-size-of-LOCK-by-8-more-bytes.patch
v0-0001-Reduce-size-of-LOCK-by-16-bytes.patchapplication/octet-stream; name=v0-0001-Reduce-size-of-LOCK-by-16-bytes.patch
v0-0003-Reduce-size-of-PROCLOCK-by-8-bytes-on-64-bit-syst.patchapplication/octet-stream; name=v0-0003-Reduce-size-of-PROCLOCK-by-8-bytes-on-64-bit-syst.patch
v0-0004-Reduce-PROCLOCK-hash-table-size.patchapplication/octet-stream; name=v0-0004-Reduce-PROCLOCK-hash-table-size.patch
#42Andres Freund
Andres Freund
andres@anarazel.de
In reply to: Tomas Vondra (#32)
Re: scalability bottlenecks with (many) partitions (and more)

Hi,

On 2024-09-21 20:33:49 +0200, Tomas Vondra wrote:

I've finally pushed this, after many rounds of careful testing to ensure
no regressions, and polishing.

One minor nit: I don't like that FP_LOCK_SLOTS_PER_BACKEND is now non-constant
while looking like a constant:

#define FP_LOCK_SLOTS_PER_BACKEND (FP_LOCK_SLOTS_PER_GROUP * FastPathLockGroupsPerBackend)

I don't think it's a good idea to have non-function-like #defines that
reference variables that can change from run to run.

Greetings,

Andres Freund

#43Tomas Vondra
Tomas Vondra
tomas@vondra.me
In reply to: Andres Freund (#42)
Re: scalability bottlenecks with (many) partitions (and more)

On 3/3/25 19:10, Andres Freund wrote:

Hi,

On 2024-09-21 20:33:49 +0200, Tomas Vondra wrote:

I've finally pushed this, after many rounds of careful testing to ensure
no regressions, and polishing.

One minor nit: I don't like that FP_LOCK_SLOTS_PER_BACKEND is now non-constant
while looking like a constant:

#define FP_LOCK_SLOTS_PER_BACKEND (FP_LOCK_SLOTS_PER_GROUP * FastPathLockGroupsPerBackend)

I don't think it's a good idea to have non-function-like #defines that
reference variables that can change from run to run.

Fair point, although it can't change "run to run" - not without a
restart. It's not a proper constant, of course, but it seemed close
enough. Yes, it might confuse people into thinking it's a constant, or
is there some additional impact?

The one fix I can think of is making it look more like a function,
possibly just like this:

#define FastPathLockSlotsPerBackend() \
(FP_LOCK_SLOTS_PER_GROUP * FastPathLockGroupsPerBackend)

Or do you have another suggestion?

regards

--
Tomas Vondra

#44Andres Freund
Andres Freund
andres@anarazel.de
In reply to: Tomas Vondra (#43)
Re: scalability bottlenecks with (many) partitions (and more)

Hi,

On 2025-03-03 21:31:42 +0100, Tomas Vondra wrote:

On 3/3/25 19:10, Andres Freund wrote:

On 2024-09-21 20:33:49 +0200, Tomas Vondra wrote:

I've finally pushed this, after many rounds of careful testing to ensure
no regressions, and polishing.

One minor nit: I don't like that FP_LOCK_SLOTS_PER_BACKEND is now non-constant
while looking like a constant:

#define FP_LOCK_SLOTS_PER_BACKEND (FP_LOCK_SLOTS_PER_GROUP * FastPathLockGroupsPerBackend)

I don't think it's a good idea to have non-function-like #defines that
reference variables that can change from run to run.

Fair point, although it can't change "run to run" - not without a
restart.

That's what I meant with "run to run".

It's not a proper constant, of course, but it seemed close
enough. Yes, it might confuse people into thinking it's a constant, or
is there some additional impact?

That seems plenty. I just looked at the shem sizing function and was confused
because I didn't see where the max_locks_per_transaction affects the
allocation size.

The one fix I can think of is making it look more like a function,
possibly just like this:

#define FastPathLockSlotsPerBackend() \
(FP_LOCK_SLOTS_PER_GROUP * FastPathLockGroupsPerBackend)

Or do you have another suggestion?

That'd work for me.

Greetings,

Andres Freund

#45Tomas Vondra
Tomas Vondra
tomas@vondra.me
In reply to: Andres Freund (#44)
1 attachment(s)
Re: scalability bottlenecks with (many) partitions (and more)

On 3/3/25 21:52, Andres Freund wrote:

Hi,

On 2025-03-03 21:31:42 +0100, Tomas Vondra wrote:

On 3/3/25 19:10, Andres Freund wrote:

On 2024-09-21 20:33:49 +0200, Tomas Vondra wrote:

I've finally pushed this, after many rounds of careful testing to ensure
no regressions, and polishing.

One minor nit: I don't like that FP_LOCK_SLOTS_PER_BACKEND is now non-constant
while looking like a constant:

#define FP_LOCK_SLOTS_PER_BACKEND (FP_LOCK_SLOTS_PER_GROUP * FastPathLockGroupsPerBackend)

I don't think it's a good idea to have non-function-like #defines that
reference variables that can change from run to run.

Fair point, although it can't change "run to run" - not without a
restart.

That's what I meant with "run to run".

OK.

It's not a proper constant, of course, but it seemed close
enough. Yes, it might confuse people into thinking it's a constant, or
is there some additional impact?

That seems plenty. I just looked at the shem sizing function and was confused
because I didn't see where the max_locks_per_transaction affects the
allocation size.

But the shmem sizing doesn't use FP_LOCK_SLOTS_PER_BACKEND at all, both
proc.c and postinit.c use the "full" formula, not the macro

FastPathLockGroupsPerBackend * FP_LOCK_SLOTS_PER_GROUP

so why would the macro make this bit less obvious?

The one fix I can think of is making it look more like a function,
possibly just like this:

#define FastPathLockSlotsPerBackend() \
(FP_LOCK_SLOTS_PER_GROUP * FastPathLockGroupsPerBackend)

Or do you have another suggestion?

That'd work for me.

Attached is a patch doing this, but considering it has nothing to do
with the shmem sizing, I wonder if it's worth it.

regards

--
Tomas Vondra

Attachments:

fast-path-macro-fix.patchtext/x-patch; charset=UTF-8; name=fast-path-macro-fix.patch
#46Andres Freund
Andres Freund
andres@anarazel.de
In reply to: Tomas Vondra (#45)
Re: scalability bottlenecks with (many) partitions (and more)

Hi,

On 2025-03-04 14:05:22 +0100, Tomas Vondra wrote:

On 3/3/25 21:52, Andres Freund wrote:

It's not a proper constant, of course, but it seemed close
enough. Yes, it might confuse people into thinking it's a constant, or
is there some additional impact?

That seems plenty. I just looked at the shem sizing function and was confused
because I didn't see where the max_locks_per_transaction affects the
allocation size.

But the shmem sizing doesn't use FP_LOCK_SLOTS_PER_BACKEND at all, both
proc.c and postinit.c use the "full" formula, not the macro

Not sure what I brainfarted there...

The one fix I can think of is making it look more like a function,
possibly just like this:

#define FastPathLockSlotsPerBackend() \
(FP_LOCK_SLOTS_PER_GROUP * FastPathLockGroupsPerBackend)

Or do you have another suggestion?

That'd work for me.

Attached is a patch doing this, but considering it has nothing to do
with the shmem sizing, I wonder if it's worth it.

Yes.

Greetings,

Andres Freund

#47Tomas Vondra
Tomas Vondra
tomas@vondra.me
In reply to: Andres Freund (#46)
1 attachment(s)
Re: scalability bottlenecks with (many) partitions (and more)

On 3/4/25 14:11, Andres Freund wrote:

Hi,

On 2025-03-04 14:05:22 +0100, Tomas Vondra wrote:

On 3/3/25 21:52, Andres Freund wrote:

It's not a proper constant, of course, but it seemed close
enough. Yes, it might confuse people into thinking it's a constant, or
is there some additional impact?

That seems plenty. I just looked at the shem sizing function and was confused
because I didn't see where the max_locks_per_transaction affects the
allocation size.

But the shmem sizing doesn't use FP_LOCK_SLOTS_PER_BACKEND at all, both
proc.c and postinit.c use the "full" formula, not the macro

Not sure what I brainfarted there...

This got me thinking - maybe it'd be better to use the new
FastPathLockSlotsPerBackend() in all places that need the number of
slots per backend, including those in proc.c etc.? Arguably, these
places should have used FP_LOCK_SLOTS_PER_BACKEND before.

The attached v2 patch does that.

The one fix I can think of is making it look more like a function,
possibly just like this:

#define FastPathLockSlotsPerBackend() \
(FP_LOCK_SLOTS_PER_GROUP * FastPathLockGroupsPerBackend)

Or do you have another suggestion?

That'd work for me.

Attached is a patch doing this, but considering it has nothing to do
with the shmem sizing, I wonder if it's worth it.

Yes.

OK, barring objections I'll push the v2.

regards

--
Tomas Vondra

Attachments:

fast-path-macro-fix-v2.patchtext/x-patch; charset=UTF-8; name=fast-path-macro-fix-v2.patch
#48Tomas Vondra
Tomas Vondra
tomas@vondra.me
In reply to: Tomas Vondra (#47)
Re: scalability bottlenecks with (many) partitions (and more)

On 3/4/25 15:38, Tomas Vondra wrote:

...

Attached is a patch doing this, but considering it has nothing to do
with the shmem sizing, I wonder if it's worth it.

Yes.

OK, barring objections I'll push the v2.

Pushed, with the tweaks to use FastPathLockSlotsPerBackend() in a couple
more places.

I noticed sifaka started failing right after I pushed this:

https://buildfarm.postgresql.org/cgi-bin/show_history.pl?nm=sifaka&amp;br=master

But I have no idea why would this cosmetic change cause issues with LDAP
tests, so I'm assuming the failure is unrelated, and the timing is
accidental and not caused by the patch.

regards

--
Tomas Vondra

#49Andres Freund
Andres Freund
andres@anarazel.de
In reply to: Tomas Vondra (#48)
Re: scalability bottlenecks with (many) partitions (and more)

Hi,

On 2025-03-04 19:58:38 +0100, Tomas Vondra wrote:

Pushed, with the tweaks to use FastPathLockSlotsPerBackend() in a couple
more places.

Thanks!

I noticed sifaka started failing right after I pushed this:

https://buildfarm.postgresql.org/cgi-bin/show_history.pl?nm=sifaka&amp;br=master

But I have no idea why would this cosmetic change cause issues with LDAP
tests, so I'm assuming the failure is unrelated, and the timing is
accidental and not caused by the patch.

The buildfarm was updated between those two runs.

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=sifaka&amp;dt=2025-03-04%2015%3A01%3A42
has
'PGBuild::Log' => 'REL_18',
whereas the failing run
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=sifaka&amp;dt=2025-03-04%2017%3A35%3A40
has
'PGBuild::Log' => 'REL_19',

It's worth noting that
a) sifaka doesn't build with ldap support
b) the failure is in checkprep, not when running the tests
c) the buildfarm unfortunately doesn't archive install.log, so it's hard to
know what actually went wrong

Greetings,

Andres Freund

#50Tom Lane
Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andres Freund (#49)
Re: scalability bottlenecks with (many) partitions (and more)

Andres Freund <andres@anarazel.de> writes:

On 2025-03-04 19:58:38 +0100, Tomas Vondra wrote:

I noticed sifaka started failing right after I pushed this:

It's worth noting that
a) sifaka doesn't build with ldap support
b) the failure is in checkprep, not when running the tests
c) the buildfarm unfortunately doesn't archive install.log, so it's hard to
know what actually went wrong

Yeah, I've been poking at that. It's not at all clear why the
animal is trying to run src/test/modules/ldap_password_func
now when it didn't before. I've been through the diffs between
BF client 18 and 19 multiple times and nothing jumps out at me.

What's clear though is that it *is* trying to do "make check"
in that directory, and the link step blows up with

ccache clang -Wall -Wmissing-prototypes -Wpointer-arith -Wdeclaration-after-statement -Werror=vla -Werror=unguarded-availability-new -Wendif-labels -Wmissing-format-attribute -Wcast-function-type -Wformat-security -Wmissing-variable-declarations -fno-strict-aliasing -fwrapv -fexcess-precision=standard -Wno-unused-command-line-argument -Wno-compound-token-split-by-macro -Wno-cast-function-type-strict -g -O2 -fvisibility=hidden -bundle -o ldap_password_func.dylib ldap_password_func.o -L../../../../src/port -L../../../../src/common -isysroot /Library/Developer/CommandLineTools/SDKs/MacOSX15.2.sdk -L/opt/local/libexec/llvm-17/lib -L/opt/local/lib -Wl,-dead_strip_dylibs -fvisibility=hidden -bundle_loader ../../../../src/backend/postgres
Undefined symbols for architecture arm64:
"_ldap_password_hook", referenced from:
__PG_init in ldap_password_func.o
ld: symbol(s) not found for architecture arm64
clang: error: linker command failed with exit code 1 (use -v to see invocation)

That happens because

(a) ldap_password_hook is not defined unless USE_LDAP;

(b) macOS's linker is persnickety and reports the missing symbol
at shlib link time, not shlib load time.

Maybe we should rethink (a)? In the meantime I'm trying to hack
the script so it skips that test module, and finding out that
my Perl is rustier than I thought.

regards, tom lane

#51Andres Freund
Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#50)
Re: scalability bottlenecks with (many) partitions (and more)

Hi,

On 2025-03-04 16:30:34 -0500, Tom Lane wrote:

Andres Freund <andres@anarazel.de> writes:

On 2025-03-04 19:58:38 +0100, Tomas Vondra wrote:

I noticed sifaka started failing right after I pushed this:

It's worth noting that
a) sifaka doesn't build with ldap support
b) the failure is in checkprep, not when running the tests
c) the buildfarm unfortunately doesn't archive install.log, so it's hard to
know what actually went wrong

Yeah, I've been poking at that. It's not at all clear why the
animal is trying to run src/test/modules/ldap_password_func
now when it didn't before.

It did do so before as well, afaict:
https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=sifaka&amp;dt=2025-03-04%2015%3A01%3A42&amp;stg=module-ldap_password_func-check

It seems to me that the difference is that now checkprep is run, whereas
previously it wasn't.

Before:
/Library/Developer/CommandLineTools/usr/bin/make -C adt jsonpath_gram.h
make[3]: `jsonpath_gram.h' is up to date.
echo "# +++ tap check in src/test/modules/ldap_password_func +++" && rm -rf '/Users/buildfarm/bf-data/HEAD/pgsql.build/src/test/modules/ldap_password_func'/tmp_check && /bin/sh ../../../../config/install-sh -c -d '/Users/buildfarm/bf-data/HEAD/pgsql.build/src/test/modules/ldap_password_func'/tmp_check && cd . && TESTLOGDIR='/Users/buildfarm/bf-data/HEAD/pgsql.build/src/test/modules/ldap_password_func/tmp_check/log' TESTDATADIR='/Users/buildfarm/bf-data/HEAD/pgsql.build/src/test/modules/ldap_password_func/tmp_check' PATH="/Users/buildfarm/bf-data/HEAD/pgsql.build/tmp_install/Users/buildfarm/bf-data/HEAD/inst/bin:/Users/buildfarm/bf-data/HEAD/pgsql.build/src/test/modules/ldap_password_func:$PATH" DYLD_LIBRARY_PATH="/Users/buildfarm/bf-data/HEAD/pgsql.build/tmp_install/Users/buildfarm/bf-data/HEAD/inst/lib:$DYLD_LIBRARY_PATH" INITDB_TEMPLATE='/Users/buildfarm/bf-data/HEAD/pgsql.build'/tmp_install/initdb-template PGPORT='65678' top_builddir='/Users/buildfarm/bf-data/HEAD/pgsql.build/src/test/modules/ldap_password_func/../../../..' PG_REGRESS='/Users/buildfarm/bf-data/HEAD/pgsql.build/src/test/modules/ldap_password_func/../../../../src/test/regress/pg_regress' share_contrib_dir='/Users/buildfarm/bf-data/HEAD/pgsql.build/tmp_install/Users/buildfarm/bf-data/HEAD/inst/share/postgresql/contrib' /usr/bin/prove -I ../../../../src/test/perl/ -I . --timer t/*.pl
# +++ tap check in src/test/modules/ldap_password_func +++
[10:08:59] t/001_mutated_bindpasswd.pl .. skipped: LDAP not supported by this build
[10:08:59]

Now:
/Library/Developer/CommandLineTools/usr/bin/make -C adt jsonpath_gram.h
make[3]: `jsonpath_gram.h' is up to date.
rm -rf '/Users/buildfarm/bf-data/HEAD/pgsql.build'/tmp_install
/bin/sh ../../../../config/install-sh -c -d '/Users/buildfarm/bf-data/HEAD/pgsql.build'/tmp_install/log
/Library/Developer/CommandLineTools/usr/bin/make -C '../../../..' DESTDIR='/Users/buildfarm/bf-data/HEAD/pgsql.build'/tmp_install install >'/Users/buildfarm/bf-data/HEAD/pgsql.build'/tmp_install/log/install.log 2>&1
/Library/Developer/CommandLineTools/usr/bin/make -j1 checkprep >>'/Users/buildfarm/bf-data/HEAD/pgsql.build'/tmp_install/log/install.log 2>&1
make: *** [temp-install] Error 2
log files for step module-ldap_password_funcCheck:

Note during a normal build ldap_password_func shouldn't be entered:
# Test runs an LDAP server, so only run if ldap is in PG_TEST_EXTRA
ifeq ($(with_ldap),yes)
ifneq (,$(filter ldap,$(PG_TEST_EXTRA)))
SUBDIRS += ldap_password_func
else
ALWAYS_SUBDIRS += ldap_password_func
endif
else
ALWAYS_SUBDIRS += ldap_password_func
endif

Which leads me to suspect that the difference might be related to
NO_TEMP_INSTALL not being set while it previously was. Which then triggers the
module being built, whereas it previously wasn't.

Of course relying on NO_TEMP_INSTALL preventing this from being built isn't
exactly reliable...

Greetings,

Andres Freund

#52Tom Lane
Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andres Freund (#51)
Re: scalability bottlenecks with (many) partitions (and more)

Andres Freund <andres@anarazel.de> writes:

On 2025-03-04 16:30:34 -0500, Tom Lane wrote:

Yeah, I've been poking at that. It's not at all clear why the
animal is trying to run src/test/modules/ldap_password_func
now when it didn't before.

It did do so before as well, afaict:
https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=sifaka&amp;dt=2025-03-04%2015%3A01%3A42&amp;stg=module-ldap_password_func-check

It seems to me that the difference is that now checkprep is run, whereas
previously it wasn't.

Maybe, but still I don't see any changes in the BF client that'd
explain it. The animal's configuration hasn't changed either;
the only non-comment diff in its buildfarm.conf is

@@ -374,7 +376,7 @@

base_port => 5678,

-       modules => [qw(TestUpgrade TestDecoding)],
+       modules => [qw(TestUpgrade)],

# settings used by run_branches.pl
global => {

which I changed to follow the lead of build-farm.conf.sample.
But surely that wouldn't affect this!?

regards, tom lane

#53Andrew Dunstan
Andrew Dunstan
andrew@dunslane.net
In reply to: Tom Lane (#52)
Re: scalability bottlenecks with (many) partitions (and more)

On 2025-03-04 Tu 5:01 PM, Tom Lane wrote:

Andres Freund <andres@anarazel.de> writes:

On 2025-03-04 16:30:34 -0500, Tom Lane wrote:

Yeah, I've been poking at that. It's not at all clear why the
animal is trying to run src/test/modules/ldap_password_func
now when it didn't before.

It did do so before as well, afaict:
https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=sifaka&amp;dt=2025-03-04%2015%3A01%3A42&amp;stg=module-ldap_password_func-check
It seems to me that the difference is that now checkprep is run, whereas
previously it wasn't.

Maybe, but still I don't see any changes in the BF client that'd
explain it. The animal's configuration hasn't changed either;
the only non-comment diff in its buildfarm.conf is

@@ -374,7 +376,7 @@

base_port => 5678,

-       modules => [qw(TestUpgrade TestDecoding)],
+       modules => [qw(TestUpgrade)],

# settings used by run_branches.pl
global => {

which I changed to follow the lead of build-farm.conf.sample.
But surely that wouldn't affect this!?

I think I found a logic bug. Testing.

cheers

andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

#54Tom Lane
Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andrew Dunstan (#53)
1 attachment(s)
Re: scalability bottlenecks with (many) partitions (and more)

Andrew Dunstan <andrew@dunslane.net> writes:

I think I found a logic bug. Testing.

Not sure what you are looking at, but I was trying to fix it
by making the loop over test modules skip unbuilt modules,
borrowing the test you added in v19 to skip unbuilt contrib
modules. It's a little more complicated for the other modules
because some of them have no .c files to be built, and I could
not get that to work. I eventually concluded that there's
something wrong with the "scalar glob()" idiom you used.
A bit of googling suggested "grep -e, glob()" instead, and
that seems to work for me. sifaka seems happy with the
attached patch.

regards, tom lane

Attachments:

run-build-fix.patchtext/x-diff; charset=us-ascii; name=run-build-fix.patch
#55Tom Lane
Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#54)
Re: scalability bottlenecks with (many) partitions (and more)

Andrew Dunstan <andrew@dunslane.net> writes:

I think I found a logic bug. Testing.

Oh! I bet you are looking at this 18-to-19 diff:

@@ -416,7 +416,8 @@ sub check_install_is_complete
 	{
 		$tmp_loc = "$tmp_loc/$install_dir";
 		$bindir = "$tmp_loc/bin";
-		$libdir = "$tmp_loc/lib/postgresql";
+		$libdir = "$tmp_loc/lib";
+		$libdir .= '/postgresql' unless $libdir =~ /postgres|pgsql/;
 		return (-d $bindir && -d $libdir);
 	}
 	elsif (-e "$build_dir/src/Makefile.global")    # i.e. not msvc
@@ -427,7 +428,8 @@ sub check_install_is_complete
 		chomp $suffix;
 		$tmp_loc = "$tmp_loc/$install_dir";
 		$bindir = "$tmp_loc/bin";
-		$libdir = "$tmp_loc/lib/postgresql";
+		$libdir = "$tmp_loc/lib";
+		$libdir .= '/postgresql' unless $libdir =~ /postgres|pgsql/;
 	}

I'd dismissed that because sifaka isn't running in a directory
that has "postgres" or "pgsql" in its path, but just now I looked
at the logs of one of these steps, and guess where it's installing:

/usr/bin/make -C '../../../..' DESTDIR='/Users/buildfarm/bf-data/HEAD/pgsql.build'/tmp_install install >'/Users/buildfarm/bf-data/HEAD/pgsql.build'/tmp_install/log/install.log 2>&1

I bet the "pgsql.build" name is confusing it into doing extra
installs. This'd explain the impression I had that the test steps
were running a bit slower than they ought to. If you check
sifaka's just-posted green run against its history, that run took
13:48 versus recent times of 10:35 or thereabouts, so we're definitely
eating a good deal of time someplace...

regards, tom lane

#56Andrew Dunstan
Andrew Dunstan
andrew@dunslane.net
In reply to: Tom Lane (#54)
1 attachment(s)
Re: scalability bottlenecks with (many) partitions (and more)

On 2025-03-04 Tu 5:28 PM, Tom Lane wrote:

Andrew Dunstan<andrew@dunslane.net> writes:

I think I found a logic bug. Testing.

Not sure what you are looking at, but I was trying to fix it
by making the loop over test modules skip unbuilt modules,
borrowing the test you added in v19 to skip unbuilt contrib
modules. It's a little more complicated for the other modules
because some of them have no .c files to be built, and I could
not get that to work. I eventually concluded that there's
something wrong with the "scalar glob()" idiom you used.
A bit of googling suggested "grep -e, glob()" instead, and
that seems to work for me. sifaka seems happy with the
attached patch.

I'm looking at something else, namely the attached.

Will check your patch out too.

--
Andrew Dunstan
EDB:https://www.enterprisedb.com

Attachments:

bfdirfix.patchtext/x-patch; charset=UTF-8; name=bfdirfix.patch
#57Tom Lane
Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andrew Dunstan (#56)
Re: scalability bottlenecks with (many) partitions (and more)

Andrew Dunstan <andrew@dunslane.net> writes:

Will check your patch out too.

Comparing previous run against current, I now see that my patch
caused it to skip these steps:

module-ldap_password_func-check
module-pg_bsd_indent-check
contrib-sepgsql-check

Skipping the ldap and sepgsql tests is desirable, but it shouldn't
have skipped pg_bsd_indent. I think the cause of that is that
src/tools/pg_bsd_indent isn't built in any of the previous build
steps. Up to now it got built as a side-effect of invoking the
tests, which isn't great because any build errors/warnings disappear
into the install log which the script doesn't capture. I agree
with not capturing the install log, because that's generally
uninteresting once we get past make-install; but we have to be sure
that everything gets built before that.

regards, tom lane

#58Andrew Dunstan
Andrew Dunstan
andrew@dunslane.net
In reply to: Tom Lane (#54)
Re: scalability bottlenecks with (many) partitions (and more)

On 2025-03-04 Tu 5:28 PM, Tom Lane wrote:

Andrew Dunstan <andrew@dunslane.net> writes:

I think I found a logic bug. Testing.

Not sure what you are looking at, but I was trying to fix it
by making the loop over test modules skip unbuilt modules,
borrowing the test you added in v19 to skip unbuilt contrib
modules. It's a little more complicated for the other modules
because some of them have no .c files to be built, and I could
not get that to work. I eventually concluded that there's
something wrong with the "scalar glob()" idiom you used.
A bit of googling suggested "grep -e, glob()" instead, and
that seems to work for me. sifaka seems happy with the
attached patch.

Well, in scalar context it should give us back the first item found, or
undef if nothing is found, AIUI.

But you're right, it might read better if I use a different formulation.

I didn't much like this, though:

+
+        # can't test it if we haven't built it
+        next unless grep -e, glob("$testdir/*.o $testdir/*.obj")
+            or not grep -e, glob("$testdir/*.c");
+

Too many negatives makes my head hurt.

I also note you said in a later email there were issues.

cheers

andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

#59Tom Lane
Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andrew Dunstan (#56)
Re: scalability bottlenecks with (many) partitions (and more)

Andrew Dunstan <andrew@dunslane.net> writes:

I'm looking at something else, namely the attached.

Yeah, that avoids the extra installs and brings sifaka's
runtime back to about what it had been.

regards, tom lane

#60Andrew Dunstan
Andrew Dunstan
andrew@dunslane.net
In reply to: Tom Lane (#57)
Re: scalability bottlenecks with (many) partitions (and more)

On 2025-03-04 Tu 6:04 PM, Tom Lane wrote:

Andrew Dunstan<andrew@dunslane.net> writes:

Will check your patch out too.

Comparing previous run against current, I now see that my patch
caused it to skip these steps:

module-ldap_password_func-check
module-pg_bsd_indent-check
contrib-sepgsql-check

Skipping the ldap and sepgsql tests is desirable, but it shouldn't
have skipped pg_bsd_indent. I think the cause of that is that
src/tools/pg_bsd_indent isn't built in any of the previous build
steps. Up to now it got built as a side-effect of invoking the
tests, which isn't great because any build errors/warnings disappear
into the install log which the script doesn't capture. I agree
with not capturing the install log, because that's generally
uninteresting once we get past make-install; but we have to be sure
that everything gets built before that.

Yeah ... I think an easy fix is to put this in make_testmodules():

+
+       # build pg_bsd_indent at the same time
+       # this doesn't really belong here, but it's convenient
+       if (-d "$pgsql/src/tools/pg_bsd_indent" && !$status)
+       {
+               my @indentout = run_log("cd 
$pgsql/src/tools/pg_bsd_indent && $make_cmd");
+               $status = $? >> 8;
+               push(@makeout,@indentout);
+       }

A lot of this special processing goes away when we're building with meson.

cheers

andrew

--
Andrew Dunstan
EDB:https://www.enterprisedb.com

#61Tom Lane
Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andrew Dunstan (#58)
Re: scalability bottlenecks with (many) partitions (and more)

Andrew Dunstan <andrew@dunslane.net> writes:

On 2025-03-04 Tu 5:28 PM, Tom Lane wrote:

... I eventually concluded that there's
something wrong with the "scalar glob()" idiom you used.

Well, in scalar context it should give us back the first item found, or
undef if nothing is found, AIUI.

That's what I would have thought too, but it didn't seem to work that
way when I was testing the logic standalone: the script processed or
skipped directories according to no rule that I could figure out.

Anyway, for the moment I think we're all right with just the
directory path fix.

regards, tom lane