Init connection time grows quadratically

Started by Потапов Александрabout 1 year ago8 messageshackers

a.potapov@postgrespro.ru

about 1 year ago

Hello!

I ran some experiments with pgbench to measure the initialization time and found that the time increases quadratically with the number of clients. It was surprising to me and I would like to understand a reason of such behavior.

Some details on how it was done:

1) I used the branch REL_16_STABLE (commit 2caa85f4).

2) The default system configuration was modified (CPU speed control, memory control, network, ram disk). Briefly:
  sudo cpupower frequency-set -g performance
sudo cpupower idle-set -D0
sudo swapoff -a
sudo sh -c 'echo 16384 >/proc/sys/net/core/somaxconn'
    sudo sh -c 'echo 16384 >/proc/sys/net/core/netdev_max_backlog'
    sudo sh -c ‘echo 16384 >/proc/sys/net/ipv4/tcp_max_syn_backlog’
    numactl --membind=0 bash
    sudo mount -t tmpfs -o rw,size=512G tmpfs /mnt/ramdisk
exit
Hyperthreading and cpu boost were disabled:
echo 0 | sudo tee /sys/devices/system/cpu/cpufreq/boost
Please note: When testing on a fast multi-core server with a large number of clients, when the speed of creation of new connections becomes very high, even with such kernel parameters an error may occur:
pgbench:pgbench: error: connection to server on socket "/tmp/.s.PGSQL.5114" failed: Resource temporarily unavailable

In such case you need to apply the patch 0001-Fix-fast-connection-rate-issue.patch (attached).

3) The server was configured as:
./configure --enable-debug --with-perl --with-icu --enable-depend --enable-tap-tests

4) Build and install on ramdrive:
make -j$(nproc) -s && make install

5) DB initialization:
/mnt/ramdisk/bin/initdb -k -D /mnt/ramdisk/data -U postgres

Add to the postgresql.conf:
huge_pages = off #for the sake of test stability and reproducibility
shared_buffers = 1GB
max_connections = 16384

6) Start command:
a) Start server (e.g on the first numa socket)
/mnt/ramdisk/bin/pg_ctl -w -D /mnt/ramdisk/data start

b) create test database and stop the server
/mnt/ramdisk/bin/psql -U postgres -c 'create database bench'
/mnt/ramdisk/bin/pg_ctl -w -D /mnt/ramdisk/data stop
7) pgbench commands:
Perform the single test sequence (I've got a dual socket server, so the server was running on the first socket while the clients were running on the second one):

export PATH=/mnt/ramdisk/bin:$PATH
export NUMACTL_CLIENT="--physcpubind=96-191 --membind=1"
export NUMACTL_SERVER="--physcpubind=0-95 --membind=0"
export CLIENTS=1024

numactl $NUMACTL_SERVER pg_ctl -w -D /mnt/ramdisk/data start
numactl $NUMACTL_CLIENT pgbench -U postgres -i -s100 bench
numactl $NUMACTL_CLIENT psql -U postgres -d bench -c "checkpoint"
numactl $NUMACTL_CLIENT pgbench -U postgres -c$CLIENTS -j$CLIENTS -t100 -S bench
numactl $NUMACTL_SERVER pg_ctl -m smart -w -D /mnt/ramdisk/data stop
8) Measurements & Results
Before the measurements I rebooted host machine and configured the host as described above. After that I ran a script that did 30 measurements of init connection time per a given number of clients, average time and standard deviation were also calculated.
The measurements results are presented as graph an in table form:

Number of clientsAverage init time, ms1024~435 +-202048~1062 +-204096~3284 +-408192~11617 +-12016384~43391 +-230

9) The Question
It turned out that the results correspond to a quadratic dependence like y ~ 0.0002x^2 where x is a number of clients and y is init time (ms).
Here there is a question: is it expected behavior or a bug? What do you think? I appreciate any comments and opinions.

--
Best regards,
Alexander Potapov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Потапов Александр

a.potapov@postgrespro.ru

about 1 year ago

In reply to: Потапов Александр (#1)

Re: Init connection time grows quadratically

Sorry, I forgot to add the table and graph for point #8
The graph is attached.
This is the table:
--------------------------------------------------
| Number of clients | Average init time, ms |
--------------------------------------------------
| 1024 | ~435 +-20 |
| 2048 | ~1062 +-20 |
| 4096 | ~3284 +-40 |
| 8192 | ~11617 +-120 |
| 16384 | ~43391 +-230 |
--------------------------------------------------

Best regards,
Alexander Potapov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Matthias van de Meent

boekewurm+postgres@gmail.com

about 1 year ago

In reply to: Потапов Александр (#1)

Re: Init connection time grows quadratically

On Tue, 27 May 2025 at 12:45, Потапов Александр
<a.potapov@postgrespro.ru> wrote:

Hello!

I ran some experiments with pgbench to measure the initialization time and found that the time increases quadratically with the number of clients. It was surprising to me and I would like to understand a reason of such behavior.

Some details on how it was done:

[...]

It turned out that the results correspond to a quadratic dependence like y ~ 0.0002x^2 where x is a number of clients and y is init time (ms).
Here there is a question: is it expected behavior or a bug? What do you think? I appreciate any comments and opinions.

Note that the value of "initial connection time" is based on the time
it takes from about the start of the pg_bench process until the moment
all N expected connections have been established, *not* the average
time it took pg_bench to connect to PostgreSQL. This does also not
exclude other known measurable delays (like spawning threads,
synchronization, etc), so the actual per-connection connection time is
probably closer to O(n) than O(n^2).

Q: Did you check that pgbench or the OS does not have
O(n_active_connections) or O(n_active_threads) overhead per worker
during thread creation or connection establishment, e.g. by varying
the number of threads used to manage these N clients? I wouldn't be
surprised if there are inefficiencies in e.g. the threading- or
synchronization model that cause O(N) per-thread overhead, or O(N^2)
overall when you have one thread per connection.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

Потапов Александр

a.potapov@postgrespro.ru

about 1 year ago

In reply to: Matthias van de Meent (#3)

Re: Init connection time grows quadratically

Hi Matthias,

I did additional experiments (changed a number of threads on pgbench side an on server side) but it did find not chanches.

Вторник, Май 27, 2025 14:36 MSK, Matthias van de Meent <boekewurm+postgres@gmail.com> писал(а):

On Tue, 27 May 2025 at 12:45, Потапов Александр
<a.potapov@postgrespro.ru> wrote:

Hello!

I ran some experiments with pgbench to measure the initialization time and found that the time increases quadratically with the number of clients. It was surprising to me and I would like to understand a reason of such behavior.

Some details on how it was done:

[...]

It turned out that the results correspond to a quadratic dependence like y ~ 0.0002x^2 where x is a number of clients and y is init time (ms).
Here there is a question: is it expected behavior or a bug? What do you think? I appreciate any comments and opinions.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

Потапов Александр

a.potapov@postgrespro.ru

about 1 year ago

In reply to: Потапов Александр (#4)

Re: Init connection time grows quadratically

To be more precise I used constant number of threads (128 and 1024) to compare with previous results. The quadratic dependency exists everywhere, see new graph.

Q: Did you check that pgbench or the OS does not have
O(n_active_connections) or O(n_active_threads) overhead per worker
during thread creation or connection establishment, e.g. by varying
the number of threads used to manage these N clients? I wouldn't be
surprised if there are inefficiencies in e.g. the threading- or
synchronization model that cause O(N) per-thread overhead, or O(N^2)
overall when you have one thread per connection.

Maksim.Melnikov

m.melnikov@postgrespro.ru

22 days ago

In reply to: Потапов Александр (#5)

Re: Init connection time grows quadratically

On 6/16/25 11:56, Потапов Александр wrote:

To be more precise I used constant number of threads (128 and 1024) to
compare with previous results. The quadratic dependency exists
everywhere, see new graph.

Q: Did you check that pgbench or the OS does not have
O(n_active_connections) or O(n_active_threads) overhead per worker
during thread creation or connection establishment, e.g. by varying
the number of threads used to manage these N clients? I wouldn't be
surprised if there are inefficiencies in e.g. the threading- or
synchronization model that cause O(N) per-thread overhead, or O(N^2)
overall when you have one thread per connection.

Hi, all!

I've investigated slightly different scenario then Alexander and I want
share my thoughts in this thread too.

I found that when we run pgbench scenarios sequantially, without
postgres restart between iterations, initial time degrades from launch
to launch and eventually it stabilizes at the worst values then first
run(ICT_degradation.png attached).

Scenario details:

1. postgres rev 1a51ec16db7, 0001-Fix-fast-connection-rate-issue.patch
was applied

2. numa was disabled, swap was off, ramdisk was used for binaries and
pg_data,Hyperthreading and cpu boost were disabled

3. server was built with -02

4.Add to the postgresql.conf:
huge_pages = off #for the sake of test stability and reproducibility
shared_buffers = 1GB
max_connections = 16384

5. Cycle iteration

psql -U postgres -c 'create database bench'

pgbench -U postgres -i -s100 bench

psql -U postgres -d bench -c "checkpoint"

pgbench -U postgres -c${clients} -t100 -j6 -S bench

psql -U postgres -c "DROP DATABASE bench"

Where ${clients} is one of 512/1024/2048/4096/8192

I paid attention that ICT for the first iteration much better than for
next ones. I investigated this behavior a little bit and found a lot of
minor page fault events in ProcArrayAdd method(perf_without_patch-j6.txt
attached) for code line

allProcs[procno].pgxactoff = index;

So, every proc.pgxactoff access generate page fault, because proc
objects accessed in memory randomly and page replacement can occur. I
have some ideas how to improve this - it seems we can put array
of pgxactoff separately

in shmem to have only few hot pages for them. I've attached appropriate
patch(0001-This-patch-reduce-connection-init-close-time.patch). Perf
with minor faults for updated version also was
attached(perf_with_patch-j6.txt attached),

as we can see, patched version fixes this. I made a series of
measurements for all versions and attached comparison
chart(ICT_degradation_with_patch.png attached). Also I add the table
with results

without_patch_first_iteration, average ICT, ms
without_patch_after_warmup, average ICT, ms with_patch_first_iteration,
average ICT, ms with_patch_after_warmup, average ICT, ms
512 ~480+-15 ~500+-10 ~470+-15 ~215 +-15
1024 ~920+-20 ~1000+-20 ~900+-15 ~920 +-20
2048 ~1780+-30 ~2240+-20 ~1760+-30 ~1800 +-30
4096 ~3570+-50 ~6140+-40 ~3450+-130 ~3740 +-60
8192 ~7440+-90 ~18840+-140 ~7440+-140 ~8100+-80

As we can see from attached charts, from one side there is no
degradation for the first iterations for patched and no patched
versions. From another side patched version is much better for next ones.

I've tried the same tests for different values of pgbench -j parameter
to avoid effects of multithreading pgbench nature and got the same picture.

I hope it will be interesting and helpful.

Best regards,

Maksim Melnikov

Matthias van de Meent

boekewurm+postgres@gmail.com

22 days ago

In reply to: Maksim.Melnikov (#6)

Re: Init connection time grows quadratically

On Wed, 3 Jun 2026 at 08:33, Maksim.Melnikov <m.melnikov@postgrespro.ru> wrote:

On 6/16/25 11:56, Потапов Александр wrote:

To be more precise I used constant number of threads (128 and 1024) to compare with previous results. The quadratic dependency exists everywhere, see new graph.

Q: Did you check that pgbench or the OS does not have
O(n_active_connections) or O(n_active_threads) overhead per worker
during thread creation or connection establishment, e.g. by varying
the number of threads used to manage these N clients? I wouldn't be
surprised if there are inefficiencies in e.g. the threading- or
synchronization model that cause O(N) per-thread overhead, or O(N^2)
overall when you have one thread per connection.

Hi, all!

I've investigated slightly different scenario then Alexander and I want share my thoughts in this thread too.

I found that when we run pgbench scenarios sequantially, without postgres restart between iterations, initial time degrades from launch to launch and eventually it stabilizes at the worst values then first run(ICT_degradation.png attached).

Scenario details:

[...]

4.Add to the postgresql.conf:
huge_pages = off #for the sake of test stability and reproducibility

I think this is the main culprit of the extreme slowdown -- without
huge pages, you're effectively guaranteed to get many minor page
faults, and with it the relevant TLB miss rates. With huge pages
enabled, the proc array should fit on one (or just a few) memory
pages.

We're not generally in the business for optimizing workloads that have
huge_pages=off.

I paid attention that ICT for the first iteration much better than for next ones. I investigated this behavior a little bit and found a lot of minor page fault events in ProcArrayAdd method(perf_without_patch-j6.txt attached) for code line

allProcs[procno].pgxactoff = index;

So, every proc.pgxactoff access generate page fault, because proc objects accessed in memory randomly and page replacement can occur. I have some ideas how to improve this - it seems we can put array of pgxactoff separately

page replacement can occur

I doubt that this is an issue. Page tables are not removed until the
mapping is removed, and it is highly unlikely that hot shmem areas
(like the PGPROC array) are ever swapped out. It's just that with
smaller memory pages the OS will have to create more page mappings for
the same amount of shared memory, and that'll take more resources
(cpu, memory, time) than it would with large (or huge) memory pages.

in shmem to have only few hot pages for them. I've attached appropriate patch(0001-This-patch-reduce-connection-init-close-time.patch). Perf with minor faults for updated version also was attached(perf_with_patch-j6.txt attached),

I see. Despite your argument hinging on small pages, I think there is
still some benefit to using a dense array instead of PGPROC.pgxactoff:
With a dense array, ProcArrayAdd/ProcArrayRemove need to touch fewer
cache lines, which are also less likely to be recently dirtied by
unrelated shared proc updates.

However, I'm now a bit more concerned about the number of indirections
required for other operations. Before, accessing pgxactoff was an
offset off of the PgProc pointer, but with this patch getting its
value is a bit more involved.

as we can see, patched version fixes this. I made a series of measurements for all versions and attached comparison chart(ICT_degradation_with_patch.png attached). Also I add the table with results

Do you happen to have data with huge_pages enabled?

I hope it will be interesting and helpful.

Definitely interesting. I'm not so sure it's as effective on a
production configuration (with huge pages enabled), but I'm definitely
interested in seeing test results.

----

Some comments on the patch:

+++ b/src/backend/storage/lmgr/proc.c
+    size = add_size(size, mul_size(TotalProcs, sizeof(int)));

Let's use the following, to fit the surrounding pattern:

+ size = add_size(size, mul_size(TotalProcs,
sizeof(*ProcGlobal->pgxactoffs)));

@@ -273,7 +274,10 @@ ProcGlobalShmemInit(void *arg)
ProcGlobal->statusFlags = (uint8 *) ptr;
ptr = ptr + (TotalProcs * sizeof(*ProcGlobal->statusFlags));
-    /* make sure we didn't overflow */
+    ProcGlobal->pgxactoffs = (int *) ptr;
+    ptr = (char *) ptr + TotalProcs * sizeof(int);
+
+    /* make sure wer didn't overflow */
Assert((ptr > (char *) procs) && (ptr <= (char *) procs + requestSize));

This needs to be updated, because right now it fails to account for
alignment when (TotalProcs * sizeof(statusFlags)) is not a multiple of
sizeof(int). The other fields take care to be correctly aligned, but
your code doesn't do that yet. It's probably best to allocate and
assign *pgxactoffs just ahead of statusFlags.

+ /* make sure wer didn't overflow */

New typo introduced.

+++ b/src/include/storage/proc.h
+#define GetXactOffPGProc(proc) (ProcGlobal->pgxactoffs[(proc) - &ProcGlobal->allProcs[0]])
+#define GetMyXactOffPGProc() (GetXactOffPGProc(MyProc))

I'd replace this with

+#define ProcGetXactOff(procno) (ProcGlobal->pgxactoffs[(procno)])
+#define ProcGetMyXactOff() (GetXactOffPGProc(MyProcNo))

So that callers can use GetNumberFromPGProc() manually if they need
to, but the offset-based calculations of Proc-to-Number are avoided
when that is possible.

Kind regards,

Matthias van de Meent
Databricks (https://www.databricks.com)

Maksim.Melnikov

m.melnikov@postgrespro.ru

14 days ago

In reply to: Matthias van de Meent (#7)

Re: Init connection time grows quadratically

On 6/3/26 16:35, Matthias van de Meent wrote:

On Wed, 3 Jun 2026 at 08:33, Maksim.Melnikov<m.melnikov@postgrespro.ru> wrote:

On 6/16/25 11:56, Потапов Александр wrote:

To be more precise I used constant number of threads (128 and 1024) to compare with previous results. The quadratic dependency exists everywhere, see new graph.

Q: Did you check that pgbench or the OS does not have
O(n_active_connections) or O(n_active_threads) overhead per worker
during thread creation or connection establishment, e.g. by varying
the number of threads used to manage these N clients? I wouldn't be
surprised if there are inefficiencies in e.g. the threading- or
synchronization model that cause O(N) per-thread overhead, or O(N^2)
overall when you have one thread per connection.

Hi, all!

I've investigated slightly different scenario then Alexander and I want share my thoughts in this thread too.

I found that when we run pgbench scenarios sequantially, without postgres restart between iterations, initial time degrades from launch to launch and eventually it stabilizes at the worst values then first run(ICT_degradation.png attached).

Scenario details:

[...]

4.Add to the postgresql.conf:
huge_pages = off #for the sake of test stability and reproducibility

I think this is the main culprit of the extreme slowdown -- without
huge pages, you're effectively guaranteed to get many minor page
faults, and with it the relevant TLB miss rates. With huge pages
enabled, the proc array should fit on one (or just a few) memory
pages.

We're not generally in the business for optimizing workloads that have
huge_pages=off.

Yes, I agree, huge_pages=off is not common setup now. My motivation was
that even if some configuration isn't commonly used, it does not mean
that it isn't interesting for someone else at all and, as a consequence,
it can be optimized without degradation for basic scenarios . Moreover,
huge_pages = try is the default value, so with huge_pages set to try,
the server will try to request huge pages, but fall back to the
huge_page=off if that fails. As I know on linux default value
for vm.nr_hugepages = 0, this means that by default, the os does not
use HugeTLB pages. Of course, DBA should setup this, but on practice
they can miss this. Anyway, if community isn't interested in such kinds
of optimizations, it is ok. It was interesting and educational
investigation for me, thanks for your help.

.....

as we can see, patched version fixes this. I made a series of measurements for all versions and attached comparison chart(ICT_degradation_with_patch.png attached). Also I add the table with results

Do you happen to have data with huge_pages enabled?

I hope it will be interesting and helpful.

Definitely interesting. I'm not so sure it's as effective on a
production configuration (with huge pages enabled), but I'm definitely
interested in seeing test results.

I've made comparative measurements for configurations with huge_pages =
on/off. Please, you can check results below.

Clients number *Huge-pages-off-with-patch*
Huge-pages-off-without-patch Huge-pages-on-with-patch
Huge-pages-on-without-patch
512 ~480 +- 3.5% ms ~490 +- 3% ms ~420 +- 3.5% ms ~420+-3.5% ms
1024 ~910 +- 1.3% ms ~990 +- 2% ms ~790 +- 1.7% ms ~800+-1.8% ms
2048 ~1810 +- 1.4% ms ~2230 +- 0.9% ms ~1540 +- 0.7% ms ~1530 +-
1.4% ms
4096 ~3690 +- 1.9% ms ~6060 +- 0.8% ms ~3070 +- 0.6% ms ~3070 +-
0.9% ms
8192 ~9900 +- 0.6% ms ~18530 +- 0.4% ms ~6220 +- 0.7% ms ~6230 +-
0.7% ms

Also comparison chart is attached.

As we can see the measurements prove patch efficiency for configuration
with huge_page=off(the same result as in previous message), but for
huge_pages=on I've got the same results for both versions, no
improvement and no degradation.

----

Some comments on the patch:

Patch with fixes was attached. Thanks for review.

Best regards,

Maksim Melnikov

Init connection time grows quadratically

Attachments:

Attachments:

Attachments:

Attachments:

Attachments: