Improving connection scalability (src/backend/storage/ipc/procarray.c)
Hi hackers,
Inspired by Andre's works, I put efforts to try to improve
GetSnapShotData().
I think that I got something.
As is well known GetSnapShotData() is a critical function for Postgres
performance, already demonstrated in other benchmarks.
So any gain, no matter how small, makes a difference.
So far, no regression has been observed.
Windows 10 64 bits (msvc 2019 64 bits)
pgbench -M prepared -c $conns -S -n -U postgres
conns tps head tps patched
1 2918.004085 3027.550711
10 12262.415696 12876.641772
50 13656.724571 14877.410140
80 14338.202348 15244.192915
pgbench can't run with conns > 80.
Linux Ubuntu 64 bits (gcc 9.4)
./pgbench -M prepared -c $conns -j $conns -S -n -U postgres
conns tps head tps patched
1 2918.004085 3190.810466
10 12262.415696 17199.862401
50 13656.724571 18278.194114
80 14338.202348 17955.336101
90 16597.510373 18269.660184
100 17706.775793 18349.650150
200 16877.067441 17881.250615
300 16942.260775 17181.441752
400 16794.514911 17124.533892
500 16598.502151 17181.244953
600 16717.935001 16961.130742
700 16651.204834 16959.172005
800 16467.546583 16834.591719
900 16588.241149 16693.902459
1000 16564.985265 16936.952195
I was surprised that in my tests, with connections greater than 100,
the tps performance drops, even in the head.
I don't have access to a powerful machine, to really test with a high
workload.
But with connections below 100, It seems to me that there are obvious gains.
patch attached.
regards,
Ranier Vilela
Attachments:
001-improve-scability-procarray.patchapplication/octet-stream; name=001-improve-scability-procarray.patchDownload+157-110
On Tue, May 24, 2022 at 11:28 AM Ranier Vilela <ranier.vf@gmail.com> wrote:
I think that I got something.
You might have something, but it's pretty hard to tell based on
looking at this patch. Whatever relevant changes it has are mixed with
a bunch of changes that are probably not relevant. For example, it's
hard to believe that moving "uint32 i" to an inner scope in
XidInMVCCSnapshot() is causing a performance gain, because an
optimizing compiler should figure that out anyway.
An even bigger issue is that it's not sufficient to just demonstrate
that the patch improves performance. It's also necessary to make an
argument as to why it is safe and correct, and "I tried it out and
nothing seemed to break" does not qualify as an argument. I'd guess
that most or maybe all of the performance gain that you've observed
here is attributable to changing GetSnapshotData() to call
GetSnapshotDataReuse() without first acquiring ProcArrayLock. That
doesn't seem like a completely hopeless idea, because the comments for
GetSnapshotDataReuse() say this:
* This very likely can be evolved to not need ProcArrayLock held (at very
* least in the case we already hold a snapshot), but that's for another day.
However, those comment seem to imply that it might not be safe in all
cases, and that changes might be needed someplace in order to make it
safe, but you haven't updated these comments, or changed the function
in any way, so it's not really clear how or whether whatever problems
Andres was worried about have been handled.
--
Robert Haas
EDB: http://www.enterprisedb.com
Em ter., 24 de mai. de 2022 às 13:06, Robert Haas <robertmhaas@gmail.com>
escreveu:
On Tue, May 24, 2022 at 11:28 AM Ranier Vilela <ranier.vf@gmail.com>
wrote:I think that I got something.
You might have something, but it's pretty hard to tell based on
looking at this patch. Whatever relevant changes it has are mixed with
a bunch of changes that are probably not relevant. For example, it's
hard to believe that moving "uint32 i" to an inner scope in
XidInMVCCSnapshot() is causing a performance gain, because an
optimizing compiler should figure that out anyway.
I believe that even these small changes are helpful and favorable.
Improves code readability and helps the compiler generate better code,
especially for older compilers.
An even bigger issue is that it's not sufficient to just demonstrate
that the patch improves performance. It's also necessary to make an
argument as to why it is safe and correct, and "I tried it out and
nothing seemed to break" does not qualify as an argument.
Ok, certainly the convincing work is not good.
I'd guess that most or maybe all of the performance gain that you've
observed
here is attributable to changing GetSnapshotData() to call
GetSnapshotDataReuse() without first acquiring ProcArrayLock.
It certainly helps, but I trust that's not the only reason, in all the
tests I did, there was an improvement in performance, even before using
this feature.
If you look closely at GetSnapShotData() you will see that
GetSnapshotDataReuse is called for all snapshots, even the new ones, which
is unnecessary.
Another example NormalTransactionIdPrecedes is more expensive than testing
statusFlags.
That
doesn't seem like a completely hopeless idea, because the comments for
GetSnapshotDataReuse() say this:* This very likely can be evolved to not need ProcArrayLock held (at very
* least in the case we already hold a snapshot), but that's for another
day.However, those comment seem to imply that it might not be safe in all
cases, and that changes might be needed someplace in order to make it
safe, but you haven't updated these comments, or changed the function
in any way, so it's not really clear how or whether whatever problems
Andres was worried about have been handled.
I think it's worth trying and testing to see if everything goes well,
so in the final patch apply whatever comments are needed.
regards,
Ranier Vilela
Hi,
On 2022-05-24 12:28:20 -0300, Ranier Vilela wrote:
Linux Ubuntu 64 bits (gcc 9.4)
./pgbench -M prepared -c $conns -j $conns -S -n -U postgresconns tps head tps patched
1 2918.004085 3190.810466
10 12262.415696 17199.862401
50 13656.724571 18278.194114
80 14338.202348 17955.336101
90 16597.510373 18269.660184
100 17706.775793 18349.650150
200 16877.067441 17881.250615
300 16942.260775 17181.441752
400 16794.514911 17124.533892
500 16598.502151 17181.244953
600 16717.935001 16961.130742
700 16651.204834 16959.172005
800 16467.546583 16834.591719
900 16588.241149 16693.902459
1000 16564.985265 16936.952195
17-18k tps is pretty low for pgbench -S. For a shared_buffers resident run, I
can get 40k in a single connection in an optimized build. If you're testing a
workload >> shared_buffers, GetSnapshotData() isn't the bottleneck. And
testing an assert build isn't a meaningful exercise either, unless you have
way way higher gains (i.e. stuff like turning O(n^2) into O(n)).
What pgbench scale is this and are you using an optimized build?
Greetings,
Andres Freund
Hi,
On 2022-05-24 13:23:43 -0300, Ranier Vilela wrote:
It certainly helps, but I trust that's not the only reason, in all the
tests I did, there was an improvement in performance, even before using
this feature.
If you look closely at GetSnapShotData() you will see that
GetSnapshotDataReuse is called for all snapshots, even the new ones, which
is unnecessary.
That only happens a handful of times as snapshots are persistently
allocated. Doing an extra GetSnapshotDataReuse() in those cases doesn't matter
for performance. If anything this increases the number of jumps for the common
case.
It'd be a huge win to avoid needing ProcArrayLock when reusing a snapshot, but
it's not at all easy to guarantee that it's correct / see how to make it
correct. I'm fairly sure it can be made correct, but ...
Another example NormalTransactionIdPrecedes is more expensive than testing
statusFlags.
That may be true when you count instructions, but isn't at all true when you
take into account that the cachelines containing status flags are hotly
contended.
Also, the likelihood of filtering out a proc due to
NormalTransactionIdPrecedes(xid, xmax) is *vastly* higher than the due to the
statusFlags check. There may be a lot of procs failing that test, but
typically there will be far fewer backends in vacuum or logical decoding.
Greetings,
Andres Freund
Em qua., 25 de mai. de 2022 às 00:46, Andres Freund <andres@anarazel.de>
escreveu:
Hi Andres, thank you for taking a look.
On 2022-05-24 12:28:20 -0300, Ranier Vilela wrote:
Linux Ubuntu 64 bits (gcc 9.4)
./pgbench -M prepared -c $conns -j $conns -S -n -U postgresconns tps head tps patched
1 2918.004085 3190.810466
10 12262.415696 17199.862401
50 13656.724571 18278.194114
80 14338.202348 17955.336101
90 16597.510373 18269.660184
100 17706.775793 18349.650150
200 16877.067441 17881.250615
300 16942.260775 17181.441752
400 16794.514911 17124.533892
500 16598.502151 17181.244953
600 16717.935001 16961.130742
700 16651.204834 16959.172005
800 16467.546583 16834.591719
900 16588.241149 16693.902459
1000 16564.985265 16936.95219517-18k tps is pretty low for pgbench -S. For a shared_buffers resident
run, I
can get 40k in a single connection in an optimized build. If you're
testing a
workload >> shared_buffers, GetSnapshotData() isn't the bottleneck. And
testing an assert build isn't a meaningful exercise either, unless you have
way way higher gains (i.e. stuff like turning O(n^2) into O(n)).
Thanks for sharing these hits.
Yes, their 17-18k tps are disappointing.
What pgbench scale is this and are you using an optimized build?
Yes this optimized build.
CFLAGS='-Wall -Wmissing-prototypes -Wpointer-arith
-Wdeclaration-after-statement -Werror=vla -Wendif-labels
-Wmissing-format-attribute -Wimplicit-fallthrough=3 -Wcast-function-type
-Wformat-security -fno-strict-aliasing -fwrapv -fexcess-precision=standard
-Wno-format-truncation -Wno-stringop-truncation -O2'
from config.log
pgbench was initialized with:
pgbench -i -p 5432 -d postgres
pgbench -M prepared -c 100 -j 100 -S -n -U postgres
pgbench (15beta1)
transaction type: <builtin: select only>
scaling factor: 1
query mode: prepared
number of clients: 100
number of threads: 100
The shared_buffers is default:
shared_buffers = 128MB
Intel® Core™ i5-8250U CPU Quad Core
RAM 8GB
SSD 256 GB
Can you share the pgbench configuration and shared buffers
this benchmark?
/messages/by-id/20200301083601.ews6hz5dduc3w2se@alap3.anarazel.de
regards,
Ranier Vilela
Em qua., 25 de mai. de 2022 às 00:56, Andres Freund <andres@anarazel.de>
escreveu:
Hi,
On 2022-05-24 13:23:43 -0300, Ranier Vilela wrote:
It certainly helps, but I trust that's not the only reason, in all the
tests I did, there was an improvement in performance, even before using
this feature.
If you look closely at GetSnapShotData() you will see that
GetSnapshotDataReuse is called for all snapshots, even the new ones,which
is unnecessary.
That only happens a handful of times as snapshots are persistently
allocated.
Yes, but now this does not happen with new snapshots.
Doing an extra GetSnapshotDataReuse() in those cases doesn't matter
for performance. If anything this increases the number of jumps for the
common
case.
IMHO with GetSnapShotData(), any gain makes a difference.
It'd be a huge win to avoid needing ProcArrayLock when reusing a snapshot,
but
it's not at all easy to guarantee that it's correct / see how to make it
correct. I'm fairly sure it can be made correct, but ...
I believe it's worth the effort to make sure everything goes well and use
this feature.
Another example NormalTransactionIdPrecedes is more expensive than
testing
statusFlags.
That may be true when you count instructions, but isn't at all true when
you
take into account that the cachelines containing status flags are hotly
contended.
Also, the likelihood of filtering out a proc due to
NormalTransactionIdPrecedes(xid, xmax) is *vastly* higher than the due to
the
statusFlags check. There may be a lot of procs failing that test, but
typically there will be far fewer backends in vacuum or logical decoding.
I believe that keeping the instructions in the cache together works better
than having the status flags test in the middle.
But I will test this to be sure.
regards,
Ranier Vilela
On 5/25/22 11:07, Ranier Vilela wrote:
Em qua., 25 de mai. de 2022 às 00:46, Andres Freund <andres@anarazel.de
<mailto:andres@anarazel.de>> escreveu:Hi Andres, thank you for taking a look.
On 2022-05-24 12:28:20 -0300, Ranier Vilela wrote:
Linux Ubuntu 64 bits (gcc 9.4)
./pgbench -M prepared -c $conns -j $conns -S -n -U postgresconns tps head tps patched
1 2918.004085 3190.810466
10 12262.415696 17199.862401
50 13656.724571 18278.194114
80 14338.202348 17955.336101
90 16597.510373 18269.660184
100 17706.775793 18349.650150
200 16877.067441 17881.250615
300 16942.260775 17181.441752
400 16794.514911 17124.533892
500 16598.502151 17181.244953
600 16717.935001 16961.130742
700 16651.204834 16959.172005
800 16467.546583 16834.591719
900 16588.241149 16693.902459
1000 16564.985265 16936.95219517-18k tps is pretty low for pgbench -S. For a shared_buffers
resident run, I
can get 40k in a single connection in an optimized build. If you're
testing a
workload >> shared_buffers, GetSnapshotData() isn't the bottleneck. And
testing an assert build isn't a meaningful exercise either, unless
you have
way way higher gains (i.e. stuff like turning O(n^2) into O(n)).Thanks for sharing these hits.
Yes, their 17-18k tps are disappointing.What pgbench scale is this and are you using an optimized build?
Yes this optimized build.
CFLAGS='-Wall -Wmissing-prototypes -Wpointer-arith
-Wdeclaration-after-statement -Werror=vla -Wendif-labels
-Wmissing-format-attribute -Wimplicit-fallthrough=3 -Wcast-function-type
-Wformat-security -fno-strict-aliasing -fwrapv
-fexcess-precision=standard -Wno-format-truncation
-Wno-stringop-truncation -O2'
from config.log
That can still be assert-enabled build. We need to see configure flags.
pgbench was initialized with:
pgbench -i -p 5432 -d postgrespgbench -M prepared -c 100 -j 100 -S -n -U postgres
You're not specifying duration/number of transactions to execute. So
it's using just 10 transactions per client, which is bound to give you
bogus results due to not having anything in relcache etc. Use -T 60 or
something like that.
pgbench (15beta1)
transaction type: <builtin: select only>
scaling factor: 1
query mode: prepared
number of clients: 100
number of threads: 100The shared_buffers is default:
shared_buffers = 128MBIntel® Core™ i5-8250U CPU Quad Core
RAM 8GB
SSD 256 GB
Well, quick results on my laptop (i7-9750H, so not that different from
what you have):
1 = 18908.080126
2 = 32943.953182
3 = 42316.079028
4 = 46700.087645
So something is likely wrong in your setup.
regards
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Em qua., 25 de mai. de 2022 às 07:13, Tomas Vondra <
tomas.vondra@enterprisedb.com> escreveu:
On 5/25/22 11:07, Ranier Vilela wrote:
Em qua., 25 de mai. de 2022 às 00:46, Andres Freund <andres@anarazel.de
<mailto:andres@anarazel.de>> escreveu:Hi Andres, thank you for taking a look.
On 2022-05-24 12:28:20 -0300, Ranier Vilela wrote:
Linux Ubuntu 64 bits (gcc 9.4)
./pgbench -M prepared -c $conns -j $conns -S -n -U postgresconns tps head tps patched
1 2918.004085 3190.810466
10 12262.415696 17199.862401
50 13656.724571 18278.194114
80 14338.202348 17955.336101
90 16597.510373 18269.660184
100 17706.775793 18349.650150
200 16877.067441 17881.250615
300 16942.260775 17181.441752
400 16794.514911 17124.533892
500 16598.502151 17181.244953
600 16717.935001 16961.130742
700 16651.204834 16959.172005
800 16467.546583 16834.591719
900 16588.241149 16693.902459
1000 16564.985265 16936.95219517-18k tps is pretty low for pgbench -S. For a shared_buffers
resident run, I
can get 40k in a single connection in an optimized build. If you're
testing a
workload >> shared_buffers, GetSnapshotData() isn't the bottleneck.And
testing an assert build isn't a meaningful exercise either, unless
you have
way way higher gains (i.e. stuff like turning O(n^2) into O(n)).Thanks for sharing these hits.
Yes, their 17-18k tps are disappointing.What pgbench scale is this and are you using an optimized build?
Yes this optimized build.
CFLAGS='-Wall -Wmissing-prototypes -Wpointer-arith
-Wdeclaration-after-statement -Werror=vla -Wendif-labels
-Wmissing-format-attribute -Wimplicit-fallthrough=3 -Wcast-function-type
-Wformat-security -fno-strict-aliasing -fwrapv
-fexcess-precision=standard -Wno-format-truncation
-Wno-stringop-truncation -O2'
from config.logThat can still be assert-enabled build. We need to see configure flags.
./configure
Attached the config.log (compressed)
pgbench was initialized with:
pgbench -i -p 5432 -d postgrespgbench -M prepared -c 100 -j 100 -S -n -U postgres
You're not specifying duration/number of transactions to execute. So
it's using just 10 transactions per client, which is bound to give you
bogus results due to not having anything in relcache etc. Use -T 60 or
something like that.
Ok, I will try with -T 60.
pgbench (15beta1)
transaction type: <builtin: select only>
scaling factor: 1
query mode: prepared
number of clients: 100
number of threads: 100The shared_buffers is default:
shared_buffers = 128MBIntel® Core™ i5-8250U CPU Quad Core
RAM 8GB
SSD 256 GBWell, quick results on my laptop (i7-9750H, so not that different from
what you have):1 = 18908.080126
2 = 32943.953182
3 = 42316.079028
4 = 46700.087645So something is likely wrong in your setup.
select version();
version
----------------------------------------------------------------------------------------------------------
PostgreSQL 15beta1 on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu
9.4.0-1ubuntu1~20.04.1) 9.4.0, 64-bit
Target: x86_64-linux-gnu
Configured with: ../src/configure -v --with-pkgversion='Ubuntu
9.4.0-1ubuntu1~20.04.1'
--with-bugurl=file:///usr/share/doc/gcc-9/README.Bugs
--enable-languages=c,ada,c++,go,brig,d,fortran,objc,obj-c++,gm2
--prefix=/usr --with-gcc-major-version-only --program-suffix=-9
--program-prefix=x86_64-linux-gnu- --enable-shared --enable-linker-build-id
--libexecdir=/usr/lib --without-included-gettext --enable-threads=posix
--libdir=/usr/lib --enable-nls --enable-clocale=gnu
--enable-libstdcxx-debug --enable-libstdcxx-time=yes
--with-default-libstdcxx-abi=new --enable-gnu-unique-object
--disable-vtable-verify --enable-plugin --enable-default-pie
--with-system-zlib --with-target-system-zlib=auto --enable-objc-gc=auto
--enable-multiarch --disable-werror --with-arch-32=i686 --with-abi=m64
--with-multilib-list=m32,m64,mx32 --enable-multilib --with-tune=generic
--enable-offload-targets=nvptx-none=/build/gcc-9-Av3uEd/gcc-9-9.4.0/debian/tmp-nvptx/usr,hsa
--without-cuda-driver --enable-checking=release --build=x86_64-linux-gnu
--host=x86_64-linux-gnu --target=x86_64-linux-gnu
Thread model: posix
gcc version 9.4.0 (Ubuntu 9.4.0-1ubuntu1~20.04.1)
regards,
Ranier Vilela
Attachments:
Em qua., 25 de mai. de 2022 às 08:26, Ranier Vilela <ranier.vf@gmail.com>
escreveu:
Em qua., 25 de mai. de 2022 às 07:13, Tomas Vondra <
tomas.vondra@enterprisedb.com> escreveu:On 5/25/22 11:07, Ranier Vilela wrote:
Em qua., 25 de mai. de 2022 às 00:46, Andres Freund <andres@anarazel.de
<mailto:andres@anarazel.de>> escreveu:Hi Andres, thank you for taking a look.
On 2022-05-24 12:28:20 -0300, Ranier Vilela wrote:
Linux Ubuntu 64 bits (gcc 9.4)
./pgbench -M prepared -c $conns -j $conns -S -n -U postgresconns tps head tps patched
1 2918.004085 3190.810466
10 12262.415696 17199.862401
50 13656.724571 18278.194114
80 14338.202348 17955.336101
90 16597.510373 18269.660184
100 17706.775793 18349.650150
200 16877.067441 17881.250615
300 16942.260775 17181.441752
400 16794.514911 17124.533892
500 16598.502151 17181.244953
600 16717.935001 16961.130742
700 16651.204834 16959.172005
800 16467.546583 16834.591719
900 16588.241149 16693.902459
1000 16564.985265 16936.95219517-18k tps is pretty low for pgbench -S. For a shared_buffers
resident run, I
can get 40k in a single connection in an optimized build. If you're
testing a
workload >> shared_buffers, GetSnapshotData() isn't the bottleneck.And
testing an assert build isn't a meaningful exercise either, unless
you have
way way higher gains (i.e. stuff like turning O(n^2) into O(n)).Thanks for sharing these hits.
Yes, their 17-18k tps are disappointing.What pgbench scale is this and are you using an optimized build?
Yes this optimized build.
CFLAGS='-Wall -Wmissing-prototypes -Wpointer-arith
-Wdeclaration-after-statement -Werror=vla -Wendif-labels
-Wmissing-format-attribute -Wimplicit-fallthrough=3 -Wcast-function-type
-Wformat-security -fno-strict-aliasing -fwrapv
-fexcess-precision=standard -Wno-format-truncation
-Wno-stringop-truncation -O2'
from config.logThat can still be assert-enabled build. We need to see configure flags.
./configure
Attached the config.log (compressed)pgbench was initialized with:
pgbench -i -p 5432 -d postgrespgbench -M prepared -c 100 -j 100 -S -n -U postgres
You're not specifying duration/number of transactions to execute. So
it's using just 10 transactions per client, which is bound to give you
bogus results due to not having anything in relcache etc. Use -T 60 or
something like that.Ok, I will try with -T 60.
Here the results with -T 60:
Linux Ubuntu 64 bits
shared_buffers = 128MB
./pgbench -M prepared -c $conns -j $conns -T 60 -S -n -U postgres
pgbench (15beta1)
transaction type: <builtin: select only>
scaling factor: 1
query mode: prepared
number of clients: 100
number of threads: 100
maximum number of tries: 1
duration: 60 s
conns tps head tps patched
1 17126.326108 17792.414234
10 82068.123383 82468.334836
50 73808.731404 74678.839428
80 73290.191713 73116.553986
90 67558.483043 68384.906949
100 65960.982801 66997.793777
200 62216.011998 62870.243385
300 62924.225658 62796.157548
400 62278.099704 63129.555135
500 63257.930870 62188.825044
600 61479.890611 61517.913967
700 61139.354053 61327.898847
800 60833.663791 61517.913967
900 61305.129642 61248.336593
1000 60990.918719 61041.670996
Linux Ubuntu 64 bits
shared_buffers = 2048MB
./pgbench -M prepared -c $conns -j $conns -S -n -U postgres
pgbench (15beta1)
transaction type: <builtin: select only>
scaling factor: 1
query mode: prepared
number of clients: 100
number of threads: 100
maximum number of tries: 1
number of transactions per client: 10
conns tps head tps patched
1 2918.004085 3211.303789
10 12262.415696 15540.015540
50 13656.724571 16701.182444
80 14338.202348 16628.559551
90 16597.510373 16835.016835
100 17706.775793 16607.433487
200 16877.067441 16426.969799
300 16942.260775 16319.780662
400 16794.514911 16155.023607
500 16598.502151 16051.106724
600 16717.935001 16007.171213
700 16651.204834 16004.353184
800 16467.546583 16834.591719
900 16588.241149 16693.902459
1000 16564.985265 16936.952195
Linux Ubuntu 64 bits
shared_buffers = 2048MB
./pgbench -M prepared -c $conns -j $conns -T 60 -S -n -U postgres
pgbench (15beta1)
transaction type: <builtin: select only>
scaling factor: 1
query mode: prepared
number of clients: 100
number of threads: 100
maximum number of tries: 1
duration: 60 s
conns tps head tps patched
1 17174.265804 17792.414234
10 82365.634750 82468.334836
50 74593.714180 74678.839428
80 69219.756038 73116.553986
90 67419.574189 68384.906949
100 66613.771701 66997.793777
200 61739.784830 62870.243385
300 62109.691298 62796.157548
400 61630.822446 63129.555135
500 61711.019964 62755.190389
600 60620.010181 61517.913967
700 60303.317736 61688.044232
800 60451.113573 61076.666572
900 60017.327157 61256.290037
1000 60088.823434 60986.799312
regards,
Ranier Vilela
On 5/27/22 02:11, Ranier Vilela wrote:
...
Here the results with -T 60:
Might be a good idea to share your analysis / interpretation of the
results, not just the raw data. After all, the change is being proposed
by you, so do you think this shows the change is beneficial?
Linux Ubuntu 64 bits
shared_buffers = 128MB./pgbench -M prepared -c $conns -j $conns -T 60 -S -n -U postgres
pgbench (15beta1)
transaction type: <builtin: select only>
scaling factor: 1
query mode: prepared
number of clients: 100
number of threads: 100
maximum number of tries: 1
duration: 60 sconns tps head tps patched
1 17126.326108 17792.414234
10 82068.123383 82468.334836
50 73808.731404 74678.839428
80 73290.191713 73116.553986
90 67558.483043 68384.906949
100 65960.982801 66997.793777
200 62216.011998 62870.243385
300 62924.225658 62796.157548
400 62278.099704 63129.555135
500 63257.930870 62188.825044
600 61479.890611 61517.913967
700 61139.354053 61327.898847
800 60833.663791 61517.913967
900 61305.129642 61248.336593
1000 60990.918719 61041.670996
These results look much saner, but IMHO it also does not show any clear
benefit of the patch. Or are you still claiming there is a benefit?
BTW it's generally a good idea to do multiple runs and then use the
average and/or median. Results from a single may be quite noisy.
Linux Ubuntu 64 bits
shared_buffers = 2048MB./pgbench -M prepared -c $conns -j $conns -S -n -U postgres
pgbench (15beta1)
transaction type: <builtin: select only>
scaling factor: 1
query mode: prepared
number of clients: 100
number of threads: 100
maximum number of tries: 1
number of transactions per client: 10conns tps head tps patched
1 2918.004085 3211.303789
10 12262.415696 15540.015540
50 13656.724571 16701.182444
80 14338.202348 16628.559551
90 16597.510373 16835.016835
100 17706.775793 16607.433487
200 16877.067441 16426.969799
300 16942.260775 16319.780662
400 16794.514911 16155.023607
500 16598.502151 16051.106724
600 16717.935001 16007.171213
700 16651.204834 16004.353184
800 16467.546583 16834.591719
900 16588.241149 16693.902459
1000 16564.985265 16936.952195
I think we've agreed these results are useless.
Linux Ubuntu 64 bits
shared_buffers = 2048MB./pgbench -M prepared -c $conns -j $conns -T 60 -S -n -U postgres
pgbench (15beta1)
transaction type: <builtin: select only>
scaling factor: 1
query mode: prepared
number of clients: 100
number of threads: 100
maximum number of tries: 1
duration: 60 sconns tps head tps patched
1 17174.265804 17792.414234
10 82365.634750 82468.334836
50 74593.714180 74678.839428
80 69219.756038 73116.553986
90 67419.574189 68384.906949
100 66613.771701 66997.793777
200 61739.784830 62870.243385
300 62109.691298 62796.157548
400 61630.822446 63129.555135
500 61711.019964 62755.190389
600 60620.010181 61517.913967
700 60303.317736 61688.044232
800 60451.113573 61076.666572
900 60017.327157 61256.290037
1000 60088.823434 60986.799312
I have no idea why shared buffers 2GB would be interesting. The proposed
change was related to procarray, not shared buffers. And scale 1 is
~15MB of data, so it fits into 128MB just fine.
Also, the first ~10 results for "patched" case match results for 128MB
shared buffers. That seems very unlikely to happen by chance, so this
seems rather suspicious.
regards
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Em qui., 26 de mai. de 2022 às 22:30, Tomas Vondra <
tomas.vondra@enterprisedb.com> escreveu:
On 5/27/22 02:11, Ranier Vilela wrote:
...
Here the results with -T 60:
Might be a good idea to share your analysis / interpretation of the
results, not just the raw data. After all, the change is being proposed
by you, so do you think this shows the change is beneficial?
I think so, but the expectation has diminished.
I expected that the more connections, the better the performance.
And for both patch and head, this doesn't happen in tests.
Performance degrades with a greater number of connections.
GetSnapShowData() isn't a bottleneck?
Linux Ubuntu 64 bits
shared_buffers = 128MB./pgbench -M prepared -c $conns -j $conns -T 60 -S -n -U postgres
pgbench (15beta1)
transaction type: <builtin: select only>
scaling factor: 1
query mode: prepared
number of clients: 100
number of threads: 100
maximum number of tries: 1
duration: 60 sconns tps head tps patched
1 17126.326108 17792.414234
10 82068.123383 82468.334836
50 73808.731404 74678.839428
80 73290.191713 73116.553986
90 67558.483043 68384.906949
100 65960.982801 66997.793777
200 62216.011998 62870.243385
300 62924.225658 62796.157548
400 62278.099704 63129.555135
500 63257.930870 62188.825044
600 61479.890611 61517.913967
700 61139.354053 61327.898847
800 60833.663791 61517.913967
900 61305.129642 61248.336593
1000 60990.918719 61041.670996These results look much saner, but IMHO it also does not show any clear
benefit of the patch. Or are you still claiming there is a benefit?
We agree that they are micro-optimizations.
However, I think they should be considered micro-optimizations in inner
loops,
because all in procarray.c is a hotpath.
The first objective, I believe, was achieved, with no performance
regression.
I agree, the gains are small, by the tests done.
But, IMHO, this is a good way, small gains turn into big gains in the end,
when applied to all code.
Consider GetSnapShotData()
1. Most of the time the snapshot is not null, so:
if (snaphost == NULL), will fail most of the time.
With the patch:
if (snapshot->xip != NULL)
{
if (GetSnapshotDataReuse(snapshot))
return snapshot;
}
Most of the time the test is true and GetSnapshotDataReuse is not called
for new
snapshots.
count, subcount and suboverflowed, will not be initialized, for all
snapshots.
2. If snapshot is taken during recoverys
The pgprocnos and ProcGlobal->subxidStates are not touched unnecessarily.
Only if is not suboverflowed.
Skipping all InvalidTransactionId, mypgxactoff, backends doing logical
decoding,
and XID is >= xmax.
3. Calling GetSnapshotDataReuse() without first acquiring ProcArrayLock.
There's an agreement that this would be fine, for now.
Consider ComputeXidHorizons()
1. ProcGlobal->statusFlags is touched before the lock.
2. allStatusFlags[index] is not touched for all numProcs.
All changes were made with the aim of avoiding or postponing unnecessary
work.
BTW it's generally a good idea to do multiple runs and then use the
average and/or median. Results from a single may be quite noisy.Linux Ubuntu 64 bits
shared_buffers = 2048MB./pgbench -M prepared -c $conns -j $conns -S -n -U postgres
pgbench (15beta1)
transaction type: <builtin: select only>
scaling factor: 1
query mode: prepared
number of clients: 100
number of threads: 100
maximum number of tries: 1
number of transactions per client: 10conns tps head tps patched
1 2918.004085 3211.303789
10 12262.415696 15540.015540
50 13656.724571 16701.182444
80 14338.202348 16628.559551
90 16597.510373 16835.016835
100 17706.775793 16607.433487
200 16877.067441 16426.969799
300 16942.260775 16319.780662
400 16794.514911 16155.023607
500 16598.502151 16051.106724
600 16717.935001 16007.171213
700 16651.204834 16004.353184
800 16467.546583 16834.591719
900 16588.241149 16693.902459
1000 16564.985265 16936.952195I think we've agreed these results are useless.
Linux Ubuntu 64 bits
shared_buffers = 2048MB./pgbench -M prepared -c $conns -j $conns -T 60 -S -n -U postgres
pgbench (15beta1)
transaction type: <builtin: select only>
scaling factor: 1
query mode: prepared
number of clients: 100
number of threads: 100
maximum number of tries: 1
duration: 60 sconns tps head tps patched
1 17174.265804 17792.414234
10 82365.634750 82468.334836
50 74593.714180 74678.839428
80 69219.756038 73116.553986
90 67419.574189 68384.906949
100 66613.771701 66997.793777
200 61739.784830 62870.243385
300 62109.691298 62796.157548
400 61630.822446 63129.555135
500 61711.019964 62755.190389
600 60620.010181 61517.913967
700 60303.317736 61688.044232
800 60451.113573 61076.666572
900 60017.327157 61256.290037
1000 60088.823434 60986.799312I have no idea why shared buffers 2GB would be interesting. The proposed
change was related to procarray, not shared buffers. And scale 1 is
~15MB of data, so it fits into 128MB just fine.
I thought about doing this benchmark, in the most common usage situation
(25% of RAM).
Also, the first ~10 results for "patched" case match results for 128MB
shared buffers. That seems very unlikely to happen by chance, so this
seems rather suspicious.
Probably, copy and paste mistake.
I redid this test, for patched:
Linux Ubuntu 64 bits
shared_buffers = 2048MB
./pgbench -M prepared -c $conns -j $conns -T 60 -S -n -U postgres
pgbench (15beta1)
transaction type: <builtin: select only>
scaling factor: 1
query mode: prepared
number of clients: 100
number of threads: 100
maximum number of tries: 1
duration: 60 s
conns tps head tps patched
1 17174.265804 17524.482668
10 82365.634750 81840.537713
50 74593.714180 74806.729434
80 69219.756038 73116.553986
90 67419.574189 69130.749209
100 66613.771701 67478.234595
200 61739.784830 63094.202413
300 62109.691298 62984.501251
400 61630.822446 63243.232816
500 61711.019964 62827.977636
600 60620.010181 62838.051693
700 60303.317736 61594.629618
800 60451.113573 61208.629058
900 60017.327157 61171.001256
1000 60088.823434 60558.067810
regards,
Ranier Vilela
Hi,
On 2022-05-27 03:30:46 +0200, Tomas Vondra wrote:
On 5/27/22 02:11, Ranier Vilela wrote:
./pgbench -M prepared -c $conns -j $conns -T 60 -S -n -U postgres
pgbench (15beta1)
transaction type: <builtin: select only>
scaling factor: 1
query mode: prepared
number of clients: 100
number of threads: 100
maximum number of tries: 1
duration: 60 sconns tps head tps patched
1 17126.326108 17792.414234
10 82068.123383 82468.334836
50 73808.731404 74678.839428
80 73290.191713 73116.553986
90 67558.483043 68384.906949
100 65960.982801 66997.793777
200 62216.011998 62870.243385
300 62924.225658 62796.157548
400 62278.099704 63129.555135
500 63257.930870 62188.825044
600 61479.890611 61517.913967
700 61139.354053 61327.898847
800 60833.663791 61517.913967
900 61305.129642 61248.336593
1000 60990.918719 61041.670996These results look much saner, but IMHO it also does not show any clear
benefit of the patch. Or are you still claiming there is a benefit?
They don't look all that sane to me - isn't that way lower than one would
expect? Restricting both client and server to the same four cores, a
thermically challenged older laptop I have around I get 150k tps at both 10
and 100 clients.
Either way, I'd not expect to see any GetSnapshotData() scalability effects to
show up on an "Intel® Core™ i5-8250U CPU Quad Core" - there's just not enough
concurrency.
The correct pieces of these changes seem very unlikely to affect
GetSnapshotData() performance meaningfully.
To improve something like GetSnapshotData() you first have to come up with a
workload that shows it being a meaningful part of a profile. Unless it is,
performance differences are going to just be due to various forms of noise.
Greetings,
Andres Freund
Hi,
On 2022-05-27 10:35:08 -0300, Ranier Vilela wrote:
Em qui., 26 de mai. de 2022 �s 22:30, Tomas Vondra <
tomas.vondra@enterprisedb.com> escreveu:On 5/27/22 02:11, Ranier Vilela wrote:
...
Here the results with -T 60:
Might be a good idea to share your analysis / interpretation of the
results, not just the raw data. After all, the change is being proposed
by you, so do you think this shows the change is beneficial?I think so, but the expectation has diminished.
I expected that the more connections, the better the performance.
And for both patch and head, this doesn't happen in tests.
Performance degrades with a greater number of connections.
Your system has four CPUs. Once they're all busy, adding more connections
won't improve performance. It'll just add more and more context switching,
cache misses, and make the OS scheduler do more work.
GetSnapShowData() isn't a bottleneck?
I'd be surprised if it showed up in a profile on your machine with that
workload in any sort of meaningful way. The snapshot reuse logic will always
work - because there are no writes - and thus the only work that needs to be
done is to acquire the ProcArrayLock briefly. And because there is only a
small number of cores, contention on the cacheline for that isn't a problem.
These results look much saner, but IMHO it also does not show any clear
benefit of the patch. Or are you still claiming there is a benefit?We agree that they are micro-optimizations. However, I think they should be
considered micro-optimizations in inner loops, because all in procarray.c is
a hotpath.
As explained earlier, I don't agree that they optimize anything - you're
making some of the scalability behaviour *worse*, if it's changed at all.
The first objective, I believe, was achieved, with no performance
regression.
I agree, the gains are small, by the tests done.
There are no gains.
But, IMHO, this is a good way, small gains turn into big gains in the end,
when applied to all code.Consider GetSnapShotData()
1. Most of the time the snapshot is not null, so:
if (snaphost == NULL), will fail most of the time.With the patch:
if (snapshot->xip != NULL)
{
if (GetSnapshotDataReuse(snapshot))
return snapshot;
}Most of the time the test is true and GetSnapshotDataReuse is not called
for new
snapshots.
count, subcount and suboverflowed, will not be initialized, for all
snapshots.
But that's irrelevant. There's only a few "new" snapshots in the life of a
connection. You're optimizing something irrelevant.
2. If snapshot is taken during recoverys
The pgprocnos and ProcGlobal->subxidStates are not touched unnecessarily.
That code isn't reached when in recovery?
3. Calling GetSnapshotDataReuse() without first acquiring ProcArrayLock.
There's an agreement that this would be fine, for now.
There's no such agreement at all. It's not correct.
Consider ComputeXidHorizons()
1. ProcGlobal->statusFlags is touched before the lock.
Hard to believe that'd have a measurable effect.
2. allStatusFlags[index] is not touched for all numProcs.
I'd be surprised if the compiler couldn't defer that load on its own.
Greetings,
Andres Freund
Em sex., 27 de mai. de 2022 às 18:08, Andres Freund <andres@anarazel.de>
escreveu:
Hi,
On 2022-05-27 03:30:46 +0200, Tomas Vondra wrote:
On 5/27/22 02:11, Ranier Vilela wrote:
./pgbench -M prepared -c $conns -j $conns -T 60 -S -n -U postgres
pgbench (15beta1)
transaction type: <builtin: select only>
scaling factor: 1
query mode: prepared
number of clients: 100
number of threads: 100
maximum number of tries: 1
duration: 60 sconns tps head tps patched
1 17126.326108 17792.414234
10 82068.123383 82468.334836
50 73808.731404 74678.839428
80 73290.191713 73116.553986
90 67558.483043 68384.906949
100 65960.982801 66997.793777
200 62216.011998 62870.243385
300 62924.225658 62796.157548
400 62278.099704 63129.555135
500 63257.930870 62188.825044
600 61479.890611 61517.913967
700 61139.354053 61327.898847
800 60833.663791 61517.913967
900 61305.129642 61248.336593
1000 60990.918719 61041.670996These results look much saner, but IMHO it also does not show any clear
benefit of the patch. Or are you still claiming there is a benefit?They don't look all that sane to me - isn't that way lower than one would
expect?
Yes, quite disappointing.
Restricting both client and server to the same four cores, a
thermically challenged older laptop I have around I get 150k tps at both 10
and 100 clients.
And you can share the benchmark details? Hardware, postgres and pgbench,
please?
Either way, I'd not expect to see any GetSnapshotData() scalability
effects to
show up on an "Intel® Core™ i5-8250U CPU Quad Core" - there's just not
enough
concurrency.
This means that our customers will not see any connections scalability with
PG15, using the simplest hardware?
The correct pieces of these changes seem very unlikely to affect
GetSnapshotData() performance meaningfully.To improve something like GetSnapshotData() you first have to come up with
a
workload that shows it being a meaningful part of a profile. Unless it is,
performance differences are going to just be due to various forms of noise.
Actually in the profiles I got with perf, GetSnapShotData() didn't show up.
regards,
Ranier Vilela
Em sex., 27 de mai. de 2022 às 18:22, Andres Freund <andres@anarazel.de>
escreveu:
Hi,
On 2022-05-27 10:35:08 -0300, Ranier Vilela wrote:
Em qui., 26 de mai. de 2022 às 22:30, Tomas Vondra <
tomas.vondra@enterprisedb.com> escreveu:On 5/27/22 02:11, Ranier Vilela wrote:
...
Here the results with -T 60:
Might be a good idea to share your analysis / interpretation of the
results, not just the raw data. After all, the change is being proposed
by you, so do you think this shows the change is beneficial?I think so, but the expectation has diminished.
I expected that the more connections, the better the performance.
And for both patch and head, this doesn't happen in tests.
Performance degrades with a greater number of connections.Your system has four CPUs. Once they're all busy, adding more connections
won't improve performance. It'll just add more and more context switching,
cache misses, and make the OS scheduler do more work.
conns tps head
10 82365.634750
50 74593.714180
80 69219.756038
90 67419.574189
100 66613.771701
Yes it is quite disappointing that with 100 connections, tps loses to 10
connections.
GetSnapShowData() isn't a bottleneck?
I'd be surprised if it showed up in a profile on your machine with that
workload in any sort of meaningful way. The snapshot reuse logic will
always
work - because there are no writes - and thus the only work that needs to
be
done is to acquire the ProcArrayLock briefly. And because there is only a
small number of cores, contention on the cacheline for that isn't a
problem.
Thanks for sharing this.
These results look much saner, but IMHO it also does not show any clear
benefit of the patch. Or are you still claiming there is a benefit?We agree that they are micro-optimizations. However, I think they
should be
considered micro-optimizations in inner loops, because all in
procarray.c is
a hotpath.
As explained earlier, I don't agree that they optimize anything - you're
making some of the scalability behaviour *worse*, if it's changed at all.The first objective, I believe, was achieved, with no performance
regression.
I agree, the gains are small, by the tests done.There are no gains.
IMHO, I must disagree.
But, IMHO, this is a good way, small gains turn into big gains in the
end,
when applied to all code.
Consider GetSnapShotData()
1. Most of the time the snapshot is not null, so:
if (snaphost == NULL), will fail most of the time.With the patch:
if (snapshot->xip != NULL)
{
if (GetSnapshotDataReuse(snapshot))
return snapshot;
}Most of the time the test is true and GetSnapshotDataReuse is not called
for new
snapshots.
count, subcount and suboverflowed, will not be initialized, for all
snapshots.But that's irrelevant. There's only a few "new" snapshots in the life of a
connection. You're optimizing something irrelevant.
IMHO, when GetSnapShotData() is the bottleneck, all is relevant.
2. If snapshot is taken during recoverys
The pgprocnos and ProcGlobal->subxidStates are not touched unnecessarily.That code isn't reached when in recovery?
Currently it is reached *even* when not in recovery.
With the patch, *only* is reached when in recovery.
3. Calling GetSnapshotDataReuse() without first acquiring ProcArrayLock.
There's an agreement that this would be fine, for now.There's no such agreement at all. It's not correct.
Ok, but there is a chance it will work correctly.
Consider ComputeXidHorizons()
1. ProcGlobal->statusFlags is touched before the lock.Hard to believe that'd have a measurable effect.
IMHO, anything you take out of the lock is a benefit.
2. allStatusFlags[index] is not touched for all numProcs.
I'd be surprised if the compiler couldn't defer that load on its own.
Better be sure of that, no?
regards,
Ranier Vilela
On 5/28/22 02:15, Ranier Vilela wrote:
Em sex., 27 de mai. de 2022 às 18:08, Andres Freund <andres@anarazel.de
<mailto:andres@anarazel.de>> escreveu:Hi,
On 2022-05-27 03:30:46 +0200, Tomas Vondra wrote:
On 5/27/22 02:11, Ranier Vilela wrote:
./pgbench -M prepared -c $conns -j $conns -T 60 -S -n -U postgres
pgbench (15beta1)
transaction type: <builtin: select only>
scaling factor: 1
query mode: prepared
number of clients: 100
number of threads: 100
maximum number of tries: 1
duration: 60 sconns tps head tps patched
1 17126.326108 17792.414234
10 82068.123383 82468.334836
50 73808.731404 74678.839428
80 73290.191713 73116.553986
90 67558.483043 68384.906949
100 65960.982801 66997.793777
200 62216.011998 62870.243385
300 62924.225658 62796.157548
400 62278.099704 63129.555135
500 63257.930870 62188.825044
600 61479.890611 61517.913967
700 61139.354053 61327.898847
800 60833.663791 61517.913967
900 61305.129642 61248.336593
1000 60990.918719 61041.670996These results look much saner, but IMHO it also does not show any
clear
benefit of the patch. Or are you still claiming there is a benefit?
They don't look all that sane to me - isn't that way lower than one
would
expect?Yes, quite disappointing.
Restricting both client and server to the same four cores, a
thermically challenged older laptop I have around I get 150k tps at
both 10
and 100 clients.And you can share the benchmark details? Hardware, postgres and pgbench,
please?Either way, I'd not expect to see any GetSnapshotData() scalability
effects to
show up on an "Intel® Core™ i5-8250U CPU Quad Core" - there's just
not enough
concurrency.This means that our customers will not see any connections scalability
with PG15, using the simplest hardware?
No. It means that on 4-core machine GetSnapshotData() is unlikely to be
a bottleneck, because you'll hit various other bottlenecks way earlier.
I personally doubt it even makes sense to worry about scaling to this
many connections on such tiny system too much.
The correct pieces of these changes seem very unlikely to affect
GetSnapshotData() performance meaningfully.To improve something like GetSnapshotData() you first have to come
up with a
workload that shows it being a meaningful part of a profile. Unless
it is,
performance differences are going to just be due to various forms of
noise.Actually in the profiles I got with perf, GetSnapShotData() didn't show up.
But that's exactly the point Andres is trying to make - if you don't see
GetSnapshotData() in the perf profile, why do you think optimizing it
will have any meaningful impact on throughput?
regards
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On 5/28/22 02:36, Ranier Vilela wrote:
Em sex., 27 de mai. de 2022 às 18:22, Andres Freund <andres@anarazel.de
<mailto:andres@anarazel.de>> escreveu:Hi,
On 2022-05-27 10:35:08 -0300, Ranier Vilela wrote:
Em qui., 26 de mai. de 2022 às 22:30, Tomas Vondra <
tomas.vondra@enterprisedb.com<mailto:tomas.vondra@enterprisedb.com>> escreveu:
On 5/27/22 02:11, Ranier Vilela wrote:
...
Here the results with -T 60:
Might be a good idea to share your analysis / interpretation of the
results, not just the raw data. After all, the change is beingproposed
by you, so do you think this shows the change is beneficial?
I think so, but the expectation has diminished.
I expected that the more connections, the better the performance.
And for both patch and head, this doesn't happen in tests.
Performance degrades with a greater number of connections.Your system has four CPUs. Once they're all busy, adding more
connections
won't improve performance. It'll just add more and more context
switching,
cache misses, and make the OS scheduler do more work.conns tps head
10 82365.634750
50 74593.714180
80 69219.756038
90 67419.574189
100 66613.771701
Yes it is quite disappointing that with 100 connections, tps loses to 10
connections.
IMO that's entirely expected on a system with only 4 cores. Increasing
the number of connections inevitably means more overhead (you have to
track/manage more stuff). And at some point the backends start competing
for L2/L3 caches, context switches are not free either, etc. So once you
cross ~2-3x the number of cores, you should expect this.
This behavior is natural/inherent, it's unlikely to go away, and it's
one of the reasons why we recommend not to use too many connections. If
you try to maximize throughput, just don't do that. Or just use machine
with more cores.
GetSnapShowData() isn't a bottleneck?
I'd be surprised if it showed up in a profile on your machine with that
workload in any sort of meaningful way. The snapshot reuse logic
will always
work - because there are no writes - and thus the only work that
needs to be
done is to acquire the ProcArrayLock briefly. And because there is
only a
small number of cores, contention on the cacheline for that isn't a
problem.Thanks for sharing this.
These results look much saner, but IMHO it also does not show
any clear
benefit of the patch. Or are you still claiming there is a benefit?
We agree that they are micro-optimizations. However, I think they
should be
considered micro-optimizations in inner loops, because all in
procarray.c is
a hotpath.
As explained earlier, I don't agree that they optimize anything - you're
making some of the scalability behaviour *worse*, if it's changed at
all.The first objective, I believe, was achieved, with no performance
regression.
I agree, the gains are small, by the tests done.There are no gains.
IMHO, I must disagree.
You don't have to, really. What you should do is showing results
demonstrating the claimed gains, and so far you have not done that.
I don't want to be rude, but so far you've shown results from a
benchmark testing fork(), due to only running 10 transactions per
client, and then results from a single run for each client count (which
doesn't really show any gains either, and is so noisy).
As mentioned GetSnapshotData() is not even in perf profile, so why would
the patch even make a difference?
You've also claimed it helps generating better code on older compilers,
but you've never supported that with any evidence.
Maybe there is an improvement - show us. Do a benchmark with more runs,
to average-out the noise. Calculate VAR/STDEV to show how variable the
results are. Use that to compare results and decide if there is an
improvement. Also, keep in mind binary layout matters [1]https://www.youtube.com/watch?v=r-TLSBdHe1A.
[1]: https://www.youtube.com/watch?v=r-TLSBdHe1A
But, IMHO, this is a good way, small gains turn into big gains in
the end,
when applied to all code.
Consider GetSnapShotData()
1. Most of the time the snapshot is not null, so:
if (snaphost == NULL), will fail most of the time.With the patch:
if (snapshot->xip != NULL)
{
if (GetSnapshotDataReuse(snapshot))
return snapshot;
}Most of the time the test is true and GetSnapshotDataReuse is not
called
for new
snapshots.
count, subcount and suboverflowed, will not be initialized, for all
snapshots.But that's irrelevant. There's only a few "new" snapshots in the
life of a
connection. You're optimizing something irrelevant.IMHO, when GetSnapShotData() is the bottleneck, all is relevant.
Maybe. Show us the difference.
2. If snapshot is taken during recoverys
The pgprocnos and ProcGlobal->subxidStates are not touchedunnecessarily.
That code isn't reached when in recovery?
Currently it is reached *even* when not in recovery.
With the patch, *only* is reached when in recovery.3. Calling GetSnapshotDataReuse() without first acquiring
ProcArrayLock.
There's an agreement that this would be fine, for now.
There's no such agreement at all. It's not correct.
Ok, but there is a chance it will work correctly.
Either it's correct or not. Chance of being correct does not count.
Consider ComputeXidHorizons()
1. ProcGlobal->statusFlags is touched before the lock.Hard to believe that'd have a measurable effect.
IMHO, anything you take out of the lock is a benefit.
Maybe. Show us the difference.
2. allStatusFlags[index] is not touched for all numProcs.
I'd be surprised if the compiler couldn't defer that load on its own.
Better be sure of that, no?
We rely on compilers doing this in about a million other places.
regards
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Em sáb., 28 de mai. de 2022 às 09:00, Tomas Vondra <
tomas.vondra@enterprisedb.com> escreveu:
On 5/28/22 02:15, Ranier Vilela wrote:
Em sex., 27 de mai. de 2022 às 18:08, Andres Freund <andres@anarazel.de
<mailto:andres@anarazel.de>> escreveu:Hi,
On 2022-05-27 03:30:46 +0200, Tomas Vondra wrote:
On 5/27/22 02:11, Ranier Vilela wrote:
./pgbench -M prepared -c $conns -j $conns -T 60 -S -n -U postgres
pgbench (15beta1)
transaction type: <builtin: select only>
scaling factor: 1
query mode: prepared
number of clients: 100
number of threads: 100
maximum number of tries: 1
duration: 60 sconns tps head tps patched
1 17126.326108 17792.414234
10 82068.123383 82468.334836
50 73808.731404 74678.839428
80 73290.191713 73116.553986
90 67558.483043 68384.906949
100 65960.982801 66997.793777
200 62216.011998 62870.243385
300 62924.225658 62796.157548
400 62278.099704 63129.555135
500 63257.930870 62188.825044
600 61479.890611 61517.913967
700 61139.354053 61327.898847
800 60833.663791 61517.913967
900 61305.129642 61248.336593
1000 60990.918719 61041.670996These results look much saner, but IMHO it also does not show any
clear
benefit of the patch. Or are you still claiming there is a benefit?
They don't look all that sane to me - isn't that way lower than one
would
expect?Yes, quite disappointing.
Restricting both client and server to the same four cores, a
thermically challenged older laptop I have around I get 150k tps at
both 10
and 100 clients.And you can share the benchmark details? Hardware, postgres and pgbench,
please?Either way, I'd not expect to see any GetSnapshotData() scalability
effects to
show up on an "Intel® Core™ i5-8250U CPU Quad Core" - there's just
not enough
concurrency.This means that our customers will not see any connections scalability
with PG15, using the simplest hardware?No. It means that on 4-core machine GetSnapshotData() is unlikely to be
a bottleneck, because you'll hit various other bottlenecks way earlier.I personally doubt it even makes sense to worry about scaling to this
many connections on such tiny system too much.
The correct pieces of these changes seem very unlikely to affect
GetSnapshotData() performance meaningfully.To improve something like GetSnapshotData() you first have to come
up with a
workload that shows it being a meaningful part of a profile. Unless
it is,
performance differences are going to just be due to various forms of
noise.Actually in the profiles I got with perf, GetSnapShotData() didn't show
up.
But that's exactly the point Andres is trying to make - if you don't see
GetSnapshotData() in the perf profile, why do you think optimizing it
will have any meaningful impact on throughput?
You see, I've seen in several places that GetSnapShotData() is the
bottleneck in scaling connections.
One of them, if I remember correctly, was at an IBM in Russia.
Another statement occurs in [1]https://techcommunity.microsoft.com/t5/azure-database-for-postgresql/improving-postgres-connection-scalability-snapshots/ba-p/1806462[2]/messages/by-id/5198715A.6070808@vmware.com[3]https://it-events.com/system/attachments/files/000/001/098/original/PostgreSQL_%D0%BC%D0%B0%D1%81%D1%88%D1%82%D0%B0%D0%B1%D0%B8%D1%80%D0%BE%D0%B2%D0%B0%D0%BD%D0%B8%D0%B5.pdf?1448975472
Just because I don't have enough hardware to force GetSnapShotData()
doesn't mean optimizing it won't make a difference.
And even on my modest hardware, we've seen gains, small but consistent.
So IMHO everyone will benefit, including the small servers.
regards,
Ranier Vilela
[1]: https://techcommunity.microsoft.com/t5/azure-database-for-postgresql/improving-postgres-connection-scalability-snapshots/ba-p/1806462
https://techcommunity.microsoft.com/t5/azure-database-for-postgresql/improving-postgres-connection-scalability-snapshots/ba-p/1806462
[2]: /messages/by-id/5198715A.6070808@vmware.com
[3]: https://it-events.com/system/attachments/files/000/001/098/original/PostgreSQL_%D0%BC%D0%B0%D1%81%D1%88%D1%82%D0%B0%D0%B1%D0%B8%D1%80%D0%BE%D0%B2%D0%B0%D0%BD%D0%B8%D0%B5.pdf?1448975472
https://it-events.com/system/attachments/files/000/001/098/original/PostgreSQL_%D0%BC%D0%B0%D1%81%D1%88%D1%82%D0%B0%D0%B1%D0%B8%D1%80%D0%BE%D0%B2%D0%B0%D0%BD%D0%B8%D0%B5.pdf?1448975472
Show quoted text
regards
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On 28 May 2022, at 16:12, Ranier Vilela <ranier.vf@gmail.com> wrote:
Just because I don't have enough hardware to force GetSnapShotData() doesn't mean optimizing it won't make a difference.
Quoting Andres from upthread:
"To improve something like GetSnapshotData() you first have to come up with
a workload that shows it being a meaningful part of a profile. Unless it
is, performance differences are going to just be due to various forms of
noise."
If you think this is a worthwhile improvement, you need to figure out a way to
reliably test it in order to prove it.
--
Daniel Gustafsson https://vmware.com/