track_planning causing performance regression
Hi,
During fully-cached SELECT-only test using pgbench, Postgres v13Beta1 shows
~45% performance drop [2] at high DB connection counts (when compared with v12.3)
Disabling pg_stat_statements.track_planning (which is 'On' by default)
brings the TPS numbers up to v12.3 levels.
The inflection point (in this test-case) is 128 Connections, beyond which the
TPS numbers are consistently low. Looking at the mailing list [1], this issue
didn't surface earlier possibly since the regression is trivial at low connection counts.
It would be great if this could be optimized further, or track_planning
disabled (by default) so as to not trip users upgrading from v12 with pg_stat_statement
enabled (but otherwise not particularly interested in track_planning).
These are some details around the above test:
pgbench: scale - 100 / threads - 16
test-duration - 30s each
server - 96 vCPUs / 768GB - r5.24xl (AWS EC2 instance)
client - 72 vCPUs / 144GB - c5.18xl (AWS EC2 instance) (co-located with the DB server - Same AZ)
v12 - REL_12_STABLE (v12.3)
v13Beta1 - REL_13_STABLE (v13Beta1)
max_connections = 10000
shared_preload_libraries = 'pg_stat_statements'
shared_buffers 128MB
Reference:
1) /messages/by-id/1554150919882-0.post@n3.nabble.com
2) Fully-cached-select-only TPS drops >= 128 connections.
Conn v12.3 v13Beta1 v13Beta1 (track_planning=off)
1 6,764 6,734 6,905
2 14,978 14,961 15,316
4 31,641 32,012 36,961
8 71,989 68,848 69,204
16 129,056 131,157 132,773
32 231,910 226,718 253,316
64 381,778 371,782 385,402
128 534,661 ====> 353,944 539,231
256 636,794 ====> 248,825 643,631
512 574,447 ====> 213,033 555,099
768 493,912 ====> 214,801 502,014
1024 484,993 ====> 222,492 490,716
1280 480,571 ====> 223,296 483,843
1536 475,030 ====> 228,137 477,153
1792 472,145 ====> 229,027 474,423
2048 471,385 ====> 228,665 470,238
3) perf - v13Beta1
- 88.38% 0.17% postgres postgres [.] PostgresMain
- 88.21% PostgresMain
- 80.09% exec_simple_query
- 25.34% pg_plan_queries
- 25.28% pg_plan_query
- 25.21% pgss_planner
- 14.36% pgss_store
+ 13.54% s_lock
+ 10.71% standard_planner
+ 18.29% PortalRun
- 15.12% PortalDrop
- 14.73% PortalCleanup
- 13.78% pgss_ExecutorEnd
- 13.72% pgss_store
+ 12.83% s_lock
0.72% standard_ExecutorEnd
+ 6.18% PortalStart
+ 4.86% pg_analyze_and_rewrite
+ 3.52% GetTransactionSnapshot
+ 2.56% pg_parse_query
+ 1.83% finish_xact_command
0.51% start_xact_command
+ 3.93% pq_getbyte
+ 3.40% ReadyForQuery
4) perf - v12.3
v12.3
- 84.32% 0.21% postgres postgres [.] PostgresMain
- 84.11% PostgresMain
- 72.56% exec_simple_query
+ 26.71% PortalRun
- 15.33% pg_plan_queries
- 15.29% pg_plan_query
+ 15.21% standard_planner
+ 7.81% PortalStart
+ 6.76% pg_analyze_and_rewrite
+ 4.37% GetTransactionSnapshot
+ 3.69% pg_parse_query
- 2.96% PortalDrop
- 2.42% PortalCleanup
- 1.35% pgss_ExecutorEnd
- 1.22% pgss_store
0.57% s_lock
0.77% standard_ExecutorEnd
+ 2.16% finish_xact_command
+ 0.78% start_xact_command
+ 0.59% pg_rewrite_query
+ 5.67% pq_getbyte
+ 4.73% ReadyForQuery
-
robins
Hi,
On Mon, Jun 29, 2020 at 7:49 AM Tharakan, Robins <tharar@amazon.com> wrote:
During fully-cached SELECT-only test using pgbench, Postgres v13Beta1 shows
~45% performance drop [2] at high DB connection counts (when compared with v12.3)Disabling pg_stat_statements.track_planning (which is 'On' by default)
brings the TPS numbers up to v12.3 levels.The inflection point (in this test-case) is 128 Connections, beyond which the
TPS numbers are consistently low. Looking at the mailing list [1], this issue
didn't surface earlier possibly since the regression is trivial at low connection counts.It would be great if this could be optimized further, or track_planning
disabled (by default) so as to not trip users upgrading from v12 with pg_stat_statement
enabled (but otherwise not particularly interested in track_planning).These are some details around the above test:
pgbench: scale - 100 / threads - 16
test-duration - 30s each
server - 96 vCPUs / 768GB - r5.24xl (AWS EC2 instance)
client - 72 vCPUs / 144GB - c5.18xl (AWS EC2 instance) (co-located with the DB server - Same AZ)
v12 - REL_12_STABLE (v12.3)
v13Beta1 - REL_13_STABLE (v13Beta1)
max_connections = 10000
shared_preload_libraries = 'pg_stat_statements'
shared_buffers 128MB
I can't reproduce this on my laptop, but I can certainly believe that
running the same 3 queries using more connections than available cores
will lead to extra overhead.
I disagree with the conclusion though. It seems to me that if you
really have this workload that consists in these few queries and want
to get better performance, you'll anyway use a connection pooler
and/or use prepared statements, which will make this overhead
disappear entirely, and will also yield an even bigger performance
improvement. A quick test using pgbench -M prepared, with
track_planning enabled, with still way too many connections already
shows a 25% improvement over the -M simple without track_planning.
On 2020/06/29 16:05, Julien Rouhaud wrote:
Hi,
On Mon, Jun 29, 2020 at 7:49 AM Tharakan, Robins <tharar@amazon.com> wrote:
During fully-cached SELECT-only test using pgbench, Postgres v13Beta1 shows
Thanks for the benchmark!
~45% performance drop [2] at high DB connection counts (when compared with v12.3)
That's bad :(
Disabling pg_stat_statements.track_planning (which is 'On' by default)
brings the TPS numbers up to v12.3 levels.The inflection point (in this test-case) is 128 Connections, beyond which the
TPS numbers are consistently low. Looking at the mailing list [1], this issue
didn't surface earlier possibly since the regression is trivial at low connection counts.It would be great if this could be optimized further, or track_planning
disabled (by default) so as to not trip users upgrading from v12 with pg_stat_statement
enabled (but otherwise not particularly interested in track_planning).
Your benchmark result seems to suggest that the cause of the problem is
the contention of per-query spinlock in pgss_store(). Right?
This lock contention is likely to happen when multiple sessions run
the same queries.
One idea to reduce that lock contention is to separate per-query spinlock
into two; one is for planning, and the other is for execution. pgss_store()
determines which lock to use based on the given "kind" argument.
To make this idea work, also every pgss counters like shared_blks_hit
need to be separated into two, i.e., for planning and execution.
These are some details around the above test:
pgbench: scale - 100 / threads - 16
test-duration - 30s each
server - 96 vCPUs / 768GB - r5.24xl (AWS EC2 instance)
client - 72 vCPUs / 144GB - c5.18xl (AWS EC2 instance) (co-located with the DB server - Same AZ)
v12 - REL_12_STABLE (v12.3)
v13Beta1 - REL_13_STABLE (v13Beta1)
max_connections = 10000
shared_preload_libraries = 'pg_stat_statements'
shared_buffers 128MBI can't reproduce this on my laptop, but I can certainly believe that
running the same 3 queries using more connections than available cores
will lead to extra overhead.I disagree with the conclusion though. It seems to me that if you
really have this workload that consists in these few queries and want
to get better performance, you'll anyway use a connection pooler
and/or use prepared statements, which will make this overhead
disappear entirely, and will also yield an even bigger performance
improvement. A quick test using pgbench -M prepared, with
track_planning enabled, with still way too many connections already
shows a 25% improvement over the -M simple without track_planning.
I understand your point. But IMO the default setting basically should
be safer value, i.e., off at least until the problem disappears.
Regards,
--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION
On Mon, Jun 29, 2020 at 10:55 AM Fujii Masao
<masao.fujii@oss.nttdata.com> wrote:
On 2020/06/29 16:05, Julien Rouhaud wrote:
On Mon, Jun 29, 2020 at 7:49 AM Tharakan, Robins <tharar@amazon.com> wrote:
During fully-cached SELECT-only test using pgbench, Postgres v13Beta1 shows
Thanks for the benchmark!
~45% performance drop [2] at high DB connection counts (when compared with v12.3)
That's bad :(
Disabling pg_stat_statements.track_planning (which is 'On' by default)
brings the TPS numbers up to v12.3 levels.The inflection point (in this test-case) is 128 Connections, beyond which the
TPS numbers are consistently low. Looking at the mailing list [1], this issue
didn't surface earlier possibly since the regression is trivial at low connection counts.It would be great if this could be optimized further, or track_planning
disabled (by default) so as to not trip users upgrading from v12 with pg_stat_statement
enabled (but otherwise not particularly interested in track_planning).Your benchmark result seems to suggest that the cause of the problem is
the contention of per-query spinlock in pgss_store(). Right?
This lock contention is likely to happen when multiple sessions run
the same queries.One idea to reduce that lock contention is to separate per-query spinlock
into two; one is for planning, and the other is for execution. pgss_store()
determines which lock to use based on the given "kind" argument.
To make this idea work, also every pgss counters like shared_blks_hit
need to be separated into two, i.e., for planning and execution.
This can probably remove some overhead, but won't it eventually hit
the same issue when multiple connections try to plan the same query,
given the number of different queries and very low execution runtime?
It'll also quite increase the shared memory consumption.
I'm wondering if we could instead use atomics to store the counters.
The only downside is that we won't guarantee per-row consistency
anymore, which may be problematic.
On 2020/06/29 18:17, Julien Rouhaud wrote:
On Mon, Jun 29, 2020 at 10:55 AM Fujii Masao
<masao.fujii@oss.nttdata.com> wrote:On 2020/06/29 16:05, Julien Rouhaud wrote:
On Mon, Jun 29, 2020 at 7:49 AM Tharakan, Robins <tharar@amazon.com> wrote:
During fully-cached SELECT-only test using pgbench, Postgres v13Beta1 shows
Thanks for the benchmark!
~45% performance drop [2] at high DB connection counts (when compared with v12.3)
That's bad :(
Disabling pg_stat_statements.track_planning (which is 'On' by default)
brings the TPS numbers up to v12.3 levels.The inflection point (in this test-case) is 128 Connections, beyond which the
TPS numbers are consistently low. Looking at the mailing list [1], this issue
didn't surface earlier possibly since the regression is trivial at low connection counts.It would be great if this could be optimized further, or track_planning
disabled (by default) so as to not trip users upgrading from v12 with pg_stat_statement
enabled (but otherwise not particularly interested in track_planning).Your benchmark result seems to suggest that the cause of the problem is
the contention of per-query spinlock in pgss_store(). Right?
This lock contention is likely to happen when multiple sessions run
the same queries.One idea to reduce that lock contention is to separate per-query spinlock
into two; one is for planning, and the other is for execution. pgss_store()
determines which lock to use based on the given "kind" argument.
To make this idea work, also every pgss counters like shared_blks_hit
need to be separated into two, i.e., for planning and execution.This can probably remove some overhead, but won't it eventually hit
the same issue when multiple connections try to plan the same query,
given the number of different queries and very low execution runtime?
Yes. But maybe we can expect that the idea would improve
the performance to the near same level as v12?
It'll also quite increase the shared memory consumption.
Yes.
I'm wondering if we could instead use atomics to store the counters.
The only downside is that we won't guarantee per-row consistency
anymore, which may be problematic.
Yeah, we can consider more improvements against this issue.
But I'm afraid these (maybe including my idea) basically should
be items for v14...
Regards,
--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION
On Mon, Jun 29, 2020 at 11:38 AM Fujii Masao
<masao.fujii@oss.nttdata.com> wrote:
Your benchmark result seems to suggest that the cause of the problem is
the contention of per-query spinlock in pgss_store(). Right?
This lock contention is likely to happen when multiple sessions run
the same queries.One idea to reduce that lock contention is to separate per-query spinlock
into two; one is for planning, and the other is for execution. pgss_store()
determines which lock to use based on the given "kind" argument.
To make this idea work, also every pgss counters like shared_blks_hit
need to be separated into two, i.e., for planning and execution.This can probably remove some overhead, but won't it eventually hit
the same issue when multiple connections try to plan the same query,
given the number of different queries and very low execution runtime?Yes. But maybe we can expect that the idea would improve
the performance to the near same level as v12?
A POC patch should be easy to do and see how much it solves this
problem. However I'm not able to reproduce the issue, and IMHO unless
we specifically want to be able to distinguish planner-time counters
from execution-time counters, I'd prefer to disable track_planning by
default than going this way, so that users with a sane usage won't
have to suffer from a memory increase.
I'm wondering if we could instead use atomics to store the counters.
The only downside is that we won't guarantee per-row consistency
anymore, which may be problematic.Yeah, we can consider more improvements against this issue.
But I'm afraid these (maybe including my idea) basically should
be items for v14...
Yes, that's clearly not something I'd vote to push in v13 at this point.
On 2020/06/29 18:53, Julien Rouhaud wrote:
On Mon, Jun 29, 2020 at 11:38 AM Fujii Masao
<masao.fujii@oss.nttdata.com> wrote:Your benchmark result seems to suggest that the cause of the problem is
the contention of per-query spinlock in pgss_store(). Right?
This lock contention is likely to happen when multiple sessions run
the same queries.One idea to reduce that lock contention is to separate per-query spinlock
into two; one is for planning, and the other is for execution. pgss_store()
determines which lock to use based on the given "kind" argument.
To make this idea work, also every pgss counters like shared_blks_hit
need to be separated into two, i.e., for planning and execution.This can probably remove some overhead, but won't it eventually hit
the same issue when multiple connections try to plan the same query,
given the number of different queries and very low execution runtime?Yes. But maybe we can expect that the idea would improve
the performance to the near same level as v12?A POC patch should be easy to do and see how much it solves this
problem. However I'm not able to reproduce the issue, and IMHO unless
we specifically want to be able to distinguish planner-time counters
from execution-time counters, I'd prefer to disable track_planning by
default than going this way, so that users with a sane usage won't
have to suffer from a memory increase.
Agreed. +1 to change that default to off.
Regards,
--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION
On 2020/06/29 18:56, Fujii Masao wrote:
On 2020/06/29 18:53, Julien Rouhaud wrote:
On Mon, Jun 29, 2020 at 11:38 AM Fujii Masao
<masao.fujii@oss.nttdata.com> wrote:Your benchmark result seems to suggest that the cause of the problem is
the contention of per-query spinlock in pgss_store(). Right?
This lock contention is likely to happen when multiple sessions run
the same queries.One idea to reduce that lock contention is to separate per-query spinlock
into two; one is for planning, and the other is for execution. pgss_store()
determines which lock to use based on the given "kind" argument.
To make this idea work, also every pgss counters like shared_blks_hit
need to be separated into two, i.e., for planning and execution.This can probably remove some overhead, but won't it eventually hit
the same issue when multiple connections try to plan the same query,
given the number of different queries and very low execution runtime?Yes. But maybe we can expect that the idea would improve
the performance to the near same level as v12?A POC patch should be easy to do and see how much it solves this
problem. However I'm not able to reproduce the issue, and IMHO unless
we specifically want to be able to distinguish planner-time counters
from execution-time counters, I'd prefer to disable track_planning by
default than going this way, so that users with a sane usage won't
have to suffer from a memory increase.Agreed. +1 to change that default to off.
Attached patch does this.
I also add the following into the description about each *_plan_time column
in the docs. IMO this is helpful for users when they see that those columns
report zero by default and try to understand why.
(if <varname>pg_stat_statements.track_planning</varname> is enabled, otherwise zero)
Regards,
--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION
Attachments:
disable_pgss_planning_default_v1.patchtext/plain; charset=UTF-8; name=disable_pgss_planning_default_v1.patch; x-mac-creator=0; x-mac-type=0Download+18-3
On Mon, Jun 29, 2020 at 1:14 PM Fujii Masao <masao.fujii@oss.nttdata.com> wrote:
On 2020/06/29 18:56, Fujii Masao wrote:
On 2020/06/29 18:53, Julien Rouhaud wrote:
On Mon, Jun 29, 2020 at 11:38 AM Fujii Masao
<masao.fujii@oss.nttdata.com> wrote:Your benchmark result seems to suggest that the cause of the problem is
the contention of per-query spinlock in pgss_store(). Right?
This lock contention is likely to happen when multiple sessions run
the same queries.One idea to reduce that lock contention is to separate per-query spinlock
into two; one is for planning, and the other is for execution. pgss_store()
determines which lock to use based on the given "kind" argument.
To make this idea work, also every pgss counters like shared_blks_hit
need to be separated into two, i.e., for planning and execution.This can probably remove some overhead, but won't it eventually hit
the same issue when multiple connections try to plan the same query,
given the number of different queries and very low execution runtime?Yes. But maybe we can expect that the idea would improve
the performance to the near same level as v12?A POC patch should be easy to do and see how much it solves this
problem. However I'm not able to reproduce the issue, and IMHO unless
we specifically want to be able to distinguish planner-time counters
from execution-time counters, I'd prefer to disable track_planning by
default than going this way, so that users with a sane usage won't
have to suffer from a memory increase.Agreed. +1 to change that default to off.
Attached patch does this.
Patch looks good to me.
I also add the following into the description about each *_plan_time column
in the docs. IMO this is helpful for users when they see that those columns
report zero by default and try to understand why.(if <varname>pg_stat_statements.track_planning</varname> is enabled, otherwise zero)
+1
Do you intend to wait for other input before pushing? FWIW I'm still
not convinced that the exposed problem is representative of any
realistic workload. I of course entirely agree with the other
documentation changes.
On Mon, 29 Jun 2020 at 12:17, Julien Rouhaud <rjuju123@gmail.com> wrote:
On Mon, Jun 29, 2020 at 10:55 AM Fujii Masao
<masao.fujii@oss.nttdata.com> wrote:On 2020/06/29 16:05, Julien Rouhaud wrote:
On Mon, Jun 29, 2020 at 7:49 AM Tharakan, Robins <tharar@amazon.com>
wrote:
During fully-cached SELECT-only test using pgbench, Postgres v13Beta1
shows
Thanks for the benchmark!
~45% performance drop [2] at high DB connection counts (when compared
with v12.3)
That's bad :(
Disabling pg_stat_statements.track_planning (which is 'On' by default)
brings the TPS numbers up to v12.3 levels.The inflection point (in this test-case) is 128 Connections, beyond
which the
TPS numbers are consistently low. Looking at the mailing list [1],
this issue
didn't surface earlier possibly since the regression is trivial at
low connection counts.
It would be great if this could be optimized further, or
track_planning
disabled (by default) so as to not trip users upgrading from v12 with
pg_stat_statement
enabled (but otherwise not particularly interested in track_planning).
Your benchmark result seems to suggest that the cause of the problem is
the contention of per-query spinlock in pgss_store(). Right?
This lock contention is likely to happen when multiple sessions run
the same queries.One idea to reduce that lock contention is to separate per-query spinlock
into two; one is for planning, and the other is for execution.pgss_store()
determines which lock to use based on the given "kind" argument.
To make this idea work, also every pgss counters like shared_blks_hit
need to be separated into two, i.e., for planning and execution.This can probably remove some overhead, but won't it eventually hit
the same issue when multiple connections try to plan the same query,
given the number of different queries and very low execution runtime?
It'll also quite increase the shared memory consumption.I'm wondering if we could instead use atomics to store the counters.
The only downside is that we won't guarantee per-row consistency
anymore, which may be problematic.
The problem looks to be that spinlocks are terrible with overloaded CPU and
a contended spinlock. A process holding the spinlock might easily get
scheduled out leading to excessive spinning by everybody. I think a simple
thing to try would be to replace the spinlock with LWLock.
I did a prototype patch that replaces spinlocks with futexes, but was not
able to find a workload where it mattered. We have done a great job at
eliminating spinlocks from contended code paths. Robins, perhaps you could
try it to see if it reduces the regression you are observing. The patch is
against v13 stable branch.
--
Ants Aasma
Senior Database Engineerwww.cybertec-postgresql.com
Attachments:
futex-prototype.patchtext/x-patch; charset=US-ASCII; name=futex-prototype.patchDownload+119-0
On Mon, Jun 29, 2020 at 1:55 AM Fujii Masao <masao.fujii@oss.nttdata.com> wrote:
I disagree with the conclusion though. It seems to me that if you
really have this workload that consists in these few queries and want
to get better performance, you'll anyway use a connection pooler
and/or use prepared statements, which will make this overhead
disappear entirely, and will also yield an even bigger performance
improvement. A quick test using pgbench -M prepared, with
track_planning enabled, with still way too many connections already
shows a 25% improvement over the -M simple without track_planning.I understand your point. But IMO the default setting basically should
be safer value, i.e., off at least until the problem disappears.
+1 -- this regression seems unacceptable to me.
--
Peter Geoghegan
On Mon, Jun 29, 2020 at 3:23 PM Peter Geoghegan <pg@bowt.ie> wrote:
+1 -- this regression seems unacceptable to me.
I added an open item to track this.
Thanks
--
Peter Geoghegan
Hi,
On 2020-06-29 09:05:18 +0200, Julien Rouhaud wrote:
I can't reproduce this on my laptop, but I can certainly believe that
running the same 3 queries using more connections than available cores
will lead to extra overhead.
I disagree with the conclusion though. It seems to me that if you
really have this workload that consists in these few queries and want
to get better performance, you'll anyway use a connection pooler
and/or use prepared statements, which will make this overhead
disappear entirely, and will also yield an even bigger performance
improvement.
It's an extremely common to have have times where there's more active
queries than CPUs. And a pooler won't avoid that fully, at least not
without drastically reducing overall throughput.
Greetings,
Andres Freund
Hi,
On 2020-06-29 17:55:28 +0900, Fujii Masao wrote:
One idea to reduce that lock contention is to separate per-query spinlock
into two; one is for planning, and the other is for execution. pgss_store()
determines which lock to use based on the given "kind" argument.
To make this idea work, also every pgss counters like shared_blks_hit
need to be separated into two, i.e., for planning and execution.
I suspect that the best thing would be to just turn the spinlock into an
lwlock. Spinlocks deal terribly with contention. I suspect it'd solve
the performance issue entirely. And it might even be possible, further
down the line, to just use a shared lock, and use atomics for the
counters.
Greetings,
Andres Freund
On 2020/06/29 22:23, Ants Aasma wrote:
On Mon, 29 Jun 2020 at 12:17, Julien Rouhaud <rjuju123@gmail.com <mailto:rjuju123@gmail.com>> wrote:
On Mon, Jun 29, 2020 at 10:55 AM Fujii Masao
<masao.fujii@oss.nttdata.com <mailto:masao.fujii@oss.nttdata.com>> wrote:On 2020/06/29 16:05, Julien Rouhaud wrote:
On Mon, Jun 29, 2020 at 7:49 AM Tharakan, Robins <tharar@amazon.com <mailto:tharar@amazon.com>> wrote:
During fully-cached SELECT-only test using pgbench, Postgres v13Beta1 shows
Thanks for the benchmark!
~45% performance drop [2] at high DB connection counts (when compared with v12.3)
That's bad :(
Disabling pg_stat_statements.track_planning (which is 'On' by default)
brings the TPS numbers up to v12.3 levels.The inflection point (in this test-case) is 128 Connections, beyond which the
TPS numbers are consistently low. Looking at the mailing list [1], this issue
didn't surface earlier possibly since the regression is trivial at low connection counts.It would be great if this could be optimized further, or track_planning
disabled (by default) so as to not trip users upgrading from v12 with pg_stat_statement
enabled (but otherwise not particularly interested in track_planning).Your benchmark result seems to suggest that the cause of the problem is
the contention of per-query spinlock in pgss_store(). Right?
This lock contention is likely to happen when multiple sessions run
the same queries.One idea to reduce that lock contention is to separate per-query spinlock
into two; one is for planning, and the other is for execution. pgss_store()
determines which lock to use based on the given "kind" argument.
To make this idea work, also every pgss counters like shared_blks_hit
need to be separated into two, i.e., for planning and execution.This can probably remove some overhead, but won't it eventually hit
the same issue when multiple connections try to plan the same query,
given the number of different queries and very low execution runtime?
It'll also quite increase the shared memory consumption.I'm wondering if we could instead use atomics to store the counters.
The only downside is that we won't guarantee per-row consistency
anymore, which may be problematic.The problem looks to be that spinlocks are terrible with overloaded CPU and a contended spinlock. A process holding the spinlock might easily get scheduled out leading to excessive spinning by everybody. I think a simple thing to try would be to replace the spinlock with LWLock.
Yes. Attached is the POC patch that replaces per-counter spinlock with LWLock.
I did a prototype patch that replaces spinlocks with futexes, but was not able to find a workload where it mattered.
I'm not familiar with futex, but could you tell me why you used futex instead
of LWLock that we already have? Is futex portable?
We have done a great job at eliminating spinlocks from contended code paths. Robins, perhaps you could try it to see if it reduces the regression you are observing.
Yes. Also we need to check that this change doesn't increase performance
overhead in other workloads.
Regards,
--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION
Attachments:
pgss_lwlock_v1.patchtext/plain; charset=UTF-8; name=pgss_lwlock_v1.patch; x-mac-creator=0; x-mac-type=0Download+20-12
On Tue, 30 Jun 2020 at 08:43, Fujii Masao <masao.fujii@oss.nttdata.com>
wrote:
The problem looks to be that spinlocks are terrible with overloaded
CPU and a contended spinlock. A process holding the spinlock might easily
get scheduled out leading to excessive spinning by everybody. I think a
simple thing to try would be to replace the spinlock with LWLock.Yes. Attached is the POC patch that replaces per-counter spinlock with
LWLock.
Great. I think this is the one that should get considered for testing.
I did a prototype patch that replaces spinlocks with futexes, but was
not able to find a workload where it mattered.
I'm not familiar with futex, but could you tell me why you used futex
instead
of LWLock that we already have? Is futex portable?
Futex is a Linux kernel call that allows to build a lock that has
uncontended cases work fully in user space almost exactly like a spinlock,
while falling back to syscalls that wait for wakeup in case of contention.
It's not portable, but probably something similar could be implemented for
other operating systems. I did not pursue this further because it became
apparent that every performance critical spinlock had already been removed.
To be clear, I am not advocating for this patch to get included. I just had
the patch immediately available and it could have confirmed that using a
better lock fixes things.
--
Ants Aasma
Senior Database Engineerwww.cybertec-postgresql.com
On 2020/06/30 20:30, Ants Aasma wrote:
On Tue, 30 Jun 2020 at 08:43, Fujii Masao <masao.fujii@oss.nttdata.com <mailto:masao.fujii@oss.nttdata.com>> wrote:
The problem looks to be that spinlocks are terrible with overloaded CPU and a contended spinlock. A process holding the spinlock might easily get scheduled out leading to excessive spinning by everybody. I think a simple thing to try would be to replace the spinlock with LWLock.
Yes. Attached is the POC patch that replaces per-counter spinlock with LWLock.
Great. I think this is the one that should get considered for testing.
I did a prototype patch that replaces spinlocks with futexes, but was not able to find a workload where it mattered.
I'm not familiar with futex, but could you tell me why you used futex instead
of LWLock that we already have? Is futex portable?Futex is a Linux kernel call that allows to build a lock that has uncontended cases work fully in user space almost exactly like a spinlock, while falling back to syscalls that wait for wakeup in case of contention. It's not portable, but probably something similar could be implemented for other operating systems. I did not pursue this further because it became apparent that every performance critical spinlock had already been removed.
To be clear, I am not advocating for this patch to get included. I just had the patch immediately available and it could have confirmed that using a better lock fixes things.
Understood. Thanks for the explanation!
Regards,
--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION
On 2020/06/30 7:29, Peter Geoghegan wrote:
On Mon, Jun 29, 2020 at 3:23 PM Peter Geoghegan <pg@bowt.ie> wrote:
+1 -- this regression seems unacceptable to me.
I added an open item to track this.
Thanks!
I'm thinking to change the default value of track_planning to off for v13.
Ants and Andres suggested to replace the spinlock used in pgss_store() with
LWLock. I agreed with them and posted the POC patch doing that. But I think
the patch is an item for v14. The patch may address the reported performance
issue, but may cause other performance issues in other workloads. We would
need to measure how the patch affects the performance in various workloads.
It seems too late to do that at this stage of v13. Thought?
Regards,
--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION
Hi,
On 2020-06-30 14:43:39 +0900, Fujii Masao wrote:
I did a prototype patch that replaces spinlocks with futexes, but was not able to find a workload where it mattered.
I'm not familiar with futex, but could you tell me why you used futex instead
of LWLock that we already have? Is futex portable?
We can't rely on futexes, they're linux only. I also don't see much of a
reason to use spinlocks (rather than lwlocks) here in the first place.
diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c index cef8bb5a49..aa506f6c11 100644 --- a/contrib/pg_stat_statements/pg_stat_statements.c +++ b/contrib/pg_stat_statements/pg_stat_statements.c @@ -39,7 +39,7 @@ * in an entry except the counters requires the same. To look up an entry, * one must hold the lock shared. To read or update the counters within * an entry, one must hold the lock shared or exclusive (so the entry doesn't - * disappear!) and also take the entry's mutex spinlock. + * disappear!) and also take the entry's partition lock. * The shared state variable pgss->extent (the next free spot in the external * query-text file) should be accessed only while holding either the * pgss->mutex spinlock, or exclusive lock on pgss->lock. We use the mutex to @@ -115,6 +115,11 @@ static const uint32 PGSS_PG_MAJOR_VERSION = PG_VERSION_NUM / 100;#define JUMBLE_SIZE 1024 /* query serialization buffer size */
+#define PGSS_NUM_LOCK_PARTITIONS() (pgss_max) +#define PGSS_HASH_PARTITION_LOCK(key) \ + (&(pgss->base + \ + (get_hash_value(pgss_hash, key) % PGSS_NUM_LOCK_PARTITIONS()))->lock) + /* * Extension version number, for supporting older extension versions' objects */ @@ -207,7 +212,7 @@ typedef struct pgssEntry Size query_offset; /* query text offset in external file */ int query_len; /* # of valid bytes in query string, or -1 */ int encoding; /* query text encoding */ - slock_t mutex; /* protects the counters only */ + LWLock *lock; /* protects the counters only */ } pgssEntry;
Why did you add the hashing here? It seems a lot better to just add an
lwlock in-place instead of the spinlock? The added size is neglegible
compared to the size of pgssEntry.
Greetings,
Andres Freund
Hi,
On 2020-06-30 14:30:03 +0300, Ants Aasma wrote:
Futex is a Linux kernel call that allows to build a lock that has
uncontended cases work fully in user space almost exactly like a spinlock,
while falling back to syscalls that wait for wakeup in case of contention.
It's not portable, but probably something similar could be implemented for
other operating systems. I did not pursue this further because it became
apparent that every performance critical spinlock had already been removed.
Our lwlock implementation does have that property already, though. While
the kernel wait is implemented using semaphores, those are implemented
using futexes internally (posix ones, not sysv ones, so only after
whatever version we switched the default to posix semas on linux).
I'd rather move towards removing spinlocks from postgres than making
their implementation more complicated...
Greetings,
Andres Freund