track_planning causing performance regression

Started by Tharakan, Robinsalmost 6 years ago64 messageshackers
Jump to latest
#1Tharakan, Robins
tharar@amazon.com

Hi,

During fully-cached SELECT-only test using pgbench, Postgres v13Beta1 shows
~45% performance drop [2] at high DB connection counts (when compared with v12.3)

Disabling pg_stat_statements.track_planning (which is 'On' by default)
brings the TPS numbers up to v12.3 levels.

The inflection point (in this test-case) is 128 Connections, beyond which the
TPS numbers are consistently low. Looking at the mailing list [1], this issue
didn't surface earlier possibly since the regression is trivial at low connection counts.

It would be great if this could be optimized further, or track_planning
disabled (by default) so as to not trip users upgrading from v12 with pg_stat_statement
enabled (but otherwise not particularly interested in track_planning).

These are some details around the above test:

pgbench: scale - 100 / threads - 16
test-duration - 30s each
server - 96 vCPUs / 768GB - r5.24xl (AWS EC2 instance)
client - 72 vCPUs / 144GB - c5.18xl (AWS EC2 instance) (co-located with the DB server - Same AZ)
v12 - REL_12_STABLE (v12.3)
v13Beta1 - REL_13_STABLE (v13Beta1)
max_connections = 10000
shared_preload_libraries = 'pg_stat_statements'
shared_buffers 128MB

Reference:
1) /messages/by-id/1554150919882-0.post@n3.nabble.com

2) Fully-cached-select-only TPS drops >= 128 connections.

Conn v12.3 v13Beta1 v13Beta1 (track_planning=off)
1 6,764 6,734 6,905
2 14,978 14,961 15,316
4 31,641 32,012 36,961
8 71,989 68,848 69,204
16 129,056 131,157 132,773
32 231,910 226,718 253,316
64 381,778 371,782 385,402
128 534,661 ====> 353,944 539,231
256 636,794 ====> 248,825 643,631
512 574,447 ====> 213,033 555,099
768 493,912 ====> 214,801 502,014
1024 484,993 ====> 222,492 490,716
1280 480,571 ====> 223,296 483,843
1536 475,030 ====> 228,137 477,153
1792 472,145 ====> 229,027 474,423
2048 471,385 ====> 228,665 470,238

3) perf - v13Beta1

-   88.38%     0.17%  postgres      postgres               [.] PostgresMain
   - 88.21% PostgresMain                    
      - 80.09% exec_simple_query            
         - 25.34% pg_plan_queries           
            - 25.28% pg_plan_query          
               - 25.21% pgss_planner        
                  - 14.36% pgss_store       
                     + 13.54% s_lock        
                  + 10.71% standard_planner 
         + 18.29% PortalRun                 
         - 15.12% PortalDrop                
            - 14.73% PortalCleanup          
               - 13.78% pgss_ExecutorEnd    
                  - 13.72% pgss_store       
                     + 12.83% s_lock        
                 0.72% standard_ExecutorEnd 
         + 6.18% PortalStart                
         + 4.86% pg_analyze_and_rewrite     
         + 3.52% GetTransactionSnapshot     
         + 2.56% pg_parse_query             
         + 1.83% finish_xact_command        
           0.51% start_xact_command         
      + 3.93% pq_getbyte                    
      + 3.40% ReadyForQuery                 

4) perf - v12.3

v12.3
-   84.32%     0.21%  postgres      postgres               [.] PostgresMain
   - 84.11% PostgresMain                    
      - 72.56% exec_simple_query            
         + 26.71% PortalRun                 
         - 15.33% pg_plan_queries           
            - 15.29% pg_plan_query          
               + 15.21% standard_planner    
         + 7.81% PortalStart                
         + 6.76% pg_analyze_and_rewrite     
         + 4.37% GetTransactionSnapshot     
         + 3.69% pg_parse_query             
         - 2.96% PortalDrop                 
            - 2.42% PortalCleanup           
               - 1.35% pgss_ExecutorEnd     
                  - 1.22% pgss_store        
                       0.57% s_lock         
                 0.77% standard_ExecutorEnd 
         + 2.16% finish_xact_command        
         + 0.78% start_xact_command         
         + 0.59% pg_rewrite_query           
      + 5.67% pq_getbyte                    
      + 4.73% ReadyForQuery

-
robins

#2Julien Rouhaud
rjuju123@gmail.com
In reply to: Tharakan, Robins (#1)
Re: track_planning causing performance regression

Hi,

On Mon, Jun 29, 2020 at 7:49 AM Tharakan, Robins <tharar@amazon.com> wrote:

During fully-cached SELECT-only test using pgbench, Postgres v13Beta1 shows
~45% performance drop [2] at high DB connection counts (when compared with v12.3)

Disabling pg_stat_statements.track_planning (which is 'On' by default)
brings the TPS numbers up to v12.3 levels.

The inflection point (in this test-case) is 128 Connections, beyond which the
TPS numbers are consistently low. Looking at the mailing list [1], this issue
didn't surface earlier possibly since the regression is trivial at low connection counts.

It would be great if this could be optimized further, or track_planning
disabled (by default) so as to not trip users upgrading from v12 with pg_stat_statement
enabled (but otherwise not particularly interested in track_planning).

These are some details around the above test:

pgbench: scale - 100 / threads - 16
test-duration - 30s each
server - 96 vCPUs / 768GB - r5.24xl (AWS EC2 instance)
client - 72 vCPUs / 144GB - c5.18xl (AWS EC2 instance) (co-located with the DB server - Same AZ)
v12 - REL_12_STABLE (v12.3)
v13Beta1 - REL_13_STABLE (v13Beta1)
max_connections = 10000
shared_preload_libraries = 'pg_stat_statements'
shared_buffers 128MB

I can't reproduce this on my laptop, but I can certainly believe that
running the same 3 queries using more connections than available cores
will lead to extra overhead.

I disagree with the conclusion though. It seems to me that if you
really have this workload that consists in these few queries and want
to get better performance, you'll anyway use a connection pooler
and/or use prepared statements, which will make this overhead
disappear entirely, and will also yield an even bigger performance
improvement. A quick test using pgbench -M prepared, with
track_planning enabled, with still way too many connections already
shows a 25% improvement over the -M simple without track_planning.

#3Fujii Masao
masao.fujii@gmail.com
In reply to: Julien Rouhaud (#2)
Re: track_planning causing performance regression

On 2020/06/29 16:05, Julien Rouhaud wrote:

Hi,

On Mon, Jun 29, 2020 at 7:49 AM Tharakan, Robins <tharar@amazon.com> wrote:

During fully-cached SELECT-only test using pgbench, Postgres v13Beta1 shows

Thanks for the benchmark!

~45% performance drop [2] at high DB connection counts (when compared with v12.3)

That's bad :(

Disabling pg_stat_statements.track_planning (which is 'On' by default)
brings the TPS numbers up to v12.3 levels.

The inflection point (in this test-case) is 128 Connections, beyond which the
TPS numbers are consistently low. Looking at the mailing list [1], this issue
didn't surface earlier possibly since the regression is trivial at low connection counts.

It would be great if this could be optimized further, or track_planning
disabled (by default) so as to not trip users upgrading from v12 with pg_stat_statement
enabled (but otherwise not particularly interested in track_planning).

Your benchmark result seems to suggest that the cause of the problem is
the contention of per-query spinlock in pgss_store(). Right?
This lock contention is likely to happen when multiple sessions run
the same queries.

One idea to reduce that lock contention is to separate per-query spinlock
into two; one is for planning, and the other is for execution. pgss_store()
determines which lock to use based on the given "kind" argument.
To make this idea work, also every pgss counters like shared_blks_hit
need to be separated into two, i.e., for planning and execution.

These are some details around the above test:

pgbench: scale - 100 / threads - 16
test-duration - 30s each
server - 96 vCPUs / 768GB - r5.24xl (AWS EC2 instance)
client - 72 vCPUs / 144GB - c5.18xl (AWS EC2 instance) (co-located with the DB server - Same AZ)
v12 - REL_12_STABLE (v12.3)
v13Beta1 - REL_13_STABLE (v13Beta1)
max_connections = 10000
shared_preload_libraries = 'pg_stat_statements'
shared_buffers 128MB

I can't reproduce this on my laptop, but I can certainly believe that
running the same 3 queries using more connections than available cores
will lead to extra overhead.

I disagree with the conclusion though. It seems to me that if you
really have this workload that consists in these few queries and want
to get better performance, you'll anyway use a connection pooler
and/or use prepared statements, which will make this overhead
disappear entirely, and will also yield an even bigger performance
improvement. A quick test using pgbench -M prepared, with
track_planning enabled, with still way too many connections already
shows a 25% improvement over the -M simple without track_planning.

I understand your point. But IMO the default setting basically should
be safer value, i.e., off at least until the problem disappears.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

#4Julien Rouhaud
rjuju123@gmail.com
In reply to: Fujii Masao (#3)
Re: track_planning causing performance regression

On Mon, Jun 29, 2020 at 10:55 AM Fujii Masao
<masao.fujii@oss.nttdata.com> wrote:

On 2020/06/29 16:05, Julien Rouhaud wrote:

On Mon, Jun 29, 2020 at 7:49 AM Tharakan, Robins <tharar@amazon.com> wrote:

During fully-cached SELECT-only test using pgbench, Postgres v13Beta1 shows

Thanks for the benchmark!

~45% performance drop [2] at high DB connection counts (when compared with v12.3)

That's bad :(

Disabling pg_stat_statements.track_planning (which is 'On' by default)
brings the TPS numbers up to v12.3 levels.

The inflection point (in this test-case) is 128 Connections, beyond which the
TPS numbers are consistently low. Looking at the mailing list [1], this issue
didn't surface earlier possibly since the regression is trivial at low connection counts.

It would be great if this could be optimized further, or track_planning
disabled (by default) so as to not trip users upgrading from v12 with pg_stat_statement
enabled (but otherwise not particularly interested in track_planning).

Your benchmark result seems to suggest that the cause of the problem is
the contention of per-query spinlock in pgss_store(). Right?
This lock contention is likely to happen when multiple sessions run
the same queries.

One idea to reduce that lock contention is to separate per-query spinlock
into two; one is for planning, and the other is for execution. pgss_store()
determines which lock to use based on the given "kind" argument.
To make this idea work, also every pgss counters like shared_blks_hit
need to be separated into two, i.e., for planning and execution.

This can probably remove some overhead, but won't it eventually hit
the same issue when multiple connections try to plan the same query,
given the number of different queries and very low execution runtime?
It'll also quite increase the shared memory consumption.

I'm wondering if we could instead use atomics to store the counters.
The only downside is that we won't guarantee per-row consistency
anymore, which may be problematic.

#5Fujii Masao
masao.fujii@gmail.com
In reply to: Julien Rouhaud (#4)
Re: track_planning causing performance regression

On 2020/06/29 18:17, Julien Rouhaud wrote:

On Mon, Jun 29, 2020 at 10:55 AM Fujii Masao
<masao.fujii@oss.nttdata.com> wrote:

On 2020/06/29 16:05, Julien Rouhaud wrote:

On Mon, Jun 29, 2020 at 7:49 AM Tharakan, Robins <tharar@amazon.com> wrote:

During fully-cached SELECT-only test using pgbench, Postgres v13Beta1 shows

Thanks for the benchmark!

~45% performance drop [2] at high DB connection counts (when compared with v12.3)

That's bad :(

Disabling pg_stat_statements.track_planning (which is 'On' by default)
brings the TPS numbers up to v12.3 levels.

The inflection point (in this test-case) is 128 Connections, beyond which the
TPS numbers are consistently low. Looking at the mailing list [1], this issue
didn't surface earlier possibly since the regression is trivial at low connection counts.

It would be great if this could be optimized further, or track_planning
disabled (by default) so as to not trip users upgrading from v12 with pg_stat_statement
enabled (but otherwise not particularly interested in track_planning).

Your benchmark result seems to suggest that the cause of the problem is
the contention of per-query spinlock in pgss_store(). Right?
This lock contention is likely to happen when multiple sessions run
the same queries.

One idea to reduce that lock contention is to separate per-query spinlock
into two; one is for planning, and the other is for execution. pgss_store()
determines which lock to use based on the given "kind" argument.
To make this idea work, also every pgss counters like shared_blks_hit
need to be separated into two, i.e., for planning and execution.

This can probably remove some overhead, but won't it eventually hit
the same issue when multiple connections try to plan the same query,
given the number of different queries and very low execution runtime?

Yes. But maybe we can expect that the idea would improve
the performance to the near same level as v12?

It'll also quite increase the shared memory consumption.

Yes.

I'm wondering if we could instead use atomics to store the counters.
The only downside is that we won't guarantee per-row consistency
anymore, which may be problematic.

Yeah, we can consider more improvements against this issue.
But I'm afraid these (maybe including my idea) basically should
be items for v14...

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

#6Julien Rouhaud
rjuju123@gmail.com
In reply to: Fujii Masao (#5)
Re: track_planning causing performance regression

On Mon, Jun 29, 2020 at 11:38 AM Fujii Masao
<masao.fujii@oss.nttdata.com> wrote:

Your benchmark result seems to suggest that the cause of the problem is
the contention of per-query spinlock in pgss_store(). Right?
This lock contention is likely to happen when multiple sessions run
the same queries.

One idea to reduce that lock contention is to separate per-query spinlock
into two; one is for planning, and the other is for execution. pgss_store()
determines which lock to use based on the given "kind" argument.
To make this idea work, also every pgss counters like shared_blks_hit
need to be separated into two, i.e., for planning and execution.

This can probably remove some overhead, but won't it eventually hit
the same issue when multiple connections try to plan the same query,
given the number of different queries and very low execution runtime?

Yes. But maybe we can expect that the idea would improve
the performance to the near same level as v12?

A POC patch should be easy to do and see how much it solves this
problem. However I'm not able to reproduce the issue, and IMHO unless
we specifically want to be able to distinguish planner-time counters
from execution-time counters, I'd prefer to disable track_planning by
default than going this way, so that users with a sane usage won't
have to suffer from a memory increase.

I'm wondering if we could instead use atomics to store the counters.
The only downside is that we won't guarantee per-row consistency
anymore, which may be problematic.

Yeah, we can consider more improvements against this issue.
But I'm afraid these (maybe including my idea) basically should
be items for v14...

Yes, that's clearly not something I'd vote to push in v13 at this point.

#7Fujii Masao
masao.fujii@gmail.com
In reply to: Julien Rouhaud (#6)
Re: track_planning causing performance regression

On 2020/06/29 18:53, Julien Rouhaud wrote:

On Mon, Jun 29, 2020 at 11:38 AM Fujii Masao
<masao.fujii@oss.nttdata.com> wrote:

Your benchmark result seems to suggest that the cause of the problem is
the contention of per-query spinlock in pgss_store(). Right?
This lock contention is likely to happen when multiple sessions run
the same queries.

One idea to reduce that lock contention is to separate per-query spinlock
into two; one is for planning, and the other is for execution. pgss_store()
determines which lock to use based on the given "kind" argument.
To make this idea work, also every pgss counters like shared_blks_hit
need to be separated into two, i.e., for planning and execution.

This can probably remove some overhead, but won't it eventually hit
the same issue when multiple connections try to plan the same query,
given the number of different queries and very low execution runtime?

Yes. But maybe we can expect that the idea would improve
the performance to the near same level as v12?

A POC patch should be easy to do and see how much it solves this
problem. However I'm not able to reproduce the issue, and IMHO unless
we specifically want to be able to distinguish planner-time counters
from execution-time counters, I'd prefer to disable track_planning by
default than going this way, so that users with a sane usage won't
have to suffer from a memory increase.

Agreed. +1 to change that default to off.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

#8Fujii Masao
masao.fujii@gmail.com
In reply to: Fujii Masao (#7)
Re: track_planning causing performance regression

On 2020/06/29 18:56, Fujii Masao wrote:

On 2020/06/29 18:53, Julien Rouhaud wrote:

On Mon, Jun 29, 2020 at 11:38 AM Fujii Masao
<masao.fujii@oss.nttdata.com> wrote:

Your benchmark result seems to suggest that the cause of the problem is
the contention of per-query spinlock in pgss_store(). Right?
This lock contention is likely to happen when multiple sessions run
the same queries.

One idea to reduce that lock contention is to separate per-query spinlock
into two; one is for planning, and the other is for execution. pgss_store()
determines which lock to use based on the given "kind" argument.
To make this idea work, also every pgss counters like shared_blks_hit
need to be separated into two, i.e., for planning and execution.

This can probably remove some overhead, but won't it eventually hit
the same issue when multiple connections try to plan the same query,
given the number of different queries and very low execution runtime?

Yes. But maybe we can expect that the idea would improve
the performance to the near same level as v12?

A POC patch should be easy to do and see how much it solves this
problem.  However I'm not able to reproduce the issue, and IMHO unless
we specifically want to be able to distinguish planner-time counters
from execution-time counters, I'd prefer to disable track_planning by
default than going this way, so that users with a sane usage won't
have to suffer from a memory increase.

Agreed. +1 to change that default to off.

Attached patch does this.

I also add the following into the description about each *_plan_time column
in the docs. IMO this is helpful for users when they see that those columns
report zero by default and try to understand why.

(if <varname>pg_stat_statements.track_planning</varname> is enabled, otherwise zero)

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Attachments:

disable_pgss_planning_default_v1.patchtext/plain; charset=UTF-8; name=disable_pgss_planning_default_v1.patch; x-mac-creator=0; x-mac-type=0Download+18-3
#9Julien Rouhaud
rjuju123@gmail.com
In reply to: Fujii Masao (#8)
Re: track_planning causing performance regression

On Mon, Jun 29, 2020 at 1:14 PM Fujii Masao <masao.fujii@oss.nttdata.com> wrote:

On 2020/06/29 18:56, Fujii Masao wrote:

On 2020/06/29 18:53, Julien Rouhaud wrote:

On Mon, Jun 29, 2020 at 11:38 AM Fujii Masao
<masao.fujii@oss.nttdata.com> wrote:

Your benchmark result seems to suggest that the cause of the problem is
the contention of per-query spinlock in pgss_store(). Right?
This lock contention is likely to happen when multiple sessions run
the same queries.

One idea to reduce that lock contention is to separate per-query spinlock
into two; one is for planning, and the other is for execution. pgss_store()
determines which lock to use based on the given "kind" argument.
To make this idea work, also every pgss counters like shared_blks_hit
need to be separated into two, i.e., for planning and execution.

This can probably remove some overhead, but won't it eventually hit
the same issue when multiple connections try to plan the same query,
given the number of different queries and very low execution runtime?

Yes. But maybe we can expect that the idea would improve
the performance to the near same level as v12?

A POC patch should be easy to do and see how much it solves this
problem. However I'm not able to reproduce the issue, and IMHO unless
we specifically want to be able to distinguish planner-time counters
from execution-time counters, I'd prefer to disable track_planning by
default than going this way, so that users with a sane usage won't
have to suffer from a memory increase.

Agreed. +1 to change that default to off.

Attached patch does this.

Patch looks good to me.

I also add the following into the description about each *_plan_time column
in the docs. IMO this is helpful for users when they see that those columns
report zero by default and try to understand why.

(if <varname>pg_stat_statements.track_planning</varname> is enabled, otherwise zero)

+1

Do you intend to wait for other input before pushing? FWIW I'm still
not convinced that the exposed problem is representative of any
realistic workload. I of course entirely agree with the other
documentation changes.

#10Ants Aasma
ants.aasma@cybertec.at
In reply to: Julien Rouhaud (#4)
Re: track_planning causing performance regression

On Mon, 29 Jun 2020 at 12:17, Julien Rouhaud <rjuju123@gmail.com> wrote:

On Mon, Jun 29, 2020 at 10:55 AM Fujii Masao
<masao.fujii@oss.nttdata.com> wrote:

On 2020/06/29 16:05, Julien Rouhaud wrote:

On Mon, Jun 29, 2020 at 7:49 AM Tharakan, Robins <tharar@amazon.com>

wrote:

During fully-cached SELECT-only test using pgbench, Postgres v13Beta1

shows

Thanks for the benchmark!

~45% performance drop [2] at high DB connection counts (when compared

with v12.3)

That's bad :(

Disabling pg_stat_statements.track_planning (which is 'On' by default)
brings the TPS numbers up to v12.3 levels.

The inflection point (in this test-case) is 128 Connections, beyond

which the

TPS numbers are consistently low. Looking at the mailing list [1],

this issue

didn't surface earlier possibly since the regression is trivial at

low connection counts.

It would be great if this could be optimized further, or

track_planning

disabled (by default) so as to not trip users upgrading from v12 with

pg_stat_statement

enabled (but otherwise not particularly interested in track_planning).

Your benchmark result seems to suggest that the cause of the problem is
the contention of per-query spinlock in pgss_store(). Right?
This lock contention is likely to happen when multiple sessions run
the same queries.

One idea to reduce that lock contention is to separate per-query spinlock
into two; one is for planning, and the other is for execution.

pgss_store()

determines which lock to use based on the given "kind" argument.
To make this idea work, also every pgss counters like shared_blks_hit
need to be separated into two, i.e., for planning and execution.

This can probably remove some overhead, but won't it eventually hit
the same issue when multiple connections try to plan the same query,
given the number of different queries and very low execution runtime?
It'll also quite increase the shared memory consumption.

I'm wondering if we could instead use atomics to store the counters.
The only downside is that we won't guarantee per-row consistency
anymore, which may be problematic.

The problem looks to be that spinlocks are terrible with overloaded CPU and
a contended spinlock. A process holding the spinlock might easily get
scheduled out leading to excessive spinning by everybody. I think a simple
thing to try would be to replace the spinlock with LWLock.

I did a prototype patch that replaces spinlocks with futexes, but was not
able to find a workload where it mattered. We have done a great job at
eliminating spinlocks from contended code paths. Robins, perhaps you could
try it to see if it reduces the regression you are observing. The patch is
against v13 stable branch.

--
Ants Aasma
Senior Database Engineerwww.cybertec-postgresql.com

Attachments:

futex-prototype.patchtext/x-patch; charset=US-ASCII; name=futex-prototype.patchDownload+119-0
In reply to: Fujii Masao (#3)
Re: track_planning causing performance regression

On Mon, Jun 29, 2020 at 1:55 AM Fujii Masao <masao.fujii@oss.nttdata.com> wrote:

I disagree with the conclusion though. It seems to me that if you
really have this workload that consists in these few queries and want
to get better performance, you'll anyway use a connection pooler
and/or use prepared statements, which will make this overhead
disappear entirely, and will also yield an even bigger performance
improvement. A quick test using pgbench -M prepared, with
track_planning enabled, with still way too many connections already
shows a 25% improvement over the -M simple without track_planning.

I understand your point. But IMO the default setting basically should
be safer value, i.e., off at least until the problem disappears.

+1 -- this regression seems unacceptable to me.

--
Peter Geoghegan

In reply to: Peter Geoghegan (#11)
Re: track_planning causing performance regression

On Mon, Jun 29, 2020 at 3:23 PM Peter Geoghegan <pg@bowt.ie> wrote:

+1 -- this regression seems unacceptable to me.

I added an open item to track this.

Thanks
--
Peter Geoghegan

#13Andres Freund
andres@anarazel.de
In reply to: Julien Rouhaud (#2)
Re: track_planning causing performance regression

Hi,

On 2020-06-29 09:05:18 +0200, Julien Rouhaud wrote:

I can't reproduce this on my laptop, but I can certainly believe that
running the same 3 queries using more connections than available cores
will lead to extra overhead.

I disagree with the conclusion though. It seems to me that if you
really have this workload that consists in these few queries and want
to get better performance, you'll anyway use a connection pooler
and/or use prepared statements, which will make this overhead
disappear entirely, and will also yield an even bigger performance
improvement.

It's an extremely common to have have times where there's more active
queries than CPUs. And a pooler won't avoid that fully, at least not
without drastically reducing overall throughput.

Greetings,

Andres Freund

#14Andres Freund
andres@anarazel.de
In reply to: Fujii Masao (#3)
Re: track_planning causing performance regression

Hi,

On 2020-06-29 17:55:28 +0900, Fujii Masao wrote:

One idea to reduce that lock contention is to separate per-query spinlock
into two; one is for planning, and the other is for execution. pgss_store()
determines which lock to use based on the given "kind" argument.
To make this idea work, also every pgss counters like shared_blks_hit
need to be separated into two, i.e., for planning and execution.

I suspect that the best thing would be to just turn the spinlock into an
lwlock. Spinlocks deal terribly with contention. I suspect it'd solve
the performance issue entirely. And it might even be possible, further
down the line, to just use a shared lock, and use atomics for the
counters.

Greetings,

Andres Freund

#15Fujii Masao
masao.fujii@gmail.com
In reply to: Ants Aasma (#10)
Re: track_planning causing performance regression

On 2020/06/29 22:23, Ants Aasma wrote:

On Mon, 29 Jun 2020 at 12:17, Julien Rouhaud <rjuju123@gmail.com <mailto:rjuju123@gmail.com>> wrote:

On Mon, Jun 29, 2020 at 10:55 AM Fujii Masao
<masao.fujii@oss.nttdata.com <mailto:masao.fujii@oss.nttdata.com>> wrote:

On 2020/06/29 16:05, Julien Rouhaud wrote:

On Mon, Jun 29, 2020 at 7:49 AM Tharakan, Robins <tharar@amazon.com <mailto:tharar@amazon.com>> wrote:

During fully-cached SELECT-only test using pgbench, Postgres v13Beta1 shows

Thanks for the benchmark!

~45% performance drop [2] at high DB connection counts (when compared with v12.3)

That's bad :(

Disabling pg_stat_statements.track_planning (which is 'On' by default)
brings the TPS numbers up to v12.3 levels.

The inflection point (in this test-case) is 128 Connections, beyond which the
TPS numbers are consistently low. Looking at the mailing list [1], this issue
didn't surface earlier possibly since the regression is trivial at low connection counts.

It would be great if this could be optimized further, or track_planning
disabled (by default) so as to not trip users upgrading from v12 with pg_stat_statement
enabled (but otherwise not particularly interested in track_planning).

Your benchmark result seems to suggest that the cause of the problem is
the contention of per-query spinlock in pgss_store(). Right?
This lock contention is likely to happen when multiple sessions run
the same queries.

One idea to reduce that lock contention is to separate per-query spinlock
into two; one is for planning, and the other is for execution. pgss_store()
determines which lock to use based on the given "kind" argument.
To make this idea work, also every pgss counters like shared_blks_hit
need to be separated into two, i.e., for planning and execution.

This can probably remove some overhead, but won't it eventually hit
the same issue when multiple connections try to plan the same query,
given the number of different queries and very low execution runtime?
It'll also quite increase the shared memory consumption.

I'm wondering if we could instead use atomics to store the counters.
The only downside is that we won't guarantee per-row consistency
anymore, which may be problematic.

The problem looks to be that spinlocks are terrible with overloaded CPU and a contended spinlock. A process holding the spinlock might easily get scheduled out leading to excessive spinning by everybody. I think a simple thing to try would be to replace the spinlock with LWLock.

Yes. Attached is the POC patch that replaces per-counter spinlock with LWLock.

I did a prototype patch that replaces spinlocks with futexes, but was not able to find a workload where it mattered.

I'm not familiar with futex, but could you tell me why you used futex instead
of LWLock that we already have? Is futex portable?

We have done a great job at eliminating spinlocks from contended code paths. Robins, perhaps you could try it to see if it reduces the regression you are observing.

Yes. Also we need to check that this change doesn't increase performance
overhead in other workloads.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Attachments:

pgss_lwlock_v1.patchtext/plain; charset=UTF-8; name=pgss_lwlock_v1.patch; x-mac-creator=0; x-mac-type=0Download+20-12
#16Ants Aasma
ants.aasma@cybertec.at
In reply to: Fujii Masao (#15)
Re: track_planning causing performance regression

On Tue, 30 Jun 2020 at 08:43, Fujii Masao <masao.fujii@oss.nttdata.com>
wrote:

The problem looks to be that spinlocks are terrible with overloaded

CPU and a contended spinlock. A process holding the spinlock might easily
get scheduled out leading to excessive spinning by everybody. I think a
simple thing to try would be to replace the spinlock with LWLock.

Yes. Attached is the POC patch that replaces per-counter spinlock with
LWLock.

Great. I think this is the one that should get considered for testing.

I did a prototype patch that replaces spinlocks with futexes, but was

not able to find a workload where it mattered.

I'm not familiar with futex, but could you tell me why you used futex
instead
of LWLock that we already have? Is futex portable?

Futex is a Linux kernel call that allows to build a lock that has
uncontended cases work fully in user space almost exactly like a spinlock,
while falling back to syscalls that wait for wakeup in case of contention.
It's not portable, but probably something similar could be implemented for
other operating systems. I did not pursue this further because it became
apparent that every performance critical spinlock had already been removed.

To be clear, I am not advocating for this patch to get included. I just had
the patch immediately available and it could have confirmed that using a
better lock fixes things.

--
Ants Aasma
Senior Database Engineerwww.cybertec-postgresql.com

#17Fujii Masao
masao.fujii@gmail.com
In reply to: Ants Aasma (#16)
Re: track_planning causing performance regression

On 2020/06/30 20:30, Ants Aasma wrote:

On Tue, 30 Jun 2020 at 08:43, Fujii Masao <masao.fujii@oss.nttdata.com <mailto:masao.fujii@oss.nttdata.com>> wrote:

The problem looks to be that spinlocks are terrible with overloaded CPU and a contended spinlock. A process holding the spinlock might easily get scheduled out leading to excessive spinning by everybody. I think a simple thing to try would be to replace the spinlock with LWLock.

Yes. Attached is the POC patch that replaces per-counter spinlock with LWLock.

Great. I think this is the one that should get considered for testing.

I did a prototype patch that replaces spinlocks with futexes, but was not able to find a workload where it mattered.

I'm not familiar with futex, but could you tell me why you used futex instead
of LWLock that we already have? Is futex portable?

Futex is a Linux kernel call that allows to build a lock that has uncontended cases work fully in user space almost exactly like a spinlock, while falling back to syscalls that wait for wakeup in case of contention. It's not portable, but probably something similar could be implemented for other operating systems. I did not pursue this further because it became apparent that every performance critical spinlock had already been removed.

To be clear, I am not advocating for this patch to get included. I just had the patch immediately available and it could have confirmed that using a better lock fixes things.

Understood. Thanks for the explanation!

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

#18Fujii Masao
masao.fujii@gmail.com
In reply to: Peter Geoghegan (#12)
Re: track_planning causing performance regression

On 2020/06/30 7:29, Peter Geoghegan wrote:

On Mon, Jun 29, 2020 at 3:23 PM Peter Geoghegan <pg@bowt.ie> wrote:

+1 -- this regression seems unacceptable to me.

I added an open item to track this.

Thanks!
I'm thinking to change the default value of track_planning to off for v13.

Ants and Andres suggested to replace the spinlock used in pgss_store() with
LWLock. I agreed with them and posted the POC patch doing that. But I think
the patch is an item for v14. The patch may address the reported performance
issue, but may cause other performance issues in other workloads. We would
need to measure how the patch affects the performance in various workloads.
It seems too late to do that at this stage of v13. Thought?

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

#19Andres Freund
andres@anarazel.de
In reply to: Fujii Masao (#15)
Re: track_planning causing performance regression

Hi,

On 2020-06-30 14:43:39 +0900, Fujii Masao wrote:

I did a prototype patch that replaces spinlocks with futexes, but was not able to find a workload where it mattered.

I'm not familiar with futex, but could you tell me why you used futex instead
of LWLock that we already have? Is futex portable?

We can't rely on futexes, they're linux only. I also don't see much of a
reason to use spinlocks (rather than lwlocks) here in the first place.

diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c
index cef8bb5a49..aa506f6c11 100644
--- a/contrib/pg_stat_statements/pg_stat_statements.c
+++ b/contrib/pg_stat_statements/pg_stat_statements.c
@@ -39,7 +39,7 @@
* in an entry except the counters requires the same.  To look up an entry,
* one must hold the lock shared.  To read or update the counters within
* an entry, one must hold the lock shared or exclusive (so the entry doesn't
- * disappear!) and also take the entry's mutex spinlock.
+ * disappear!) and also take the entry's partition lock.
* The shared state variable pgss->extent (the next free spot in the external
* query-text file) should be accessed only while holding either the
* pgss->mutex spinlock, or exclusive lock on pgss->lock.  We use the mutex to
@@ -115,6 +115,11 @@ static const uint32 PGSS_PG_MAJOR_VERSION = PG_VERSION_NUM / 100;

#define JUMBLE_SIZE 1024 /* query serialization buffer size */

+#define	PGSS_NUM_LOCK_PARTITIONS()		(pgss_max)
+#define	PGSS_HASH_PARTITION_LOCK(key)	\
+	(&(pgss->base +	\
+	   (get_hash_value(pgss_hash, key) % PGSS_NUM_LOCK_PARTITIONS()))->lock)
+
/*
* Extension version number, for supporting older extension versions' objects
*/
@@ -207,7 +212,7 @@ typedef struct pgssEntry
Size		query_offset;	/* query text offset in external file */
int			query_len;		/* # of valid bytes in query string, or -1 */
int			encoding;		/* query text encoding */
-	slock_t		mutex;			/* protects the counters only */
+	LWLock	   	*lock;			/* protects the counters only */
} pgssEntry;

Why did you add the hashing here? It seems a lot better to just add an
lwlock in-place instead of the spinlock? The added size is neglegible
compared to the size of pgssEntry.

Greetings,

Andres Freund

#20Andres Freund
andres@anarazel.de
In reply to: Ants Aasma (#16)
Re: track_planning causing performance regression

Hi,

On 2020-06-30 14:30:03 +0300, Ants Aasma wrote:

Futex is a Linux kernel call that allows to build a lock that has
uncontended cases work fully in user space almost exactly like a spinlock,
while falling back to syscalls that wait for wakeup in case of contention.
It's not portable, but probably something similar could be implemented for
other operating systems. I did not pursue this further because it became
apparent that every performance critical spinlock had already been removed.

Our lwlock implementation does have that property already, though. While
the kernel wait is implemented using semaphores, those are implemented
using futexes internally (posix ones, not sysv ones, so only after
whatever version we switched the default to posix semas on linux).

I'd rather move towards removing spinlocks from postgres than making
their implementation more complicated...

Greetings,

Andres Freund

In reply to: Fujii Masao (#18)
#22Fujii Masao
masao.fujii@gmail.com
In reply to: Andres Freund (#19)
#23Andres Freund
andres@anarazel.de
In reply to: Fujii Masao (#22)
#24Fujii Masao
masao.fujii@gmail.com
In reply to: Andres Freund (#23)
#25Fujii Masao
masao.fujii@gmail.com
In reply to: Peter Geoghegan (#21)
In reply to: Fujii Masao (#25)
#27Fujii Masao
masao.fujii@gmail.com
In reply to: Peter Geoghegan (#26)
#28Pavel Stehule
pavel.stehule@gmail.com
In reply to: Fujii Masao (#25)
#29Fujii Masao
masao.fujii@gmail.com
In reply to: Pavel Stehule (#28)
#30Pavel Stehule
pavel.stehule@gmail.com
In reply to: Fujii Masao (#29)
#31Fujii Masao
masao.fujii@gmail.com
In reply to: Pavel Stehule (#30)
#32Pavel Stehule
pavel.stehule@gmail.com
In reply to: Fujii Masao (#31)
#33Fujii Masao
masao.fujii@gmail.com
In reply to: Pavel Stehule (#32)
#34Hamid Akhtar
hamid.akhtar@gmail.com
In reply to: Fujii Masao (#33)
#35Fujii Masao
masao.fujii@gmail.com
In reply to: Hamid Akhtar (#34)
#36Hamid Akhtar
hamid.akhtar@gmail.com
In reply to: Fujii Masao (#35)
#37Fujii Masao
masao.fujii@gmail.com
In reply to: Hamid Akhtar (#36)
#38Fujii Masao
masao.fujii@gmail.com
In reply to: Fujii Masao (#37)
#39Hamid Akhtar
hamid.akhtar@gmail.com
In reply to: Fujii Masao (#38)
#40Hamid Akhtar
hamid.akhtar@gmail.com
In reply to: Hamid Akhtar (#39)
#41bttanakahbk
bttanakahbk@oss.nttdata.com
In reply to: Fujii Masao (#38)
#42Fujii Masao
masao.fujii@gmail.com
In reply to: bttanakahbk (#41)
#43Julien Rouhaud
rjuju123@gmail.com
In reply to: Fujii Masao (#42)
#44Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Fujii Masao (#42)
#45Andres Freund
andres@anarazel.de
In reply to: Alvaro Herrera (#44)
#46Tharakan, Robins
tharar@amazon.com
In reply to: Andres Freund (#45)
#47Tharakan, Robins
tharar@amazon.com
In reply to: Tharakan, Robins (#46)
#48Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tharakan, Robins (#46)
#49Michael Paquier
michael@paquier.xyz
In reply to: Andres Freund (#45)
#50Justin Pryzby
pryzby@telsasoft.com
In reply to: Fujii Masao (#29)
#51Fujii Masao
masao.fujii@gmail.com
In reply to: Justin Pryzby (#50)
#52Justin Pryzby
pryzby@telsasoft.com
In reply to: Fujii Masao (#51)
#53Fujii Masao
masao.fujii@gmail.com
In reply to: Justin Pryzby (#52)
#54Justin Pryzby
pryzby@telsasoft.com
In reply to: Fujii Masao (#53)
#55Fujii Masao
masao.fujii@gmail.com
In reply to: Justin Pryzby (#54)
#56Justin Pryzby
pryzby@telsasoft.com
In reply to: Fujii Masao (#55)
#57Justin Pryzby
pryzby@telsasoft.com
In reply to: Justin Pryzby (#56)
#58Julien Rouhaud
rjuju123@gmail.com
In reply to: Justin Pryzby (#57)
#59Justin Pryzby
pryzby@telsasoft.com
In reply to: Julien Rouhaud (#58)
#60Julien Rouhaud
rjuju123@gmail.com
In reply to: Justin Pryzby (#59)
#61Fujii Masao
masao.fujii@gmail.com
In reply to: Julien Rouhaud (#60)
#62Julien Rouhaud
rjuju123@gmail.com
In reply to: Fujii Masao (#61)
#63Fujii Masao
masao.fujii@gmail.com
In reply to: Julien Rouhaud (#62)
#64Julien Rouhaud
rjuju123@gmail.com
In reply to: Fujii Masao (#63)