BUG #18141: sorry, too many clients error occurring very frequently

Started by PG Bug reporting formover 2 years ago8 messagesbugs

noreply@postgresql.org

over 2 years ago

The following bug has been logged on the website:

Bug reference: 18141
Logged by: Ravi Agrawal
Email address: ravi.s.agrawal23@gmail.com
PostgreSQL version: 13.11
Operating system: RHEL
Description:

Hi Team,

We have updated PostgresSQL from version 13.4 to 13.11 and observing 'sorry,
too many clients' error very frequently.
Connection count is shooting up on performing basic operations and crossing
the max_connections value. This has not been observed previously on Postgres
v13.4
We need to understand if this issue is caused due to the change in version.
Is there any history of known incidents facing this issue with the specific
version of PostgreSQL.

Appreciate your response on this.

Many Thanks,
Ravi Agrawal

Wetmore, Matthew (CTR)

Matthew.Wetmore@express-scripts.com

over 2 years ago

In reply to: PG Bug reporting form (#1)

I can second this behavior. We did a few things to mitigate the issue. HA may be at play here too.
We're still monitoring.

13.11 + HA - these issues
13.6 , no HA - no issues

-----Original Message-----
From: PG Bug reporting form <noreply@postgresql.org>
Sent: Friday, September 29, 2023 7:32 AM
To: pgsql-bugs@lists.postgresql.org
Cc: ravi.s.agrawal23@gmail.com
Subject: [EXTERNAL] BUG #18141: sorry, too many clients error occurring very frequently

The following bug has been logged on the website:

Bug reference: 18141
Logged by: Ravi Agrawal
Email address: ravi.s.agrawal23@gmail.com
PostgreSQL version: 13.11
Operating system: RHEL
Description:

Hi Team,

We have updated PostgresSQL from version 13.4 to 13.11 and observing 'sorry, too many clients' error very frequently.
Connection count is shooting up on performing basic operations and crossing the max_connections value. This has not been observed previously on Postgres
v13.4
We need to understand if this issue is caused due to the change in version.
Is there any history of known incidents facing this issue with the specific version of PostgreSQL.

Appreciate your response on this.

Many Thanks,
Ravi Agrawal

Joe Conway

mail@joeconway.com

over 2 years ago

In reply to: PG Bug reporting form (#1)

Re: BUG #18141: sorry, too many clients error occurring very frequently

On 9/29/23 10:31, PG Bug reporting form wrote:

We have updated PostgresSQL from version 13.4 to 13.11 and observing 'sorry,
too many clients' error very frequently.
Connection count is shooting up on performing basic operations and crossing
the max_connections value. This has not been observed previously on Postgres
v13.4
We need to understand if this issue is caused due to the change in version.
Is there any history of known incidents facing this issue with the specific
version of PostgreSQL.

Question -- are you using pg_stat_statements? If so, please try
disabling it and see if the issue goes away. Also was is your work_mem
setting? Does raising it help?

--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Jeff Janes

jeff.janes@gmail.com

over 2 years ago

In reply to: PG Bug reporting form (#1)

Re: BUG #18141: sorry, too many clients error occurring very frequently

On Fri, Sep 29, 2023 at 7:55 PM PG Bug reporting form <
noreply@postgresql.org> wrote:

The following bug has been logged on the website:

Bug reference: 18141
Logged by: Ravi Agrawal
Email address: ravi.s.agrawal23@gmail.com
PostgreSQL version: 13.11
Operating system: RHEL
Description:

Hi Team,

We have updated PostgresSQL from version 13.4 to 13.11 and observing
'sorry,
too many clients' error very frequently.
Connection count is shooting up on performing basic operations and
crossing
the max_connections value. This has not been observed previously on
Postgres
v13.4
We need to understand if this issue is caused due to the change in version.
Is there any history of known incidents facing this issue with the specific
version of PostgreSQL.

I think it is unlikely to be a bug. How did you do the upgrade? Just
install the new binaries, then restart the server?

Can you capture plans from the newly slow queries and compare them to plans
from the same queries before the upgrade?

Cheers,

Jeff

Joe Conway

mail@joeconway.com

over 2 years ago

In reply to: Jeff Janes (#4)

Re: BUG #18141: sorry, too many clients error occurring very frequently

On 9/30/23 14:04, Jeff Janes wrote:

On Fri, Sep 29, 2023 at 7:55 PM PG Bug reporting form
<noreply@postgresql.org <mailto:noreply@postgresql.org>> wrote:

The following bug has been logged on the website:

Bug reference: 18141
Logged by: Ravi Agrawal
Email address: ravi.s.agrawal23@gmail.com
<mailto:ravi.s.agrawal23@gmail.com>
PostgreSQL version: 13.11
Operating system: RHEL
Description:

Hi Team,

We have updated PostgresSQL from version 13.4 to 13.11 and observing
'sorry,
too many clients' error very frequently.
Connection count is shooting up on performing basic operations and
crossing
the max_connections value. This has not been observed previously on
Postgres
v13.4
We need to understand if this issue is caused due to the change in
version.
Is there any history of known incidents facing this issue with the
specific
version of PostgreSQL.

I think it is unlikely to be a bug. How did you do the upgrade? Just
install the new binaries, then restart the server?

Maybe, maybe not. I have seen two other cases that are similar. One was
an upgrade from 12.8 to 12.12 and the other an upgrade from 13.4 to 13.8.

I checked and 12.8 was stamped is the same date as 13.4, and 12.12 the
same day as 13.8.

In both cases queries against pg_stat_statements suddenly started taking
more memory leading to spillage to pg_temp and a step degradation in
overall performance. In at least one of those cases the
solution/workaround was to increase work_mem. In the other I think
pg_stat_statements was disabled.

Myself and at least one other hacker looked at the pg_stat_statements
specific changes in that time interval and saw no smoking gun.

But it is possible that something else backpatched to both branches
between Aug 09, 2021 and Aug 8, 2022 has caused a more general
performance regression which we have yet to track down.

At least based on this sample of two (now maybe 3?) folks with similar
symptoms.

--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Tom Lane

tgl@sss.pgh.pa.us

over 2 years ago

In reply to: Joe Conway (#5)

Re: BUG #18141: sorry, too many clients error occurring very frequently

Joe Conway <mail@joeconway.com> writes:

Maybe, maybe not. I have seen two other cases that are similar. One was
an upgrade from 12.8 to 12.12 and the other an upgrade from 13.4 to 13.8.

I checked and 12.8 was stamped is the same date as 13.4, and 12.12 the
same day as 13.8.

In both cases queries against pg_stat_statements suddenly started taking
more memory leading to spillage to pg_temp and a step degradation in
overall performance. In at least one of those cases the
solution/workaround was to increase work_mem. In the other I think
pg_stat_statements was disabled.

Myself and at least one other hacker looked at the pg_stat_statements
specific changes in that time interval and saw no smoking gun.

But it is possible that something else backpatched to both branches
between Aug 09, 2021 and Aug 8, 2022 has caused a more general
performance regression which we have yet to track down.

Hmm. My first instinct is to wonder about changes in plan selection.
How complex were the troublesome queries?

regards, tom lane

Joe Conway

mail@joeconway.com

over 2 years ago

In reply to: Tom Lane (#6)

Re: BUG #18141: sorry, too many clients error occurring very frequently

On 10/1/23 14:09, Tom Lane wrote:

Joe Conway <mail@joeconway.com> writes:

Maybe, maybe not. I have seen two other cases that are similar. One was
an upgrade from 12.8 to 12.12 and the other an upgrade from 13.4 to 13.8.

I checked and 12.8 was stamped is the same date as 13.4, and 12.12 the
same day as 13.8.

In both cases queries against pg_stat_statements suddenly started taking
more memory leading to spillage to pg_temp and a step degradation in
overall performance. In at least one of those cases the
solution/workaround was to increase work_mem. In the other I think
pg_stat_statements was disabled.

Myself and at least one other hacker looked at the pg_stat_statements
specific changes in that time interval and saw no smoking gun.

But it is possible that something else backpatched to both branches
between Aug 09, 2021 and Aug 8, 2022 has caused a more general
performance regression which we have yet to track down.

Hmm. My first instinct is to wonder about changes in plan selection.
How complex were the troublesome queries?

I think both of these cases involved a number of common attributes:

* The queries against pg_stat_statements were
relatively complex
* The other queries on the system were relatively
long and complex (and thus the query string length
in pg_stat_statements)
* Prior to the upgrade the systems were overall
keeping up, but extremely busy

In one case it seems that the upgrade caused a significant increase of
temp file usage. This impacted the system enough that other active
queries took longer, and thus number of active connections increased.
Raising work_mem eliminated the temp file usage and cpu loads dropped
back to similar levels as they were prior to the minor upgrade.

The other case had different specifics, but generally involved increased
memory usage. In that one eliminating the use of pg_stat_statements
restored performance. They did not try raising work_mem (as I understand
it), and I did not get any info regarding temp file spillage there.

--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Tom Lane

tgl@sss.pgh.pa.us

over 2 years ago

In reply to: Joe Conway (#7)

Re: BUG #18141: sorry, too many clients error occurring very frequently

Joe Conway <mail@joeconway.com> writes:

On 10/1/23 14:09, Tom Lane wrote:

Joe Conway <mail@joeconway.com> writes:

But it is possible that something else backpatched to both branches
between Aug 09, 2021 and Aug 8, 2022 has caused a more general
performance regression which we have yet to track down.

Hmm. My first instinct is to wonder about changes in plan selection.
How complex were the troublesome queries?

I think both of these cases involved a number of common attributes:
* The queries against pg_stat_statements were
relatively complex
* The other queries on the system were relatively
long and complex (and thus the query string length
in pg_stat_statements)
* Prior to the upgrade the systems were overall
keeping up, but extremely busy
In one case it seems that the upgrade caused a significant increase of
temp file usage. This impacted the system enough that other active
queries took longer, and thus number of active connections increased.
Raising work_mem eliminated the temp file usage and cpu loads dropped
back to similar levels as they were prior to the minor upgrade.

Interesting. After some desultory trawling through the commit log,
I'm wondering if there could be some connection to

Author: Tom Lane <tgl@sss.pgh.pa.us>
Branch: REL_13_STABLE Release: REL_13_5 [dede14399] 2021-09-20 11:48:52 -0400
Branch: REL_12_STABLE Release: REL_12_9 [f230614da] 2021-09-20 11:48:52 -0400
Branch: REL_11_STABLE Release: REL_11_14 [914e54501] 2021-09-20 11:48:52 -0400
Branch: REL_10_STABLE Release: REL_10_19 [923b7efc2] 2021-09-20 11:48:52 -0400
Branch: REL9_6_STABLE Release: REL9_6_24 [183b3aced] 2021-09-20 11:48:52 -0400

Don't elide casting to typmod -1.

That's in the right timeframe for these reports. A fairly direct
connection to pg_stat_statements could be made: addition of
RelabelType nodes that weren't there before would change the query ID
hash of affected statements, so that after running for a little while
the pg_stat_statements hash would have a bunch of duplicate entries
(assuming you didn't bump up against pg_stat_statements.max).
As long as nothing happened to age out the now-dead entries with
the old query hashes, you'd have more rows in the pg_stat_statements
view than before, which could explain performance decreases in
queries on that view.

The big hole in this theory is that I would not have expected this
casting change to affect any large proportion of SQL commands,
so that it's a bit hard to credit it causing a lot of bloat in
pg_stat_statements. Still, some specific coding habit or DDL
detail could maybe allow that to happen in a particular application.

Another line of thought is that the extra RelabelType nodes could
block a planner optimization that used to occur before. It's not
apparent why that would manifest specifically in connection with
pg_stat_statements queries though. If that is the problem,
we'd need an example of a query whose plan changed in order to pin
down the cause.

Anyway, that's just a theory, and it might be hot air.

regards, tom lane