Auto-vacuum timing out and preventing connections

Started by David Johansenalmost 4 years ago12 messagesbugs

davejohansen@gmail.com

almost 4 years ago

We're running into an issue where the database can't be connected to. It
appears that the auto-vacuum is timing out and then that prevents new
connections from happening. This assumption is based on these logs showing
up in the logs:
WARNING: worker took too long to start; canceled
The log appears about every 5 minutes and eventually nothing can connect to
it and it has to be rebooted.

These are the most similarly related previous posts, but the CPU usage
isn't high when this happens, so I don't believe that's the problem
/messages/by-id/20081105185206.GS4114@alvh.no-ip.org
/messages/by-id/AANLkTinsGLeRc26RT5Kb4_HEhow5e97p0ZBveg=p9xqS@mail.gmail.com

What can we do to diagnose this problem and get our database working
reliably again?

Thanks,
Dave

Julien Rouhaud

rjuju123@gmail.com

almost 4 years ago

In reply to: David Johansen (#1)

Re: Auto-vacuum timing out and preventing connections

Hi,

On Mon, Jun 27, 2022 at 02:38:21PM -0600, David Johansen wrote:

We're running into an issue where the database can't be connected to. It
appears that the auto-vacuum is timing out and then that prevents new
connections from happening. This assumption is based on these logs showing
up in the logs:
WARNING: worker took too long to start; canceled

I don't think that autovacuum is the reason of the problem, but just another
victim of the same problem as the autovacuum launcher is still active and tries
to schedule workers, which can't connect either.

The log appears about every 5 minutes and eventually nothing can connect to
it and it has to be rebooted.

Are you saying that you have to reboot every 5 minutes?

Also, do you mean reboot the server or just restarting the postgres service is
enough?

These are the most similarly related previous posts, but the CPU usage
isn't high when this happens, so I don't believe that's the problem
/messages/by-id/20081105185206.GS4114@alvh.no-ip.org
/messages/by-id/AANLkTinsGLeRc26RT5Kb4_HEhow5e97p0ZBveg=p9xqS@mail.gmail.com

What can we do to diagnose this problem and get our database working
reliably again?

As mentioned in the 2nd link, getting a strace of the postmaster when the
problem happens may help.

David Johansen

davejohansen@gmail.com

almost 4 years ago

In reply to: Julien Rouhaud (#2)

Re: Auto-vacuum timing out and preventing connections

On Mon, Jun 27, 2022 at 10:42 PM Julien Rouhaud <rjuju123@gmail.com> wrote:

Hi,

On Mon, Jun 27, 2022 at 02:38:21PM -0600, David Johansen wrote:

We're running into an issue where the database can't be connected to. It
appears that the auto-vacuum is timing out and then that prevents new
connections from happening. This assumption is based on these logs

showing

up in the logs:
WARNING: worker took too long to start; canceled

I don't think that autovacuum is the reason of the problem, but just
another
victim of the same problem as the autovacuum launcher is still active and
tries
to schedule workers, which can't connect either.

Sorry, I should have provided some more details. These logs happen for
12-24 hours before the server stops accepting connections.

The log appears about every 5 minutes and eventually nothing can connect

to

it and it has to be rebooted.

Are you saying that you have to reboot every 5 minutes?

That error log happens every 5 minutes and that's the nap time.

Also, do you mean reboot the server or just restarting the postgres
service is
enough?

Restarting the postgres service.

These are the most similarly related previous posts, but the CPU usage
isn't high when this happens, so I don't believe that's the problem

/messages/by-id/20081105185206.GS4114@alvh.no-ip.org

/messages/by-id/AANLkTinsGLeRc26RT5Kb4_HEhow5e97p0ZBveg=p9xqS@mail.gmail.com

What can we do to diagnose this problem and get our database working
reliably again?

As mentioned in the 2nd link, getting a strace of the postmaster when the
problem happens may help.

This is running in RDS on AWS, so I don't believe I can do an strace on the
service.

Jeff Janes

jeff.janes@gmail.com

almost 4 years ago

In reply to: David Johansen (#1)

Re: Auto-vacuum timing out and preventing connections

On Mon, Jun 27, 2022 at 4:38 PM David Johansen <davejohansen@gmail.com>
wrote:

We're running into an issue where the database can't be connected to. It
appears that the auto-vacuum is timing out and then that prevents new
connections from happening. This assumption is based on these logs showing
up in the logs:
WARNING: worker took too long to start; canceled
The log appears about every 5 minutes and eventually nothing can connect
to it and it has to be rebooted.

As Julien suggested, this sounds like another victim, not the cause. Is
there anything else in the log files?

What version are you using?

These are the most similarly related previous posts, but the CPU usage
isn't high when this happens, so I don't believe that's the problem

/messages/by-id/20081105185206.GS4114@alvh.no-ip.org

/messages/by-id/AANLkTinsGLeRc26RT5Kb4_HEhow5e97p0ZBveg=p9xqS@mail.gmail.com

But, I don't see high CPU described as a symptom in either of those
threads.

If you can't reproduce the problem locally, there probably isn't much we
can do. Maybe ask Amazon to look into it, since they are the only ones
with sufficient access to do so.

Cheers,

Jeff

Show quoted text

David Johansen

davejohansen@gmail.com

almost 4 years ago

In reply to: Jeff Janes (#4)

Re: Auto-vacuum timing out and preventing connections

On Tue, Jun 28, 2022 at 1:31 PM Jeff Janes <jeff.janes@gmail.com> wrote:

On Mon, Jun 27, 2022 at 4:38 PM David Johansen <davejohansen@gmail.com>
wrote:

We're running into an issue where the database can't be connected to. It
appears that the auto-vacuum is timing out and then that prevents new
connections from happening. This assumption is based on these logs showing
up in the logs:
WARNING: worker took too long to start; canceled
The log appears about every 5 minutes and eventually nothing can connect
to it and it has to be rebooted.

As Julien suggested, this sounds like another victim, not the cause. Is
there anything else in the log files?

That's the only thing in the logs for the 12-24 hours before the database
becomes inaccessible.

What version are you using?

13.6

These are the most similarly related previous posts, but the CPU usage

isn't high when this happens, so I don't believe that's the problem

/messages/by-id/20081105185206.GS4114@alvh.no-ip.org

/messages/by-id/AANLkTinsGLeRc26RT5Kb4_HEhow5e97p0ZBveg=p9xqS@mail.gmail.com

But, I don't see high CPU described as a symptom in either of those
threads.

I was referring to the "I've seen this happen under heavy load" statement.
Not sure that's the cause or related in those posts, but it doesn't appear
to be the issue here.

If you can't reproduce the problem locally, there probably isn't much we
can do. Maybe ask Amazon to look into it, since they are the only ones
with sufficient access to do so.

We've opened a support case, but I was trying to be proactive and seeing
what we could dig into on our end. Is there a way to tell which table the
auto-vacuum is trying to run on and timing out with?

Masahiko Sawada

sawada.mshk@gmail.com

almost 4 years ago

In reply to: David Johansen (#5)

Re: Auto-vacuum timing out and preventing connections

On Wed, Jun 29, 2022 at 5:05 AM David Johansen <davejohansen@gmail.com> wrote:

On Tue, Jun 28, 2022 at 1:31 PM Jeff Janes <jeff.janes@gmail.com> wrote:

On Mon, Jun 27, 2022 at 4:38 PM David Johansen <davejohansen@gmail.com> wrote:

We're running into an issue where the database can't be connected to. It appears that the auto-vacuum is timing out and then that prevents new connections from happening. This assumption is based on these logs showing up in the logs:
WARNING: worker took too long to start; canceled
The log appears about every 5 minutes and eventually nothing can connect to it and it has to be rebooted.

As Julien suggested, this sounds like another victim, not the cause. Is there anything else in the log files?

That's the only thing in the logs for the 12-24 hours before the database becomes inaccessible.

What version are you using?

13.6

These are the most similarly related previous posts, but the CPU usage isn't high when this happens, so I don't believe that's the problem
/messages/by-id/20081105185206.GS4114@alvh.no-ip.org
/messages/by-id/AANLkTinsGLeRc26RT5Kb4_HEhow5e97p0ZBveg=p9xqS@mail.gmail.com

But, I don't see high CPU described as a symptom in either of those threads.

I was referring to the "I've seen this happen under heavy load" statement. Not sure that's the cause or related in those posts, but it doesn't appear to be the issue here.

If you can't reproduce the problem locally, there probably isn't much we can do. Maybe ask Amazon to look into it, since they are the only ones with sufficient access to do so.

We've opened a support case, but I was trying to be proactive and seeing what we could dig into on our end. Is there a way to tell which table the auto-vacuum is trying to run on and timing out with?

Autovacuum workers launch per database. The situation where the
warning "worker took too long to start; canceled" occurs is that an
autovacuum worker for the particular database took a long time for its
startup phase (initializing and setting parameters etc.). There is no
way to know neither which table nor which database.

If it's reproducible, it may help investigate it if you could collect
the contents of pg_stat_activity when the issue is happening in order
to see if there is another process waiting.

Regards,

--
Masahiko Sawada
EDB: https://www.enterprisedb.com/

Andres Freund

andres@anarazel.de

almost 4 years ago

In reply to: David Johansen (#1)

Re: Auto-vacuum timing out and preventing connections

Hi,

On 2022-06-27 14:38:21 -0600, David Johansen wrote:

We're running into an issue where the database can't be connected to. It
appears that the auto-vacuum is timing out and then that prevents new
connections from happening. This assumption is based on these logs showing
up in the logs:
WARNING: worker took too long to start; canceled
The log appears about every 5 minutes and eventually nothing can connect to
it and it has to be rebooted.

How many databases, tables and functions do you have?

What can we do to diagnose this problem and get our database working
reliably again?

Downthread you said you're on RDS - unfortunately you're going to have quite
limited access and are going to have to take it up with AWS.

Greetings,

Andres Freund

David Johansen

davejohansen@gmail.com

almost 4 years ago

In reply to: Andres Freund (#7)

Re: Auto-vacuum timing out and preventing connections

On Fri, Jul 1, 2022 at 9:11 PM Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2022-06-27 14:38:21 -0600, David Johansen wrote:

We're running into an issue where the database can't be connected to. It
appears that the auto-vacuum is timing out and then that prevents new
connections from happening. This assumption is based on these logs

showing

up in the logs:
WARNING: worker took too long to start; canceled
The log appears about every 5 minutes and eventually nothing can connect

to

it and it has to be rebooted.

How many databases, tables and functions do you have?

1 database
427 tables
1 function

What can we do to diagnose this problem and get our database working
reliably again?

Downthread you said you're on RDS - unfortunately you're going to have
quite
limited access and are going to have to take it up with AWS.

Yes, I definitely understand that AWS will likely need to be involved or
resolve this issue, but as I said previously, I'm trying to be proactive
and just see what I could find.

Xiong He

iihero@qq.com

almost 4 years ago

In reply to: David Johansen (#8)

Re: Auto-vacuum timing out and preventing connections

You can get the auto vacuum activity from pg_stat view. try to fond which table causes the timeout. and even more you csn disable autovacuum to check whether it still happens. 

---Original---
From: "David Johansen"<davejohansen@gmail.com>
Date: Sat, Jul 2, 2022 12:34 PM
To: "Andres Freund"<andres@anarazel.de>;
Cc: "pgsql-bugs"<pgsql-bugs@lists.postgresql.org>;
Subject: Re: Auto-vacuum timing out and preventing connections

On Fri, Jul 1, 2022 at 9:11 PM Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2022-06-27 14:38:21 -0600, David Johansen wrote:
> We're running into an issue where the database can't be connected to. It
> appears that the auto-vacuum is timing out and then that prevents new
> connections from happening. This assumption is based on these logs showing
> up in the logs:
> WARNING:  worker took too long to start; canceled
> The log appears about every 5 minutes and eventually nothing can connect to
> it and it has to be rebooted.

How many databases, tables and functions do you have?

1 database
427 tables
1 function
 
> What can we do to diagnose this problem and get our database working
> reliably again?

Downthread you said you're on RDS - unfortunately you're going to have quite
limited access and are going to have to take it up with AWS.

Yes, I definitely understand that AWS will likely need to be involved or resolve this issue, but as I said previously, I'm trying to be proactive and just see what I could find. 

Import Notes

Resolved by subject fallback

#10

David Johansen

davejohansen@gmail.com

almost 4 years ago

In reply to: David Johansen (#5)

Re: Auto-vacuum timing out and preventing connections

On Tue, Jun 28, 2022 at 2:05 PM David Johansen <davejohansen@gmail.com>
wrote:

On Tue, Jun 28, 2022 at 1:31 PM Jeff Janes <jeff.janes@gmail.com> wrote:

On Mon, Jun 27, 2022 at 4:38 PM David Johansen <davejohansen@gmail.com>
wrote:

We're running into an issue where the database can't be connected to. It
appears that the auto-vacuum is timing out and then that prevents new
connections from happening. This assumption is based on these logs showing
up in the logs:
WARNING: worker took too long to start; canceled
The log appears about every 5 minutes and eventually nothing can connect
to it and it has to be rebooted.

As Julien suggested, this sounds like another victim, not the cause. Is
there anything else in the log files?

That's the only thing in the logs for the 12-24 hours before the database
becomes inaccessible.

To follow up on this, this was the symptom and not the cause. The
auto-vacuum was failing to start because of a bug and not the cause of the
problem.

Show quoted text

#11

Andres Freund

andres@anarazel.de

almost 4 years ago

In reply to: David Johansen (#10)

Re: Auto-vacuum timing out and preventing connections

Hi,

On 2022-07-14 10:51:39 -0600, David Johansen wrote:

On Tue, Jun 28, 2022 at 2:05 PM David Johansen <davejohansen@gmail.com>
wrote:

On Tue, Jun 28, 2022 at 1:31 PM Jeff Janes <jeff.janes@gmail.com> wrote:

On Mon, Jun 27, 2022 at 4:38 PM David Johansen <davejohansen@gmail.com>
wrote:

We're running into an issue where the database can't be connected to. It
appears that the auto-vacuum is timing out and then that prevents new
connections from happening. This assumption is based on these logs showing
up in the logs:
WARNING: worker took too long to start; canceled
The log appears about every 5 minutes and eventually nothing can connect
to it and it has to be rebooted.

As Julien suggested, this sounds like another victim, not the cause. Is
there anything else in the log files?

That's the only thing in the logs for the 12-24 hours before the database
becomes inaccessible.

To follow up on this, this was the symptom and not the cause. The
auto-vacuum was failing to start because of a bug and not the cause of the
problem.

What bug?

Greetings,

Andres Freund

#12

David Johansen

davejohansen@gmail.com

almost 4 years ago

In reply to: Andres Freund (#11)

Re: Auto-vacuum timing out and preventing connections

On Thu, Jul 14, 2022 at 9:42 PM Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2022-07-14 10:51:39 -0600, David Johansen wrote:

On Tue, Jun 28, 2022 at 2:05 PM David Johansen <davejohansen@gmail.com>
wrote:

On Tue, Jun 28, 2022 at 1:31 PM Jeff Janes <jeff.janes@gmail.com>

wrote:

On Mon, Jun 27, 2022 at 4:38 PM David Johansen <

davejohansen@gmail.com>

wrote:

We're running into an issue where the database can't be connected

to. It

appears that the auto-vacuum is timing out and then that prevents new
connections from happening. This assumption is based on these logs

showing

up in the logs:
WARNING: worker took too long to start; canceled
The log appears about every 5 minutes and eventually nothing can

connect

to it and it has to be rebooted.

As Julien suggested, this sounds like another victim, not the cause.

Is

there anything else in the log files?

That's the only thing in the logs for the 12-24 hours before the

database

becomes inaccessible.

To follow up on this, this was the symptom and not the cause. The
auto-vacuum was failing to start because of a bug and not the cause of

the

problem.

What bug?

It appears to have been related to the scaling and process management that
Aurora Serverless V2 does. I haven't been able to find any info posted
about this issue from AWS, but we opened a support case and were told the
following:

We have identified a critical stability update for Aurora PostgreSQL
Serverless v2 instances running versions 13.6, 13.7, and 14.3. We have also
identified a critical issue in Aurora PostgreSQL Serverless v2 clusters
running versions 13.7 and 14.3. These issues can cause database restarts or
failovers under specific conditions. We have developed fixes and are
deploying the fixes in two patches. The patches will be automatically
applied to the affected instances and clusters in upcoming maintenance
windows over the next 3 weeks causing two restarts of your database. One
patch will show as a security and stability update and one patch will show
as a database update. They will be scheduled sequentially.

The symptoms we observed were slightly different than what is described
above, but we manually applied the patches as soon as they were available
and haven't noticed the problem since.