Slow shutdowns sometimes on RDS Postgres

Started by Chris Williamsover 7 years ago12 messagesgeneral
Jump to latest
#1Chris Williams
cswilliams@gmail.com

Hi,

I'm using AWS RDS Postgres (9.6.6) and have run into very slow shutdowns
(10+ minutes) a few times when making database modifications (e.g. reboot,
changing instance size, etc.). Other times, it shuts down quickly (1
minute or so). I have not been able to figure out why sometimes it takes a
long time to shutdown.

When it happens, I see a bunch of lines in the postgres log like the
following over and over (almost every second or two) during this 10 minute
shutdown period:
2018-09-12 06:37:01 UTC:XXX.XXX.XXX.XXX(19712):my_user@my_db
:[16495]:FATAL:
2018-09-12 06:37:01 UTC:localhost(31368):rdsadmin@rdsadmin:[16488]:FATAL:
the database system is shutting down

Once I start seeing these messages, I start manually shutting down all of
our applications that are connected to the db. I'm not sure if shutting
down the apps fixes it or if there's some timeout on the RDS side, but it
seems like once I start doing this, the database finally shuts down.

When it takes this long to shut down, it ends up causing a lot more
downtime than I would like. I've tried asking AWS's support why it takes
so long to shutdown sometimes, but they basically just told me that's "how
it works" and that I should try to shut down all of my connections ahead of
time before making database modifications.

We just have a few ruby on rails applications connected to the database,
and don't really have any long running or heavy queries and the db is under
very light load, so I don't understand why it takes so long to shutdown.
We do have a sizeable number of connections though (about 600) and there
are two replicas connected to it. I also tried
setting idle_in_transaction_session_timeout to 300 seconds to see if that
would help, but it made no difference.

I was wondering if anyone else had seen this behavior on their RDS Postgres
instances or had any suggestions on how I could shorten the shutdown time?

Thanks,
Chris

#2Joshua D. Drake
jd@commandprompt.com
In reply to: Chris Williams (#1)
Re: Slow shutdowns sometimes on RDS Postgres

On 09/13/2018 03:04 PM, Chris Williams wrote:

Hi,

I'm using AWS RDS Postgres (9.6.6) and have run into very slow
shutdowns (10+ minutes) a few times when making database modifications
(e.g. reboot, changing instance size, etc.).  Other times, it shuts
down quickly (1 minute or so).  I have not been able to figure out why
sometimes it takes a long time to shutdown.

This is probably something you would have to talk to Amazon about. AWS
RDS Postgres is a fork of PostgreSQL and not 100% compatible from an
administrative perspective.

JD

When it happens, I see a bunch of lines in the postgres log like the
following over and over (almost every second or two) during this 10
minute shutdown period:
2018-09-12 06:37:01
UTC:XXX.XXX.XXX.XXX(19712):my_user@my_db:[16495]:FATAL:
2018-09-12 06:37:01
UTC:localhost(31368):rdsadmin@rdsadmin:[16488]:FATAL: the database
system is shutting down

Once I start seeing these messages, I start manually shutting down all
of our applications that are connected to the db.  I'm not sure if
shutting down the apps fixes it or if there's some timeout on the RDS
side, but it seems like once I start doing this, the database finally
shuts down.

When it takes this long to shut down, it ends up causing a lot more
downtime than I would like.  I've tried asking AWS's support why it
takes so long to shutdown sometimes, but they basically just told me
that's "how it works" and that I should try to shut down all of my
connections ahead of time before making database modifications.

We just have a few ruby on rails applications connected to the
database, and don't really have any long running or heavy queries and
the db is under very light load, so I don't understand why it takes so
long to shutdown.  We do have a sizeable number of connections though
(about 600) and there are two replicas connected to it.  I also tried
setting idle_in_transaction_session_timeout to 300 seconds to see if
that would help, but it made no difference.

I was wondering if anyone else had seen this behavior on their RDS
Postgres instances or had any suggestions on how I could shorten the
shutdown time?

Thanks,
Chris

--
Command Prompt, Inc. || http://the.postgres.company/ || @cmdpromptinc
*** A fault and talent of mine is to tell it exactly how it is. ***
PostgreSQL centered full stack support, consulting and development.
Advocate: @amplifypostgres || Learn: https://postgresconf.org
***** Unless otherwise stated, opinions are my own. *****

#3Chris Williams
cswilliams@gmail.com
In reply to: Joshua D. Drake (#2)
Re: Slow shutdowns sometimes on RDS Postgres

Yeah, I figured that. Unfortunately, every time it happens, I open a
support ticket with them, but they always just tell me that this is normal
behavior for postgres. Whether it's "normal" or not, I really would like
to get my db to shut down faster, and their suggestion of manually shutting
down all of my apps ahead of time is a real headache. Given that I haven't
gotten anywhere with their support, I figured I'd try asking on the mailing
list.

Thanks,
Chris

On Thu, Sep 13, 2018 at 3:17 PM Joshua D. Drake <jd@commandprompt.com>
wrote:

Show quoted text

On 09/13/2018 03:04 PM, Chris Williams wrote:

Hi,

I'm using AWS RDS Postgres (9.6.6) and have run into very slow
shutdowns (10+ minutes) a few times when making database modifications
(e.g. reboot, changing instance size, etc.). Other times, it shuts
down quickly (1 minute or so). I have not been able to figure out why
sometimes it takes a long time to shutdown.

This is probably something you would have to talk to Amazon about. AWS
RDS Postgres is a fork of PostgreSQL and not 100% compatible from an
administrative perspective.

JD

When it happens, I see a bunch of lines in the postgres log like the
following over and over (almost every second or two) during this 10
minute shutdown period:
2018-09-12 06:37:01
UTC:XXX.XXX.XXX.XXX(19712):my_user@my_db:[16495]:FATAL:
2018-09-12 06:37:01
UTC:localhost(31368):rdsadmin@rdsadmin:[16488]:FATAL: the database
system is shutting down

Once I start seeing these messages, I start manually shutting down all
of our applications that are connected to the db. I'm not sure if
shutting down the apps fixes it or if there's some timeout on the RDS
side, but it seems like once I start doing this, the database finally
shuts down.

When it takes this long to shut down, it ends up causing a lot more
downtime than I would like. I've tried asking AWS's support why it
takes so long to shutdown sometimes, but they basically just told me
that's "how it works" and that I should try to shut down all of my
connections ahead of time before making database modifications.

We just have a few ruby on rails applications connected to the
database, and don't really have any long running or heavy queries and
the db is under very light load, so I don't understand why it takes so
long to shutdown. We do have a sizeable number of connections though
(about 600) and there are two replicas connected to it. I also tried
setting idle_in_transaction_session_timeout to 300 seconds to see if
that would help, but it made no difference.

I was wondering if anyone else had seen this behavior on their RDS
Postgres instances or had any suggestions on how I could shorten the
shutdown time?

Thanks,
Chris

--
Command Prompt, Inc. || http://the.postgres.company/ || @cmdpromptinc
*** A fault and talent of mine is to tell it exactly how it is. ***
PostgreSQL centered full stack support, consulting and development.
Advocate: @amplifypostgres || Learn: https://postgresconf.org
***** Unless otherwise stated, opinions are my own. *****

#4Adrian Klaver
adrian.klaver@aklaver.com
In reply to: Chris Williams (#3)
Re: Slow shutdowns sometimes on RDS Postgres

On 9/13/18 3:59 PM, Chris Williams wrote:

Yeah, I figured that.  Unfortunately, every time it happens, I open a
support ticket with them, but they always just tell me that this is
normal behavior for postgres.  Whether it's "normal" or not, I really
would like to get my db to shut down faster, and their suggestion of
manually shutting down all of my apps ahead of time is a real headache.
Given that I haven't gotten anywhere with their support, I figured I'd
try asking on the mailing list.

The thing is, what you are doing ("(e.g. reboot, changing instance size,
etc.)") are instance operations not database operations. That comes
under AWS's purview. For what it is worth the behavior is documented:

https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/USER_RebootInstance.html

"The time required to reboot your DB instance depends on the crash
recovery process of your specific database engine. To improve the reboot
time, we recommend that you reduce database activity as much as possible
during the reboot process. Reducing database activity reduces rollback
activity for in-transit transactions.
"

Pretty sure shutting down those 600 connections and uncoupling the
replication(s) goes a long way to the time elapsed.

Thanks,
Chris

On Thu, Sep 13, 2018 at 3:17 PM Joshua D. Drake <jd@commandprompt.com
<mailto:jd@commandprompt.com>> wrote:

On 09/13/2018 03:04 PM, Chris Williams wrote:

Hi,

I'm using AWS RDS Postgres (9.6.6) and have run into very slow
shutdowns (10+ minutes) a few times when making database

modifications

(e.g. reboot, changing instance size, etc.).  Other times, it shuts
down quickly (1 minute or so).  I have not been able to figure

out why

sometimes it takes a long time to shutdown.

This is probably something you would have to talk to Amazon about. AWS
RDS Postgres is a fork of PostgreSQL and not 100% compatible from an
administrative perspective.

JD

When it happens, I see a bunch of lines in the postgres log like the
following over and over (almost every second or two) during this 10
minute shutdown period:
2018-09-12 06:37:01
UTC:XXX.XXX.XXX.XXX(19712):my_user@my_db:[16495]:FATAL:
2018-09-12 06:37:01
UTC:localhost(31368):rdsadmin@rdsadmin:[16488]:FATAL: the database
system is shutting down

Once I start seeing these messages, I start manually shutting

down all

of our applications that are connected to the db.  I'm not sure if
shutting down the apps fixes it or if there's some timeout on the

RDS

side, but it seems like once I start doing this, the database

finally

shuts down.

When it takes this long to shut down, it ends up causing a lot more
downtime than I would like.  I've tried asking AWS's support why it
takes so long to shutdown sometimes, but they basically just told me
that's "how it works" and that I should try to shut down all of my
connections ahead of time before making database modifications.

We just have a few ruby on rails applications connected to the
database, and don't really have any long running or heavy queries

and

the db is under very light load, so I don't understand why it

takes so

long to shutdown.  We do have a sizeable number of connections

though

(about 600) and there are two replicas connected to it.  I also

tried

setting idle_in_transaction_session_timeout to 300 seconds to see if
that would help, but it made no difference.

I was wondering if anyone else had seen this behavior on their RDS
Postgres instances or had any suggestions on how I could shorten the
shutdown time?

Thanks,
Chris

--
Command Prompt, Inc. || http://the.postgres.company/ || @cmdpromptinc
***  A fault and talent of mine is to tell it exactly how it is.  ***
PostgreSQL centered full stack support, consulting and development.
Advocate: @amplifypostgres || Learn: https://postgresconf.org
*****     Unless otherwise stated, opinions are my own.   *****

--
Adrian Klaver
adrian.klaver@aklaver.com

#5Juan Manuel Cuello
juanmacuello@gmail.com
In reply to: Chris Williams (#1)
Re: Slow shutdowns sometimes on RDS Postgres

On Thu, Sep 13, 2018 at 7:05 PM Chris Williams <cswilliams@gmail.com> wrote:

We just have a few ruby on rails applications connected to the database,
and don't really have any long running or heavy queries and the db is under
very light load, so I don't understand why it takes so long to shutdown.
We do have a sizeable number of connections though (about 600) and there
are two replicas connected to it. I also tried
setting idle_in_transaction_session_timeout to 300 seconds to see if that
would help, but it made no difference.

If you only have a few rails apps connected to the DB and a very light
load, may be you can try reducing the number of connections (how big are
your connection pools?) and see if that helps. 600 seems to be a big value
for just a few apps and a light load.

#6Jeremy Schneider
schnjere@amazon.com
In reply to: Joshua D. Drake (#2)
Re: Slow shutdowns sometimes on RDS Postgres

Hi Chris - this is an interesting one that we do see from time to time;
seems worth responding here as actually our best understanding right now
is that this is something in community code, not AWS-specific.

On 9/13/18 15:17, Joshua D. Drake wrote:

This is probably something you would have to talk to Amazon about. AWS
RDS Postgres is a fork of PostgreSQL and not 100% compatible from an
administrative perspective.

Hey JD! FWIW, the code differences are pretty minimal and generally just
what's required to have a managed service where people can still use the
database as they normally would. The biggest difference is just getting
used to operating without direct OS access, and working through
automation/tooling instead. (And there's always EC2 for the many
customers who want/need superuser but still don't want to maintain the
hardware.)

On 9/13/18 16:10, Adrian Klaver wrote:

The thing is, what you are doing ("(e.g. reboot, changing instance
size, etc.)") are instance operations not database operations. That
comes under AWS's purview.

Correct, managing reboots and hardware reconfigurations would be the
responsibility of AWS. However Chris' issue here is just that PostgreSQL
itself took a long time to shut down. I'm not aware of anything
RDS-specific with this.

On 09/13/2018 03:04 PM, Chris Williams wrote:

I'm using AWS RDS Postgres (9.6.6) and have run into very slow
shutdowns (10+ minutes) a few times when making database modifications
(e.g. reboot, changing instance size, etc.).  Other times, it shuts
down quickly (1 minute or so).  I have not been able to figure out why
sometimes it takes a long time to shutdown.

I don't know about this specific incident, but I do know that the RDS
team has seen cases where a backend gets into a state (like a system
call) where it's not checking signals and thus doesn't receive or
process the postmaster's request to quit. We've seen these processes
delay shutdowns and also block recovery on streaming replicas.

Once I start seeing these messages, I start manually shutting down all
of our applications that are connected to the db.  I'm not sure if
shutting down the apps fixes it or if there's some timeout on the RDS
side, but it seems like once I start doing this, the database finally
shuts down.

FYI, yes there is a timeout on the RDS side. The basic workflow is to
try to shutdown postgres the normal way, and if it hasn't cleanly shut
down after a period of time then forcefully kill it.

When it takes this long to shut down, it ends up causing a lot more
downtime than I would like.  I've tried asking AWS's support why it
takes so long to shutdown sometimes, but they basically just told me
that's "how it works" and that I should try to shut down all of my
connections ahead of time before making database modifications.

Yeah, that's a really frustrating answer! I can think of a few
additional things you can do. Obviously the holy grail here is a repro
(as usual) -- I bet if you have a reliable repro then RDS support will
be interested in taking a look at that. In fact if we can repro on RDS
then we should be able to repro on community PostgreSQL just as easily,
and could stack trace the non-responsive backend to see why it's stuck.

Outside of a repro, you can enable enhanced monitoring on RDS which
gives you a "top-like" process listing where you can at least see the
basic state of the stuck backend, much like you would with top/ps on
community postgres. Once you upgrade to PostgreSQL 10 on RDS you can
also enable Performance Insights which will show you the SQL, if the
backend happens to be active. I haven't gone real deep on pgSentinel
yet, but it provides similar functionality on community PostgreSQL.

We just have a few ruby on rails applications connected to the
database, and don't really have any long running or heavy queries and
the db is under very light load, so I don't understand why it takes so
long to shutdown.  We do have a sizeable number of connections though
(about 600) and there are two replicas connected to it.  I also tried
setting idle_in_transaction_session_timeout to 300 seconds to see if
that would help, but it made no difference.

Good data point - the fact that you set the idle timeout so low
reinforces my suspicion that there was a non-responsive backend involved
here.

I was wondering if anyone else had seen this behavior on their RDS
Postgres instances or had any suggestions on how I could shorten the
shutdown time?

So yeah, it's not common... but you're not alone. If you're able to
narrow it down then please share anything you learn!

Hope this helps,
Jeremy

--
Jeremy Schneider
Database Engineer
Amazon Web Services

#7Adrian Klaver
adrian.klaver@aklaver.com
In reply to: Jeremy Schneider (#6)
Re: Slow shutdowns sometimes on RDS Postgres

On 9/14/18 8:43 AM, Jeremy Schneider wrote:

Hi Chris - this is an interesting one that we do see from time to time;
seems worth responding here as actually our best understanding right now
is that this is something in community code, not AWS-specific.

On 9/13/18 16:10, Adrian Klaver wrote:

The thing is, what you are doing ("(e.g. reboot, changing instance
size, etc.)") are instance operations not database operations. That
comes under AWS's purview.

Correct, managing reboots and hardware reconfigurations would be the
responsibility of AWS. However Chris' issue here is just that PostgreSQL
itself took a long time to shut down. I'm not aware of anything
RDS-specific with this.

The thing is I do not remember any posts to this list mentioning the
same problem on a platform outside RDS. A quick search seems to confirm
that.

Would it be possible to see the RDS shutdown script?

I don't know about this specific incident, but I do know that the RDS
team has seen cases where a backend gets into a state (like a system
call) where it's not checking signals and thus doesn't receive or
process the postmaster's request to quit. We've seen these processes
delay shutdowns and also block recovery on streaming replicas.

The particulars of that state?

FYI, yes there is a timeout on the RDS side. The basic workflow is to
try to shutdown postgres the normal way, and if it hasn't cleanly shut
down after a period of time then forcefully kill it.

Hope this helps,
Jeremy

--
Adrian Klaver
adrian.klaver@aklaver.com

#8Christophe Pettus
xof@thebuild.com
In reply to: Jeremy Schneider (#6)
Re: Slow shutdowns sometimes on RDS Postgres

On Sep 14, 2018, at 08:43, Jeremy Schneider <schnjere@amazon.com> wrote:
So yeah, it's not common...

In our experience, it's actually quite common that an RDS shutdown (or even just applying parameter changes) can take a while. What's particularly concerning is that it's not predictable, and that can make it hard to schedule and manage maintenance windows. What we were told previously is that RDS queues the operations, and it can take a variable amount of time for the operation to be worked on from the queue. Is that not the case?

--
-- Christophe Pettus
xof@thebuild.com

#9Jeremy Schneider
schnjere@amazon.com
In reply to: Christophe Pettus (#8)
Re: Slow shutdowns sometimes on RDS Postgres

On 9/14/18 10:04, Christophe Pettus wrote:

In our experience, it's actually quite common that an RDS shutdown (or
even just applying parameter changes) can take a while. What's
particularly concerning is that it's not predictable, and that can
make it hard to schedule and manage maintenance windows. What we were
told previously is that RDS queues the operations, and it can take a
variable amount of time for the operation to be worked on from the
queue. Is that not the case?

Thanks Christophe - even if it's not what Chris is running into, this is
is another good call-out.

It's important to distinguish here between the RDS parts and the
community PostgreSQL parts.  I think for this thread it's just worth
pointing out that RDS automation/tooling will report the database in a
"modifying" state until it completes its management operations, however
the actual database unavailability is much shorter.  RDS carefully
engineers their processes to minimize the actual database unavailability
itself.

Chris has run into a problem where the PostgreSQL processes did not shut
down, evidenced by the error messages he mentioned, and as a result his
database was actually unavailable to applications for an extended
period.  This is uncommon and concerning.

This isn't the right forum for discussing the RDS bits; lets take that
to the AWS forums.  It's not synchronous, but the time to complete
should absolutely be predictable within reasonable bounds depending on
the operation type. I don't know how anyone could use the platform
otherwise!  If anyone is unable to establish bounded expectations for
some automated operation, I'd strongly encourage starting a thread on
the AWS forums or opening a support ticket.

On 9/14/18 09:27, Adrian Klaver wrote:

The thing is I do not remember any posts to this list mentioning the
same problem on a platform outside RDS. A quick search seems to
confirm that.

I've met folks from other large fleet operators at PG conferences. 
There are all kinds of stories we don't find on the lists yet.  :) 
Hopefully we're all getting better about closing the loop and sharing
stuff back - that's part of the value large fleet operators can and
should bring to the community.

I don't know about this specific incident, but I do know that the RDS
team has seen cases where a backend gets into a state (like a system
call) where it's not checking signals and thus doesn't receive or
process the postmaster's request to quit. We've seen these processes
delay shutdowns and also block recovery on streaming replicas.

The particulars of that state?

For the cases I've heard about, we haven't yet caught things quickly
enough to get stack dumps.  So I don't think we have particulars yet.

-Jeremy

--
Jeremy Schneider
Database Engineer
Amazon Web Services

#10Adrian Klaver
adrian.klaver@aklaver.com
In reply to: Jeremy Schneider (#9)
Re: Slow shutdowns sometimes on RDS Postgres

On 9/14/18 12:11 PM, Jeremy Schneider wrote:

On 9/14/18 10:04, Christophe Pettus wrote:

Thanks Christophe - even if it's not what Chris is running into, this is
is another good call-out.

It's important to distinguish here between the RDS parts and the
community PostgreSQL parts.  I think for this thread it's just worth
pointing out that RDS automation/tooling will report the database in a
"modifying" state until it completes its management operations, however
the actual database unavailability is much shorter.  RDS carefully
engineers their processes to minimize the actual database unavailability
itself.

Chris has run into a problem where the PostgreSQL processes did not shut
down, evidenced by the error messages he mentioned, and as a result his
database was actually unavailable to applications for an extended
period.  This is uncommon and concerning.

This isn't the right forum for discussing the RDS bits; lets take that
to the AWS forums.  It's not synchronous, but the time to complete
should absolutely be predictable within reasonable bounds depending on
the operation type. I don't know how anyone could use the platform
otherwise!  If anyone is unable to establish bounded expectations for
some automated operation, I'd strongly encourage starting a thread on
the AWS forums or opening a support ticket.

Not sure I follow. RDS using Postgres is a set, so I am not sure you can
unbundle then when tracking down a bug.

On 9/14/18 09:27, Adrian Klaver wrote:

The thing is I do not remember any posts to this list mentioning the
same problem on a platform outside RDS. A quick search seems to
confirm that.

I've met folks from other large fleet operators at PG conferences. There
are all kinds of stories we don't find on the lists yet.  :) Hopefully
we're all getting better about closing the loop and sharing stuff back -
that's part of the value large fleet operators can and should bring to
the community.

Hopefully sooner rather then later.

-Jeremy

--
Adrian Klaver
adrian.klaver@aklaver.com

#11Chris Williams
cswilliams@gmail.com
In reply to: Jeremy Schneider (#6)
Re: Slow shutdowns sometimes on RDS Postgres

Hey Jeremy,

Thanks so much for your response. That's a great tip regarding enabling
enhanced monitoring. Maybe it will give some insight into which particular
backends are causing the shutdown to hang. One interesting data point when
this happened is that in cloudwatch, once the database started shutting
down, you can see the connection count drop from 600 to 4 immediately.
Then the graph shows a constant 4 connections for 10 minutes straight
before it finally shuts down and goes to 0. I'm guessing the shutdown is
hanging because of one of these 4 connections. Unfortunately, because the
database is shutting down, I'm not able to connect and get any info about
these connections, so enabling enhanced monitoring might give some more
clues.

My other question is, do you think shutting down my apps manually once I
noticed the shutdown was hanging had any effect on the total shutdown time?
It seems a bit coincidental that the database finally shut down after
exactly 10 minutes. This makes me think that some internal timeout in
RDS's shutdown script was triggered and that shutting down my apps didn't
actually affect anything. I'd much rather just wait 10 minutes then
frantically try to shutdown all of my apps. So I'd be curious to know if
you are able to look up what the timeout is and if you think the timeout is
what is actually causing it to finally shut down. While I'd much rather
have my database shutdown in a minute or two, at least there would be some
comfort in knowing that 10 minutes is the upper bound.

I'd love to be able to figure out how to reproduce it, but it doesn't
happen consistently unfortunately and I've only ever seen it on our main
production database so that makes things trickier. We do need to resize
our database again in a few weeks. If there's any special debug settings i
should enable before modifying it (besides enhanced monitoring) or if
support wanted to observe it in the middle of the shutdown, let me know.
The last two times I've resized the database, the shutdown has hung like
this, so there's a good chance it will do it again.

Another idea I had too was to remove all the security group rules on the db
right before applying the modification. My thinking was that maybe that
might help postgres terminate all the connections quicker? That said, it's
all a shot in the dark I guess since we don't know the underlying cause.

Best,
Chris

On Fri, Sep 14, 2018 at 8:43 AM Jeremy Schneider <schnjere@amazon.com>
wrote:

Show quoted text

Hi Chris - this is an interesting one that we do see from time to time;
seems worth responding here as actually our best understanding right now
is that this is something in community code, not AWS-specific.

On 9/13/18 15:17, Joshua D. Drake wrote:

This is probably something you would have to talk to Amazon about. AWS
RDS Postgres is a fork of PostgreSQL and not 100% compatible from an
administrative perspective.

Hey JD! FWIW, the code differences are pretty minimal and generally just
what's required to have a managed service where people can still use the
database as they normally would. The biggest difference is just getting
used to operating without direct OS access, and working through
automation/tooling instead. (And there's always EC2 for the many
customers who want/need superuser but still don't want to maintain the
hardware.)

On 9/13/18 16:10, Adrian Klaver wrote:

The thing is, what you are doing ("(e.g. reboot, changing instance
size, etc.)") are instance operations not database operations. That
comes under AWS's purview.

Correct, managing reboots and hardware reconfigurations would be the
responsibility of AWS. However Chris' issue here is just that PostgreSQL
itself took a long time to shut down. I'm not aware of anything
RDS-specific with this.

On 09/13/2018 03:04 PM, Chris Williams wrote:

I'm using AWS RDS Postgres (9.6.6) and have run into very slow
shutdowns (10+ minutes) a few times when making database modifications
(e.g. reboot, changing instance size, etc.). Other times, it shuts
down quickly (1 minute or so). I have not been able to figure out why
sometimes it takes a long time to shutdown.

I don't know about this specific incident, but I do know that the RDS
team has seen cases where a backend gets into a state (like a system
call) where it's not checking signals and thus doesn't receive or
process the postmaster's request to quit. We've seen these processes
delay shutdowns and also block recovery on streaming replicas.

Once I start seeing these messages, I start manually shutting down all
of our applications that are connected to the db. I'm not sure if
shutting down the apps fixes it or if there's some timeout on the RDS
side, but it seems like once I start doing this, the database finally
shuts down.

FYI, yes there is a timeout on the RDS side. The basic workflow is to
try to shutdown postgres the normal way, and if it hasn't cleanly shut
down after a period of time then forcefully kill it.

When it takes this long to shut down, it ends up causing a lot more
downtime than I would like. I've tried asking AWS's support why it
takes so long to shutdown sometimes, but they basically just told me
that's "how it works" and that I should try to shut down all of my
connections ahead of time before making database modifications.

Yeah, that's a really frustrating answer! I can think of a few
additional things you can do. Obviously the holy grail here is a repro
(as usual) -- I bet if you have a reliable repro then RDS support will
be interested in taking a look at that. In fact if we can repro on RDS
then we should be able to repro on community PostgreSQL just as easily,
and could stack trace the non-responsive backend to see why it's stuck.

Outside of a repro, you can enable enhanced monitoring on RDS which
gives you a "top-like" process listing where you can at least see the
basic state of the stuck backend, much like you would with top/ps on
community postgres. Once you upgrade to PostgreSQL 10 on RDS you can
also enable Performance Insights which will show you the SQL, if the
backend happens to be active. I haven't gone real deep on pgSentinel
yet, but it provides similar functionality on community PostgreSQL.

We just have a few ruby on rails applications connected to the
database, and don't really have any long running or heavy queries and
the db is under very light load, so I don't understand why it takes so
long to shutdown. We do have a sizeable number of connections though
(about 600) and there are two replicas connected to it. I also tried
setting idle_in_transaction_session_timeout to 300 seconds to see if
that would help, but it made no difference.

Good data point - the fact that you set the idle timeout so low
reinforces my suspicion that there was a non-responsive backend involved
here.

I was wondering if anyone else had seen this behavior on their RDS
Postgres instances or had any suggestions on how I could shorten the
shutdown time?

So yeah, it's not common... but you're not alone. If you're able to
narrow it down then please share anything you learn!

Hope this helps,
Jeremy

--
Jeremy Schneider
Database Engineer
Amazon Web Services

#12Jeremy Schneider
schnjere@amazon.com
In reply to: Chris Williams (#11)
Re: Slow shutdowns sometimes on RDS Postgres

In this particular case, if the signals from postmaster didn't stop those four processes, then it seems unlikely that anything else would make a difference - whether stopping clients or changing network rules. Based on what you describe, the internal timeout in the shutdown script seems the most probable explanation and you have likely found an upper bound.

-Jeremy

Sent from my TI-83

On Sep 14, 2018, at 4:08 PM, Chris Williams <cswilliams@gmail.com<mailto:cswilliams@gmail.com>> wrote:

Hey Jeremy,

Thanks so much for your response. That's a great tip regarding enabling enhanced monitoring. Maybe it will give some insight into which particular backends are causing the shutdown to hang. One interesting data point when this happened is that in cloudwatch, once the database started shutting down, you can see the connection count drop from 600 to 4 immediately. Then the graph shows a constant 4 connections for 10 minutes straight before it finally shuts down and goes to 0. I'm guessing the shutdown is hanging because of one of these 4 connections. Unfortunately, because the database is shutting down, I'm not able to connect and get any info about these connections, so enabling enhanced monitoring might give some more clues.

My other question is, do you think shutting down my apps manually once I noticed the shutdown was hanging had any effect on the total shutdown time? It seems a bit coincidental that the database finally shut down after exactly 10 minutes. This makes me think that some internal timeout in RDS's shutdown script was triggered and that shutting down my apps didn't actually affect anything. I'd much rather just wait 10 minutes then frantically try to shutdown all of my apps. So I'd be curious to know if you are able to look up what the timeout is and if you think the timeout is what is actually causing it to finally shut down. While I'd much rather have my database shutdown in a minute or two, at least there would be some comfort in knowing that 10 minutes is the upper bound.

I'd love to be able to figure out how to reproduce it, but it doesn't happen consistently unfortunately and I've only ever seen it on our main production database so that makes things trickier. We do need to resize our database again in a few weeks. If there's any special debug settings i should enable before modifying it (besides enhanced monitoring) or if support wanted to observe it in the middle of the shutdown, let me know. The last two times I've resized the database, the shutdown has hung like this, so there's a good chance it will do it again.

Another idea I had too was to remove all the security group rules on the db right before applying the modification. My thinking was that maybe that might help postgres terminate all the connections quicker? That said, it's all a shot in the dark I guess since we don't know the underlying cause.

Best,
Chris

On Fri, Sep 14, 2018 at 8:43 AM Jeremy Schneider <schnjere@amazon.com<mailto:schnjere@amazon.com>> wrote:
Hi Chris - this is an interesting one that we do see from time to time;
seems worth responding here as actually our best understanding right now
is that this is something in community code, not AWS-specific.

On 9/13/18 15:17, Joshua D. Drake wrote:

This is probably something you would have to talk to Amazon about. AWS
RDS Postgres is a fork of PostgreSQL and not 100% compatible from an
administrative perspective.

Hey JD! FWIW, the code differences are pretty minimal and generally just
what's required to have a managed service where people can still use the
database as they normally would. The biggest difference is just getting
used to operating without direct OS access, and working through
automation/tooling instead. (And there's always EC2 for the many
customers who want/need superuser but still don't want to maintain the
hardware.)

On 9/13/18 16:10, Adrian Klaver wrote:

The thing is, what you are doing ("(e.g. reboot, changing instance
size, etc.)") are instance operations not database operations. That
comes under AWS's purview.

Correct, managing reboots and hardware reconfigurations would be the
responsibility of AWS. However Chris' issue here is just that PostgreSQL
itself took a long time to shut down. I'm not aware of anything
RDS-specific with this.

On 09/13/2018 03:04 PM, Chris Williams wrote:

I'm using AWS RDS Postgres (9.6.6) and have run into very slow
shutdowns (10+ minutes) a few times when making database modifications
(e.g. reboot, changing instance size, etc.). Other times, it shuts
down quickly (1 minute or so). I have not been able to figure out why
sometimes it takes a long time to shutdown.

I don't know about this specific incident, but I do know that the RDS
team has seen cases where a backend gets into a state (like a system
call) where it's not checking signals and thus doesn't receive or
process the postmaster's request to quit. We've seen these processes
delay shutdowns and also block recovery on streaming replicas.

Once I start seeing these messages, I start manually shutting down all
of our applications that are connected to the db. I'm not sure if
shutting down the apps fixes it or if there's some timeout on the RDS
side, but it seems like once I start doing this, the database finally
shuts down.

FYI, yes there is a timeout on the RDS side. The basic workflow is to
try to shutdown postgres the normal way, and if it hasn't cleanly shut
down after a period of time then forcefully kill it.

When it takes this long to shut down, it ends up causing a lot more
downtime than I would like. I've tried asking AWS's support why it
takes so long to shutdown sometimes, but they basically just told me
that's "how it works" and that I should try to shut down all of my
connections ahead of time before making database modifications.

Yeah, that's a really frustrating answer! I can think of a few
additional things you can do. Obviously the holy grail here is a repro
(as usual) -- I bet if you have a reliable repro then RDS support will
be interested in taking a look at that. In fact if we can repro on RDS
then we should be able to repro on community PostgreSQL just as easily,
and could stack trace the non-responsive backend to see why it's stuck.

Outside of a repro, you can enable enhanced monitoring on RDS which
gives you a "top-like" process listing where you can at least see the
basic state of the stuck backend, much like you would with top/ps on
community postgres. Once you upgrade to PostgreSQL 10 on RDS you can
also enable Performance Insights which will show you the SQL, if the
backend happens to be active. I haven't gone real deep on pgSentinel
yet, but it provides similar functionality on community PostgreSQL.

We just have a few ruby on rails applications connected to the
database, and don't really have any long running or heavy queries and
the db is under very light load, so I don't understand why it takes so
long to shutdown. We do have a sizeable number of connections though
(about 600) and there are two replicas connected to it. I also tried
setting idle_in_transaction_session_timeout to 300 seconds to see if
that would help, but it made no difference.

Good data point - the fact that you set the idle timeout so low
reinforces my suspicion that there was a non-responsive backend involved
here.

I was wondering if anyone else had seen this behavior on their RDS
Postgres instances or had any suggestions on how I could shorten the
shutdown time?

So yeah, it's not common... but you're not alone. If you're able to
narrow it down then please share anything you learn!

Hope this helps,
Jeremy

--
Jeremy Schneider
Database Engineer
Amazon Web Services