Debugging a backend stuck consuming CPU

Started by Kenneth Marshallalmost 10 years ago8 messagesgeneral
Jump to latest
#1Kenneth Marshall
ktm@rice.edu

Hi PostgreSQL community,

I am investigating a problem with a backend that appears to be stuck
and spinning while performing a "DISCARD ALL" command. The system is
running an older release 9.2.2. Are there any bugs that could be
causing this behavior? How can I tell what the process is actually
doing? It does not respond to either pg_cancel_backend() or to
pg_terminate_backend() so it is acting like it is not receiving the
signals. Any ideas on how to debug this would be appriciated.

Regards,
Ken

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: Kenneth Marshall (#1)
Re: Debugging a backend stuck consuming CPU

"ktm@rice.edu" <ktm@rice.edu> writes:

I am investigating a problem with a backend that appears to be stuck
and spinning while performing a "DISCARD ALL" command. The system is
running an older release 9.2.2.

You do realize that the current release in that series is 9.2.17.

Are there any bugs that could be causing this behavior?

Known bugs are summarized here:
http://www.postgresql.org/docs/9.2/static/release.html

How can I tell what the process is actually doing?

Getting a stack trace with gdb might be informative:
https://wiki.postgresql.org/wiki/Generating_a_stack_trace_of_a_PostgreSQL_backend

regards, tom lane

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#3Adrian Klaver
adrian.klaver@aklaver.com
In reply to: Kenneth Marshall (#1)
Re: Debugging a backend stuck consuming CPU

On 05/19/2016 06:46 AM, ktm@rice.edu wrote:

Hi PostgreSQL community,

I am investigating a problem with a backend that appears to be stuck
and spinning while performing a "DISCARD ALL" command. The system is
running an older release 9.2.2. Are there any bugs that could be
causing this behavior? How can I tell what the process is actually
doing? It does not respond to either pg_cancel_backend() or to
pg_terminate_backend() so it is acting like it is not receiving the
signals. Any ideas on how to debug this would be appriciated.

What does the Postgres log show around the time of the command and/or
currently?

What OS and does it's system log show anything?

Can you query pg_stat_activity?

http://www.postgresql.org/docs/9.2/static/monitoring-stats.html#PG-STAT-ACTIVITY-VIEW

Regards,
Ken

--
Adrian Klaver
adrian.klaver@aklaver.com

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#4Kenneth Marshall
ktm@rice.edu
In reply to: Tom Lane (#2)
Re: Debugging a backend stuck consuming CPU

On Thu, May 19, 2016 at 09:58:45AM -0400, Tom Lane wrote:

"ktm@rice.edu" <ktm@rice.edu> writes:

I am investigating a problem with a backend that appears to be stuck
and spinning while performing a "DISCARD ALL" command. The system is
running an older release 9.2.2.

You do realize that the current release in that series is 9.2.17.

Are there any bugs that could be causing this behavior?

Known bugs are summarized here:
http://www.postgresql.org/docs/9.2/static/release.html

How can I tell what the process is actually doing?

Getting a stack trace with gdb might be informative:
https://wiki.postgresql.org/wiki/Generating_a_stack_trace_of_a_PostgreSQL_backend

regards, tom lane

Hi,

We do have an upgrade scheduled. I will get a stack trace and see what
information that provides.

Regards,
Ken

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#5Merlin Moncure
mmoncure@gmail.com
In reply to: Kenneth Marshall (#1)
Re: Debugging a backend stuck consuming CPU

On Thu, May 19, 2016 at 8:46 AM, ktm@rice.edu <ktm@rice.edu> wrote:

Hi PostgreSQL community,

I am investigating a problem with a backend that appears to be stuck
and spinning while performing a "DISCARD ALL" command. The system is
running an older release 9.2.2. Are there any bugs that could be
causing this behavior? How can I tell what the process is actually
doing? It does not respond to either pg_cancel_backend() or to
pg_terminate_backend() so it is acting like it is not receiving the
signals. Any ideas on how to debug this would be appriciated.

1. Grab a stack trace and a 'perf top' for forensic analysis in case
it happens again
2. Install latest 9.2 binaries
3. Restart database

There's not much value in analyzing such an out of date database;
there's a very high probability your bug has already been fixed.

merlin

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#6Kenneth Marshall
ktm@rice.edu
In reply to: Tom Lane (#2)
Re: Debugging a backend stuck consuming CPU

On Thu, May 19, 2016 at 09:58:45AM -0400, Tom Lane wrote:

"ktm@rice.edu" <ktm@rice.edu> writes:

I am investigating a problem with a backend that appears to be stuck
and spinning while performing a "DISCARD ALL" command. The system is
running an older release 9.2.2.

You do realize that the current release in that series is 9.2.17.

Are there any bugs that could be causing this behavior?

Known bugs are summarized here:
http://www.postgresql.org/docs/9.2/static/release.html

How can I tell what the process is actually doing?

Getting a stack trace with gdb might be informative:
https://wiki.postgresql.org/wiki/Generating_a_stack_trace_of_a_PostgreSQL_backend

regards, tom lane

Hi,

The stack trace just appeared to be what I would expect while a 'DISCARD ALL'
command was being run:

Continuing.

Program received signal SIGINT, Interrupt.
0x000000000073bc7c in MemoryContextSetParent ()
#0 0x000000000073bc7c in MemoryContextSetParent ()
#1 0x000000000073bde3 in MemoryContextDelete ()
#2 0x000000000054e3a9 in DropAllPreparedStatements ()
#3 0x00000000005365f3 in DiscardCommand ()
#4 0x00000000006582c7 in ?? ()
#5 0x00000000006592bd in ?? ()
#6 0x0000000000659a42 in PortalRun ()
#7 0x000000000065603d in ?? ()
#8 0x0000000000656ed0 in PostgresMain ()
#9 0x0000000000613b91 in ?? ()
#10 0x00000000006167fc in PostmasterMain ()
#11 0x00000000005b5290 in main ()
Continuing.

Program received signal SIGINT, Interrupt.
0x000000000073bc7c in MemoryContextSetParent ()
#0 0x000000000073bc7c in MemoryContextSetParent ()
#1 0x000000000073bde3 in MemoryContextDelete ()
#2 0x000000000054e3a9 in DropAllPreparedStatements ()
#3 0x00000000005365f3 in DiscardCommand ()
#4 0x00000000006582c7 in ?? ()
#5 0x00000000006592bd in ?? ()
#6 0x0000000000659a42 in PortalRun ()
#7 0x000000000065603d in ?? ()
#8 0x0000000000656ed0 in PostgresMain ()
#9 0x0000000000613b91 in ?? ()
#10 0x00000000006167fc in PostmasterMain ()
#11 0x00000000005b5290 in main ()
Continuing.

Program received signal SIGINT, Interrupt.
0x000000000073bc7c in MemoryContextSetParent ()
#0 0x000000000073bc7c in MemoryContextSetParent ()
#1 0x000000000073bde3 in MemoryContextDelete ()
#2 0x000000000070e7df in DropCachedPlan ()
#3 0x000000000054e3a9 in DropAllPreparedStatements ()
#4 0x00000000005365f3 in DiscardCommand ()
#5 0x00000000006582c7 in ?? ()
#6 0x00000000006592bd in ?? ()
#7 0x0000000000659a42 in PortalRun ()
#8 0x000000000065603d in ?? ()
#9 0x0000000000656ed0 in PostgresMain ()
#10 0x0000000000613b91 in ?? ()
#11 0x00000000006167fc in PostmasterMain ()
#12 0x00000000005b5290 in main ()
Continuing.

Program received signal SIGINT, Interrupt.
0x000000000073bc7c in MemoryContextSetParent ()
#0 0x000000000073bc7c in MemoryContextSetParent ()
#1 0x000000000073bde3 in MemoryContextDelete ()
#2 0x000000000054e3a9 in DropAllPreparedStatements ()
#3 0x00000000005365f3 in DiscardCommand ()
#4 0x00000000006582c7 in ?? ()
#5 0x00000000006592bd in ?? ()
#6 0x0000000000659a42 in PortalRun ()
#7 0x000000000065603d in ?? ()
#8 0x0000000000656ed0 in PostgresMain ()
#9 0x0000000000613b91 in ?? ()
#10 0x00000000006167fc in PostmasterMain ()
#11 0x00000000005b5290 in main ()
Continuing.

Program received signal SIGINT, Interrupt.
0x000000000070e7ff in DropCachedPlan ()
#0 0x000000000070e7ff in DropCachedPlan ()
#1 0x000000000054e3a9 in DropAllPreparedStatements ()
#2 0x00000000005365f3 in DiscardCommand ()
#3 0x00000000006582c7 in ?? ()
#4 0x00000000006592bd in ?? ()
#5 0x0000000000659a42 in PortalRun ()
#6 0x000000000065603d in ?? ()
#7 0x0000000000656ed0 in PostgresMain ()
#8 0x0000000000613b91 in ?? ()
#9 0x00000000006167fc in PostmasterMain ()
#10 0x00000000005b5290 in main ()
Detaching from program: /usr/pgsql-9.2/bin/postgres, process 38604
Undefined command: "exit". Try "help".
Continuing.

Program received signal SIGINT, Interrupt.
0x000000000070e7ff in DropCachedPlan ()
#0 0x000000000070e7ff in DropCachedPlan ()
#1 0x000000000054e3a9 in DropAllPreparedStatements ()
#2 0x00000000005365f3 in DiscardCommand ()
#3 0x00000000006582c7 in ?? ()
#4 0x00000000006592bd in ?? ()
#5 0x0000000000659a42 in PortalRun ()
#6 0x000000000065603d in ?? ()
#7 0x0000000000656ed0 in PostgresMain ()
#8 0x0000000000613b91 in ?? ()
#9 0x00000000006167fc in PostmasterMain ()
#10 0x00000000005b5290 in main ()
Continuing.

Program received signal SIGINT, Interrupt.
0x000000000070e7ff in DropCachedPlan ()
#0 0x000000000070e7ff in DropCachedPlan ()
#1 0x000000000054e3a9 in DropAllPreparedStatements ()
#2 0x00000000005365f3 in DiscardCommand ()
#3 0x00000000006582c7 in ?? ()
#4 0x00000000006592bd in ?? ()
#5 0x0000000000659a42 in PortalRun ()
#6 0x000000000065603d in ?? ()
#7 0x0000000000656ed0 in PostgresMain ()
#8 0x0000000000613b91 in ?? ()
#9 0x00000000006167fc in PostmasterMain ()
#10 0x00000000005b5290 in main ()
Continuing.

Program received signal SIGINT, Interrupt.
0x000000000073bc7c in MemoryContextSetParent ()
#0 0x000000000073bc7c in MemoryContextSetParent ()
#1 0x000000000073bde3 in MemoryContextDelete ()
#2 0x000000000070e7df in DropCachedPlan ()
#3 0x000000000054e3a9 in DropAllPreparedStatements ()
#4 0x00000000005365f3 in DiscardCommand ()
#5 0x00000000006582c7 in ?? ()
#6 0x00000000006592bd in ?? ()
#7 0x0000000000659a42 in PortalRun ()
#8 0x000000000065603d in ?? ()
#9 0x0000000000656ed0 in PostgresMain ()
#10 0x0000000000613b91 in ?? ()
#11 0x00000000006167fc in PostmasterMain ()
#12 0x00000000005b5290 in main ()
Continuing.

Does a DISCARD command take alot of time, or is it like TRUNCATE? The
backend does have a very large memory footprint (12GB).

Regards,
Ken

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#7Tom Lane
tgl@sss.pgh.pa.us
In reply to: Kenneth Marshall (#6)
Re: Debugging a backend stuck consuming CPU

"ktm@rice.edu" <ktm@rice.edu> writes:

The stack trace just appeared to be what I would expect while a 'DISCARD ALL'
command was being run:

#0 0x000000000073bc7c in MemoryContextSetParent ()
#1 0x000000000073bde3 in MemoryContextDelete ()
#2 0x000000000054e3a9 in DropAllPreparedStatements ()
#3 0x00000000005365f3 in DiscardCommand ()

Hmm, what it seems from these traces is that you've got a whole heck of
a lot of prepared statements.

The backend does have a very large memory footprint (12GB).

Um.

The most likely explanation is that you are hitting O(N^2) behavior as
a consequence of MemoryContextSetParent being O(N) in the number of
sibling contexts of the context to be deleted. We fixed that for 9.6
(commit 25c539233044c235e97fd7c9dc600fb5f08fe065) but there's no easy
solution in older branches, short of not using so many prepared
statements. I'm a bit surprised that you could have gotten up to 12GB
worth of prepared statements in an application that sends DISCARD ALL
periodically.

regards, tom lane

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#8Kenneth Marshall
ktm@rice.edu
In reply to: Tom Lane (#7)
Re: Debugging a backend stuck consuming CPU

On Thu, May 19, 2016 at 05:52:26PM -0400, Tom Lane wrote:

"ktm@rice.edu" <ktm@rice.edu> writes:

The stack trace just appeared to be what I would expect while a 'DISCARD ALL'
command was being run:

#0 0x000000000073bc7c in MemoryContextSetParent ()
#1 0x000000000073bde3 in MemoryContextDelete ()
#2 0x000000000054e3a9 in DropAllPreparedStatements ()
#3 0x00000000005365f3 in DiscardCommand ()

Hmm, what it seems from these traces is that you've got a whole heck of
a lot of prepared statements.

The backend does have a very large memory footprint (12GB).

Um.

The most likely explanation is that you are hitting O(N^2) behavior as
a consequence of MemoryContextSetParent being O(N) in the number of
sibling contexts of the context to be deleted. We fixed that for 9.6
(commit 25c539233044c235e97fd7c9dc600fb5f08fe065) but there's no easy
solution in older branches, short of not using so many prepared
statements. I'm a bit surprised that you could have gotten up to 12GB
worth of prepared statements in an application that sends DISCARD ALL
periodically.

regards, tom lane

Hi,

The DISCARD ALL is only sent by pgbouncer at the end of the processing.
The actual process builds up a cache to be used later whose size is
proportional to the number of items. The initial run is large, but the
regular runs are much smaller and cleanup quickly. I was more concerned
with incorrect behavior leading to DB corruption. Thank you for your
suggestions and assistance.

Regards,
Ken

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general