Help troubleshooting SubtransControlLock problems

Started by Scott Frazerover 8 years ago10 messagesgeneral

sfrazer@couponcabin.com

over 8 years ago

Hi, we have a Postgres 9.6 setup using replication that has recently
started seeing a lot of processes stuck in "SubtransControlLock" as a
wait_event on the read-replicas. Like this, only usually about 300-800 of
them:

179706 | LWLockNamed | SubtransControlLock

186602 | LWLockNamed | SubtransControlLock

186606 | LWLockNamed | SubtransControlLock

180947 | LWLockNamed | SubtransControlLock

186621 | LWLockNamed | SubtransControlLock

The server then begins to crawl, with some queries just never finishing
until I finally shut the server down.

Searching for that particular combo of wait_event_type and wait_event only
seems to turn up the page about statistics collection, but no helpful
information on troubleshooting this lock.

Restarting the replica server clears the locks and allows us to start
working again, but it's happened twice now in 12 hours and I'm worried it
will happen again.

Does anyone have any advice on where to start looking?

Thanks,
Scott

Rene Romero Benavides

rene.romero.b@gmail.com

over 8 years ago

In reply to: Scott Frazer (#1)

Re: Help troubleshooting SubtransControlLock problems

For such issues, I find this view very useful (the first one):
https://wiki.postgresql.org/wiki/Lock_Monitoring

Examine blocking_pid's , and tell us what kind of operation is blocking
the other processes . Also, are there many long running transactions in
your server?

2018-03-06 21:24 GMT-06:00 Scott Frazer <sfrazer@couponcabin.com>:

Hi, we have a Postgres 9.6 setup using replication that has recently
started seeing a lot of processes stuck in "SubtransControlLock" as a
wait_event on the read-replicas. Like this, only usually about 300-800 of
them:

179706 | LWLockNamed | SubtransControlLock

186602 | LWLockNamed | SubtransControlLock

186606 | LWLockNamed | SubtransControlLock

180947 | LWLockNamed | SubtransControlLock

186621 | LWLockNamed | SubtransControlLock

The server then begins to crawl, with some queries just never finishing
until I finally shut the server down.

Searching for that particular combo of wait_event_type and wait_event only
seems to turn up the page about statistics collection, but no helpful
information on troubleshooting this lock.

Restarting the replica server clears the locks and allows us to start
working again, but it's happened twice now in 12 hours and I'm worried it
will happen again.

Does anyone have any advice on where to start looking?

Thanks,
Scott

--
El genio es 1% inspiración y 99% transpiración.
Thomas Alva Edison
http://pglearn.blogspot.mx/

Scott Frazer

sfrazer@couponcabin.com

over 8 years ago

In reply to: Rene Romero Benavides (#2)

Re: Help troubleshooting SubtransControlLock problems

These don't seem like normal locks. Nothing shows up in a "SELECT
relation::regclass, * FROM pg_locks WHERE NOT GRANTED;"

These processes are all active but the wait_event and wait_event_type
fields indicate they are waiting on (I believe) shared memory locks.

vent_type | wait_event | state

--------+----------+--------------+---------------------------------------------+-------------------------------+-------

----------+---------------------+--------

165024 | 16392 | content_user | application | nymapp01.nym | 6192 |
2018-03-07 09:19:09.770155-06 | LWLock

Named | SubtransControlLock | active

On Tue, Mar 6, 2018 at 11:43 PM, Rene Romero Benavides <
rene.romero.b@gmail.com> wrote:

Show quoted text

For such issues, I find this view very useful (the first one):
https://wiki.postgresql.org/wiki/Lock_Monitoring

Examine blocking_pid's , and tell us what kind of operation is blocking
the other processes . Also, are there many long running transactions in
your server?

2018-03-06 21:24 GMT-06:00 Scott Frazer <sfrazer@couponcabin.com>:

Hi, we have a Postgres 9.6 setup using replication that has recently
started seeing a lot of processes stuck in "SubtransControlLock" as a
wait_event on the read-replicas. Like this, only usually about 300-800 of
them:

179706 | LWLockNamed | SubtransControlLock

186602 | LWLockNamed | SubtransControlLock

186606 | LWLockNamed | SubtransControlLock

180947 | LWLockNamed | SubtransControlLock

186621 | LWLockNamed | SubtransControlLock

The server then begins to crawl, with some queries just never finishing
until I finally shut the server down.

Searching for that particular combo of wait_event_type and wait_event
only seems to turn up the page about statistics collection, but no helpful
information on troubleshooting this lock.

Restarting the replica server clears the locks and allows us to start
working again, but it's happened twice now in 12 hours and I'm worried it
will happen again.

Does anyone have any advice on where to start looking?

Thanks,
Scott

--
El genio es 1% inspiración y 99% transpiración.
Thomas Alva Edison
http://pglearn.blogspot.mx/

Laurenz Albe

laurenz.albe@cybertec.at

over 8 years ago

In reply to: Scott Frazer (#1)

Re: Help troubleshooting SubtransControlLock problems

Scott Frazer wrote:

Hi, we have a Postgres 9.6 setup using replication that has recently started seeing a lot of processes stuck in
"SubtransControlLock" as a wait_event on the read-replicas. Like this, only usually about 300-800 of them:

179706 | LWLockNamed | SubtransControlLock
186602 | LWLockNamed | SubtransControlLock
186606 | LWLockNamed | SubtransControlLock
180947 | LWLockNamed | SubtransControlLock
186621 | LWLockNamed | SubtransControlLock

The server then begins to crawl, with some queries just never finishing until I finally shut the server down.

Searching for that particular combo of wait_event_type and wait_event only seems to turn up the page
about statistics collection, but no helpful information on troubleshooting this lock.

Restarting the replica server clears the locks and allows us to start working again, but it's happened
twice now in 12 hours and I'm worried it will happen again.

Does anyone have any advice on where to start looking?

I don't think there is any connection to statistics collection.

This lock is used when subtransactions (SAVEPOINTs in SQL or EXCEPTION blocks
in PL/pgSQL) are created, read or removed.

This sounds like a PostgreSQL bug.

What is the exact PostgreSQL version you are running?

It would be cool if you could get a stack trace from the backend that is holding the lock.

Yours,
Laurenz Albe

Scott Frazer

sfrazer@couponcabin.com

over 8 years ago

In reply to: Laurenz Albe (#4)

Re: Help troubleshooting SubtransControlLock problems

Server version is 9.6.5

Is there a decent guide to getting a stack trace on Centos7 when using the
official Postgres repo? trying to follow the Fedora guide put the debug
info for 9.2.23 on the box instead of the 9.6.5 version.

On Wed, Mar 7, 2018 at 9:52 AM, Laurenz Albe <laurenz.albe@cybertec.at>
wrote:

Show quoted text

Scott Frazer wrote:

Hi, we have a Postgres 9.6 setup using replication that has recently

started seeing a lot of processes stuck in

"SubtransControlLock" as a wait_event on the read-replicas. Like this,

only usually about 300-800 of them:

179706 | LWLockNamed | SubtransControlLock
186602 | LWLockNamed | SubtransControlLock
186606 | LWLockNamed | SubtransControlLock
180947 | LWLockNamed | SubtransControlLock
186621 | LWLockNamed | SubtransControlLock

The server then begins to crawl, with some queries just never finishing

until I finally shut the server down.

Searching for that particular combo of wait_event_type and wait_event

only seems to turn up the page

about statistics collection, but no helpful information on

troubleshooting this lock.

Restarting the replica server clears the locks and allows us to start

working again, but it's happened

twice now in 12 hours and I'm worried it will happen again.

Does anyone have any advice on where to start looking?

I don't think there is any connection to statistics collection.

This lock is used when subtransactions (SAVEPOINTs in SQL or EXCEPTION
blocks
in PL/pgSQL) are created, read or removed.

This sounds like a PostgreSQL bug.

What is the exact PostgreSQL version you are running?

It would be cool if you could get a stack trace from the backend that is
holding the lock.

Yours,
Laurenz Albe

Laurenz Albe

laurenz.albe@cybertec.at

over 8 years ago

In reply to: Scott Frazer (#5)

Re: Help troubleshooting SubtransControlLock problems

Scott Frazer wrote:

On Wed, Mar 7, 2018 at 9:52 AM, Laurenz Albe <laurenz.albe@cybertec.at> wrote:

Scott Frazer wrote:

Hi, we have a Postgres 9.6 setup using replication that has recently started seeing a lot of processes stuck in
"SubtransControlLock" as a wait_event on the read-replicas. Like this, only usually about 300-800 of them:

179706 | LWLockNamed | SubtransControlLock
186602 | LWLockNamed | SubtransControlLock
186606 | LWLockNamed | SubtransControlLock
180947 | LWLockNamed | SubtransControlLock
186621 | LWLockNamed | SubtransControlLock

The server then begins to crawl, with some queries just never finishing until I finally shut the server down.

Searching for that particular combo of wait_event_type and wait_event only seems to turn up the page
about statistics collection, but no helpful information on troubleshooting this lock.

Restarting the replica server clears the locks and allows us to start working again, but it's happened
twice now in 12 hours and I'm worried it will happen again.

Does anyone have any advice on where to start looking?

I don't think there is any connection to statistics collection.

This lock is used when subtransactions (SAVEPOINTs in SQL or EXCEPTION blocks
in PL/pgSQL) are created, read or removed.

This sounds like a PostgreSQL bug.

What is the exact PostgreSQL version you are running?

It would be cool if you could get a stack trace from the backend that is holding the lock.

Server version is 9.6.5

Is there a decent guide to getting a stack trace on Centos7 when using the official Postgres repo?
trying to follow the Fedora guide put the debug info for 9.2.23 on the box instead of the 9.6.5 version.

You'll have to install debuginfo from the same source where you got PostgreSQL.
It has to be for the same database version.

You should upgrade to 9.6.8 and see if the problem persists.

There is a guid how to get a stack trace at
https://wiki.postgresql.org/wiki/Getting_a_stack_trace_of_a_running_PostgreSQL_backend_on_Linux/BSD

Yours,
Laurenz Albe
--
Cybertec | https://www.cybertec-postgresql.com

Alvaro Herrera

alvherre@2ndquadrant.com

over 8 years ago

In reply to: Laurenz Albe (#4)

Re: Help troubleshooting SubtransControlLock problems

Laurenz Albe wrote:

Scott Frazer wrote:

Hi, we have a Postgres 9.6 setup using replication that has recently started seeing a lot of processes stuck in
"SubtransControlLock" as a wait_event on the read-replicas. Like this, only usually about 300-800 of them:

179706 | LWLockNamed | SubtransControlLock

I think you could get in this situation if the range of open
transactions exceeds what fits in the buffers for subtrans.c pages, and
the subtransaction cache overflows (64 entries apiece;
PGPROC_MAX_CACHED_SUBXIDS in src/include/storage/proc.h). Each page is
2048 transactions (4 bytes per xact, 8192 bytes per page;
SUBTRANS_XACTS_PER_PAGE in src/backend/access/transam/subtrans.c), and
we keep 32 pages (NUM_SUBTRANS_BUFFERS src/include/access/subtrans.h).
So if your oldest transaction is over 64k XIDs old, and you have
transactions with more than 64 subtransactions, you get in trouble.

A simple solution is to increase NUM_SUBTRANS_BUFFERS and recompile.

--
ï¿½lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Scott Frazer

sfrazer@couponcabin.com

over 8 years ago

In reply to: Alvaro Herrera (#7)

Re: Help troubleshooting SubtransControlLock problems

On Wed, Mar 7, 2018 at 10:39 AM, Alvaro Herrera <alvherre@alvh.no-ip.org>
wrote:

Laurenz Albe wrote:
I think you could get in this situation if the range of open
transactions exceeds what fits in the buffers for subtrans.c pages, and
the subtransaction cache overflows (64 entries apiece;
PGPROC_MAX_CACHED_SUBXIDS in src/include/storage/proc.h). Each page is
2048 transactions (4 bytes per xact, 8192 bytes per page;
SUBTRANS_XACTS_PER_PAGE in src/backend/access/transam/subtrans.c), and
we keep 32 pages (NUM_SUBTRANS_BUFFERS src/include/access/subtrans.h).
So if your oldest transaction is over 64k XIDs old, and you have
transactions with more than 64 subtransactions, you get in trouble.

A simple solution is to increase NUM_SUBTRANS_BUFFERS and recompile.

Currently this seems to be happening when the server just starts taking
connections. It doesn't even need time to scale up anymore (previously it
took about 8 hours for the problem to re-appear)

It's only happening on the read replicas, though. I've just set my master
to handle all the traffic, but that's not really sustainable

Alvaro Herrera

alvherre@2ndquadrant.com

over 8 years ago

In reply to: Scott Frazer (#8)

Re: Help troubleshooting SubtransControlLock problems

Scott Frazer wrote:

Currently this seems to be happening when the server just starts taking
connections. It doesn't even need time to scale up anymore (previously it
took about 8 hours for the problem to re-appear)

It's only happening on the read replicas, though. I've just set my master
to handle all the traffic, but that's not really sustainable

I failed to notice at start of thread that these were replicas. I
suppose then you would have to be thinking about the range of XIDs in
the primary server. Maybe you left some old transaction open, or an
uncommitted prepared transaction there?

If not, I suppose you're up for some "perf" profiling ...

--
ï¿½lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#10

Scott Frazer

sfrazer@couponcabin.com

over 8 years ago

In reply to: Alvaro Herrera (#9)

Re: Help troubleshooting SubtransControlLock problems

On Wed, Mar 7, 2018 at 3:50 PM, Alvaro Herrera <alvherre@alvh.no-ip.org>
wrote:

Scott Frazer wrote:

It's only happening on the read replicas, though. I've just set my

master

to handle all the traffic, but that's not really sustainable

I failed to notice at start of thread that these were replicas. I
suppose then you would have to be thinking about the range of XIDs in
the primary server. Maybe you left some old transaction open, or an
uncommitted prepared transaction there?

If not, I suppose you're up for some "perf" profiling ...

I think you got it. I found a 2-day old transaction on the master server
that never closed. I had to pg_terminate_backend on it and the problem
seems to have cleared up.