startup process stuck in recovery

Started by Christophe Pettusover 8 years ago26 messagesgeneral
Jump to latest
#1Christophe Pettus
xof@thebuild.com

Hi,

We're dealing with a 9.5.5 database with the symptom that, after a certain amount of time after restart, the startup process reaches a certain WAL segment, and stops. The startup process runs at 100% CPU, with no output from strace. There are no queries running on the secondary, so it's not a delay for queries. The WAL receiver continues to operate normally, and we can see sent_location / write_location / flush_location continue to move ahead in parallel, with replay_location stuck in that WAL segment.

Suggestions on further diagnosis?

--
-- Christophe Pettus
xof@thebuild.com

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: Christophe Pettus (#1)
Re: startup process stuck in recovery

Christophe Pettus <xof@thebuild.com> writes:

We're dealing with a 9.5.5 database with the symptom that, after a certain amount of time after restart, the startup process reaches a certain WAL segment, and stops. The startup process runs at 100% CPU, with no output from strace. There are no queries running on the secondary, so it's not a delay for queries. The WAL receiver continues to operate normally, and we can see sent_location / write_location / flush_location continue to move ahead in parallel, with replay_location stuck in that WAL segment.

Suggestions on further diagnosis?

Attach to startup process with gdb, and get a stack trace?

regards, tom lane

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#3Christophe Pettus
xof@thebuild.com
In reply to: Tom Lane (#2)
Re: startup process stuck in recovery

On Oct 9, 2017, at 12:10, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Attach to startup process with gdb, and get a stack trace?

#0 0x0000558812f4f1da in ?? ()
#1 0x0000558812f4f8cb in StandbyReleaseLockTree ()
#2 0x0000558812d718ee in ?? ()
#3 0x0000558812d75520 in xact_redo ()
#4 0x0000558812d7f713 in StartupXLOG ()
#5 0x0000558812f0e262 in StartupProcessMain ()
#6 0x0000558812d8d4ea in AuxiliaryProcessMain ()
#7 0x0000558812f0b2e9 in ?? ()
#8 0x0000558812f0dae7 in PostmasterMain ()
#9 0x0000558812d0c402 in main ()

--
-- Christophe Pettus
xof@thebuild.com

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#4Christophe Pettus
xof@thebuild.com
In reply to: Christophe Pettus (#3)
Re: startup process stuck in recovery

On Oct 9, 2017, at 12:18, Christophe Pettus <xof@thebuild.com> wrote:

#0 0x0000558812f4f1da in ?? ()
#1 0x0000558812f4f8cb in StandbyReleaseLockTree ()
#2 0x0000558812d718ee in ?? ()
#3 0x0000558812d75520 in xact_redo ()
#4 0x0000558812d7f713 in StartupXLOG ()
#5 0x0000558812f0e262 in StartupProcessMain ()
#6 0x0000558812d8d4ea in AuxiliaryProcessMain ()
#7 0x0000558812f0b2e9 in ?? ()
#8 0x0000558812f0dae7 in PostmasterMain ()
#9 0x0000558812d0c402 in main ()

The other observation is that the startup process is holding a *lot* of locks:

(pid 5882 is the startup process)

# select mode, count(*) from pg_locks where pid=5882 group by mode;
mode | count
---------------------+-------
ExclusiveLock | 1
AccessExclusiveLock | 7133
(2 rows)

--
-- Christophe Pettus
xof@thebuild.com

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#5Tom Lane
tgl@sss.pgh.pa.us
In reply to: Christophe Pettus (#4)
Re: startup process stuck in recovery

Christophe Pettus <xof@thebuild.com> writes:

The other observation is that the startup process is holding a *lot* of locks:

Hmm. Is it possible that the process is replaying the abort of a
transaction with a lot of subtransactions? It seems like maybe
you could be getting into an O(N^2) situation, though for N in
the low thousands it's hard to see that loop taking so long that
you'd think things were stuck.

# select mode, count(*) from pg_locks where pid=5882 group by mode;
AccessExclusiveLock | 7133

Is that number changing at all?

regards, tom lane

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#6Christophe Pettus
xof@thebuild.com
In reply to: Tom Lane (#5)
Re: startup process stuck in recovery

On Oct 9, 2017, at 13:01, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Hmm. Is it possible that the process is replaying the abort of a
transaction with a lot of subtransactions?

That's possible, although we're now talking about an hours-long delay at this point.

Is that number changing at all?

Increasing:

# select mode, count(*) from pg_locks where pid=5882 group by mode;
mode | count
---------------------+-------
ExclusiveLock | 1
AccessExclusiveLock | 8810
(2 rows)

--
-- Christophe Pettus
xof@thebuild.com

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#7Tom Lane
tgl@sss.pgh.pa.us
In reply to: Christophe Pettus (#6)
Re: startup process stuck in recovery

Christophe Pettus <xof@thebuild.com> writes:

On Oct 9, 2017, at 13:01, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Is that number changing at all?

Increasing:
AccessExclusiveLock | 8810

Oh, that's really interesting. So it's not *just* releasing locks but
also acquiring them, which says that it is making progress of some sort.

My bet is that the source server did something that's provoking O(N^2)
behavior in the standby server's lock management. It's hard to say
exactly what, but I'm wondering about something like a plpgsql function
taking an AccessExclusiveLock inside a loop that repeatedly traps an
exception. Can you correlate where the standby is stuck with what
was happening on the source?

regards, tom lane

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#8Christophe Pettus
xof@thebuild.com
In reply to: Tom Lane (#7)
Re: startup process stuck in recovery

On Oct 9, 2017, at 13:26, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Oh, that's really interesting. So it's not *just* releasing locks but
also acquiring them, which says that it is making progress of some sort.

It seems to have leveled out now, and is still grinding away.

Can you correlate where the standby is stuck with what
was happening on the source?

There are definitely some mysterious functions being called, and I'll dig into those.

--
-- Christophe Pettus
xof@thebuild.com

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#9Christophe Pettus
xof@thebuild.com
In reply to: Tom Lane (#7)
Re: startup process stuck in recovery

On Oct 9, 2017, at 13:26, Tom Lane <tgl@sss.pgh.pa.us> wrote:
My bet is that the source server did something that's provoking O(N^2)
behavior in the standby server's lock management. It's hard to say
exactly what, but I'm wondering about something like a plpgsql function
taking an AccessExclusiveLock inside a loop that repeatedly traps an
exception. Can you correlate where the standby is stuck with what
was happening on the source?

Interestingly, the OIDs for the relations on which the locks on the secondary are held aren't present in pg_class, and they're clustered together. Could a large number of temporary table creations that are being undone by an abort cause this?

--
-- Christophe Pettus
xof@thebuild.com

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#10Tom Lane
tgl@sss.pgh.pa.us
In reply to: Christophe Pettus (#9)
Re: startup process stuck in recovery

Christophe Pettus <xof@thebuild.com> writes:

On Oct 9, 2017, at 13:26, Tom Lane <tgl@sss.pgh.pa.us> wrote:
My bet is that the source server did something that's provoking O(N^2)
behavior in the standby server's lock management. It's hard to say
exactly what, but I'm wondering about something like a plpgsql function
taking an AccessExclusiveLock inside a loop that repeatedly traps an
exception. Can you correlate where the standby is stuck with what
was happening on the source?

Interestingly, the OIDs for the relations on which the locks on the secondary are held aren't present in pg_class, and they're clustered together. Could a large number of temporary table creations that are being undone by an abort cause this?

Hmm. Creating or dropping a temp table does take AccessExclusiveLock,
just as it does for a non-temp table. In principle we'd not have to
transmit those locks to standbys, but I doubt that the WAL code has
enough knowledge to filter them out. So a lot of temp tables and
a lot of separate subtransactions could be a nasty combination.

regards, tom lane

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#11Christophe Pettus
xof@thebuild.com
In reply to: Tom Lane (#10)
Re: startup process stuck in recovery

On Oct 9, 2017, at 14:29, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Hmm. Creating or dropping a temp table does take AccessExclusiveLock,
just as it does for a non-temp table. In principle we'd not have to
transmit those locks to standbys, but I doubt that the WAL code has
enough knowledge to filter them out. So a lot of temp tables and
a lot of separate subtransactions could be a nasty combination.

The problem indeed appear to be a very large number of subtransactions, each one creating a temp table, inside a single transaction. It's made worse by one of those transactions finally getting replayed on the secondary, only to have another one come in right behind it...
--
-- Christophe Pettus
xof@thebuild.com

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#12Merlin Moncure
mmoncure@gmail.com
In reply to: Christophe Pettus (#11)
Re: startup process stuck in recovery

On Mon, Oct 9, 2017 at 6:12 PM, Christophe Pettus <xof@thebuild.com> wrote:

On Oct 9, 2017, at 14:29, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Hmm. Creating or dropping a temp table does take AccessExclusiveLock,
just as it does for a non-temp table. In principle we'd not have to
transmit those locks to standbys, but I doubt that the WAL code has
enough knowledge to filter them out. So a lot of temp tables and
a lot of separate subtransactions could be a nasty combination.

The problem indeed appear to be a very large number of subtransactions, each one creating a temp table, inside a single transaction. It's made worse by one of those transactions finally getting replayed on the secondary, only to have another one come in right behind it...

FYI, this is pathological coding; it can cause other nastiness such as
exhausting shared memory. Consider refactoring the code to using a
permanent table (if access is concurrent, you can do tricks like using
'pid' in the key to subdivide the data).

merlin

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#13Tom Lane
tgl@sss.pgh.pa.us
In reply to: Christophe Pettus (#11)
Re: startup process stuck in recovery

Christophe Pettus <xof@thebuild.com> writes:

The problem indeed appear to be a very large number of subtransactions, each one creating a temp table, inside a single transaction. It's made worse by one of those transactions finally getting replayed on the secondary, only to have another one come in right behind it...

Hmm, I tried to reproduce this and could not. I experimented with
various permutations of this:

create or replace function goslow(n int) returns void language plpgsql as
$$
begin
for i in 1..n loop
begin
execute 'create temp table myfoo' || i || '(f1 int)';
perform i/0;
exception when division_by_zero then null;
end;
end loop;
perform n/0;
end;
$$;

and did not see any untoward behavior, at least not till I got to enough
temp tables to overrun the master's shared lock table, and even then it
cleaned up fine. At no point was the standby process consuming anywhere
near as much CPU as the master's backend.

What am I missing to reproduce the problem?

regards, tom lane

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

In reply to: Tom Lane (#13)
Re: startup process stuck in recovery

On Mon, Oct 9, 2017 at 5:30 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

and did not see any untoward behavior, at least not till I got to enough
temp tables to overrun the master's shared lock table, and even then it
cleaned up fine. At no point was the standby process consuming anywhere
near as much CPU as the master's backend.

What am I missing to reproduce the problem?

Just a guess, but do you disable autovacuum on your dev machine? (I know I do.)

It's possible that this is relevant:
/messages/by-id/CAB-EU3RawZx8-OzMfvswFf6z+Y7GOZf03TZ=bez+PbQX+A4M7Q@mail.gmail.com

--
Peter Geoghegan

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#15Tom Lane
tgl@sss.pgh.pa.us
In reply to: Peter Geoghegan (#14)
Re: startup process stuck in recovery

Peter Geoghegan <pg@bowt.ie> writes:

Just a guess, but do you disable autovacuum on your dev machine? (I know I do.)

Nope.

regards, tom lane

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

In reply to: Christophe Pettus (#1)
Re: startup process stuck in recovery

On Mon, Oct 9, 2017 at 12:08 PM, Christophe Pettus <xof@thebuild.com> wrote:

Suggestions on further diagnosis?

What's the hot_standy_feedback setting? How about
max_standby_archive_delay/max_standby_streaming_delay?

--
Peter Geoghegan

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#17Christophe Pettus
xof@thebuild.com
In reply to: Peter Geoghegan (#16)
Re: startup process stuck in recovery

On Oct 9, 2017, at 18:21, Peter Geoghegan <pg@bowt.ie> wrote:
What's the hot_standy_feedback setting? How about
max_standby_archive_delay/max_standby_streaming_delay?

On, 5m, 5m.

--
-- Christophe Pettus
xof@thebuild.com

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#18Christophe Pettus
xof@thebuild.com
In reply to: Tom Lane (#13)
Re: startup process stuck in recovery

On Oct 9, 2017, at 17:30, Tom Lane <tgl@sss.pgh.pa.us> wrote:

What am I missing to reproduce the problem?

Not sure. The actual client behavior here is a bit cryptic (not our code, incompletely logs). They might be creating a savepoint before each temp table creation, without a release, but I would expect that to exhaust something else first.

--
-- Christophe Pettus
xof@thebuild.com

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#19Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Tom Lane (#13)
Re: startup process stuck in recovery

Tom Lane wrote:

Christophe Pettus <xof@thebuild.com> writes:

The problem indeed appear to be a very large number of subtransactions, each one creating a temp table, inside a single transaction. It's made worse by one of those transactions finally getting replayed on the secondary, only to have another one come in right behind it...

Hmm, I tried to reproduce this and could not. I experimented with
various permutations of this:

This problem is probably related to commit 9b013dc238c, which AFAICS is
only in pg10, not 9.5.

--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#20Tom Lane
tgl@sss.pgh.pa.us
In reply to: Alvaro Herrera (#19)
Re: startup process stuck in recovery

Alvaro Herrera <alvherre@alvh.no-ip.org> writes:

Tom Lane wrote:

Hmm, I tried to reproduce this and could not. I experimented with
various permutations of this:

This problem is probably related to commit 9b013dc238c, which AFAICS is
only in pg10, not 9.5.

You're right, I was testing on HEAD, so that patch might've obscured
the problem. But the code looks like it could still be O(N^2) in
some cases. Will look again later.

regards, tom lane

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#21Christophe Pettus
xof@thebuild.com
In reply to: Tom Lane (#20)
#22Tom Lane
tgl@sss.pgh.pa.us
In reply to: Christophe Pettus (#21)
#23Simon Riggs
simon@2ndQuadrant.com
In reply to: Tom Lane (#22)
#24Christophe Pettus
xof@thebuild.com
In reply to: Simon Riggs (#23)
#25Simon Riggs
simon@2ndQuadrant.com
In reply to: Christophe Pettus (#24)
#26Tom Lane
tgl@sss.pgh.pa.us
In reply to: Simon Riggs (#25)