Backends stuck in LISTEN

Started by Marko Tiikkajaover 12 years ago12 messagesbugs
Jump to latest
#1Marko Tiikkaja
marko@joh.to

Hi people,

Today we had a problem where around 30-40 backends were stuck in LISTEN
for a long time. I snatched a core dump before we had to kill the
server, and the backtrace looks like this:

#0 0x00007f26dd17d5b7 in monstartup () from /lib/x86_64-linux-gnu/libc.so.6
#1 0x00007f26d79e5c70 in ?? ()
#2 0x00007f26d79e5c80 in ?? ()
#3 0x00007f254d772b60 in ?? ()
#4 0x00000000006359c5 in LWLockAcquire (lockid=87, mode=LW_SHARED) at
/build/buildd/postgresql-9.1-9.1.9/build/../src/backend/storage/lmgr/lwlock.c:464
#5 0x000000000049fabb in SimpleLruWaitIO (slotno=6, ctl=<optimized
out>) at
/build/buildd/postgresql-9.1-9.1.9/build/../src/backend/access/transam/slru.c:306
#6 0x000000000049fe00 in SlruSelectLRUPage (ctl=<optimized out>,
pageno=34331) at
/build/buildd/postgresql-9.1-9.1.9/build/../src/backend/access/transam/slru.c:994
#7 0x00000000004a02ab in SimpleLruReadPage (ctl=0xb046a0, pageno=34331,
write_ok=1 '\001', xid=1124964531) at
/build/buildd/postgresql-9.1-9.1.9/build/../src/backend/access/transam/slru.c:366
#8 0x0000000000487fb3 in TransactionIdGetStatus (xid=1124964531,
lsn=0x7fffe8503750) at
/build/buildd/postgresql-9.1-9.1.9/build/../src/backend/access/transam/clog.c:400
#9 0x0000000000488460 in TransactionLogFetch (transactionId=1124964531)
at
/build/buildd/postgresql-9.1-9.1.9/build/../src/backend/access/transam/transam.c:83
#10 TransactionLogFetch (transactionId=1124964531) at
/build/buildd/postgresql-9.1-9.1.9/build/../src/backend/access/transam/transam.c:56
#11 0x00000000004885b1 in TransactionIdDidCommit
(transactionId=<optimized out>) at
/build/buildd/postgresql-9.1-9.1.9/build/../src/backend/access/transam/transam.c:133
#12 0x00000000005159f4 in asyncQueueProcessPageEntries
(page_buffer=0x7fffe85038b0 "$", stop=..., current=<synthetic pointer>)
at
/build/buildd/postgresql-9.1-9.1.9/build/../src/backend/commands/async.c:1998
#13 asyncQueueReadAllNotifications () at
/build/buildd/postgresql-9.1-9.1.9/build/../src/backend/commands/async.c:1923
#14 0x0000000000516926 in Exec_ListenPreCommit () at
/build/buildd/postgresql-9.1-9.1.9/build/../src/backend/commands/async.c:977
#15 PreCommit_Notify () at
/build/buildd/postgresql-9.1-9.1.9/build/../src/backend/commands/async.c:801
#16 0x000000000048c863 in CommitTransaction () at
/build/buildd/postgresql-9.1-9.1.9/build/../src/backend/access/transam/xact.c:1827
#17 0x000000000048ce8d in CommitTransactionCommand () at
/build/buildd/postgresql-9.1-9.1.9/build/../src/backend/access/transam/xact.c:2563
#18 0x000000000063fafb in finish_xact_command () at
/build/buildd/postgresql-9.1-9.1.9/build/../src/backend/tcop/postgres.c:2452
#19 finish_xact_command () at
/build/buildd/postgresql-9.1-9.1.9/build/../src/backend/tcop/postgres.c:2441
#20 0x0000000000641e92 in PostgresMain (argc=<optimized out>,
argv=<optimized out>, dbname=0xc5dee0 "gluepay", username=<optimized
out>) at
/build/buildd/postgresql-9.1-9.1.9/build/../src/backend/tcop/postgres.c:4153
#21 0x0000000000603f83 in BackendRun (port=0xca6f50) at
/build/buildd/postgresql-9.1-9.1.9/build/../src/backend/postmaster/postmaster.c:3612
#22 BackendStartup (port=0xca6f50) at
/build/buildd/postgresql-9.1-9.1.9/build/../src/backend/postmaster/postmaster.c:3302
#23 ServerLoop () at
/build/buildd/postgresql-9.1-9.1.9/build/../src/backend/postmaster/postmaster.c:1466
#24 0x00000000006049f1 in PostmasterMain (argc=<optimized out>,
argv=<optimized out>) at
/build/buildd/postgresql-9.1-9.1.9/build/../src/backend/postmaster/postmaster.c:1127
#25 0x000000000045a600 in main (argc=5, argv=0xc5d180) at
/build/buildd/postgresql-9.1-9.1.9/build/../src/backend/main/main.c:199

This is on 64-bit Linux, version 9.1.9 from the apt repositories.

Any idea what could cause this? It took around 30 seconds for the
backends to respond to pg_terminate_backend().

Regards,
Marko Tiikkaja

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

#2Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Marko Tiikkaja (#1)
Re: Backends stuck in LISTEN

Marko Tiikkaja wrote:

Hi people,

Today we had a problem where around 30-40 backends were stuck in
LISTEN for a long time. I snatched a core dump before we had to
kill the server, and the backtrace looks like this:

#0 0x00007f26dd17d5b7 in monstartup () from /lib/x86_64-linux-gnu/libc.so.6
#1 0x00007f26d79e5c70 in ?? ()
#2 0x00007f26d79e5c80 in ?? ()
#3 0x00007f254d772b60 in ?? ()
#4 0x00000000006359c5 in LWLockAcquire (lockid=87, mode=LW_SHARED)
at /build/buildd/postgresql-9.1-9.1.9/build/../src/backend/storage/lmgr/lwlock.c:464

Hm, looks like either they got stuck trying to activate profiling (but
then Debian doesn't build its packages with gcc -pg, does it?), or the
backtrace is flawed for those upper frames. Maybe installing the libc
-dbg (I imagine it'd be libc6-dbg) package would be useful, if this
reoccurs.

--
�lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

#3Marko Tiikkaja
marko@joh.to
In reply to: Alvaro Herrera (#2)
Re: Backends stuck in LISTEN

On 1/8/14 3:00 PM, Alvaro Herrera wrote:

Marko Tiikkaja wrote:

#0 0x00007f26dd17d5b7 in monstartup () from /lib/x86_64-linux-gnu/libc.so.6
#1 0x00007f26d79e5c70 in ?? ()
#2 0x00007f26d79e5c80 in ?? ()
#3 0x00007f254d772b60 in ?? ()
#4 0x00000000006359c5 in LWLockAcquire (lockid=87, mode=LW_SHARED)
at /build/buildd/postgresql-9.1-9.1.9/build/../src/backend/storage/lmgr/lwlock.c:464

Hm, looks like either they got stuck trying to activate profiling (but
then Debian doesn't build its packages with gcc -pg, does it?), or the
backtrace is flawed for those upper frames. Maybe installing the libc
-dbg (I imagine it'd be libc6-dbg) package would be useful, if this
reoccurs.

libc6-dbg only changed it a bit:

#0 0x00007f26dd17d5b7 in __monstartup (lowpc=<optimized out>,
highpc=140737090958784) at gmon.c:130
#1 0x00007f26d79fcae0 in ?? ()
#2 0x00007f26d79fcaf0 in ?? ()
#3 0x00007f254d772aa0 in ?? ()

I still have the core dump available, is there something else I can do?
From what I heard, this is the second time this happened already :-(

Regards,
Marko Tiikkaja

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

#4Marko Tiikkaja
marko@joh.to
In reply to: Marko Tiikkaja (#3)
Re: Backends stuck in LISTEN

On 1/8/14 3:24 PM, I wrote:

libc6-dbg only changed it a bit:

Oops, looks like there was something wonky about that dump. This should
make more sense:

#0 0x00007f26dd17d5b7 in semop () from /lib/x86_64-linux-gnu/libc.so.6
#1 0x00000000005f6128 in PGSemaphoreLock (sema=0x7f26d79e5c80,
interruptOK=0 '\000') at pg_sema.c:418
#2 0x00000000006359c5 in LWLockAcquire (lockid=87, mode=LW_SHARED) at
/build/buildd/postgresql-9.1-9.1.9/build/../src/backend/storage/lmgr/lwlock.c:464
#3 0x000000000049fabb in SimpleLruWaitIO (slotno=6, ctl=<optimized
out>) at
/build/buildd/postgresql-9.1-9.1.9/build/../src/backend/access/transam/slru.c:306
#4 0x000000000049fe00 in SlruSelectLRUPage (ctl=<optimized out>,
pageno=34331) at
/build/buildd/postgresql-9.1-9.1.9/build/../src/backend/access/transam/slru.c:994
#5 0x00000000004a02ab in SimpleLruReadPage (ctl=0xb046a0, pageno=34331,
write_ok=1 '\001', xid=1124964531) at
/build/buildd/postgresql-9.1-9.1.9/build/../src/backend/access/transam/slru.c:366
#6 0x0000000000487fb3 in TransactionIdGetStatus (xid=1124964531,
lsn=0x7fffe8503750) at
/build/buildd/postgresql-9.1-9.1.9/build/../src/backend/access/transam/clog.c:400
#7 0x0000000000488460 in TransactionLogFetch (transactionId=1124964531)
at
/build/buildd/postgresql-9.1-9.1.9/build/../src/backend/access/transam/transam.c:83
#8 TransactionLogFetch (transactionId=1124964531) at
/build/buildd/postgresql-9.1-9.1.9/build/../src/backend/access/transam/transam.c:56
#9 0x00000000004885b1 in TransactionIdDidCommit
(transactionId=<optimized out>) at
/build/buildd/postgresql-9.1-9.1.9/build/../src/backend/access/transam/transam.c:133
#10 0x00000000005159f4 in asyncQueueProcessPageEntries
(page_buffer=0x7fffe85038b0 "$", stop=..., current=<synthetic pointer>)
at
/build/buildd/postgresql-9.1-9.1.9/build/../src/backend/commands/async.c:1998
#11 asyncQueueReadAllNotifications () at
/build/buildd/postgresql-9.1-9.1.9/build/../src/backend/commands/async.c:1923
#12 0x0000000000516926 in Exec_ListenPreCommit () at
/build/buildd/postgresql-9.1-9.1.9/build/../src/backend/commands/async.c:977
#13 PreCommit_Notify () at
/build/buildd/postgresql-9.1-9.1.9/build/../src/backend/commands/async.c:801
#14 0x000000000048c863 in CommitTransaction () at
/build/buildd/postgresql-9.1-9.1.9/build/../src/backend/access/transam/xact.c:1827
#15 0x000000000048ce8d in CommitTransactionCommand () at
/build/buildd/postgresql-9.1-9.1.9/build/../src/backend/access/transam/xact.c:2563
#16 0x000000000063fafb in finish_xact_command () at
/build/buildd/postgresql-9.1-9.1.9/build/../src/backend/tcop/postgres.c:2452
#17 finish_xact_command () at
/build/buildd/postgresql-9.1-9.1.9/build/../src/backend/tcop/postgres.c:2441
#18 0x0000000000641e92 in PostgresMain (argc=<optimized out>,
argv=<optimized out>, dbname=0xc5dee0 "gluepay", username=<optimized out>)
at
/build/buildd/postgresql-9.1-9.1.9/build/../src/backend/tcop/postgres.c:4153
#19 0x0000000000603f83 in BackendRun (port=0xca6f50) at
/build/buildd/postgresql-9.1-9.1.9/build/../src/backend/postmaster/postmaster.c:3612
#20 BackendStartup (port=0xca6f50) at
/build/buildd/postgresql-9.1-9.1.9/build/../src/backend/postmaster/postmaster.c:3302
#21 ServerLoop () at
/build/buildd/postgresql-9.1-9.1.9/build/../src/backend/postmaster/postmaster.c:1466
#22 0x00000000006049f1 in PostmasterMain (argc=<optimized out>,
argv=<optimized out>) at
/build/buildd/postgresql-9.1-9.1.9/build/../src/backend/postmaster/postmaster.c:1127
#23 0x000000000045a600 in main (argc=5, argv=0xc5d180) at
/build/buildd/postgresql-9.1-9.1.9/build/../src/backend/main/main.c:199

Regards,
Marko Tiikkaja

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

#5Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Marko Tiikkaja (#4)
Re: Backends stuck in LISTEN

Marko Tiikkaja wrote:

On 1/8/14 3:24 PM, I wrote:

libc6-dbg only changed it a bit:

Oops, looks like there was something wonky about that dump. This
should make more sense:

#0 0x00007f26dd17d5b7 in semop () from /lib/x86_64-linux-gnu/libc.so.6
#1 0x00000000005f6128 in PGSemaphoreLock (sema=0x7f26d79e5c80,
interruptOK=0 '\000') at pg_sema.c:418
#2 0x00000000006359c5 in LWLockAcquire (lockid=87, mode=LW_SHARED)
at /build/buildd/postgresql-9.1-9.1.9/build/../src/backend/storage/lmgr/lwlock.c:464
#3 0x000000000049fabb in SimpleLruWaitIO (slotno=6, ctl=<optimized
out>) at /build/buildd/postgresql-9.1-9.1.9/build/../src/backend/access/transam/slru.c:306

Yeah, this makes a lot more sense. What this is saying is that it's
waiting for the I/O-lock on the pg_clog page to be released. In other
words there is some other process holding that lock, which is supposed
to be reading the page, but for some reason it's failing to do that
timely. I guess you would need to get backtraces from all processes to
see which one differs (i.e. which one has the lock and is trying to do
something but not doing it).

--
�lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

#6Marko Tiikkaja
marko@joh.to
In reply to: Alvaro Herrera (#5)
Re: Backends stuck in LISTEN

On 1/8/14 3:50 PM, Alvaro Herrera wrote:

Yeah, this makes a lot more sense. What this is saying is that it's
waiting for the I/O-lock on the pg_clog page to be released. In other
words there is some other process holding that lock, which is supposed
to be reading the page, but for some reason it's failing to do that
timely. I guess you would need to get backtraces from all processes to
see which one differs (i.e. which one has the lock and is trying to do
something but not doing it).

I got only two, and they both looked like this, unfortunately. I'll try
to find the offender should this happen again.

In the meanwhile I'll be happy to provide more information if someone
has any ideas.

Regards,
Marko Tiikkaja

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

#7Tom Lane
tgl@sss.pgh.pa.us
In reply to: Marko Tiikkaja (#6)
Re: Backends stuck in LISTEN

Marko Tiikkaja <marko@joh.to> writes:

In the meanwhile I'll be happy to provide more information if someone
has any ideas.

I don't suppose there's anything in the postmaster log suggesting trouble
with accessing pg_clog/ files shortly before the lockup?

I concur with Alvaro's assessment that this looks like a bunch of
processes all waiting for somebody else to finish reading the clog page
they want. If somebody had forgotten to unlock the I/O lock after loading
a clog buffer, we could get this symptom later on; but I find it hard to
see how that could happen.

BTW ... just looking at the code a bit ... I wonder whether there is any
interlock that ensures that listen-queue messages will be cleared before
the originating transaction's entry is truncated away from pg_clog?
It doesn't seem like an ancient XID in the message queue could cause this
particular symptom, but I could see it leading to "could not access
status of transaction" failures.

regards, tom lane

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

#8Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#7)
Re: Backends stuck in LISTEN

Marko Tiikkaja <marko@joh.to> writes:

In the meanwhile I'll be happy to provide more information if someone
has any ideas.

Oh, after further thought: this trace says the process was trying to look
up the status of XID 1124964531 (see the TransactionIdGetStatus and
TransactionLogFetch calls). Do all the backtraces show this same XID?

If I did the math right, this corresponds to pg_clog file 010C. Do you
have such a file? Does it look to have the right size and permissions?
What file names are present in pg_clog?

It'd also be interesting to see the current XID counter according to
pg_controldata, especially if you have some idea how far it's advanced
since the problem happened.

regards, tom lane

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

#9Marko Tiikkaja
marko@joh.to
In reply to: Tom Lane (#7)
Re: Backends stuck in LISTEN

On 1/8/14 4:38 PM, Tom Lane wrote:

I don't suppose there's anything in the postmaster log suggesting trouble
with accessing pg_clog/ files shortly before the lockup?

Nope, nothing out of the ordinary in the logs that I can find.

Regards,
Marko Tiikkaja

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

#10Marko Tiikkaja
marko@joh.to
In reply to: Tom Lane (#8)
Re: Backends stuck in LISTEN

On 1/8/14 4:55 PM, Tom Lane wrote:

Marko Tiikkaja <marko@joh.to> writes:

In the meanwhile I'll be happy to provide more information if someone
has any ideas.

Oh, after further thought: this trace says the process was trying to look
up the status of XID 1124964531 (see the TransactionIdGetStatus and
TransactionLogFetch calls). Do all the backtraces show this same XID?

I only got two, and the other one wants the status of 1133447326.

If I did the math right, this corresponds to pg_clog file 010C. Do you
have such a file? Does it look to have the right size and permissions?
What file names are present in pg_clog?

No, it's not there. The files look like this:

-rw------- 1 postgres postgres 256K Nov 3 15:55 0390
-rw------- 1 postgres postgres 256K Nov 3 22:26 0391
..
-rw------- 1 postgres postgres 256K Jan 8 16:43 044D
-rw------- 1 postgres postgres 24K Jan 8 17:28 044E

It'd also be interesting to see the current XID counter according to
pg_controldata, especially if you have some idea how far it's advanced
since the problem happened.

Latest checkpoint's NextXID: 0/1155614649

(I hope that's the number you were looking for.)

Judging from a few days old base backup that number would have been
(very roughly) around 1148225540 when this incident happened.

Regards,
Marko Tiikkaja

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

#11Tom Lane
tgl@sss.pgh.pa.us
In reply to: Marko Tiikkaja (#10)
Re: Backends stuck in LISTEN

Marko Tiikkaja <marko@joh.to> writes:

On 1/8/14 4:55 PM, Tom Lane wrote:

If I did the math right, this corresponds to pg_clog file 010C. Do you
have such a file? Does it look to have the right size and permissions?
What file names are present in pg_clog?

No, it's not there. The files look like this:

-rw------- 1 postgres postgres 256K Nov 3 15:55 0390
-rw------- 1 postgres postgres 256K Nov 3 22:26 0391
..
-rw------- 1 postgres postgres 256K Jan 8 16:43 044D
-rw------- 1 postgres postgres 24K Jan 8 17:28 044E

Oh, I fat-fingered the math somehow, because when I do it over again,
I get 0430 (or 0438 for your other observed XID). That's right in the
range of your available clog files, so the notion of a crazy XID in
the message queue doesn't seem to be the explanation here.

Since the referenced XIDs are not close together, I assume that the two
traces also show different pagenos (SimpleLruReadPage) and slotnos
(SimpleLruWaitIO) and lockids (LWLockAcquire)?

That leads to the idea that we've got some condition that broke
all of the clog SLRU slots at the same time, though it's really hard
to see what that might be ...

regards, tom lane

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

#12Marko Tiikkaja
marko@joh.to
In reply to: Tom Lane (#11)
Re: Backends stuck in LISTEN

On 1/8/14 6:04 PM, Tom Lane wrote:

Since the referenced XIDs are not close together, I assume that the two
traces also show different pagenos (SimpleLruReadPage) and slotnos
(SimpleLruWaitIO) and lockids (LWLockAcquire)?

Correct. The other one has a lockno of NumFixedLWLocks, if that's of
any significance.

Regards,
Marko Tiikkaja

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs