PANIC during exit on behalf of FATAL semop error

Started by Dave Vitekalmost 9 years ago3 messagesbugs

dvitek@grammatech.com

almost 9 years ago

We have an x86_64 linux machine running postgresql 9.6.2. Our
application uses LISTEN/NOTIFY. We recently made a change so that our
testing infrastructure would notice postgres crashes and out popped this
crash:

elog(PANIC, "queueing for lock while waiting on another one");

in this code:

/*
* Add ourselves to the end of the queue.
*
* NB: Mode can be LW_WAIT_UNTIL_FREE here!
*/
static void
LWLockQueueSelf(LWLock *lock, LWLockMode mode)
{
        /*
         * If we don't have a PGPROC structure, there's no way to wait.
This
         * should never occur, since MyProc should only be null during
shared
         * memory initialization.
         */
        if (MyProc == NULL)
                elog(PANIC, "cannot wait without a PGPROC structure");

if (MyProc->lwWaiting)
-----> elog(PANIC, "queueing for lock while waiting on another
one"); <-------------------

LWLockWaitListLock(lock);
...

Here's a stack trace and a more verbose stack trace.

#0 0x00007ff27bb50c37 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
#1 0x00007ff27bb54028 in __GI_abort () at abort.c:89
#2 0x0000000000978119 in errfinish (dummy=0) at elog.c:557
#3 0x000000000097a671 in elog_finish (elevel=22, fmt=0xc5a000 "queueing for lock while waiting on another one") at elog.c:1378
#4 0x000000000080eb73 in LWLockQueueSelf (lock=0x7ff272b06480, mode=LW_EXCLUSIVE) at lwlock.c:1035
#5 0x000000000080ee30 in LWLockAcquire (lock=0x7ff272b06480, mode=LW_EXCLUSIVE) at lwlock.c:1250
#6 0x00000000007fe416 in CleanupInvalidationState (status=1, arg=140679430974720) at sinvaladt.c:344
#7 0x00000000007f5132 in shmem_exit (code=1) at ipc.c:261
#8 0x00000000007f4f63 in proc_exit_prepare (code=1) at ipc.c:185
#9 0x00000000007f4eb3 in proc_exit (code=1) at ipc.c:102
#10 0x00000000009780e7 in errfinish (dummy=0) at elog.c:543
#11 0x000000000097a671 in elog_finish (elevel=21, fmt=0xc45b4a "semop(id=%d) failed: %m") at elog.c:1378
#12 0x00000000007881d9 in PGSemaphoreLock (sema=0x7ff27b7d6740) at pg_sema.c:391
#13 0x000000000080ee74 in LWLockAcquire (lock=0x7ff272b06f00, mode=LW_SHARED) at lwlock.c:1287
#14 0x0000000000604177 in asyncQueueReadAllNotifications () at async.c:1877
#15 0x000000000060464c in ProcessIncomingNotify () at async.c:2058
#16 0x0000000000603f04 in ProcessNotifyInterrupt () at async.c:1732
#17 0x000000000081da4d in ProcessClientReadInterrupt (blocked=1 '\001') at postgres.c:537
#18 0x00000000006d7a8d in secure_read (port=0x2bb90b0, ptr=0xfe8da0 <PqRecvBuffer>, len=8192) at be-secure.c:177
#19 0x00000000006e3f82 in pq_recvbuf () at pqcomm.c:921
#20 0x00000000006e4022 in pq_getbyte () at pqcomm.c:964
#21 0x000000000081d495 in SocketBackend (inBuf=0x7ffe148694a0) at postgres.c:334
#22 0x000000000081d9db in ReadCommand (inBuf=0x7ffe148694a0) at postgres.c:507
#23 0x00000000008228ff in PostgresMain (argc=1, argv=0x2bbb308, dbname=0x2bbb2f0 "cshub", username=0x2bbb2d0 "cshubuser") at postgres.c:4021
#24 0x000000000079f903 in BackendRun (port=0x2bb90b0) at postmaster.c:4272
#25 0x000000000079ef9d in BackendStartup (port=0x2bb90b0) at postmaster.c:3946
#26 0x000000000079b645 in ServerLoop () at postmaster.c:1701
#27 0x000000000079ab97 in PostmasterMain (argc=3, argv=0x2b918b0) at postmaster.c:1309
#28 0x00000000006e8f8c in main (argc=3, argv=0x2b918b0) at main.c:228

So the PANIC is really happening because of the first problem:

/*
* PGSemaphoreLock
*
* Lock a semaphore (decrement count), blocking if count would be < 0
*/
void
PGSemaphoreLock(PGSemaphore sema)
{
int errStatus;
struct sembuf sops;

        sops.sem_op = -1;                       /* decrement */
        sops.sem_flg = 0;
        sops.sem_num = sema->semNum;

        /*
         * Note: if errStatus is -1 and errno == EINTR then it means we
returned
         * from the operation prematurely because we were sent a
signal. So we
         * try and lock the semaphore again.
         *
         * We used to check interrupts here, but that required servicing
         * interrupts directly from signal handlers. Which is hard to
do safely
         * and portably.
         */
        do
        {
                errStatus = semop(sema->semId, &sops, 1);
        } while (errStatus < 0 && errno == EINTR);

if (errStatus < 0)
------> elog(FATAL, "semop(id=%d) failed: %m", sema->semId);
<--------------
}

Should this one use PANIC instead of FATAL given that the FATAL exit
path causes a PANIC in some cases? Is there an opportunity to repair
the state of things enough that a FATAL exit is possible here?

I'm not seeing either of the messages in the postgres logs, so I'm not
sure what %m expanded to. The logs are short and uninteresting--just
startup and shutdown messages. It is possible the machine was being
stressed when this occurred.

Andres Freund

andres@anarazel.de

almost 9 years ago

In reply to: Dave Vitek (#1)

Re: PANIC during exit on behalf of FATAL semop error

Hi,

On 2017-09-15 17:30:51 -0400, Dave Vitek wrote:

We have an x86_64 linux machine running postgresql 9.6.2.ï¿½ Our application
uses LISTEN/NOTIFY.ï¿½ We recently made a change so that our testing
infrastructure would notice postgres crashes and out popped this crash:

Oh. That's curious.

elog(PANIC, "queueing for lock while waiting on another one");

in this code:

/*
ï¿½* Add ourselves to the end of the queue.
ï¿½*
ï¿½* NB: Mode can be LW_WAIT_UNTIL_FREE here!
ï¿½*/
static void
LWLockQueueSelf(LWLock *lock, LWLockMode mode)
{
ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ /*
ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ * If we don't have a PGPROC structure, there's no way to wait. This
ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ * should never occur, since MyProc should only be null during
shared
ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ * memory initialization.
ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ */
ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ if (MyProc == NULL)
ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ elog(PANIC, "cannot wait without a PGPROC structure");

ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ if (MyProc->lwWaiting)
-----> ï¿½ï¿½ï¿½ï¿½ï¿½ elog(PANIC, "queueing for lock while waiting on another one");
<-------------------

Here's a stack trace and a more verbose stack trace.

#0 0x00007ff27bb50c37 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
#1 0x00007ff27bb54028 in __GI_abort () at abort.c:89
#2 0x0000000000978119 in errfinish (dummy=0) at elog.c:557
#3 0x000000000097a671 in elog_finish (elevel=22, fmt=0xc5a000 "queueing for lock while waiting on another one") at elog.c:1378
#4 0x000000000080eb73 in LWLockQueueSelf (lock=0x7ff272b06480, mode=LW_EXCLUSIVE) at lwlock.c:1035
#5 0x000000000080ee30 in LWLockAcquire (lock=0x7ff272b06480, mode=LW_EXCLUSIVE) at lwlock.c:1250
#6 0x00000000007fe416 in CleanupInvalidationState (status=1, arg=140679430974720) at sinvaladt.c:344
#7 0x00000000007f5132 in shmem_exit (code=1) at ipc.c:261
#8 0x00000000007f4f63 in proc_exit_prepare (code=1) at ipc.c:185
#9 0x00000000007f4eb3 in proc_exit (code=1) at ipc.c:102
#10 0x00000000009780e7 in errfinish (dummy=0) at elog.c:543
#11 0x000000000097a671 in elog_finish (elevel=21, fmt=0xc45b4a "semop(id=%d) failed: %m") at elog.c:1378
#12 0x00000000007881d9 in PGSemaphoreLock (sema=0x7ff27b7d6740) at pg_sema.c:391
#13 0x000000000080ee74 in LWLockAcquire (lock=0x7ff272b06f00, mode=LW_SHARED) at lwlock.c:1287
#14 0x0000000000604177 in asyncQueueReadAllNotifications () at async.c:1877
#15 0x000000000060464c in ProcessIncomingNotify () at async.c:2058
#16 0x0000000000603f04 in ProcessNotifyInterrupt () at async.c:1732
#17 0x000000000081da4d in ProcessClientReadInterrupt (blocked=1 '\001') at postgres.c:537
#18 0x00000000006d7a8d in secure_read (port=0x2bb90b0, ptr=0xfe8da0 <PqRecvBuffer>, len=8192) at be-secure.c:177
#19 0x00000000006e3f82 in pq_recvbuf () at pqcomm.c:921
#20 0x00000000006e4022 in pq_getbyte () at pqcomm.c:964
#21 0x000000000081d495 in SocketBackend (inBuf=0x7ffe148694a0) at postgres.c:334
#22 0x000000000081d9db in ReadCommand (inBuf=0x7ffe148694a0) at postgres.c:507
#23 0x00000000008228ff in PostgresMain (argc=1, argv=0x2bbb308, dbname=0x2bbb2f0 "cshub", username=0x2bbb2d0 "cshubuser") at postgres.c:4021
#24 0x000000000079f903 in BackendRun (port=0x2bb90b0) at postmaster.c:4272
#25 0x000000000079ef9d in BackendStartup (port=0x2bb90b0) at postmaster.c:3946
#26 0x000000000079b645 in ServerLoop () at postmaster.c:1701
#27 0x000000000079ab97 in PostmasterMain (argc=3, argv=0x2b918b0) at postmaster.c:1309
#28 0x00000000006e8f8c in main (argc=3, argv=0x2b918b0) at main.c:228

So the PANIC is really happening because of the first problem:

/*
ï¿½* PGSemaphoreLock
ï¿½*
ï¿½* Lock a semaphore (decrement count), blocking if count would be < 0
ï¿½*/
void
PGSemaphoreLock(PGSemaphore sema)
{

ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ if (errStatus < 0)
------> ï¿½ï¿½ï¿½ elog(FATAL, "semop(id=%d) failed: %m", sema->semId);ï¿½
<--------------
}

Should this one use PANIC instead of FATAL given that the FATAL exit path
causes a PANIC in some cases?ï¿½ Is there an opportunity to repair the state
of things enough that a FATAL exit is possible here?

I'm right now more curious to discover how this happened. Are you by any
chance running this with systemd/logind in the mix? It's RemoveIPC=
setting can cause such things...

Greetings,

Andres Freund

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Dave Vitek

dvitek@grammatech.com

almost 9 years ago

In reply to: Andres Freund (#2)

Re: PANIC during exit on behalf of FATAL semop error

On 9/15/2017 5:42 PM, Andres Freund wrote:

Hi,

On 2017-09-15 17:30:51 -0400, Dave Vitek wrote:

We have an x86_64 linux machine running postgresql 9.6.2. Our application
uses LISTEN/NOTIFY. We recently made a change so that our testing
infrastructure would notice postgres crashes and out popped this crash:

Oh. That's curious.

elog(PANIC, "queueing for lock while waiting on another one");

in this code:

/*
* Add ourselves to the end of the queue.
*
* NB: Mode can be LW_WAIT_UNTIL_FREE here!
*/
static void
LWLockQueueSelf(LWLock *lock, LWLockMode mode)
{
        /*
         * If we don't have a PGPROC structure, there's no way to wait. This
         * should never occur, since MyProc should only be null during
shared
         * memory initialization.
         */
        if (MyProc == NULL)
                elog(PANIC, "cannot wait without a PGPROC structure");

        if (MyProc->lwWaiting)
----->       elog(PANIC, "queueing for lock while waiting on another one");
<-------------------
Here's a stack trace and a more verbose stack trace.

#0 0x00007ff27bb50c37 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
#1 0x00007ff27bb54028 in __GI_abort () at abort.c:89
#2 0x0000000000978119 in errfinish (dummy=0) at elog.c:557
#3 0x000000000097a671 in elog_finish (elevel=22, fmt=0xc5a000 "queueing for lock while waiting on another one") at elog.c:1378
#4 0x000000000080eb73 in LWLockQueueSelf (lock=0x7ff272b06480, mode=LW_EXCLUSIVE) at lwlock.c:1035
#5 0x000000000080ee30 in LWLockAcquire (lock=0x7ff272b06480, mode=LW_EXCLUSIVE) at lwlock.c:1250
#6 0x00000000007fe416 in CleanupInvalidationState (status=1, arg=140679430974720) at sinvaladt.c:344
#7 0x00000000007f5132 in shmem_exit (code=1) at ipc.c:261
#8 0x00000000007f4f63 in proc_exit_prepare (code=1) at ipc.c:185
#9 0x00000000007f4eb3 in proc_exit (code=1) at ipc.c:102
#10 0x00000000009780e7 in errfinish (dummy=0) at elog.c:543
#11 0x000000000097a671 in elog_finish (elevel=21, fmt=0xc45b4a "semop(id=%d) failed: %m") at elog.c:1378
#12 0x00000000007881d9 in PGSemaphoreLock (sema=0x7ff27b7d6740) at pg_sema.c:391
#13 0x000000000080ee74 in LWLockAcquire (lock=0x7ff272b06f00, mode=LW_SHARED) at lwlock.c:1287
#14 0x0000000000604177 in asyncQueueReadAllNotifications () at async.c:1877
#15 0x000000000060464c in ProcessIncomingNotify () at async.c:2058
#16 0x0000000000603f04 in ProcessNotifyInterrupt () at async.c:1732
#17 0x000000000081da4d in ProcessClientReadInterrupt (blocked=1 '\001') at postgres.c:537
#18 0x00000000006d7a8d in secure_read (port=0x2bb90b0, ptr=0xfe8da0 <PqRecvBuffer>, len=8192) at be-secure.c:177
#19 0x00000000006e3f82 in pq_recvbuf () at pqcomm.c:921
#20 0x00000000006e4022 in pq_getbyte () at pqcomm.c:964
#21 0x000000000081d495 in SocketBackend (inBuf=0x7ffe148694a0) at postgres.c:334
#22 0x000000000081d9db in ReadCommand (inBuf=0x7ffe148694a0) at postgres.c:507
#23 0x00000000008228ff in PostgresMain (argc=1, argv=0x2bbb308, dbname=0x2bbb2f0 "cshub", username=0x2bbb2d0 "cshubuser") at postgres.c:4021
#24 0x000000000079f903 in BackendRun (port=0x2bb90b0) at postmaster.c:4272
#25 0x000000000079ef9d in BackendStartup (port=0x2bb90b0) at postmaster.c:3946
#26 0x000000000079b645 in ServerLoop () at postmaster.c:1701
#27 0x000000000079ab97 in PostmasterMain (argc=3, argv=0x2b918b0) at postmaster.c:1309
#28 0x00000000006e8f8c in main (argc=3, argv=0x2b918b0) at main.c:228

So the PANIC is really happening because of the first problem:

/*
* PGSemaphoreLock
*
* Lock a semaphore (decrement count), blocking if count would be < 0
*/
void
PGSemaphoreLock(PGSemaphore sema)
{
        if (errStatus < 0)
------>     elog(FATAL, "semop(id=%d) failed: %m", sema->semId);
<--------------
}

Should this one use PANIC instead of FATAL given that the FATAL exit path
causes a PANIC in some cases? Is there an opportunity to repair the state
of things enough that a FATAL exit is possible here?

I'm right now more curious to discover how this happened. Are you by any
chance running this with systemd/logind in the mix? It's RemoveIPC=
setting can cause such things...

Greetings,

Andres Freund

Thanks for the quick reply.

I can see a systemd-logind process running on the machine. It's using
Ubuntu "trusty." I did not set it up, but assume it might be largely
using default settings.

The logind.conf looks like:

[Login]
#NAutoVTs=6
#ReserveVT=6
#KillUserProcesses=no
#KillOnlyUsers=
#KillExcludeUsers=root
Controllers=blkio cpu cpuacct cpuset devices freezer hugetlb memory
perf_event net_cls net_prio
ResetControllers=
#InhibitDelayMaxSec=5
#HandlePowerKey=poweroff
#HandleSuspendKey=suspend
#HandleHibernateKey=hibernate
#HandleLidSwitch=suspend
#PowerKeyIgnoreInhibited=no
#SuspendKeyIgnoreInhibited=no
#HibernateKeyIgnoreInhibited=no
#LidSwitchIgnoreInhibited=yes
#IdleAction=ignore
#IdleActionSec=30min

so no mention of RemoveIPC. I will add the line to disable it and
reboot the system anyway. I wonder how it complains about unrecognized
variables?

It's also not completely impossible that a human trying to fix something
ran ipcrm in error.

- Dave

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs