Frequent "pg_ctl status" removing(?) semaphores (unlikely)

Started by rafover 9 years ago3 messagesgeneral

raf@raf.org

over 9 years ago

Hi,

debian-8 (stable), postgres-9.4.9

I've just started running "/etc/init.d/postgresql-9.4 status"
every minute via cron and it seems to be having a very bad effect
on the server ["So stop doing it!" heard from the peanut gallery].

I noticed an error message like:

FATAL: the database system is in recovery mode

when it shouldn't have been in recovery mode,

and the log files say:

[25844]: FATAL: semctl(5505030, 6, SETVAL, 0) failed: Invalid argument
[6708]: LOG: semctl(5537799, 0, IPC_RMID, ...) failed: Invalid argument
[6708]: LOG: semctl(5537799, 0, IPC_RMID, ...) failed: Invalid argument
[6714]: DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server 2016-09-27 10:27:16 AEST::@:[6714]: HINT: In a moment you should be able to reconnect to the database and repeat your command.
[6714]: DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server 2016-09-27 10:27:16 AEST::@:[6714]: HINT: In a moment you should be able to reconnect to the database and repeat your command.
current transaction and exit, because another server 2016-09-27 10:27:16 AEST::@:[6714]DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server 2016-09-27 10:27:16 AEST::@:[6714]: HINT: In a moment you should be able to reconnect to the database and repeat your command.:
HINT: In a moment you should be able to reconnect to the database and repeat your command.
[6708]: LOG: semctl(5537799, 0, IPC_RMID, ...) failed: Invalid argument
[6708]: LOG: semctl(5537799, 0, IPC_RMID, ...) failed: Invalid argument
[6708]: LOG: semctl(5537799, 0, IPC_RMID, ...) failed: Invalid argument
[6708]: LOG: semctl(5537799, 0, IPC_RMID, ...) failed: Invalid argument
[6708]: LOG: semctl(5537799, 0, IPC_RMID, ...) failed: Invalid argument
[6708]: LOG: semctl(5537799, 0, IPC_RMID, ...) failed: Invalid argument
[6708]: LOG: semctl(5537799, 0, IPC_RMID, ...) failed: Invalid argument
[6708]: LOG: semctl(5537799, 0, IPC_RMID, ...) failed: Invalid argument
[6708]: LOG: semctl(5537799, 0, IPC_RMID, ...) failed: Invalid argument
[6708]: LOG: semctl(5537799, 0, IPC_RMID, ...) failed: Invalid argument
[25845]: LOG: database system was interrupted; last known up at 2016-09-23 09:29:28 AEST

However, it's not really in recovery mode and it doesn't come
good again until I manually restart the server.

Googling for this issue revealed that something is deleting
postgres's semaphores. Sure enough, the "ipcs -s" shows no
semaphores when this happens rather than showing the usual eight
lines of semaphore information.

The advice was to look for another script issuing ipcrm commands
(or, if running postgres in multiple freebsd jails, use a
different userid for each).

The only thing I'm doing differently since this started is
running this every minute from a script run by cron:

/etc/init.d/postgresql-9.4 status

That means a call to:

su -s /bin/sh - postgres -c "LD_LIBRARY_PATH=/opt/PostgreSQL/9.4/lib:$LD_LIBRARY_PATH
/opt/PostgreSQL/9.4/bin/pgpg_ctl status -D \"/data/payroll-9.4\""

Now, I can't prove that "pg_ctl status" is causing the problem
but when I disable the cronjob, the problem disappears and
whenever I enable it the problem reappears fairly quickly (i.e.
within an hour or two) so I'm fairly convinced that it's
involved. And I've just noticed that the logfile messages
above do refer to IPC_RMID so some part of postgres
is trying to remove the semaphores but it looks like they're
already gone when it tries.

So, my qestion is, is it possible that "pg_ctl status" could be
removing postgres's semaphores and can I stop it? It seems
extremely unlikely. So, if it isn't, what else could it be?
Systemd perhaps? It's been known to kill screen/tmux/nohup
processes when a user logs out in its keenness to clean up but
that may be clutching at straws.

At first, when I saw this, I assumed that I had stopped the
server interactively at the same time as the cronjob was
starting it and the two actions clashed with regards to
semaphore creation and removal but I wasn't convinced. And I'm
not trying to stop the server now. I'm just running the cronjob
to check the status. And the problem still occurs.

In case you're wondering what else the cronjob does, the first
thing it does is:

/etc/init.d/postgresql-9.4 status | grep -q 'server is running' && exit 0

So it's not doing anything else if postgres is running.

Any idea what I've done wrong? (apart from the obvious) :-)

cheers,
raf

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Michael Paquier

michael@paquier.xyz

over 9 years ago

In reply to: raf (#1)

Re: Frequent "pg_ctl status" removing(?) semaphores (unlikely)

On Tue, Sep 27, 2016 at 2:13 PM, raf <raf@raf.org> wrote:

So, my qestion is, is it possible that "pg_ctl status" could be
removing postgres's semaphores and can I stop it? It seems
extremely unlikely. So, if it isn't, what else could it be?
Systemd perhaps? It's been known to kill screen/tmux/nohup
processes when a user logs out in its keenness to clean up but
that may be clutching at straws.

systemd sometimes has fun removing semaphores. See here for example:
/messages/by-id/CAK7tEys9-O4BTERbs3Xuk2BfFNNd55u2sM9j5R2Fi7v6BHjrQw@mail.gmail.com
--
Michael

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

raf

raf@raf.org

over 9 years ago

In reply to: Michael Paquier (#2)

Re: Frequent "pg_ctl status" removing(?) semaphores (unlikely)

Michael Paquier wrote:

On Tue, Sep 27, 2016 at 2:13 PM, raf <raf@raf.org> wrote:

So, my qestion is, is it possible that "pg_ctl status" could be
removing postgres's semaphores and can I stop it? It seems
extremely unlikely. So, if it isn't, what else could it be?
Systemd perhaps? It's been known to kill screen/tmux/nohup
processes when a user logs out in its keenness to clean up but
that may be clutching at straws.

systemd sometimes has fun removing semaphores. See here for example:
/messages/by-id/CAK7tEys9-O4BTERbs3Xuk2BfFNNd55u2sM9j5R2Fi7v6BHjrQw@mail.gmail.com
--
Michael

Thanks, Michael. That's it. I added RemoveIPC=no to
/etc/systemd/logind.conf, restarted systemd-logind,
and re-enabled the cron job, and it's been fine for
a whole day.

Now to do the same on all my other systemd-afflicted
hosts (just on principal). :-)

It looks to me like a bug in systemd. If it's removing
IPC resources because it thinks that the postgres user has
completely "logged out", why does it take over 100 "logins"
(i.e. cronjobs) before it decides that the postgres user
is completely logged out. What was the difference between all
the previous cronjobs and the one where it finally decided
to remove the semaphores? Why didn't it remove them after the
first cronjob had completed? And why did it think the postgres
user was completely "logged out" when it's also running the
postgres server processes? It might be because it sees the
postgres server processes as belonging to my user "slice"
because I started it even though it's running as the postgres
user. That doesn't seem like a useful way of viewing things.

But this is all rhetorical. No doubt there are reasons for this
but I don't care as long as I can turn it off.

Thanks again,
raf

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general