BUG #9721: Fatal error on startup: no free slots in PMChildFlags array

Started by Daniel Hahlerabout 12 years ago10 messagesbugs
Jump to latest
#1Daniel Hahler
postgresql@thequod.de

The following bug has been logged on the website:

Bug reference: 9721
Logged by: Daniel Hahler
Email address: postgresql@thequod.de
PostgreSQL version: 9.3.3
Operating system: Debian
Description:

PostgreSQL just failed to startup after a reboot (which was forced via
remote Ctrl-Alt-Delete on the PostgreSQL's containers host):

2014-03-24 13:32:47 CET LOG: could not receive data from client: Connection
reset by peer
2014-03-25 12:32:17 CET FATAL: no free slots in PMChildFlags array
2014-03-25 12:32:17 CET LOG: process 9975 releasing ProcSignal slot 108,
but it contains 0
2014-03-25 12:32:17 CET LOG: process 9974 releasing ProcSignal slot 109,
but it contains 0
2014-03-25 12:32:17 CET LOG: process 9976 releasing ProcSignal slot 110,
but it contains 0

I have found that the source code says that this should never happen:

/* Out of slots ... should never happen, else postmaster.c messed up */
elog(FATAL, "no free slots in PMChildFlags array");

Another attempt at starting it worked:

2014-03-25 12:38:43 CET LOG: database system was interrupted; last known up
at 2014-03-24 18:22:38 CET
2014-03-25 12:38:43 CET LOG: database system was not properly shut down;
automatic recovery in progress
2014-03-25 12:38:43 CET LOG: incomplete startup packet
2014-03-25 12:38:44 CET LOG: record with zero length at 0/7843020
2014-03-25 12:38:44 CET LOG: redo is not required
2014-03-25 12:38:44 CET LOG: autovacuum launcher started
2014-03-25 12:38:44 CET LOG: database system is ready to accept
connections

I am using 9.3.3-1.pgdg70+1 from http://apt.postgresql.org/pub/repos/apt/ on
a Debian Wheezy system, running in an OpenVZ container.

I have just now noticed that 9.3.4 has been released, and upgraded to it. I
am sorry if that's something that has been fixed in that update already.

Thanks!

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: Daniel Hahler (#1)
Re: BUG #9721: Fatal error on startup: no free slots in PMChildFlags array

postgresql@thequod.de writes:

PostgreSQL just failed to startup after a reboot (which was forced via
remote Ctrl-Alt-Delete on the PostgreSQL's containers host):

2014-03-24 13:32:47 CET LOG: could not receive data from client: Connection
reset by peer
2014-03-25 12:32:17 CET FATAL: no free slots in PMChildFlags array
2014-03-25 12:32:17 CET LOG: process 9975 releasing ProcSignal slot 108,
but it contains 0
2014-03-25 12:32:17 CET LOG: process 9974 releasing ProcSignal slot 109,
but it contains 0
2014-03-25 12:32:17 CET LOG: process 9976 releasing ProcSignal slot 110,
but it contains 0

That's odd (and as you say, unexpected) but this log extract doesn't give
much clue as to how we got into this state. What was going on before
this? In particular, it's hard to call this "failure to start up" when
you evidently had a hundred or so postmaster child processes already.
Could there have been some unexpected surge in the number of connection
attempts just after the database came up? Also, this extract doesn't look
like anything that would've caused the postmaster to decide to shut down
again, so what happened after that? Or in short, I want to see the rest
of the log not just this part.

regards, tom lane

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

#3Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Tom Lane (#2)
Re: BUG #9721: Fatal error on startup: no free slots in PMChildFlags array

Tom Lane wrote:

postgresql@thequod.de writes:

PostgreSQL just failed to startup after a reboot (which was forced via
remote Ctrl-Alt-Delete on the PostgreSQL's containers host):

2014-03-24 13:32:47 CET LOG: could not receive data from client: Connection
reset by peer
2014-03-25 12:32:17 CET FATAL: no free slots in PMChildFlags array
2014-03-25 12:32:17 CET LOG: process 9975 releasing ProcSignal slot 108,
but it contains 0
2014-03-25 12:32:17 CET LOG: process 9974 releasing ProcSignal slot 109,
but it contains 0
2014-03-25 12:32:17 CET LOG: process 9976 releasing ProcSignal slot 110,
but it contains 0

That's odd (and as you say, unexpected) but this log extract doesn't give
much clue as to how we got into this state. What was going on before
this? In particular, it's hard to call this "failure to start up" when
you evidently had a hundred or so postmaster child processes already.
Could there have been some unexpected surge in the number of connection
attempts just after the database came up? Also, this extract doesn't look
like anything that would've caused the postmaster to decide to shut down
again, so what happened after that? Or in short, I want to see the rest
of the log not just this part.

Here's my guess --- this is a virtualized system that somehow dumped
some state to disk to hibernate while the host was being rebooted; and
then, when the host was up again, it tried to resurrect the virtual
machine and found things to be all inconsistent.

--
�lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

#4Daniel Hahler
postgresql@thequod.de
In reply to: Alvaro Herrera (#3)
Re: BUG #9721: Fatal error on startup: no free slots in PMChildFlags array

On 25.03.2014 15:36, Alvaro Herrera wrote:

Tom Lane wrote:

postgresql@thequod.de writes:

PostgreSQL just failed to startup after a reboot (which was forced via
remote Ctrl-Alt-Delete on the PostgreSQL's containers host):

2014-03-24 13:32:47 CET LOG: could not receive data from client: Connection
reset by peer
2014-03-25 12:32:17 CET FATAL: no free slots in PMChildFlags array
2014-03-25 12:32:17 CET LOG: process 9975 releasing ProcSignal slot 108,
but it contains 0
2014-03-25 12:32:17 CET LOG: process 9974 releasing ProcSignal slot 109,
but it contains 0
2014-03-25 12:32:17 CET LOG: process 9976 releasing ProcSignal slot 110,
but it contains 0

That's odd (and as you say, unexpected) but this log extract doesn't give
much clue as to how we got into this state. What was going on before
this? In particular, it's hard to call this "failure to start up" when
you evidently had a hundred or so postmaster child processes already.
Could there have been some unexpected surge in the number of connection
attempts just after the database came up? Also, this extract doesn't look
like anything that would've caused the postmaster to decide to shut down
again, so what happened after that? Or in short, I want to see the rest
of the log not just this part.

That was the whole log.

The rotated one before has only:
2014-03-22 03:51:37 CET LOG: could not receive data from client: Connection reset by peer
2014-03-22 03:52:25 CET LOG: could not receive data from client: Connection reset by peer
2014-03-22 03:59:31 CET LOG: could not receive data from client: Connection reset by peer
2014-03-22 04:00:18 CET LOG: could not receive data from client: Connection reset by peer
2014-03-22 06:03:06 CET LOG: could not receive data from client: Connection reset by peer

Should I increase the logging verbosity, in case this happens again?
If so, to what? (I have not configured logging yet, so it has the defaults from your Debian package).

Here's my guess --- this is a virtualized system that somehow dumped
some state to disk to hibernate while the host was being rebooted; and
then, when the host was up again, it tried to resurrect the virtual
machine and found things to be all inconsistent.

Yes, the container was frozen during reboot:

From the host:
Mar 25 11:54:48 HN kernel: [ 76.237452] CT: 144: started
Mar 25 11:55:03 HN kernel: [ 91.201145] CT: 144: restored

OpenVZ uses "suspend" by default to stop containers on host reboots.
I will change this to "stop" for the PostgreSQL container, but still this seems like something PostgreSQL should handle better.

FWIW, I have just suspended and started the container manually, and PostgreSQL kept running (upgraded to 9.3.4 in the meantime).

Maybe it's a bug with OpenVZ and how it restores some resources after rebooting the host?

Please also note that the PostgreSQL error happened half an hour after the reboot/resuming of the container.

Thanks,
Daniel.

--
http://daniel.hahler.de/

#5Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Daniel Hahler (#4)
Re: BUG #9721: Fatal error on startup: no free slots in PMChildFlags array

Daniel Hahler wrote:

On 25.03.2014 15:36, Alvaro Herrera wrote:

Here's my guess --- this is a virtualized system that somehow dumped
some state to disk to hibernate while the host was being rebooted; and
then, when the host was up again, it tried to resurrect the virtual
machine and found things to be all inconsistent.

Yes, the container was frozen during reboot:

From the host:
Mar 25 11:54:48 HN kernel: [ 76.237452] CT: 144: started
Mar 25 11:55:03 HN kernel: [ 91.201145] CT: 144: restored

OpenVZ uses "suspend" by default to stop containers on host reboots.
I will change this to "stop" for the PostgreSQL container, but still this seems like something PostgreSQL should handle better.

Of course. We will accept your patch as soon as you discover the bug.
Thanks for the offer.

FWIW, I have just suspended and started the container manually, and
PostgreSQL kept running (upgraded to 9.3.4 in the meantime).

Maybe it's a bug with OpenVZ and how it restores some resources after rebooting the host?

Maybe it forgot to restore the shared memory state.

--
�lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

#6Daniel Hahler
postgresql@thequod.de
In reply to: Alvaro Herrera (#5)
Re: BUG #9721: Fatal error on startup: no free slots in PMChildFlags array

On 25.03.2014 16:26, Alvaro Herrera wrote:

FWIW, I have just suspended and started the container manually, and
PostgreSQL kept running (upgraded to 9.3.4 in the meantime).

Maybe it's a bug with OpenVZ and how it restores some resources after rebooting the host?

Maybe it forgot to restore the shared memory state.

Shouldn't that have caused PostgreSQL to error out much earlier - and not after half an hour?

Anyway, maybe it won't affect 9.3.4 anymore, and it's good to know that it might just be related to rebooting with the "suspend" option.

Regards and thanks,
Daniel.

--
http://daniel.hahler.de/

#7Tom Lane
tgl@sss.pgh.pa.us
In reply to: Daniel Hahler (#4)
Re: BUG #9721: Fatal error on startup: no free slots in PMChildFlags array

Daniel Hahler <postgresql@thequod.de> writes:

On 25.03.2014 15:36, Alvaro Herrera wrote:

Tom Lane wrote:

postgresql@thequod.de writes:

PostgreSQL just failed to startup after a reboot (which was forced via
remote Ctrl-Alt-Delete on the PostgreSQL's containers host):

... in short, I want to see the rest of the log not just this part.

That was the whole log.

Oh. So actually, "failed to startup" was a complete misdescription, and
what you should've said was "it appeared to be running normally, but I
saw this burst of weird messages in the postmaster log, so I decided to
restart it"?

I tend to concur with Alvaro's theory that OpenVZ must've failed to
restore the contents of shared memory precisely. If there were no other
symptoms than these, it seems like the misrestore must have affected just
a few small areas of the shared memory segment. Which seems odd, but then
again I know nothing of the internals of OpenVZ, so maybe it's a plausible
failure mode.

regards, tom lane

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

#8Daniel Hahler
postgresql@thequod.de
In reply to: Tom Lane (#7)
Re: BUG #9721: Fatal error on startup: no free slots in PMChildFlags array

On 25.03.2014 17:02, Tom Lane wrote:

Daniel Hahler <postgresql@thequod.de> writes:

On 25.03.2014 15:36, Alvaro Herrera wrote:

Tom Lane wrote:

postgresql@thequod.de writes:

PostgreSQL just failed to startup after a reboot (which was forced via
remote Ctrl-Alt-Delete on the PostgreSQL's containers host):

... in short, I want to see the rest of the log not just this part.

That was the whole log.

Oh. So actually, "failed to startup" was a complete misdescription, and
what you should've said was "it appeared to be running normally, but I
saw this burst of weird messages in the postmaster log, so I decided to
restart it"?

Sorry, "failed to startup" was a misdescription, because I did not remember that OpenVZ suspends containers on reboots.
I should have said "crashed half an hour after being resumed".

I have not restarted PostgreSQL though, but started it when noticing that it wasn't running (anymore).

From the logs (again):
2014-03-24 13:32:47 CET LOG: could not receive data from client: Connection reset by peer
2014-03-25 12:32:17 CET FATAL: no free slots in PMChildFlags array
2014-03-25 12:32:17 CET LOG: process 9975 releasing ProcSignal slot 108, but it contains 0
2014-03-25 12:32:17 CET LOG: process 9974 releasing ProcSignal slot 109, but it contains 0
2014-03-25 12:32:17 CET LOG: process 9976 releasing ProcSignal slot 110, but it contains 0

I have noticed it, and started it:

2014-03-25 12:38:43 CET LOG: database system was interrupted; last known up at 2014-03-24 18:22:38 CET

If you expect more information from the logs (in case this happens again), I would appreciate some hint about what would be important to enable in the configuration.

Regards,
Daniel.

--
http://daniel.hahler.de/

#9Tom Lane
tgl@sss.pgh.pa.us
In reply to: Daniel Hahler (#8)
Re: BUG #9721: Fatal error on startup: no free slots in PMChildFlags array

Daniel Hahler <postgresql@thequod.de> writes:

On 25.03.2014 17:02, Tom Lane wrote:

Oh. So actually, "failed to startup" was a complete misdescription, and
what you should've said was "it appeared to be running normally, but I
saw this burst of weird messages in the postmaster log, so I decided to
restart it"?

Sorry, "failed to startup" was a misdescription, because I did not remember that OpenVZ suspends containers on reboots.
I should have said "crashed half an hour after being resumed".

I have not restarted PostgreSQL though, but started it when noticing that it wasn't running (anymore).

Well, your log doesn't show any evidence to support the claim that
Postgres wasn't running. We see some child processes having minor trouble
while exiting, and one that failed while trying to start up; the latter
being strong evidence that the postmaster was still working at least.

So in short, you haven't given us a complete description of what happened.
What led you to decide you needed to restart Postgres?

regards, tom lane

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

#10Daniel Hahler
postgresql@thequod.de
In reply to: Tom Lane (#9)
Re: BUG #9721: Fatal error on startup: no free slots in PMChildFlags array

On 25.03.2014 17:26, Tom Lane wrote:

Daniel Hahler <postgresql@thequod.de> writes:

On 25.03.2014 17:02, Tom Lane wrote:

Oh. So actually, "failed to startup" was a complete misdescription, and
what you should've said was "it appeared to be running normally, but I
saw this burst of weird messages in the postmaster log, so I decided to
restart it"?

Sorry, "failed to startup" was a misdescription, because I did not remember that OpenVZ suspends containers on reboots.
I should have said "crashed half an hour after being resumed".

I have not restarted PostgreSQL though, but started it when noticing that it wasn't running (anymore).

Well, your log doesn't show any evidence to support the claim that
Postgres wasn't running. We see some child processes having minor trouble
while exiting, and one that failed while trying to start up; the latter
being strong evidence that the postmaster was still working at least.

I've received the following error on 11:32:19 from a client app trying to connect:
OperationalError: could not connect to server: Connection refused
Is the server running on host "10.122.x.x" and accepting
TCP/IP connections on port x?

So in short, you haven't given us a complete description of what happened.
What led you to decide you needed to restart Postgres?

There was no Postgres process running in 'ps aux', and I only had to start, not restart it.

Relevant entries from the shell history (llog is for syslog/messages):

47 12:32 ps aux
48 12:33 llog
52 12:33 less ~l/daemon.log
55 12:34 less ~l/postgresql/postgresql-9.3-main.log
56 12:34 ps aux
64 12:38 service postgresql start

Regards,
Daniel.

--
http://daniel.hahler.de/