PMChildFlags array

Started by bhargav kamineniover 6 years ago10 messagesgeneral

bhargavpostgres@gmail.com

over 6 years ago

Hi,

Observed below errors in logfile

2019-09-20 02:00:24.504 UTC,,,99779,,5d73303a.185c3,73,,2019-09-07 04:21:14
UTC,,0,FATAL,XX000,"no free slots in PMChildFlags array",,,,,,,,,""
2019-09-20 02:00:24.505 UTC,,,109949,,5d8432b8.1ad7d,1,,2019-09-20 02:00:24
UTC,,0,ERROR,58P01,"could not open shared memory segment
""/PostgreSQL.2520932"": No such file or directory",,,,,,,,,""
2019-09-20 02:00:24.505 UTC,,,109950,,5d8432b8.1ad7e,1,,2019-09-20 02:00:24
UTC,,0,ERROR,58P01,"could not open shared memory segment
""/PostgreSQL.2520932"": No such file or directory",,,,,,,,,""

what could be the possible reasons for this to occur and is there any
chance of database corruption after this event ?

Regards,
Bhargav

bhargav kamineni

bhargavpostgres@gmail.com

over 6 years ago

In reply to: bhargav kamineni (#1)

Re: PMChildFlags array

Any suggestions on this ?

On Thu, 3 Oct 2019 at 16:27, bhargav kamineni <bhargavpostgres@gmail.com>
wrote:

Show quoted text

Hi,

Observed below errors in logfile

2019-09-20 02:00:24.504 UTC,,,99779,,5d73303a.185c3,73,,2019-09-07
04:21:14 UTC,,0,FATAL,XX000,"no free slots in PMChildFlags array",,,,,,,,,""
2019-09-20 02:00:24.505 UTC,,,109949,,5d8432b8.1ad7d,1,,2019-09-20
02:00:24 UTC,,0,ERROR,58P01,"could not open shared memory segment
""/PostgreSQL.2520932"": No such file or directory",,,,,,,,,""
2019-09-20 02:00:24.505 UTC,,,109950,,5d8432b8.1ad7e,1,,2019-09-20
02:00:24 UTC,,0,ERROR,58P01,"could not open shared memory segment
""/PostgreSQL.2520932"": No such file or directory",,,,,,,,,""

what could be the possible reasons for this to occur and is there any
chance of database corruption after this event ?

Regards,
Bhargav

Adrian Klaver

adrian.klaver@aklaver.com

over 6 years ago

In reply to: bhargav kamineni (#1)

Re: PMChildFlags array

On 10/3/19 3:57 AM, bhargav kamineni wrote:

Hi,

Observed below errors in logfile

2019-09-20 02:00:24.504 UTC,,,99779,,5d73303a.185c3,73,,2019-09-07
04:21:14 UTC,,0,FATAL,XX000,"no free slots in PMChildFlags array",,,,,,,,,""
2019-09-20 02:00:24.505 UTC,,,109949,,5d8432b8.1ad7d,1,,2019-09-20
02:00:24 UTC,,0,ERROR,58P01,"could not open shared memory segment
""/PostgreSQL.2520932"": No such file or directory",,,,,,,,,""
2019-09-20 02:00:24.505 UTC,,,109950,,5d8432b8.1ad7e,1,,2019-09-20
02:00:24 UTC,,0,ERROR,58P01,"could not open shared memory segment
""/PostgreSQL.2520932"": No such file or directory",,,,,,,,,""

Postgres version?

OS and version?

What was the database doing just before the FATAL line?

what could be the possible reasons for this to occur and is there any
chance of database corruption after this event ?

The source(backend/storage/ipc/pmsignal.c ) says:

"/* Out of slots ... should never happen, else postmaster.c messed up */
elog(FATAL, "no free slots in PMChildFlags array");
"

Someone else will need to comment on what 'messed up' could be.

Regards,
Bhargav

--
Adrian Klaver
adrian.klaver@aklaver.com

bhargav kamineni

bhargavpostgres@gmail.com

over 6 years ago

In reply to: Adrian Klaver (#3)

Re: PMChildFlags array

Hi,

Observed below errors in logfile

2019-09-20 02:00:24.504 UTC,,,99779,,5d73303a.185c3,73,,2019-09-07
04:21:14 UTC,,0,FATAL,XX000,"no free slots in PMChildFlags

array",,,,,,,,,""

2019-09-20 02:00:24.505 UTC,,,109949,,5d8432b8.1ad7d,1,,2019-09-20
02:00:24 UTC,,0,ERROR,58P01,"could not open shared memory segment
""/PostgreSQL.2520932"": No such file or directory",,,,,,,,,""
2019-09-20 02:00:24.505 UTC,,,109950,,5d8432b8.1ad7e,1,,2019-09-20
02:00:24 UTC,,0,ERROR,58P01,"could not open shared memory segment
""/PostgreSQL.2520932"": No such file or directory",,,,,,,,,""

Postgres version?

PostgreSQL 10.8

OS and version?

NAME="Ubuntu"
VERSION="18.04.1 LTS (Bionic Beaver)"

What was the database doing just before the FATAL line?

Postgres was rejecting a bunch of connections from a user who is having a
connection limit set. that was the the FATAL error that i could see in log
file.
FATAL,53300,"too many connections for role ""user_app"""

db=\du user_app
List of roles
Role name | Attributes | Member of
--------------+-------------------------------+--------------------
user_app | No inheritance +| {application_role}
| 100 connections +|
| Password valid until infinity |

what could be the possible reasons for this to occur and is there any
chance of database corruption after this event ?

The source(backend/storage/ipc/pmsignal.c ) says:

"/* Out of slots ... should never happen, else postmaster.c messed up */
elog(FATAL, "no free slots in PMChildFlags array");
"

Someone else will need to comment on what 'messed up' could be

On Thu, 3 Oct 2019 at 18:56, Adrian Klaver <adrian.klaver@aklaver.com>
wrote:

Show quoted text

On 10/3/19 3:57 AM, bhargav kamineni wrote:

Hi,

Observed below errors in logfile

2019-09-20 02:00:24.504 UTC,,,99779,,5d73303a.185c3,73,,2019-09-07
04:21:14 UTC,,0,FATAL,XX000,"no free slots in PMChildFlags

array",,,,,,,,,""

2019-09-20 02:00:24.505 UTC,,,109949,,5d8432b8.1ad7d,1,,2019-09-20
02:00:24 UTC,,0,ERROR,58P01,"could not open shared memory segment
""/PostgreSQL.2520932"": No such file or directory",,,,,,,,,""
2019-09-20 02:00:24.505 UTC,,,109950,,5d8432b8.1ad7e,1,,2019-09-20
02:00:24 UTC,,0,ERROR,58P01,"could not open shared memory segment
""/PostgreSQL.2520932"": No such file or directory",,,,,,,,,""

Postgres version?

OS and version?

What was the database doing just before the FATAL line?

what could be the possible reasons for this to occur and is there any
chance of database corruption after this event ?

The source(backend/storage/ipc/pmsignal.c ) says:

"/* Out of slots ... should never happen, else postmaster.c messed up */
elog(FATAL, "no free slots in PMChildFlags array");
"

Someone else will need to comment on what 'messed up' could be.

Regards,
Bhargav

--
Adrian Klaver
adrian.klaver@aklaver.com

Tom Lane

tgl@sss.pgh.pa.us

over 6 years ago

In reply to: bhargav kamineni (#4)

Re: PMChildFlags array

bhargav kamineni <bhargavpostgres@gmail.com> writes:

Postgres was rejecting a bunch of connections from a user who is having a
connection limit set. that was the the FATAL error that i could see in log
file.
FATAL,53300,"too many connections for role ""user_app"""

db=\du user_app
List of roles
Role name | Attributes | Member of
--------------+-------------------------------+--------------------
user_app | No inheritance +| {application_role}
| 100 connections +|
| Password valid until infinity |

Hm, what's the overall max_connections limit? (I'm wondering
in particular if it's more or less than 100.)

regards, tom lane

bhargav kamineni

bhargavpostgres@gmail.com

over 6 years ago

In reply to: Tom Lane (#5)

Re: PMChildFlags array

bhargav kamineni <bhargavpostgres@gmail.com> writes:

Postgres was rejecting a bunch of connections from a user who is having a
connection limit set. that was the the FATAL error that i could see in log
file.
FATAL,53300,"too many connections for role ""user_app"""

db=\du user_app
List of roles
Role name | Attributes | Member of
--------------+-------------------------------+--------------------
user_app | No inheritance +| {application_role}
| 100 connections +|
| Password valid until infinity |

Hm, what's the overall max_connections limit? (I'm wondering

in particular if it's more or less than 100.)

its set to 500;
show max_connections ;
max_connections
-----------------
500

On Thu, 3 Oct 2019 at 22:52, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Show quoted text

bhargav kamineni <bhargavpostgres@gmail.com> writes:

Postgres was rejecting a bunch of connections from a user who is having a
connection limit set. that was the the FATAL error that i could see in

log

file.
FATAL,53300,"too many connections for role ""user_app"""

db=\du user_app
List of roles
Role name | Attributes | Member of
--------------+-------------------------------+--------------------
user_app | No inheritance +| {application_role}
| 100 connections +|
| Password valid until infinity |

Hm, what's the overall max_connections limit? (I'm wondering
in particular if it's more or less than 100.)

regards, tom lane

Tom Lane

tgl@sss.pgh.pa.us

over 6 years ago

In reply to: bhargav kamineni (#4)

Re: PMChildFlags array

bhargav kamineni <bhargavpostgres@gmail.com> writes:

What was the database doing just before the FATAL line?

Postgres was rejecting a bunch of connections from a user who is having a
connection limit set. that was the the FATAL error that i could see in log
file.
FATAL,53300,"too many connections for role ""user_app"""

So ... how many is "a bunch"?

Looking at the code, it seems like it'd be possible for a sufficiently
aggressive spawner of incoming connections to reach the
MaxLivePostmasterChildren limit. While the postmaster would correctly
reject additional connection attempts after that, what it would not do
is ensure that any child slots are left for new parallel worker processes.
So we could hypothesize that the error you're seeing in the log is from
failure to spawn a parallel worker process, due to being out of child
slots.

However, given that max_connections = 500, MaxLivePostmasterChildren()
would be 1000-plus. This would mean that reaching this condition would
require *at least* 500 concurrent connection-attempts-that-haven't-yet-
been-rejected, maybe well more than that if you didn't have close to
500 legitimately open sessions. That seems like a lot, enough to suggest
that you've got some pretty serious bug in your client-side logic.

Anyway, I think it's clearly a bug that canAcceptConnections() thinks the
number of acceptable connections is identical to the number of allowed
child processes; it needs to be less, by the number of background
processes we want to support. But it seems like a darn hard-to-hit bug,
so I'm not quite sure that that explains your observation.

regards, tom lane

Alvaro Herrera

alvherre@2ndquadrant.com

over 6 years ago

In reply to: bhargav kamineni (#6)

Re: PMChildFlags array

On 2019-Oct-03, bhargav kamineni wrote:

bhargav kamineni <bhargavpostgres@gmail.com> writes:

Postgres was rejecting a bunch of connections from a user who is having a
connection limit set. that was the the FATAL error that i could see in log
file.
FATAL,53300,"too many connections for role ""user_app"""

db=\du user_app
List of roles
Role name | Attributes | Member of
--------------+-------------------------------+--------------------
user_app | No inheritance +| {application_role}
| 100 connections +|
| Password valid until infinity |

Was the machine overloaded at the time the problem occurred?

--
ï¿½lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

bhargav kamineni

bhargavpostgres@gmail.com

over 6 years ago

In reply to: Tom Lane (#7)

Re: PMChildFlags array

Thanks Tom Lane for detailing the issue.

So ... how many is "a bunch"?

more than 85

Looking at the code, it seems like it'd be possible for a sufficiently
aggressive spawner of incoming connections to reach the
MaxLivePostmasterChildren limit. While the postmaster would correctly
reject additional connection attempts after that, what it would not do
is ensure that any child slots are left for new parallel worker processes.
So we could hypothesize that the error you're seeing in the log is from
failure to spawn a parallel worker process, due to being out of child
slots.

Thanks Tom Lane for detailing the issue.

we have enabled "max_parallel_workers_per_gather = 4". 20 days before we
ran into this issue .

However, given that max_connections = 500, MaxLivePostmasterChildren()
would be 1000-plus. This would mean that reaching this condition would
require *at least* 500 concurrent connection-attempts-that-haven't-yet-
been-rejected, maybe well more than that if you didn't have close to
500 legitimately open sessions. That seems like a lot, enough to suggest
that you've got some pretty serious bug in your client-side logic.

below errors observed after crash in postgres logfile :

ERROR: xlog flush request is not satisfied for couple of tables , we have
initiated the vacuum full on those tables and the error went off after that.
ERROR: right sibling's left-link doesn't match: block 273660 links to
273500 instead of expected 273661 in index -- observed this error while
doing vacuum freeze on databsase , we have dropped this index and created a
new one

Observations :

Vacuum freeze analyze job is getting stuck at database end which is
initiated thru cronjob, pg_cancel_backend(), pg_termiante_backend() is not
able to terminate those stuck process , Restarting the database only able
to clear those process , i am thinking this is happening due to corruption
(if this is true how can i detect this ? pg_dump ?). is there any way to
overcome this problem ?

does migrating the database to a new instance (pg_basebackup and switching
over to new instance ) solves this issue ?

On Fri, 4 Oct 2019 at 03:49, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Show quoted text

bhargav kamineni <bhargavpostgres@gmail.com> writes:

What was the database doing just before the FATAL line?

Postgres was rejecting a bunch of connections from a user who is having a
connection limit set. that was the the FATAL error that i could see in

log

file.
FATAL,53300,"too many connections for role ""user_app"""

So ... how many is "a bunch"?

Looking at the code, it seems like it'd be possible for a sufficiently
aggressive spawner of incoming connections to reach the
MaxLivePostmasterChildren limit. While the postmaster would correctly
reject additional connection attempts after that, what it would not do
is ensure that any child slots are left for new parallel worker processes.
So we could hypothesize that the error you're seeing in the log is from
failure to spawn a parallel worker process, due to being out of child
slots.

However, given that max_connections = 500, MaxLivePostmasterChildren()
would be 1000-plus. This would mean that reaching this condition would
require *at least* 500 concurrent connection-attempts-that-haven't-yet-
been-rejected, maybe well more than that if you didn't have close to
500 legitimately open sessions. That seems like a lot, enough to suggest
that you've got some pretty serious bug in your client-side logic.

Anyway, I think it's clearly a bug that canAcceptConnections() thinks the
number of acceptable connections is identical to the number of allowed
child processes; it needs to be less, by the number of background
processes we want to support. But it seems like a darn hard-to-hit bug,
so I'm not quite sure that that explains your observation.

regards, tom lane

#10

Tom Lane

tgl@sss.pgh.pa.us

over 6 years ago

In reply to: bhargav kamineni (#9)

Re: PMChildFlags array

bhargav kamineni <bhargavpostgres@gmail.com> writes:

So ... how many is "a bunch"?

more than 85

Hm. That doesn't seem like it'd be enough to trigger the problem;
you'd need about max_connections excess connections (that are shortly
going to be rejected) to run into this problem, and you said you
had max_connections = 500. Maybe several different clients were all
doing this at once?

But anyway, AFAICS there is only one code path that could lead to the
reported error message, so one way or another you got there. I've
pushed a fix for this, which will be in next month's releases.

below errors observed after crash in postgres logfile :

ERROR: xlog flush request is not satisfied for couple of tables , we have
initiated the vacuum full on those tables and the error went off after that.
ERROR: right sibling's left-link doesn't match: block 273660 links to
273500 instead of expected 273661 in index -- observed this error while
doing vacuum freeze on databsase , we have dropped this index and created a
new one

That seems unrelated. A postmaster crash shouldn't have any
data-corruption consequences, since it never touches any
relation files directly.

regards, tom lane