PMChildFlags array
Hi,
Observed below errors in logfile
2019-09-20 02:00:24.504 UTC,,,99779,,5d73303a.185c3,73,,2019-09-07 04:21:14
UTC,,0,FATAL,XX000,"no free slots in PMChildFlags array",,,,,,,,,""
2019-09-20 02:00:24.505 UTC,,,109949,,5d8432b8.1ad7d,1,,2019-09-20 02:00:24
UTC,,0,ERROR,58P01,"could not open shared memory segment
""/PostgreSQL.2520932"": No such file or directory",,,,,,,,,""
2019-09-20 02:00:24.505 UTC,,,109950,,5d8432b8.1ad7e,1,,2019-09-20 02:00:24
UTC,,0,ERROR,58P01,"could not open shared memory segment
""/PostgreSQL.2520932"": No such file or directory",,,,,,,,,""
what could be the possible reasons for this to occur and is there any
chance of database corruption after this event ?
Regards,
Bhargav
Any suggestions on this ?
On Thu, 3 Oct 2019 at 16:27, bhargav kamineni <bhargavpostgres@gmail.com>
wrote:
Show quoted text
Hi,
Observed below errors in logfile
2019-09-20 02:00:24.504 UTC,,,99779,,5d73303a.185c3,73,,2019-09-07
04:21:14 UTC,,0,FATAL,XX000,"no free slots in PMChildFlags array",,,,,,,,,""
2019-09-20 02:00:24.505 UTC,,,109949,,5d8432b8.1ad7d,1,,2019-09-20
02:00:24 UTC,,0,ERROR,58P01,"could not open shared memory segment
""/PostgreSQL.2520932"": No such file or directory",,,,,,,,,""
2019-09-20 02:00:24.505 UTC,,,109950,,5d8432b8.1ad7e,1,,2019-09-20
02:00:24 UTC,,0,ERROR,58P01,"could not open shared memory segment
""/PostgreSQL.2520932"": No such file or directory",,,,,,,,,""what could be the possible reasons for this to occur and is there any
chance of database corruption after this event ?Regards,
Bhargav
On 10/3/19 3:57 AM, bhargav kamineni wrote:
Hi,
Observed below errors in logfile
2019-09-20 02:00:24.504 UTC,,,99779,,5d73303a.185c3,73,,2019-09-07
04:21:14 UTC,,0,FATAL,XX000,"no free slots in PMChildFlags array",,,,,,,,,""
2019-09-20 02:00:24.505 UTC,,,109949,,5d8432b8.1ad7d,1,,2019-09-20
02:00:24 UTC,,0,ERROR,58P01,"could not open shared memory segment
""/PostgreSQL.2520932"": No such file or directory",,,,,,,,,""
2019-09-20 02:00:24.505 UTC,,,109950,,5d8432b8.1ad7e,1,,2019-09-20
02:00:24 UTC,,0,ERROR,58P01,"could not open shared memory segment
""/PostgreSQL.2520932"": No such file or directory",,,,,,,,,""
Postgres version?
OS and version?
What was the database doing just before the FATAL line?
what could be the possible reasons for this to occur and is there any
chance of database corruption after this event ?
The source(backend/storage/ipc/pmsignal.c ) says:
"/* Out of slots ... should never happen, else postmaster.c messed up */
elog(FATAL, "no free slots in PMChildFlags array");
"
Someone else will need to comment on what 'messed up' could be.
Regards,
Bhargav
--
Adrian Klaver
adrian.klaver@aklaver.com
Hi,
Observed below errors in logfile
2019-09-20 02:00:24.504 UTC,,,99779,,5d73303a.185c3,73,,2019-09-07
04:21:14 UTC,,0,FATAL,XX000,"no free slots in PMChildFlags
array",,,,,,,,,""
2019-09-20 02:00:24.505 UTC,,,109949,,5d8432b8.1ad7d,1,,2019-09-20
02:00:24 UTC,,0,ERROR,58P01,"could not open shared memory segment
""/PostgreSQL.2520932"": No such file or directory",,,,,,,,,""
2019-09-20 02:00:24.505 UTC,,,109950,,5d8432b8.1ad7e,1,,2019-09-20
02:00:24 UTC,,0,ERROR,58P01,"could not open shared memory segment
""/PostgreSQL.2520932"": No such file or directory",,,,,,,,,""
Postgres version?
PostgreSQL 10.8
OS and version?
NAME="Ubuntu"
VERSION="18.04.1 LTS (Bionic Beaver)"
What was the database doing just before the FATAL line?
Postgres was rejecting a bunch of connections from a user who is having a
connection limit set. that was the the FATAL error that i could see in log
file.
FATAL,53300,"too many connections for role ""user_app"""
db=\du user_app
List of roles
Role name | Attributes | Member of
--------------+-------------------------------+--------------------
user_app | No inheritance +| {application_role}
| 100 connections +|
| Password valid until infinity |
what could be the possible reasons for this to occur and is there any
chance of database corruption after this event ?
The source(backend/storage/ipc/pmsignal.c ) says:
"/* Out of slots ... should never happen, else postmaster.c messed up */
elog(FATAL, "no free slots in PMChildFlags array");
"
Someone else will need to comment on what 'messed up' could be
On Thu, 3 Oct 2019 at 18:56, Adrian Klaver <adrian.klaver@aklaver.com>
wrote:
Show quoted text
On 10/3/19 3:57 AM, bhargav kamineni wrote:
Hi,
Observed below errors in logfile
2019-09-20 02:00:24.504 UTC,,,99779,,5d73303a.185c3,73,,2019-09-07
04:21:14 UTC,,0,FATAL,XX000,"no free slots in PMChildFlagsarray",,,,,,,,,""
2019-09-20 02:00:24.505 UTC,,,109949,,5d8432b8.1ad7d,1,,2019-09-20
02:00:24 UTC,,0,ERROR,58P01,"could not open shared memory segment
""/PostgreSQL.2520932"": No such file or directory",,,,,,,,,""
2019-09-20 02:00:24.505 UTC,,,109950,,5d8432b8.1ad7e,1,,2019-09-20
02:00:24 UTC,,0,ERROR,58P01,"could not open shared memory segment
""/PostgreSQL.2520932"": No such file or directory",,,,,,,,,""Postgres version?
OS and version?
What was the database doing just before the FATAL line?
what could be the possible reasons for this to occur and is there any
chance of database corruption after this event ?The source(backend/storage/ipc/pmsignal.c ) says:
"/* Out of slots ... should never happen, else postmaster.c messed up */
elog(FATAL, "no free slots in PMChildFlags array");
"Someone else will need to comment on what 'messed up' could be.
Regards,
Bhargav--
Adrian Klaver
adrian.klaver@aklaver.com
bhargav kamineni <bhargavpostgres@gmail.com> writes:
Postgres was rejecting a bunch of connections from a user who is having a
connection limit set. that was the the FATAL error that i could see in log
file.
FATAL,53300,"too many connections for role ""user_app"""
db=\du user_app
List of roles
Role name | Attributes | Member of
--------------+-------------------------------+--------------------
user_app | No inheritance +| {application_role}
| 100 connections +|
| Password valid until infinity |
Hm, what's the overall max_connections limit? (I'm wondering
in particular if it's more or less than 100.)
regards, tom lane
bhargav kamineni <bhargavpostgres@gmail.com> writes:
Postgres was rejecting a bunch of connections from a user who is having a
connection limit set. that was the the FATAL error that i could see in log
file.
FATAL,53300,"too many connections for role ""user_app"""
db=\du user_app
List of roles
Role name | Attributes | Member of
--------------+-------------------------------+--------------------
user_app | No inheritance +| {application_role}
| 100 connections +|
| Password valid until infinity |
Hm, what's the overall max_connections limit? (I'm wondering
in particular if it's more or less than 100.)
its set to 500;
show max_connections ;
max_connections
-----------------
500
On Thu, 3 Oct 2019 at 22:52, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Show quoted text
bhargav kamineni <bhargavpostgres@gmail.com> writes:
Postgres was rejecting a bunch of connections from a user who is having a
connection limit set. that was the the FATAL error that i could see inlog
file.
FATAL,53300,"too many connections for role ""user_app"""db=\du user_app
List of roles
Role name | Attributes | Member of
--------------+-------------------------------+--------------------
user_app | No inheritance +| {application_role}
| 100 connections +|
| Password valid until infinity |Hm, what's the overall max_connections limit? (I'm wondering
in particular if it's more or less than 100.)regards, tom lane
bhargav kamineni <bhargavpostgres@gmail.com> writes:
What was the database doing just before the FATAL line?
Postgres was rejecting a bunch of connections from a user who is having a
connection limit set. that was the the FATAL error that i could see in log
file.
FATAL,53300,"too many connections for role ""user_app"""
So ... how many is "a bunch"?
Looking at the code, it seems like it'd be possible for a sufficiently
aggressive spawner of incoming connections to reach the
MaxLivePostmasterChildren limit. While the postmaster would correctly
reject additional connection attempts after that, what it would not do
is ensure that any child slots are left for new parallel worker processes.
So we could hypothesize that the error you're seeing in the log is from
failure to spawn a parallel worker process, due to being out of child
slots.
However, given that max_connections = 500, MaxLivePostmasterChildren()
would be 1000-plus. This would mean that reaching this condition would
require *at least* 500 concurrent connection-attempts-that-haven't-yet-
been-rejected, maybe well more than that if you didn't have close to
500 legitimately open sessions. That seems like a lot, enough to suggest
that you've got some pretty serious bug in your client-side logic.
Anyway, I think it's clearly a bug that canAcceptConnections() thinks the
number of acceptable connections is identical to the number of allowed
child processes; it needs to be less, by the number of background
processes we want to support. But it seems like a darn hard-to-hit bug,
so I'm not quite sure that that explains your observation.
regards, tom lane
On 2019-Oct-03, bhargav kamineni wrote:
bhargav kamineni <bhargavpostgres@gmail.com> writes:
Postgres was rejecting a bunch of connections from a user who is having a
connection limit set. that was the the FATAL error that i could see in log
file.
FATAL,53300,"too many connections for role ""user_app"""db=\du user_app
List of roles
Role name | Attributes | Member of
--------------+-------------------------------+--------------------
user_app | No inheritance +| {application_role}
| 100 connections +|
| Password valid until infinity |
Was the machine overloaded at the time the problem occurred?
--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Thanks Tom Lane for detailing the issue.
So ... how many is "a bunch"?
more than 85
Looking at the code, it seems like it'd be possible for a sufficiently
aggressive spawner of incoming connections to reach the
MaxLivePostmasterChildren limit. While the postmaster would correctly
reject additional connection attempts after that, what it would not do
is ensure that any child slots are left for new parallel worker processes.
So we could hypothesize that the error you're seeing in the log is from
failure to spawn a parallel worker process, due to being out of child
slots.
Thanks Tom Lane for detailing the issue.
we have enabled "max_parallel_workers_per_gather = 4". 20 days before we
ran into this issue .
However, given that max_connections = 500, MaxLivePostmasterChildren()
would be 1000-plus. This would mean that reaching this condition would
require *at least* 500 concurrent connection-attempts-that-haven't-yet-
been-rejected, maybe well more than that if you didn't have close to
500 legitimately open sessions. That seems like a lot, enough to suggest
that you've got some pretty serious bug in your client-side logic.
below errors observed after crash in postgres logfile :
ERROR: xlog flush request is not satisfied for couple of tables , we have
initiated the vacuum full on those tables and the error went off after that.
ERROR: right sibling's left-link doesn't match: block 273660 links to
273500 instead of expected 273661 in index -- observed this error while
doing vacuum freeze on databsase , we have dropped this index and created a
new one
Observations :
Vacuum freeze analyze job is getting stuck at database end which is
initiated thru cronjob, pg_cancel_backend(), pg_termiante_backend() is not
able to terminate those stuck process , Restarting the database only able
to clear those process , i am thinking this is happening due to corruption
(if this is true how can i detect this ? pg_dump ?). is there any way to
overcome this problem ?
does migrating the database to a new instance (pg_basebackup and switching
over to new instance ) solves this issue ?
Anyway, I think it's clearly a bug that canAcceptConnections() thinks the
number of acceptable connections is identical to the number of allowed
child processes; it needs to be less, by the number of background
processes we want to support. But it seems like a darn hard-to-hit bug,
so I'm not quite sure that that explains your observation.
On Fri, 4 Oct 2019 at 03:49, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Show quoted text
bhargav kamineni <bhargavpostgres@gmail.com> writes:
What was the database doing just before the FATAL line?
Postgres was rejecting a bunch of connections from a user who is having a
connection limit set. that was the the FATAL error that i could see inlog
file.
FATAL,53300,"too many connections for role ""user_app"""So ... how many is "a bunch"?
Looking at the code, it seems like it'd be possible for a sufficiently
aggressive spawner of incoming connections to reach the
MaxLivePostmasterChildren limit. While the postmaster would correctly
reject additional connection attempts after that, what it would not do
is ensure that any child slots are left for new parallel worker processes.
So we could hypothesize that the error you're seeing in the log is from
failure to spawn a parallel worker process, due to being out of child
slots.However, given that max_connections = 500, MaxLivePostmasterChildren()
would be 1000-plus. This would mean that reaching this condition would
require *at least* 500 concurrent connection-attempts-that-haven't-yet-
been-rejected, maybe well more than that if you didn't have close to
500 legitimately open sessions. That seems like a lot, enough to suggest
that you've got some pretty serious bug in your client-side logic.Anyway, I think it's clearly a bug that canAcceptConnections() thinks the
number of acceptable connections is identical to the number of allowed
child processes; it needs to be less, by the number of background
processes we want to support. But it seems like a darn hard-to-hit bug,
so I'm not quite sure that that explains your observation.regards, tom lane
bhargav kamineni <bhargavpostgres@gmail.com> writes:
So ... how many is "a bunch"?
more than 85
Hm. That doesn't seem like it'd be enough to trigger the problem;
you'd need about max_connections excess connections (that are shortly
going to be rejected) to run into this problem, and you said you
had max_connections = 500. Maybe several different clients were all
doing this at once?
But anyway, AFAICS there is only one code path that could lead to the
reported error message, so one way or another you got there. I've
pushed a fix for this, which will be in next month's releases.
below errors observed after crash in postgres logfile :
ERROR: xlog flush request is not satisfied for couple of tables , we have
initiated the vacuum full on those tables and the error went off after that.
ERROR: right sibling's left-link doesn't match: block 273660 links to
273500 instead of expected 273661 in index -- observed this error while
doing vacuum freeze on databsase , we have dropped this index and created a
new one
That seems unrelated. A postmaster crash shouldn't have any
data-corruption consequences, since it never touches any
relation files directly.
regards, tom lane