More postmaster troubles

Started by Daryl W. Dunbarover 27 years ago6 messageshackers

daryl@www.com

over 27 years ago

Hello again,

Thanks again to those who pointed me to the semaphore problem. I,
unfortunately have another problem:

Solaris7 on a Sparc20 running 6.4.2. Occasionally (once or twice a
day) under a very light load, brain-dead child processes begin to
accumulate in my system. If left unchecked, eventually the parent
process runs out of resources and dies, orphaning all the lost
processes. (Now that I have solved the semaphore error, it appears
to be the backend limit of 64 processes.)

Here is a snapshot of truss on some of the processes:
# truss -p 5879
semop(259915776, 0xEFFFC560, 1) (sleeping...)
# truss -p 5912
semop(259915776, 0xEFFFC190, 1) (sleeping...)
# truss -p 5915
semop(259915776, 0xEFFFC190, 1) (sleeping...)
# truss -p 5931
semop(259915776, 0xEFFFC280, 1) (sleeping...)
# truss -p 5926
semop(259915776, 0xEFFFC280, 1) (sleeping...)

They all appear to be waiting on a semaphore operation which
apparently never happens. The number of stalled processes grows
rapidly (it has gone from 12 to 21 while I wrote this e-mail).

The stalled processes all started between 6:57am PST and 7:18am PST,
here is what postmaster wrote to the log:
Feb 12 06:56:46 constantinople POSTMASTER: FATAL: pq_putnchar:
fputc() failed: errno=32
Feb 12 06:57:42 constantinople POSTMASTER: NOTICE: Deadlock
detected -- See the lock(l) manual page for a possible cause.
Feb 12 06:57:42 constantinople POSTMASTER: ERROR: WaitOnLock: error
on wakeup - Aborting this transaction
Feb 12 06:57:42 constantinople POSTMASTER: NOTICE: Deadlock
detected -- See the lock(l) manual page for a possible cause.
Feb 12 06:57:42 constantinople POSTMASTER: ERROR: WaitOnLock: error
on wakeup - Aborting this transaction
Feb 12 07:02:18 constantinople POSTMASTER: FATAL: pq_putnchar:
fputc() failed: errno=32
Feb 12 07:02:19 constantinople last message repeated 2 times

Most of the time, things just work, but it appears that once
somethins has gone awry, I experience a spiraling death.

Thoughts? Suggestions? Help? :)

DwD
--
Daryl W. Dunbar
http://www.com, Where the Web Begins!
mailto:daryl@www.com

Tatsuo Ishii

ishii@postgresql.org

over 27 years ago

In reply to: Daryl W. Dunbar (#1)

Re: [HACKERS] More postmaster troubles

Solaris7 on a Sparc20 running 6.4.2. Occasionally (once or twice a
day) under a very light load, brain-dead child processes begin to
accumulate in my system. If left unchecked, eventually the parent
process runs out of resources and dies, orphaning all the lost
processes. (Now that I have solved the semaphore error, it appears
to be the backend limit of 64 processes.)

Have you installed following patches? This solves the problem when #
of backends reaches MaxBackendId. I'm not sure if your problem relates
to this, though.

-------------------------------- cut here ---------------------------
*** postgresql-6.4.2/src/backend/postmaster/postmaster.c.orig	Sun Nov 29 10:52:32 1998
--- postgresql-6.4.2/src/backend/postmaster/postmaster.c	Sat Jan  9 18:14:52 1999
***************
*** 238,243 ****
--- 238,244 ----
  static long PostmasterRandom(void);
  static void RandomSalt(char *salt);
  static void SignalChildren(SIGNAL_ARGS);
+ static int CountChildren(void);

#ifdef CYR_RECODE
void GetCharSetByHost(char *, int, char *);
***************
*** 754,764 ****
* by the backend.
*/

! if (BackendStartup(port) != STATUS_OK)
! PacketSendError(&port->pktInfo,
"Backend startup failed");
! else
! status = STATUS_ERROR;
}

  			/* Close the connection if required. */
--- 755,771 ----
  				 * by the backend.
  				 */

! if (CountChildren() < MaxBackendId) {
! if (BackendStartup(port) != STATUS_OK)
! PacketSendError(&port->pktInfo,
"Backend startup failed");
! else {
! status = STATUS_ERROR;
! }
! } else {
! PacketSendError(&port->pktInfo,
! "There are too many backends");
! }
}

  			/* Close the connection if required. */
***************
*** 1617,1620 ****
--- 1624,1655 ----
  	}

  	return random() ^ random_seed;
+ }
+ 
+ /*
+  * Count up number of chidren processes.
+  */
+ static int
+ CountChildren(void)
+ {
+ 	Dlelem	   *curr,
+ 			   *next;
+ 	Backend    *bp;
+ 	int			mypid = getpid();
+ 	int	cnt = 0;
+ 
+ 	curr = DLGetHead(BackendList);
+ 	while (curr)
+ 	{
+ 		next = DLGetSucc(curr);
+ 		bp = (Backend *) DLE_VAL(curr);
+ 
+ 		if (bp->pid != mypid)
+ 		{
+ 			cnt++;
+ 		}
+ 
+ 		curr = next;
+ 	}
+ 	return(cnt);
  }

Import Notes

Reply to msg id not found: YourmessageofFri12Feb1999103936EST.002e01be569de4debf501445e59b@ddunbar.eni.net | Resolved by subject fallback

Daryl W. Dunbar

daryl@www.com

over 27 years ago

In reply to: Tatsuo Ishii (#2)

RE: [HACKERS] More postmaster troubles

Thank you Tatsousan. This patch will solve the dying process
problem when I reach MaxBackendId (which I increased from 64 to
128). However, I do not know what is causing the spiraling death of
the processes in the first place. :(

Is there some place I should be looking for other patches, besides
those listed on www.postgresql.org?

Thank you for your continued help.

DwD

Show quoted text

-----Original Message-----
From: t-ishii@ext16.sra.co.jp
[mailto:t-ishii@ext16.sra.co.jp]On Behalf
Of Tatsuo Ishii
Sent: Saturday, February 13, 1999 1:03 AM
To: Daryl W. Dunbar
Cc: pgsql-hackers@postgreSQL. org
Subject: Re: [HACKERS] More postmaster troubles

Solaris7 on a Sparc20 running 6.4.2. Occasionally

(once or twice a

day) under a very light load, brain-dead child

processes begin to

accumulate in my system. If left unchecked, eventually

the parent

process runs out of resources and dies, orphaning all the lost
processes. (Now that I have solved the semaphore

error, it appears

to be the backend limit of 64 processes.)

Have you installed following patches? This solves the
problem when #
of backends reaches MaxBackendId. I'm not sure if your
problem relates
to this, though.
-------------------------------- cut here
---------------------------
***
postgresql-6.4.2/src/backend/postmaster/postmaster.c.orig
Sun Nov 29 10:52:32 1998
--- postgresql-6.4.2/src/backend/postmaster/postmaster.c
Sat Jan  9 18:14:52 1999
***************
*** 238,243 ****
--- 238,244 ----
static long PostmasterRandom(void);
static void RandomSalt(char *salt);
static void SignalChildren(SIGNAL_ARGS);
+ static int CountChildren(void);
#ifdef CYR_RECODE
void GetCharSetByHost(char *, int, char *);
***************
*** 754,764 ****
* by the backend.
*/

! if (BackendStartup(port) !=
STATUS_OK)
!
PacketSendError(&port->pktInfo,

"Backend startup failed");
! else
! status = STATUS_ERROR;
}
/* Close the connection if required. */
--- 755,771 ----
* by the backend.
*/
! if (CountChildren() <
MaxBackendId) {
! if
(BackendStartup(port) != STATUS_OK)
!
PacketSendError(&port->pktInfo,

"Backend startup failed");
! else {
! status =
STATUS_ERROR;
! }
! } else {
!
PacketSendError(&port->pktInfo,
! "There are too many
backends");
! }
}
/* Close the connection if required. */
***************
*** 1617,1620 ****
--- 1624,1655 ----
}
return random() ^ random_seed;
+ }
+
+ /*
+  * Count up number of chidren processes.
+  */
+ static int
+ CountChildren(void)
+ {
+ 	Dlelem	   *curr,
+ 			   *next;
+ 	Backend    *bp;
+ 	int			mypid = getpid();
+ 	int	cnt = 0;
+
+ 	curr = DLGetHead(BackendList);
+ 	while (curr)
+ 	{
+ 		next = DLGetSucc(curr);
+ 		bp = (Backend *) DLE_VAL(curr);
+
+ 		if (bp->pid != mypid)
+ 		{
+ 			cnt++;
+ 		}
+
+ 		curr = next;
+ 	}
+ 	return(cnt);
}

Tom Lane

tgl@sss.pgh.pa.us

over 27 years ago

In reply to: Daryl W. Dunbar (#3)

Re: [HACKERS] More postmaster troubles

"Daryl W. Dunbar" <daryl@www.com> writes:

Thank you Tatsousan. This patch will solve the dying process
problem when I reach MaxBackendId (which I increased from 64 to
128). However, I do not know what is causing the spiraling death of
the processes in the first place. :(

Hmm. I have noticed at least one place in the code where there is an
undocumented hard-wired dependency on MaxBackendId, to wit MAX_PROC_SEMS
in include/storage/proc.h which is set at 128. Presumably it should be
equal to MaxBackendId (and I intend to fix that soon). Evidently that
particular bug is not hurting you (yet) but perhaps there are similar
errors elsewhere that kick in sooner. Do you see the spiraling-death
problem if you run with MaxBackendId at its customary value of 64?

The log extract you posted before mentions "fputc() failed: errno=32"
which suggests an unexpected client disconnect during a transaction.
I suspect the backend that gets that disconnect is failing to clean up
properly before exiting, and is leaving one or more locks locked.
We don't have enough info yet to track down the cause, but I suggest
we could narrow it down some by seeing whether the problem goes away
with a lower MaxBackendId setting.

(You might also want to work on making your clients more robust,
but I'd like to see if we can solve the backend bug first ...)

regards, tom lane

Import Notes

Reply to msg id not found: YourmessageofSat13Feb1999132329-0500000001be577df47d18401445e59b@ddunbar.eni.net | Resolved by subject fallback

Daryl W. Dunbar

daryl@www.com

over 27 years ago

In reply to: Tom Lane (#4)

RE: [HACKERS] More postmaster troubles

Tom,

I have to date experienced the problem only with MaxBackendId set to
64. Today I installed a version of the code with it set to 128
(just picked that number out of luck, but would like to get it
higher). By the way, I had to tune the kernel to allow me to
increase MaxBackendId, this time in shared memory (SHMMAX).

As for the clients, they are web users via mod_perl/DBI/DBD:Pg. It
is possible that the user is hitting the stop button right at a time
which hangs the connection (backend), but I have been unable to
reproduce that so far. That was my first thought on this problem.
The fact that it apparently spirals is disturbing, I highly doubt
there is a user out there hitting the stop key 64 times in a row. :)

Thanks for your help,

DwD

Show quoted text

-----Original Message-----
From: Tom Lane [mailto:tgl@sss.pgh.pa.us]
Sent: Saturday, February 13, 1999 3:23 PM
To: Daryl W. Dunbar
Cc: pgsql-hackers@postgreSQL.org
Subject: Re: [HACKERS] More postmaster troubles

"Daryl W. Dunbar" <daryl@www.com> writes:

Thank you Tatsousan. This patch will solve the dying process
problem when I reach MaxBackendId (which I increased from 64 to
128). However, I do not know what is causing the

spiraling death of

the processes in the first place. :(

Hmm. I have noticed at least one place in the code where
there is an
undocumented hard-wired dependency on MaxBackendId, to
wit MAX_PROC_SEMS
in include/storage/proc.h which is set at 128.
Presumably it should be
equal to MaxBackendId (and I intend to fix that soon).
Evidently that
particular bug is not hurting you (yet) but perhaps there
are similar
errors elsewhere that kick in sooner. Do you see the
spiraling-death
problem if you run with MaxBackendId at its customary value of 64?

The log extract you posted before mentions "fputc()
failed: errno=32"
which suggests an unexpected client disconnect during a
transaction.
I suspect the backend that gets that disconnect is
failing to clean up
properly before exiting, and is leaving one or more locks locked.
We don't have enough info yet to track down the cause,
but I suggest
we could narrow it down some by seeing whether the
problem goes away
with a lower MaxBackendId setting.

(You might also want to work on making your clients more robust,
but I'd like to see if we can solve the backend bug first ...)

regards, tom lane

Michael Davis

michael.davis@prevuenet.com

over 27 years ago

In reply to: Daryl W. Dunbar (#5)

RE: [HACKERS] More postmaster troubles

Is there any documentation on database recovery. I am new to Postgres. I
really like what I see. I have written a membership database application
for a NPO in New Mexico. It is written in Access 97. I want to migrate to
a real database engine and I am strongly considering Postgres. I have
several questions:

1) Is transaction logging available, how does it work, and how do I set
it up. I have been through most of the on-line documentation several time
(available on the internet) and have not found anything that talks about
transaction logging.

2) Database recovery. If I make a back up at 10:00am and the database
goes south at 1:00pm, can I restore back to 10:00am and automatically
re-post any/all transactions that occurred between 10:00am and 1:00pm
without requiring the users to re-enter all their data?

3) Is any working on mirroring or shadowing? I would like to be able
have a backup database engine on a second back-up server get automatically
updated soon after an update is posted to the main server.

Thanks, Michael

-----Original Message-----
From: Tom Lane [SMTP:tgl@sss.pgh.pa.us]
Sent: Saturday, February 13, 1999 1:23 PM
To: Daryl W. Dunbar
Cc: pgsql-hackers@postgreSQL.org
Subject: Re: [HACKERS] More postmaster troubles

"Daryl W. Dunbar" <daryl@www.com> writes:

Thank you Tatsousan. This patch will solve the dying process
problem when I reach MaxBackendId (which I increased from 64 to
128). However, I do not know what is causing the spiraling death

the processes in the first place. :(

Hmm. I have noticed at least one place in the code where there is
an
undocumented hard-wired dependency on MaxBackendId, to wit
MAX_PROC_SEMS
in include/storage/proc.h which is set at 128. Presumably it should
be
equal to MaxBackendId (and I intend to fix that soon). Evidently
that
particular bug is not hurting you (yet) but perhaps there are
similar
errors elsewhere that kick in sooner. Do you see the
spiraling-death
problem if you run with MaxBackendId at its customary value of 64?

The log extract you posted before mentions "fputc() failed:
errno=32"
which suggests an unexpected client disconnect during a transaction.
I suspect the backend that gets that disconnect is failing to clean
up
properly before exiting, and is leaving one or more locks locked.
We don't have enough info yet to track down the cause, but I suggest
we could narrow it down some by seeing whether the problem goes away
with a lower MaxBackendId setting.

(You might also want to work on making your clients more robust,
but I'd like to see if we can solve the backend bug first ...)

regards, tom lane

Import Notes

Resolved by subject fallback