Major bug, possible, with Solaris 7?

Started by The Hermit Hackeralmost 27 years ago8 messages
#1The Hermit Hacker
scrappy@hub.org

Can someone please take a minute to look at this?

I've gzip'd and moved his errorlog to
ftp.postgresql.org:/pub/debugging...one thing that appears to be
lacking...what version of PostgreSQL are you using?

Marc G. Fournier
Systems Administrator @ hub.org
primary: scrappy@hub.org secondary: scrappy@{freebsd|postgresql}.org

---------- Forwarded message ----------
Date: Thu, 18 Feb 1999 18:23:25 -0500
From: Daryl W. Dunbar <daryl@www.com>
To: The Hermit Hacker <scrappy@hub.org>
Subject: RE: Interested?

Thanks Marc, We exchanged an e-mail or two last week, along with
Tatsuo Ishii and Tom Lane. You suggested I truss the process.

Anyway, periodically, the backends spiral out of control with hung
up children until I hit MaxBackendID (which I compiled in to be
128). Initially, I was running out of semaphores on Solaris 7 and
changed /etc/system to add these lines:
set shmsys:shminfo_shmmax=16777216
set shmsys:shminfo_shmmin=1
set shmsys:shminfo_shmmni=128
set shmsys:shminfo_shmseg=51
*
set semsys:seminfo_semmap=128
set semsys:seminfo_semmni=128
set semsys:seminfo_semmns=8192
set semsys:seminfo_semmnu=8192
set semsys:seminfo_semmsl=64
set semsys:seminfo_semopm=32
set semsys:seminfo_semume=32

I increased shared memory so I could start more backends...

OK, so now, everything is running fine and boom, the backends start
to hang on semop, eventually reaching MaxBackendID and refusing
connections.
Attached is a log file from a hang up today. Debug is set to 3.
All times are PST. I have carved out a bunch of normal operation
from the beginning (about 21,000 lines) and redundant 'too many
backends' (about 1,000 lines, while I was eating lunch :) signified
by {SNIP SNIP}. I pick the log back up with the birth of pid 2828
and left several 'normal' cycles in until...

You can see that process 2840 is the first child to hang. It was
started at 11:39:23 and did not die until sent a 15 by the parent at
14:12:16. All of the hung processes fall between 2840 and 3454.

Sorry the file is so big. Here are some 'keys' you can use:
Startup is the first line (obviously).
You can find child startup by looking for [2840] (pid in brackets)
You can find child exits by looking for '2480 exited'
You can find where I send the kill signal by looking for 'pmdie 15'

I think that's a good start. :)

Don't hesitate to contact me if I can shed any more light. I'm wide
open to ideas at the moment. I'm in EST, but tend to work until
10-11 at night, so e-mail anytime.

Thanks,

DwD

Show quoted text

-----Original Message-----
From: The Hermit Hacker [mailto:scrappy@hub.org]
Sent: Thursday, February 18, 1999 5:36 PM
To: Daryl W. Dunbar
Subject: Re: Interested?

Hi Daryl...

I'm not the strongest at internal code, so may not
be of any help
at all. I just went through my -hackers email, and can't
seem to find
anything from you in there. Can you tell me what your
problem is, as well
as version of PostgreSQL you are using, and we'll see
what we can do?

Marc

On Thu, 18 Feb 1999, Daryl W. Dunbar wrote:

Marc,

I know that you put considerable volunteer time into

PostgreSQL. If

I am not too bold in asking, and you are comfortable

with it, I am

prepared to compensate you for your time if you can assist me in
tracking down this rather nasty bug I have been

e-mailing Hackers

about. Please let me know if you are interested and if

so, at what

rate.

We are in the process of launching a pretty exciting site and a
database in a integral part of it. I really want to

use PostgreSQL,

but can not take it into production on Solaris with this problem
going on. I'm in the process of installing a test site

on Linux to

see if the problem exists there, but I expect it is limited to
Solaris.

I anxiously await your response.

Thanks,

DwD

--
Daryl W. Dunbar
VP of Engineering/Chief Technology Officer
http://www.com, Where the Web Begins!
mailto:daryl@www.com

Marc G. Fournier
Systems Administrator @ hub.org
primary: scrappy@hub.org secondary:
scrappy@{freebsd|postgresql}.org

#2Daryl W. Dunbar
daryl@www.com
In reply to: The Hermit Hacker (#1)
RE: [HACKERS] Major bug, possible, with Solaris 7?

Oh, sorry. 6.4.2 with a backend patch to prevent the parent death
in the event of MaxBackendID being reached.

I know it is in semop() because I did a truss on the child
processes. From a small sample, it looks like they may all be
trying to operate on the same semaphore. I'm recompiling with
the -g flag to gain more insight...

DwD

Show quoted text

-----Original Message-----
From: owner-pgsql-hackers@postgreSQL.org
[mailto:owner-pgsql-hackers@postgreSQL.org]On Behalf Of The Hermit
Hacker
Sent: Friday, February 19, 1999 12:46 PM
To: pgsql-hackers@postgreSQL.org
Cc: Daryl W. Dunbar
Subject: [HACKERS] Major bug, possible, with Solaris 7?

Can someone please take a minute to look at this?

I've gzip'd and moved his errorlog to
ftp.postgresql.org:/pub/debugging...one thing that appears to be
lacking...what version of PostgreSQL are you using?

Marc G. Fournier
Systems Administrator @ hub.org
primary: scrappy@hub.org secondary:
scrappy@{freebsd|postgresql}.org

---------- Forwarded message ----------
Date: Thu, 18 Feb 1999 18:23:25 -0500
From: Daryl W. Dunbar <daryl@www.com>
To: The Hermit Hacker <scrappy@hub.org>
Subject: RE: Interested?

Thanks Marc, We exchanged an e-mail or two last week, along with
Tatsuo Ishii and Tom Lane. You suggested I truss the process.

Anyway, periodically, the backends spiral out of control with hung
up children until I hit MaxBackendID (which I compiled in to be
128). Initially, I was running out of semaphores on Solaris 7 and
changed /etc/system to add these lines:
set shmsys:shminfo_shmmax=16777216
set shmsys:shminfo_shmmin=1
set shmsys:shminfo_shmmni=128
set shmsys:shminfo_shmseg=51
*
set semsys:seminfo_semmap=128
set semsys:seminfo_semmni=128
set semsys:seminfo_semmns=8192
set semsys:seminfo_semmnu=8192
set semsys:seminfo_semmsl=64
set semsys:seminfo_semopm=32
set semsys:seminfo_semume=32

I increased shared memory so I could start more backends...

OK, so now, everything is running fine and boom, the
backends start
to hang on semop, eventually reaching MaxBackendID and refusing
connections.
Attached is a log file from a hang up today. Debug is set to 3.
All times are PST. I have carved out a bunch of normal operation
from the beginning (about 21,000 lines) and redundant 'too many
backends' (about 1,000 lines, while I was eating lunch :)
signified
by {SNIP SNIP}. I pick the log back up with the birth of pid 2828
and left several 'normal' cycles in until...

You can see that process 2840 is the first child to hang. It was
started at 11:39:23 and did not die until sent a 15 by
the parent at
14:12:16. All of the hung processes fall between 2840 and 3454.

Sorry the file is so big. Here are some 'keys' you can use:
Startup is the first line (obviously).
You can find child startup by looking for [2840] (pid in brackets)
You can find child exits by looking for '2480 exited'
You can find where I send the kill signal by looking for
'pmdie 15'

I think that's a good start. :)

Don't hesitate to contact me if I can shed any more
light. I'm wide
open to ideas at the moment. I'm in EST, but tend to work until
10-11 at night, so e-mail anytime.

Thanks,

DwD

-----Original Message-----
From: The Hermit Hacker [mailto:scrappy@hub.org]
Sent: Thursday, February 18, 1999 5:36 PM
To: Daryl W. Dunbar
Subject: Re: Interested?

Hi Daryl...

I'm not the strongest at internal code, so may not
be of any help
at all. I just went through my -hackers email, and can't
seem to find
anything from you in there. Can you tell me what your
problem is, as well
as version of PostgreSQL you are using, and we'll see
what we can do?

Marc

On Thu, 18 Feb 1999, Daryl W. Dunbar wrote:

Marc,

I know that you put considerable volunteer time into

PostgreSQL. If

I am not too bold in asking, and you are comfortable

with it, I am

prepared to compensate you for your time if you can

assist me in

tracking down this rather nasty bug I have been

e-mailing Hackers

about. Please let me know if you are interested and if

so, at what

rate.

We are in the process of launching a pretty exciting

site and a

database in a integral part of it. I really want to

use PostgreSQL,

but can not take it into production on Solaris with

this problem

going on. I'm in the process of installing a test site

on Linux to

see if the problem exists there, but I expect it is limited to
Solaris.

I anxiously await your response.

Thanks,

DwD

--
Daryl W. Dunbar
VP of Engineering/Chief Technology Officer
http://www.com, Where the Web Begins!
mailto:daryl@www.com

Marc G. Fournier
Systems Administrator @ hub.org
primary: scrappy@hub.org secondary:
scrappy@{freebsd|postgresql}.org

#3The Hermit Hacker
scrappy@hub.org
In reply to: Daryl W. Dunbar (#2)
RE: [HACKERS] Major bug, possible, with Solaris 7?

On Fri, 19 Feb 1999, Daryl W. Dunbar wrote:

Oh, sorry. 6.4.2 with a backend patch to prevent the parent death
in the event of MaxBackendID being reached.

I know it is in semop() because I did a truss on the child
processes. From a small sample, it looks like they may all be
trying to operate on the same semaphore. I'm recompiling with
the -g flag to gain more insight...

I'm just curious, but is this being used production yet? If not, would
you be willing to try out the current snapshot, which is soon to become
6.5-BETA? If this apparent bug still exists there, I think its sufficient
a bug to prevent v6.5 coming out until this is fixed :( then again,
something this reproducible will most likely hold up v6.4.3 from being
released also, so if we are planning a v6.4.3 (I thought we were), we'll
have to get this fixed in the 6.4 line also.

Actually, with that in mind, I'm putting together a very quick tar ball of
what v6.4.3 is looking like so far. this is *not* a release, but I'd like
to see if this problem exists in the most current STABLE tree or not...I
know there has been quite a few fixes put into it...

Check in about a half hour or so, under the 'test' directory of
ftp.postgresql.org .. should be there then...

-----Original Message-----
From: owner-pgsql-hackers@postgreSQL.org
[mailto:owner-pgsql-hackers@postgreSQL.org]On Behalf Of The Hermit
Hacker
Sent: Friday, February 19, 1999 12:46 PM
To: pgsql-hackers@postgreSQL.org
Cc: Daryl W. Dunbar
Subject: [HACKERS] Major bug, possible, with Solaris 7?

Can someone please take a minute to look at this?

I've gzip'd and moved his errorlog to
ftp.postgresql.org:/pub/debugging...one thing that appears to be
lacking...what version of PostgreSQL are you using?

Marc G. Fournier
Systems Administrator @ hub.org
primary: scrappy@hub.org secondary:
scrappy@{freebsd|postgresql}.org

---------- Forwarded message ----------
Date: Thu, 18 Feb 1999 18:23:25 -0500
From: Daryl W. Dunbar <daryl@www.com>
To: The Hermit Hacker <scrappy@hub.org>
Subject: RE: Interested?

Thanks Marc, We exchanged an e-mail or two last week, along with
Tatsuo Ishii and Tom Lane. You suggested I truss the process.

Anyway, periodically, the backends spiral out of control with hung
up children until I hit MaxBackendID (which I compiled in to be
128). Initially, I was running out of semaphores on Solaris 7 and
changed /etc/system to add these lines:
set shmsys:shminfo_shmmax=16777216
set shmsys:shminfo_shmmin=1
set shmsys:shminfo_shmmni=128
set shmsys:shminfo_shmseg=51
*
set semsys:seminfo_semmap=128
set semsys:seminfo_semmni=128
set semsys:seminfo_semmns=8192
set semsys:seminfo_semmnu=8192
set semsys:seminfo_semmsl=64
set semsys:seminfo_semopm=32
set semsys:seminfo_semume=32

I increased shared memory so I could start more backends...

OK, so now, everything is running fine and boom, the
backends start
to hang on semop, eventually reaching MaxBackendID and refusing
connections.
Attached is a log file from a hang up today. Debug is set to 3.
All times are PST. I have carved out a bunch of normal operation
from the beginning (about 21,000 lines) and redundant 'too many
backends' (about 1,000 lines, while I was eating lunch :)
signified
by {SNIP SNIP}. I pick the log back up with the birth of pid 2828
and left several 'normal' cycles in until...

You can see that process 2840 is the first child to hang. It was
started at 11:39:23 and did not die until sent a 15 by
the parent at
14:12:16. All of the hung processes fall between 2840 and 3454.

Sorry the file is so big. Here are some 'keys' you can use:
Startup is the first line (obviously).
You can find child startup by looking for [2840] (pid in brackets)
You can find child exits by looking for '2480 exited'
You can find where I send the kill signal by looking for
'pmdie 15'

I think that's a good start. :)

Don't hesitate to contact me if I can shed any more
light. I'm wide
open to ideas at the moment. I'm in EST, but tend to work until
10-11 at night, so e-mail anytime.

Thanks,

DwD

-----Original Message-----
From: The Hermit Hacker [mailto:scrappy@hub.org]
Sent: Thursday, February 18, 1999 5:36 PM
To: Daryl W. Dunbar
Subject: Re: Interested?

Hi Daryl...

I'm not the strongest at internal code, so may not
be of any help
at all. I just went through my -hackers email, and can't
seem to find
anything from you in there. Can you tell me what your
problem is, as well
as version of PostgreSQL you are using, and we'll see
what we can do?

Marc

On Thu, 18 Feb 1999, Daryl W. Dunbar wrote:

Marc,

I know that you put considerable volunteer time into

PostgreSQL. If

I am not too bold in asking, and you are comfortable

with it, I am

prepared to compensate you for your time if you can

assist me in

tracking down this rather nasty bug I have been

e-mailing Hackers

about. Please let me know if you are interested and if

so, at what

rate.

We are in the process of launching a pretty exciting

site and a

database in a integral part of it. I really want to

use PostgreSQL,

but can not take it into production on Solaris with

this problem

going on. I'm in the process of installing a test site

on Linux to

see if the problem exists there, but I expect it is limited to
Solaris.

I anxiously await your response.

Thanks,

DwD

--
Daryl W. Dunbar
VP of Engineering/Chief Technology Officer
http://www.com, Where the Web Begins!
mailto:daryl@www.com

Marc G. Fournier
Systems Administrator @ hub.org
primary: scrappy@hub.org secondary:
scrappy@{freebsd|postgresql}.org

Marc G. Fournier
Systems Administrator @ hub.org
primary: scrappy@hub.org secondary: scrappy@{freebsd|postgresql}.org

#4Daryl W. Dunbar
daryl@www.com
In reply to: The Hermit Hacker (#3)
RE: [HACKERS] Major bug, possible, with Solaris 7?

At this point, I willing to try anything. I'm in production (live
site), but we have not announced the site. What that means is that
I have the weekend to debug/fix/decide what to do. I'll take
whatever version you suggest and load it.

DwD

Show quoted text

-----Original Message-----
From: The Hermit Hacker [mailto:scrappy@hub.org]
Sent: Friday, February 19, 1999 10:39 PM
To: Daryl W. Dunbar
Cc: pgsql-hackers@postgreSQL.org
Subject: RE: [HACKERS] Major bug, possible, with Solaris 7?

On Fri, 19 Feb 1999, Daryl W. Dunbar wrote:

Oh, sorry. 6.4.2 with a backend patch to prevent the

parent death

in the event of MaxBackendID being reached.

I know it is in semop() because I did a truss on the child
processes. From a small sample, it looks like they may all be
trying to operate on the same semaphore. I'm recompiling with
the -g flag to gain more insight...

I'm just curious, but is this being used production yet?
If not, would
you be willing to try out the current snapshot, which is
soon to become
6.5-BETA? If this apparent bug still exists there, I
think its sufficient
a bug to prevent v6.5 coming out until this is fixed

then again,
something this reproducible will most likely hold up
v6.4.3 from being
released also, so if we are planning a v6.4.3 (I thought
we were), we'll
have to get this fixed in the 6.4 line also.

Actually, with that in mind, I'm putting together a very
quick tar ball of
what v6.4.3 is looking like so far. this is *not* a
release, but I'd like
to see if this problem exists in the most current STABLE
tree or not...I
know there has been quite a few fixes put into it...

Check in about a half hour or so, under the 'test' directory of
ftp.postgresql.org .. should be there then...

-----Original Message-----
From: owner-pgsql-hackers@postgreSQL.org
[mailto:owner-pgsql-hackers@postgreSQL.org]On Behalf

Of The Hermit

Hacker
Sent: Friday, February 19, 1999 12:46 PM
To: pgsql-hackers@postgreSQL.org
Cc: Daryl W. Dunbar
Subject: [HACKERS] Major bug, possible, with Solaris 7?

Can someone please take a minute to look at this?

I've gzip'd and moved his errorlog to
ftp.postgresql.org:/pub/debugging...one thing that

appears to be

lacking...what version of PostgreSQL are you using?

Marc G. Fournier
Systems Administrator @ hub.org
primary: scrappy@hub.org secondary:
scrappy@{freebsd|postgresql}.org

---------- Forwarded message ----------
Date: Thu, 18 Feb 1999 18:23:25 -0500
From: Daryl W. Dunbar <daryl@www.com>
To: The Hermit Hacker <scrappy@hub.org>
Subject: RE: Interested?

Thanks Marc, We exchanged an e-mail or two last

week, along with

Tatsuo Ishii and Tom Lane. You suggested I truss the process.

Anyway, periodically, the backends spiral out of

control with hung

up children until I hit MaxBackendID (which I

compiled in to be

128). Initially, I was running out of semaphores on

Solaris 7 and

changed /etc/system to add these lines:
set shmsys:shminfo_shmmax=16777216
set shmsys:shminfo_shmmin=1
set shmsys:shminfo_shmmni=128
set shmsys:shminfo_shmseg=51
*
set semsys:seminfo_semmap=128
set semsys:seminfo_semmni=128
set semsys:seminfo_semmns=8192
set semsys:seminfo_semmnu=8192
set semsys:seminfo_semmsl=64
set semsys:seminfo_semopm=32
set semsys:seminfo_semume=32

I increased shared memory so I could start more backends...

OK, so now, everything is running fine and boom, the
backends start
to hang on semop, eventually reaching MaxBackendID

and refusing

connections.
Attached is a log file from a hang up today. Debug

is set to 3.

All times are PST. I have carved out a bunch of

normal operation

from the beginning (about 21,000 lines) and redundant

'too many

backends' (about 1,000 lines, while I was eating lunch :)
signified
by {SNIP SNIP}. I pick the log back up with the

birth of pid 2828

and left several 'normal' cycles in until...

You can see that process 2840 is the first child to

hang. It was

started at 11:39:23 and did not die until sent a 15 by
the parent at
14:12:16. All of the hung processes fall between

2840 and 3454.

Sorry the file is so big. Here are some 'keys' you can use:
Startup is the first line (obviously).
You can find child startup by looking for [2840] (pid

in brackets)

You can find child exits by looking for '2480 exited'
You can find where I send the kill signal by looking for
'pmdie 15'

I think that's a good start. :)

Don't hesitate to contact me if I can shed any more
light. I'm wide
open to ideas at the moment. I'm in EST, but tend to

work until

10-11 at night, so e-mail anytime.

Thanks,

DwD

-----Original Message-----
From: The Hermit Hacker [mailto:scrappy@hub.org]
Sent: Thursday, February 18, 1999 5:36 PM
To: Daryl W. Dunbar
Subject: Re: Interested?

Hi Daryl...

I'm not the strongest at internal code, so may not
be of any help
at all. I just went through my -hackers email, and can't
seem to find
anything from you in there. Can you tell me what your
problem is, as well
as version of PostgreSQL you are using, and we'll see
what we can do?

Marc

On Thu, 18 Feb 1999, Daryl W. Dunbar wrote:

Marc,

I know that you put considerable volunteer time into

PostgreSQL. If

I am not too bold in asking, and you are comfortable

with it, I am

prepared to compensate you for your time if you can

assist me in

tracking down this rather nasty bug I have been

e-mailing Hackers

about. Please let me know if you are interested and if

so, at what

rate.

We are in the process of launching a pretty exciting

site and a

database in a integral part of it. I really want to

use PostgreSQL,

but can not take it into production on Solaris with

this problem

going on. I'm in the process of installing a test site

on Linux to

see if the problem exists there, but I expect it

is limited to

Solaris.

I anxiously await your response.

Thanks,

DwD

--
Daryl W. Dunbar
VP of Engineering/Chief Technology Officer
http://www.com, Where the Web Begins!
mailto:daryl@www.com

Marc G. Fournier
Systems Administrator @ hub.org
primary: scrappy@hub.org secondary:
scrappy@{freebsd|postgresql}.org

Marc G. Fournier
Systems Administrator @ hub.org
primary: scrappy@hub.org secondary:
scrappy@{freebsd|postgresql}.org

#5The Hermit Hacker
scrappy@hub.org
In reply to: Daryl W. Dunbar (#4)
RE: [HACKERS] Major bug, possible, with Solaris 7?

On Fri, 19 Feb 1999, Daryl W. Dunbar wrote:

At this point, I willing to try anything. I'm in production (live
site), but we have not announced the site. What that means is that
I have the weekend to debug/fix/decide what to do. I'll take
whatever version you suggest and load it.

Apologies for the delay...there is a copy of postgresql-6.4.3beta.tar.gz
available in the test directory...try that and please report back here...

DwD

-----Original Message-----
From: The Hermit Hacker [mailto:scrappy@hub.org]
Sent: Friday, February 19, 1999 10:39 PM
To: Daryl W. Dunbar
Cc: pgsql-hackers@postgreSQL.org
Subject: RE: [HACKERS] Major bug, possible, with Solaris 7?

On Fri, 19 Feb 1999, Daryl W. Dunbar wrote:

Oh, sorry. 6.4.2 with a backend patch to prevent the

parent death

in the event of MaxBackendID being reached.

I know it is in semop() because I did a truss on the child
processes. From a small sample, it looks like they may all be
trying to operate on the same semaphore. I'm recompiling with
the -g flag to gain more insight...

I'm just curious, but is this being used production yet?
If not, would
you be willing to try out the current snapshot, which is
soon to become
6.5-BETA? If this apparent bug still exists there, I
think its sufficient
a bug to prevent v6.5 coming out until this is fixed

then again,
something this reproducible will most likely hold up
v6.4.3 from being
released also, so if we are planning a v6.4.3 (I thought
we were), we'll
have to get this fixed in the 6.4 line also.

Actually, with that in mind, I'm putting together a very
quick tar ball of
what v6.4.3 is looking like so far. this is *not* a
release, but I'd like
to see if this problem exists in the most current STABLE
tree or not...I
know there has been quite a few fixes put into it...

Check in about a half hour or so, under the 'test' directory of
ftp.postgresql.org .. should be there then...

-----Original Message-----
From: owner-pgsql-hackers@postgreSQL.org
[mailto:owner-pgsql-hackers@postgreSQL.org]On Behalf

Of The Hermit

Hacker
Sent: Friday, February 19, 1999 12:46 PM
To: pgsql-hackers@postgreSQL.org
Cc: Daryl W. Dunbar
Subject: [HACKERS] Major bug, possible, with Solaris 7?

Can someone please take a minute to look at this?

I've gzip'd and moved his errorlog to
ftp.postgresql.org:/pub/debugging...one thing that

appears to be

lacking...what version of PostgreSQL are you using?

Marc G. Fournier
Systems Administrator @ hub.org
primary: scrappy@hub.org secondary:
scrappy@{freebsd|postgresql}.org

---------- Forwarded message ----------
Date: Thu, 18 Feb 1999 18:23:25 -0500
From: Daryl W. Dunbar <daryl@www.com>
To: The Hermit Hacker <scrappy@hub.org>
Subject: RE: Interested?

Thanks Marc, We exchanged an e-mail or two last

week, along with

Tatsuo Ishii and Tom Lane. You suggested I truss the process.

Anyway, periodically, the backends spiral out of

control with hung

up children until I hit MaxBackendID (which I

compiled in to be

128). Initially, I was running out of semaphores on

Solaris 7 and

changed /etc/system to add these lines:
set shmsys:shminfo_shmmax=16777216
set shmsys:shminfo_shmmin=1
set shmsys:shminfo_shmmni=128
set shmsys:shminfo_shmseg=51
*
set semsys:seminfo_semmap=128
set semsys:seminfo_semmni=128
set semsys:seminfo_semmns=8192
set semsys:seminfo_semmnu=8192
set semsys:seminfo_semmsl=64
set semsys:seminfo_semopm=32
set semsys:seminfo_semume=32

I increased shared memory so I could start more backends...

OK, so now, everything is running fine and boom, the
backends start
to hang on semop, eventually reaching MaxBackendID

and refusing

connections.
Attached is a log file from a hang up today. Debug

is set to 3.

All times are PST. I have carved out a bunch of

normal operation

from the beginning (about 21,000 lines) and redundant

'too many

backends' (about 1,000 lines, while I was eating lunch :)
signified
by {SNIP SNIP}. I pick the log back up with the

birth of pid 2828

and left several 'normal' cycles in until...

You can see that process 2840 is the first child to

hang. It was

started at 11:39:23 and did not die until sent a 15 by
the parent at
14:12:16. All of the hung processes fall between

2840 and 3454.

Sorry the file is so big. Here are some 'keys' you can use:
Startup is the first line (obviously).
You can find child startup by looking for [2840] (pid

in brackets)

You can find child exits by looking for '2480 exited'
You can find where I send the kill signal by looking for
'pmdie 15'

I think that's a good start. :)

Don't hesitate to contact me if I can shed any more
light. I'm wide
open to ideas at the moment. I'm in EST, but tend to

work until

10-11 at night, so e-mail anytime.

Thanks,

DwD

-----Original Message-----
From: The Hermit Hacker [mailto:scrappy@hub.org]
Sent: Thursday, February 18, 1999 5:36 PM
To: Daryl W. Dunbar
Subject: Re: Interested?

Hi Daryl...

I'm not the strongest at internal code, so may not
be of any help
at all. I just went through my -hackers email, and can't
seem to find
anything from you in there. Can you tell me what your
problem is, as well
as version of PostgreSQL you are using, and we'll see
what we can do?

Marc

On Thu, 18 Feb 1999, Daryl W. Dunbar wrote:

Marc,

I know that you put considerable volunteer time into

PostgreSQL. If

I am not too bold in asking, and you are comfortable

with it, I am

prepared to compensate you for your time if you can

assist me in

tracking down this rather nasty bug I have been

e-mailing Hackers

about. Please let me know if you are interested and if

so, at what

rate.

We are in the process of launching a pretty exciting

site and a

database in a integral part of it. I really want to

use PostgreSQL,

but can not take it into production on Solaris with

this problem

going on. I'm in the process of installing a test site

on Linux to

see if the problem exists there, but I expect it

is limited to

Solaris.

I anxiously await your response.

Thanks,

DwD

--
Daryl W. Dunbar
VP of Engineering/Chief Technology Officer
http://www.com, Where the Web Begins!
mailto:daryl@www.com

Marc G. Fournier
Systems Administrator @ hub.org
primary: scrappy@hub.org secondary:
scrappy@{freebsd|postgresql}.org

Marc G. Fournier
Systems Administrator @ hub.org
primary: scrappy@hub.org secondary:
scrappy@{freebsd|postgresql}.org

Marc G. Fournier
Systems Administrator @ hub.org
primary: scrappy@hub.org secondary: scrappy@{freebsd|postgresql}.org

#6Daryl W. Dunbar
daryl@www.com
In reply to: The Hermit Hacker (#5)
2 attachment(s)
RE: [HACKERS] Major bug, possible, with Solaris 7?

OK. I'm running 6.4.3beta (after patching the code to compile -
patches attached). Now we wait to see if it breaks again...

DwD

Show quoted text

-----Original Message-----
From: The Hermit Hacker [mailto:scrappy@hub.org]
Sent: Friday, February 19, 1999 11:48 PM
To: Daryl W. Dunbar
Cc: pgsql-hackers@postgreSQL.org
Subject: RE: [HACKERS] Major bug, possible, with Solaris 7?

On Fri, 19 Feb 1999, Daryl W. Dunbar wrote:

At this point, I willing to try anything. I'm in

production (live

site), but we have not announced the site. What that

means is that

I have the weekend to debug/fix/decide what to do. I'll take
whatever version you suggest and load it.

Apologies for the delay...there is a copy of
postgresql-6.4.3beta.tar.gz
available in the test directory...try that and please
report back here...

DwD

-----Original Message-----
From: The Hermit Hacker [mailto:scrappy@hub.org]
Sent: Friday, February 19, 1999 10:39 PM
To: Daryl W. Dunbar
Cc: pgsql-hackers@postgreSQL.org
Subject: RE: [HACKERS] Major bug, possible, with Solaris 7?

On Fri, 19 Feb 1999, Daryl W. Dunbar wrote:

Oh, sorry. 6.4.2 with a backend patch to prevent the

parent death

in the event of MaxBackendID being reached.

I know it is in semop() because I did a truss on the child
processes. From a small sample, it looks like they

may all be

trying to operate on the same semaphore. I'm

recompiling with

the -g flag to gain more insight...

I'm just curious, but is this being used production yet?
If not, would
you be willing to try out the current snapshot, which is
soon to become
6.5-BETA? If this apparent bug still exists there, I
think its sufficient
a bug to prevent v6.5 coming out until this is fixed

then again,
something this reproducible will most likely hold up
v6.4.3 from being
released also, so if we are planning a v6.4.3 (I thought
we were), we'll
have to get this fixed in the 6.4 line also.

Actually, with that in mind, I'm putting together a very
quick tar ball of
what v6.4.3 is looking like so far. this is *not* a
release, but I'd like
to see if this problem exists in the most current STABLE
tree or not...I
know there has been quite a few fixes put into it...

Check in about a half hour or so, under the 'test'

directory of

ftp.postgresql.org .. should be there then...

-----Original Message-----
From: owner-pgsql-hackers@postgreSQL.org
[mailto:owner-pgsql-hackers@postgreSQL.org]On Behalf

Of The Hermit

Hacker
Sent: Friday, February 19, 1999 12:46 PM
To: pgsql-hackers@postgreSQL.org
Cc: Daryl W. Dunbar
Subject: [HACKERS] Major bug, possible, with Solaris 7?

Can someone please take a minute to look at this?

I've gzip'd and moved his errorlog to
ftp.postgresql.org:/pub/debugging...one thing that

appears to be

lacking...what version of PostgreSQL are you using?

Marc G. Fournier
Systems Administrator @ hub.org
primary: scrappy@hub.org secondary:
scrappy@{freebsd|postgresql}.org

---------- Forwarded message ----------
Date: Thu, 18 Feb 1999 18:23:25 -0500
From: Daryl W. Dunbar <daryl@www.com>
To: The Hermit Hacker <scrappy@hub.org>
Subject: RE: Interested?

Thanks Marc, We exchanged an e-mail or two last

week, along with

Tatsuo Ishii and Tom Lane. You suggested I truss

the process.

Anyway, periodically, the backends spiral out of

control with hung

up children until I hit MaxBackendID (which I

compiled in to be

128). Initially, I was running out of semaphores on

Solaris 7 and

changed /etc/system to add these lines:
set shmsys:shminfo_shmmax=16777216
set shmsys:shminfo_shmmin=1
set shmsys:shminfo_shmmni=128
set shmsys:shminfo_shmseg=51
*
set semsys:seminfo_semmap=128
set semsys:seminfo_semmni=128
set semsys:seminfo_semmns=8192
set semsys:seminfo_semmnu=8192
set semsys:seminfo_semmsl=64
set semsys:seminfo_semopm=32
set semsys:seminfo_semume=32

I increased shared memory so I could start more

backends...

OK, so now, everything is running fine and boom, the
backends start
to hang on semop, eventually reaching MaxBackendID

and refusing

connections.
Attached is a log file from a hang up today. Debug

is set to 3.

All times are PST. I have carved out a bunch of

normal operation

from the beginning (about 21,000 lines) and redundant

'too many

backends' (about 1,000 lines, while I was eating lunch :)
signified
by {SNIP SNIP}. I pick the log back up with the

birth of pid 2828

and left several 'normal' cycles in until...

You can see that process 2840 is the first child to

hang. It was

started at 11:39:23 and did not die until sent a 15 by
the parent at
14:12:16. All of the hung processes fall between

2840 and 3454.

Sorry the file is so big. Here are some 'keys'

you can use:

Startup is the first line (obviously).
You can find child startup by looking for [2840] (pid

in brackets)

You can find child exits by looking for '2480 exited'
You can find where I send the kill signal by looking for
'pmdie 15'

I think that's a good start. :)

Don't hesitate to contact me if I can shed any more
light. I'm wide
open to ideas at the moment. I'm in EST, but tend to

work until

10-11 at night, so e-mail anytime.

Thanks,

DwD

-----Original Message-----
From: The Hermit Hacker [mailto:scrappy@hub.org]
Sent: Thursday, February 18, 1999 5:36 PM
To: Daryl W. Dunbar
Subject: Re: Interested?

Hi Daryl...

I'm not the strongest at internal code, so may not
be of any help
at all. I just went through my -hackers email,

and can't

seem to find
anything from you in there. Can you tell me what your
problem is, as well
as version of PostgreSQL you are using, and we'll see
what we can do?

Marc

On Thu, 18 Feb 1999, Daryl W. Dunbar wrote:

Marc,

I know that you put considerable volunteer time into

PostgreSQL. If

I am not too bold in asking, and you are comfortable

with it, I am

prepared to compensate you for your time if you can

assist me in

tracking down this rather nasty bug I have been

e-mailing Hackers

about. Please let me know if you are

interested and if

so, at what

rate.

We are in the process of launching a pretty exciting

site and a

database in a integral part of it. I really want to

use PostgreSQL,

but can not take it into production on Solaris with

this problem

going on. I'm in the process of installing a

test site

on Linux to

see if the problem exists there, but I expect it

is limited to

Solaris.

I anxiously await your response.

Thanks,

DwD

--
Daryl W. Dunbar
VP of Engineering/Chief Technology Officer
http://www.com, Where the Web Begins!
mailto:daryl@www.com

Marc G. Fournier
Systems Administrator @ hub.org
primary: scrappy@hub.org secondary:
scrappy@{freebsd|postgresql}.org

Marc G. Fournier
Systems Administrator @ hub.org
primary: scrappy@hub.org secondary:
scrappy@{freebsd|postgresql}.org

Marc G. Fournier
Systems Administrator @ hub.org
primary: scrappy@hub.org secondary:
scrappy@{freebsd|postgresql}.org

Attachments:

datetime.patapplication/octet-stream; name=datetime.patDownload
*** datetime.c.orig	Sat Feb 20 08:00:43 1999
--- datetime.c	Sat Feb 20 08:00:22 1999
***************
*** 27,33 ****
  
  static int	date2tm(DateADT dateVal, int *tzp, struct tm * tm, double *fsec, char **tzn);
  
! #if 0
  static int	day_tab[2][12] = {
  	{31, 28, 31, 30, 31, 30, 31, 31, 30, 31, 30, 31},
  {31, 29, 31, 30, 31, 30, 31, 31, 30, 31, 30, 31}};
--- 27,33 ----
  
  static int	date2tm(DateADT dateVal, int *tzp, struct tm * tm, double *fsec, char **tzn);
  
! #if 1
  static int	day_tab[2][12] = {
  	{31, 28, 31, 30, 31, 30, 31, 31, 30, 31, 30, 31},
  {31, 29, 31, 30, 31, 30, 31, 31, 30, 31, 30, 31}};
dt.patapplication/octet-stream; name=dt.patDownload
*** dt.c.orig	Sat Feb 20 08:00:54 1999
--- dt.c	Sat Feb 20 08:11:12 1999
***************
*** 55,61 ****
  #define USE_DATE_CACHE 1
  #define ROUND_ALL 0
  
! #if 0
  #define isleap(y) (((y % 4) == 0) && (((y % 100) != 0) || ((y % 400) == 0)))
  
  int			mdays[] = {31, 28, 31, 30, 31, 30, 31, 31, 30, 31, 30, 31, 0};
--- 55,61 ----
  #define USE_DATE_CACHE 1
  #define ROUND_ALL 0
  
! #if 1
  #define isleap(y) (((y % 4) == 0) && (((y % 100) != 0) || ((y % 400) == 0)))
  
  int			mdays[] = {31, 28, 31, 30, 31, 30, 31, 31, 30, 31, 30, 31, 0};
***************
*** 2302,2309 ****
   * These routines will be used by other date/time packages - tgl 97/02/25
   */
  
! #if 0
! XXX moved to dt.h - thomas 1999-01-15
  /* Set the minimum year to one greater than the year of the first valid day
   *	to avoid having to check year and day both. - tgl 97/05/08
   */
--- 2302,2310 ----
   * These routines will be used by other date/time packages - tgl 97/02/25
   */
  
! #if 1
! /*XXX moved to dt.h - thomas 1999-01-15
!  */
  /* Set the minimum year to one greater than the year of the first valid day
   *	to avoid having to check year and day both. - tgl 97/05/08
   */
#7Daryl W. Dunbar
daryl@www.com
In reply to: Daryl W. Dunbar (#6)
RE: [HACKERS] Major bug, possible, with Solaris 7?

Problem still exists in 6.4.3.

I am wondering, since gdb can not give me any information on the
location of my hang (I get lots of ??'s) and all I can see is
semsys(), am I spinning in a system library? Does anyone have
access to the Solaris7 patches? I see one kernel patch out there,
but I can not access the description, nor download the patch,
because it is not considered in the recommended or security list.
I'm talking to my rep on this on Monday!

For reference, I can provide a syslog and truss of the 6.4.3
failure, but I expect it looks just about like the 6.4.2 one.

Thanks,

DwD

Show quoted text

-----Original Message-----
From: owner-pgsql-hackers@postgreSQL.org
[mailto:owner-pgsql-hackers@postgreSQL.org]On Behalf Of
Daryl W. Dunbar
Sent: Saturday, February 20, 1999 11:26 AM
To: The Hermit Hacker
Cc: pgsql-hackers@postgreSQL.org
Subject: RE: [HACKERS] Major bug, possible, with Solaris 7?

OK. I'm running 6.4.3beta (after patching the code to compile -
patches attached). Now we wait to see if it breaks again...

DwD

-----Original Message-----
From: The Hermit Hacker [mailto:scrappy@hub.org]
Sent: Friday, February 19, 1999 11:48 PM
To: Daryl W. Dunbar
Cc: pgsql-hackers@postgreSQL.org
Subject: RE: [HACKERS] Major bug, possible, with Solaris 7?

On Fri, 19 Feb 1999, Daryl W. Dunbar wrote:

At this point, I willing to try anything. I'm in

production (live

site), but we have not announced the site. What that

means is that

I have the weekend to debug/fix/decide what to do. I'll take
whatever version you suggest and load it.

Apologies for the delay...there is a copy of
postgresql-6.4.3beta.tar.gz
available in the test directory...try that and please
report back here...

DwD

-----Original Message-----
From: The Hermit Hacker [mailto:scrappy@hub.org]
Sent: Friday, February 19, 1999 10:39 PM
To: Daryl W. Dunbar
Cc: pgsql-hackers@postgreSQL.org
Subject: RE: [HACKERS] Major bug, possible, with Solaris 7?

On Fri, 19 Feb 1999, Daryl W. Dunbar wrote:

Oh, sorry. 6.4.2 with a backend patch to prevent the

parent death

in the event of MaxBackendID being reached.

I know it is in semop() because I did a truss on the child
processes. From a small sample, it looks like they

may all be

trying to operate on the same semaphore. I'm

recompiling with

the -g flag to gain more insight...

I'm just curious, but is this being used production yet?
If not, would
you be willing to try out the current snapshot, which is
soon to become
6.5-BETA? If this apparent bug still exists there, I
think its sufficient
a bug to prevent v6.5 coming out until this is fixed

then again,
something this reproducible will most likely hold up
v6.4.3 from being
released also, so if we are planning a v6.4.3 (I thought
we were), we'll
have to get this fixed in the 6.4 line also.

Actually, with that in mind, I'm putting together a very
quick tar ball of
what v6.4.3 is looking like so far. this is *not* a
release, but I'd like
to see if this problem exists in the most current STABLE
tree or not...I
know there has been quite a few fixes put into it...

Check in about a half hour or so, under the 'test'

directory of

ftp.postgresql.org .. should be there then...

-----Original Message-----
From: owner-pgsql-hackers@postgreSQL.org
[mailto:owner-pgsql-hackers@postgreSQL.org]On Behalf

Of The Hermit

Hacker
Sent: Friday, February 19, 1999 12:46 PM
To: pgsql-hackers@postgreSQL.org
Cc: Daryl W. Dunbar
Subject: [HACKERS] Major bug, possible, with Solaris 7?

Can someone please take a minute to look at this?

I've gzip'd and moved his errorlog to
ftp.postgresql.org:/pub/debugging...one thing that

appears to be

lacking...what version of PostgreSQL are you using?

Marc G. Fournier
Systems Administrator @ hub.org
primary: scrappy@hub.org secondary:
scrappy@{freebsd|postgresql}.org

---------- Forwarded message ----------
Date: Thu, 18 Feb 1999 18:23:25 -0500
From: Daryl W. Dunbar <daryl@www.com>
To: The Hermit Hacker <scrappy@hub.org>
Subject: RE: Interested?

Thanks Marc, We exchanged an e-mail or two last

week, along with

Tatsuo Ishii and Tom Lane. You suggested I truss

the process.

Anyway, periodically, the backends spiral out of

control with hung

up children until I hit MaxBackendID (which I

compiled in to be

128). Initially, I was running out of semaphores on

Solaris 7 and

changed /etc/system to add these lines:
set shmsys:shminfo_shmmax=16777216
set shmsys:shminfo_shmmin=1
set shmsys:shminfo_shmmni=128
set shmsys:shminfo_shmseg=51
*
set semsys:seminfo_semmap=128
set semsys:seminfo_semmni=128
set semsys:seminfo_semmns=8192
set semsys:seminfo_semmnu=8192
set semsys:seminfo_semmsl=64
set semsys:seminfo_semopm=32
set semsys:seminfo_semume=32

I increased shared memory so I could start more

backends...

OK, so now, everything is running fine and boom, the
backends start
to hang on semop, eventually reaching MaxBackendID

and refusing

connections.
Attached is a log file from a hang up today. Debug

is set to 3.

All times are PST. I have carved out a bunch of

normal operation

from the beginning (about 21,000 lines) and redundant

'too many

backends' (about 1,000 lines, while I was

eating lunch :)

signified
by {SNIP SNIP}. I pick the log back up with the

birth of pid 2828

and left several 'normal' cycles in until...

You can see that process 2840 is the first child to

hang. It was

started at 11:39:23 and did not die until sent a 15 by
the parent at
14:12:16. All of the hung processes fall between

2840 and 3454.

Sorry the file is so big. Here are some 'keys'

you can use:

Startup is the first line (obviously).
You can find child startup by looking for [2840] (pid

in brackets)

You can find child exits by looking for '2480 exited'
You can find where I send the kill signal by looking for
'pmdie 15'

I think that's a good start. :)

Don't hesitate to contact me if I can shed any more
light. I'm wide
open to ideas at the moment. I'm in EST, but tend to

work until

10-11 at night, so e-mail anytime.

Thanks,

DwD

-----Original Message-----
From: The Hermit Hacker [mailto:scrappy@hub.org]
Sent: Thursday, February 18, 1999 5:36 PM
To: Daryl W. Dunbar
Subject: Re: Interested?

Hi Daryl...

I'm not the strongest at internal code, so may not
be of any help
at all. I just went through my -hackers email,

and can't

seem to find
anything from you in there. Can you tell me what your
problem is, as well
as version of PostgreSQL you are using, and we'll see
what we can do?

Marc

On Thu, 18 Feb 1999, Daryl W. Dunbar wrote:

Marc,

I know that you put considerable volunteer time into

PostgreSQL. If

I am not too bold in asking, and you are comfortable

with it, I am

prepared to compensate you for your time if you can

assist me in

tracking down this rather nasty bug I have been

e-mailing Hackers

about. Please let me know if you are

interested and if

so, at what

rate.

We are in the process of launching a pretty exciting

site and a

database in a integral part of it. I really want to

use PostgreSQL,

but can not take it into production on Solaris with

this problem

going on. I'm in the process of installing a

test site

on Linux to

see if the problem exists there, but I expect it

is limited to

Solaris.

I anxiously await your response.

Thanks,

DwD

--
Daryl W. Dunbar
VP of Engineering/Chief Technology Officer
http://www.com, Where the Web Begins!
mailto:daryl@www.com

Marc G. Fournier
Systems Administrator @ hub.org
primary: scrappy@hub.org secondary:
scrappy@{freebsd|postgresql}.org

Marc G. Fournier
Systems Administrator @ hub.org
primary: scrappy@hub.org secondary:
scrappy@{freebsd|postgresql}.org

Marc G. Fournier
Systems Administrator @ hub.org
primary: scrappy@hub.org secondary:
scrappy@{freebsd|postgresql}.org

#8Tom Lane
tgl@sss.pgh.pa.us
In reply to: Daryl W. Dunbar (#7)
Re: [HACKERS] Major bug, possible, with Solaris 7?

"Daryl W. Dunbar" <daryl@www.com> writes:

Problem still exists in 6.4.3.

I figured it probably would :-(.

As far as I can tell from your truss trace, the processes are going
to sleep via semop() and never being awoken. There's not much more
that we can find out at the kernel level, since the kernel can't tell
*why* a backend thinks it needs to go to sleep. Assuming that
TEST_AND_SET is defined in your compilation, the backend only use
one semaphore apiece and all blocking/awakening is done via the same
semaphore. We need to know what lock-manager condition is causing
each backend to decide to block and why the lock is not getting
released.

I was hoping that a gdb backtrace would tell us more --- it's bad that
you can't get any info that way. On my system (HPUX) gdb has a problem
with debugging shared libraries in a process that you attach to, as
opposed to starting fresh under gdb. I dunno if Solaris is similar, but
it might be worth building your -g version of the backend with no shared
libraries, everything linked statically (-static option, I think, when
linking the postgres binary). If your system doesn't have a static
version of libc then this won't help.

But probably the first thing to try at this point is adding a bunch of
debugging printouts. If you compile with -DLOCK_MGR_DEBUG (see
src/backend/storage/lmgr/lock.c) and turn on the trace-locks option then
you'll get a bunch more log output that should tell us something useful
about why the processes are deciding to block.

regards, tom lane