buildfarm failures on smew and anole

Started by Robert Haasabout 12 years ago18 messages
#1Robert Haas
Robert Haas
robertmhaas@gmail.com

The build is continuing to fail on smew and anole. The reason it's
failing is because those machines are choosing max_connections = 10,
which is not enough to run the regression tests. I think this is
probably because of System V semaphore exhaustion. The machines are
not choosing a small value for shared_buffers - they're still picking
128MB - so the problem is not the operating system's shared memory
limit. But it might be that the operating system is short on some
other resource that prevents starting up with a more normal value for
max_connections. My best guess is System V semaphores; I think that
one of the failed runs caused by the dynamic shared memory patch
probably left a bunch of semaphores allocated, so the build will keep
failing until those are manually cleaned up.

Can the owners of these buildfarm machines please check whether there
are extra semaphores allocated and if so free them? Or at least
reboot, to see if that unbreaks the build?

Thanks,

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#2Andrew Dunstan
Andrew Dunstan
andrew@dunslane.net
In reply to: Robert Haas (#1)
Re: buildfarm failures on smew and anole

On 10/11/2013 03:33 PM, Robert Haas wrote:

The build is continuing to fail on smew and anole. The reason it's
failing is because those machines are choosing max_connections = 10,
which is not enough to run the regression tests. I think this is
probably because of System V semaphore exhaustion. The machines are
not choosing a small value for shared_buffers - they're still picking
128MB - so the problem is not the operating system's shared memory
limit. But it might be that the operating system is short on some
other resource that prevents starting up with a more normal value for
max_connections. My best guess is System V semaphores; I think that
one of the failed runs caused by the dynamic shared memory patch
probably left a bunch of semaphores allocated, so the build will keep
failing until those are manually cleaned up.

Can the owners of these buildfarm machines please check whether there
are extra semaphores allocated and if so free them? Or at least
reboot, to see if that unbreaks the build?

It is possible to set the buildfarm config

build_env=> {MAX_CONNECTIONS => 10 },

and the tests will run with that constraint.

Not sure if this would help.

cheers

andrew

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#3Robert Haas
Robert Haas
robertmhaas@gmail.com
In reply to: Andrew Dunstan (#2)
Re: buildfarm failures on smew and anole

On Fri, Oct 11, 2013 at 4:03 PM, Andrew Dunstan <andrew@dunslane.net> wrote:

Can the owners of these buildfarm machines please check whether there
are extra semaphores allocated and if so free them? Or at least
reboot, to see if that unbreaks the build?

It is possible to set the buildfarm config

build_env=> {MAX_CONNECTIONS => 10 },

and the tests will run with that constraint.

Not sure if this would help.

Maybe I didn't explain that well. The problem is that the regression
tests require at least 20 connections to run, and those two machines
are currently auto-selecting 10 connections, so make check is failing.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#4Andrew Dunstan
Andrew Dunstan
andrew@dunslane.net
In reply to: Robert Haas (#3)
Re: buildfarm failures on smew and anole

On 10/14/2013 09:12 AM, Robert Haas wrote:

On Fri, Oct 11, 2013 at 4:03 PM, Andrew Dunstan <andrew@dunslane.net> wrote:

Can the owners of these buildfarm machines please check whether there
are extra semaphores allocated and if so free them? Or at least
reboot, to see if that unbreaks the build?

It is possible to set the buildfarm config

build_env=> {MAX_CONNECTIONS => 10 },

and the tests will run with that constraint.

Not sure if this would help.

Maybe I didn't explain that well. The problem is that the regression
tests require at least 20 connections to run, and those two machines
are currently auto-selecting 10 connections, so make check is failing.

Why do they need 20 connections? pg_regress has code in it to limit the
degree of parallelism of tests, and has done for years, specifically to
cater for buildfarm machines that are unable to handle the defaults.
Using this option in the buildfarm client config triggers use of this
feature.

cheers

andrew

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#5Andres Freund
Andres Freund
andres@2ndquadrant.com
In reply to: Robert Haas (#3)
Re: buildfarm failures on smew and anole

On 2013-10-14 09:12:09 -0400, Robert Haas wrote:

On Fri, Oct 11, 2013 at 4:03 PM, Andrew Dunstan <andrew@dunslane.net> wrote:

Can the owners of these buildfarm machines please check whether there
are extra semaphores allocated and if so free them? Or at least
reboot, to see if that unbreaks the build?

It is possible to set the buildfarm config

build_env=> {MAX_CONNECTIONS => 10 },

and the tests will run with that constraint.

Not sure if this would help.

Maybe I didn't explain that well. The problem is that the regression
tests require at least 20 connections to run, and those two machines
are currently auto-selecting 10 connections, so make check is failing.

I think pg_regress has support for spreading groups to fewer connections
if max_connections is set appropriately. I guess that's what Andrew is
referring to.

That said, I don't think that's the solution here. The machine clearly
worked with more connections until recently.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#6Robert Haas
Robert Haas
robertmhaas@gmail.com
In reply to: Andrew Dunstan (#4)
Re: buildfarm failures on smew and anole

On Mon, Oct 14, 2013 at 9:22 AM, Andrew Dunstan <andrew@dunslane.net> wrote:

Maybe I didn't explain that well. The problem is that the regression
tests require at least 20 connections to run, and those two machines
are currently auto-selecting 10 connections, so make check is failing.

Why do they need 20 connections? pg_regress has code in it to limit the
degree of parallelism of tests, and has done for years, specifically to
cater for buildfarm machines that are unable to handle the defaults. Using
this option in the buildfarm client config triggers use of this feature.

Hmm, I wasn't aware of that. I thought they needed 20 connections
because parallel_schedule says:

# By convention, we put no more than twenty tests in any one parallel group;
# this limits the number of connections needed to run the tests.

If it's not supposed to matter how many connections are available,
then that comment is misleading. But I think it does matter, at least
in some situations, because otherwise these machines wouldn't be
failing with "sorry, too many clients already".

Anyway, as Andres said, the machines were working fine until recently,
so I think we just need to get them un-broken.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#7Andres Freund
Andres Freund
andres@2ndquadrant.com
In reply to: Robert Haas (#6)
Re: buildfarm failures on smew and anole

On 2013-10-14 09:28:04 -0400, Robert Haas wrote:

# By convention, we put no more than twenty tests in any one parallel group;
# this limits the number of connections needed to run the tests.

If it's not supposed to matter how many connections are available,
then that comment is misleading. But I think it does matter, at least
in some situations, because otherwise these machines wouldn't be
failing with "sorry, too many clients already".

Well, you need to explicitly pass --max-connections to pg_regress.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#8Tom Lane
Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#6)
Re: buildfarm failures on smew and anole

Robert Haas <robertmhaas@gmail.com> writes:

Anyway, as Andres said, the machines were working fine until recently,
so I think we just need to get them un-broken.

I think you're talking past each other. What would be useful here is
to find out *why* these machines are now failing, when they didn't before.
There might or might not be anything useful to be done about it, but if
we don't have that information, we can't tell.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#9Robert Haas
Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#8)
Re: buildfarm failures on smew and anole

On Mon, Oct 14, 2013 at 1:33 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Robert Haas <robertmhaas@gmail.com> writes:

Anyway, as Andres said, the machines were working fine until recently,
so I think we just need to get them un-broken.

I think you're talking past each other. What would be useful here is
to find out *why* these machines are now failing, when they didn't before.
There might or might not be anything useful to be done about it, but if
we don't have that information, we can't tell.

Well, my OP had a working theory which I think fits the facts, and
some suggested troubleshooting steps. How about that for a start?

The real problem here is that neither of the buildfarm owners has
responded to this thread.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#10Peter Eisentraut
Peter Eisentraut
peter_e@gmx.net
In reply to: Robert Haas (#1)
Re: buildfarm failures on smew and anole

On Fri, 2013-10-11 at 15:33 -0400, Robert Haas wrote:

Can the owners of these buildfarm machines please check whether there
are extra semaphores allocated and if so free them? Or at least
reboot, to see if that unbreaks the build?

I cleaned the semaphores on smew, but they came back. Whatever is
crashing is leaving the semaphores lying around.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#11Robert Haas
Robert Haas
robertmhaas@gmail.com
In reply to: Peter Eisentraut (#10)
Re: buildfarm failures on smew and anole

On Mon, Oct 14, 2013 at 4:29 PM, Peter Eisentraut <peter_e@gmx.net> wrote:

On Fri, 2013-10-11 at 15:33 -0400, Robert Haas wrote:

Can the owners of these buildfarm machines please check whether there
are extra semaphores allocated and if so free them? Or at least
reboot, to see if that unbreaks the build?

I cleaned the semaphores on smew, but they came back. Whatever is
crashing is leaving the semaphores lying around.

Ugh. When did you do that exactly? I thought I fixed the problem
that was causing that days ago, and the last 4 days worth of runs all
show the "too many clients" error.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#12Peter Eisentraut
Peter Eisentraut
peter_e@gmx.net
In reply to: Robert Haas (#11)
Re: buildfarm failures on smew and anole

On Mon, 2013-10-14 at 18:14 -0400, Robert Haas wrote:

I cleaned the semaphores on smew, but they came back. Whatever is
crashing is leaving the semaphores lying around.

Ugh. When did you do that exactly? I thought I fixed the problem
that was causing that days ago, and the last 4 days worth of runs all
show the "too many clients" error.

I did it a few times over the weekend. At least twice less than 4 days
ago. There are currently no semaphores left around, so whatever
happened in the last run cleaned it up.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#13Robert Haas
Robert Haas
robertmhaas@gmail.com
In reply to: Peter Eisentraut (#12)
Re: buildfarm failures on smew and anole

On Tue, Oct 15, 2013 at 11:17 PM, Peter Eisentraut <peter_e@gmx.net> wrote:

On Mon, 2013-10-14 at 18:14 -0400, Robert Haas wrote:

I cleaned the semaphores on smew, but they came back. Whatever is
crashing is leaving the semaphores lying around.

Ugh. When did you do that exactly? I thought I fixed the problem
that was causing that days ago, and the last 4 days worth of runs all
show the "too many clients" error.

I did it a few times over the weekend. At least twice less than 4 days
ago. There are currently no semaphores left around, so whatever
happened in the last run cleaned it up.

That seems to suggest I've introduced some bug. I'm at a loss as to
what it is, though. :-(

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#14Andres Freund
Andres Freund
andres@2ndquadrant.com
In reply to: Robert Haas (#13)
Re: buildfarm failures on smew and anole

On 2013-10-16 08:39:10 -0400, Robert Haas wrote:

On Tue, Oct 15, 2013 at 11:17 PM, Peter Eisentraut <peter_e@gmx.net> wrote:

On Mon, 2013-10-14 at 18:14 -0400, Robert Haas wrote:

I cleaned the semaphores on smew, but they came back. Whatever is
crashing is leaving the semaphores lying around.

Ugh. When did you do that exactly? I thought I fixed the problem
that was causing that days ago, and the last 4 days worth of runs all
show the "too many clients" error.

I did it a few times over the weekend. At least twice less than 4 days
ago. There are currently no semaphores left around, so whatever
happened in the last run cleaned it up.

That seems to suggest I've introduced some bug. I'm at a loss as to
what it is, though. :-(

Ah. I see the issue. To reproduce do something like
# mkdir /tmp/empty
# mount --bind /tmp/empty /dev/shm/
and then run initdb.

The issue is that test_config_settings determines max_connections
without disabling dynamic shared memory which consequently chooses posix
which doesn't work. Setting it to none during the test makes it work.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#15Robert Haas
Robert Haas
robertmhaas@gmail.com
In reply to: Andres Freund (#14)
Re: buildfarm failures on smew and anole

On Wed, Oct 16, 2013 at 8:54 AM, Andres Freund <andres@2ndquadrant.com> wrote:

On 2013-10-16 08:39:10 -0400, Robert Haas wrote:

On Tue, Oct 15, 2013 at 11:17 PM, Peter Eisentraut <peter_e@gmx.net> wrote:

On Mon, 2013-10-14 at 18:14 -0400, Robert Haas wrote:

I cleaned the semaphores on smew, but they came back. Whatever is
crashing is leaving the semaphores lying around.

Ugh. When did you do that exactly? I thought I fixed the problem
that was causing that days ago, and the last 4 days worth of runs all
show the "too many clients" error.

I did it a few times over the weekend. At least twice less than 4 days
ago. There are currently no semaphores left around, so whatever
happened in the last run cleaned it up.

That seems to suggest I've introduced some bug. I'm at a loss as to
what it is, though. :-(

Ah. I see the issue. To reproduce do something like
# mkdir /tmp/empty
# mount --bind /tmp/empty /dev/shm/
and then run initdb.

The issue is that test_config_settings determines max_connections
without disabling dynamic shared memory which consequently chooses posix
which doesn't work. Setting it to none during the test makes it work.

Gah. I fixed one instance of that problem in test_config_settings(),
but missed the other.

Thanks.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#16Andres Freund
Andres Freund
andres@2ndquadrant.com
In reply to: Robert Haas (#15)
Re: buildfarm failures on smew and anole

On 2013-10-16 09:35:46 -0400, Robert Haas wrote:

Gah. I fixed one instance of that problem in test_config_settings(),
but missed the other.

Maybe it'd be better to default to none, just as max_connections
defaults to 1 and shared_buffers to 16? As we write out the value in the
config file, everything should still continue to work.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#17Robert Haas
Robert Haas
robertmhaas@gmail.com
In reply to: Andres Freund (#16)
Re: buildfarm failures on smew and anole

On Wed, Oct 16, 2013 at 9:37 AM, Andres Freund <andres@2ndquadrant.com> wrote:

On 2013-10-16 09:35:46 -0400, Robert Haas wrote:

Gah. I fixed one instance of that problem in test_config_settings(),
but missed the other.

Maybe it'd be better to default to none, just as max_connections
defaults to 1 and shared_buffers to 16? As we write out the value in the
config file, everything should still continue to work.

Hmm, possibly. But how would we document that? It seems strange to
say that the default is none, but the actual setting probably won't be
none on your system because we hack up postgresql.conf.
shared_buffers pretty much just glosses over the distinction between
"default" and "what you probably have configured", but I'm not sure
that's actually great policy.

Trivial fixed pushed, for now.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#18Andres Freund
Andres Freund
andres@2ndquadrant.com
In reply to: Robert Haas (#17)
Re: buildfarm failures on smew and anole

On 2013-10-16 09:44:32 -0400, Robert Haas wrote:

On Wed, Oct 16, 2013 at 9:37 AM, Andres Freund <andres@2ndquadrant.com> wrote:

On 2013-10-16 09:35:46 -0400, Robert Haas wrote:

Gah. I fixed one instance of that problem in test_config_settings(),
but missed the other.

Maybe it'd be better to default to none, just as max_connections
defaults to 1 and shared_buffers to 16? As we write out the value in the
config file, everything should still continue to work.

Hmm, possibly. But how would we document that? It seems strange to
say that the default is none, but the actual setting probably won't be
none on your system because we hack up postgresql.conf.
shared_buffers pretty much just glosses over the distinction between
"default" and "what you probably have configured", but I'm not sure
that's actually great policy.

I can't remember somebody actually being confused by that with s_b or
max_connections. So maybe it's just ok not to document it. But yes, I
can't come up with a succinct description of that behaviour either.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers