Re: test_shm_mq failing on anole (was: Sending out a request for more buildfarm animals?)

Started by Robert Haasalmost 12 years ago29 messageshackers
Jump to latest
#1Robert Haas
robertmhaas@gmail.com

On Sat, May 3, 2014 at 4:31 AM, Dave Page <dave.page@enterprisedb.com> wrote:

Hamid@EDB; Can you please have someone configure anole to build git
head as well as the other branches? Thanks.

The test_shm_mq regression tests hung on this machine this morning.
Hamid was able to give me access to log in and troubleshoot.
Unfortunately, I wasn't able to completely track down the problem
before accidentally killing off the running cluster, but it looks like
test_shm_mq_pipelined() tried to start 3 background workers and the
postmaster only ever launched one of them, so the test just sat there
and waited for the other two workers to start. At this point, I have
no idea what could cause the postmaster to be asleep at the switch
like this, but it seems clear that's what happened.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#1)

Robert Haas <robertmhaas@gmail.com> writes:

The test_shm_mq regression tests hung on this machine this morning.

It looks like hamster may have a repeatable issue there as well,
since the last set of DSM code changes.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#3Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#2)

I wrote:

It looks like hamster may have a repeatable issue there as well,
since the last set of DSM code changes.

Ah, scratch that --- on closer inspection it looks like both failures
probably trace to out-of-disk-space.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#4Michael Paquier
michael@paquier.xyz
In reply to: Tom Lane (#2)

On Sat, May 10, 2014 at 6:22 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Robert Haas <robertmhaas@gmail.com> writes:

The test_shm_mq regression tests hung on this machine this morning.

It looks like hamster may have a repeatable issue there as well,
since the last set of DSM code changes.

Yeah, this node has a limited amount of space available as it runs
only with a 4GB flash card... I just freed up inside it 200MB~ of
cached packages, let's hope that the run-of-space error is less
frequent when building on a branch. What is interesting btw is that it
only happens for a couple of contrib tests (pgcrypto, test_shm_mq),
and only on master branch.
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#5Robert Haas
robertmhaas@gmail.com
In reply to: Robert Haas (#1)

On Fri, May 9, 2014 at 10:18 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Sat, May 3, 2014 at 4:31 AM, Dave Page <dave.page@enterprisedb.com> wrote:

Hamid@EDB; Can you please have someone configure anole to build git
head as well as the other branches? Thanks.

The test_shm_mq regression tests hung on this machine this morning.
Hamid was able to give me access to log in and troubleshoot.
Unfortunately, I wasn't able to completely track down the problem
before accidentally killing off the running cluster, but it looks like
test_shm_mq_pipelined() tried to start 3 background workers and the
postmaster only ever launched one of them, so the test just sat there
and waited for the other two workers to start. At this point, I have
no idea what could cause the postmaster to be asleep at the switch
like this, but it seems clear that's what happened.

This happened again, and I investigated further. It looks like the
postmaster knows full well that it's supposed to start more bgworkers:
the ones that never get started are in the postmaster's
BackgroundWorkerList, and StartWorkerNeeded is true. But it only
starts the first one, not all three. Why?

Here's my theory. When I did a backtrace inside the postmaster, it
was stuck inside inside select(), within ServerLoop(). I think that's
just where it was when the backend that wanted to run test_shm_mq
requested that a few background workers get launched. Each
registration would have sent the postmaster a separate SIGUSR1, but
for some reason the postmaster only received one, which I think is
legit behavior, though possibly not typical on modern Linux systems.
When the SIGUSR1 arrived, the postmaster jumped into
sigusr1_handler(). sigusr1_handler() calls maybe_start_bgworker(),
which launched the first background worker. Then it returned, and the
arrival of the signal did NOT interrupt the pending select().

This chain of events can't occur if an arriving SIGUSR1 causes
select() to return EINTR or EWOULDBLOCK, nor can it happen if the
signal handler is entered three separate times, once for each SIGUSR1.
That combination of explanations seems likely sufficient to explain
why this doesn't occur on other machines.

The code seems to have been this way since the commit that introduced
background workers (da07a1e856511dca59cbb1357616e26baa64428e),
although the function was called StartOneBackgroundWorker back then.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#6Andres Freund
andres@anarazel.de
In reply to: Robert Haas (#5)

On 2014-09-29 14:46:20 -0400, Robert Haas wrote:

On Fri, May 9, 2014 at 10:18 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Sat, May 3, 2014 at 4:31 AM, Dave Page <dave.page@enterprisedb.com> wrote:

Hamid@EDB; Can you please have someone configure anole to build git
head as well as the other branches? Thanks.

The test_shm_mq regression tests hung on this machine this morning.
Hamid was able to give me access to log in and troubleshoot.
Unfortunately, I wasn't able to completely track down the problem
before accidentally killing off the running cluster, but it looks like
test_shm_mq_pipelined() tried to start 3 background workers and the
postmaster only ever launched one of them, so the test just sat there
and waited for the other two workers to start. At this point, I have
no idea what could cause the postmaster to be asleep at the switch
like this, but it seems clear that's what happened.

This happened again, and I investigated further. It looks like the
postmaster knows full well that it's supposed to start more bgworkers:
the ones that never get started are in the postmaster's
BackgroundWorkerList, and StartWorkerNeeded is true. But it only
starts the first one, not all three. Why?

Here's my theory. When I did a backtrace inside the postmaster, it
was stuck inside inside select(), within ServerLoop(). I think that's
just where it was when the backend that wanted to run test_shm_mq
requested that a few background workers get launched. Each
registration would have sent the postmaster a separate SIGUSR1, but
for some reason the postmaster only received one, which I think is
legit behavior, though possibly not typical on modern Linux systems.
When the SIGUSR1 arrived, the postmaster jumped into
sigusr1_handler(). sigusr1_handler() calls maybe_start_bgworker(),
which launched the first background worker. Then it returned, and the
arrival of the signal did NOT interrupt the pending select().

This chain of events can't occur if an arriving SIGUSR1 causes
select() to return EINTR or EWOULDBLOCK, nor can it happen if the
signal handler is entered three separate times, once for each SIGUSR1.
That combination of explanations seems likely sufficient to explain
why this doesn't occur on other machines.

The code seems to have been this way since the commit that introduced
background workers (da07a1e856511dca59cbb1357616e26baa64428e),
although the function was called StartOneBackgroundWorker back then.

If that theory is true, wouldn't things get unstuck everytime a new
connection comes in? Or 60 seconds have passed? That's not to say this
isn't wrong, but still?

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#7Robert Haas
robertmhaas@gmail.com
In reply to: Andres Freund (#6)

On Mon, Sep 29, 2014 at 2:52 PM, Andres Freund <andres@2ndquadrant.com> wrote:

This happened again, and I investigated further. It looks like the
postmaster knows full well that it's supposed to start more bgworkers:
the ones that never get started are in the postmaster's
BackgroundWorkerList, and StartWorkerNeeded is true. But it only
starts the first one, not all three. Why?

Here's my theory. When I did a backtrace inside the postmaster, it
was stuck inside inside select(), within ServerLoop(). I think that's
just where it was when the backend that wanted to run test_shm_mq
requested that a few background workers get launched. Each
registration would have sent the postmaster a separate SIGUSR1, but
for some reason the postmaster only received one, which I think is
legit behavior, though possibly not typical on modern Linux systems.
When the SIGUSR1 arrived, the postmaster jumped into
sigusr1_handler(). sigusr1_handler() calls maybe_start_bgworker(),
which launched the first background worker. Then it returned, and the
arrival of the signal did NOT interrupt the pending select().

This chain of events can't occur if an arriving SIGUSR1 causes
select() to return EINTR or EWOULDBLOCK, nor can it happen if the
signal handler is entered three separate times, once for each SIGUSR1.
That combination of explanations seems likely sufficient to explain
why this doesn't occur on other machines.

The code seems to have been this way since the commit that introduced
background workers (da07a1e856511dca59cbb1357616e26baa64428e),
although the function was called StartOneBackgroundWorker back then.

If that theory is true, wouldn't things get unstuck everytime a new
connection comes in? Or 60 seconds have passed? That's not to say this
isn't wrong, but still?

There aren't any going to be any new connections arriving when running
the contrib regression tests, I believe, so I don't think there is an
escape hatch there. I didn't think to check how timeout was set in
ServerLoop, and it does look like the maximum ought to be 60 seconds,
so either there's some other ingredient I'm missing here, or the whole
theory is just wrong altogether. :-(

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#8Andres Freund
andres@anarazel.de
In reply to: Robert Haas (#7)

On 2014-09-29 15:24:55 -0400, Robert Haas wrote:

On Mon, Sep 29, 2014 at 2:52 PM, Andres Freund <andres@2ndquadrant.com> wrote:

If that theory is true, wouldn't things get unstuck everytime a new
connection comes in? Or 60 seconds have passed? That's not to say this
isn't wrong, but still?

There aren't any going to be any new connections arriving when running
the contrib regression tests, I believe, so I don't think there is an
escape hatch there.

I thought you might have tested to connect... And I'd guessed you'd have
reported if that had fixed it.

I didn't think to check how timeout was set in
ServerLoop, and it does look like the maximum ought to be 60 seconds,
so either there's some other ingredient I'm missing here, or the whole
theory is just wrong altogether. :-(

Yea :(. Note how signals are blocked in all the signal handlers and only
unblocked for a very short time (the sleep).

(stare at random shit for far too long)

Ah. DetermineSleepTime(), which is called while signals are unblocked!,
modifies BackgroundWorkerList. Previously that only iterated the list,
without modifying it. That's already of quite debatable safety, but
modifying it without having blocked signals is most definitely
broken. The modification was introduced by 7f7485a0c...

If you can manually run stuff on that machine, it'd be rather helpful if
you could put a PG_SETMASK(&BlockSig);...PG_SETMASK(&UnBlockSig); in the
HaveCrashedWorker() loop.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#9Andres Freund
andres@anarazel.de
In reply to: Robert Haas (#5)

On 2014-09-29 14:46:20 -0400, Robert Haas wrote:

On Fri, May 9, 2014 at 10:18 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Sat, May 3, 2014 at 4:31 AM, Dave Page <dave.page@enterprisedb.com> wrote:

Hamid@EDB; Can you please have someone configure anole to build git
head as well as the other branches? Thanks.

The test_shm_mq regression tests hung on this machine this morning.
Hamid was able to give me access to log in and troubleshoot.
Unfortunately, I wasn't able to completely track down the problem
before accidentally killing off the running cluster, but it looks like
test_shm_mq_pipelined() tried to start 3 background workers and the
postmaster only ever launched one of them, so the test just sat there
and waited for the other two workers to start. At this point, I have
no idea what could cause the postmaster to be asleep at the switch
like this, but it seems clear that's what happened.

This happened again, and I investigated further. It looks like the
postmaster knows full well that it's supposed to start more bgworkers:
the ones that never get started are in the postmaster's
BackgroundWorkerList, and StartWorkerNeeded is true. But it only
starts the first one, not all three. Why?

Not necessarily related, but one interesting tidbit is that fork isn't
mentioned to be async signal safe on HP-UX:
http://nixdoc.net/man-pages/HP-UX/man5/thread_safety.5.html#Async%20Signal%20Safe

I have some doubts that fork() really could be not signal safe, but it's
a bit odd. IIRC posix requires fork() to be async safe, at least if
threads aren't present.

I'm generally baffled at all the stuff postmaster does in signal
handlers... ProcessConfigFile(), load_hba() et al. It's all done with
signals disabled, but still.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#10Robert Haas
robertmhaas@gmail.com
In reply to: Andres Freund (#8)

On Mon, Sep 29, 2014 at 3:37 PM, Andres Freund <andres@2ndquadrant.com> wrote:

Yea :(. Note how signals are blocked in all the signal handlers and only
unblocked for a very short time (the sleep).

(stare at random shit for far too long)

Ah. DetermineSleepTime(), which is called while signals are unblocked!,
modifies BackgroundWorkerList. Previously that only iterated the list,
without modifying it. That's already of quite debatable safety, but
modifying it without having blocked signals is most definitely
broken. The modification was introduced by 7f7485a0c...

Ouch. OK, yeah, that's a bug.

If you can manually run stuff on that machine, it'd be rather helpful if
you could put a PG_SETMASK(&BlockSig);...PG_SETMASK(&UnBlockSig); in the
HaveCrashedWorker() loop.

I'd do it the other way around, and adjust ServerLoop to put the
PG_SETMASK calls right around pg_usleep() and select(). But why futz
with anole? Let's just check in the fix. It'll either fix anole or
not, but we should fix the bug you found either way.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#11Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#10)

Robert Haas <robertmhaas@gmail.com> writes:

On Mon, Sep 29, 2014 at 3:37 PM, Andres Freund <andres@2ndquadrant.com> wrote:

Ah. DetermineSleepTime(), which is called while signals are unblocked!,
modifies BackgroundWorkerList. Previously that only iterated the list,
without modifying it. That's already of quite debatable safety, but
modifying it without having blocked signals is most definitely
broken. The modification was introduced by 7f7485a0c...

Ouch. OK, yeah, that's a bug.

Yeah. Can we just postpone the signal unblock till after that function?

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#12Andres Freund
andres@anarazel.de
In reply to: Robert Haas (#10)

On 2014-09-29 16:16:24 -0400, Robert Haas wrote:

If you can manually run stuff on that machine, it'd be rather helpful if
you could put a PG_SETMASK(&BlockSig);...PG_SETMASK(&UnBlockSig); in the
HaveCrashedWorker() loop.

I'd do it the other way around, and adjust ServerLoop to put the
PG_SETMASK calls right around pg_usleep() and select().

Sounds good.

But why futz with anole?

Shorter feedback cycles. Anole doesn't seem to run very often, and it
takes you to log to see whether it's just slow or hanging...

Let's just check in the fix. It'll either fix anole or not, but we
should fix the bug you found either way.

Right. Are you going to do it? I can, but it'll be tomorrow. I'm neck
deep in another bug right now.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#13Robert Haas
robertmhaas@gmail.com
In reply to: Andres Freund (#12)

On Mon, Sep 29, 2014 at 4:20 PM, Andres Freund <andres@2ndquadrant.com> wrote:

Let's just check in the fix. It'll either fix anole or not, but we
should fix the bug you found either way.

Right. Are you going to do it? I can, but it'll be tomorrow. I'm neck
deep in another bug right now.

I probably can't do it until Wednesday, but I'll do it then if you
can't get to it first.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#14Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Andres Freund (#9)

Andres Freund wrote:

I'm generally baffled at all the stuff postmaster does in signal
handlers... ProcessConfigFile(), load_hba() et al. It's all done with
signals disabled, but still.

As far as I recall, the rationale for why this is acceptable is that the
whole of postmaster is run with signals blocked; they are only unblocked
during the sleeping select().

--
�lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#15Andres Freund
andres@anarazel.de
In reply to: Alvaro Herrera (#14)

On 2014-09-29 18:44:34 -0300, Alvaro Herrera wrote:

Andres Freund wrote:

I'm generally baffled at all the stuff postmaster does in signal
handlers... ProcessConfigFile(), load_hba() et al. It's all done with
signals disabled, but still.

As far as I recall, the rationale for why this is acceptable is that the
whole of postmaster is run with signals blocked; they are only unblocked
during the sleeping select().

Yea, I wrote that above :). Still seems remarkably fragile and
unnecessarily complex. The whole thing would be much simpler and
importantly easier to understand if everything would be done inside the
mainloop and the handlers just would set a latch...
But I guess that'd be a bit of large change to something as central as
postmaster's code..

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#16Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andres Freund (#15)

Andres Freund <andres@2ndquadrant.com> writes:

On 2014-09-29 18:44:34 -0300, Alvaro Herrera wrote:

As far as I recall, the rationale for why this is acceptable is that the
whole of postmaster is run with signals blocked; they are only unblocked
during the sleeping select().

Yea, I wrote that above :). Still seems remarkably fragile and
unnecessarily complex. The whole thing would be much simpler and
importantly easier to understand if everything would be done inside the
mainloop and the handlers just would set a latch...

Actually, I rather doubt that it would be either simpler or easier to
understand. The reason to think about changing it, IMO, is the fear that
sooner or later we're going to file a bug against some platform's libc and
they're going to tell us to get lost because POSIX says that such-and-such
a library call isn't supported inside a signal handler.

But I guess that'd be a bit of large change to something as central as
postmaster's code..

Yeah. It's a bit scary to consider changing this just to head off a
hypothetical portability problem, especially of a class that we've not
*actually* tripped across in nigh twenty years. The closest thing that
I've seen to that is the valgrind bug we hit awhile back,
https://bugzilla.redhat.com/show_bug.cgi?id=1024162

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#17Andres Freund
andres@anarazel.de
In reply to: Robert Haas (#5)

On 2014-09-29 14:46:20 -0400, Robert Haas wrote:

This happened again, and I investigated further.

Uh. Interestingly anole just succeeded twice:
http://buildfarm.postgresql.org/cgi-bin/show_history.pl?nm=anole&amp;br=REL9_4_STABLE

I plan to commit the mask/unmask patch regardless, but it's curious. The
first of the two builds could have been you 'unsticking' it by manually
mucking around. Did you also do that for the second build?

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#18Robert Haas
robertmhaas@gmail.com
In reply to: Andres Freund (#17)

On Wed, Oct 1, 2014 at 7:00 AM, Andres Freund <andres@2ndquadrant.com> wrote:

On 2014-09-29 14:46:20 -0400, Robert Haas wrote:

This happened again, and I investigated further.

Uh. Interestingly anole just succeeded twice:
http://buildfarm.postgresql.org/cgi-bin/show_history.pl?nm=anole&amp;br=REL9_4_STABLE

I plan to commit the mask/unmask patch regardless, but it's curious. The
first of the two builds could have been you 'unsticking' it by manually
mucking around. Did you also do that for the second build?

No, but I think the failures have always been intermittent.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#19Andres Freund
andres@anarazel.de
In reply to: Robert Haas (#18)

On 2014-10-01 10:45:13 -0400, Robert Haas wrote:

On Wed, Oct 1, 2014 at 7:00 AM, Andres Freund <andres@2ndquadrant.com> wrote:

On 2014-09-29 14:46:20 -0400, Robert Haas wrote:

This happened again, and I investigated further.

Uh. Interestingly anole just succeeded twice:
http://buildfarm.postgresql.org/cgi-bin/show_history.pl?nm=anole&amp;br=REL9_4_STABLE

I plan to commit the mask/unmask patch regardless, but it's curious. The
first of the two builds could have been you 'unsticking' it by manually
mucking around. Did you also do that for the second build?

No, but I think the failures have always been intermittent.

There's no record of any relevantly failing builds on 9.4:
http://buildfarm.postgresql.org/cgi-bin/show_history.pl?nm=anole&amp;br=REL9_4_STABLE
and none from master either:
http://buildfarm.postgresql.org/cgi-bin/show_history.pl?nm=anole&amp;br=HEAD

Is it setup for master now? Because it has reported back for 9.4 twice,
but never for master.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#20Robert Haas
robertmhaas@gmail.com
In reply to: Andres Freund (#19)

On Wed, Oct 1, 2014 at 10:50 AM, Andres Freund <andres@2ndquadrant.com> wrote:

On 2014-10-01 10:45:13 -0400, Robert Haas wrote:

On Wed, Oct 1, 2014 at 7:00 AM, Andres Freund <andres@2ndquadrant.com> wrote:

On 2014-09-29 14:46:20 -0400, Robert Haas wrote:

This happened again, and I investigated further.

Uh. Interestingly anole just succeeded twice:
http://buildfarm.postgresql.org/cgi-bin/show_history.pl?nm=anole&amp;br=REL9_4_STABLE

I plan to commit the mask/unmask patch regardless, but it's curious. The
first of the two builds could have been you 'unsticking' it by manually
mucking around. Did you also do that for the second build?

No, but I think the failures have always been intermittent.

There's no record of any relevantly failing builds on 9.4:
http://buildfarm.postgresql.org/cgi-bin/show_history.pl?nm=anole&amp;br=REL9_4_STABLE
and none from master either:
http://buildfarm.postgresql.org/cgi-bin/show_history.pl?nm=anole&amp;br=HEAD

Is it setup for master now? Because it has reported back for 9.4 twice,
but never for master.

As far as I can tell, it's configured to run everything. I just
checked, though, and found it wedged again. I'm not sure whether it
was the same problem, though; I ended up just killing all of the
postgres processes to fix it. We may be just at the beginning of an
exciting debugging journey.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#21Robert Haas
robertmhaas@gmail.com
In reply to: Robert Haas (#20)
#22Robert Haas
robertmhaas@gmail.com
In reply to: Robert Haas (#21)
#23Andres Freund
andres@anarazel.de
In reply to: Robert Haas (#22)
#24Christoph Berg
myon@debian.org
In reply to: Andres Freund (#23)
#25Robert Haas
robertmhaas@gmail.com
In reply to: Christoph Berg (#24)
#26Christoph Berg
myon@debian.org
In reply to: Robert Haas (#25)
#27Christoph Berg
myon@debian.org
In reply to: Christoph Berg (#26)
#28Robert Haas
robertmhaas@gmail.com
In reply to: Christoph Berg (#27)
#29Christoph Berg
myon@debian.org
In reply to: Robert Haas (#28)