Tracing down buildfarm "postmaster does not shut down" failures

Started by Tom Laneabout 10 years ago33 messageshackers

tgl@sss.pgh.pa.us

about 10 years ago

Of late, by far the majority of the random-noise failures we see in the
buildfarm have come from failure to shut down the postmaster in a
reasonable timeframe. An example is this current failure on hornet:

http://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=hornet&dt=2016-02-08%2013%3A41%3A55

waiting for server to shut down........................................................................................................................... failed
pg_ctl: server does not shut down
=========== db log file ==========
2016-02-08 15:14:35.783 UTC [56b8b00d.9e00e2:2] LOG: received fast shutdown request
2016-02-08 15:14:35.783 UTC [56b8b00d.9e00e2:3] LOG: aborting any active transactions
2016-02-08 15:14:35.783 UTC [56b8b00d.7e0028:2] LOG: autovacuum launcher shutting down
2016-02-08 15:14:35.865 UTC [56b8b00d.2e5000a:1] LOG: shutting down

The buildfarm script runs pg_ctl stop with -t 120, and counting the dots
shows that pg_ctl was honoring that, so apparently the postmaster took
more than 2 minutes to shut down.

Looking at other recent successful runs, stopdb-C-1 usually takes 30 to
40 or so seconds on this machine. So it's possible that it was just so
overloaded that it took three times that much time on this run, but I'm
starting to think there may be more to it than that. We've seen variants
on this theme on half a dozen machines just in the past week --- and it
seems to mostly happen in 9.5 and HEAD, which is fishy.

The fact that we got "shutting down" but not "database system is shut
down" indicates that the server was in ShutdownXLOG(). Normally the
only component of that that takes much time is buffer flushing, but
could something else be happening? Or perhaps the checkpoint code
has somehow not gotten the word to do its flushing at full speed?

What I'd like to do to investigate this is put in a temporary HEAD-only
patch that makes ShutdownXLOG() and its subroutines much chattier about
how far they've gotten and what time it is, and also makes pg_ctl print
out the current time if it gives up waiting. A few failed runs with
that in place will at least allow us to confirm or deny whether it's
just that the shutdown checkpoint is sometimes really slow, or whether
there's a bug lurking.

Any objections? Anybody have another idea for data to collect?

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Alvaro Herrera

alvherre@2ndquadrant.com

about 10 years ago

In reply to: Tom Lane (#1)

Re: Tracing down buildfarm "postmaster does not shut down" failures

Tom Lane wrote:

Of late, by far the majority of the random-noise failures we see in the
buildfarm have come from failure to shut down the postmaster in a
reasonable timeframe.

I noticed that.

An example is this current failure on hornet:

http://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=hornet&dt=2016-02-08%2013%3A41%3A55

FWIW you seem to have edited this URL -- it returns a blank page.
The right one is
http://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=hornet&dt=2016-02-08%2013%3A41%3A55&stg=stopdb-C-1

What I'd like to do to investigate this is put in a temporary HEAD-only
patch that makes ShutdownXLOG() and its subroutines much chattier about
how far they've gotten and what time it is, and also makes pg_ctl print
out the current time if it gives up waiting. A few failed runs with
that in place will at least allow us to confirm or deny whether it's
just that the shutdown checkpoint is sometimes really slow, or whether
there's a bug lurking.

Any objections? Anybody have another idea for data to collect?

Seems like a reasonable place to start. I assume you'll be installing
some debug-elog calls, enabled by default, and then once the problem is
fixed, we'll change the default to disabled but keep the actual calls?

--
ï¿½lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Tom Lane

tgl@sss.pgh.pa.us

about 10 years ago

In reply to: Alvaro Herrera (#2)

Re: Tracing down buildfarm "postmaster does not shut down" failures

Alvaro Herrera <alvherre@2ndquadrant.com> writes:

Tom Lane wrote:

What I'd like to do to investigate this is put in a temporary HEAD-only
patch that makes ShutdownXLOG() and its subroutines much chattier about
how far they've gotten and what time it is, and also makes pg_ctl print
out the current time if it gives up waiting.

Seems like a reasonable place to start. I assume you'll be installing
some debug-elog calls, enabled by default, and then once the problem is
fixed, we'll change the default to disabled but keep the actual calls?

I had in mind to just "git revert" the patch when we're done with it.
There might be some parts we want to keep (for instance, I'm thinking of
logging "postmaster is done" after we've unlinked the pidfile, which might
be useful permanently). But I plan to err on the side of noisiness for
the moment, rather than make something I think has long-term value.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Andrew Dunstan

andrew@dunslane.net

about 10 years ago

In reply to: Tom Lane (#1)

Re: Tracing down buildfarm "postmaster does not shut down" failures

On 02/08/2016 02:15 PM, Tom Lane wrote:

Of late, by far the majority of the random-noise failures we see in the
buildfarm have come from failure to shut down the postmaster in a
reasonable timeframe. An example is this current failure on hornet:

http://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=hornet&dt=2016-02-08%2013%3A41%3A55

waiting for server to shut down........................................................................................................................... failed
pg_ctl: server does not shut down
=========== db log file ==========
2016-02-08 15:14:35.783 UTC [56b8b00d.9e00e2:2] LOG: received fast shutdown request
2016-02-08 15:14:35.783 UTC [56b8b00d.9e00e2:3] LOG: aborting any active transactions
2016-02-08 15:14:35.783 UTC [56b8b00d.7e0028:2] LOG: autovacuum launcher shutting down
2016-02-08 15:14:35.865 UTC [56b8b00d.2e5000a:1] LOG: shutting down

The buildfarm script runs pg_ctl stop with -t 120, and counting the dots
shows that pg_ctl was honoring that, so apparently the postmaster took
more than 2 minutes to shut down.

Looking at other recent successful runs, stopdb-C-1 usually takes 30 to
40 or so seconds on this machine. So it's possible that it was just so
overloaded that it took three times that much time on this run, but I'm
starting to think there may be more to it than that. We've seen variants
on this theme on half a dozen machines just in the past week --- and it
seems to mostly happen in 9.5 and HEAD, which is fishy.

The fact that we got "shutting down" but not "database system is shut
down" indicates that the server was in ShutdownXLOG(). Normally the
only component of that that takes much time is buffer flushing, but
could something else be happening? Or perhaps the checkpoint code
has somehow not gotten the word to do its flushing at full speed?

What I'd like to do to investigate this is put in a temporary HEAD-only
patch that makes ShutdownXLOG() and its subroutines much chattier about
how far they've gotten and what time it is, and also makes pg_ctl print
out the current time if it gives up waiting. A few failed runs with
that in place will at least allow us to confirm or deny whether it's
just that the shutdown checkpoint is sometimes really slow, or whether
there's a bug lurking.

Any objections? Anybody have another idea for data to collect?

I think that's an excellent idea. This has been a minor running sore for
ages on slow machines.

cheers

andrew

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Noah Misch

noah@leadboat.com

about 10 years ago

In reply to: Tom Lane (#1)

Re: Tracing down buildfarm "postmaster does not shut down" failures

On Mon, Feb 08, 2016 at 02:15:48PM -0500, Tom Lane wrote:

Of late, by far the majority of the random-noise failures we see in the
buildfarm have come from failure to shut down the postmaster in a
reasonable timeframe.

We've seen variants
on this theme on half a dozen machines just in the past week --- and it
seems to mostly happen in 9.5 and HEAD, which is fishy.

It has been affecting only the four AIX animals, which do share hardware.
(Back in 2015 and once in 2016-01, it did affect axolotl and shearwater.) I
agree the concentration on 9.5 and HEAD is suspicious; while those branches
get the most buildfarm runs, that factor by itself doesn't explain the
distribution of failures among versions.

What I'd like to do to investigate this is put in a temporary HEAD-only
patch that makes ShutdownXLOG() and its subroutines much chattier about
how far they've gotten and what time it is, and also makes pg_ctl print
out the current time if it gives up waiting. A few failed runs with
that in place will at least allow us to confirm or deny whether it's
just that the shutdown checkpoint is sometimes really slow, or whether
there's a bug lurking.

Any objections? Anybody have another idea for data to collect?

That's reasonable. If you would like higher-fidelity data, I can run loops of
"pg_ctl -w start; make installcheck; pg_ctl -t900 -w stop", and I could run
that for HEAD and 9.2 simultaneously. A day of logs from that should show
clearly if HEAD is systematically worse than 9.2. By the way, you would
almost surely qualify for an account on this machine.

I had drafted the following message and patch last week, and I suppose it
belongs in this thread:

On Mon, Oct 12, 2015 at 06:41:06PM -0400, Tom Lane wrote:

I'm not sure if this will completely fix our problems with "pg_ctl start"
related buildfarm failures on very slow critters. It does get rid of the
hard wired 5-second timeout, but the 60-second timeout could still be an
issue. I think Noah was considering a patch to allow that number to be
raised. I'd be in favor of letting pg_ctl accept a default timeout length
from an environment variable, and then the slower critters could be fixed
by adjusting their buildfarm configurations.

Your commit 6bcce25 made src/bin test suites stop failing due to pg_ctl
startup timeouts, but other suites have been failing on the AIX buildfarm zoo
due to slow shutdown. Example taking 72s to even reach ShutdownXLOG():
http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=sungazer&dt=2016-01-15%2012%3A43%3A00

So, I wish to raise the timeout for those animals. Using an environment
variable was a good idea; it's one less thing for test authors to remember.
Since the variable affects a performance-related fudge factor rather than
change behavior per se, I'm less skittish than usual about unintended
consequences of dynamic scope. (With said unintended consequences in mind, I
made "pg_ctl register" ignore PGCTLTIMEOUT rather than embed its value into
the service created.)

Thanks,
nm

Tom Lane

tgl@sss.pgh.pa.us

about 10 years ago

In reply to: Noah Misch (#5)

Re: Tracing down buildfarm "postmaster does not shut down" failures

Noah Misch <noah@leadboat.com> writes:

On Mon, Feb 08, 2016 at 02:15:48PM -0500, Tom Lane wrote:

We've seen variants
on this theme on half a dozen machines just in the past week --- and it
seems to mostly happen in 9.5 and HEAD, which is fishy.

It has been affecting only the four AIX animals, which do share hardware.
(Back in 2015 and once in 2016-01, it did affect axolotl and shearwater.)

Certainly your AIX critters have shown this a bunch, but here's another
current example:
http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=axolotl&dt=2016-02-08%2014%3A49%3A23

That's reasonable. If you would like higher-fidelity data, I can run loops of
"pg_ctl -w start; make installcheck; pg_ctl -t900 -w stop", and I could run
that for HEAD and 9.2 simultaneously. A day of logs from that should show
clearly if HEAD is systematically worse than 9.2.

That sounds like a fine plan, please do it.

So, I wish to raise the timeout for those animals. Using an environment
variable was a good idea; it's one less thing for test authors to remember.
Since the variable affects a performance-related fudge factor rather than
change behavior per se, I'm less skittish than usual about unintended
consequences of dynamic scope. (With said unintended consequences in mind, I
made "pg_ctl register" ignore PGCTLTIMEOUT rather than embed its value into
the service created.)

While this isn't necessarily a bad idea in isolation, the current
buildfarm scripts explicitly specify a -t value to pg_ctl stop, which
I would not expect an environment variable to override. So we need
to fix the buildfarm script to allow the timeout to be configurable.
I'm not sure if there would be any value-add in having pg_ctl answer
to an environment variable once we've done that.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Andrew Dunstan

andrew@dunslane.net

about 10 years ago

In reply to: Tom Lane (#6)

Re: Tracing down buildfarm "postmaster does not shut down" failures

On 02/08/2016 10:55 PM, Tom Lane wrote:

Noah Misch <noah@leadboat.com> writes:

On Mon, Feb 08, 2016 at 02:15:48PM -0500, Tom Lane wrote:

We've seen variants
on this theme on half a dozen machines just in the past week --- and it
seems to mostly happen in 9.5 and HEAD, which is fishy.

It has been affecting only the four AIX animals, which do share hardware.
(Back in 2015 and once in 2016-01, it did affect axolotl and shearwater.)

Certainly your AIX critters have shown this a bunch, but here's another
current example:
http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=axolotl&dt=2016-02-08%2014%3A49%3A23

That's reasonable. If you would like higher-fidelity data, I can run loops of
"pg_ctl -w start; make installcheck; pg_ctl -t900 -w stop", and I could run
that for HEAD and 9.2 simultaneously. A day of logs from that should show
clearly if HEAD is systematically worse than 9.2.

That sounds like a fine plan, please do it.

So, I wish to raise the timeout for those animals. Using an environment
variable was a good idea; it's one less thing for test authors to remember.
Since the variable affects a performance-related fudge factor rather than
change behavior per se, I'm less skittish than usual about unintended
consequences of dynamic scope. (With said unintended consequences in mind, I
made "pg_ctl register" ignore PGCTLTIMEOUT rather than embed its value into
the service created.)

While this isn't necessarily a bad idea in isolation, the current
buildfarm scripts explicitly specify a -t value to pg_ctl stop, which
I would not expect an environment variable to override. So we need
to fix the buildfarm script to allow the timeout to be configurable.
I'm not sure if there would be any value-add in having pg_ctl answer
to an environment variable once we've done that.

The failure on axolotl was for the ECPG tests, which don't use the
buildfarm's startup/stop db code. They don't honour TEMP_CONFIG either,
which they probably should - then we might get better log traces.

cheers

andrew

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Tom Lane

tgl@sss.pgh.pa.us

about 10 years ago

In reply to: Tom Lane (#6)

Re: Tracing down buildfarm "postmaster does not shut down" failures

I wrote:

Noah Misch <noah@leadboat.com> writes:

On Mon, Feb 08, 2016 at 02:15:48PM -0500, Tom Lane wrote:

We've seen variants
on this theme on half a dozen machines just in the past week --- and it
seems to mostly happen in 9.5 and HEAD, which is fishy.

It has been affecting only the four AIX animals, which do share hardware.
(Back in 2015 and once in 2016-01, it did affect axolotl and shearwater.)

Certainly your AIX critters have shown this a bunch, but here's another
current example:
http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=axolotl&dt=2016-02-08%2014%3A49%3A23

While we've not seen an actual test failure since I installed the tracing
code, we do have reports from the four AIX critters, and they are pretty
instructive. See for example
http://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=hornet&dt=2016-02-09%2016%3A27%3A05&stg=stopdb-C-2
which is

Snapshot: 2016-02-09 16:27:05

waiting for server to shut down at 2016-02-09 18:04:09.159 UTC................................................................................ done
server stopped
=========== db log file ==========
2016-02-09 18:04:09.159 UTC [56ba2944.ee01bc:2] LOG: received fast shutdown request at 2016-02-09 18:04:09.159 UTC
2016-02-09 18:04:09.159 UTC [56ba2944.ee01bc:3] LOG: aborting any active transactions
2016-02-09 18:04:09.160 UTC [56ba2944.14501d8:2] LOG: autovacuum launcher shutting down at 2016-02-09 18:04:09.160 UTC
2016-02-09 18:04:09.258 UTC [56ba2944.363000e:40] LOG: shutting down at 2016-02-09 18:04:09.258 UTC
2016-02-09 18:04:09.258 UTC [56ba2944.363000e:41] LOG: CheckPointGuts starting at 2016-02-09 18:04:09.258 UTC
2016-02-09 18:04:09.258 UTC [56ba2944.363000e:42] LOG: CheckPointCLOG() done at 2016-02-09 18:04:09.258 UTC
2016-02-09 18:04:09.258 UTC [56ba2944.363000e:43] LOG: CheckPointCommitTs() done at 2016-02-09 18:04:09.258 UTC
2016-02-09 18:04:09.258 UTC [56ba2944.363000e:44] LOG: CheckPointSUBTRANS() done at 2016-02-09 18:04:09.258 UTC
2016-02-09 18:04:09.258 UTC [56ba2944.363000e:45] LOG: CheckPointMultiXact() done at 2016-02-09 18:04:09.258 UTC
2016-02-09 18:04:09.258 UTC [56ba2944.363000e:46] LOG: CheckPointPredicate() done at 2016-02-09 18:04:09.258 UTC
2016-02-09 18:04:09.258 UTC [56ba2944.363000e:47] LOG: CheckPointRelationMap() done at 2016-02-09 18:04:09.258 UTC
2016-02-09 18:04:09.258 UTC [56ba2944.363000e:48] LOG: CheckPointReplicationSlots() done at 2016-02-09 18:04:09.258 UTC
2016-02-09 18:04:09.259 UTC [56ba2944.363000e:49] LOG: CheckPointSnapBuild() done at 2016-02-09 18:04:09.259 UTC
2016-02-09 18:04:09.259 UTC [56ba2944.363000e:50] LOG: CheckPointLogicalRewriteHeap() done at 2016-02-09 18:04:09.259 UTC
2016-02-09 18:04:09.259 UTC [56ba2944.363000e:51] LOG: BufferSync(5) beginning to write 632 buffers at 2016-02-09 18:04:09.259 UTC
2016-02-09 18:04:09.966 UTC [56ba2944.363000e:52] LOG: BufferSync(5) done, wrote 632/632 buffers at 2016-02-09 18:04:09.966 UTC
2016-02-09 18:04:09.966 UTC [56ba2944.363000e:53] LOG: CheckPointBuffers() done at 2016-02-09 18:04:09.966 UTC
2016-02-09 18:04:09.966 UTC [56ba2944.363000e:54] LOG: CheckPointReplicationOrigin() done at 2016-02-09 18:04:09.966 UTC
2016-02-09 18:04:09.966 UTC [56ba2944.363000e:55] LOG: CheckPointGuts done at 2016-02-09 18:04:09.966 UTC
2016-02-09 18:04:09.967 UTC [56ba2944.363000e:56] LOG: checkpoint WAL record flushed at 2016-02-09 18:04:09.967 UTC
2016-02-09 18:04:09.967 UTC [56ba2944.363000e:57] LOG: pg_control updated at 2016-02-09 18:04:09.967 UTC
2016-02-09 18:04:29.375 UTC [56ba2944.363000e:58] LOG: in mdpostckpt, 3072 unlinks remain to do at 2016-02-09 18:04:29.375 UTC
2016-02-09 18:04:48.207 UTC [56ba2944.363000e:59] LOG: in mdpostckpt, 2048 unlinks remain to do at 2016-02-09 18:04:48.207 UTC
2016-02-09 18:05:07.381 UTC [56ba2944.363000e:60] LOG: in mdpostckpt, 1024 unlinks remain to do at 2016-02-09 18:05:07.381 UTC
2016-02-09 18:05:25.764 UTC [56ba2944.363000e:61] LOG: smgrpostckpt() done at 2016-02-09 18:05:25.764 UTC
2016-02-09 18:05:25.764 UTC [56ba2944.363000e:62] LOG: RemoveOldXlogFiles() done at 2016-02-09 18:05:25.764 UTC
2016-02-09 18:05:25.764 UTC [56ba2944.363000e:63] LOG: TruncateSUBTRANS() done at 2016-02-09 18:05:25.764 UTC
2016-02-09 18:05:25.764 UTC [56ba2944.363000e:64] LOG: shutdown checkpoint complete at 2016-02-09 18:05:25.764 UTC
2016-02-09 18:05:25.764 UTC [56ba2944.363000e:65] LOG: ShutdownCLOG() complete at 2016-02-09 18:05:25.764 UTC
2016-02-09 18:05:25.764 UTC [56ba2944.363000e:66] LOG: ShutdownCommitTs() complete at 2016-02-09 18:05:25.764 UTC
2016-02-09 18:05:25.764 UTC [56ba2944.363000e:67] LOG: ShutdownSUBTRANS() complete at 2016-02-09 18:05:25.764 UTC
2016-02-09 18:05:25.764 UTC [56ba2944.363000e:68] LOG: database system is shut down at 2016-02-09 18:05:25.764 UTC
2016-02-09 18:05:26.165 UTC [56ba2944.ee01bc:4] LOG: lock files all released at 2016-02-09 18:05:26.165 UTC

This says that the bulk of the shutdown time is being spent in mdpostckpt,
where we are finally unlinking files belonging to relations that were
dropped since the last checkpoint. That took 75 seconds here, while the
parts before and after it took less than a second each. There were
somewhere between 3000 and 4000 files to be unlinked, so it's not
astonishing that it took some time, but our other critters seem to get
through this a lot quicker. (All four AIX critters reported pretty
similar results, BTW.)

So my diagnosis at this point is

(1) Slow file system, specifically slow unlink, is the core of the
problem. (I wonder if the AIX critters are using an NFS filesystem?)

(2) The apparent increase in failure rate in more recent branches is
probably real, but the likely cause is simply more test cases = more
stuff to be cleaned up at the end = mdpostckpt takes longer.

I'm not sure whether there's anything to be gained by leaving the tracing
code in there till we see actual buildfarm fails. There might be another
slowdown mechanism somewhere, but I rather doubt it. Thoughts?

In any case, we should proceed with fixing things so that buildfarm owners
can specify a higher shutdown timeout for especially slow critters.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Tom Lane

tgl@sss.pgh.pa.us

about 10 years ago

In reply to: Tom Lane (#8)

Re: Tracing down buildfarm "postmaster does not shut down" failures

I wrote:

In any case, we should proceed with fixing things so that buildfarm owners
can specify a higher shutdown timeout for especially slow critters.

I looked into doing this as I suggested yesterday, namely modifying the
buildfarm scripts, and soon decided that it would be a mess; there are
too many cases where "pg_ctl stop" is not invoked directly by the script.

I'm now in favor of applying the PGCTLTIMEOUT patch Noah proposed, and
*removing* the two existing hacks in run_build.pl that try to force -t 120.

The only real argument I can see against that approach is that we'd have
to back-patch the PGCTLTIMEOUT patch to all active branches if we want
to stop the buildfarm failures. We don't usually back-patch feature
additions. On the other hand, this wouldn't be the first time we've
back-patched something on grounds of helping the buildfarm, so I find
that argument pretty weak.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#10

Andrew Dunstan

andrew@dunslane.net

about 10 years ago

In reply to: Tom Lane (#9)

Re: Tracing down buildfarm "postmaster does not shut down" failures

On 02/09/2016 03:05 PM, Tom Lane wrote:

I wrote:

In any case, we should proceed with fixing things so that buildfarm owners
can specify a higher shutdown timeout for especially slow critters.

I looked into doing this as I suggested yesterday, namely modifying the
buildfarm scripts, and soon decided that it would be a mess; there are
too many cases where "pg_ctl stop" is not invoked directly by the script.

I'm now in favor of applying the PGCTLTIMEOUT patch Noah proposed, and
*removing* the two existing hacks in run_build.pl that try to force -t 120.

The only real argument I can see against that approach is that we'd have
to back-patch the PGCTLTIMEOUT patch to all active branches if we want
to stop the buildfarm failures. We don't usually back-patch feature
additions. On the other hand, this wouldn't be the first time we've
back-patched something on grounds of helping the buildfarm, so I find
that argument pretty weak.

OK. I can put out a new release as required.

cheers

andrew

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#11

Tom Lane

tgl@sss.pgh.pa.us

about 10 years ago

In reply to: Tom Lane (#8)

Re: Tracing down buildfarm "postmaster does not shut down" failures

I wrote:

I'm not sure whether there's anything to be gained by leaving the tracing
code in there till we see actual buildfarm fails. There might be another
slowdown mechanism somewhere, but I rather doubt it. Thoughts?

Hmmm ... I take that back. AFAICT, the failures on Noah's AIX zoo are
sufficiently explained by the "mdpostckpt takes a long time after the
regression tests" theory. However, there is something else happening
on axolotl. Looking at the HEAD and 9.5 branches, there are three very
similar failures in the ECPG step within the past 60 days:

http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=axolotl&dt=2016-02-08%2014%3A49%3A23
http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=axolotl&dt=2015-12-15%2018%3A49%3A31
http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=axolotl&dt=2015-12-12%2001%3A44%3A39

In all three, we got "pg_ctl: server does not shut down", but the
postmaster log claims that it shut down, and pretty speedily too.
For example, in the 2015-12-12 failure,

LOG: received fast shutdown request
LOG: aborting any active transactions
LOG: autovacuum launcher shutting down
LOG: shutting down
LOG: checkpoint starting: shutdown immediate
LOG: checkpoint complete: wrote 176 buffers (1.1%); 0 transaction log file(s) added, 0 removed, 0 recycled; write=0.039 s, sync=0.000 s, total=0.059 s; sync files=0, longest=0.000 s, average=0.000 s; distance=978 kB, estimate=978 kB
LOG: database system is shut down

We have no theory that would account for postmaster shutdown stalling
after the end of ShutdownXLOG, but that seems to be what happened.
How come? Why does only the ECPG test seem to be affected?

It's also pretty fishy that we have three failures in 60 days on HEAD+9.5
but none before that, and none in the older branches. That smells like
a recently-introduced bug, though I have no idea what.

Andrew, I wonder if I could prevail on you to make axolotl run "make
check" on HEAD in src/interfaces/ecpg/ until it fails, so that we can
see if the logging I added tells anything useful about this.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#12

Tom Lane

tgl@sss.pgh.pa.us

about 10 years ago

In reply to: Tom Lane (#11)

Re: Tracing down buildfarm "postmaster does not shut down" failures

I wrote:

... However, there is something else happening
on axolotl. Looking at the HEAD and 9.5 branches, there are three very
similar failures in the ECPG step within the past 60 days:
http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=axolotl&dt=2016-02-08%2014%3A49%3A23
http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=axolotl&dt=2015-12-15%2018%3A49%3A31
http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=axolotl&dt=2015-12-12%2001%3A44%3A39

Oh look, searching with the "failures" BF webpage turns up a fourth such
failure on sungazer:

http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=sungazer&dt=2016-01-18%2014%3A45%3A15

The plot thickens. I still have no clue what's going wrong though.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#13

Andrew Dunstan

andrew@dunslane.net

about 10 years ago

In reply to: Tom Lane (#11)

Re: Tracing down buildfarm "postmaster does not shut down" failures

On 02/09/2016 05:53 PM, Tom Lane wrote:

Andrew, I wonder if I could prevail on you to make axolotl run "make
check" on HEAD in src/interfaces/ecpg/ until it fails, so that we can
see if the logging I added tells anything useful about this.

Will do.

cheers

andrew

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#14

Andrew Dunstan

andrew@dunslane.net

about 10 years ago

In reply to: Andrew Dunstan (#13)

Re: Tracing down buildfarm "postmaster does not shut down" failures

On 02/09/2016 06:46 PM, Andrew Dunstan wrote:

On 02/09/2016 05:53 PM, Tom Lane wrote:

Andrew, I wonder if I could prevail on you to make axolotl run "make
check" on HEAD in src/interfaces/ecpg/ until it fails, so that we can
see if the logging I added tells anything useful about this.

Will do.

Incidentally, as I noted earlier, the ecpg tests don't honour
TEMP_CONFIG, and in axolotl's case this could well make a difference, as
it it set up like this:

extra_config =>{
DEFAULT => [
q(log_line_prefix = '%m [%c:%l] '),
"log_connections = 'true'",
"log_disconnections = 'true'",
"log_statement = 'all'",
"fsync = off",
"stats_temp_directory='/home/andrew/bf/root/stats_temp/$branch'"
],
},

So running it's not running with fsync off or using the ramdisk for
stats_temp_directory. Of course, that doesn't explain why we're not
seeing it on branches earlier than 9.5, but it could explain why we're
only seeing it on the ecpg tests.

cheers

andrew

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#15

Tom Lane

tgl@sss.pgh.pa.us

about 10 years ago

In reply to: Andrew Dunstan (#14)

Re: Tracing down buildfarm "postmaster does not shut down" failures

Andrew Dunstan <andrew@dunslane.net> writes:

Incidentally, as I noted earlier, the ecpg tests don't honour
TEMP_CONFIG, and in axolotl's case this could well make a difference, as
it it set up like this:
...
So running it's not running with fsync off or using the ramdisk for
stats_temp_directory.

Oooohh ...

I had just been looking into what else the postmaster does after "database
system is shut down" comes out. One of those activities is telling the
stats collector to shut down, and then waiting for it to do so. So a
really slow write of the collected stats might possibly account for
delaying things here.

What I was doing was adding some more logging in the
post-shutdown-checkpoint area, and this is what I see on dromedary,
when shutting down just after a regression test run:

LOG: database system is shut down at 2016-02-09 19:12:29.497 EST
LOG: doing before_shmem_exit 0 at 2016-02-09 19:12:29.498 EST
LOG: doing on_shmem_exit 2 at 2016-02-09 19:12:29.498 EST
LOG: doing on_shmem_exit 1 at 2016-02-09 19:12:29.498 EST
LOG: doing on_shmem_exit 0 at 2016-02-09 19:12:29.498 EST
LOG: doing on_proc_exit 1 at 2016-02-09 19:12:29.498 EST
LOG: doing on_proc_exit 0 at 2016-02-09 19:12:29.498 EST
LOG: calling exit(0) at 2016-02-09 19:12:29.498 EST
LOG: checkpointer dead at 2016-02-09 19:12:32.382 EST
LOG: all children dead at 2016-02-09 19:12:32.387 EST
LOG: doing on_shmem_exit 3 at 2016-02-09 19:12:32.387 EST
LOG: doing on_shmem_exit 2 at 2016-02-09 19:12:32.387 EST
LOG: doing on_shmem_exit 1 at 2016-02-09 19:12:32.387 EST
LOG: doing on_shmem_exit 0 at 2016-02-09 19:12:32.396 EST
LOG: doing on_proc_exit 1 at 2016-02-09 19:12:32.396 EST
LOG: doing on_proc_exit 0 at 2016-02-09 19:12:32.396 EST
LOG: lock files all released at 2016-02-09 19:12:32.397 EST
LOG: calling exit(0) at 2016-02-09 19:12:32.397 EST

The first batch of those "on_shmem/proc_exit" reports are from the
checkpointer process, and the second set are from the postmaster.
The stats collector shutdown would be between "checkpointer dead"
and "all children dead". We can see that on this machine, that's
not really an issue ... but how in the world did it take the postmaster
nigh three seconds to notice the checkpoint process exit? Something
very odd indeed there.

Anyway, I think I should push this additional instrumentation so you
can use it on axolotl.

This also makes it look like we really need to revisit where/when the
"database system is shut down" message is printed. But that's for
later.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#16

Tom Lane

tgl@sss.pgh.pa.us

about 10 years ago

In reply to: Tom Lane (#15)

Re: Tracing down buildfarm "postmaster does not shut down" failures

I wrote:

Anyway, I think I should push this additional instrumentation so you
can use it on axolotl.

Done.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#17

Tom Lane

tgl@sss.pgh.pa.us

about 10 years ago

In reply to: Andrew Dunstan (#14)

Re: Tracing down buildfarm "postmaster does not shut down" failures

Andrew Dunstan <andrew@dunslane.net> writes:

So running it's not running with fsync off or using the ramdisk for
stats_temp_directory. Of course, that doesn't explain why we're not
seeing it on branches earlier than 9.5, but it could explain why we're
only seeing it on the ecpg tests.

BTW, the evidence we already have from axolotl's last run is that
post-checkpoint shutdown in the ecpg test was pretty quick:

LOG: database system is shut down at 2016-02-09 16:31:14.784 EST
LOG: lock files all released at 2016-02-09 16:31:14.817 EST

However, I'd already noted from some other digging in the buildfarm
logs that axolotl's speed seems to vary tremendously. I do not
know what else you typically run on that hardware, but putting it
under full load might help prove things.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#18

Andrew Dunstan

andrew@dunslane.net

about 10 years ago

In reply to: Tom Lane (#17)

Re: Tracing down buildfarm "postmaster does not shut down" failures

On 02/09/2016 07:49 PM, Tom Lane wrote:

Andrew Dunstan <andrew@dunslane.net> writes:

So running it's not running with fsync off or using the ramdisk for
stats_temp_directory. Of course, that doesn't explain why we're not
seeing it on branches earlier than 9.5, but it could explain why we're
only seeing it on the ecpg tests.

BTW, the evidence we already have from axolotl's last run is that
post-checkpoint shutdown in the ecpg test was pretty quick:

LOG: database system is shut down at 2016-02-09 16:31:14.784 EST
LOG: lock files all released at 2016-02-09 16:31:14.817 EST

However, I'd already noted from some other digging in the buildfarm
logs that axolotl's speed seems to vary tremendously. I do not
know what else you typically run on that hardware, but putting it
under full load might help prove things.

Almost nothing. It's a Raspberry Pi 2B that I got mainly to run a
buildfarm animal. About the only other thing of note is a very lightly
configured Nagios instance.

Of course, I could put it under continuous load running a compilation or
something.

cheers

andrew

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#19

Tom Lane

tgl@sss.pgh.pa.us

about 10 years ago

In reply to: Andrew Dunstan (#18)

Re: Tracing down buildfarm "postmaster does not shut down" failures

Andrew Dunstan <andrew@dunslane.net> writes:

On 02/09/2016 07:49 PM, Tom Lane wrote:

However, I'd already noted from some other digging in the buildfarm
logs that axolotl's speed seems to vary tremendously. I do not
know what else you typically run on that hardware, but putting it
under full load might help prove things.

Almost nothing. It's a Raspberry Pi 2B that I got mainly to run a
buildfarm animal. About the only other thing of note is a very lightly
configured Nagios instance.

Huh, that's quite strange. There is one fairly recent report of
axolotl failing "make check" because of taking over a minute to shut down:

http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=axolotl&dt=2015-12-14%2020%3A30%3A52

but the runs before and after that show shutdown times of only a second or
two.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#20

Andrew Dunstan

andrew@dunslane.net

about 10 years ago

In reply to: Tom Lane (#19)

Re: Tracing down buildfarm "postmaster does not shut down" failures

On 02/09/2016 08:49 PM, Tom Lane wrote:

Andrew Dunstan <andrew@dunslane.net> writes:

On 02/09/2016 07:49 PM, Tom Lane wrote:

However, I'd already noted from some other digging in the buildfarm
logs that axolotl's speed seems to vary tremendously. I do not
know what else you typically run on that hardware, but putting it
under full load might help prove things.

Almost nothing. It's a Raspberry Pi 2B that I got mainly to run a
buildfarm animal. About the only other thing of note is a very lightly
configured Nagios instance.

Huh, that's quite strange. There is one fairly recent report of
axolotl failing "make check" because of taking over a minute to shut down:

http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=axolotl&dt=2015-12-14%2020%3A30%3A52

but the runs before and after that show shutdown times of only a second or
two.

No idea why.

anyway, we got a failure pretty quickly:

pg_ctl: server does not shut down at 2016-02-09 21:10:11.914 EST

pg_regress: could not stop postmaster: exit code was 256
Makefile:81: recipe for target 'check' failed
make: *** [check] Error 2

The log traces from the shutdown are below.

cheers

andrew

LOG: received fast shutdown request at 2016-02-09 21:09:11.824 EST
LOG: aborting any active transactions
LOG: autovacuum launcher shutting down at 2016-02-09 21:09:11.830 EST
LOG: shutting down at 2016-02-09 21:09:11.839 EST
LOG: checkpoint starting: shutdown immediate
LOG: CheckPointGuts starting at 2016-02-09 21:09:11.840 EST
LOG: CheckPointCLOG() done at 2016-02-09 21:09:11.840 EST
LOG: CheckPointCommitTs() done at 2016-02-09 21:09:11.840 EST
LOG: CheckPointSUBTRANS() done at 2016-02-09 21:09:11.841 EST
LOG: CheckPointMultiXact() done at 2016-02-09 21:09:11.841 EST
LOG: CheckPointPredicate() done at 2016-02-09 21:09:11.841 EST
LOG: CheckPointRelationMap() done at 2016-02-09 21:09:11.841 EST
LOG: CheckPointReplicationSlots() done at 2016-02-09 21:09:11.841 EST
LOG: CheckPointSnapBuild() done at 2016-02-09 21:09:11.841 EST
LOG: CheckPointLogicalRewriteHeap() done at 2016-02-09 21:09:11.842 EST
LOG: BufferSync(5) beginning to write 172 buffers at 2016-02-09
21:09:11.845 EST
LOG: BufferSync(5) done, wrote 172/172 buffers at 2016-02-09
21:09:14.635 EST
LOG: CheckPointBuffers() done at 2016-02-09 21:09:14.636 EST
LOG: CheckPointReplicationOrigin() done at 2016-02-09 21:09:14.636 EST
LOG: CheckPointGuts done at 2016-02-09 21:09:14.636 EST
LOG: checkpoint WAL record flushed at 2016-02-09 21:09:14.636 EST
LOG: pg_control updated at 2016-02-09 21:09:14.637 EST
LOG: smgrpostckpt() done at 2016-02-09 21:09:14.668 EST
LOG: RemoveOldXlogFiles() done at 2016-02-09 21:09:14.668 EST
LOG: TruncateSUBTRANS() done at 2016-02-09 21:09:14.669 EST
LOG: checkpoint complete: wrote 172 buffers (1.0%); 0 transaction log
file(s) added, 0 removed, 0 recycled; write=2.794 s, sync=0.000 s,
total=2.829 s; sync files=0, longest=0.000 s, average=0.000 s;
distance=966 kB, estimate=966 kB
LOG: shutdown checkpoint complete at 2016-02-09 21:09:14.669 EST
LOG: ShutdownCLOG() complete at 2016-02-09 21:09:14.669 EST
LOG: ShutdownCommitTs() complete at 2016-02-09 21:09:14.669 EST
LOG: ShutdownSUBTRANS() complete at 2016-02-09 21:09:14.669 EST
LOG: database system is shut down at 2016-02-09 21:09:14.669 EST
LOG: doing before_shmem_exit 0 at 2016-02-09 21:09:14.670 EST
LOG: doing on_shmem_exit 2 at 2016-02-09 21:09:14.670 EST
LOG: doing on_shmem_exit 1 at 2016-02-09 21:09:14.670 EST
LOG: doing on_shmem_exit 0 at 2016-02-09 21:09:14.670 EST
LOG: doing on_proc_exit 1 at 2016-02-09 21:09:14.670 EST
LOG: doing on_proc_exit 0 at 2016-02-09 21:09:14.671 EST
LOG: calling exit(0) at 2016-02-09 21:09:14.671 EST
LOG: checkpointer dead at 2016-02-09 21:09:14.683 EST
LOG: all children dead at 2016-02-09 21:10:11.184 EST
LOG: doing on_shmem_exit 3 at 2016-02-09 21:10:11.191 EST
LOG: doing on_shmem_exit 2 at 2016-02-09 21:10:11.191 EST
LOG: doing on_shmem_exit 1 at 2016-02-09 21:10:11.192 EST
LOG: doing on_shmem_exit 0 at 2016-02-09 21:10:11.208 EST
LOG: doing on_proc_exit 1 at 2016-02-09 21:10:11.209 EST
LOG: doing on_proc_exit 0 at 2016-02-09 21:10:11.209 EST
LOG: lock files all released at 2016-02-09 21:10:11.211 EST
LOG: calling exit(0) at 2016-02-09 21:10:11.211 EST

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#21