pg_ctl/pg_rewind tests vs. slow AIX buildfarm members
My AIX buildfarm members have failed the BinInstallCheck step on and off since
inception. It became more frequent when I added animals sungazer and tern
alongside the older hornet and mandrill. The animals share a machine with
each other and with dozens of other developers. I setpriority() the animals
to the lowest available priority, so they probably lose the CPU for long
periods. Separately, this machine has slow filesystem metadata operations.
For example, git-new-workdir takes ~50s for a PostgreSQL tree.
The pg_rewind suite has failed a few times when crash recovery took longer
than the 60s pg_ctl default timeout. Disabling fsync (commit 7d7a103) reduced
median crash recovery time by 75%, which may suffice. If not, I'll be
inclined to add --timeout=900 to each pg_ctl invocation.
The pg_ctl suite has failed with "not ok 12 - second pg_ctl start succeeds".
You can reproduce that by adding "sleep 3;" between that test and the one
before it. The timing dependency comes from the pg_ctl "slop" time:
/*
* Make sanity checks. If it's for a standalone backend
* (negative PID), or the recorded start time is before
* pg_ctl started, then either we are looking at the wrong
* data directory, or this is a pre-existing pidfile that
* hasn't (yet?) been overwritten by our child postmaster.
* Allow 2 seconds slop for possible cross-process clock
* skew.
*/
The "second pg_ctl start succeeds" tested-for behavior is actually a minor bug
that we'd ideally fix as described in the last paragraph of the commit 3c485ca
log message:
All of this could be improved if we rewrote start_postmaster() so that it
could report the child postmaster's PID, so that we'd know a-priori the
correct PID to test with postmaster_is_alive(). That looks like a bit too
much change for so late in the 9.1 development cycle, unfortunately.
I recommend we invert the test expectation and, pending the ideal pg_ctl fix,
add the "sleep 3" to avoid falling within the time slop:
--- a/src/bin/pg_ctl/t/001_start_stop.pl
+++ b/src/bin/pg_ctl/t/001_start_stop.pl
@@ -35,6 +35,7 @@ close CONF;
command_ok([ 'pg_ctl', 'start', '-D', "$tempdir/data", '-w' ],
'pg_ctl start -w');
-command_ok([ 'pg_ctl', 'start', '-D', "$tempdir/data", '-w' ],
- 'second pg_ctl start succeeds');
+sleep 3; # bridge test_postmaster_connection() slop threshold
+command_fails([ 'pg_ctl', 'start', '-D', "$tempdir/data", '-w' ],
+ 'second pg_ctl start fails');
command_ok([ 'pg_ctl', 'stop', '-D', "$tempdir/data", '-w', '-m', 'fast' ],
'pg_ctl stop -w');
Alternately, I could just remove the test.
crake failed the same way, once:
http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=crake&dt=2015-07-07%2016%3A35%3A06
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2015-09-03 02:25:00 -0400, Noah Misch wrote:
--- a/src/bin/pg_ctl/t/001_start_stop.pl +++ b/src/bin/pg_ctl/t/001_start_stop.pl @@ -35,6 +35,7 @@ close CONF; command_ok([ 'pg_ctl', 'start', '-D', "$tempdir/data", '-w' ], 'pg_ctl start -w'); -command_ok([ 'pg_ctl', 'start', '-D', "$tempdir/data", '-w' ], - 'second pg_ctl start succeeds'); +sleep 3; # bridge test_postmaster_connection() slop threshold +command_fails([ 'pg_ctl', 'start', '-D', "$tempdir/data", '-w' ], + 'second pg_ctl start fails'); command_ok([ 'pg_ctl', 'stop', '-D', "$tempdir/data", '-w', '-m', 'fast' ], 'pg_ctl stop -w');
I'don't like adding a couple seconds of test runtime for the benefit of
very slow platforms.
The second pg_ctl start doesn't seem to test something very
interesting. I'm inclined to just remove it. I'm not caffeinated
sufficiently, but afaics that ought to sidestep the issue as stop
doesn't depend on the slop time?
crake failed the same way, once:
http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=crake&dt=2015-07-07%2016%3A35%3A06
Sounds like an actual production hazard too.
- Andres
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Andres Freund <andres@anarazel.de> writes:
On 2015-09-03 02:25:00 -0400, Noah Misch wrote:
--- a/src/bin/pg_ctl/t/001_start_stop.pl +++ b/src/bin/pg_ctl/t/001_start_stop.pl @@ -35,6 +35,7 @@ close CONF; command_ok([ 'pg_ctl', 'start', '-D', "$tempdir/data", '-w' ], 'pg_ctl start -w'); -command_ok([ 'pg_ctl', 'start', '-D', "$tempdir/data", '-w' ], - 'second pg_ctl start succeeds'); +sleep 3; # bridge test_postmaster_connection() slop threshold +command_fails([ 'pg_ctl', 'start', '-D', "$tempdir/data", '-w' ], + 'second pg_ctl start fails'); command_ok([ 'pg_ctl', 'stop', '-D', "$tempdir/data", '-w', '-m', 'fast' ], 'pg_ctl stop -w');
I'don't like adding a couple seconds of test runtime for the benefit of
very slow platforms.
Me either. This is the first time I've seen an indication that the
start_postmaster change mentioned in the comment is actually important
for production use, rather than just being cleanup. I think we ought
to just fix it. I'm willing to take care of the Unix side if someone
will explain how to change the Windows side.
regards, tom lane
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
I wrote:
Andres Freund <andres@anarazel.de> writes:
I'don't like adding a couple seconds of test runtime for the benefit of
very slow platforms.
Me either. This is the first time I've seen an indication that the
start_postmaster change mentioned in the comment is actually important
for production use, rather than just being cleanup. I think we ought
to just fix it. I'm willing to take care of the Unix side if someone
will explain how to change the Windows side.
Attached is a draft patch for this. I think it's fine for Unix (unless
someone wants to object to relying on "/bin/sh -c"), but I have no idea
whether it works for Windows. The main risk is that if CMD.EXE runs
the postmaster as a subprocess rather than overlaying itself a la shell
"exec", the PID we'd get back would apply only to CMD.EXE not to the
eventual postmaster. If so, I do not know how to fix that, or whether
it's fixable at all.
Note that this makes the test case in question fail reliably, which is
reasonable behavior IMO so I just changed the test.
If this does (or can be made to) work on Windows, I'd propose applying
it back to 9.2, where the current coding came in.
regards, tom lane
Attachments:
fork-exec-in-pg_ctl.patchtext/x-diff; charset=us-ascii; name=fork-exec-in-pg_ctl.patchDownload+140-126
On 09/03/2015 03:31 PM, Tom Lane wrote:
I wrote:
Andres Freund <andres@anarazel.de> writes:
I'don't like adding a couple seconds of test runtime for the benefit of
very slow platforms.Me either. This is the first time I've seen an indication that the
start_postmaster change mentioned in the comment is actually important
for production use, rather than just being cleanup. I think we ought
to just fix it. I'm willing to take care of the Unix side if someone
will explain how to change the Windows side.Attached is a draft patch for this. I think it's fine for Unix (unless
someone wants to object to relying on "/bin/sh -c"), but I have no idea
whether it works for Windows. The main risk is that if CMD.EXE runs
the postmaster as a subprocess rather than overlaying itself a la shell
"exec", the PID we'd get back would apply only to CMD.EXE not to the
eventual postmaster. If so, I do not know how to fix that, or whether
it's fixable at all.Note that this makes the test case in question fail reliably, which is
reasonable behavior IMO so I just changed the test.If this does (or can be made to) work on Windows, I'd propose applying
it back to 9.2, where the current coding came in.
There is no equivalent of execl, nor a cmd.exe exquivalent of the
shell's exec. But surely the equivalent of the fork/execl you're doing
here would be a simple CreateProcess(). I don't see why you need a shell
in the middle on Windows at all.
cheers
andrew
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Andrew Dunstan <andrew@dunslane.net> writes:
There is no equivalent of execl, nor a cmd.exe exquivalent of the
shell's exec. But surely the equivalent of the fork/execl you're doing
here would be a simple CreateProcess(). I don't see why you need a shell
in the middle on Windows at all.
The problem is to get the output redirection to work. I imagine it could
be rewritten without involving the shell, but it'd be less than trivial
(or at least, I'm not going to do it).
If we did get that to work, it could probably be applied to the
service-start case as well, which currently just does a CreateProcess and
pays no attention to what happens to postmaster stderr.
regards, tom lane
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Tom Lane wrote:
Andrew Dunstan <andrew@dunslane.net> writes:
There is no equivalent of execl, nor a cmd.exe exquivalent of the
shell's exec. But surely the equivalent of the fork/execl you're doing
here would be a simple CreateProcess(). I don't see why you need a shell
in the middle on Windows at all.The problem is to get the output redirection to work. I imagine it could
be rewritten without involving the shell, but it'd be less than trivial
(or at least, I'm not going to do it).
In the CreateProcess model, I think you can use startupInfo.hStdOutput,
see
https://msdn.microsoft.com/en-us/library/windows/desktop/ms686331%28v=vs.85%29.aspx
https://msdn.microsoft.com/en-us/library/windows/desktop/ms682425%28v=vs.85%29.aspx
--
�lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Thu, Sep 03, 2015 at 03:31:06PM -0400, Tom Lane wrote:
I wrote:
This is the first time I've seen an indication that the
start_postmaster change mentioned in the comment is actually important
for production use, rather than just being cleanup.
I scratched my head awhile without thinking of a credible production use case
that would notice the difference. Which one did you have in mind?
Note that this makes the test case in question fail reliably, which is
reasonable behavior IMO so I just changed the test.
That's the better behavior, agreed.
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Noah Misch <noah@leadboat.com> writes:
On Thu, Sep 03, 2015 at 03:31:06PM -0400, Tom Lane wrote:
This is the first time I've seen an indication that the
start_postmaster change mentioned in the comment is actually important
for production use, rather than just being cleanup.
I scratched my head awhile without thinking of a credible production use case
that would notice the difference. Which one did you have in mind?
Well, mainly that it's making our regression tests fail, which suggests
behavioral instability that could be important for production.
Aside from the specific case you diagnosed, it's clear from buildfarm
results that the five-second timeout elsewhere in
test_postmaster_connection has got funny behavior under load; there are
way too many cases where pg_ctl gives up after exactly that long, with
no useful information printed. The draft patch I posted gets rid of that
behavior ...
regards, tom lane
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Fri, Sep 04, 2015 at 10:54:51PM -0400, Tom Lane wrote:
Noah Misch <noah@leadboat.com> writes:
On Thu, Sep 03, 2015 at 03:31:06PM -0400, Tom Lane wrote:
This is the first time I've seen an indication that the
start_postmaster change mentioned in the comment is actually important
for production use, rather than just being cleanup.I scratched my head awhile without thinking of a credible production use case
that would notice the difference. Which one did you have in mind?Well, mainly that it's making our regression tests fail, which suggests
behavioral instability that could be important for production.
I doubt issuing two "pg_ctl start" within two seconds, with no intervening
stop, has a production use. However, ...
Aside from the specific case you diagnosed, it's clear from buildfarm
results that the five-second timeout elsewhere in
test_postmaster_connection has got funny behavior under load; there are
way too many cases where pg_ctl gives up after exactly that long, with
no useful information printed. The draft patch I posted gets rid of that
behavior ...
... fixing this is a good deal more valuable. Besides coping with stalls in
early postmaster startup, pg_ctl would no longer take 5s to exit when the
postmaster reports a postgresql.conf defect or other early error.
On Thu, Sep 03, 2015 at 03:31:06PM -0400, Tom Lane wrote:
If this does (or can be made to) work on Windows, I'd propose applying
it back to 9.2, where the current coding came in.
That seems prudent. We'll keep meeting buildfarm members with performance
variability sufficient to reach that 5s timeout. (Incidentally, the current
coding first appeared in 9.1.)
--- 650,671 ---- }/*
! * Reap child process if it died; if it remains in zombie state then
! * our kill-based test will think it's still running.
*/
! #ifndef WIN32
{
! int exitstatus;! (void) waitpid((pid_t) pm_pid, &exitstatus, WNOHANG);
Let's report any exit status. When a signal kills the young postmaster, it's
a waste to omit that detail.
}
+ #endif/*
! * Check whether the child postmaster process is still alive. This
! * lets us exit early if the postmaster fails during startup.
*/
! if (!postmaster_is_alive((pid_t) pm_pid))
return PQPING_NO_RESPONSE;
kill() adds nothing when dealing with a pg_ctl child. waitpid() is enough.
Thanks,
nm
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Noah Misch <noah@leadboat.com> writes:
kill() adds nothing when dealing with a pg_ctl child. waitpid() is enough.
The kill() coding works on Windows (I believe); waitpid() not so much.
regards, tom lane
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Sat, Sep 05, 2015 at 10:22:59PM -0400, Tom Lane wrote:
Noah Misch <noah@leadboat.com> writes:
kill() adds nothing when dealing with a pg_ctl child. waitpid() is enough.
The kill() coding works on Windows (I believe); waitpid() not so much.
Windows has the concept under a different name. See postmaster.c.
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Noah Misch <noah@leadboat.com> writes:
On Sat, Sep 05, 2015 at 10:22:59PM -0400, Tom Lane wrote:
The kill() coding works on Windows (I believe); waitpid() not so much.
Windows has the concept under a different name. See postmaster.c.
Well, if someone else wants to code and test that, feel free. I don't
do Windows ...
regards, tom lane
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
I wrote:
Attached is a draft patch for this. I think it's fine for Unix (unless
someone wants to object to relying on "/bin/sh -c"), but I have no idea
whether it works for Windows. The main risk is that if CMD.EXE runs
the postmaster as a subprocess rather than overlaying itself a la shell
"exec", the PID we'd get back would apply only to CMD.EXE not to the
eventual postmaster. If so, I do not know how to fix that, or whether
it's fixable at all.
Note that this makes the test case in question fail reliably, which is
reasonable behavior IMO so I just changed the test.
If this does (or can be made to) work on Windows, I'd propose applying
it back to 9.2, where the current coding came in.
Nobody seems to have stepped up to fix the Windows side of this, but
after some more thought it occurred to me that maybe it doesn't need
(much) fixing. There are two things that pg_ctl wants the PID for:
1. To check if the postmaster.pid file contains the expected child process
PID. Well, that's a nice-to-have, but we never had it before and got
along well enough, because the start-timestamp comparison check covers
most real-world cases. We can omit the PID-match check on Windows.
2. To see whether the child postmaster process is still running, via a
"kill(pid, 0)" test. But if we've launched the postmaster through a
shell, and not used "&" semantics, presumably that shell will wait for its
child process to exit and then exit itself. So *testing whether the shell
is still there is just about as good as testing the real postmaster PID*.
Therefore, we don't really need to worry about rewriting the Windows
subprocess-launch logic. If someone would like to do it, that would
be nice, but we don't have to wait for that to happen before we can
get rid of the 5-second-timeout issues.
So the attached modified patch adjusts the PID-match logic and some
comments, but is otherwise what I posted before. I believe that this
might actually work on Windows, but I have no way to test it. Someone
please try that? (Don't forget to test the service-start path, too.)
regards, tom lane
Attachments:
fork-exec-in-pg_ctl-2.patchtext/x-diff; charset=us-ascii; name=fork-exec-in-pg_ctl-2.patchDownload+170-148
I wrote:
So the attached modified patch adjusts the PID-match logic and some
comments, but is otherwise what I posted before. I believe that this
might actually work on Windows, but I have no way to test it. Someone
please try that? (Don't forget to test the service-start path, too.)
Has anyone tried that patch on Windows? I'd like to push it in hopes of
improving buildfarm stability, but I'm hesitant to do so without some
confirmation that it acts as I think it will.
regards, tom lane
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Wed, Oct 7, 2015 at 6:44 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
I wrote:
So the attached modified patch adjusts the PID-match logic and some
comments, but is otherwise what I posted before. I believe that this
might actually work on Windows, but I have no way to test it. Someone
please try that? (Don't forget to test the service-start path, too.)Has anyone tried that patch on Windows? I'd like to push it in hopes of
improving buildfarm stability, but I'm hesitant to do so without some
confirmation that it acts as I think it will.
I marked this patch as something to look at, though I haven't take the
time to do it yet...
--
Michael
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Wed, Oct 7, 2015 at 7:05 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:
On Wed, Oct 7, 2015 at 6:44 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
I wrote:
So the attached modified patch adjusts the PID-match logic and some
comments, but is otherwise what I posted before. I believe that this
might actually work on Windows, but I have no way to test it. Someone
please try that? (Don't forget to test the service-start path, too.)Has anyone tried that patch on Windows? I'd like to push it in hopes of
improving buildfarm stability, but I'm hesitant to do so without some
confirmation that it acts as I think it will.I marked this patch as something to look at, though I haven't take the
time to do it yet...
s/take/taken
--
Michael
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Sat, Sep 26, 2015 at 9:12 AM, Tom Lane wrote:
So the attached modified patch adjusts the PID-match logic and some
comments, but is otherwise what I posted before. I believe that this
might actually work on Windows, but I have no way to test it. Someone
please try that? (Don't forget to test the service-start path, too.)
If "pg_ctl start" is invoked directly from a command prompt, it works.
Now, if I run "pg_ctl start" within a script (in an msysgit
environment for example), pg_ctl fails, complaining that
postmaster.pid already exists. The TAP tests get broken as well,
attached are the log files of the run, and here is an interested bit:
# Running: pg_ctl start -D
C:\Users\ioltas\git\postgres\src\bin\pg_ctl\tmp_testor2K/data -w
waiting for server to start... stopped waiting
not ok 12 - pg_ctl start -w
# Failed test 'pg_ctl start -w'
# at C:/Users/ioltas/git/postgres/src/test/perl/TestLib.pm line 282.
# Running: pg_ctl start -D
C:\Users\ioltas\git\postgres\src\bin\pg_ctl\tmp_testor2K/data -w
waiting for server to start...FATAL: lock file "postmaster.pid" already exists
HINT: Is another postmaster (PID 4040) running in data directory
"C:/Users/ioltas/git/postgres/src/bin/pg_ctl/tmp_testor2K/data"?
LOG: database system was shut down at 2015-10-06 23:58:02 PDT
LOG: MultiXact member wraparound protections are now enabled
FATAL: the database system is starting up
LOG: database system is ready to accept connections
LOG: autovacuum launcher started
stopped waiting
Registering a service to the SCM works, but the service registered
refuses to start.
Regards,
--
Michael
Michael Paquier <michael.paquier@gmail.com> writes:
On Sat, Sep 26, 2015 at 9:12 AM, Tom Lane wrote:
So the attached modified patch adjusts the PID-match logic and some
comments, but is otherwise what I posted before. I believe that this
might actually work on Windows, but I have no way to test it. Someone
please try that? (Don't forget to test the service-start path, too.)
If "pg_ctl start" is invoked directly from a command prompt, it works.
Now, if I run "pg_ctl start" within a script (in an msysgit
environment for example), pg_ctl fails, complaining that
postmaster.pid already exists. The TAP tests get broken as well,
Hm, the TAP trace makes it look like the postmaster start does actually
work, but pg_ctl isn't correctly waiting for that to happen. (The
complaint about "already exists" seems to be from the second start
attempt, and is exactly what to expect there.) I could believe that
I'd fat-fingered the PID comparison logic in test_postmaster_connection,
but I don't understand how it would work from a command prompt but not
from a script. Any ideas?
regards, tom lane
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Michael Paquier <michael.paquier@gmail.com> writes:
On Sat, Sep 26, 2015 at 9:12 AM, Tom Lane wrote:
So the attached modified patch adjusts the PID-match logic and some
comments, but is otherwise what I posted before. I believe that this
might actually work on Windows, but I have no way to test it. Someone
please try that? (Don't forget to test the service-start path, too.)
If "pg_ctl start" is invoked directly from a command prompt, it works.
Now, if I run "pg_ctl start" within a script (in an msysgit
environment for example), pg_ctl fails, complaining that
postmaster.pid already exists. The TAP tests get broken as well,
Reading that again, I think you mean that "pg_ctl start" works but you
didn't try "pg_ctl start -w" manually? Because it's "-w" that's at
issue here, and the failing test cases are using that.
After studying the test logs more carefully, I think the behavior they're
showing is consistent with the idea that postmaster_is_alive() doesn't
work on the CMD shell process. Which seems very likely now that I think
about it, because we're depending on an implementation of kill() that
is PG-specific.
So in fact this idea doesn't work :-(.
I think there is still room to salvage something without fully rewriting
the postmaster invocation logic to avoid using CMD, because it's still
true that checking whether the CMD process is still there should be as
good as checking the postmaster proper. We just can't use kill() for it.
I'd be inclined to get rid of the use of kill() on Unix as well; as Noah
mentioned upthread, if we're using fork/exec we might as well pay
attention to waitpid's results instead. On Windows, I imagine that the
thing to do is have start_postmaster() save aside the handle for the CMD
process, and then in test_postmaster_connection(), check the handle state
to see if the process is still running (I assume there's a way to do
that).
I can take care of the Unix side of that, but as before, somebody else
would need to code and test the Windows side. It's probably just a few
lines of code to add ... any volunteers?
regards, tom lane
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers