Intermittent pg_ctl failures on Windows

Started by Tom Laneabout 8 years ago16 messageshackers
Jump to latest
#1Tom Lane
tgl@sss.pgh.pa.us

The buildfarm's Windows members occasionally show weird pg_ctl failures,
for instance this recent case:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=bowerbird&dt=2018-03-10%2020%3A30%3A20

### Restarting node "master"
# Running: pg_ctl -D G:/prog/bf/root/HEAD/pgsql.build/src/test/recovery/tmp_check/t_006_logical_decoding_master_data/pgdata -l G:/prog/bf/root/HEAD/pgsql.build/src/test/recovery/tmp_check/log/006_logical_decoding_master.log restart
waiting for server to shut down.... done
server stopped
waiting for server to start....The process cannot access the file because it is being used by another process.
stopped waiting
pg_ctl: could not start server
Examine the log output.
Bail out! system pg_ctl failed

or this one:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=bowerbird&dt=2017-12-29%2023%3A30%3A24

### Stopping node "subscriber" using mode fast
# Running: pg_ctl -D c:/prog/bf/root/HEAD/pgsql.build/src/test/subscription/tmp_check/t_001_rep_changes_subscriber_data/pgdata -m fast stop
waiting for server to shut down....pg_ctl: could not open PID file "c:/prog/bf/root/HEAD/pgsql.build/src/test/subscription/tmp_check/t_001_rep_changes_subscriber_data/pgdata/postmaster.pid": Permission denied
Bail out! system pg_ctl failed

I'd been writing these off as Microsoft randomness and/or antivirus
interference, but it suddenly occurred to me that there might be a
consistent explanation: since commit f13ea95f9, when pg_ctl is waiting
for server start/stop, it is trying to read postmaster.pid more-or-less
concurrently with the postmaster writing to that file. On Unix that's not
much of a problem, but I believe that on Windows you have to specifically
open the file with sharing enabled, or you get error messages like these.
The postmaster should be enabling sharing, because port.h redirects
open/fopen to pgwin32_open/pgwin32_fopen which enable the sharing flags.
But it only does that #ifndef FRONTEND. So pg_ctl is just using naked
open(), which could explain these failures.

If this theory is accurate, it should be pretty easy to replicate the
problem if you modify the postmaster to hold postmaster.pid open longer
when rewriting it, e.g. stick fractional-second sleeps into CreateLockFile
and AddToDataDirLockFile.

I'm not in a position to investigate this in detail nor test a fix,
but I think somebody should.

regards, tom lane

#2Badrul Chowdhury
bachow@microsoft.com
In reply to: Tom Lane (#1)
RE: Intermittent pg_ctl failures on Windows

Hi Tom,

This is a great catch. I am looking into it: I will start by reproducing the error as you suggested.

Thanks,
Badrul

-----Original Message-----
From: Tom Lane [mailto:tgl@sss.pgh.pa.us]
Sent: Saturday, March 10, 2018 2:48 PM
To: pgsql-hackers@lists.postgresql.org
Subject: Intermittent pg_ctl failures on Windows

The buildfarm's Windows members occasionally show weird pg_ctl failures, for instance this recent case:

https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbuildfarm.postgresql.org%2Fcgi-bin%2Fshow_log.pl%3Fnm%3Dbowerbird%26dt%3D2018-03-10%252020%253A30%253A20&data=04%7C01%7Cbachow%40microsoft.com%7C28a8094e84c74c26ecb108d586d91a9d%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636563189370087651%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwifQ%3D%3D%7C-1&sdata=qBtMsJ0EJFs4DVtkA6TZJhCDNlj392uNxsB6MHnu7po%3D&reserved=0

### Restarting node "master"
# Running: pg_ctl -D G:/prog/bf/root/HEAD/pgsql.build/src/test/recovery/tmp_check/t_006_logical_decoding_master_data/pgdata -l G:/prog/bf/root/HEAD/pgsql.build/src/test/recovery/tmp_check/log/006_logical_decoding_master.log restart waiting for server to shut down.... done server stopped waiting for server to start....The process cannot access the file because it is being used by another process.
stopped waiting
pg_ctl: could not start server
Examine the log output.
Bail out! system pg_ctl failed

or this one:

https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbuildfarm.postgresql.org%2Fcgi-bin%2Fshow_log.pl%3Fnm%3Dbowerbird%26dt%3D2017-12-29%252023%253A30%253A24&data=04%7C01%7Cbachow%40microsoft.com%7C28a8094e84c74c26ecb108d586d91a9d%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636563189370087651%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwifQ%3D%3D%7C-1&sdata=NdoDkZxBagXpiPDjNmhN6znHh%2BITyjEv2StPpLaabaw%3D&reserved=0

### Stopping node "subscriber" using mode fast # Running: pg_ctl -D c:/prog/bf/root/HEAD/pgsql.build/src/test/subscription/tmp_check/t_001_rep_changes_subscriber_data/pgdata -m fast stop waiting for server to shut down....pg_ctl: could not open PID file "c:/prog/bf/root/HEAD/pgsql.build/src/test/subscription/tmp_check/t_001_rep_changes_subscriber_data/pgdata/postmaster.pid": Permission denied Bail out! system pg_ctl failed

I'd been writing these off as Microsoft randomness and/or antivirus interference, but it suddenly occurred to me that there might be a consistent explanation: since commit f13ea95f9, when pg_ctl is waiting for server start/stop, it is trying to read postmaster.pid more-or-less concurrently with the postmaster writing to that file. On Unix that's not much of a problem, but I believe that on Windows you have to specifically open the file with sharing enabled, or you get error messages like these.
The postmaster should be enabling sharing, because port.h redirects open/fopen to pgwin32_open/pgwin32_fopen which enable the sharing flags.
But it only does that #ifndef FRONTEND. So pg_ctl is just using naked open(), which could explain these failures.

If this theory is accurate, it should be pretty easy to replicate the problem if you modify the postmaster to hold postmaster.pid open longer when rewriting it, e.g. stick fractional-second sleeps into CreateLockFile and AddToDataDirLockFile.

I'm not in a position to investigate this in detail nor test a fix, but I think somebody should.

regards, tom lane

#3Zharkov Roman
r.zharkov@postgrespro.ru
In reply to: Badrul Chowdhury (#2)
Re: Intermittent pg_ctl failures on Windows

Hello!

We reproduced these errors on out buildfarm and my windows workstation.
We used small TAP test that restarts PostgresNode in loop. Additionally,
constant WAITS_PER_SEC in the pg_ctl.c file has been increased 1000
times.

There are two different problems with pg_ctl:
1 - share access to postmaster.pid;

2 - share access to logfile. Postmaster runs in the shell (CMD.EXE).
When the server stops cmd.exe may blocks logfile from been opened by new
cmd.exe.

waiting for server to shut down.... done
server stopped
waiting for server to start....The process cannot access the file
because it is being used by another process.
stopped waiting
pg_ctl: could not start server
Examine the log output.

To avoid the first error we made two changes:
pg_ctl now opens the postmaster.pid file using pgwin32_open() function
to correctly handle share locks.
On Windows systems we cannot handle ERROR_DELETE_PENDING because
GetLastError() returns ERROR_ACCESS_DENIED instead.
So we rename the lock files before delete them.

To avoid the second error we added the wait for cmd.exe finish in the
do_stop() and do_restart() functions.

Git patch is in the attachment.

On 2018-03-12 18:55, Badrul Chowdhury wrote:

Hi Tom,

This is a great catch. I am looking into it: I will start by
reproducing the error as you suggested.

Thanks,
Badrul

-----Original Message-----
From: Tom Lane [mailto:tgl@sss.pgh.pa.us]
Sent: Saturday, March 10, 2018 2:48 PM
To: pgsql-hackers@lists.postgresql.org
Subject: Intermittent pg_ctl failures on Windows

The buildfarm's Windows members occasionally show weird pg_ctl
failures, for instance this recent case:

https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbuildfarm.postgresql.org%2Fcgi-bin%2Fshow_log.pl%3Fnm%3Dbowerbird%26dt%3D2018-03-10%252020%253A30%253A20&data=04%7C01%7Cbachow%40microsoft.com%7C28a8094e84c74c26ecb108d586d91a9d%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636563189370087651%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwifQ%3D%3D%7C-1&sdata=qBtMsJ0EJFs4DVtkA6TZJhCDNlj392uNxsB6MHnu7po%3D&reserved=0

### Restarting node "master"
# Running: pg_ctl -D
G:/prog/bf/root/HEAD/pgsql.build/src/test/recovery/tmp_check/t_006_logical_decoding_master_data/pgdata
-l
G:/prog/bf/root/HEAD/pgsql.build/src/test/recovery/tmp_check/log/006_logical_decoding_master.log
restart waiting for server to shut down.... done server stopped
waiting for server to start....The process cannot access the file
because it is being used by another process.
stopped waiting
pg_ctl: could not start server
Examine the log output.
Bail out! system pg_ctl failed

or this one:

https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbuildfarm.postgresql.org%2Fcgi-bin%2Fshow_log.pl%3Fnm%3Dbowerbird%26dt%3D2017-12-29%252023%253A30%253A24&data=04%7C01%7Cbachow%40microsoft.com%7C28a8094e84c74c26ecb108d586d91a9d%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636563189370087651%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwifQ%3D%3D%7C-1&sdata=NdoDkZxBagXpiPDjNmhN6znHh%2BITyjEv2StPpLaabaw%3D&reserved=0

### Stopping node "subscriber" using mode fast # Running: pg_ctl -D
c:/prog/bf/root/HEAD/pgsql.build/src/test/subscription/tmp_check/t_001_rep_changes_subscriber_data/pgdata
-m fast stop waiting for server to shut down....pg_ctl: could not open
PID file
"c:/prog/bf/root/HEAD/pgsql.build/src/test/subscription/tmp_check/t_001_rep_changes_subscriber_data/pgdata/postmaster.pid":
Permission denied Bail out! system pg_ctl failed

I'd been writing these off as Microsoft randomness and/or antivirus
interference, but it suddenly occurred to me that there might be a
consistent explanation: since commit f13ea95f9, when pg_ctl is waiting
for server start/stop, it is trying to read postmaster.pid
more-or-less concurrently with the postmaster writing to that file.
On Unix that's not much of a problem, but I believe that on Windows
you have to specifically open the file with sharing enabled, or you
get error messages like these.
The postmaster should be enabling sharing, because port.h redirects
open/fopen to pgwin32_open/pgwin32_fopen which enable the sharing
flags.
But it only does that #ifndef FRONTEND. So pg_ctl is just using naked
open(), which could explain these failures.

If this theory is accurate, it should be pretty easy to replicate the
problem if you modify the postmaster to hold postmaster.pid open
longer when rewriting it, e.g. stick fractional-second sleeps into
CreateLockFile and AddToDataDirLockFile.

I'm not in a position to investigate this in detail nor test a fix,
but I think somebody should.

regards, tom lane

--

regars, Roman Zharkov

Attachments:

pg_ctl.patchtext/x-diff; name=pg_ctl.patchDownload+150-4
#4Tom Lane
tgl@sss.pgh.pa.us
In reply to: Zharkov Roman (#3)
Re: Intermittent pg_ctl failures on Windows

r.zharkov@postgrespro.ru writes:

pg_ctl now opens the postmaster.pid file using pgwin32_open() function
to correctly handle share locks.

HEAD already does that, no? See f02259fe9.

On Windows systems we cannot handle ERROR_DELETE_PENDING because
GetLastError() returns ERROR_ACCESS_DENIED instead.
So we rename the lock files before delete them.

This seems improbably broken/stupid. Also, I've not seen any buildfarm
failures that would match this; it's always pg_ctl complaining not the
postmaster.

To avoid the second error we added the wait for cmd.exe finish in the
do_stop() and do_restart() functions.

Hmmm ... there seems the germ of a good idea here, but can't we do
it with less (and less ugly) code?

Alternatively, perhaps we could fix things so that the parent cmd.exe
shell isn't involved in logfile access? That'd require teaching the
postmaster to open/redirect its stdout/stderr, which is kind of
annoying, but it might beat hacking things as you have done here.

regards, tom lane

#5Zharkov Roman
r.zharkov@postgrespro.ru
In reply to: Tom Lane (#4)
Re: Intermittent pg_ctl failures on Windows

On 2019-07-17 20:51, Tom Lane wrote:

r.zharkov@postgrespro.ru writes:

pg_ctl now opens the postmaster.pid file using pgwin32_open() function
to correctly handle share locks.

HEAD already does that, no? See f02259fe9.

You are right. I tested branch REL_11_STABLE and it is my mistake.

On Windows systems we cannot handle ERROR_DELETE_PENDING because
GetLastError() returns ERROR_ACCESS_DENIED instead.
So we rename the lock files before delete them.

This seems improbably broken/stupid. Also, I've not seen any buildfarm
failures that would match this; it's always pg_ctl complaining not the
postmaster.

The probability is very small. In one of our tests pg_ctl fails with
that error sometime.
In a test with multiple frequent restarts the probability is 5-6%. On
other machines probability differs.
To find out the real error code we used the Process Monitor utility.

To avoid the second error we added the wait for cmd.exe finish in the
do_stop() and do_restart() functions.

Hmmm ... there seems the germ of a good idea here, but can't we do
it with less (and less ugly) code?

Alternatively, perhaps we could fix things so that the parent cmd.exe
shell isn't involved in logfile access? That'd require teaching the
postmaster to open/redirect its stdout/stderr, which is kind of
annoying, but it might beat hacking things as you have done here.

If we need parent cmd.exe only for log writing, can we start the
postmaster without it?
Is cmd.exe necessary?

--
regards, Roman Zharkov

#6Michael Paquier
michael@paquier.xyz
In reply to: Tom Lane (#4)
Re: Intermittent pg_ctl failures on Windows

On Wed, Jul 17, 2019 at 09:51:48AM -0400, Tom Lane wrote:

r.zharkov@postgrespro.ru writes:

On Windows systems we cannot handle ERROR_DELETE_PENDING because
GetLastError() returns ERROR_ACCESS_DENIED instead.
So we rename the lock files before delete them.

This seems improbably broken/stupid. Also, I've not seen any buildfarm
failures that would match this; it's always pg_ctl complaining not the
postmaster.

Oh, it is. A lot. And this has been discussed a couple of times
across multiple threads on -bugs and/or -hackers. I think that
Windows just lacks a way to detect reliably if a file is pending for
deletion or not, and there is no way that we are going to enforce
blindly a ERROR_DELETE_PENDING when we see EACCES. One code path
which is impacted by that is our wrapper for stat() for the win32
port... For now I'd rather believe that what the OS tells us is true,
in which case here it is plain wrong, but well.
--
Michael

#7Zharkov Roman
r.zharkov@postgrespro.ru
In reply to: Michael Paquier (#6)
Re: Intermittent pg_ctl failures on Windows

Hello,

Therefore, we suggest replace the deletion of a lock file by renaming it.
unlink in windows is not an atomic operation.
When we try to open the file between
SetDispositionInformationFile and CloseFile we get ERROR_DELETE_PENDING ( see screenshot )

Attachments:

image1.pngimage/png; name=image1.png; x-apple-part-url=7599006A-1D1E-409C-9C1E-2E25CC304A72Download
#8Michael Paquier
michael@paquier.xyz
In reply to: Zharkov Roman (#5)
Re: Intermittent pg_ctl failures on Windows

On Wed, Jul 17, 2019 at 09:54:16PM +0700, r.zharkov@postgrespro.ru wrote:

You are right. I tested branch REL_11_STABLE and it is my mistake.

[...]

The probability is very small. In one of our tests pg_ctl fails with that
error sometime.
In a test with multiple frequent restarts the probability is 5-6%. On other
machines probability differs.
To find out the real error code we used the Process Monitor utility.

I would like to be sure of something here. Are those failures based
on what you have tested on REL_11_STABLE or do you still see pg_ctl
complain about such things on HEAD (post f02259fe as mentioned by Tom
upthread)? I would like to think that you mean the former and the
latter, but this thread mentions only the former.
--
Michael

#9Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Zharkov Roman (#7)
Re: Intermittent pg_ctl failures on Windows

At Thu, 18 Jul 2019 14:04:34 +0700, Жарков Роман <r.zharkov@postgrespro.ru> wrote in <ECD07611-BBA8-4E9B-975C-50E59F7154DA@postgrespro.ru>

Hello,

Therefore, we suggest replace the deletion of a lock file by renaming it.
unlink in windows is not an atomic operation.
When we try to open the file between
SetDispositionInformationFile and CloseFile we get ERROR_DELETE_PENDING ( see screenshot )

I found a mail in the arvhive.

/messages/by-id/CABUevExwz9ncAk90Hb3Sb4MHeaOEyS59eTb_AKW9qFEjAidg5Q@mail.gmail.com

One approach would be to rename the file into something indicating it's
being deleted, before actually deleting it.

That would be an option (IIRC if you open with FILE_SHARE_DELETE, it can
also be renamed). But if we can track the delete when we try to open it

and

just treat it as "file does not exist", that seems cleaner.

I'm not sure what exactly you're suggesting. But isn't there the issue
that such an approach will not interoperate with external tools?

I'll have to admit I wasn't thinking of external tools. The external tools
would have to learn about pending-delete files themselves. So yeah, if we
care about those then renaming them to something predictable and then
instruct the third party tools to exclude such files would be a more
complete fix.

STATUS_PENDING_DELETE is concealed under win32 API (seems like a kind of
misdesign if it is true that ERROR_PENDING_DELETE is defined but not actually
used as I saw in someone's blog.). AFAICS native interfaces like NtOpenFile
returns the native status code. That is a bit cumbersome but not too difficult
(as heard).

https://resources.infosecinstitute.com/calling-ntdll-functions-directly/

That said, I haven't succeeded even to open a file with it. (I replaced the path
with NT native path (\??\c:\blah..) since NtOpenFile doesn't accept the
user-friendly(?) paths).

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#10Zharkov Roman
r.zharkov@postgrespro.ru
In reply to: Michael Paquier (#8)
Re: Intermittent pg_ctl failures on Windows

I have tested clean REL_11_STABLE.
Commit f02259fe was reverted by df8b5f3e in this branch.
So pg_ctl uses “old” open() function.

regards, Roman

Show quoted text

18 июля 2019 г., в 15:51, Michael Paquier <michael@paquier.xyz> написал(а):

On Wed, Jul 17, 2019 at 09:54:16PM +0700, r.zharkov@postgrespro.ru wrote:
You are right. I tested branch REL_11_STABLE and it is my mistake.

[...]

The probability is very small. In one of our tests pg_ctl fails with that
error sometime.
In a test with multiple frequent restarts the probability is 5-6%. On other
machines probability differs.
To find out the real error code we used the Process Monitor utility.

I would like to be sure of something here. Are those failures based
on what you have tested on REL_11_STABLE or do you still see pg_ctl
complain about such things on HEAD (post f02259fe as mentioned by Tom
upthread)? I would like to think that you mean the former and the
latter, but this thread mentions only the former.
--
Michael

#11Michael Paquier
michael@paquier.xyz
In reply to: Zharkov Roman (#10)
Re: Intermittent pg_ctl failures on Windows

On Thu, Jul 18, 2019 at 04:14:34PM +0700, Жарков Роман wrote:

I have tested clean REL_11_STABLE.
Commit f02259fe was reverted by df8b5f3e in this branch.
So pg_ctl uses “old” open() function.

Yeah, that was a failure from me, so I tend to be rather very careful
about anything related to Windows. However, after that we have added
40cfe86 about which nobody has complained yet, and the number of
buildfarm failures about pg_ctl concurrency on HEAD has gone down to
zero since (perhaps I am missing something?).

So, instead of trying to invent a new solution only for stable
branches (which may have its own bugs we would need to deal with only
on stable branches, and only for Windows), why don't we just try to
move forward into back-patching those pieces? Or it happens that we
still have some potential failures on HEAD and REL_12_STABLE which
would justify some extra handling? In this case, I would recommend
that we focus on HEAD as a first step, and put things in order there.
--
Michael

#12Tom Lane
tgl@sss.pgh.pa.us
In reply to: Michael Paquier (#11)
Re: Intermittent pg_ctl failures on Windows

Michael Paquier <michael@paquier.xyz> writes:

On Thu, Jul 18, 2019 at 04:14:34PM +0700, Жарков Роман wrote:

I have tested clean REL_11_STABLE.
Commit f02259fe was reverted by df8b5f3e in this branch.
So pg_ctl uses “old” open() function.

Yeah, that was a failure from me, so I tend to be rather very careful
about anything related to Windows. However, after that we have added
40cfe86 about which nobody has complained yet, and the number of
buildfarm failures about pg_ctl concurrency on HEAD has gone down to
zero since (perhaps I am missing something?).

Hm, I think 0ba06e0 is actually the relevant change here? Though
40cfe86 was a necessary cleanup fix.

I'm too tired to dig in the buildfarm database to be sure, but my
impression is that the failure rate is much-better-but-not-zero.
So I'd support back-patching those two commits, but I'm not sure
if that's the end of the conversation.

regards, tom lane

#13Zharkov Roman
r.zharkov@postgrespro.ru
In reply to: Michael Paquier (#11)
Re: Intermittent pg_ctl failures on Windows

First time we found pg_ctl errors while testing our fork.
I reproduced them on REL_11_STABLE.
I found three problems with pg_ctl do_stop/do_restart:
1 - "old" fopen() function;
2 - "delete pending" problem rarely happens with "new" fopen() function when pg_ctl tries to open postmaster.pid file;
3 - cmd.exe shell may block the log file when the server restarts;

Now i try to reproduce it on REL_12_STABLE

regard, Roman

Show quoted text

19 июля 2019 г., в 10:02, Michael Paquier <michael@paquier.xyz> написал(а):

On Thu, Jul 18, 2019 at 04:14:34PM +0700, Жарков Роман wrote:
I have tested clean REL_11_STABLE.
Commit f02259fe was reverted by df8b5f3e in this branch.
So pg_ctl uses “old” open() function.

Yeah, that was a failure from me, so I tend to be rather very careful
about anything related to Windows. However, after that we have added
40cfe86 about which nobody has complained yet, and the number of
buildfarm failures about pg_ctl concurrency on HEAD has gone down to
zero since (perhaps I am missing something?).

So, instead of trying to invent a new solution only for stable
branches (which may have its own bugs we would need to deal with only
on stable branches, and only for Windows), why don't we just try to
move forward into back-patching those pieces? Or it happens that we
still have some potential failures on HEAD and REL_12_STABLE which
would justify some extra handling? In this case, I would recommend
that we focus on HEAD as a first step, and put things in order there.
--
Michael

#14Michael Paquier
michael@paquier.xyz
In reply to: Tom Lane (#12)
Re: Intermittent pg_ctl failures on Windows

On Fri, Jul 19, 2019 at 12:59:43AM -0400, Tom Lane wrote:

Hm, I think 0ba06e0 is actually the relevant change here? Though
40cfe86 was a necessary cleanup fix.

Oops. Yes, I meant that.

I'm too tired to dig in the buildfarm database to be sure, but my
impression is that the failure rate is much-better-but-not-zero.
So I'd support back-patching those two commits, but I'm not sure
if that's the end of the conversation.

Yeah, I am a bit afraid to miss something under the hood... Let's
see... Looking at the last 90 days of failures, I actually see one
from 63 days ago from bowerbird which seems to indicate that we are not yet
completely done with it:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=bowerbird&amp;dt=2019-05-16%2010%3A30%3A52
# Running: pg_ctl stop -D
H:\prog\bf\root\HEAD\pgsql.build\src\bin\pg_ctl\tmp_check\tmp_test_qB5F/data
waiting for server to shut down....pg_ctl: could not open PID file
"H:/prog/bf/root/HEAD/pgsql.build/src/bin/pg_ctl/tmp_check/tmp_test_qB5F/data/postmaster.pid":
Permission denied
not ok 16 - pg_ctl stop

And there is this one from jacana with subscription tests from 27 days
ago:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=jacana&amp;dt=2019-06-12%2003%3A42%3A55

This makes it at least two, still that's much less than before :(
--
Michael

#15Zharkov Roman
r.zharkov@postgrespro.ru
In reply to: Tom Lane (#12)
Re: Intermittent pg_ctl failures on Windows

Hello,

I have reproduced the DELETE_PENDING error on the REL_12_STABLE.
I changed the WAITS_PER_SEC to 500000 in the pg_ctl.c file.
Then I launched dummy TAP test ( see attachment ).
The picture in the attachment illustrates the hidden DELETE_PENDING
error.

-------------------------------------------
Log output:
# Postmaster PID for node "master" is 17796
# restarted 233
# 14:31:05
### Restarting node "master"
# Running: pg_ctl -D
C:/HOME/Git/postgrespro/src/test/pg_ctl/tmp_check/t_test_master_data/pgdata
-l C:/HOME/Git/postgrespro/src/test/pg_ctl/tmp_check/log/test_master.log
restart
waiting for server to shut down....pg_ctl: could not open PID file
"C:/HOME/Git/postgrespro/src/test/pg_ctl/tmp_check/t_test_master_data/pgdata/postmaster.pid":
Permission denied
Bail out! system pg_ctl failed
### Stopping node "master" using mode immediate
# Running: pg_ctl -D
C:/HOME/Git/postgrespro/src/test/pg_ctl/tmp_check/t_test_master_data/pgdata
-m immediate stop
pg_ctl: PID file
"C:/HOME/Git/postgrespro/src/test/pg_ctl/tmp_check/t_test_master_data/pgdata/postmaster.pid"
does not exist
Is server running?
Bail out! system pg_ctl failed
-------------------------------------------

On 2019-07-19 11:59, Tom Lane wrote:

Michael Paquier <michael@paquier.xyz> writes:

On Thu, Jul 18, 2019 at 04:14:34PM +0700, Жарков Роман wrote:

I have tested clean REL_11_STABLE.
Commit f02259fe was reverted by df8b5f3e in this branch.
So pg_ctl uses “old” open() function.

Yeah, that was a failure from me, so I tend to be rather very careful
about anything related to Windows. However, after that we have added
40cfe86 about which nobody has complained yet, and the number of
buildfarm failures about pg_ctl concurrency on HEAD has gone down to
zero since (perhaps I am missing something?).

Hm, I think 0ba06e0 is actually the relevant change here? Though
40cfe86 was a necessary cleanup fix.

I'm too tired to dig in the buildfarm database to be sure, but my
impression is that the failure rate is much-better-but-not-zero.
So I'd support back-patching those two commits, but I'm not sure
if that's the end of the conversation.

regards, tom lane

--
regards, Roman

Attachments:

test.pltext/plain; name=test.plDownload
postmaster_pid.PNGimage/png; name=postmaster_pid.PNGDownload+6-2
regress_log_testtext/plain; name=regress_log_testDownload
#16Zharkov Roman
r.zharkov@postgrespro.ru
In reply to: Tom Lane (#12)
Re: Intermittent pg_ctl failures on Windows

Hello,

I have reproduced the CMD.EXE error "The process cannot access the file
because it is being used by another process" the REL_12_STABLE.
I simply launched my dummy TAP test and I got «Процесс не может получить
доступ к файлу, так как этот файл занят другим процессом.»
( “The process cannot access the file because it is being used by
another process” in english Windows version )

The pictures in the attachment illustrate the SHARING_VIOLATION error.

-------------------------------------------
# Postmaster PID for node "master" is 10456
# restarted 32
# 13:28:58
### Restarting node "master"
# Running: pg_ctl -D
C:/HOME/Git/postgrespro/src/test/pg_ctl/tmp_check/t_test_master_data/pgdata
-l C:/HOME/Git/postgrespro/src/test/pg_ctl/tmp_check/log/test_master.log
restart
waiting for server to shut down.... done
server stopped
waiting for server to start....Процесс не может получить доступ к файлу,
так как этот файл занят другим процессом.
stopped waiting
pg_ctl: could not start server
Examine the log output.
Bail out! system pg_ctl failed
### Stopping node "master" using mode immediate
# Running: pg_ctl -D
C:/HOME/Git/postgrespro/src/test/pg_ctl/tmp_check/t_test_master_data/pgdata
-m immediate stop
pg_ctl: PID file
"C:/HOME/Git/postgrespro/src/test/pg_ctl/tmp_check/t_test_master_data/pgdata/postmaster.pid"
does not exist
Is server running?
Bail out! system pg_ctl failed
------------------------------------------

--
regards, Roman

Attachments:

test.pltext/plain; name=test.plDownload
sharing_violation.PNGimage/png; name=sharing_violation.PNGDownload
process_tree.PNGimage/png; name=process_tree.PNGDownload+7-2
regress_log_testtext/plain; name=regress_log_testDownload