pgsql: Move any remaining files generated by pg_upgrade into an interna

Started by Michael Paquierover 3 years ago36 messageshackers

michael@paquier.xyz

over 3 years ago

Move any remaining files generated by pg_upgrade into an internal subdir

This change concerns a couple of .txt files (for internal state checks)
that were still written in the path where the binary is executed, and
not in the subdirectory located in the target cluster. Like the other
.txt files doing already so (like loadable_libraries.txt), these are
saved in the base output directory. Note that on failure, the logs
report the full path to the .txt file generated, so these are easy to
find.

Oversight in 38bfae3.

Author: Daniel Gustafsson
Reviewed-by: Michael Paquier, Justin Prysby
Discussion: /messages/by-id/181A6DA8-3B7F-4B71-82D5-363FF0146820@yesql.se
Backpatch-through: 15

Branch
------
master

Details
-------
https://git.postgresql.org/pg/commitdiff/ee5353abb6124de5ffd24ef1cedbc2a7196d4fd5

Modified Files
--------------
src/bin/pg_upgrade/check.c | 12 +++++++++---
src/bin/pg_upgrade/version.c | 12 +++++++++---
2 files changed, 18 insertions(+), 6 deletions(-)

Andres Freund

andres@anarazel.de

over 3 years ago

In reply to: Michael Paquier (#1)

pg_upgrade test failure

Hi,

After my last rebase of the meson tree I encountered the following test
failure:

https://cirrus-ci.com/task/5532444261613568

[20:23:04.171] ------------------------------------- 8< -------------------------------------
[20:23:04.171] stderr:
[20:23:04.171] # Failed test 'pg_upgrade_output.d/ not removed after pg_upgrade --check success'
[20:23:04.171] # at C:/cirrus/src/bin/pg_upgrade/t/002_pg_upgrade.pl line 249.
[20:23:04.171] # Failed test 'pg_upgrade_output.d/ removed after pg_upgrade success'
[20:23:04.171] # at C:/cirrus/src/bin/pg_upgrade/t/002_pg_upgrade.pl line 261.
[20:23:04.171] # Looks like you failed 2 tests of 13.

regress_log:
https://api.cirrus-ci.com/v1/artifact/task/5532444261613568/testrun/build/testrun/pg_upgrade/002_pg_upgrade/log/regress_log_002_pg_upgrade

The pg_upgrade output contains these potentially relevant warnings:

...
*Clusters are compatible*
pg_upgrade: warning: could not remove file or directory "C:/cirrus/build/testrun/pg_upgrade/002_pg_upgrade/data/t_002_pg_upgrade_new_node_data/pgdata/pg_upgrade_output.d/20220919T201958.511/log": Directory not empty
pg_upgrade: warning: could not remove file or directory "C:/cirrus/build/testrun/pg_upgrade/002_pg_upgrade/data/t_002_pg_upgrade_new_node_data/pgdata/pg_upgrade_output.d/20220919T201958.511": Directory not empty
...

I don't know if actually related to the commit below, but there've been a
lot of runs of the pg_upgrade tests in the meson branch, and this is the first
failure of this kind. Unfortunately the error seems to be transient -
rerunning the tests succeeded.

On 2022-09-13 01:39:59 +0000, Michael Paquier wrote:

Move any remaining files generated by pg_upgrade into an internal subdir

This change concerns a couple of .txt files (for internal state checks)
that were still written in the path where the binary is executed, and
not in the subdirectory located in the target cluster. Like the other
.txt files doing already so (like loadable_libraries.txt), these are
saved in the base output directory. Note that on failure, the logs
report the full path to the .txt file generated, so these are easy to
find.

Oversight in 38bfae3.

Author: Daniel Gustafsson
Reviewed-by: Michael Paquier, Justin Prysby
Discussion: /messages/by-id/181A6DA8-3B7F-4B71-82D5-363FF0146820@yesql.se
Backpatch-through: 15

Greetings,

Andres Freund

Michael Paquier

michael@paquier.xyz

over 3 years ago

In reply to: Andres Freund (#2)

Re: pg_upgrade test failure

On Mon, Sep 19, 2022 at 02:32:17PM -0700, Andres Freund wrote:

I don't know if actually related to the commit below, but there've been a
lot of runs of the pg_upgrade tests in the meson branch, and this is the first
failure of this kind. Unfortunately the error seems to be transient -
rerunning the tests succeeded.

This smells to me like a race condition in pg_upgrade (or even pg_ctl
for SERVER_LOG_FILE) where the code still has handles on some of the
files in the log/ subdirectory, causing its removal to not be able to
finish happen. If this proves to be rather easy to reproduce, giving
a list of the files still present in this path would give a hint easy
to follow. Does this reproduce with a good frequency?
--
Michael

Andres Freund

andres@anarazel.de

over 3 years ago

In reply to: Michael Paquier (#3)

Re: pg_upgrade test failure

Hi,

On 2022-09-20 10:08:41 +0900, Michael Paquier wrote:

On Mon, Sep 19, 2022 at 02:32:17PM -0700, Andres Freund wrote:

I don't know if actually related to the commit below, but there've been a
lot of runs of the pg_upgrade tests in the meson branch, and this is the first
failure of this kind. Unfortunately the error seems to be transient -
rerunning the tests succeeded.

This smells to me like a race condition in pg_upgrade (or even pg_ctl
for SERVER_LOG_FILE) where the code still has handles on some of the
files in the log/ subdirectory, causing its removal to not be able to
finish happen.

I don't really see what'd race with what here? pg_upgrade has precise control
over what's happening here, no?

If this proves to be rather easy to reproduce, giving
a list of the files still present in this path would give a hint easy
to follow. Does this reproduce with a good frequency?

I've only seen it once so far, but there haven't been many CI runs of the
meson branch since rebasing ontop of the last changes to pg_upgrade.

Greetings,

Andres Freund

Michael Paquier

michael@paquier.xyz

over 3 years ago

In reply to: Andres Freund (#4)

Re: pg_upgrade test failure

On Mon, Sep 19, 2022 at 06:13:17PM -0700, Andres Freund wrote:

I don't really see what'd race with what here? pg_upgrade has precise control
over what's happening here, no?

A code path could have forgotten a fclose() for example, but this code
is rather old and close-proof as far as I know. Most of the log files
are used with redirections for external calls, though I don't see
how these could still be hold after pg_upgrade finishes, though :/
Could the use meson somewhat influence when running tests on Windows?

I've only seen it once so far, but there haven't been many CI runs of the
meson branch since rebasing ontop of the last changes to pg_upgrade.

Hmm, okay. Is that a specific branch in one of your public repos?
--
Michael

Justin Pryzby

pryzby@telsasoft.com

over 3 years ago

In reply to: Andres Freund (#2)

Re: pg_upgrade test failure

On Mon, Sep 19, 2022 at 02:32:17PM -0700, Andres Freund wrote:

Hi,

After my last rebase of the meson tree I encountered the following test
failure:

https://cirrus-ci.com/task/5532444261613568

[20:23:04.171] ------------------------------------- 8< -------------------------------------
[20:23:04.171] stderr:
[20:23:04.171] # Failed test 'pg_upgrade_output.d/ not removed after pg_upgrade --check success'
[20:23:04.171] # at C:/cirrus/src/bin/pg_upgrade/t/002_pg_upgrade.pl line 249.
[20:23:04.171] # Failed test 'pg_upgrade_output.d/ removed after pg_upgrade success'
[20:23:04.171] # at C:/cirrus/src/bin/pg_upgrade/t/002_pg_upgrade.pl line 261.
[20:23:04.171] # Looks like you failed 2 tests of 13.

regress_log:
https://api.cirrus-ci.com/v1/artifact/task/5532444261613568/testrun/build/testrun/pg_upgrade/002_pg_upgrade/log/regress_log_002_pg_upgrade

The pg_upgrade output contains these potentially relevant warnings:

...
*Clusters are compatible*
pg_upgrade: warning: could not remove file or directory "C:/cirrus/build/testrun/pg_upgrade/002_pg_upgrade/data/t_002_pg_upgrade_new_node_data/pgdata/pg_upgrade_output.d/20220919T201958.511/log": Directory not empty
pg_upgrade: warning: could not remove file or directory "C:/cirrus/build/testrun/pg_upgrade/002_pg_upgrade/data/t_002_pg_upgrade_new_node_data/pgdata/pg_upgrade_output.d/20220919T201958.511": Directory not empty
...

It looks like it failed to remove a *.log file under windows, which
caused rmtree to fail.

src/bin/pg_upgrade/pg_upgrade.h-#define DB_DUMP_LOG_FILE_MASK "pg_upgrade_dump_%u.log"
src/bin/pg_upgrade/pg_upgrade.h-#define SERVER_LOG_FILE "pg_upgrade_server.log"
src/bin/pg_upgrade/pg_upgrade.h-#define UTILITY_LOG_FILE "pg_upgrade_utility.log"
src/bin/pg_upgrade/pg_upgrade.h:#define INTERNAL_LOG_FILE "pg_upgrade_internal.log"

ee5353abb only changed .txt files used for errors so can't be the cause,
but the original commit 38bfae3 might be related.

I suspect that rmtree() was looping in pgunlink(), and got ENOENT, so
didn't warn about the file itself, but then failed one moment later in
rmdir.

--
Justin

Bharath Rupireddy

bharath.rupireddyforpostgres@gmail.com

over 3 years ago

In reply to: Justin Pryzby (#6)

Re: pg_upgrade test failure

On Tue, Sep 20, 2022 at 7:01 AM Justin Pryzby <pryzby@telsasoft.com> wrote:

On Mon, Sep 19, 2022 at 02:32:17PM -0700, Andres Freund wrote:

Hi,

After my last rebase of the meson tree I encountered the following test
failure:

https://cirrus-ci.com/task/5532444261613568

[20:23:04.171] ------------------------------------- 8< -------------------------------------
[20:23:04.171] stderr:
[20:23:04.171] # Failed test 'pg_upgrade_output.d/ not removed after pg_upgrade --check success'
[20:23:04.171] # at C:/cirrus/src/bin/pg_upgrade/t/002_pg_upgrade.pl line 249.
[20:23:04.171] # Failed test 'pg_upgrade_output.d/ removed after pg_upgrade success'
[20:23:04.171] # at C:/cirrus/src/bin/pg_upgrade/t/002_pg_upgrade.pl line 261.
[20:23:04.171] # Looks like you failed 2 tests of 13.

regress_log:
https://api.cirrus-ci.com/v1/artifact/task/5532444261613568/testrun/build/testrun/pg_upgrade/002_pg_upgrade/log/regress_log_002_pg_upgrade

The pg_upgrade output contains these potentially relevant warnings:

...
*Clusters are compatible*
pg_upgrade: warning: could not remove file or directory "C:/cirrus/build/testrun/pg_upgrade/002_pg_upgrade/data/t_002_pg_upgrade_new_node_data/pgdata/pg_upgrade_output.d/20220919T201958.511/log": Directory not empty
pg_upgrade: warning: could not remove file or directory "C:/cirrus/build/testrun/pg_upgrade/002_pg_upgrade/data/t_002_pg_upgrade_new_node_data/pgdata/pg_upgrade_output.d/20220919T201958.511": Directory not empty
...

Just for the records - the same issue was also seen here [1]https://cirrus-ci.com/task/5709014662119424?logs=check_world#L82, [2]https://api.cirrus-ci.com/v1/artifact/task/5709014662119424/testrun/build/testrun/pg_upgrade/002_pg_upgrade/log/regress_log_002_pg_upgrade.

[1]: https://cirrus-ci.com/task/5709014662119424?logs=check_world#L82
[2]: https://api.cirrus-ci.com/v1/artifact/task/5709014662119424/testrun/build/testrun/pg_upgrade/002_pg_upgrade/log/regress_log_002_pg_upgrade

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Andres Freund

andres@anarazel.de

over 3 years ago

In reply to: Bharath Rupireddy (#7)

Re: pg_upgrade test failure

Hi,

On 2022-09-27 11:47:37 +0530, Bharath Rupireddy wrote:

Just for the records - the same issue was also seen here [1], [2].

[1] https://cirrus-ci.com/task/5709014662119424?logs=check_world#L82
[2] https://api.cirrus-ci.com/v1/artifact/task/5709014662119424/testrun/build/testrun/pg_upgrade/002_pg_upgrade/log/regress_log_002_pg_upgrade

Yea, this is at the moment one of the top sources of spurious test failures
for cfbot. Just manually looking at http://cfbot.cputube.org/ for tasks that
recently changed state on windows:

Something needs to happen here.

Greetings,

Andres Freund

Thomas Munro

thomas.munro@gmail.com

over 3 years ago

In reply to: Justin Pryzby (#6)

Re: pg_upgrade test failure

On Tue, Sep 20, 2022 at 1:31 PM Justin Pryzby <pryzby@telsasoft.com> wrote:

I suspect that rmtree() was looping in pgunlink(), and got ENOENT, so
didn't warn about the file itself, but then failed one moment later in
rmdir.

Yeah, I think this is my fault. In commit f357233c the new lstat()
call might return ENOENT for STATUS_DELETE_PENDING, and then we don't
enter pgunlink()'s 10 second sleep-retry loop. Let me think about how
best to fix that, and how to write a regression test program that
would exercise stuff like this. Might take a couple of days as I am
away from computers until mid-week.

#10

Thomas Munro

thomas.munro@gmail.com

over 3 years ago

In reply to: Thomas Munro (#9)

Re: pg_upgrade test failure

On Mon, Oct 3, 2022 at 9:07 AM Thomas Munro <thomas.munro@gmail.com> wrote:

On Tue, Sep 20, 2022 at 1:31 PM Justin Pryzby <pryzby@telsasoft.com> wrote:

I suspect that rmtree() was looping in pgunlink(), and got ENOENT, so
didn't warn about the file itself, but then failed one moment later in
rmdir.

Yeah, I think this is my fault. In commit f357233c the new lstat()
call might return ENOENT for STATUS_DELETE_PENDING, and then we don't
enter pgunlink()'s 10 second sleep-retry loop. Let me think about how
best to fix that, and how to write a regression test program that
would exercise stuff like this. Might take a couple of days as I am
away from computers until mid-week.

I think something like the attached should do the right thing for
STATUS_DELETE_PENDING (sort of "ENOENT-in-progress"). unlink() goes
back to being blocking (sleep+retry until eventually we reach ENOENT
or we time out and give up with EACCES), but we still distinguish it
from true ENOENT so we have a fast exit in that case. This is passing
CI, but not tested yet.

One ugly thing in this patch is that it has to deal with our
historical mistake (?) of including Windows headers in this file in
Cygwin builds for no reason and thus getting WIN32 defined on a
non-WIN32 build, as I've complained about before[1]https://commitfest.postgresql.org/39/3781/ but not yet tidied
up.

Remembering why we do any of this weird looking stuff that we don't
need on Unix, the general idea is that things that scan directories to
unlink everything before unlinking the parent directory need to block
for a while on STATUS_DELETE_PENDING to increase their probability of
success, while things that do anything else probably just want to skip
such zombie files completely. To recap, we have:

* readdir() sees files that are ENOENT-in-progress (so recursive
unlinks can see them)
* unlink() waits for ENOENT-in-progress to reach ENOENT (what broke here)
* stat() and lstat() report ENOENT-in-progress as ENOENT (done to fix
eg pg_basebackup, which used to fail at random on Windows)
* open() reports ENOENT-in-progress as either ENOENT or EEXIST
depending on O_CREAT (because used by stat())

Clearly this set of kludges isn't perfect and other kludge-sets would
be possible too. One thought is that we could hide ENOENT-in-progress
from readdir(), and add a new rmdir() wrapper instead. If it gets a
directory-not-empty error from the kernel, it could at that point wait
for zombie files to go away (perhaps registering for file system
events with some local equivalent of KQ_FILTER_VNODE if there is one,
to be less sloppy that the current sleep() nonsense, but sleep would
work too).

When I'm back at my real keyboard I'll try to come up with tests for
this stuff, but I'm not sure how solid we can really make a test for
this particular case -- I think you'd need to have another thread open
the file and then close it after different periods of time, to
demonstrate that the retry loop works but also gives up, and that's
exactly the sort of timing-dependent stuff we try to avoid. But I
think I'll try that anyway, because it's essential infrastructure to
allow Unix-only hackers to work only this stuff. Once we have that,
we might be able to make some more progress with the various
FILE_DISPOSITION_POSIX_SEMANTICS proposals, if it helps, because we'll
have reproducible evidence for what it really does.

[1]: https://commitfest.postgresql.org/39/3781/

#11

Michael Paquier

michael@paquier.xyz

over 3 years ago

In reply to: Thomas Munro (#10)

Re: pg_upgrade test failure

On Mon, Oct 03, 2022 at 12:10:06PM +1300, Thomas Munro wrote:

I think something like the attached should do the right thing for
STATUS_DELETE_PENDING (sort of "ENOENT-in-progress"). unlink() goes
back to being blocking (sleep+retry until eventually we reach ENOENT
or we time out and give up with EACCES), but we still distinguish it
from true ENOENT so we have a fast exit in that case. This is passing
CI, but not tested yet.

    if (lstat(path, &st) < 0)
-       return -1;
+   {
+       if (lstat_error_was_status_delete_pending())
+           is_lnk = false;
+       else
+           return -1;
+   }
+   else
+       is_lnk = S_ISLNK(st.st_mode);
Sorry, I don't remember all the details in this area, but a directory
can never be marked as STATUS_DELETE_PENDING with some of its contents
still inside, right?  If it has some contents, forcing unlink() all
the time would be fine?

One ugly thing in this patch is that it has to deal with our
historical mistake (?) of including Windows headers in this file in
Cygwin builds for no reason and thus getting WIN32 defined on a
non-WIN32 build, as I've complained about before[1] but not yet tidied
up.

Your proposal remains local to dirmod.c, so that does not sound like a
big deal to me for the time being.
--
Michael

#12

Thomas Munro

thomas.munro@gmail.com

over 3 years ago

In reply to: Michael Paquier (#11)

Re: pg_upgrade test failure

On Mon, Oct 3, 2022 at 1:40 PM Michael Paquier <michael@paquier.xyz> wrote:

On Mon, Oct 03, 2022 at 12:10:06PM +1300, Thomas Munro wrote:

I think something like the attached should do the right thing for
STATUS_DELETE_PENDING (sort of "ENOENT-in-progress"). unlink() goes
back to being blocking (sleep+retry until eventually we reach ENOENT
or we time out and give up with EACCES), but we still distinguish it
from true ENOENT so we have a fast exit in that case. This is passing
CI, but not tested yet.
if (lstat(path, &st) < 0)
-       return -1;
+   {
+       if (lstat_error_was_status_delete_pending())
+           is_lnk = false;
+       else
+           return -1;
+   }
+   else
+       is_lnk = S_ISLNK(st.st_mode);

Sorry, I don't remember all the details in this area, but a directory
can never be marked as STATUS_DELETE_PENDING with some of its contents
still inside, right?

That's my understanding, yes: just like Unix, you can't remove a
directory with something in it. Unlike Unix, that includes files that
have been unlinked but are still open somewhere. (Note that in this
case it's not exactly a real directory, it's a junction point, which
is a directory but it doesn't have contents, it is a reparse point
pointing somewhere else, so I suspect that it can't really suffer from
ENOTEMPTY, but it probably can suffer from 'someone has it open for a
short time because they are concurrently stat-ing it'.)

If it has some contents, forcing unlink() all
the time would be fine?

Here's why I think it's probably OK to use unlink() unconditionally
after detecting STATUS_DELETE_PENDING. AFAICT there is no way to even
find out if it's a file or a junction in this state, but it doesn't
matter: we are not waiting for rmdir() or unlink() to succeed, we are
waiting for it to fail with an error other than EACCES, most likely
ENOENT (or to time out, perhaps because someone held the file open for
11 seconds, or because EACCES was actually telling us about a
permissions problem). EACCES is the errno for many things including
STATUS_DELETE_PENDING and also "you called unlink() but it's a
directory" (should be EPERM according to POSIX, or EISDIR according
to Linux). Both of those reasons imply that the zombie directory
entry still exists, and we don't care which of those reasons triggered
it. So I think that setting is_lnk = false is good enough here. Do
you see a hole in it?

#13

Michael Paquier

michael@paquier.xyz

over 3 years ago

In reply to: Thomas Munro (#12)

Re: pg_upgrade test failure

On Mon, Oct 03, 2022 at 04:03:12PM +1300, Thomas Munro wrote:

So I think that setting is_lnk = false is good enough here. Do
you see a hole in it?

I cannot think on one, on top of my head. Thanks for the
explanation.
--
Michael

#14

Thomas Munro

thomas.munro@gmail.com

over 3 years ago

In reply to: Michael Paquier (#13)

Re: pg_upgrade test failure

On Mon, Oct 3, 2022 at 7:29 PM Michael Paquier <michael@paquier.xyz> wrote:

On Mon, Oct 03, 2022 at 04:03:12PM +1300, Thomas Munro wrote:

So I think that setting is_lnk = false is good enough here. Do
you see a hole in it?

I cannot think on one, on top of my head. Thanks for the
explanation.

Some things I learned while trying to understand how I managed to
introduce that bug despite reading and testing:

* the code in pgunlink() has comments saying that its retry loop is to
handle sharing violations
* in fact that retry loop also comes into play for STATUS_DELETE_PENDING
* that case is fairly well hidden, because to reach it you need to
unlink(pathname) twice! the second call will wait up to 10 seconds
for handles to close and then report ENOENT, allowing rmdir(parent) to
succeed
* I guess this code is relying on that double-unlink to block until
the directory is empty?
* you wouldn't notice any of this if you were testing on Windows 10 on
a desktop/laptop/VM, because it now uses POSIX semantics for unlink on
NTFS, so the first unlink truly (synchronously) unlinks (no more
STATUS_DELETE_PENDING)
* Server 2019, as used on CI, still uses the traditional NT semantics
(unlink is asynchronous, when all handles closes)
* the fix I proposed has the right effect (I will follow up with tests
to demonstrate)

I'll post my tests for this and a bunch more things I figured out
shortly in a new Windows-filesystem-semantics-omnibus thread.

#15

Michael Paquier

michael@paquier.xyz

over 3 years ago

In reply to: Thomas Munro (#14)

Re: pg_upgrade test failure

On Tue, Oct 18, 2022 at 09:47:37AM +1300, Thomas Munro wrote:

* Server 2019, as used on CI, still uses the traditional NT semantics
(unlink is asynchronous, when all handles closes)
* the fix I proposed has the right effect (I will follow up with tests
to demonstrate)

Wow, nice investigation. And cirrus does not offer a newer option
either.. Do you think that Windows server 2022 (successor of 2019) is
able to use POSIX semantics for unlink()? It looks that we are a few
years away from being able to do that assuming that cirrus offers a
newer version than server 2019, but I guess that the code could
mention this possibility in a comment, at least..
--
Michael

#16

Justin Pryzby

pryzby@telsasoft.com

over 3 years ago

In reply to: Michael Paquier (#15)

Re: pg_upgrade test failure

On Tue, Oct 18, 2022 at 01:06:15PM +0900, Michael Paquier wrote:

On Tue, Oct 18, 2022 at 09:47:37AM +1300, Thomas Munro wrote:

* Server 2019, as used on CI, still uses the traditional NT semantics
(unlink is asynchronous, when all handles closes)
* the fix I proposed has the right effect (I will follow up with tests
to demonstrate)

Wow, nice investigation. And cirrus does not offer a newer option
either..

Currently Andres builds images based on cirrus's 2019 image, but I think
we could use any windows docker image.

Do you think that Windows server 2022 (successor of 2019) is
able to use POSIX semantics for unlink()?

I think it's possible to use it now, like what's done here.
https://commitfest.postgresql.org/40/3347/

The only caveat is that it's done conditionally.

#17

Andres Freund

andres@anarazel.de

over 3 years ago

In reply to: Justin Pryzby (#16)

Re: pg_upgrade test failure

Hi,

On 2022-10-17 23:31:44 -0500, Justin Pryzby wrote:

On Tue, Oct 18, 2022 at 01:06:15PM +0900, Michael Paquier wrote:

On Tue, Oct 18, 2022 at 09:47:37AM +1300, Thomas Munro wrote:

* Server 2019, as used on CI, still uses the traditional NT semantics
(unlink is asynchronous, when all handles closes)
* the fix I proposed has the right effect (I will follow up with tests
to demonstrate)

Wow, nice investigation. And cirrus does not offer a newer option
either..

Currently Andres builds images based on cirrus's 2019 image, but I think
we could use any windows docker image.

You unfortunately can't run newer containers than the host OS :(, just user
older ones. And if you use mismatching containers the startup gets slower
because it switches to use full virtualization rather than containers.

I think we need to switch to use full VMs rather than containers. The
performance of the windows containers is just atrocious (build times on a
local VM with the same number of cores is 1/2 of what we see in CI, test times
1/3), they're slow to start due to pulling all files and decompressing them,
and they're fragile. I've asked Bilal (CCed) to work on generating both
containers and images.

Greetings,

Andres Freund

#18

Thomas Munro

thomas.munro@gmail.com

over 3 years ago

In reply to: Andres Freund (#17)

Re: pg_upgrade test failure

So [1]/messages/by-id/20221025213055.GA8537@telsasoft.com on its own didn't fix this. My next guess is that the attached
might help.

Hmm. Following Michael's clue that this might involve log files and
pg_ctl, I noticed one thing: pg_ctl implements
wait_for_postmaster_stop() by waiting for kill(pid, 0) to fail, and
our kill emulation does CallNamedPipe(). If the server is in the
process of exiting and the kernel is cleaning up all the handles we
didn't close, is there any reason to expect the signal pipe to be
closed after the log file?

[1]: /messages/by-id/20221025213055.GA8537@telsasoft.com

#19

Justin Pryzby

pryzby@telsasoft.com

over 3 years ago

In reply to: Thomas Munro (#18)

Re: pg_upgrade test failure

On Tue, Nov 08, 2022 at 01:16:09AM +1300, Thomas Munro wrote:

So [1] on its own didn't fix this. My next guess is that the attached
might help.

I took the liberty of adding a CF entry for this
https://commitfest.postgresql.org/41/4011/

And afterwards figured I could be a little bit wasteful and run the
tests using meson test --repeat, rather than let cfbot do it over the
course of a month.
https://cirrus-ci.com/task/5115893722644480

So I didn't find evidence that it doesn't resolve the issue (but this
also doesn't prove that it will works).

--
Justin

#20

Andres Freund

andres@anarazel.de

over 3 years ago

In reply to: Thomas Munro (#18)

Re: pg_upgrade test failure

Hi,

On 2022-11-08 01:16:09 +1300, Thomas Munro wrote:

So [1] on its own didn't fix this. My next guess is that the attached
might help.

Hmm. Following Michael's clue that this might involve log files and
pg_ctl, I noticed one thing: pg_ctl implements
wait_for_postmaster_stop() by waiting for kill(pid, 0) to fail, and
our kill emulation does CallNamedPipe(). If the server is in the
process of exiting and the kernel is cleaning up all the handles we
didn't close, is there any reason to expect the signal pipe to be
closed after the log file?

What is our plan here? This afaict is the most common "false positive" for
cfbot in the last weeks.

E.g.:

https://api.cirrus-ci.com/v1/artifact/task/5462686092230656/testrun/build/testrun/pg_upgrade/002_pg_upgrade/log/regress_log_002_pg_upgrade
...
[00:02:58.761](93.859s) ok 10 - run of pg_upgrade for new instance
[00:02:58.808](0.047s) not ok 11 - pg_upgrade_output.d/ removed after pg_upgrade success
[00:02:58.815](0.007s) # Failed test 'pg_upgrade_output.d/ removed after pg_upgrade success'
# at C:/cirrus/src/bin/pg_upgrade/t/002_pg_upgrade.pl line 288.

Michael:

Why does 002_pg_upgrade.pl try to filter the list of files in
pg_upgrade_output.d for files ending in .log? And why does it print those
only after starting the new node?

How about moving the iteration through the pg_upgrade_output.d to before the
->start and printing all the files, but only slurp_file() if the filename ends
with *.log?

Minor nit: It seems off to quite so many copies of
$newnode->data_dir . "/pg_upgrade_output.d"
particularly where the test defines $log_path, but then still builds
it from scratch after (line 304).

Greetings,

Andres Freund

#21

Thomas Munro

thomas.munro@gmail.com

over 3 years ago

In reply to: Andres Freund (#20)

#22

Thomas Munro

thomas.munro@gmail.com

over 3 years ago

In reply to: Thomas Munro (#21)