Cygwin cleanup

Started by Thomas Munroalmost 4 years ago38 messageshackers

thomas.munro@gmail.com

almost 4 years ago

Hi,

Continuing a discussion started over at [1]/messages/by-id/CA+hUKGKZ_FjkBnjGADk+pa2g4oKDcG8=SE5V23sPTP0EELfyzQ@mail.gmail.com. Moving this to a new
thread so that other thread can focus on Unix cleanup, and both
threads can get CI coverage...

1. In a few places, it is alleged that both __CYGWIN__ and WIN32
might be defined at the same time. Do you think we should try to get
rid of that possibility? I understand that we have to have a few
tests for __CYGWIN__ here and there, because eg file permissions don't
work quite right and there's not much we can do about that. But it
seems a bit unhelpful if we also have to worry about a more-or-less
POSIX-ish build taking WIN32 paths at uncertain times if we forget to
defend against that, or wonder why some places are not consistent.

A quick recap of the three flavours of Windows platform we have to
handle, as I understand it:

* MSVC: Windowsy toolchain, Windowsy C
* custom perl scripts instead of configure
* msbuild instead of make
* MSVC compiler
* Windows C APIs
* we provide our own emulation of some POSIX C APIs on top
* MSYS: Unixy toolchain, Windowsy C
* configure (portname = "win32")
* make
* GCC compiler
* Windows C APIs
* we provide our own emulation of some POSIX C APIs on top
* Cygwin: Unixy toolchain, Unixy C
* configure (portname = "cygwin")
* make
* GCC compiler
* POSIX C APIs (emulations provided by the Cygwin runtime libraries)

(The configure/make part will be harmonised by the Meson project.)

The macro WIN32 is visibly defined by something in/near msbuild in
MSVC builds: /D WIN32 is right here in the build transcripts (whereas
the compiler defines _WIN32; good compiler). I am not sure how
exactly it is first defined in MSYS builds; I suspect that MSYS gcc
might define it itself, but I don't have access to MSYS to check. As
for Cygwin, the only translation unit where I could find both
__CYGWIN__ and WIN32 defined is dirmod.c, and that results from
including <windows.h> and ultimately <minwindef.h> (even though WIN32
isn't defined yet at that time). I couldn't understand why we do
that, but I probably didn't read enough commit history. The purpose
of dirmod.c on Cygwin today is only to wrap otherwise pure POSIX code
in retry loops to handle those spurious EACCES errors due to NT
sharing violations, so there is no need for that.

Proposal: let's make it a programming rule that we don't allow
definitions from Windows headers to leak into Cygwin translation
units, preferably by never including them, or if we really must, let's
grant specific exemptions in an isolated and well documented way. We
don't seem to need any such exemptions currently. Places where we
currently worry about the contradictory macros could become
conditional #error directives instead.

2. To make it possible to test any of that, you either need a working
Windows+Cygwin setup, or working CI. I'm a salty old Unix hacker so I
opted for the latter, and I also hope this will eventually be useful
to others. Unfortunately I haven't figured out how to get everything
working yet, so some of the check-world tests are failing. Clues most
welcome!

The version I'm posting here is set to run always, so that cfbot will
show it alongside others. But I would imagine that if we got a
committable-quality version of this, it'd probably be opt-in, so you'd
have to say "ci-os-only: cygwin", or "ci-os-only: cygwin, windows" etc
in a commit to your private github account to ask for it (or maybe
we'll come up with a way to tell cfbot we want the full works of CI
checks; the same decision will come up for MSYS, OpenBSD and NetBSD CI
support that my colleague is working on). There are other things to
fix too, including abysmal performance; see commit message.

3. You can't really run PostgreSQL on Cygwin for real, because its
implementation of signals does not have reliable signal masking, so
unsubtle and probably also subtle breakage occurs. That was reported
upstream by Noah years ago, but they aren't working on a fix.
lorikeet shows random failures, and presumably any CI system will do
the same... I even wondered about putting our own magic entry/exit
macros into signal handlers, that would use atomics to implement a
second level of signal masking (?!) but that's an uncommonly large
bandaid for a defective platform... and trying to fix Cygwin itself
is a rabbithole too far for me.

4. When building with Cygwin GCC 11.3 you get a bunch of warnings
that don't show up on other platforms, seemingly indicating that it
interprets -Wimplicit-fallthrough=3 differently. Huh?

[1]: /messages/by-id/CA+hUKGKZ_FjkBnjGADk+pa2g4oKDcG8=SE5V23sPTP0EELfyzQ@mail.gmail.com

Tom Lane

tgl@sss.pgh.pa.us

almost 4 years ago

In reply to: Thomas Munro (#1)

Re: Cygwin cleanup

Thomas Munro <thomas.munro@gmail.com> writes:

3. You can't really run PostgreSQL on Cygwin for real, because its
implementation of signals does not have reliable signal masking, so
unsubtle and probably also subtle breakage occurs. That was reported
upstream by Noah years ago, but they aren't working on a fix.
lorikeet shows random failures, and presumably any CI system will do
the same...

If that's an accurate statement, shouldn't we just drop Cygwin support?
Now that we have a native Windows build, it's hard to see how any live
user would prefer to use the Cygwin build.

regards, tom lane

Thomas Munro

thomas.munro@gmail.com

almost 4 years ago

In reply to: Tom Lane (#2)

Re: Cygwin cleanup

On Tue, Jul 26, 2022 at 4:34 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Thomas Munro <thomas.munro@gmail.com> writes:

3. You can't really run PostgreSQL on Cygwin for real, because its
implementation of signals does not have reliable signal masking, so
unsubtle and probably also subtle breakage occurs. That was reported
upstream by Noah years ago, but they aren't working on a fix.
lorikeet shows random failures, and presumably any CI system will do
the same...

If that's an accurate statement, shouldn't we just drop Cygwin support?

This thread rejected the idea last time around:

/messages/by-id/136712b0-0619-5619-4634-0f0286acaef7@2ndQuadrant.com

lorikeet still shows the issue. Failures often involve assertions
about PMSignalState or mq->mq_sender. Hmm, it's running Cygwin 3.2.0
(March 2021) and the latest release is 3.3.5, so it's remotely
possible that it's been fixed recently. Maybe that'd be somewhere in
here, but it's not jumping out:

https://github.com/cygwin/cygwin/commits/master/winsup/cygwin/signal.cc

(Oooh, another implementation of signalfd...)

Tom Lane

tgl@sss.pgh.pa.us

almost 4 years ago

In reply to: Thomas Munro (#3)

Re: Cygwin cleanup

Thomas Munro <thomas.munro@gmail.com> writes:

On Tue, Jul 26, 2022 at 4:34 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

If that's an accurate statement, shouldn't we just drop Cygwin support?

This thread rejected the idea last time around:
/messages/by-id/136712b0-0619-5619-4634-0f0286acaef7@2ndQuadrant.com

I think maybe we should re-open the discussion. I've certainly
reached the stage of fed-up-ness. That platform seems seriously
broken, upstream is making no progress on fixing it, and there
doesn't seem to be any real-world use-case. The only positive
argument for it is that Readline doesn't work in the other
Windows builds --- but we've apparently not rechecked that
statement in eighteen years, so maybe things are better now.

If we could just continue to blithely ignore lorikeet's failures,
I wouldn't mind so much; but doing any significant amount of new
code development work for the platform seems like throwing away
developer time.

regards, tom lane

Robert Haas

robertmhaas@gmail.com

almost 4 years ago

In reply to: Tom Lane (#4)

Re: Cygwin cleanup

On Tue, Jul 26, 2022 at 7:40 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

I think maybe we should re-open the discussion. I've certainly
reached the stage of fed-up-ness. That platform seems seriously
broken, upstream is making no progress on fixing it, and there
doesn't seem to be any real-world use-case. The only positive
argument for it is that Readline doesn't work in the other
Windows builds --- but we've apparently not rechecked that
statement in eighteen years, so maybe things are better now.

If we could just continue to blithely ignore lorikeet's failures,
I wouldn't mind so much; but doing any significant amount of new
code development work for the platform seems like throwing away
developer time.

I agree with that. All things being equal, I like the idea of
supporting a bunch of different platforms, and Cygwin doesn't really
look that dead. It has recent releases. But if blocking signals
doesn't actually work on that platform, making PostgreSQL work
reliably there seems really difficult.

--
Robert Haas
EDB: http://www.enterprisedb.com

Justin Pryzby

pryzby@telsasoft.com

almost 4 years ago

In reply to: Thomas Munro (#1)

Re: Cygwin cleanup

On Tue, Jul 26, 2022 at 04:24:25PM +1200, Thomas Munro wrote:

3. You can't really run PostgreSQL on Cygwin for real, because its
implementation of signals does not have reliable signal masking, so
unsubtle and probably also subtle breakage occurs. That was reported
upstream by Noah years ago, but they aren't working on a fix.
lorikeet shows random failures, and presumably any CI system will do
the same...

Reference: /messages/by-id/20170321034703.GB2097809@tornado.leadboat.com

On my 2nd try:

https://cirrus-ci.com/task/5311911574110208
TRAP: FailedAssertion("mq->mq_sender == NULL", File: "shm_mq.c", Line: 230, PID: 16370)
2022-07-26 06:32:35.525 PDT [15538][postmaster] LOG: background worker "parallel worker" (PID 16370) was terminated by signal 6: Aborted

XXX Doesn't get all the way through yet...

Mainly because getopt was causing all tap tests to fail.
I tried to fix that in configure, but ended up changing the callers.

This is getting close, but I don't think has actually managed to pass all tests
yet.. https://cirrus-ci.com/task/5274721116749824

4. When building with Cygwin GCC 11.3 you get a bunch of warnings
that don't show up on other platforms, seemingly indicating that it
interprets -Wimplicit-fallthrough=3 differently. Huh?

Evidently due to the same getopt issues.

XXX This should use a canned Docker image with all the right packages
installed

Has anyone tried using non-canned images ? It sounds like this could reduce
the 4min startup time for windows.

https://cirrus-ci.org/guide/docker-builder-vm/#dockerfile-as-a-ci-environment

XXX configure is soooo slooow, can we cache it?! Compiling is also
insanely slow, but ccache gets it down to a couple of minutes if you're
lucky

One reason compiling was slow is because you ended up with -O2.

You can cache configure as long as you're willing to re-run it whenever options
were changed. That also applies to the existing headerscheck.

XXX I don't know how to put variables like BUILD_JOBS into the scripts

WDYM ? If it's outside of bash and in windows shell it's like %var%, right ?
https://cirrus-ci.org/guide/writing-tasks/#environment-variables

I just noticed that cirrus is misbehaving: if there's a variable called CI
(which there is), then it expands $CI_FOO like ${CI}_FOO rather than ${CI_FOO}.
I've also seen weirdness when variable names or operators appear in the commit
message...

XXX Needs some --with-X options

Done

XXX We would never want this to run by default in CI, but it'd be nice
to be able to ask for it with ci-os-only! (See commented out line)
only_if: $CIRRUS_CHANGE_MESSAGE =~ '.*\nci-os-only:[^\n]*cygwin.*'

Doesn't this already do what's needed?
As long as it doesn't also check: CHANGE_MESSAGE !~ 'ci-os-only',
the task will runs only on request.

XXX I have no idea if crash dump works, and if this should share
elements with the msys work in commitfest #3575

Based on the crash above, it wasn't working. And after some changes ... it
still doesn't work.

windows_os is probably skipping too many things.

--
Justin

Thomas Munro

thomas.munro@gmail.com

almost 4 years ago

In reply to: Justin Pryzby (#6)

Re: Cygwin cleanup

On Wed, Jul 27, 2022 at 6:44 PM Justin Pryzby <pryzby@telsasoft.com> wrote:

On Tue, Jul 26, 2022 at 04:24:25PM +1200, Thomas Munro wrote:

3. You can't really run PostgreSQL on Cygwin for real, because its
implementation of signals does not have reliable signal masking, so
unsubtle and probably also subtle breakage occurs. That was reported
upstream by Noah years ago, but they aren't working on a fix.
lorikeet shows random failures, and presumably any CI system will do
the same...

Reference: /messages/by-id/20170321034703.GB2097809@tornado.leadboat.com

On my 2nd try:

https://cirrus-ci.com/task/5311911574110208
TRAP: FailedAssertion("mq->mq_sender == NULL", File: "shm_mq.c", Line: 230, PID: 16370)
2022-07-26 06:32:35.525 PDT [15538][postmaster] LOG: background worker "parallel worker" (PID 16370) was terminated by signal 6: Aborted

Thanks for working on this!

Huh, that Cygwin being shipped by Choco is quite old, older than
lorikeet's, but not old enough to not have the bug:

[04:33:55.234] Starting cygwin install, version 2.918

Based on clues in Noah's emails in the archives, I think versions from
maybe somewhere around 2015 didn't have the bug, and then the bug
appeared, and AFAIK it's still here. I wonder if you can tell Choco
to install an ancient version, but even if that's possible you'd be
dealing with other stupid problems and bugs.

XXX Doesn't get all the way through yet...

Mainly because getopt was causing all tap tests to fail.
I tried to fix that in configure, but ended up changing the callers.

This is getting close, but I don't think has actually managed to pass all tests
yet.. https://cirrus-ci.com/task/5274721116749824

Woo.

4. When building with Cygwin GCC 11.3 you get a bunch of warnings
that don't show up on other platforms, seemingly indicating that it
interprets -Wimplicit-fallthrough=3 differently. Huh?

Evidently due to the same getopt issues.

Ahh, nice detective work.

XXX This should use a canned Docker image with all the right packages
installed

Has anyone tried using non-canned images ? It sounds like this could reduce
the 4min startup time for windows.

https://cirrus-ci.org/guide/docker-builder-vm/#dockerfile-as-a-ci-environment

Yeah, I had that working once. Not sure what the pros and cons would be for us.

XXX configure is soooo slooow, can we cache it?! Compiling is also
insanely slow, but ccache gets it down to a couple of minutes if you're
lucky

One reason compiling was slow is because you ended up with -O2.

Ah, right.

You can cache configure as long as you're willing to re-run it whenever options
were changed. That also applies to the existing headerscheck.

XXX I don't know how to put variables like BUILD_JOBS into the scripts

WDYM ? If it's outside of bash and in windows shell it's like %var%, right ?
https://cirrus-ci.org/guide/writing-tasks/#environment-variables

Right. I should have taken the clue from the %cd% (I got a few ideas
about how to do this from libarchive's CI scripting[1]https://github.com/libarchive/libarchive/blob/master/build/ci/cirrus_ci/ci.cmd).

I just noticed that cirrus is misbehaving: if there's a variable called CI
(which there is), then it expands $CI_FOO like ${CI}_FOO rather than ${CI_FOO}.
I've also seen weirdness when variable names or operators appear in the commit
message...

XXX Needs some --with-X options

Done

Neat.

XXX We would never want this to run by default in CI, but it'd be nice
to be able to ask for it with ci-os-only! (See commented out line)
only_if: $CIRRUS_CHANGE_MESSAGE =~ '.*\nci-os-only:[^\n]*cygwin.*'

Doesn't this already do what's needed?
As long as it doesn't also check: CHANGE_MESSAGE !~ 'ci-os-only',
the task will runs only on request.

Yeah I was just trying to say that I was sharing the script in a way
that always runs, but for commit we'd want that. This is all far too
slow for cfbot to have to deal with on every build. Looks like we can
expect to be able to build and test fast on Windows soonish, though,
so maybe one day we'd just turn Cygwin and MSYS on?

[1]: https://github.com/libarchive/libarchive/blob/master/build/ci/cirrus_ci/ci.cmd

Justin Pryzby

pryzby@telsasoft.com

almost 4 years ago

In reply to: Thomas Munro (#7)

Re: Cygwin cleanup

On Fri, Jul 29, 2022 at 10:04:04AM +1200, Thomas Munro wrote:

Thanks for working on this!

Huh, that Cygwin being shipped by Choco is quite old, older than
lorikeet's, but not old enough to not have the bug:

[04:33:55.234] Starting cygwin install, version 2.918

Hm, I think that's the version of "cygwinsetup" but not cygwin..
It also says this: [13:16:36.014] Cygwin v3.3.4.20220408 [Approved]

I wonder if you can tell Choco
to install an ancient version, but even if that's possible you'd be
dealing with other stupid problems and bugs.

Yes: choco install -y --no-progress --version 4.6.1 ccache

XXX This should use a canned Docker image with all the right packages
installed

Has anyone tried using non-canned images ? It sounds like this could reduce
the 4min startup time for windows.

https://cirrus-ci.org/guide/docker-builder-vm/#dockerfile-as-a-ci-environment

Yeah, I had that working once. Not sure what the pros and cons would be for us.

I think it could be a lot faster to start, since cirrus caches the generated
docker image locally. Rather than (I gather) pulling the image every time.

XXX We would never want this to run by default in CI, but it'd be nice
to be able to ask for it with ci-os-only! (See commented out line)
only_if: $CIRRUS_CHANGE_MESSAGE =~ '.*\nci-os-only:[^\n]*cygwin.*'

Doesn't this already do what's needed?
As long as it doesn't also check: CHANGE_MESSAGE !~ 'ci-os-only',
the task will runs only on request.

Yeah I was just trying to say that I was sharing the script in a way
that always runs, but for commit we'd want that. This is all far too
slow for cfbot to have to deal with on every build.

It occurred to me today that if cfbot preserved the original patch series, and
commit messages, that would allow patch authors to write things like
"ci-os-only: docs" for a doc only patch. I've never gotten cirrus'
changesOnly() stuff to work...

Looks like we can expect to be able to build and test fast on Windows
soonish, though,

Do you mean with meson ?

so maybe one day we'd just turn Cygwin and MSYS on?

I didn't understand this ?

--
Justin

Thomas Munro

thomas.munro@gmail.com

almost 4 years ago

In reply to: Justin Pryzby (#8)

Re: Cygwin cleanup

On Fri, Jul 29, 2022 at 10:23 AM Justin Pryzby <pryzby@telsasoft.com> wrote:

On Fri, Jul 29, 2022 at 10:04:04AM +1200, Thomas Munro wrote:

[04:33:55.234] Starting cygwin install, version 2.918

Hm, I think that's the version of "cygwinsetup" but not cygwin..
It also says this: [13:16:36.014] Cygwin v3.3.4.20220408 [Approved]

Oops. Ok so we're testing the very latest then, and it definitely
still has the bug as we thought.

It occurred to me today that if cfbot preserved the original patch series, and
commit messages, that would allow patch authors to write things like
"ci-os-only: docs" for a doc only patch. I've never gotten cirrus'
changesOnly() stuff to work...

Maybe it's time to switch to "git am -3 ..." and reject patches that
don't apply that way.

Looks like we can expect to be able to build and test fast on Windows
soonish, though,

Do you mean with meson ?

Yeah. Also there are some other things we can do to speed up testing
on Windows (and elsewhere), like not running every test query with new
psql + backend process pair, which takes at least a few hundred ms and
sometimes up to several seconds on this platform; I have some patches
I need to finish...

so maybe one day we'd just turn Cygwin and MSYS on?

I didn't understand this ?

I mean, if, some sunny day, we can compile and test on Windows at
non-glacial speeds, then it would become possible to contemplate
having cfbot run these tasks for every patch every time.

#10

Thomas Munro

thomas.munro@gmail.com

almost 4 years ago

In reply to: Robert Haas (#5)

Re: Cygwin cleanup

On Wed, Jul 27, 2022 at 5:09 AM Robert Haas <robertmhaas@gmail.com> wrote:

On Tue, Jul 26, 2022 at 7:40 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

I think maybe we should re-open the discussion. I've certainly
reached the stage of fed-up-ness. That platform seems seriously
broken, upstream is making no progress on fixing it, and there
doesn't seem to be any real-world use-case. The only positive
argument for it is that Readline doesn't work in the other
Windows builds --- but we've apparently not rechecked that
statement in eighteen years, so maybe things are better now.

If we could just continue to blithely ignore lorikeet's failures,
I wouldn't mind so much; but doing any significant amount of new
code development work for the platform seems like throwing away
developer time.

I agree with that. All things being equal, I like the idea of
supporting a bunch of different platforms, and Cygwin doesn't really
look that dead. It has recent releases. But if blocking signals
doesn't actually work on that platform, making PostgreSQL work
reliably there seems really difficult.

It's one thing to drop old dead Unixes but I don't think anyone would
enjoy dropping support for an active open source project. The best
outcome would be for people who have an interest in seeing PostgreSQL
work correctly on Cygwin to help get the bug fixed. Here are the
threads I'm aware of:

https://cygwin.com/pipermail/cygwin/2017-August/234001.html
https://cygwin.com/pipermail/cygwin/2017-August/234097.html

I wonder if these problems would go away as a nice incidental
side-effect if we used latches for postmaster wakeups. I don't
know... maybe, if the problem is just with the postmaster's pattern of
blocking/unblocking? Maybe backend startup is simple enough that it
doesn't hit the bug? From a quick glance, I think the assertion
failures that occur in regular backends can possibly be blamed on the
postmaster getting confused about its children due to unexpected
handler re-entry.

#11

Justin Pryzby

pryzby@telsasoft.com

almost 4 years ago

In reply to: Thomas Munro (#7)

Re: Cygwin cleanup

On Fri, Jul 29, 2022 at 10:04:04AM +1200, Thomas Munro wrote:

XXX We would never want this to run by default in CI, but it'd be nice
to be able to ask for it with ci-os-only! (See commented out line)
only_if: $CIRRUS_CHANGE_MESSAGE =~ '.*\nci-os-only:[^\n]*cygwin.*'

Doesn't this already do what's needed?
As long as it doesn't also check: CHANGE_MESSAGE !~ 'ci-os-only',
the task will runs only on request.

Yeah I was just trying to say that I was sharing the script in a way
that always runs, but for commit we'd want that.

That makes more sense after noticing that you created a cf entry (for which
cfbot has been skipping my patch due to my "only_if" line). There's still a
few persistent issues:

This fails ~50% of the time in recovery 010-truncate
I hacked around this by setting data_sync_retry.
https://cirrus-ci.com/task/5289444063313920
I found these, not sure if they're relevant.
/messages/by-id/CAA4eK1Kft05mwNuZbTVRmz8SNS3r+uriuCT8DxL5KJy5btoS-A@mail.gmail.com
/messages/by-id/CAFiTN-uGxgo5258hZy2QJoz=s7_Cs7v9=b8Z2GgFV7qmQUOwxw@mail.gmail.com

And an fsync abort in 013 which seems similar to this other one.
data_sync_retry also avoids this issue.
https://cirrus-ci.com/task/6283023745286144?logs=cores#L34
/messages/by-id/CAMVYW_4QhjZ-19Xpr2x1B19soRCNu1BXHM8g1mOnAVtd5VViDw@mail.gmail.com

And sometimes various assertions failing in regress parallel_select (and then times out)
https://api.cirrus-ci.com/v1/artifact/task/5537540282253312/log/src/test/regress/log/postmaster.log
https://api.cirrus-ci.com/v1/artifact/task/6108746773430272/log/src/test/regress/log/postmaster.log
Or "could not map dynamic shared memory segment" (actually in 027-stream-regress):
https://cirrus-ci.com/task/6168860746317824

And segfault in vacuum parallel
https://api.cirrus-ci.com/v1/artifact/task/5404589569605632/log/src/test/regress/log/postmaster.log

Sometimes semctl() failed: Resource temporarily unavailable
https://api.cirrus-ci.com/v1/artifact/task/5027860623654912/log/src/test/subscription/tmp_check/log/014_binary_publisher.log
https://api.cirrus-ci.com/v1/artifact/task/5027860623654912/log/src/bin/pg_rewind/tmp_check/log/001_basic_standby_local.log

Some more
https://cirrus-ci.com/task/6468927780814848

If you're lucky, there's only 1 or 2 problems, of which those are different
symptoms.. Maybe for now this needs to disable tap tests :(

This shows that it *can* pass, if slowly, and infrequently:
https://cirrus-ci.com/task/6546858536337408

This fixes my changes to configure for getopt.
And simplifies the changes to *.pl (the .exe changes weren't necessary at all).
And removes the changes for implicit-fallthrough; I realized that configure was
just deciding that it didn't work and not using it at all.
And adds support for backtraces.
And remove kerberos and and add libxml

Why did you write "|| exit /b 1" in all the bash invocations ? I think cirrus
handles that automatically.

--
Justin

#12

Andres Freund

andres@anarazel.de

almost 4 years ago

In reply to: Justin Pryzby (#8)

Re: Cygwin cleanup

Hi,

On 2022-07-28 17:23:19 -0500, Justin Pryzby wrote:

On Fri, Jul 29, 2022 at 10:04:04AM +1200, Thomas Munro wrote:

XXX This should use a canned Docker image with all the right packages
installed

Has anyone tried using non-canned images ? It sounds like this could reduce
the 4min startup time for windows.

https://cirrus-ci.org/guide/docker-builder-vm/#dockerfile-as-a-ci-environment

Yeah, I had that working once. Not sure what the pros and cons would be for us.

I think it could be a lot faster to start, since cirrus caches the generated
docker image locally. Rather than (I gather) pulling the image every time.

I'm quite certain that is not true. All the docker images built are just
uploaded to the google container registry and then downloaded onto a
*separate* windows host. The dockerfile: stuff generates a separate task
running on a separate machine...

It's a bit better for non-windows containers, because there google has some
optimization for pulling image (pieces) on demand or such.

Greetings,

Andres Freund

#13

Thomas Munro

thomas.munro@gmail.com

almost 4 years ago

In reply to: Justin Pryzby (#11)

Re: Cygwin cleanup

On Thu, Aug 4, 2022 at 3:38 PM Justin Pryzby <pryzby@telsasoft.com> wrote:

[train wreck]

Oh my, so I'm getting the impression we might actually be totally
unstable on Cygwin. Which surprises me because ... wait a minute ...
lorikeet isn't even running most of the tests. So... we don't really
know the degree to which any of this works at all?

This shows that it *can* pass, if slowly, and infrequently:
https://cirrus-ci.com/task/6546858536337408

Ok, that's slightly reassuring, so maybe we *can* fix this, but I'm
one step closer to what Tom said, re wasting developer time...

[lots of improvements]

Cool.

Why did you write "|| exit /b 1" in all the bash invocations ? I think cirrus
handles that automatically.

Cargo-culted from libarchive.

#14

Andres Freund

andres@anarazel.de

almost 4 years ago

In reply to: Thomas Munro (#13)

Re: Cygwin cleanup

Hi,

On 2022-08-04 16:16:06 +1200, Thomas Munro wrote:

Ok, that's slightly reassuring, so maybe we *can* fix this, but I'm
one step closer to what Tom said, re wasting developer time...

It might be worth checking whether the cygwin installer, which at some point
at least allowed installing postgres, has download numbers available anywhere.

It's possible we could e.g. get away with just allowing libpq to be built.

Greetings,

Andres Freund

#15

Thomas Munro

thomas.munro@gmail.com

almost 4 years ago

In reply to: Thomas Munro (#13)

Re: Cygwin cleanup

On Thu, Aug 4, 2022 at 4:16 PM Thomas Munro <thomas.munro@gmail.com> wrote:

On Thu, Aug 4, 2022 at 3:38 PM Justin Pryzby <pryzby@telsasoft.com> wrote:

[train wreck]

Oh my, so I'm getting the impression we might actually be totally
unstable on Cygwin. Which surprises me because ... wait a minute ...
lorikeet isn't even running most of the tests. So... we don't really
know the degree to which any of this works at all?

Hmm, it's possible that all these failures are just new-to-me effects
of the known bug. Certainly the assertion failures are of the usual
type, and I think it might be possible for the weird parallel query
failure to be explained by the postmaster forking extra phantom child
processes.

It may be madness to try to work around this, but I wonder if we could
use a static local variable that we update with atomic compare
exhange, inside PG_SIGNAL_HANDLER_ENTRY(), and
PG_SIGNAL_HANDLER_EXIT() macros that do nothing on every other system.
On entry, if you can do 0->1 it means you are allowed to run the
function. If it's non-zero, set n->n+1 and return immediately: signal
blocked, but queued for later. On exit, you CAS n->0. If n was > 1,
then you have to jump back to the top and run the function body again.

#16

Tom Lane

tgl@sss.pgh.pa.us

almost 4 years ago

In reply to: Thomas Munro (#15)

Re: Cygwin cleanup

Thomas Munro <thomas.munro@gmail.com> writes:

It may be madness to try to work around this, but I wonder if we could
use a static local variable that we update with atomic compare
exhange, inside PG_SIGNAL_HANDLER_ENTRY(), and
PG_SIGNAL_HANDLER_EXIT() macros that do nothing on every other system.
On entry, if you can do 0->1 it means you are allowed to run the
function. If it's non-zero, set n->n+1 and return immediately: signal
blocked, but queued for later. On exit, you CAS n->0. If n was > 1,
then you have to jump back to the top and run the function body again.

And ... we're expending all this effort for what exactly?

regards, tom lane

#17

Thomas Munro

thomas.munro@gmail.com

almost 4 years ago

In reply to: Tom Lane (#16)

Re: Cygwin cleanup

On Thu, Aug 4, 2022 at 5:23 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Thomas Munro <thomas.munro@gmail.com> writes:

It may be madness to try to work around this, but I wonder if we could
use a static local variable that we update with atomic compare
exhange, inside PG_SIGNAL_HANDLER_ENTRY(), and
PG_SIGNAL_HANDLER_EXIT() macros that do nothing on every other system.
On entry, if you can do 0->1 it means you are allowed to run the
function. If it's non-zero, set n->n+1 and return immediately: signal
blocked, but queued for later. On exit, you CAS n->0. If n was > 1,
then you have to jump back to the top and run the function body again.

And ... we're expending all this effort for what exactly?

I'd be almost as happy if we ripped it all out, shut down lorikeet and
added it to the list of fallen platforms. I'd feel a bit like a
vandal, though. My suggestion is a last-ditch idea for Noah (CCd)
and/or Andrew to consider, who (respectively) blocked this last time
and run lorikeet. No plans to write that patch myself...

#18

Justin Pryzby

pryzby@telsasoft.com

over 3 years ago

In reply to: Thomas Munro (#13)

Re: Cygwin cleanup

On Thu, Aug 04, 2022 at 04:16:06PM +1200, Thomas Munro wrote:

On Thu, Aug 4, 2022 at 3:38 PM Justin Pryzby <pryzby@telsasoft.com> wrote:

[train wreck]

Oh my, so I'm getting the impression we might actually be totally
unstable on Cygwin. Which surprises me because ... wait a minute ...
lorikeet isn't even running most of the tests. So... we don't really
know the degree to which any of this works at all?

Right.

Maybe it's of limited interest, but ..

This updates the patch to build and test with meson.
Which first requires patching some meson.builds.
I guess that's needed for some current BF members, too.
Unfortunately, ccache+PCH causes gcc to crash :(

--
Justin

#19

Justin Pryzby

pryzby@telsasoft.com

over 3 years ago

In reply to: Justin Pryzby (#18)

Re: Cygwin cleanup

On Thu, Oct 20, 2022 at 10:40:40PM -0500, Justin Pryzby wrote:

On Thu, Aug 04, 2022 at 04:16:06PM +1200, Thomas Munro wrote:

On Thu, Aug 4, 2022 at 3:38 PM Justin Pryzby <pryzby@telsasoft.com> wrote:

[train wreck]

Oh my, so I'm getting the impression we might actually be totally
unstable on Cygwin. Which surprises me because ... wait a minute ...
lorikeet isn't even running most of the tests. So... we don't really
know the degree to which any of this works at all?

Right.

Maybe it's of limited interest, but ..

This updates the patch to build and test with meson.
Which first requires patching some meson.builds.
I guess that's needed for some current BF members, too.
Unfortunately, ccache+PCH causes gcc to crash :(

Resending with the 'only-if' line commented (doh).
And some fixes to 001 as Andres pointed out by on other thread.

--
Justin

#20

Andres Freund

andres@anarazel.de

over 3 years ago

In reply to: Justin Pryzby (#19)

Re: Cygwin cleanup

Hi,

On 2022-11-08 19:04:37 -0600, Justin Pryzby wrote:

From 2741472080eceac5cb6d002c39eaf319d7f72b50 Mon Sep 17 00:00:00 2001
From: Justin Pryzby <pryzbyj@telsasoft.com>
Date: Fri, 30 Sep 2022 13:39:43 -0500
Subject: [PATCH 1/3] meson: other fixes for cygwin

XXX: what about HAVE_BUGGY_STRTOF ?

What about it? As noted in another thread, HAVE_BUGGY_STRTOF is defined in a
header, and shouldn't be affected by the buildsystem.

Pushed this commit.

XXX This should use a canned Docker image with all the right packages
installed? But if the larger image is slower to start, then maybe not...

I think once we convert the windows containers to windows VMs we can just
install both cygwin and mingw in the same image. The overhead of installing
too much seems far far smaller there.

+    CONFIGURE_FLAGS: --enable-cassert --enable-debug --with-ldap --with-ssl=openssl --with-libxml
+    # --enable-tap-tests

I assume this is disabled as tap tests fail?

+ C:\tools\cygwin\bin\bash.exe --login -c "cygserver-config -y"

I'd copy the approach used for mingw of putting most of this in an environment
variable.

+findargs=''
case $os in
freebsd|linux|macos)
-    ;;
+        ;;
+
+    cygwin)
+        # XXX Evidently I don't know how to write two arguments here without pathname expansion later, other than eval.
+        #findargs='-name "*.stackdump"'
+        for stack in $(find "$directory" -type f -name "*.stackdump") ; do
+            binary=`basename "$stack" .stackdump`
+            echo;echo;
+            echo "dumping ${stack} for ${binary}"
+            awk '/^0/{print $2}' $stack |addr2line -f -i -e ./src/backend/postgres.exe
+            #awk '/^0/{print $2}' $stack |addr2line -f -i -e "./src/backend/$binary.exe"
+        done
+        exit 0
+        ;;

Is this stuff actually needed? Could we use the infrastructure we use for
backtraces with msvc instead? Or use something that understands .stackdump
files?

+++ b/src/test/perl/PostgreSQL/Test/Cluster.pm
[...]
+++ b/src/test/perl/PostgreSQL/Test/Utils.pm
[...]
+++ b/src/test/recovery/t/020_archive_status.pl
[...]

I think these should be in a separate commit, they're not actually about CI.

Greetings,

Andres Freund

#21

Thomas Munro

thomas.munro@gmail.com

over 3 years ago

In reply to: Thomas Munro (#10)

#22