Windows CFBot is broken because ecpg dec_test.c error

Started by Jelte Fennema-Nio12 months ago17 messages
#1Jelte Fennema-Nio
postgres@jeltef.nl

Since about ~11 hours ago the ecpg test is consistently failing on
Window with this error[1]https://cirrus-ci.com/task/6305422665056256?logs=check_world#L143:

Could not open file C:/cirrus/build/src/interfaces/ecpg/test/compat_informix/dec_test.c for reading

I took a quick look at possible causes but couldn't find a clear
winner. My current guess is that there's some dependency rule missing
in the meson file and due to some infra changes files now get compiled
in the wrong order.

One recent suspicious commit seems to be:
7819a25cd101b574f5422edb00fe3376fbb646da
But there are a bunch of successful changes that include that commit,
so it seems to be a red herring. (CC-ed Noah anyway)

[1]: https://cirrus-ci.com/task/6305422665056256?logs=check_world#L143

#2Andres Freund
andres@anarazel.de
In reply to: Jelte Fennema-Nio (#1)
Re: Windows CFBot is broken because ecpg dec_test.c error

Hi,

On January 28, 2025 7:13:16 AM EST, Jelte Fennema-Nio <postgres@jeltef.nl> wrote:

Since about ~11 hours ago the ecpg test is consistently failing on
Window with this error[1]:

Could not open file C:/cirrus/build/src/interfaces/ecpg/test/compat_informix/dec_test.c for reading

I took a quick look at possible causes but couldn't find a clear
winner. My current guess is that there's some dependency rule missing
in the meson file and due to some infra changes files now get compiled
in the wrong order.

One recent suspicious commit seems to be:
7819a25cd101b574f5422edb00fe3376fbb646da
But there are a bunch of successful changes that include that commit,
so it seems to be a red herring. (CC-ed Noah anyway)

I think it's due to a new version of meson. Seems we under specified test dependencies. I'll write up a patch.

Andres

--
Sent from my Android device with K-9 Mail. Please excuse my brevity.

#3Nazir Bilal Yavuz
byavuz81@gmail.com
In reply to: Andres Freund (#2)
Re: Windows CFBot is broken because ecpg dec_test.c error

Hi,

On Tue, 28 Jan 2025 at 17:02, Andres Freund <andres@anarazel.de> wrote:

Hi,

On January 28, 2025 7:13:16 AM EST, Jelte Fennema-Nio <postgres@jeltef.nl> wrote:

Since about ~11 hours ago the ecpg test is consistently failing on
Window with this error[1]:

Could not open file C:/cirrus/build/src/interfaces/ecpg/test/compat_informix/dec_test.c for reading

I took a quick look at possible causes but couldn't find a clear
winner. My current guess is that there's some dependency rule missing
in the meson file and due to some infra changes files now get compiled
in the wrong order.

One recent suspicious commit seems to be:
7819a25cd101b574f5422edb00fe3376fbb646da
But there are a bunch of successful changes that include that commit,
so it seems to be a red herring. (CC-ed Noah anyway)

I think it's due to a new version of meson. Seems we under specified test dependencies. I'll write up a patch.

The cause is that meson fixed a bug [1]https://mesonbuild.com/Release-notes-for-1-7-0.html#test-targets-no-longer-built-by-default in v.1.7.0. Before meson
v1.7.0; although --no-rebuild is used while running tests, meson was
building all targets. This is fixed with v.1.7.0.

The change below fixes the problem:

diff --git a/.cirrus.tasks.yml b/.cirrus.tasks.yml
index 18e944ca89d..c7a94ff6471 100644
--- a/.cirrus.tasks.yml
+++ b/.cirrus.tasks.yml
@@ -17,7 +17,7 @@ env:
   CHECK: check-world PROVE_FLAGS=$PROVE_FLAGS
   CHECKFLAGS: -Otarget
   PROVE_FLAGS: --timer
-  MTEST_ARGS: --print-errorlogs --no-rebuild -C build
+  MTEST_ARGS: --print-errorlogs -C build
   PGCTLTIMEOUT: 120 # avoids spurious failures during parallel tests
   TEMP_CONFIG: ${CIRRUS_WORKING_DIR}/src/tools/ci/pg_ci_base.conf
   PG_TEST_EXTRA: kerberos ldap ssl libpq_encryption load_balance

And I think this is the correct approach. It builds all of the
not-yet-built targets before running the tests. Another solution might
be manually building ecpg target before running tests but I think the
former approach is more suitable for the CI.

CI run after this change applied: https://cirrus-ci.com/build/6264369203380224

[1]: https://mesonbuild.com/Release-notes-for-1-7-0.html#test-targets-no-longer-built-by-default

--
Regards,
Nazir Bilal Yavuz
Microsoft

#4Andres Freund
andres@anarazel.de
In reply to: Nazir Bilal Yavuz (#3)
Re: Windows CFBot is broken because ecpg dec_test.c error

Hi,

On 2025-01-29 18:24:45 +0300, Nazir Bilal Yavuz wrote:

On Tue, 28 Jan 2025 at 17:02, Andres Freund <andres@anarazel.de> wrote:

Hi,

On January 28, 2025 7:13:16 AM EST, Jelte Fennema-Nio <postgres@jeltef.nl> wrote:

Since about ~11 hours ago the ecpg test is consistently failing on
Window with this error[1]:

Could not open file C:/cirrus/build/src/interfaces/ecpg/test/compat_informix/dec_test.c for reading

I took a quick look at possible causes but couldn't find a clear
winner. My current guess is that there's some dependency rule missing
in the meson file and due to some infra changes files now get compiled
in the wrong order.

One recent suspicious commit seems to be:
7819a25cd101b574f5422edb00fe3376fbb646da
But there are a bunch of successful changes that include that commit,
so it seems to be a red herring. (CC-ed Noah anyway)

I think it's due to a new version of meson. Seems we under specified test dependencies. I'll write up a patch.

Sorry, got distracted with somewhat pressing matters.

The cause is that meson fixed a bug [1] in v.1.7.0. Before meson
v1.7.0; although --no-rebuild is used while running tests, meson was
building all targets. This is fixed with v.1.7.0.

[1] https://mesonbuild.com/Release-notes-for-1-7-0.html#test-targets-no-longer-built-by-default

That's not *quite* right - it wasn't that the targets were built when
--no-rebuild was specified, it's that the default build target (for just
'ninja'), built all test dependencies.

The change below fixes the problem:

diff --git a/.cirrus.tasks.yml b/.cirrus.tasks.yml
index 18e944ca89d..c7a94ff6471 100644
--- a/.cirrus.tasks.yml
+++ b/.cirrus.tasks.yml
@@ -17,7 +17,7 @@ env:
CHECK: check-world PROVE_FLAGS=$PROVE_FLAGS
CHECKFLAGS: -Otarget
PROVE_FLAGS: --timer
-  MTEST_ARGS: --print-errorlogs --no-rebuild -C build
+  MTEST_ARGS: --print-errorlogs -C build
PGCTLTIMEOUT: 120 # avoids spurious failures during parallel tests
TEMP_CONFIG: ${CIRRUS_WORKING_DIR}/src/tools/ci/pg_ci_base.conf
PG_TEST_EXTRA: kerberos ldap ssl libpq_encryption load_balance

And I think this is the correct approach. It builds all of the
not-yet-built targets before running the tests. Another solution might
be manually building ecpg target before running tests but I think the
former approach is more suitable for the CI.

CI run after this change applied: https://cirrus-ci.com/build/6264369203380224

I don't think that's the entirety of the issue.

Our dependencies aren't quite airtight enough. With a sufficiently modern
meson, try doing e.g.

rm -rf tmp_install/ && ninja clean && meson test --suite setup --suite ecpg

It'll fail, because the dependencies of the tests are insufficient.

See the set of patches at
/messages/by-id/qh4c5tvkgjef7jikjig56rclbcdrrotngnwpycukd2n3k25zi2@44hxxvtwmgum

I think the only reason your patch on its own suffices, is that the "all"
target, that we ran separately beforehand, actually has sufficient
dependencies to make things work.

The nice thing is that with this meson improvement we should be able to get
rid of the "setup" test suite and instead generate the test install via
dependencies. Obviously we either have to wait a fair bit or do it depending
on the meson version...

Greetings,

Andres Freund

#5Nazir Bilal Yavuz
byavuz81@gmail.com
In reply to: Andres Freund (#4)
Re: Windows CFBot is broken because ecpg dec_test.c error

Hi,

On Wed, 29 Jan 2025 at 19:50, Andres Freund <andres@anarazel.de> wrote:

On 2025-01-29 18:24:45 +0300, Nazir Bilal Yavuz wrote:

The cause is that meson fixed a bug [1] in v.1.7.0. Before meson
v1.7.0; although --no-rebuild is used while running tests, meson was
building all targets. This is fixed with v.1.7.0.

[1] https://mesonbuild.com/Release-notes-for-1-7-0.html#test-targets-no-longer-built-by-default

That's not *quite* right - it wasn't that the targets were built when
--no-rebuild was specified, it's that the default build target (for just
'ninja'), built all test dependencies.

The change below fixes the problem:

diff --git a/.cirrus.tasks.yml b/.cirrus.tasks.yml
index 18e944ca89d..c7a94ff6471 100644
--- a/.cirrus.tasks.yml
+++ b/.cirrus.tasks.yml
@@ -17,7 +17,7 @@ env:
CHECK: check-world PROVE_FLAGS=$PROVE_FLAGS
CHECKFLAGS: -Otarget
PROVE_FLAGS: --timer
-  MTEST_ARGS: --print-errorlogs --no-rebuild -C build
+  MTEST_ARGS: --print-errorlogs -C build
PGCTLTIMEOUT: 120 # avoids spurious failures during parallel tests
TEMP_CONFIG: ${CIRRUS_WORKING_DIR}/src/tools/ci/pg_ci_base.conf
PG_TEST_EXTRA: kerberos ldap ssl libpq_encryption load_balance

And I think this is the correct approach. It builds all of the
not-yet-built targets before running the tests. Another solution might
be manually building ecpg target before running tests but I think the
former approach is more suitable for the CI.

CI run after this change applied: https://cirrus-ci.com/build/6264369203380224

I don't think that's the entirety of the issue.

Our dependencies aren't quite airtight enough. With a sufficiently modern
meson, try doing e.g.

rm -rf tmp_install/ && ninja clean && meson test --suite setup --suite ecpg

It'll fail, because the dependencies of the tests are insufficient.

See the set of patches at
/messages/by-id/qh4c5tvkgjef7jikjig56rclbcdrrotngnwpycukd2n3k25zi2@44hxxvtwmgum

I think the only reason your patch on its own suffices, is that the "all"
target, that we ran separately beforehand, actually has sufficient
dependencies to make things work.

Yes, you are right. I agree that what you said is the correct solution
and that should be the ultimate goal. What I shared could be a
band-aid fix to make the Windows CI task happy until the patches you
shared get committed. Another solution might be to downgrade the meson
version in the Windows images at the CI repository [1]https://github.com/anarazel/pg-vm-images, that would be
better for the commit history.

[1]: https://github.com/anarazel/pg-vm-images

--
Regards,
Nazir Bilal Yavuz
Microsoft

#6Andres Freund
andres@anarazel.de
In reply to: Nazir Bilal Yavuz (#5)
Re: Windows CFBot is broken because ecpg dec_test.c error

Hi,

On 2025-01-30 16:18:54 +0300, Nazir Bilal Yavuz wrote:

On Wed, 29 Jan 2025 at 19:50, Andres Freund <andres@anarazel.de> wrote:

I don't think that's the entirety of the issue.

Our dependencies aren't quite airtight enough. With a sufficiently modern
meson, try doing e.g.

rm -rf tmp_install/ && ninja clean && meson test --suite setup --suite ecpg

It'll fail, because the dependencies of the tests are insufficient.

See the set of patches at
/messages/by-id/qh4c5tvkgjef7jikjig56rclbcdrrotngnwpycukd2n3k25zi2@44hxxvtwmgum

I think the only reason your patch on its own suffices, is that the "all"
target, that we ran separately beforehand, actually has sufficient
dependencies to make things work.

Yes, you are right. I agree that what you said is the correct solution
and that should be the ultimate goal.

I think we need to fix this properly across branches. The version of meson is
going to be more common soon. And it'll be a problem for developers too, not
just CI. I'll start working on committing these fixes across the branches,
unless somebody protests immediately.

What I shared could be a band-aid fix to make the Windows CI task happy
until the patches you shared get committed.

I think we'll still need something like what you propose. Although I do think
it'd be better if we continued building all targets in a dedicated _script:
block, so that you can see all build failures in those steps.

Greetings,

Andres Freund

#7Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#6)
Re: Windows CFBot is broken because ecpg dec_test.c error

Hi,

On 2025-02-04 12:46:42 -0500, Andres Freund wrote:

On 2025-01-30 16:18:54 +0300, Nazir Bilal Yavuz wrote:

On Wed, 29 Jan 2025 at 19:50, Andres Freund <andres@anarazel.de> wrote:

I don't think that's the entirety of the issue.

Our dependencies aren't quite airtight enough. With a sufficiently modern
meson, try doing e.g.

rm -rf tmp_install/ && ninja clean && meson test --suite setup --suite ecpg

It'll fail, because the dependencies of the tests are insufficient.

See the set of patches at
/messages/by-id/qh4c5tvkgjef7jikjig56rclbcdrrotngnwpycukd2n3k25zi2@44hxxvtwmgum

I think the only reason your patch on its own suffices, is that the "all"
target, that we ran separately beforehand, actually has sufficient
dependencies to make things work.

Yes, you are right. I agree that what you said is the correct solution
and that should be the ultimate goal.

I think we need to fix this properly across branches. The version of meson is
going to be more common soon. And it'll be a problem for developers too, not
just CI. I'll start working on committing these fixes across the branches,
unless somebody protests immediately.

What I shared could be a band-aid fix to make the Windows CI task happy
until the patches you shared get committed.

I think we'll still need something like what you propose. Although I do think
it'd be better if we continued building all targets in a dedicated _script:
block, so that you can see all build failures in those steps.

Pushed like that.

I'll watch CI and BF over the next hours.

Greetings,

Andres

#8Jelte Fennema-Nio
postgres@jeltef.nl
In reply to: Andres Freund (#7)
Re: Windows CFBot is broken because ecpg dec_test.c error

On Wed, 5 Feb 2025 at 00:22, Andres Freund <andres@anarazel.de> wrote:

Pushed like that.

I'll watch CI and BF over the next hours.

I guess you probably noticed, but in case you didn't: CI on windows is
still broken.

#9Andres Freund
andres@anarazel.de
In reply to: Jelte Fennema-Nio (#8)
Re: Windows CFBot is broken because ecpg dec_test.c error

Hi,

On 2025-02-05 19:42:05 +0100, Jelte Fennema-Nio wrote:

On Wed, 5 Feb 2025 at 00:22, Andres Freund <andres@anarazel.de> wrote:

Pushed like that.

I'll watch CI and BF over the next hours.

I guess you probably noticed, but in case you didn't: CI on windows is
still broken.

Huh. CI did pass on all platforms after my push:
https://cirrus-ci.com/github/postgres/postgres/

While there is a failure on master, it isn't due to this:
https://cirrus-ci.com/task/6185223693533184
[17:55:32.636] ------------------------------------- 8< -------------------------------------
[17:55:32.636] stderr:
[17:55:32.636] # Failed test 'can't connect to invalid database - error message'
[17:55:32.636] # at C:/cirrus/src/test/recovery/t/037_invalid_database.pl line 40.
[17:55:32.636] # 'psql: error: connection to server on socket "C:/Windows/TEMP/kqIhcyR2yC/.s.PGSQL.31868" failed: server closed the connection unexpectedly
[17:55:32.636] # This probably means the server terminated abnormally
[17:55:32.636] # before or while processing the request.'
[17:55:32.636] # doesn't match '(?^:FATAL:\s+cannot connect to invalid database "regression_invalid")'
[17:55:32.636] # Looks like you failed 1 test of 10.
[17:55:32.636]
[17:55:32.636] (test program exited with status code 1)
[17:55:32.636] ------------------------------------------------------------------------------

I think that may be due to

commit a14707da564e8c94bd123f0e3a75e194fd7ef56a (upstream/master)
Author: Tom Lane <tgl@sss.pgh.pa.us>
Date: 2025-02-05 12:45:58 -0500

Show more-intuitive titles for psql commands \dt, \di, etc.

I do see a lot of failures on cfbot - but afaict that's because for some
reason there haven't been recent runs. Thomas?

E.g. the currently newest run is https://cirrus-ci.com/build/6378368658046976
which is based on

commit 43493cceda2
Author: Peter Eisentraut <peter@eisentraut.org>
Date: 2025-01-24 22:58:13 +0100

Add get_opfamily_name() function

Greetings,

Andres Freund

#10Tom Lane
tgl@sss.pgh.pa.us
In reply to: Jelte Fennema-Nio (#8)
Re: Windows CFBot is broken because ecpg dec_test.c error

Jelte Fennema-Nio <postgres@jeltef.nl> writes:

I guess you probably noticed, but in case you didn't: CI on windows is
still broken.

Hard to tell, considering the cfbot has been completely wedged
since Sunday.

regards, tom lane

#11Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#10)
Re: Windows CFBot is broken because ecpg dec_test.c error

Hi,

On 2025-02-05 14:09:02 -0500, Tom Lane wrote:

Jelte Fennema-Nio <postgres@jeltef.nl> writes:

I guess you probably noticed, but in case you didn't: CI on windows is
still broken.

Hard to tell, considering the cfbot has been completely wedged
since Sunday.

It passed on the postgres repo just before this commit:
https://cirrus-ci.com/build/4733656549294080
and then failed with it:
https://cirrus-ci.com/build/5944955807465472

Greetings,

Andres Freund

#12Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andres Freund (#11)
Re: Windows CFBot is broken because ecpg dec_test.c error

Andres Freund <andres@anarazel.de> writes:

On 2025-02-05 14:09:02 -0500, Tom Lane wrote:

Hard to tell, considering the cfbot has been completely wedged
since Sunday.

It passed on the postgres repo just before this commit:
https://cirrus-ci.com/build/4733656549294080
and then failed with it:
https://cirrus-ci.com/build/5944955807465472

Hmm, maybe it's only the cfbot's web server that's broken,
but none of the pages at http://cfbot.cputube.org
appear to be updating. What other mechanism are you using
to find the cirrus-ci.com logs?

regards, tom lane

#13Jelte Fennema-Nio
postgres@jeltef.nl
In reply to: Tom Lane (#12)
Re: Windows CFBot is broken because ecpg dec_test.c error

On Wed, 5 Feb 2025 at 20:21, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Andres Freund <andres@anarazel.de> writes:

On 2025-02-05 14:09:02 -0500, Tom Lane wrote:

Hard to tell, considering the cfbot has been completely wedged
since Sunday.

It passed on the postgres repo just before this commit:
https://cirrus-ci.com/build/4733656549294080
and then failed with it:
https://cirrus-ci.com/build/5944955807465472

Hmm, maybe it's only the cfbot's web server that's broken,
but none of the pages at http://cfbot.cputube.org
appear to be updating. What other mechanism are you using
to find the cirrus-ci.com logs?

Ugh yes, cfbot isn't updating at all anymore. So Andres' commits might
very well have fixed the issue, but the prod cfbot is not doing any
builds at the moment...

I'll look into fixing that soonish. I took a quick look and it seems
related to some unexpected response from the Cirrus API.

#14Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#12)
Re: Windows CFBot is broken because ecpg dec_test.c error

Hi,

On 2025-02-05 14:20:59 -0500, Tom Lane wrote:

Andres Freund <andres@anarazel.de> writes:

On 2025-02-05 14:09:02 -0500, Tom Lane wrote:

Hard to tell, considering the cfbot has been completely wedged
since Sunday.

It passed on the postgres repo just before this commit:
https://cirrus-ci.com/build/4733656549294080
and then failed with it:
https://cirrus-ci.com/build/5944955807465472

Hmm, maybe it's only the cfbot's web server that's broken,
but none of the pages at http://cfbot.cputube.org
appear to be updating.

It does look to me like cfbot isn't updating the relevant branches, i.e. it's
not just the website that's not updating, or CI somehow not triggering after
cfbot updates the relevant branches.

What other mechanism are you using to find the cirrus-ci.com logs?

This isn't run via cfbot, but via postgres' github mirror. Whenever the repo
sync pushes a change it also triggers CI.

You can see all the runs of that on
https://cirrus-ci.com/github/postgres/postgres/

CI on windows failed in ecpg for a few days, there were just two master runs
that didn't fail inbetween that being fixed and the failure I linked to
above. But recovery/037_invalid_database didn't fail at that time.

Greetings,

Andres Freund

#15Jelte Fennema-Nio
postgres@jeltef.nl
In reply to: Jelte Fennema-Nio (#13)
Re: Windows CFBot is broken because ecpg dec_test.c error

On Wed, 5 Feb 2025 at 20:29, Jelte Fennema-Nio <postgres@jeltef.nl> wrote:

I'll look into fixing that soonish. I took a quick look and it seems
related to some unexpected response from the Cirrus API.

Okay I think I got it running again. It didn't like that there was no
commitfest with number 54 yet. So I created one, and it's doing more
than before now. I'll check after dinner if it's still running
correctly then.

#16Daniel Gustafsson
daniel@yesql.se
In reply to: Jelte Fennema-Nio (#15)
Re: Windows CFBot is broken because ecpg dec_test.c error

On 5 Feb 2025, at 20:36, Jelte Fennema-Nio <postgres@jeltef.nl> wrote:

On Wed, 5 Feb 2025 at 20:29, Jelte Fennema-Nio <postgres@jeltef.nl> wrote:

I'll look into fixing that soonish. I took a quick look and it seems
related to some unexpected response from the Cirrus API.

Okay I think I got it running again. It didn't like that there was no
commitfest with number 54 yet. So I created one, and it's doing more
than before now. I'll check after dinner if it's still running
correctly then.

For reference, you meant 53 right? (There is no 54 in the system.) If the
CFBot always need one in "Future" state we should document that to make sure we
don't miss that going forward (and perhaps automate it to make sure we dont
make manual work for ourselves).

--
Daniel Gustafsson

#17Jelte Fennema-Nio
postgres@jeltef.nl
In reply to: Daniel Gustafsson (#16)
Re: Windows CFBot is broken because ecpg dec_test.c error

On Wed, 5 Feb 2025 at 21:05, Daniel Gustafsson <daniel@yesql.se> wrote:

For reference, you meant 53 right?

Yes, I meant 53

If the
CFBot always need one in "Future" state we should document that to make sure we
don't miss that going forward (and perhaps automate it to make sure we dont
make manual work for ourselves).

Afaict it doesn't need at least one in the "Future" state, instead it
needs one after the current[1]https://commitfest.postgresql.org/current/ commitfest. I don't think it should
rely on that though. So I created an issue to fix that[2]https://github.com/macdice/cfbot/issues/22.

It does seem silly that we require people to manually create new
commitfests though, so I created an issue to track that[3]https://github.com/postgres/pgcommitfest/issues/25

[1]: https://commitfest.postgresql.org/current/
[2]: https://github.com/macdice/cfbot/issues/22
[3]: https://github.com/postgres/pgcommitfest/issues/25