Random pg_upgrade 004_subscription test failure on drongo

Started by vignesh Cover 1 year ago11 messageshackers
Jump to latest
#1vignesh C
vignesh21@gmail.com

Hi,

In one of the recent buildfarm runs, the test
pg_upgrade/004_subscription test fails at [1]https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=drongo&dt=2025-03-10%2019%3A26%3A35.
It has failed with the following:
Restoring database schemas in the new cluster
*failure*

Consult the last few lines of
"C:/prog/bf/root/HEAD/pgsql.build/testrun/pg_upgrade/004_subscription/data/t_004_subscription_new_sub_data/pgdata/pg_upgrade_output.d/20250310T194018.517/log/pg_upgrade_dump_1.log"
for
the probable cause of the failure.
Failure, exiting

Unfortunately we don't have pg_upgrade_output.d contents in buildfarm
to see what is the exact reason.
I checked old buildfarm failures and found a similar failure in drongo
for pg_upgarde/003_logical_slots at [2]/messages/by-id/TYAPR01MB5866AB7FD922CE30A2565B8BF5A8A@TYAPR01MB5866.jpnprd01.prod.outlook.com.

The test failure can happen because of the reason mentioned at [3]/messages/by-id/TYAPR01MB5866A4E7342088E91362BEF0F5BBA@TYAPR01MB5866.jpnprd01.prod.outlook.com and [6]/messages/by-id/f0d303f1-e380-5988-91c7-74b755abd4bb@gmail.com.
I'm just requoting the contents from there, if you have checked [3]/messages/by-id/TYAPR01MB5866A4E7342088E91362BEF0F5BBA@TYAPR01MB5866.jpnprd01.prod.outlook.com
and [6]/messages/by-id/f0d303f1-e380-5988-91c7-74b755abd4bb@gmail.com skip the below contents:
---------------------------------------------
The failure reason for the earlier failure was found to be:
According to it, the TRUNCATE command seemed to be failed due to a
"File exists" error.
2023-11-15 00:02:02.239 UTC [1752:18] 003_logical_slots.pl ERROR:
could not create file "base/1/2683": File exists
2023-11-15 00:02:02.239 UTC [1752:19] 003_logical_slots.pl STATEMENT:
-- For binary upgrade, preserve pg_largeobject and index relfilenodes
SELECT pg_catalog.binary_upgrade_set_next_index_relfilenode('2683'::pg_catalog.oid);
SELECT pg_catalog.binary_upgrade_set_next_heap_relfilenode('2613'::pg_catalog.oid);
TRUNCATE pg_catalog.pg_largeobject;
...

## Analysis
I think it was caused due to the STATUS_DELETE_PENDING failure, not
related with recent
updates for pg_upgrade.

The file "base/1/2683" is an index file for
pg_largeobject_loid_pn_index, and the
output meant that file creation failed. Below is a backtrace.

```
pgwin32_open() // <-- this returns -1
open()
BasicOpenFilePerm()
PathNameOpenFilePerm()
PathNameOpenFile()
mdcreate()
smgrcreate()
RelationCreateStorage()
RelationSetNewRelfilenumber()
ExecuteTruncateGuts()
ExecuteTruncate()
```

But this is strange. Before calling mdcreate(), we surely unlink the
file which have the same name. Below is a trace until unlink.

```
pgunlink()
unlink()
mdunlinkfork()
mdunlink()
smgrdounlinkall()
RelationSetNewRelfilenumber() // common path with above
ExecuteTruncateGuts()
ExecuteTruncate()
```

I found Thomas said that [4]/messages/by-id/CA+hUKGKsdzw06c5nnb=KYG9GmvyykoVpJA_VR3k0r7cZOKcx6Q@mail.gmail.com pgunlink sometimes could not remove a
file even if it returns OK, at that time NTSTATUS is
STATUS_DELETE_PENDING. Also, a comment in pgwin32_open_handle()
mentions the same thing:

```
/*
* ERROR_ACCESS_DENIED is returned if the file is deleted but not yet
* gone (Windows NT status code is STATUS_DELETE_PENDING). In that
* case, we'd better ask for the NT status too so we can translate it
* to a more Unix-like error. We hope that nothing clobbers the NT
* status in between the internal NtCreateFile() call and CreateFile()
* returning.
*
```

The definition of STATUS_DELETE_PENDING can be seen in [5]https://learn.microsoft.com/en-us/openspecs/windows_protocols/ms-erref/596a1078-e883-4972-9bbc-49e60bebca55. Based on
that, indeed, open() would be able to fail with STATUS_DELETE_PENDING
if the deletion is pending but it is trying to open.
---------------------------------------------

This was fixed by the following change in the target upgrade nodes:
bgwriter_lru_maxpages = 0
checkpoint_timeout = 1h

Attached is a patch in similar lines for 004_subscription.

[1]: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=drongo&amp;dt=2025-03-10%2019%3A26%3A35
[2]: /messages/by-id/TYAPR01MB5866AB7FD922CE30A2565B8BF5A8A@TYAPR01MB5866.jpnprd01.prod.outlook.com
[3]: /messages/by-id/TYAPR01MB5866A4E7342088E91362BEF0F5BBA@TYAPR01MB5866.jpnprd01.prod.outlook.com
[4]: /messages/by-id/CA+hUKGKsdzw06c5nnb=KYG9GmvyykoVpJA_VR3k0r7cZOKcx6Q@mail.gmail.com
[5]: https://learn.microsoft.com/en-us/openspecs/windows_protocols/ms-erref/596a1078-e883-4972-9bbc-49e60bebca55
[6]: /messages/by-id/f0d303f1-e380-5988-91c7-74b755abd4bb@gmail.com

Regards,
Vignesh

Attachments:

0001-Fix-an-intermetant-BF-failure-in-004_subscription.patchapplication/octet-stream; name=0001-Fix-an-intermetant-BF-failure-in-004_subscription.patchDownload+13-1
#2Andrew Dunstan
andrew@dunslane.net
In reply to: vignesh C (#1)
Re: Random pg_upgrade 004_subscription test failure on drongo

On 2025-03-13 Th 5:04 AM, vignesh C wrote:

Hi,

In one of the recent buildfarm runs, the test
pg_upgrade/004_subscription test fails at [1].
It has failed with the following:
Restoring database schemas in the new cluster
*failure*

Consult the last few lines of
"C:/prog/bf/root/HEAD/pgsql.build/testrun/pg_upgrade/004_subscription/data/t_004_subscription_new_sub_data/pgdata/pg_upgrade_output.d/20250310T194018.517/log/pg_upgrade_dump_1.log"
for
the probable cause of the failure.
Failure, exiting

Unfortunately we don't have pg_upgrade_output.d contents in buildfarm
to see what is the exact reason.

That's not supposed to happen. I am testing a fix to see if I can make
it collect the logs, but for now we'll have to wait till the next failure ..

cheers

andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

#3Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: vignesh C (#1)
Re: Random pg_upgrade 004_subscription test failure on drongo

On 13/03/2025 11:04, vignesh C wrote:

## Analysis
I think it was caused due to the STATUS_DELETE_PENDING failure, not
related with recent
updates for pg_upgrade.

The file "base/1/2683" is an index file for
pg_largeobject_loid_pn_index, and the
output meant that file creation failed. Below is a backtrace.

```
pgwin32_open() // <-- this returns -1
open()
BasicOpenFilePerm()
PathNameOpenFilePerm()
PathNameOpenFile()
mdcreate()
smgrcreate()
RelationCreateStorage()
RelationSetNewRelfilenumber()
ExecuteTruncateGuts()
ExecuteTruncate()
```

But this is strange. Before calling mdcreate(), we surely unlink the
file which have the same name. Below is a trace until unlink.

```
pgunlink()
unlink()
mdunlinkfork()
mdunlink()
smgrdounlinkall()
RelationSetNewRelfilenumber() // common path with above
ExecuteTruncateGuts()
ExecuteTruncate()
```

I found Thomas said that [4] pgunlink sometimes could not remove a
file even if it returns OK, at that time NTSTATUS is
STATUS_DELETE_PENDING. Also, a comment in pgwin32_open_handle()
mentions the same thing:

```
/*
* ERROR_ACCESS_DENIED is returned if the file is deleted but not yet
* gone (Windows NT status code is STATUS_DELETE_PENDING). In that
* case, we'd better ask for the NT status too so we can translate it
* to a more Unix-like error. We hope that nothing clobbers the NT
* status in between the internal NtCreateFile() call and CreateFile()
* returning.
*
```

The definition of STATUS_DELETE_PENDING can be seen in [5]. Based on
that, indeed, open() would be able to fail with STATUS_DELETE_PENDING
if the deletion is pending but it is trying to open.
---------------------------------------------

This was fixed by the following change in the target upgrade nodes:
bgwriter_lru_maxpages = 0
checkpoint_timeout = 1h

Attached is a patch in similar lines for 004_subscription.

Hmm, this problem isn't limited to this one pg_upgrade test, right? It
could happen with any pg_upgrade invocation. And perhaps in a running
server too, if a relfilenumber is reused quickly. In dropdb() and
DropTableSpace() we do this:

WaitForProcSignalBarrier(EmitProcSignalBarrier(PROCSIGNAL_BARRIER_SMGRRELEASE));

Should we do the same here? Not sure where exactly to put that; perhaps
in mdcreate(), if the creation fails with STATUS_DELETE_PENDING.

--
Heikki Linnakangas
Neon (https://neon.tech)

#4vignesh C
vignesh21@gmail.com
In reply to: Heikki Linnakangas (#3)
Re: Random pg_upgrade 004_subscription test failure on drongo

On Thu, 13 Mar 2025 at 18:10, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

Hmm, this problem isn't limited to this one pg_upgrade test, right? It
could happen with any pg_upgrade invocation. And perhaps in a running
server too, if a relfilenumber is reused quickly.

Yes, it can happen with these scenarios.

In dropdb() and DropTableSpace() we do this:

WaitForProcSignalBarrier(EmitProcSignalBarrier(PROCSIGNAL_BARRIER_SMGRRELEASE));

Should we do the same here? Not sure where exactly to put that; perhaps
in mdcreate(), if the creation fails with STATUS_DELETE_PENDING.

I felt mdcreate would be a good place to fix this issue.

Regards,
Vignesh

#5vignesh C
vignesh21@gmail.com
In reply to: Heikki Linnakangas (#3)
Re: Random pg_upgrade 004_subscription test failure on drongo

On Thu, 13 Mar 2025 at 18:10, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

Hmm, this problem isn't limited to this one pg_upgrade test, right? It
could happen with any pg_upgrade invocation. And perhaps in a running
server too, if a relfilenumber is reused quickly. In dropdb() and
DropTableSpace() we do this:

WaitForProcSignalBarrier(EmitProcSignalBarrier(PROCSIGNAL_BARRIER_SMGRRELEASE));

Should we do the same here? Not sure where exactly to put that; perhaps
in mdcreate(), if the creation fails with STATUS_DELETE_PENDING.

How about a patch similar to the attached one? I have run pg_upgrade
tests multiple times, but unfortunately, I was unable to reproduce the
issue or verify these changes.

Regards,
Vignesh

Attachments:

v1-0001-Fix-issue-with-file-handle-retention-during-CREAT.patchapplication/octet-stream; name=v1-0001-Fix-issue-with-file-handle-retention-during-CREAT.patchDownload+16-1
#6vignesh C
vignesh21@gmail.com
In reply to: vignesh C (#5)
Re: Random pg_upgrade 004_subscription test failure on drongo

On Fri, 21 Mar 2025 at 18:54, vignesh C <vignesh21@gmail.com> wrote:

On Thu, 13 Mar 2025 at 18:10, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

Hmm, this problem isn't limited to this one pg_upgrade test, right? It
could happen with any pg_upgrade invocation. And perhaps in a running
server too, if a relfilenumber is reused quickly. In dropdb() and
DropTableSpace() we do this:

WaitForProcSignalBarrier(EmitProcSignalBarrier(PROCSIGNAL_BARRIER_SMGRRELEASE));

Should we do the same here? Not sure where exactly to put that; perhaps
in mdcreate(), if the creation fails with STATUS_DELETE_PENDING.

How about a patch similar to the attached one? I have run pg_upgrade
tests multiple times, but unfortunately, I was unable to reproduce the
issue or verify these changes.

CFBot reported an issue in one of the machines, here is an updated
version for the same.

Regards,
Vignesh

Attachments:

v2-0001-Fix-issue-with-file-handle-retention-during-CREAT.patchapplication/octet-stream; name=v2-0001-Fix-issue-with-file-handle-retention-during-CREAT.patchDownload+25-1
#7Michael Paquier
michael@paquier.xyz
In reply to: vignesh C (#6)
Re: Random pg_upgrade 004_subscription test failure on drongo

On Mon, Sep 22, 2025 at 02:28:35PM +0530, vignesh C wrote:

CFBot reported an issue in one of the machines, here is an updated
version for the same.

@@ -235,6 +248,9 @@ mdcreate(SMgrRelation reln, ForkNumber forknum, bool isRedo)
[...]
+#if defined(WIN32) && !defined(__CYGWIN__)
+            if (!retryattempted && pg_RtlGetLastNtStatus() == STATUS_DELETE_PENDING)
+            {
+                retryattempted = true;
+                WaitForProcSignalBarrier(EmitProcSignalBarrier(PROCSIGNAL_BARRIER_SMGRRELEASE));
+                goto retry;
+            }

Adding more WIN32-specific awesomeness into a single backend code path
that we try to make POSIX-consistent does not seem right to me,
because it may apply to more FDs opened than this one, no? One code
path would be to enforce a signal in pgwin32_open_handle(), only when
we see a STATUS_DELETE_PENDING. And there is a retry loop in
src/port/open.c in our wrapper, partially for this reason.

Like any failures of this type, how can we reliably make sure that
these issues are gone for sure? Perhaps it would be time to have a
test module specified in concurrent file-system operations? We could
hold into FDs while making backends wait, for example, with various
concurrent in-core calls stepping on each other.. Perhaps that would
be beneficial in the long-term knowing the number of platforms we need
to support.
--
Michael

#8Alexander Lakhin
exclusion@gmail.com
In reply to: Andrew Dunstan (#2)
Re: Random pg_upgrade 004_subscription test failure on drongo

Hello Andrew,

13.03.2025 14:19, Andrew Dunstan wrote:

On 2025-03-13 Th 5:04 AM, vignesh C wrote:

Unfortunately we don't have pg_upgrade_output.d contents in buildfarm
to see what is the exact reason.

That's not supposed to happen. I am testing a fix to see if I can make it collect the logs, but for now we'll have to
wait till the next failure ..

Windows animals still produce mysterious pg_upgrade/004_subscription
failures: [1]https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=fairywren&amp;dt=2026-05-03%2014%3A03%3A12 and [2]https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=drongo&amp;dt=2026-01-12%2000%3A09%3A37 (and 006_transfer_modes also failed similarly since
then: [3]https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=drongo&amp;dt=2025-12-28%2003%3A43%3A24 and [4]https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=fairywren&amp;dt=2026-01-14%2010%3A52%3A35). Unluckily, pg_upgrade_output.d/ content is still not
shown in the failure logs, so we can only guess why the tests failed. I
could not reproduce something alike locally, unfortunately — tried
different approaches for several days, but without success, so if we could
see the error/failure reason in the log it would be useful, I hope.

Regarding 006_transfer_modes, I thought the reason we have no
pg_upgrade_output.d/ is that the test removes it explicitly: [5]/messages/by-id/61d3ce85-9c7b-4eea-bf83-81272cab00c3@gmail.com, but I
see no clean_node in 004_subscription, so maybe the log collection is yet
to be fixed in the buildfarm client.

Could you have a look, please?

[1]: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=fairywren&amp;dt=2026-05-03%2014%3A03%3A12
[2]: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=drongo&amp;dt=2026-01-12%2000%3A09%3A37
[3]: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=drongo&amp;dt=2025-12-28%2003%3A43%3A24
[4]: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=fairywren&amp;dt=2026-01-14%2010%3A52%3A35
[5]: /messages/by-id/61d3ce85-9c7b-4eea-bf83-81272cab00c3@gmail.com

Best regards,
Alexander

#9Andrew Dunstan
andrew@dunslane.net
In reply to: Alexander Lakhin (#8)
Re: Random pg_upgrade 004_subscription test failure on drongo

On 2026-05-10 Su 11:00 AM, Alexander Lakhin wrote:

Hello Andrew,

13.03.2025 14:19, Andrew Dunstan wrote:

On 2025-03-13 Th 5:04 AM, vignesh C wrote:

Unfortunately we don't have pg_upgrade_output.d contents in buildfarm
to see what is the exact reason.

That's not supposed to happen. I am testing a fix to see if I can
make it collect the logs, but for now we'll have to wait till the
next failure ..

Windows animals still produce mysterious pg_upgrade/004_subscription
failures: [1] and [2] (and 006_transfer_modes also failed similarly since
then: [3] and [4]). Unluckily, pg_upgrade_output.d/ content is still not
shown in the failure logs, so we can only guess why the tests failed. I
could not reproduce something alike locally, unfortunately — tried
different approaches for several days, but without success, so if we
could
see the error/failure reason in the log it would be useful, I hope.

Regarding 006_transfer_modes, I thought the reason we have no
pg_upgrade_output.d/ is that the test removes it explicitly: [5], but I
see no clean_node in 004_subscription, so maybe the log collection is yet
to be fixed in the buildfarm client.

Could you have a look, please?

[1]
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=fairywren&amp;dt=2026-05-03%2014%3A03%3A12
[2]
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=drongo&amp;dt=2026-01-12%2000%3A09%3A37
[3]
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=drongo&amp;dt=2025-12-28%2003%3A43%3A24
[4]
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=fairywren&amp;dt=2026-01-14%2010%3A52%3A35
[5]
/messages/by-id/61d3ce85-9c7b-4eea-bf83-81272cab00c3@gmail.com

The test looked ok as it was on drongo and fairywren, but I have
loosened it some more on those to see if we can catch the output logs.
There are other issues with the 004 test at least (The ok test for the
output directory is wrong if there's a failure). But that can wait for now.

cheers

andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

#10Andrew Dunstan
andrew@dunslane.net
In reply to: Andrew Dunstan (#9)
Re: Random pg_upgrade 004_subscription test failure on drongo

On 2026-05-11 Mo 8:10 AM, Andrew Dunstan wrote:

On 2026-05-10 Su 11:00 AM, Alexander Lakhin wrote:

Hello Andrew,

13.03.2025 14:19, Andrew Dunstan wrote:

On 2025-03-13 Th 5:04 AM, vignesh C wrote:

Unfortunately we don't have pg_upgrade_output.d contents in buildfarm
to see what is the exact reason.

That's not supposed to happen. I am testing a fix to see if I can
make it collect the logs, but for now we'll have to wait till the
next failure ..

Windows animals still produce mysterious pg_upgrade/004_subscription
failures: [1] and [2] (and 006_transfer_modes also failed similarly
since
then: [3] and [4]). Unluckily, pg_upgrade_output.d/ content is still not
shown in the failure logs, so we can only guess why the tests failed. I
could not reproduce something alike locally, unfortunately — tried
different approaches for several days, but without success, so if we
could
see the error/failure reason in the log it would be useful, I hope.

Regarding 006_transfer_modes, I thought the reason we have no
pg_upgrade_output.d/ is that the test removes it explicitly: [5], but I
see no clean_node in 004_subscription, so maybe the log collection is
yet
to be fixed in the buildfarm client.

Could you have a look, please?

[1]
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=fairywren&amp;dt=2026-05-03%2014%3A03%3A12
[2]
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=drongo&amp;dt=2026-01-12%2000%3A09%3A37
[3]
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=drongo&amp;dt=2025-12-28%2003%3A43%3A24
[4]
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=fairywren&amp;dt=2026-01-14%2010%3A52%3A35
[5]
/messages/by-id/61d3ce85-9c7b-4eea-bf83-81272cab00c3@gmail.com

The test looked ok as it was on drongo and fairywren, but I have
loosened it some more on those to see if we can catch the output logs.
There are other issues with the 004 test at least (The ok test for the
output directory is wrong if there's a failure). But that can wait for
now.

OK, I have found and fixed the buildfarm bug. See
<https://github.com/PGBuildFarm/client-code/commit/55fdf7e0aa355600512a26cdd159b974498f802e&gt;

It will be in the next release, but meanwhile I have deployed it on
drongo and fairywren.

cheers

andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

#11Alexander Lakhin
exclusion@gmail.com
In reply to: Andrew Dunstan (#10)
Re: Random pg_upgrade 004_subscription test failure on drongo

Hello Andrew,

16.05.2026 14:22, Andrew Dunstan wrote:

OK, I have found and fixed the buildfarm bug. See
<https://github.com/PGBuildFarm/client-code/commit/55fdf7e0aa355600512a26cdd159b974498f802e&gt;

It will be in the next release, but meanwhile I have deployed it on drongo and fairywren.

Fairywren has failed 005_char_signedness.pl yesterday [1]https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=fairywren&amp;dt=2026-06-22%2002%3A03%3A10:
Restoring global objects in the new cluster                   ok
Restoring database schemas in the new cluster
*failure*

Consult the last few lines of
"C:/tools/xmsys64/home/pgrunner/bf/root/REL_18_STABLE/pgsql.build/testrun/pg_upgrade/005_char_signedness/data/t_005_char_signedness_new_data/pgdata/pg_upgrade_output.d/20260622T020818.960/log/pg_upgrade_dump_1.log"
for
the probable cause of the failure.
Failure, exiting
[02:08:41.260](23.954s) not ok 13 - run of pg_upgrade

But still no pg_upgrade logs to find out what caused the failure,
unfortunately.

Could you have a look once more, please?

[1]: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=fairywren&amp;dt=2026-06-22%2002%3A03%3A10

Best regards,
Alexander