Random pg_upgrade test failure on drongo

Started by Hayato Kuroda (Fujitsu)over 2 years ago31 messageshackers

kuroda.hayato@fujitsu.com

over 2 years ago

Dear hackers,

While tracking a buildfarm, I found that drongo failed the test pg_upgrade/003_logical_slots [1]https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=drongo&dt=2023-11-07%2013%3A43%3A23&stg=pg_upgrade-check.
A strange point is that the test passed in the next iteration. Currently I'm not
sure the reason, but I will keep my eye for it and will investigate if it
happens again.

I think this failure is not related with our logical slots work, whereas it
failed 003_logical_slots.pl. More detail, please see latter part.

For more investigation, a server log during the upgrade may be needed. It will
be in the data directory so BF system will not upload them. I may need additional
information if it failed again.

# Analysis of failure

According to the output, pg_upgrade seemed to be failed while restoring objects
to new cluster[2]``` ... # No postmaster PID for node "oldpub" # Running: pg_upgrade --no-sync -d C:\\prog\\bf\\root\\HEAD\\pgsql.build/testrun/pg_upgrade/003_logical_slots\\data/t_003_logical_slots_oldpub_data/pgdata -D C:\\prog\\bf\\root\\HEAD\\pgsql.build/testrun/pg_upgrade/003_logical_slots\\data/t_003_logical_slots_newpub_data/pgdata -b C:/prog/bf/root/HEAD/PGSQL~1.BUI/TMP_IN~1/prog/bf/root/HEAD/inst/bin -B C:/prog/bf/root/HEAD/PGSQL~1.BUI/TMP_IN~1/prog/bf/root/HEAD/inst/bin -s 127.0.0.1 -p 54813 -P 54814 --copy Performing Consistency Checks ... Setting frozenxid and minmxid counters in new cluster ok Restoring global objects in the new cluster ok Restoring database schemas in the new cluster *failure*.

As code-level anaysis, pg_upgrade command failed in exec_prog().
In the function, pg_restore tried to be executed for database "postgres".
Below is a brief call-stack. Note that pg_restore is not used for migrating
logical replication slots, it is done by pg_upgrade binary itself. Also, the
migration is done after all objects are copied, not in create_new_objects().

```
exec_prog()
parallel_exec_prog("pg_restore ... ") <-- Since -j option is not specified, it is just a wrapper
create_new_objects()
main()
```

In exec_prog(), system() system call was called but returned non-zero value.
Doc said that sytem() returns value that is returned by the command interpreter,
when input is not NULL [3]https://learn.microsoft.com/en-us/cpp/c-runtime-library/reference/system-wsystem?view=msvc-170. Unfortunately, current code does not output the
return code. Also, BF system does not upload data directory for failed tests.
Therefore, I could not get more information for the investigation.

[1]: https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=drongo&dt=2023-11-07%2013%3A43%3A23&stg=pg_upgrade-check
[2]: ``` ... # No postmaster PID for node "oldpub" # Running: pg_upgrade --no-sync -d C:\\prog\\bf\\root\\HEAD\\pgsql.build/testrun/pg_upgrade/003_logical_slots\\data/t_003_logical_slots_oldpub_data/pgdata -D C:\\prog\\bf\\root\\HEAD\\pgsql.build/testrun/pg_upgrade/003_logical_slots\\data/t_003_logical_slots_newpub_data/pgdata -b C:/prog/bf/root/HEAD/PGSQL~1.BUI/TMP_IN~1/prog/bf/root/HEAD/inst/bin -B C:/prog/bf/root/HEAD/PGSQL~1.BUI/TMP_IN~1/prog/bf/root/HEAD/inst/bin -s 127.0.0.1 -p 54813 -P 54814 --copy Performing Consistency Checks ... Setting frozenxid and minmxid counters in new cluster ok Restoring global objects in the new cluster ok Restoring database schemas in the new cluster *failure*
```
...
# No postmaster PID for node "oldpub"
# Running: pg_upgrade --no-sync -d C:\\prog\\bf\\root\\HEAD\\pgsql.build/testrun/pg_upgrade/003_logical_slots\\data/t_003_logical_slots_oldpub_data/pgdata -D C:\\prog\\bf\\root\\HEAD\\pgsql.build/testrun/pg_upgrade/003_logical_slots\\data/t_003_logical_slots_newpub_data/pgdata -b C:/prog/bf/root/HEAD/PGSQL~1.BUI/TMP_IN~1/prog/bf/root/HEAD/inst/bin -B C:/prog/bf/root/HEAD/PGSQL~1.BUI/TMP_IN~1/prog/bf/root/HEAD/inst/bin -s 127.0.0.1 -p 54813 -P 54814 --copy
Performing Consistency Checks
...
Setting frozenxid and minmxid counters in new cluster ok
Restoring global objects in the new cluster ok
Restoring database schemas in the new cluster
*failure*

Consult the last few lines of "C:/prog/bf/root/HEAD/pgsql.build/testrun/pg_upgrade/003_logical_slots/data/t_003_logical_slots_newpub_data/pgdata/pg_upgrade_output.d/20231107T142224.580/log/pg_upgrade_dump_5.log" for
the probable cause of the failure.
Failure, exiting
[14:23:26.632](70.141s) not ok 10 - run of pg_upgrade of old cluster
[14:23:26.632](0.000s) # Failed test 'run of pg_upgrade of old cluster'
# at C:/prog/bf/root/HEAD/pgsql/src/bin/pg_upgrade/t/003_logical_slots.pl line 170.
### Starting node "newpub"
# Running: pg_ctl -w -D C:\\prog\\bf\\root\\HEAD\\pgsql.build/testrun/pg_upgrade/003_logical_slots\\data/t_003_logical_slots_newpub_data/pgdata -l C:\\prog\\bf\\root\\HEAD\\pgsql.build/testrun/pg_upgrade/003_logical_slots\\log/003_logical_slots_newpub.log -o --cluster-name=newpub start
waiting for server to start.... done
server started
# Postmaster PID for node "newpub" is 4604
[14:23:28.398](1.766s) not ok 11 - check the slot exists on new cluster
[14:23:28.398](0.001s) # Failed test 'check the slot exists on new cluster'
# at C:/prog/bf/root/HEAD/pgsql/src/bin/pg_upgrade/t/003_logical_slots.pl line 176.
[14:23:28.399](0.000s) # got: ''
# expected: 'regress_sub|t'
...
```
[3]: https://learn.microsoft.com/en-us/cpp/c-runtime-library/reference/system-wsystem?view=msvc-170

Best Regards,
Hayato Kuroda
FUJITSU LIMITED

Hayato Kuroda (Fujitsu)

kuroda.hayato@fujitsu.com

over 2 years ago

In reply to: Hayato Kuroda (Fujitsu) (#1)

RE: Random pg_upgrade test failure on drongo

Dear hackers,

While tracking a buildfarm, I found that drongo failed the test
pg_upgrade/003_logical_slots [1].
A strange point is that the test passed in the next iteration. Currently I'm not
sure the reason, but I will keep my eye for it and will investigate if it
happens again.

This email just tells an update. We found that fairywren was also failed due to
the same reason [2]https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=fairywren&dt=2023-11-08%2010%3A22%3A45. It fails inconsistently, but there might be a bad thing on
windows. I'm now trying to reproduce with my colleagues to analyze more detail.
Also, working with Andrew for getting logs emitted during the upgrade.
I will continue to keep on my eye.

[2]: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=fairywren&dt=2023-11-08%2010%3A22%3A45

Best Regards,
Hayato Kuroda
FUJITSU LIMITED

Hayato Kuroda (Fujitsu)

kuroda.hayato@fujitsu.com

over 2 years ago

In reply to: Hayato Kuroda (Fujitsu) (#2)

RE: Random pg_upgrade test failure on drongo

Dear hackers,

This email tells an update. The machine drongo failed the test a week ago [1]https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=sungazer&dt=2023-11-15%2006%3A16%3A15
and finally got logfiles. PSA files.

## Observed failure

pg_upgrade_server.log is a server log during the pg_upgrade command. According to
it, the TRUNCATE command seemed to be failed due to a "File exists" error.

```
2023-11-15 00:02:02.239 UTC [1752:18] 003_logical_slots.pl ERROR: could not create file "base/1/2683": File exists
2023-11-15 00:02:02.239 UTC [1752:19] 003_logical_slots.pl STATEMENT:
-- For binary upgrade, preserve pg_largeobject and index relfilenodes
SELECT pg_catalog.binary_upgrade_set_next_index_relfilenode('2683'::pg_catalog.oid);
SELECT pg_catalog.binary_upgrade_set_next_heap_relfilenode('2613'::pg_catalog.oid);
TRUNCATE pg_catalog.pg_largeobject;
...
```

## Analysis

I think it caused due to the STATUS_DELETE_PENDING failure, not related with recent
updates for pg_upgrade.

The file "base/1/2683" is an index file for pg_largeobject_loid_pn_index, and the
output meant that file creation was failed. Below is a backtrace.

```
pgwin32_open() // <-- this returns -1
open()
BasicOpenFilePerm()
PathNameOpenFilePerm()
PathNameOpenFile()
mdcreate()
smgrcreate()
RelationCreateStorage()
RelationSetNewRelfilenumber()
ExecuteTruncateGuts()
ExecuteTruncate()
```

But this is strange. Before calling mdcreate(), we surely unlink the file which
have the same name. Below is a trace until unlink.

```
pgunlink()
unlink()
mdunlinkfork()
mdunlink()
smgrdounlinkall()
RelationSetNewRelfilenumber() // common path with above
ExecuteTruncateGuts()
ExecuteTruncate()
```

I found Thomas said that [2]/messages/by-id/CA+hUKGKsdzw06c5nnb=KYG9GmvyykoVpJA_VR3k0r7cZOKcx6Q@mail.gmail.com pgunlink sometimes could not remove file even if
it returns OK, at that time NTSTATUS is STATUS_DELETE_PENDING. Also, a comment
in pgwin32_open_handle() mentions the same thing:

```
/*
* ERROR_ACCESS_DENIED is returned if the file is deleted but not yet
* gone (Windows NT status code is STATUS_DELETE_PENDING). In that
* case, we'd better ask for the NT status too so we can translate it
* to a more Unix-like error. We hope that nothing clobbers the NT
* status in between the internal NtCreateFile() call and CreateFile()
* returning.
*
```

The definition of STATUS_DELETE_PENDING can be seen in [3]https://learn.microsoft.com/en-us/openspecs/windows_protocols/ms-erref/596a1078-e883-4972-9bbc-49e60bebca55. Based on that, indeed,
open() would be able to fail with STATUS_DELETE_PENDING if the deletion is pending
but it is tried to open.

Another thread [4]/messages/by-id/20220919213217.ptqfdlcc5idk5xup@awork3.anarazel.de also tries the issue while doing rmtree->unlink, and it reties
to remove if it fails with STATUS_DELETE_PENDING. So, should we retry to open when
it fails as well? Anyway, this fix seems out-of-scope for pg_upgrade.

How do you think? Do you have any thoughts about it?

## Acknowledgement

I want to say thanks to Sholk, Vingesh, for helping the analysis.

[1]: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=sungazer&dt=2023-11-15%2006%3A16%3A15
[2]: /messages/by-id/CA+hUKGKsdzw06c5nnb=KYG9GmvyykoVpJA_VR3k0r7cZOKcx6Q@mail.gmail.com
[3]: https://learn.microsoft.com/en-us/openspecs/windows_protocols/ms-erref/596a1078-e883-4972-9bbc-49e60bebca55
[4]: /messages/by-id/20220919213217.ptqfdlcc5idk5xup@awork3.anarazel.de

Best Regards,
Hayato Kuroda
FUJITSU LIMITED

Hayato Kuroda (Fujitsu)

kuroda.hayato@fujitsu.com

over 2 years ago

In reply to: Hayato Kuroda (Fujitsu) (#3)

RE: Random pg_upgrade test failure on drongo

Dear hackers,

This email tells an update. The machine drongo failed the test a week ago [1]
and finally got logfiles. PSA files.

Oh, sorry. I missed to attach files. You can see pg_upgrade_server.log for now.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED

Alexander Lakhin

exclusion@gmail.com

over 2 years ago

In reply to: Hayato Kuroda (Fujitsu) (#4)

Re: Random pg_upgrade test failure on drongo

Hello Kuroda-san,

21.11.2023 13:37, Hayato Kuroda (Fujitsu) wrote:

This email tells an update. The machine drongo failed the test a week ago [1]
and finally got logfiles. PSA files.

Oh, sorry. I missed to attach files. You can see pg_upgrade_server.log for now.

I can easily reproduce this failure on my workstation by running 5 tests
003_logical_slots in parallel inside Windows VM with it's CPU resources
limited to 50%, like so:
VBoxManage controlvm "Windows" cpuexecutioncap 50

set PGCTLTIMEOUT=180
python3 -c "NUMITERATIONS=20;NUMTESTS=5;import os;tsts='';exec('for i in range(1,NUMTESTS+1):
tsts+=f\"pg_upgrade_{i}/003_logical_slots \"'); exec('for i in range(1,NUMITERATIONS+1):print(f\"iteration {i}\");
assert(os.system(f\"meson test --num-processes {NUMTESTS} {tsts}\") == 0)')"
...
iteration 2
ninja: Entering directory `C:\src\postgresql\build'
ninja: no work to do.
1/5 postgresql:pg_upgrade_2 / pg_upgrade_2/003_logical_slots ERROR 60.30s exit status 25
...
pg_restore: error: could not execute query: ERROR: could not create file "base/1/2683": File exists
...

I agree with your analysis and would like to propose a PoC fix (see
attached). With this patch applied, 20 iterations succeeded for me.

Best regards,
Alexander

Hayato Kuroda (Fujitsu)

kuroda.hayato@fujitsu.com

over 2 years ago

In reply to: Alexander Lakhin (#5)

RE: Random pg_upgrade test failure on drongo

Dear Alexander,

I can easily reproduce this failure on my workstation by running 5 tests
003_logical_slots in parallel inside Windows VM with it's CPU resources
limited to 50%, like so:
VBoxManage controlvm "Windows" cpuexecutioncap 50

set PGCTLTIMEOUT=180
python3 -c "NUMITERATIONS=20;NUMTESTS=5;import os;tsts='';exec('for i in
range(1,NUMTESTS+1):
tsts+=f\"pg_upgrade_{i}/003_logical_slots \"'); exec('for i in
range(1,NUMITERATIONS+1):print(f\"iteration {i}\");
assert(os.system(f\"meson test --num-processes {NUMTESTS} {tsts}\") == 0)')"
...
iteration 2
ninja: Entering directory `C:\src\postgresql\build'
ninja: no work to do.
1/5 postgresql:pg_upgrade_2 / pg_upgrade_2/003_logical_slots
ERROR 60.30s exit status 25
...
pg_restore: error: could not execute query: ERROR: could not create file
"base/1/2683": File exists
...

Great. I do not have such an environment so I could not find. This seemed to
suggest that the failure was occurred because the system was busy.

I agree with your analysis and would like to propose a PoC fix (see
attached). With this patch applied, 20 iterations succeeded for me.

Thanks, here are comments. I'm quite not sure for the windows, so I may say
something wrong.

* I'm not sure why the file/directory name was changed before doing a unlink.
Could you add descriptions?
* IIUC, the important points is the latter part, which waits until the status is
changed. Based on that, can we remove a double rmtree() from cleanup_output_dirs()?
They seems to be add for the similar motivation.

```
+	loops = 0;
+	while (lstat(curpath, &st) < 0 && lstat_error_was_status_delete_pending())
+	{
+		if (++loops > 100)		/* time out after 10 sec */
+			return -1;
+		pg_usleep(100000);		/* us */
+	}
```

Best Regards,
Hayato Kuroda
FUJITSU LIMITED

Alexander Lakhin

exclusion@gmail.com

over 2 years ago

In reply to: Hayato Kuroda (Fujitsu) (#6)

Re: Random pg_upgrade test failure on drongo

Hello Kuroda-san,

23.11.2023 15:15, Hayato Kuroda (Fujitsu) wrote:

I agree with your analysis and would like to propose a PoC fix (see
attached). With this patch applied, 20 iterations succeeded for me.

Thanks, here are comments. I'm quite not sure for the windows, so I may say
something wrong.

* I'm not sure why the file/directory name was changed before doing a unlink.
Could you add descriptions?

Please look at the simple test program attached. It demonstrates the
failure for me when running in two sessions as follows:
unlink-open test 150 1000
unlink-open test2 150 1000
...
iteration 60
iteration 61
fopen() after unlink() failed (13)

Process Monitor shows:
...
        9:16:55.9249792 AM    unlink-open.exe    3232 WriteFile    C:\src\test2    SUCCESS    Offset: 138,412,032,
Length: 1,048,576
### unlink() performed for the file "test":
9:16:55.9852903 AM    unlink-open.exe    4968    CreateFile C:\src\test    SUCCESS    Desired Access: Read Attributes,
Delete, Disposition: Open, Options: Non-Directory File, Open Reparse Point, Attributes: n/a, ShareMode: Read, Write,
Delete, AllocationSize: n/a, OpenResult: Opened
9:16:55.9853637 AM    unlink-open.exe    4968 QueryAttributeTagFile    C:\src\test    SUCCESS    Attributes: A,
ReparseTag: 0x0
### file "test" gets into state DELETE PENDING:
9:16:55.9853756 AM    unlink-open.exe    4968 SetDispositionInformationFile    C:\src\test    SUCCESS    Delete: True
9:16:55.9853888 AM    unlink-open.exe    4968    CloseFile C:\src\test    SUCCESS
### concurrent operations with file "test2":
        9:16:55.9866362 AM    unlink-open.exe    3232 WriteFile    C:\src\test2    SUCCESS    Offset: 139,460,608,
Length: 1,048,576
...
        9:16:55.9972373 AM    unlink-open.exe    3232 WriteFile    C:\src\test2    SUCCESS    Offset: 157,286,400,
Length: 1,048,576
        9:16:55.9979040 AM    unlink-open.exe    3232 CloseFile    C:\src\test2    SUCCESS
### unlink() for "test2":
        9:16:56.1029981 AM    unlink-open.exe    3232 CreateFile    C:\src\test2    SUCCESS    Desired Access: Read
Attributes, Delete, Disposition: Open, Options: Non-Directory File, Open Reparse Point, Attributes: n/a, ShareMode:
Read, Write, Delete, AllocationSize: n/a, OpenResult: Opened
        9:16:56.1030432 AM    unlink-open.exe    3232 QueryAttributeTagFile    C:\src\test2    SUCCESS    Attributes:
A, ReparseTag: 0x0
### file "test2" gets into state DELETE PENDING:
        9:16:56.1030517 AM    unlink-open.exe    3232 SetDispositionInformationFile    C:\src\test2    SUCCESS
Delete: True
        9:16:56.1030625 AM    unlink-open.exe    3232 CloseFile    C:\src\test2    SUCCESS
### and then it opened successfully:
        9:16:56.1189503 AM    unlink-open.exe    3232 CreateFile    C:\src\test2    SUCCESS    Desired Access: Generic
Write, Read Attributes, Disposition: OverwriteIf, Options: Synchronous IO Non-Alert, Non-Directory File, Attributes: N,
ShareMode: Read, Write, AllocationSize: 0, OpenResult: Created
        9:16:56.1192016 AM    unlink-open.exe    3232 CloseFile    C:\src\test2    SUCCESS
### operations with file "test2" continued:
        9:16:56.1193394 AM    unlink-open.exe    3232 CreateFile    C:\src\test2    SUCCESS    Desired Access: Read
Attributes, Delete, Disposition: Open, Options: Non-Directory File, Open Reparse Point, Attributes: n/a, ShareMode:
Read, Write, Delete, AllocationSize: n/a, OpenResult: Opened
        9:16:56.1193895 AM    unlink-open.exe    3232 QueryAttributeTagFile    C:\src\test2    SUCCESS    Attributes:
A, ReparseTag: 0x0
        9:16:56.1194042 AM    unlink-open.exe    3232 SetDispositionInformationFile    C:\src\test2    SUCCESS
Delete: True
        9:16:56.1194188 AM    unlink-open.exe    3232 CloseFile    C:\src\test2    SUCCESS
        9:16:56.1198459 AM    unlink-open.exe    3232 CreateFile    C:\src\test2    SUCCESS    Desired Access: Generic
Write, Read Attributes, Disposition: OverwriteIf, Options: Synchronous IO Non-Alert, Non-Directory File, Attributes: N,
ShareMode: Read, Write, AllocationSize: 0, OpenResult: Created
        9:16:56.1200302 AM    unlink-open.exe    3232 WriteFile    C:\src\test2    SUCCESS    Offset: 0, Length:
1,048,576, Priority: Normal
...
        9:16:56.1275871 AM    unlink-open.exe    3232 WriteFile    C:\src\test2    SUCCESS    Offset: 10,485,760,
Length: 1,048,576

### at the same time, CreateFile() for file "test" failed:
9:16:56.1276453 AM    unlink-open.exe    4968    CreateFile C:\src\test    DELETE PENDING    Desired Access: Generic
Write, Read Attributes, Disposition: OverwriteIf, Options: Synchronous IO Non-Alert, Non-Directory File, Attributes: N,
ShareMode: Read, Write, AllocationSize: 0
9:16:56.1279359 AM    unlink-open.exe    3232    WriteFile C:\src\test2    SUCCESS    Offset: 11,534,336, Length: 1,048,576
9:16:56.1283452 AM    unlink-open.exe    3232    WriteFile C:\src\test2    SUCCESS    Offset: 12,582,912, Length: 1,048,576
...

But with rename(MoveFileEx), I see:
unlink-open test 150 1000 rename
...
9:38:01.7035286 AM    unlink-open.exe    10208    WriteFile C:\src\test    SUCCESS    Offset: 156,237,824, Length: 1,048,576
9:38:01.7075621 AM    unlink-open.exe    10208    WriteFile C:\src\test    SUCCESS    Offset: 157,286,400, Length: 1,048,576
9:38:01.7101299 AM    unlink-open.exe    10208    CloseFile C:\src\test    SUCCESS
9:38:01.7130802 AM    unlink-open.exe    10208    CreateFile C:\src\test    SUCCESS    Desired Access: Read Attributes,
Delete, Synchronize, Disposition: Open, Options: Synchronous IO Non-Alert, Open Reparse Point, Attributes: n/a,
ShareMode: Read, Write, Delete, AllocationSize: n/a, OpenResult: Opened
9:38:01.7132844 AM    unlink-open.exe    10208 QueryAttributeTagFile    C:\src\test    SUCCESS    Attributes: A,
ReparseTag: 0x0
9:38:01.7133420 AM    unlink-open.exe    10208 QueryBasicInformationFile    C:\src\test    SUCCESS    CreationTime:
11/24/2023 9:38:01 AM, LastAccessTime: 11/24/2023 9:38:01 AM, LastWriteTime: 11/24/2023 9:38:01 AM, ChangeTime:
11/24/2023 9:38:01 AM, FileAttributes: A
9:38:01.7135191 AM    unlink-open.exe    10208    CreateFile C:\src    SUCCESS    Desired Access: Write Data/Add File,
Synchronize, Disposition: Open, Options: , Attributes: n/a, ShareMode: Read, Write, AllocationSize: n/a, OpenResult: Opened

### file "test" renamed to "test.tmp", it doesn't get into state DELETE PENDING
9:38:01.7136221 AM    unlink-open.exe    10208 SetRenameInformationFile    C:\src\test    SUCCESS ReplaceIfExists: True,
FileName: C:\src\test.tmp
9:38:01.8384110 AM    unlink-open.exe    10208    CloseFile C:\src    SUCCESS
9:38:01.8388203 AM    unlink-open.exe    10208    CloseFile C:\src\test.tmp    SUCCESS

### then file "test.tmp" deleted as usual:
9:38:01.8394278 AM    unlink-open.exe    10208    CreateFile C:\src\test.tmp    SUCCESS    Desired Access: Read
Attributes, Delete, Disposition: Open, Options: Non-Directory File, Open Reparse Point, Attributes: n/a, ShareMode:
Read, Write, Delete, AllocationSize: n/a, OpenResult: Opened
9:38:01.8396534 AM    unlink-open.exe    10208 QueryAttributeTagFile    C:\src\test.tmp    SUCCESS    Attributes: A,
ReparseTag: 0x0
9:38:01.8396885 AM    unlink-open.exe    10208 SetDispositionInformationFile    C:\src\test.tmp    SUCCESS Delete: True
9:38:01.8397312 AM    unlink-open.exe    10208    CloseFile C:\src\test.tmp    SUCCESS
9:38:01.9162566 AM    unlink-open.exe    10208    CreateFile C:\src\test    SUCCESS    Desired Access: Generic Write,
Read Attributes, Disposition: OverwriteIf, Options: Synchronous IO Non-Alert, Non-Directory File, Attributes: N,
ShareMode: Read, Write, AllocationSize: 0, OpenResult: Created
9:38:01.9167628 AM    unlink-open.exe    10208    CloseFile C:\src\test    SUCCESS

So the same test run with MoveFileEx():
unlink-open test 150 1000 rename
unlink-open test2 150 1000 rename
successfully passes for me in the same environment (Windows VM slowed down to 50%).

That is, my idea was to try removing a file through renaming it as a fast
path (thus avoiding that troublesome state DELETE PENDING), and if that
fails, to perform removal as before. May be the whole function might be
simplified, but I'm not sure about special cases yet.

* IIUC, the important points is the latter part, which waits until the status is
changed. Based on that, can we remove a double rmtree() from cleanup_output_dirs()?
They seems to be add for the similar motivation.

I couldn't yet reproduce a failure, which motivated that doubling (IIUC, it
was observed in [1]/messages/by-id/20230131172806.GM22427@telsasoft.com), with c28911750 reverted, so I need more time to
research that issue to answer this question.

[1]: /messages/by-id/20230131172806.GM22427@telsasoft.com

Best regards,
Alexander

Hayato Kuroda (Fujitsu)

kuroda.hayato@fujitsu.com

over 2 years ago

In reply to: Alexander Lakhin (#7)

RE: Random pg_upgrade test failure on drongo

Dear Alexander,

Please look at the simple test program attached. It demonstrates the
failure for me when running in two sessions as follows:
unlink-open test 150 1000
unlink-open test2 150 1000

Thanks for attaching a program. This helps us to understand the issue.
I wanted to confirm your env - this failure was occurred on windows server XXXX, right?

That is, my idea was to try removing a file through renaming it as a fast
path (thus avoiding that troublesome state DELETE PENDING), and if that
fails, to perform removal as before. May be the whole function might be
simplified, but I'm not sure about special cases yet.

I felt that your result showed pgrename() would be more rarely delayed than unlink().
That's why a file which has original name would not exist when subsequent open() was called.

About special cases, I wanted seniors to check.

* IIUC, the important points is the latter part, which waits until the status is
changed. Based on that, can we remove a double rmtree() from

cleanup_output_dirs()?

They seems to be add for the similar motivation.

I couldn't yet reproduce a failure, which motivated that doubling (IIUC, it
was observed in [1]), with c28911750 reverted, so I need more time to
research that issue to answer this question.

Yeah, as the first place, this failure seldom occurred....

Best Regards,
Hayato Kuroda
FUJITSU LIMITED

Alexander Lakhin

exclusion@gmail.com

over 2 years ago

In reply to: Hayato Kuroda (Fujitsu) (#8)

Re: Random pg_upgrade test failure on drongo

Hello Kuroda-san,

25.11.2023 18:19, Hayato Kuroda (Fujitsu) wrote:

Thanks for attaching a program. This helps us to understand the issue.
I wanted to confirm your env - this failure was occurred on windows server XXXX, right?

I see that behavior on:
Windows 10 Version 1607 (OS Build 14393.0)
Windows Server 2016 Version 1607 (OS Build 14393.0)
Windows Server 2019 Version 1809 (OS Build 17763.1)

But it's not reproduced on:
Windows 10 Version 1809 (OS Build 17763.1) (triple-checked)
Windows Server 2019 Version 1809 (OS Build 17763.592)
Windows 10 Version 22H2 (OS Build 19045.3693)
Windows 11 Version 21H2 (OS Build 22000.613)

So it looks like the failure occurs depending not on Windows edition, but
rather on it's build. For Windows Server 2019 the "good" build is
somewhere between 17763.1 and 17763.592, but for Windows 10 it's between
14393.0 and 17763.1.
(Maybe there was some change related to FILE_DISPOSITION_POSIX_SEMANTICS/
FILE_DISPOSITION_ON_CLOSE implementation; I don't know where to find
information about that change.)

It's also interesting, what is full version/build of OS on drongo and
fairywren.

That is, my idea was to try removing a file through renaming it as a fast
path (thus avoiding that troublesome state DELETE PENDING), and if that
fails, to perform removal as before. May be the whole function might be
simplified, but I'm not sure about special cases yet.

I felt that your result showed pgrename() would be more rarely delayed than unlink().
That's why a file which has original name would not exist when subsequent open() was called.

I think that's because unlink() is performed asynchronously on those old
Windows versions, but rename() is always synchronous.

* IIUC, the important points is the latter part, which waits until the status is
changed. Based on that, can we remove a double rmtree() from

cleanup_output_dirs()?

They seems to be add for the similar motivation.

I couldn't yet reproduce a failure, which motivated that doubling (IIUC, it
was observed in [1]), with c28911750 reverted, so I need more time to
research that issue to answer this question.

Yeah, as the first place, this failure seldom occurred....

I've managed to reproduce that issue (or at least a situation that
manifested similarly) with a sleep added in miscinit.c:
        ereport(IsPostmasterEnvironment ? LOG : NOTICE,
                        (errmsg("database system is shut down")));
+       pg_usleep(500000L);

With this change, I get the same warning as in [1]/messages/by-id/20230131172806.GM22427@telsasoft.com when running in
parallel 10 tests 002_pg_upgrade with a minimal olddump (on iterations
33, 46, 8). And with my PoC patch applied, I could see the same warning
as well (on iteration 6).

I believe that's because rename() can't rename a directory containing an
open file, just as unlink() can't remove it.

In the light of the above, I think that the issue in question should be
fixed in accordance with/as a supplement to [2]/messages/by-id/CA+hUKG+ajSQ_8eu2AogTncOnZ5me2D-Cn66iN_-wZnRjLN+icg@mail.gmail.com.

[1]: /messages/by-id/20230131172806.GM22427@telsasoft.com
[2]: /messages/by-id/CA+hUKG+ajSQ_8eu2AogTncOnZ5me2D-Cn66iN_-wZnRjLN+icg@mail.gmail.com

Best regards,
Alexander

#10

Andrew Dunstan

andrew@dunslane.net

over 2 years ago

In reply to: Alexander Lakhin (#9)

Re: Random pg_upgrade test failure on drongo

On 2023-11-27 Mo 07:00, Alexander Lakhin wrote:

Hello Kuroda-san,

25.11.2023 18:19, Hayato Kuroda (Fujitsu) wrote:

Thanks for attaching a program. This helps us to understand the issue.
I wanted to confirm your env - this failure was occurred on windows
server XXXX, right?

I see that behavior on:
Windows 10 Version 1607 (OS Build 14393.0)
Windows Server 2016 Version 1607 (OS Build 14393.0)
Windows Server 2019 Version 1809 (OS Build 17763.1)

But it's not reproduced on:
Windows 10 Version 1809 (OS Build 17763.1) (triple-checked)
Windows Server 2019 Version 1809 (OS Build 17763.592)
Windows 10 Version 22H2 (OS Build 19045.3693)
Windows 11 Version 21H2 (OS Build 22000.613)

So it looks like the failure occurs depending not on Windows edition, but
rather on it's build. For Windows Server 2019 the "good" build is
somewhere between 17763.1 and 17763.592, but for Windows 10 it's between
14393.0 and 17763.1.
(Maybe there was some change related to FILE_DISPOSITION_POSIX_SEMANTICS/
FILE_DISPOSITION_ON_CLOSE implementation; I don't know where to find
information about that change.)

It's also interesting, what is full version/build of OS on drongo and
fairywren.

It's WS 2019 1809/17763.4252. The latest available AFAICT is 17763.5122

cheers

andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

#11

Andrew Dunstan

andrew@dunslane.net

over 2 years ago

In reply to: Andrew Dunstan (#10)

Re: Random pg_upgrade test failure on drongo

On 2023-11-27 Mo 07:39, Andrew Dunstan wrote:

On 2023-11-27 Mo 07:00, Alexander Lakhin wrote:

Hello Kuroda-san,

25.11.2023 18:19, Hayato Kuroda (Fujitsu) wrote:

Thanks for attaching a program. This helps us to understand the issue.
I wanted to confirm your env - this failure was occurred on windows
server XXXX, right?

I see that behavior on:
Windows 10 Version 1607 (OS Build 14393.0)
Windows Server 2016 Version 1607 (OS Build 14393.0)
Windows Server 2019 Version 1809 (OS Build 17763.1)

But it's not reproduced on:
Windows 10 Version 1809 (OS Build 17763.1) (triple-checked)
Windows Server 2019 Version 1809 (OS Build 17763.592)
Windows 10 Version 22H2 (OS Build 19045.3693)
Windows 11 Version 21H2 (OS Build 22000.613)

So it looks like the failure occurs depending not on Windows edition,
but
rather on it's build. For Windows Server 2019 the "good" build is
somewhere between 17763.1 and 17763.592, but for Windows 10 it's between
14393.0 and 17763.1.
(Maybe there was some change related to
FILE_DISPOSITION_POSIX_SEMANTICS/
FILE_DISPOSITION_ON_CLOSE implementation; I don't know where to find
information about that change.)

It's also interesting, what is full version/build of OS on drongo and
fairywren.

It's WS 2019 1809/17763.4252. The latest available AFAICT is 17763.5122

I've updated it to 17763.5122 now.

cheers

andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

#12

Hayato Kuroda (Fujitsu)

kuroda.hayato@fujitsu.com

over 2 years ago

In reply to: Alexander Lakhin (#9)

RE: Random pg_upgrade test failure on drongo

Dear Alexander, Andrew,

Thanks for your analysis!

I see that behavior on:
Windows 10 Version 1607 (OS Build 14393.0)
Windows Server 2016 Version 1607 (OS Build 14393.0)
Windows Server 2019 Version 1809 (OS Build 17763.1)

But it's not reproduced on:
Windows 10 Version 1809 (OS Build 17763.1) (triple-checked)
Windows Server 2019 Version 1809 (OS Build 17763.592)
Windows 10 Version 22H2 (OS Build 19045.3693)
Windows 11 Version 21H2 (OS Build 22000.613)

So it looks like the failure occurs depending not on Windows edition, but
rather on it's build. For Windows Server 2019 the "good" build is
somewhere between 17763.1 and 17763.592, but for Windows 10 it's between
14393.0 and 17763.1.
(Maybe there was some change related to
FILE_DISPOSITION_POSIX_SEMANTICS/
FILE_DISPOSITION_ON_CLOSE implementation; I don't know where to find
information about that change.)

It's also interesting, what is full version/build of OS on drongo and
fairywren.

Thanks for your interest for the issue. I have been tracking the failure but been not occurred.
Your analysis seems to solve BF failures, by updating OSes.

I think that's because unlink() is performed asynchronously on those old
Windows versions, but rename() is always synchronous.

OK. Actually I could not find descriptions about them, but your experiment showed facts.

I've managed to reproduce that issue (or at least a situation that
manifested similarly) with a sleep added in miscinit.c:
ereport(IsPostmasterEnvironment ? LOG : NOTICE,
(errmsg("database system is shut down")));
+ pg_usleep(500000L);

With this change, I get the same warning as in [1] when running in
parallel 10 tests 002_pg_upgrade with a minimal olddump (on iterations
33, 46, 8). And with my PoC patch applied, I could see the same warning
as well (on iteration 6).

I believe that's because rename() can't rename a directory containing an
open file, just as unlink() can't remove it.

In the light of the above, I think that the issue in question should be
fixed in accordance with/as a supplement to [2].

OK, I understood that we need to fix more around here. For now, we should focus our points.

Your patch seems good, but it needs more sight from windows-friendly developers.
How do other think?

Best Regards,
Hayato Kuroda
FUJITSU LIMITED

#13

Alexander Lakhin

exclusion@gmail.com

over 2 years ago

In reply to: Andrew Dunstan (#11)

Re: Random pg_upgrade test failure on drongo

Hello Andrew and Kuroda-san,

27.11.2023 16:58, Andrew Dunstan wrote:

It's also interesting, what is full version/build of OS on drongo and
fairywren.

It's WS 2019 1809/17763.4252. The latest available AFAICT is 17763.5122

I've updated it to 17763.5122 now.

Thank you for the information! It had pushed me to upgrade my Server
2019 1809 17763.592 to 17763.4252. And then I discovered that I have
difficulties with reproducing the issue on all my VMs after reboot (even
on old versions/builds). It took me a while to understand what's going on
and what affects reproduction of the issue.
I was puzzled by the fact that I can't reproduce the issue with my
unlink-open test program under seemingly the same conditions as before,
until I realized that the issue reproduced only when the target directory
opened in Windows Explorer.

Now I'm sorry for bringing more mystery into the topic and for misleading
information.

So, the issue reproduced only when something scans the working directory
for files/opens them. I added the same logic into my test program (see
unlink-open-scandir attached) and now I see the failure on Windows Server
2019 (Version 10.0.17763.4252).
A script like this:
start cmd /c "unlink-open-scandir test1 10 5000 >log1 2>&1"
...
start cmd /c "unlink-open-scandir test10 10 5000 >log10 2>&1"

results in:
C:\temp>find "failed" log*
---------- LOG1
---------- LOG10
fopen() after unlink() failed (13)
---------- LOG2
fopen() after unlink() failed (13)
---------- LOG3
fopen() after unlink() failed (13)
---------- LOG4
fopen() after unlink() failed (13)
---------- LOG5
fopen() after unlink() failed (13)
---------- LOG6
fopen() after unlink() failed (13)
---------- LOG7
fopen() after unlink() failed (13)
---------- LOG8
fopen() after unlink() failed (13)
---------- LOG9
fopen() after unlink() failed (13)

C:\temp>type log10
...
iteration 108
fopen() after unlink() failed (13)

The same observed on:
Windows 10 Version 1809 (OS Build 17763.1)

But no failures on:
Windows 10 Version 22H2 (OS Build 19045.3693)
Windows 11 Version 21H2 (OS Build 22000.613)

So the behavior change really took place, but my previous estimations were
incorrect (my apologies).

BTW, "rename" mode of the test program can produce more rare errors on
rename:
---------- LOG3
MoveFileEx() failed (0)

but not on open.

30.11.2023 13:00, Hayato Kuroda (Fujitsu) wrote:

Thanks for your interest for the issue. I have been tracking the failure but been not occurred.
Your analysis seems to solve BF failures, by updating OSes.

Yes, but I don't think that leaving Server 2019 behind (I suppose Windows
Server 2019 build 20348 would have the same behaviour as Windows 10 19045)
is affordable. (Though looking at Cirrus CI logs, I see that what is
entitled "Windows Server 2019" in fact is Windows Server 2022 there.)

I think that's because unlink() is performed asynchronously on those old
Windows versions, but rename() is always synchronous.

OK. Actually I could not find descriptions about them, but your experiment showed facts.

I don't know how this peculiarity is called, but it looks like when some
other process captures the file handle, unlink() exits as if the file was
deleted completely, but the subsequent open() fails.

Best regards,
Alexander

#14

Hayato Kuroda (Fujitsu)

kuroda.hayato@fujitsu.com

over 2 years ago

In reply to: Alexander Lakhin (#5)

RE: Random pg_upgrade test failure on drongo

Dear Alexander,

I agree with your analysis and would like to propose a PoC fix (see
attached). With this patch applied, 20 iterations succeeded for me.

There are no reviewers so that I will review again. Let's move the PoC
to the concrete patch. Note that I only focused on fixes of random failure,
other parts are out-of-scope.

Basically, code comments can be updated accordingly.

01.

```
/*
* This function might be called for a regular file or for a junction
* point (which we use to emulate symlinks). The latter must be unlinked
* with rmdir() on Windows. Before we worry about any of that, let's see
* if we can unlink directly, since that's expected to be the most common
* case.
*/
snprintf(tmppath, sizeof(tmppath), "%s.tmp", path);
if (pgrename(path, tmppath) == 0)
{
if (unlink(tmppath) == 0)
return 0;
curpath = tmppath;
}
```

You can modify comments atop changes because it is not trivial.
Below is my draft:

```
* XXX: we rename the target file to ".tmp" before calling unlink. The
* removal may fail with STATUS_DELETE_PENDING status on Windows, so
* creating the same file would fail. This assumes that renaming is a
* synchronous operation.
```

02.

```
loops = 0;
while (lstat(curpath, &st) < 0 && lstat_error_was_status_delete_pending())
{
if (++loops > 100) /* time out after 10 sec */
return -1;
pg_usleep(100000); /* us */
}
```

Comments can be added atop the part. Below one is my draft.

```
/*
* Wait until the removal is really finished to avoid ERRORs for creating a
* same file in other functions.
*/
```

Best Regards,
Hayato Kuroda
FUJITSU LIMITED

#15

Alexander Lakhin

exclusion@gmail.com

over 2 years ago

In reply to: Hayato Kuroda (Fujitsu) (#14)

Re: Random pg_upgrade test failure on drongo

Hello Kuroda-san,

28.12.2023 06:08, Hayato Kuroda (Fujitsu) wrote:

Dear Alexander,

I agree with your analysis and would like to propose a PoC fix (see
attached). With this patch applied, 20 iterations succeeded for me.

There are no reviewers so that I will review again. Let's move the PoC
to the concrete patch. Note that I only focused on fixes of random failure,
other parts are out-of-scope.

Thinking about that fix more, I'm not satisfied with the approach proposed.
First, it turns every unlink operation into two write operations
(rename + unlink), not to say about new risks of having stale .tmp files
(perhaps, it's ok for regular files (MoveFileEx can overwrite existing
files), but not for directories)
Second, it does that on any Windows OS versions, including modern ones,
which are not affected by the issue, as we know.

So I started to think about other approach: to perform unlink as it's
implemented now, but then wait until the DELETE_PENDING state is gone.
And I was very surprised to see that this state is not transient in our case.
Additional investigation showed that the test fails not because some aside
process opens a file (concretely, {template1_id/postgres_id}/2683), that is
being deleted, but because of an internal process that opens the file and
holds a handle to it indefinitely.
And the internal process is ... background writer (BgBufferSync()).

So, I tried just adding bgwriter_lru_maxpages = 0 to postgresql.conf and
got 20 x 10 tests passing.

Thus, it we want just to get rid of the test failure, maybe it's enough to
add this to the test's config...

The other way to go is to find out whether the background writer process
should react on a shared-inval message, sent from smgrdounlinkall(), and
close that file's handle,

Maybe we could also (after changing the bgwriter's behaviour) add a waiting
loop into pgwin32_open_handle() to completely rule out transient open()
failures due to some other process (such as Windows Exporer) opening a file
being deleted, but I would not complicate the things until we have a clear
vision/plans of using modern APIs/relying of modern OS versions' behavior.
I mean proceeding with something like:
https://commitfest.postgresql.org/40/3951/

Best regards,
Alexander

#16

Amit Kapila

amit.kapila16@gmail.com

over 2 years ago

In reply to: Alexander Lakhin (#15)

Re: Random pg_upgrade test failure on drongo

On Tue, Jan 2, 2024 at 10:30 AM Alexander Lakhin <exclusion@gmail.com> wrote:

28.12.2023 06:08, Hayato Kuroda (Fujitsu) wrote:

Dear Alexander,

I agree with your analysis and would like to propose a PoC fix (see
attached). With this patch applied, 20 iterations succeeded for me.

There are no reviewers so that I will review again. Let's move the PoC
to the concrete patch. Note that I only focused on fixes of random failure,
other parts are out-of-scope.

Thinking about that fix more, I'm not satisfied with the approach proposed.
First, it turns every unlink operation into two write operations
(rename + unlink), not to say about new risks of having stale .tmp files
(perhaps, it's ok for regular files (MoveFileEx can overwrite existing
files), but not for directories)
Second, it does that on any Windows OS versions, including modern ones,
which are not affected by the issue, as we know.

So I started to think about other approach: to perform unlink as it's
implemented now, but then wait until the DELETE_PENDING state is gone.

There is a comment in the code which suggests we shouldn't wait
indefinitely. See "However, we won't wait indefinitely for someone
else to close the file, as the caller might be holding locks and
blocking other backends."

And I was very surprised to see that this state is not transient in our case.
Additional investigation showed that the test fails not because some aside
process opens a file (concretely, {template1_id/postgres_id}/2683), that is
being deleted, but because of an internal process that opens the file and
holds a handle to it indefinitely.
And the internal process is ... background writer (BgBufferSync()).

So, I tried just adding bgwriter_lru_maxpages = 0 to postgresql.conf and
got 20 x 10 tests passing.

Thus, it we want just to get rid of the test failure, maybe it's enough to
add this to the test's config...

What about checkpoints? Can't it do the same while writing the buffers?

--
With Regards,
Amit Kapila.

#17

Alexander Lakhin

exclusion@gmail.com

over 2 years ago

In reply to: Amit Kapila (#16)

Re: Random pg_upgrade test failure on drongo

Hello Amit,

03.01.2024 14:42, Amit Kapila wrote:

So I started to think about other approach: to perform unlink as it's
implemented now, but then wait until the DELETE_PENDING state is gone.

There is a comment in the code which suggests we shouldn't wait
indefinitely. See "However, we won't wait indefinitely for someone
else to close the file, as the caller might be holding locks and
blocking other backends."

Yes, I saw it, but initially I thought that we have a transient condition
there, so waiting in open() (instead of failing immediately) seemed like a
good idea then...

And the internal process is ... background writer (BgBufferSync()).

So, I tried just adding bgwriter_lru_maxpages = 0 to postgresql.conf and
got 20 x 10 tests passing.

Thus, it we want just to get rid of the test failure, maybe it's enough to
add this to the test's config...

What about checkpoints? Can't it do the same while writing the buffers?

As we deal here with pg_upgrade/pg_restore, it must not be very easy to get
the desired effect, but I think it's not impossible in principle.
More details below.
What happens during the pg_upgrade execution is essentially:
1) CREATE DATABASE "postgres" WITH TEMPLATE = template0 OID = 5 ...;
-- this command flushes file buffers as well
2) ALTER DATABASE postgres OWNER TO ...
3) COMMENT ON DATABASE "postgres" IS ...
4)     -- For binary upgrade, preserve pg_largeobject and index relfilenodes
    SELECT pg_catalog.binary_upgrade_set_next_index_relfilenode('2683'::pg_catalog.oid);
    SELECT pg_catalog.binary_upgrade_set_next_heap_relfilenode('2613'::pg_catalog.oid);
    TRUNCATE pg_catalog.pg_largeobject;
-- ^^^ here we can get the error "could not create file "base/5/2683": File exists"
...

We get the effect discussed when the background writer process decides to
flush a file buffer for pg_largeobject during stage 1.
(Thus, if a checkpoint somehow happened to occur during CREATE DATABASE,
the result must be the same.)
And another important factor is shared_buffers = 1MB (set during the test).
With the default setting of 128MB I couldn't see the failure.

It can be reproduced easily (on old Windows versions) just by running
pg_upgrade in a loop (I've got failures on iterations 22, 37, 17 (with the
default cluster)).
If an old cluster contains dozen of databases, this increases the failure
probability significantly (with 10 additional databases I've got failures
on iterations 4, 1, 6).

Please see the reproducing script attached.

Best regards,
Alexander

#18

Amit Kapila

amit.kapila16@gmail.com

over 2 years ago

In reply to: Alexander Lakhin (#17)

Re: Random pg_upgrade test failure on drongo

On Thu, Jan 4, 2024 at 5:30 PM Alexander Lakhin <exclusion@gmail.com> wrote:

03.01.2024 14:42, Amit Kapila wrote:

And the internal process is ... background writer (BgBufferSync()).

So, I tried just adding bgwriter_lru_maxpages = 0 to postgresql.conf and
got 20 x 10 tests passing.

Thus, it we want just to get rid of the test failure, maybe it's enough to
add this to the test's config...

What about checkpoints? Can't it do the same while writing the buffers?

As we deal here with pg_upgrade/pg_restore, it must not be very easy to get
the desired effect, but I think it's not impossible in principle.
More details below.
What happens during the pg_upgrade execution is essentially:
1) CREATE DATABASE "postgres" WITH TEMPLATE = template0 OID = 5 ...;
-- this command flushes file buffers as well
2) ALTER DATABASE postgres OWNER TO ...
3) COMMENT ON DATABASE "postgres" IS ...
4) -- For binary upgrade, preserve pg_largeobject and index relfilenodes
SELECT pg_catalog.binary_upgrade_set_next_index_relfilenode('2683'::pg_catalog.oid);
SELECT pg_catalog.binary_upgrade_set_next_heap_relfilenode('2613'::pg_catalog.oid);
TRUNCATE pg_catalog.pg_largeobject;
-- ^^^ here we can get the error "could not create file "base/5/2683": File exists"
...

We get the effect discussed when the background writer process decides to
flush a file buffer for pg_largeobject during stage 1.
(Thus, if a checkpoint somehow happened to occur during CREATE DATABASE,
the result must be the same.)
And another important factor is shared_buffers = 1MB (set during the test).
With the default setting of 128MB I couldn't see the failure.

It can be reproduced easily (on old Windows versions) just by running
pg_upgrade in a loop (I've got failures on iterations 22, 37, 17 (with the
default cluster)).
If an old cluster contains dozen of databases, this increases the failure
probability significantly (with 10 additional databases I've got failures
on iterations 4, 1, 6).

I don't have an old Windows environment to test but I agree with your
analysis and theory. The question is what should we do for these new
random BF failures? I think we should set bgwriter_lru_maxpages to 0
and checkpoint_timeout to 1hr for these new tests. Doing some invasive
fix as part of this doesn't sound reasonable because this is an
existing problem and there seems to be another patch by Thomas that
probably deals with the root cause of the existing problem [1]https://commitfest.postgresql.org/40/3951/ as
pointed out by you.

[1]: https://commitfest.postgresql.org/40/3951/

--
With Regards,
Amit Kapila.

#19

Jim Nasby

Jim.Nasby@BlueTreble.com

over 2 years ago

In reply to: Amit Kapila (#18)

Re: Random pg_upgrade test failure on drongo

On 1/4/24 10:19 PM, Amit Kapila wrote:

On Thu, Jan 4, 2024 at 5:30 PM Alexander Lakhin <exclusion@gmail.com> wrote:

03.01.2024 14:42, Amit Kapila wrote:

And the internal process is ... background writer (BgBufferSync()).

So, I tried just adding bgwriter_lru_maxpages = 0 to postgresql.conf and
got 20 x 10 tests passing.

Thus, it we want just to get rid of the test failure, maybe it's enough to
add this to the test's config...

What about checkpoints? Can't it do the same while writing the buffers?

As we deal here with pg_upgrade/pg_restore, it must not be very easy to get
the desired effect, but I think it's not impossible in principle.
More details below.
What happens during the pg_upgrade execution is essentially:
1) CREATE DATABASE "postgres" WITH TEMPLATE = template0 OID = 5 ...;
-- this command flushes file buffers as well
2) ALTER DATABASE postgres OWNER TO ...
3) COMMENT ON DATABASE "postgres" IS ...
4) -- For binary upgrade, preserve pg_largeobject and index relfilenodes
SELECT pg_catalog.binary_upgrade_set_next_index_relfilenode('2683'::pg_catalog.oid);
SELECT pg_catalog.binary_upgrade_set_next_heap_relfilenode('2613'::pg_catalog.oid);
TRUNCATE pg_catalog.pg_largeobject;
-- ^^^ here we can get the error "could not create file "base/5/2683": File exists"
...

We get the effect discussed when the background writer process decides to
flush a file buffer for pg_largeobject during stage 1.
(Thus, if a checkpoint somehow happened to occur during CREATE DATABASE,
the result must be the same.)
And another important factor is shared_buffers = 1MB (set during the test).
With the default setting of 128MB I couldn't see the failure.

It can be reproduced easily (on old Windows versions) just by running
pg_upgrade in a loop (I've got failures on iterations 22, 37, 17 (with the
default cluster)).
If an old cluster contains dozen of databases, this increases the failure
probability significantly (with 10 additional databases I've got failures
on iterations 4, 1, 6).

I don't have an old Windows environment to test but I agree with your
analysis and theory. The question is what should we do for these new
random BF failures? I think we should set bgwriter_lru_maxpages to 0
and checkpoint_timeout to 1hr for these new tests. Doing some invasive
fix as part of this doesn't sound reasonable because this is an
existing problem and there seems to be another patch by Thomas that
probably deals with the root cause of the existing problem [1] as
pointed out by you.

[1] - https://commitfest.postgresql.org/40/3951/

Isn't this just sweeping the problem (non-POSIX behavior on SMB and
ReFS) under the carpet? I realize that synthetic test workloads like
pg_upgrade in a loop aren't themselves real-world scenarios, but what
about other cases? Even if we're certain it's not possible for these
issues to wedge a server, it's still not a good experience for users to
get random, unexplained IO-related errors...
--
Jim Nasby, Data Architect, Austin TX

#20

Amit Kapila

amit.kapila16@gmail.com

over 2 years ago

In reply to: Jim Nasby (#19)

Re: Random pg_upgrade test failure on drongo

On Mon, Jan 8, 2024 at 9:36 PM Jim Nasby <jim.nasby@gmail.com> wrote:

On 1/4/24 10:19 PM, Amit Kapila wrote:

On Thu, Jan 4, 2024 at 5:30 PM Alexander Lakhin <exclusion@gmail.com> wrote:

03.01.2024 14:42, Amit Kapila wrote:

And the internal process is ... background writer (BgBufferSync()).

So, I tried just adding bgwriter_lru_maxpages = 0 to postgresql.conf and
got 20 x 10 tests passing.

Thus, it we want just to get rid of the test failure, maybe it's enough to
add this to the test's config...

What about checkpoints? Can't it do the same while writing the buffers?

As we deal here with pg_upgrade/pg_restore, it must not be very easy to get
the desired effect, but I think it's not impossible in principle.
More details below.
What happens during the pg_upgrade execution is essentially:
1) CREATE DATABASE "postgres" WITH TEMPLATE = template0 OID = 5 ...;
-- this command flushes file buffers as well
2) ALTER DATABASE postgres OWNER TO ...
3) COMMENT ON DATABASE "postgres" IS ...
4) -- For binary upgrade, preserve pg_largeobject and index relfilenodes
SELECT pg_catalog.binary_upgrade_set_next_index_relfilenode('2683'::pg_catalog.oid);
SELECT pg_catalog.binary_upgrade_set_next_heap_relfilenode('2613'::pg_catalog.oid);
TRUNCATE pg_catalog.pg_largeobject;
-- ^^^ here we can get the error "could not create file "base/5/2683": File exists"
...

We get the effect discussed when the background writer process decides to
flush a file buffer for pg_largeobject during stage 1.
(Thus, if a checkpoint somehow happened to occur during CREATE DATABASE,
the result must be the same.)
And another important factor is shared_buffers = 1MB (set during the test).
With the default setting of 128MB I couldn't see the failure.

It can be reproduced easily (on old Windows versions) just by running
pg_upgrade in a loop (I've got failures on iterations 22, 37, 17 (with the
default cluster)).
If an old cluster contains dozen of databases, this increases the failure
probability significantly (with 10 additional databases I've got failures
on iterations 4, 1, 6).

I don't have an old Windows environment to test but I agree with your
analysis and theory. The question is what should we do for these new
random BF failures? I think we should set bgwriter_lru_maxpages to 0
and checkpoint_timeout to 1hr for these new tests. Doing some invasive
fix as part of this doesn't sound reasonable because this is an
existing problem and there seems to be another patch by Thomas that
probably deals with the root cause of the existing problem [1] as
pointed out by you.

[1] - https://commitfest.postgresql.org/40/3951/

Isn't this just sweeping the problem (non-POSIX behavior on SMB and
ReFS) under the carpet? I realize that synthetic test workloads like
pg_upgrade in a loop aren't themselves real-world scenarios, but what
about other cases? Even if we're certain it's not possible for these
issues to wedge a server, it's still not a good experience for users to
get random, unexplained IO-related errors...

The point is that this is an existing known Windows behavior and that
too only in certain versions. The fix doesn't seem to be
straightforward, so it seems advisable to avoid random BF failures by
having an appropriate configuration.

--
With Regards,
Amit Kapila.

#21

Hayato Kuroda (Fujitsu)

kuroda.hayato@fujitsu.com

over 2 years ago

In reply to: Amit Kapila (#18)

#22

Alexander Lakhin

exclusion@gmail.com

over 2 years ago

In reply to: Hayato Kuroda (Fujitsu) (#21)

#23