Bug report - pg_upgrade tool seems to have a race condition when trying to delete a pg_wal file

Started by Waka Ranaialmost 2 years ago9 messagesbugs
Jump to latest
#1Waka Ranai
wakadotranai@gmail.com

Hello,

I tested the pg_upgrade tool many times on different servers (always
Windows server 19, actual subversion may differ) when trying to upgrade an
existing database from Postgres 9.6 to Postgres 15 (I tried both the 15.4.2
and 15.7) and was almost all the time faced with this issue during the step
“Setting next transaction ID and epoch for new cluster”.

Here’s the version of one of the servers, on which it failed at least three
times :

[image: image.png]

The command I ran is "C:\Program Files\PostgreSQL\15\bin\pg_upgrade.exe" -d
"C:\Program Files\PostgreSQL\9.6\data" -D "C:\Program
Files\PostgreSQL\15\data" -b "C:\Program Files\PostgreSQL\9.6\bin" -B
"C:\Program Files\PostgreSQL\15\bin" -U postgres after having set
PGPASSWORD to the correct password.

The issue was either “pg_resetwal: error: could not delete file
"pg_wal/000000010000000000000001": Permission denied” or sometimes it was
saying that the file could not be found instead of Permission denied. When
I look in the directory while it is executing, I can see that the file is
there previously, and always removed after the pg_upgrade crashes. I tried
to inspect with Process Explorer what processes were using it, always
processes from postgres, only one after a fresh install of postgres 15, but
I saw that during the execution of pg_upgrade, sometimes two processes were
using it.

I suspect that there is some sort of race condition where one process sees
that the file exists, does something with it and deletes it, while another
process saw the file existing, but upon trying to delete it, it could not
find it anymore. I had a look in the code and I believe it happens in the
function KillExistingXLOG from line 973 of pg_resetwal.c (
https://github.com/postgres/postgres/blob/master/src/bin/pg_resetwal/pg_resetwal.c#L973)
though I cannot be entirely sure of the cause.

You can find the logs produced by the pg_upgrade tool attached, with the
verbose option.

Thanks in advance for the investigation and I hope to understand better the
problem and hopefully see a fix soon as it is complicating the deployment
of a major upgrade of our product,

Have a great day,

Thomas

Attachments:

image.pngimage/png; name=image.pngDownload+5-1
pg_upgrade_output.d.zipapplication/x-zip-compressed; name=pg_upgrade_output.d.zipDownload
#2Laurenz Albe
laurenz.albe@cybertec.at
In reply to: Waka Ranai (#1)
Re: Bug report - pg_upgrade tool seems to have a race condition when trying to delete a pg_wal file

On Wed, 2024-05-22 at 18:14 +0200, Waka Ranai wrote:

The command I ran is
"C:\Program Files\PostgreSQL\15\bin\pg_upgrade.exe" -d "C:\Program Files\PostgreSQL\9.6\data" -D "C:\Program Files\PostgreSQL\15\data" -b "C:\Program Files\PostgreSQL\9.6\bin" -B "C:\Program Files\PostgreSQL\15\bin" -U postgres
after having set PGPASSWORD to the correct password.
 
The issue was either
“pg_resetwal: error: could not delete file "pg_wal/000000010000000000000001": Permission denied”
or sometimes it was saying that the file could not be found instead of Permission denied.
When I look in the directory while it is executing, I can see that the file is there previously,
and always removed after the pg_upgrade crashes. I tried to inspect with Process Explorer what
processes were using it, always processes from postgres, only one after a fresh install of
postgres 15, but I saw that during the execution of pg_upgrade, sometimes two processes were
using it.

Did you make sure that no virus scanner is active on the data directories?

Yours,
Laurenz Albe

#3Waka Ranai
wakadotranai@gmail.com
In reply to: Laurenz Albe (#2)
Re: Bug report - pg_upgrade tool seems to have a race condition when trying to delete a pg_wal file

I cannot confirm for each of the machines, but I know that on some of them
where the issue happened, no virus scanner were installed apart from the
default one of Microsoft, and I did not see any other processes than
postgres using the file when I looked with Process Explorer, do you know
how I can be 100% sure that it was not the case ?

Thanks in advance,

Thomas

Le jeu. 23 mai 2024 à 09:21, Laurenz Albe <laurenz.albe@cybertec.at> a
écrit :

Show quoted text

On Wed, 2024-05-22 at 18:14 +0200, Waka Ranai wrote:

The command I ran is
"C:\Program Files\PostgreSQL\15\bin\pg_upgrade.exe" -d "C:\Program

Files\PostgreSQL\9.6\data" -D "C:\Program Files\PostgreSQL\15\data" -b
"C:\Program Files\PostgreSQL\9.6\bin" -B "C:\Program
Files\PostgreSQL\15\bin" -U postgres

after having set PGPASSWORD to the correct password.

The issue was either
“pg_resetwal: error: could not delete file

"pg_wal/000000010000000000000001": Permission denied”

or sometimes it was saying that the file could not be found instead of

Permission denied.

When I look in the directory while it is executing, I can see that the

file is there previously,

and always removed after the pg_upgrade crashes. I tried to inspect with

Process Explorer what

processes were using it, always processes from postgres, only one after

a fresh install of

postgres 15, but I saw that during the execution of pg_upgrade,

sometimes two processes were

using it.

Did you make sure that no virus scanner is active on the data directories?

Yours,
Laurenz Albe

#4Laurenz Albe
laurenz.albe@cybertec.at
In reply to: Waka Ranai (#3)
Re: Bug report - pg_upgrade tool seems to have a race condition when trying to delete a pg_wal file

On Thu, 2024-05-23 at 11:39 +0200, Waka Ranai wrote:

no virus scanner were installed apart from the default one of Microsoft

Perhaps the default virus scanner is at fault. To exclude that possibility,
disable it for the PostgreSQL data directories.

Yours,
Laurenz Albe

#5Waka Ranai
wakadotranai@gmail.com
In reply to: Laurenz Albe (#4)
Re: Bug report - pg_upgrade tool seems to have a race condition when trying to delete a pg_wal file

We tested on the aforementioned computer after adding an exception on the
pg_wal folder for the Microsoft default antivirus with *Add-MpPreference
-ExclusionPath "C:\Program Files\PostgreSQL\15\data\pg_wal"* but we still
faced the same issue, I included the pg_upgrade logs
[image: image.jpeg]

Le jeu. 23 mai 2024 à 16:31, Laurenz Albe <laurenz.albe@cybertec.at> a
écrit :

Show quoted text

On Thu, 2024-05-23 at 11:39 +0200, Waka Ranai wrote:

no virus scanner were installed apart from the default one of Microsoft

Perhaps the default virus scanner is at fault. To exclude that
possibility,
disable it for the PostgreSQL data directories.

Yours,
Laurenz Albe

Attachments:

image.jpegimage/jpeg; name=image.jpegDownload+1-0
pg_upgrade_output.d.zipapplication/x-zip-compressed; name=pg_upgrade_output.d.zipDownload
#6Laurenz Albe
laurenz.albe@cybertec.at
In reply to: Waka Ranai (#5)
Re: Bug report - pg_upgrade tool seems to have a race condition when trying to delete a pg_wal file

On Tue, 2024-05-28 at 16:14 +0200, Waka Ranai wrote:

We tested on the aforementioned computer after adding an exception on the pg_wal
folder for the Microsoft default antivirus with
Add-MpPreference -ExclusionPath "C:\Program Files\PostgreSQL\15\data\pg_wal"
but we still faced the same issue, I included the pg_upgrade logs

Thanks. I see

command: "C:/Program Files/PostgreSQL/15/bin/pg_resetwal" -f -u 536 "C:/Program Files/PostgreSQL/15/data" >> "C:/Program Files/PostgreSQL/15/data/pg_upgrade_output.d/202405>
Write-ahead log reset

command: "C:/Program Files/PostgreSQL/15/bin/pg_resetwal" -f -x 3466214 "C:/Program Files/PostgreSQL/15/data" >> "C:/Program Files/PostgreSQL/15/data/pg_upgrade_output.d/20>
pg_resetwal: error: could not delete file "pg_wal/000000010000000000000001": No such file or directory

So it is failing in KillExistingXLOG(): readdir() finds the file,
but by the time unlink() is executed, the file is already gone.
The file in question is the WAL segment written by WriteEmptyXLOG() in the
previous "pg_resetwal" execution.

But the previous "pg_resetwal" has exited by the time the next one is started,
so it should not be at fault.

I found this similar thread:
/messages/by-id/20090910094211.166C5753FB7@cvs.postgresql.org
The symptoms are the same.

I wonder if something like commit 4e2d5efc6a45b1f9f96df42629f6d1c7740e657e
would be useful here too. But it cannot be a PostgreSQL process that is
holding the file open - the creating process has already exited, and no
other PostgreSQL process would read the file.

So the fact remains that there is something *outside of PostgreSQL* that
opens newly created files. You say you disabled the virus scanner, but can
you think of any other software on your system that would do that?
Perhaps you can try disabling the virus scanner completely and check if
that gets rid of the problem.

Yours,
Laurenz Albe

#7Waka Ranai
wakadotranai@gmail.com
In reply to: Laurenz Albe (#6)
Re: Bug report - pg_upgrade tool seems to have a race condition when trying to delete a pg_wal file

Hello again, I tested after disabling the Microsoft antivirus entirely and
it worked the first time. I then uninstalled completely the new
Postrgres I'm upgrading to (Postgres 15, I made sure to delete the data
folder) and reinstalled it again to try the upgrade a second and a third
time, but both attempts failed, always on the same step, with the same
error message. I also tested on one of the other machines where the upgrade
never succeeded after disabling entirely the antivirus and still got the
error.
I agree that it must be some other process making readdir finding the file,
but releasing before unlink could work, but I could not manage to find
which one (apart from postgres processes) were using the wal file. I was
wondering if it wouldn't be a suitable solution/workaround to not fail when
trying to delete a file that is not there anymore ?
I will continue looking for what process could be reading the newly
modified/created file, but I'm a bit out of luck for now

Le mer. 29 mai 2024 à 09:51, Laurenz Albe <laurenz.albe@cybertec.at> a
écrit :

Show quoted text

On Tue, 2024-05-28 at 16:14 +0200, Waka Ranai wrote:

We tested on the aforementioned computer after adding an exception on

the pg_wal

folder for the Microsoft default antivirus with
Add-MpPreference -ExclusionPath "C:\Program

Files\PostgreSQL\15\data\pg_wal"

but we still faced the same issue, I included the pg_upgrade logs

Thanks. I see

command: "C:/Program Files/PostgreSQL/15/bin/pg_resetwal" -f -u 536
"C:/Program Files/PostgreSQL/15/data" >> "C:/Program
Files/PostgreSQL/15/data/pg_upgrade_output.d/202405>
Write-ahead log reset

command: "C:/Program Files/PostgreSQL/15/bin/pg_resetwal" -f -x 3466214
"C:/Program Files/PostgreSQL/15/data" >> "C:/Program
Files/PostgreSQL/15/data/pg_upgrade_output.d/20>
pg_resetwal: error: could not delete file
"pg_wal/000000010000000000000001": No such file or directory

So it is failing in KillExistingXLOG(): readdir() finds the file,
but by the time unlink() is executed, the file is already gone.
The file in question is the WAL segment written by WriteEmptyXLOG() in the
previous "pg_resetwal" execution.

But the previous "pg_resetwal" has exited by the time the next one is
started,
so it should not be at fault.

I found this similar thread:
/messages/by-id/20090910094211.166C5753FB7@cvs.postgresql.org
The symptoms are the same.

I wonder if something like commit 4e2d5efc6a45b1f9f96df42629f6d1c7740e657e
would be useful here too. But it cannot be a PostgreSQL process that is
holding the file open - the creating process has already exited, and no
other PostgreSQL process would read the file.

So the fact remains that there is something *outside of PostgreSQL* that
opens newly created files. You say you disabled the virus scanner, but can
you think of any other software on your system that would do that?
Perhaps you can try disabling the virus scanner completely and check if
that gets rid of the problem.

Yours,
Laurenz Albe

#8Waka Ranai
wakadotranai@gmail.com
In reply to: Waka Ranai (#7)
Re: Bug report - pg_upgrade tool seems to have a race condition when trying to delete a pg_wal file

Hi again, I eventually found out that Cortex XDR was also installed on the
system, but even after uninstalling it, I'm still faced with the same
issue. I try to monitor the resources that might have a handle on the file,
but the only ones shown are from postgres (one from postgres.exe and one
from pg_resetwal). I did the monitoring with the bundled Resource Monitor
of Microsoft, do you have any recommendations for another monitoring tool
with automatic scanning maybe ?
How could I make sure that the issue is not due to an internal postgres
process ?
Did you consider not failing the upgrade if the file cannot be deleted ?
What would be the problems, if any, in that use case ?
Thanks in advance

Le ven. 31 mai 2024 à 14:01, Waka Ranai <wakadotranai@gmail.com> a écrit :

Show quoted text

Hello again, I tested after disabling the Microsoft antivirus entirely and
it worked the first time. I then uninstalled completely the new
Postrgres I'm upgrading to (Postgres 15, I made sure to delete the data
folder) and reinstalled it again to try the upgrade a second and a third
time, but both attempts failed, always on the same step, with the same
error message. I also tested on one of the other machines where the upgrade
never succeeded after disabling entirely the antivirus and still got the
error.
I agree that it must be some other process making readdir finding the
file, but releasing before unlink could work, but I could not manage to
find which one (apart from postgres processes) were using the wal file. I
was wondering if it wouldn't be a suitable solution/workaround to not fail
when trying to delete a file that is not there anymore ?
I will continue looking for what process could be reading the newly
modified/created file, but I'm a bit out of luck for now

Le mer. 29 mai 2024 à 09:51, Laurenz Albe <laurenz.albe@cybertec.at> a
écrit :

On Tue, 2024-05-28 at 16:14 +0200, Waka Ranai wrote:

We tested on the aforementioned computer after adding an exception on

the pg_wal

folder for the Microsoft default antivirus with
Add-MpPreference -ExclusionPath "C:\Program

Files\PostgreSQL\15\data\pg_wal"

but we still faced the same issue, I included the pg_upgrade logs

Thanks. I see

command: "C:/Program Files/PostgreSQL/15/bin/pg_resetwal" -f -u 536
"C:/Program Files/PostgreSQL/15/data" >> "C:/Program
Files/PostgreSQL/15/data/pg_upgrade_output.d/202405>
Write-ahead log reset

command: "C:/Program Files/PostgreSQL/15/bin/pg_resetwal" -f -x 3466214
"C:/Program Files/PostgreSQL/15/data" >> "C:/Program
Files/PostgreSQL/15/data/pg_upgrade_output.d/20>
pg_resetwal: error: could not delete file
"pg_wal/000000010000000000000001": No such file or directory

So it is failing in KillExistingXLOG(): readdir() finds the file,
but by the time unlink() is executed, the file is already gone.
The file in question is the WAL segment written by WriteEmptyXLOG() in the
previous "pg_resetwal" execution.

But the previous "pg_resetwal" has exited by the time the next one is
started,
so it should not be at fault.

I found this similar thread:
/messages/by-id/20090910094211.166C5753FB7@cvs.postgresql.org
The symptoms are the same.

I wonder if something like commit 4e2d5efc6a45b1f9f96df42629f6d1c7740e657e
would be useful here too. But it cannot be a PostgreSQL process that is
holding the file open - the creating process has already exited, and no
other PostgreSQL process would read the file.

So the fact remains that there is something *outside of PostgreSQL* that
opens newly created files. You say you disabled the virus scanner, but
can
you think of any other software on your system that would do that?
Perhaps you can try disabling the virus scanner completely and check if
that gets rid of the problem.

Yours,
Laurenz Albe

#9Laurenz Albe
laurenz.albe@cybertec.at
In reply to: Waka Ranai (#8)
Re: Bug report - pg_upgrade tool seems to have a race condition when trying to delete a pg_wal file

On Tue, 2024-07-02 at 11:57 +0200, Waka Ranai wrote:

command: "C:/Program Files/PostgreSQL/15/bin/pg_resetwal" -f -x 3466214 "C:/Program Files/PostgreSQL/15/data" >> "C:/Program Files/PostgreSQL/15/data/pg_upgrade_output.d/20>
pg_resetwal: error: could not delete file "pg_wal/000000010000000000000001": No such file or directory

Hi again, I eventually found out that Cortex XDR was also installed on the system,
but even after uninstalling it, I'm still faced with the same issue. I try to monitor
the resources that might have a handle on the file, but the only ones shown are from
postgres (one from postgres.exe and one from pg_resetwal). I did the monitoring with
the bundled Resource Monitor of Microsoft, do you have any recommendations for another
monitoring tool with automatic scanning maybe ?

I am not a Windows person; sorry.

How could I make sure that the issue is not due to an internal postgres process ?

No idea - but the problem only happens for you, and as you say, it happens
always. Why would PostgreSQL do something else on your system only?

Did you consider not failing the upgrade if the file cannot be deleted ? What would
be the problems, if any, in that use case ?

Should PostgreSQL be resilient to important files being removed from under it?
I think not...

Yours,
Laurenz Albe