PostgreSQL occasionally unable to rename WAL files (NTFS)

Started by Guy Burgessabout 5 years ago10 messagesgeneral
Jump to latest
#1Guy Burgess
guy@burgess.co.nz

Hello,

Running 13.1 on Windows Server 2019, I am getting the following log
entries occasionally:

    2021-02-11 12:34:10.149 NZDT [6072] LOG:  could not rename file
"pg_wal/0000000100000099000000D3": Permission denied
    2021-02-11 12:40:31.377 NZDT [6072] LOG:  could not rename file
"pg_wal/0000000100000099000000D3": Permission denied
    2021-02-11 12:46:06.294 NZDT [6072] LOG:  could not rename file
"pg_wal/0000000100000099000000D3": Permission denied
    2021-02-11 12:46:16.502 NZDT [6072] LOG:  could not rename file
"pg_wal/0000000100000099000000DA": Permission denied
    2021-02-11 12:50:20.917 NZDT [6072] LOG:  could not rename file
"pg_wal/0000000100000099000000D3": Permission denied
    2021-02-11 12:50:31.098 NZDT [6072] LOG:  could not rename file
"pg_wal/0000000100000099000000DA": Permission denied

What appears to be happening is the affected WAL files (which is usually
only 2 or 3 WAL files at a time) are somehow "losing" their NTFS
permissions, so the PG process can't rename them - though of course the
PG process created them. Even running icacls as admin gives "Access is
denied" on those files. A further oddity is the affected files do end up
disappearing after a while.

The NTFS permissions on the pg_wal directory are correct, and most WAL
files are unaffected. Chkdsk reports no problems, and the database is
working fine otherwise. Have tried disabling antivirus software in case
that was doing something but no difference.

I found another recent report of similar behaviour here:
https://stackoverflow.com/questions/65405479/postgresql-13-log-could-not-rename-file-pg-wal-0000000100000001000000c6

WAL config as follows:

wal_level = replica
fsync = on
synchronous_commit = on
wal_sync_method = fsync
full_page_writes = on
wal_compression = off
wal_log_hints = off
wal_init_zero = on
wal_recycle = on
wal_buffers = -1
wal_writer_delay = 200ms
wal_writer_flush_after = 1MB
wal_skip_threshold = 2MB
commit_delay = 0
commit_siblings = 5
checkpoint_timeout = 5min
max_wal_size = 2GB
min_wal_size = 256MB
checkpoint_completion_target = 0.7
checkpoint_flush_after = 0
checkpoint_warning = 30s
archive_mode = off

I'm thinking of disabling wal_recycle as a first step to see if that
makes any difference, but thought I'd seek some comments first.

Not sure how much of a problem this is - the database is running fine
otherwise - but any thoughts would be appreciated.

Thanks & regards,

Guy

#2Lionel Bouton
lionel.bouton@jtek.fr
In reply to: Guy Burgess (#1)
Re: PostgreSQL occasionally unable to rename WAL files (NTFS)

Hi,

Le 11/02/2021 à 01:21, Guy Burgess a écrit :

Hello,

Running 13.1 on Windows Server 2019, I am getting the following log
entries occasionally:

    2021-02-11 12:34:10.149 NZDT [6072] LOG:  could not rename file
"pg_wal/0000000100000099000000D3": Permission denied
    2021-02-11 12:40:31.377 NZDT [6072] LOG:  could not rename file
"pg_wal/0000000100000099000000D3": Permission denied
    2021-02-11 12:46:06.294 NZDT [6072] LOG:  could not rename file
"pg_wal/0000000100000099000000D3": Permission denied
    2021-02-11 12:46:16.502 NZDT [6072] LOG:  could not rename file
"pg_wal/0000000100000099000000DA": Permission denied
    2021-02-11 12:50:20.917 NZDT [6072] LOG:  could not rename file
"pg_wal/0000000100000099000000D3": Permission denied
    2021-02-11 12:50:31.098 NZDT [6072] LOG:  could not rename file
"pg_wal/0000000100000099000000DA": Permission denied

What appears to be happening is the affected WAL files (which is
usually only 2 or 3 WAL files at a time) are somehow "losing" their
NTFS permissions, so the PG process can't rename them - though of
course the PG process created them. Even running icacls as admin gives
"Access is denied" on those files. A further oddity is the affected
files do end up disappearing after a while.

The NTFS permissions on the pg_wal directory are correct, and most WAL
files are unaffected. Chkdsk reports no problems, and the database is
working fine otherwise. Have tried disabling antivirus software in
case that was doing something but no difference.

I haven't dealt with a Windows environment for quite some time, but from
what I remember an antivirus installs a driver intercepting file
accesses and these drivers are still active even if you disable the
antivirus (I suppose they just call a noop instead of content analysis
code) and can still interfere with your system. For example some years
ago I've seen what looked like a race condition involving rename for
MySQL on Windows that could not be fixed by disabling the antivirus but
could by uninstalling it completely.

You might want to uninstall the antivirus temporarily to check this.

Best regards,

--
Lionel Bouton
gérant de JTEK SARL
https://www.linkedin.com/in/lionelbouton/

#3Thorsten Schöning
tschoening@am-soft.de
In reply to: Guy Burgess (#1)
Re: PostgreSQL occasionally unable to rename WAL files (NTFS)

Guten Tag Guy Burgess,
am Donnerstag, 11. Februar 2021 um 01:21 schrieben Sie:

What appears to be happening is the affected WAL files (which is
usually only 2 or 3 WAL files at a time) are somehow "losing" their
NTFS permissions, so the PG process can't rename them - though of
course the PG process created them. Even running icacls as admin
gives "Access is denied" on those files. A further oddity is the
affected files do end up disappearing after a while.

If you see that somewhat frequently, use Process Monitor and Process
Explorer to see who accesses those files how. ProcExp easily allows
you to find all open handles per file. If it's not AV, it might be
something like Windows Search Indexer as well, if that is enabled by
default in Server 2019 at all.

Though, even with my Windows 10 and Search Indexer enabled I didn't
run into such problems yet. And especially when rewriting large parts
of my databases with lots of created WAL files, I see the Indexer
working on those files, but NOT conflicting with Postgres yet.

The behaviour you describe happens exactly when two processes e.g.
concurrently hold HANDLEs on the same file and one of those deletes
the file then. Windows keeps file names until all open HANDLEs are
closed and depending on how those HANDLEs have been opened by the
first app, concurrent deletion is perferctly fine for Windows.

Though, a such deleted file can't be opened easily anymore and looks
like it has lost permissions only. But that's not the case, it's
deleted already. It might be that this happens for Postgres to itself
somehow when some other app has an open HANDLE. I don't think that
some other app is deleting that file by purpose instead, reading it
for some reason seems more likely to me.

dwShareMode
FILE_SHARE_DELETE

Enables subsequent open operations on a file or device to request
delete access. Otherwise, other processes cannot open the file or
device if they request delete access.

https://docs.microsoft.com/en-us/windows/win32/api/fileapi/nf-fileapi-createfilew

Mit freundlichen Grüßen

Thorsten Schöning

--
Thorsten Schöning
AM-SoFT IT-Service - Bitstore Hameln GmbH i.G. ist ein Mitglied der Bitstore Gruppe - Ihr Full-Service-Dienstleister für IT und TK

E-Mail: Thorsten.Schoening@AM-SoFT.de
Web: http://www.AM-SoFT.de/

Telefon: 05151- 9468- 0
Telefon: 05151- 9468-55
Fax: 05151- 9468-88
Mobil: 0178-8 9468-04

AM-SoFT IT-Service - Bitstore Hameln GmbH i.G., Brandenburger Str. 7c, 31789 Hameln
AG Hannover HRB neu - Geschäftsführer: Janine Galonska

Für Rückfragen stehe ich Ihnen sehr gerne zur Verfügung.

Mit freundlichen Grüßen

Thorsten Schöning

Tel: 05151 9468 0
Fax: 05151 9468 88
Mobil:
Webseite: https://www.am-soft.de

AM-Soft IT-Service - Bitstore Hameln GmbH i.G. ist ein Mitglied der Bitstore Gruppe - Ihr Full-Service-Dienstleister für IT und TK

AM-Soft IT-Service - Bitstore Hameln GmbH i.G.
Brandenburger Str. 7c
31789 Hameln
Tel: 05151 9468 0

Bitstore IT-Consulting GmbH
Zentrale - Berlin Lichtenberg
Frankfurter Allee 285
10317 Berlin
Tel: 030 453 087 80

CBS IT-Service - Bitstore Kaulsdorf UG
Tel: 030 453 087 880 1

Büro Dallgow-Döberitz
Tel: 03322 507 020

Büro Kloster Lehnin
Tel: 033207 566 530

PCE IT-Service - Bitstore Darmstadt UG
Darmstadt
Tel: 06151 392 973 0

Büro Neuruppin
Tel: 033932 606 090

ACI EDV Systemhaus Dresden GmbH
Dresden
Tel: 0351 254 410

Das Systemhaus - Bitstore Magdeburg GmbH
Magdeburg
Tel: 0391 636 651 0

Allerdata.IT - Bitstore Wittenberg GmbH
Wittenberg
Tel: 03491 876 735 7

Büro Liebenwalde
Tel: 033054 810 00

HSA - das Büro - Bitstore Altenburg UG
Altenburg
Tel: 0344 784 390 97

Bitstore IT – Consulting GmbH
NL Piesteritz
Piesteritz
Tel: 03491 644 868 6

Solltec IT-Services - Bitstore Braunschweig UG
Braunschweig
Tel: 0531 206 068 0

MF Computer Service - Bitstore Gütersloh GmbH
Gütersloh
Tel: 05245 920 809 3

Firmensitz: MF Computer Service - Bitstore Gütersloh GmbH, Gildestraße 25, 33442 Herzebrock-Clarholz
Geschäftsführer Janine Galonska

#4Guy Burgess
guy@burgess.co.nz
In reply to: Lionel Bouton (#2)
Re: PostgreSQL occasionally unable to rename WAL files (NTFS)

On 12/02/2021 12:31 am, Lionel Bouton wrote:

I haven't dealt with a Windows environment for quite some time, but from
what I remember an antivirus installs a driver intercepting file
accesses and these drivers are still active even if you disable the
antivirus (I suppose they just call a noop instead of content analysis
code) and can still interfere with your system. For example some years
ago I've seen what looked like a race condition involving rename for
MySQL on Windows that could not be fixed by disabling the antivirus but
could by uninstalling it completely.

You might want to uninstall the antivirus temporarily to check this.

Thanks Lionel for this suggestion. Its the built-in Windows Defender AV
which I believe can't be uninstalled, so am limited to 'disabling' it
(plus it has exclusions for the PG directories & processes already).
Using Procmon I can't see any AV file activity when it is disabled. Will
keep monitoring it though.

Kind regards

Guy

#5Guy Burgess
guy@burgess.co.nz
In reply to: Thorsten Schöning (#3)
Re: PostgreSQL occasionally unable to rename WAL files (NTFS)

On 12/02/2021 4:33 am, Thorsten Schöning wrote:

If you see that somewhat frequently, use Process Monitor and Process
Explorer to see who accesses those files how. ProcExp easily allows
you to find all open handles per file. If it's not AV, it might be
something like Windows Search Indexer as well, if that is enabled by
default in Server 2019 at all.

Thanks Thorsten, that's a great idea - I'm using Procmon and Process
Explorer and able to monitor activity on the WAL files, so hopefully
that sheds some light. The Indexer service is disabled, so can rule that
out.

Kind regards

Guy

#6Guy Burgess
guy@burgess.co.nz
In reply to: Thorsten Schöning (#3)
Re: PostgreSQL occasionally unable to rename WAL files (NTFS)

On 12/02/2021 4:33 am, Thorsten Schöning wrote:

The behaviour you describe happens exactly when two processes e.g.
concurrently hold HANDLEs on the same file and one of those deletes
the file then. Windows keeps file names until all open HANDLEs are
closed and depending on how those HANDLEs have been opened by the
first app, concurrent deletion is perferctly fine for Windows.

Though, a such deleted file can't be opened easily anymore and looks
like it has lost permissions only. But that's not the case, it's
deleted already. It might be that this happens for Postgres to itself
somehow when some other app has an open HANDLE. I don't think that
some other app is deleting that file by purpose instead, reading it
for some reason seems more likely to me.

Using Process Monitor, Thorsten's explanation above appears to correctly
diagnose what is happening. ProcMon data shows postgres.exe performing
"CreateFile" operations on the affected WAL files, with the result
status "DELETE PENDING". Which according to
https://stackoverflow.com/a/29892104 means:

"Windows allows a process to delete a file, even though it is still
opened by another process (e.g. Windows indexing service or
Antivirus). It gets internally marked as "delete pending". The file
does not actually get removed from the file system, it is still
there after the File.Delete call. Anybody that tries to open the
file after that gets an access denied error. The file doesn't
actually get removed until the last handle to the file object gets
closed"

which is the same behaviour Thorsten describes above (great info, thanks
Thorsten).

The mystery now is that the only process logged as touching the affected
WAL files is postgres.exe (of which there are many separate processes).
Could it be that one of the postgres.exe instances is holding the
affected WAL files in use after another postgres.exe instance has
flagged the file as deleted? (or to put it the other way, a postgres.exe
instance is flagging the file as deleted while another instance still
has an open handle to the file)? If it is some other process such as the
indexer (disabled) or AV (excluded from pgdata) is obtaining a handle on
the WAL files, it isn't being logged by ProcMon.

Kind regards,

Guy

#7Thorsten Schöning
tschoening@am-soft.de
In reply to: Guy Burgess (#6)
Re: PostgreSQL occasionally unable to rename WAL files (NTFS)

Guten Tag Guy Burgess,
am Montag, 15. Februar 2021 um 11:52 schrieben Sie:

The mystery now is that the only process logged as touching the
affected WAL files is postgres.exe (of which there are many separate
processes). Could it be that one of the postgres.exe instances is
holding the affected WAL files in use after another postgres.exe
instance has flagged the file as deleted?[...]

I suggest checking your WAL-related and archive/backup settings for
Postgres again. There's e.g. "archive_command" optionally copying WALs
to some other place and postgres.exe would wait until that process has
finished, maybe locking the file to copy itself as well. Or
"archive_timeout" interfering with some other operations or alike.

Mit freundlichen Grüßen

Thorsten Schöning

--
Thorsten Schöning
AM-SoFT IT-Service - Bitstore Hameln GmbH i.G. ist ein Mitglied der Bitstore Gruppe - Ihr Full-Service-Dienstleister für IT und TK

E-Mail: Thorsten.Schoening@AM-SoFT.de
Web: http://www.AM-SoFT.de/

Telefon: 05151- 9468- 0
Telefon: 05151- 9468-55
Fax: 05151- 9468-88
Mobil: 0178-8 9468-04

AM-SoFT IT-Service - Bitstore Hameln GmbH i.G., Brandenburger Str. 7c, 31789 Hameln
AG Hannover HRB neu - Geschäftsführer: Janine Galonska

Für Rückfragen stehe ich Ihnen sehr gerne zur Verfügung.

Mit freundlichen Grüßen

Thorsten Schöning

Tel: 05151 9468 0
Fax: 05151 9468 88
Mobil:
Webseite: https://www.am-soft.de

AM-Soft IT-Service - Bitstore Hameln GmbH i.G. ist ein Mitglied der Bitstore Gruppe - Ihr Full-Service-Dienstleister für IT und TK

AM-Soft IT-Service - Bitstore Hameln GmbH i.G.
Brandenburger Str. 7c
31789 Hameln
Tel: 05151 9468 0

Bitstore IT-Consulting GmbH
Zentrale - Berlin Lichtenberg
Frankfurter Allee 285
10317 Berlin
Tel: 030 453 087 80

CBS IT-Service - Bitstore Kaulsdorf UG
Tel: 030 453 087 880 1

Büro Dallgow-Döberitz
Tel: 03322 507 020

Büro Kloster Lehnin
Tel: 033207 566 530

PCE IT-Service - Bitstore Darmstadt UG
Darmstadt
Tel: 06151 392 973 0

Büro Neuruppin
Tel: 033932 606 090

ACI EDV Systemhaus Dresden GmbH
Dresden
Tel: 0351 254 410

Das Systemhaus - Bitstore Magdeburg GmbH
Magdeburg
Tel: 0391 636 651 0

Allerdata.IT - Bitstore Wittenberg GmbH
Wittenberg
Tel: 03491 876 735 7

Büro Liebenwalde
Tel: 033054 810 00

HSA - das Büro - Bitstore Altenburg UG
Altenburg
Tel: 0344 784 390 97

Bitstore IT – Consulting GmbH
NL Piesteritz
Piesteritz
Tel: 03491 644 868 6

Solltec IT-Services - Bitstore Braunschweig UG
Braunschweig
Tel: 0531 206 068 0

MF Computer Service - Bitstore Gütersloh GmbH
Gütersloh
Tel: 05245 920 809 3

Firmensitz: MF Computer Service - Bitstore Gütersloh GmbH, Gildestraße 25, 33442 Herzebrock-Clarholz
Geschäftsführer Janine Galonska

#8Guy Burgess
guy@burgess.co.nz
In reply to: Thorsten Schöning (#7)
Re: PostgreSQL occasionally unable to rename WAL files (NTFS)

On 16/02/2021 12:23 am, Thorsten Schöning wrote:

The mystery now is that the only process logged as touching the
affected WAL files is postgres.exe (of which there are many separate
processes). Could it be that one of the postgres.exe instances is
holding the affected WAL files in use after another postgres.exe
instance has flagged the file as deleted?[...]

I suggest checking your WAL-related and archive/backup settings for
Postgres again. There's e.g. "archive_command" optionally copying WALs
to some other place and postgres.exe would wait until that process has
finished, maybe locking the file to copy itself as well. Or
"archive_timeout" interfering with some other operations or alike.

Thanks Thorsten.  The WAL archive settings are out-of-the-box defaults,
i.e. disabled: archive_mode = off; archive_command = ''; archive_timeout
= 0.

I'm not sure there is anything else I can check at this time. The good
thing is it doesn't seem to cause any problem other than logging "could
not rename file" warnings, so I might have to park this for now. If I
find anything else that might offer a new lead I will report back.

Kind regards,

Guy

#9Thomas Kellerer
shammat@gmx.net
In reply to: Guy Burgess (#6)
Re: PostgreSQL occasionally unable to rename WAL files (NTFS)

Guy Burgess schrieb am 15.02.2021 um 11:52:

The mystery now is that the only process logged as touching the
affected WAL files is postgres.exe (of which there are many separate
processes). Could it be that one of the postgres.exe instances is
holding the affected WAL files in use after another postgres.exe
instance has flagged the file as deleted? (or to put it the other
way, a postgres.exe instance is flagging the file as deleted while
another instance still has an open handle to the file)? If it is some
other process such as the indexer (disabled) or AV (excluded from
pgdata) is obtaining a handle on the WAL files, it isn't being logged
by ProcMon.

My first suspect is always the anti-virus on Windows when things like
that happen with Postgres.

Maybe the AV configuration was changed and pgdata is no longer excluded.

Thomas

#10Michael Paquier
michael@paquier.xyz
In reply to: Thomas Kellerer (#9)
Re: PostgreSQL occasionally unable to rename WAL files (NTFS)

On Tue, Feb 16, 2021 at 12:22:36PM +0100, Thomas Kellerer wrote:

My first suspect is always the anti-virus on Windows when things like
that happen with Postgres.

Or maybe not. 13 has introduced a regression in this area AFAIK, and
909b449 should have taken care of it (available in 13.3~).
--
Michael