BUG #16032: pg_basebackup when running on Windows doesn't clean up on failure correctly

Started by PG Bug reporting formover 6 years ago4 messagesbugs
Jump to latest
#1PG Bug reporting form
noreply@postgresql.org

The following bug has been logged on the website:

Bug reference: 16032
Logged by: Rob Emery
Email address: postgresql@mintsoft.net
PostgreSQL version: 11.5
Operating system: Windows
Description:

Hello,

When running a pg_basebackup in tar mode on windows against a PG 9.5 box we
are finding that the backup doesn't get deleted successfully when it
fails.
Based on
https://github.com/postgres/postgres/commit/9083353b15c3cf8e7bbac104a81ad42281178cdf#diff-473d306e41ee616e92fb58ac128a9dcc
we would expect it to?

To reproduce start running pg_basebackup like so:

PS C:\Users\developer> & "$PgBinPath\pg_basebackup.exe" --host="PGSERVER"
--pgdata="D:\Backups\rofl" --format=tar --wal-method=fetch
--username="backup_user"

from a psql on the box run:
`SELECT pg_terminate_backend(pid) FROM pg_stat_replication WHERE state =
'backup';`
this kills the backup.

If you run pg_basebackup on a linux box it nicely deletes the base.tar and
the directory, whereas on windows we get the following error:
```
pg_basebackup: could not read COPY data: SSL SYSCALL error: EOF detected
pg_basebackup: removing data directory "D:\Backups\rofl"
could not remove file or directory "D:\Backups\rofl/base.tar": Permission
denied
could not remove file or directory "D:\Backups\rofl": Directory not empty
pg_basebackup: failed to remove data directory
```
Once pg_basebackup has exited, the file lock seems to be gone and we can
delete it ourselves,
it looks like pg_basebackup is contending with itself but only on Windows.

We've reproduced this on 2 different Windows machines so we don't think it's
something like AntiVirus getting in the way or similiar.

Thanks,
Rob

#2Rob
postgresql@mintsoft.net
In reply to: PG Bug reporting form (#1)
[PATCH] Re: BUG #16032: pg_basebackup when running on Windows doesn't clean up on failure correctly

Hello,

I've already posted this to pg_hackers, but I'm not sure if it's more
normal to
post proposed fixes against the bug itself; so see attached if needed.

Thanks,
Rob

Show quoted text

On 2019-10-01 17:06, PG Bug reporting form wrote:

The following bug has been logged on the website:

Bug reference: 16032
Logged by: Rob Emery
Email address: postgresql@mintsoft.net
PostgreSQL version: 11.5
Operating system: Windows
Description:

Hello,

When running a pg_basebackup in tar mode on windows against a PG 9.5
box we
are finding that the backup doesn't get deleted successfully when it
fails.
Based on
https://github.com/postgres/postgres/commit/9083353b15c3cf8e7bbac104a81ad42281178cdf#diff-473d306e41ee616e92fb58ac128a9dcc
we would expect it to?

To reproduce start running pg_basebackup like so:

PS C:\Users\developer> & "$PgBinPath\pg_basebackup.exe"
--host="PGSERVER"
--pgdata="D:\Backups\rofl" --format=tar --wal-method=fetch
--username="backup_user"

from a psql on the box run:
`SELECT pg_terminate_backend(pid) FROM pg_stat_replication WHERE state
=
'backup';`
this kills the backup.

If you run pg_basebackup on a linux box it nicely deletes the base.tar
and
the directory, whereas on windows we get the following error:
```
pg_basebackup: could not read COPY data: SSL SYSCALL error: EOF
detected
pg_basebackup: removing data directory "D:\Backups\rofl"
could not remove file or directory "D:\Backups\rofl/base.tar":
Permission
denied
could not remove file or directory "D:\Backups\rofl": Directory not
empty
pg_basebackup: failed to remove data directory
```
Once pg_basebackup has exited, the file lock seems to be gone and we
can
delete it ourselves,
it looks like pg_basebackup is contending with itself but only on
Windows.

We've reproduced this on 2 different Windows machines so we don't think
it's
something like AntiVirus getting in the way or similiar.

Thanks,
Rob

Attachments:

0001-Fix-bug16032-under-windows-when-the-backup-is-aborte.patchtext/x-diff; name=0001-Fix-bug16032-under-windows-when-the-backup-is-aborte.patchDownload+22-1
#3Michael Paquier
michael@paquier.xyz
In reply to: Rob (#2)
Re: [PATCH] Re: BUG #16032: pg_basebackup when running on Windows doesn't clean up on failure correctly

On Sun, Oct 06, 2019 at 08:15:10PM +0100, Rob Emery wrote:

Hello,

I've already posted this to pg_hackers, but I'm not sure if it's
more normal to post proposed fixes against the bug itself; so see
attached if needed.

Discussing patches on -bugs threads is perfectly fine :)

Regarding your patch, anything living in the middle of processing
(basically calling disconnect_and_exit() in ~11 and exit() for 12~)
would fail into this category. Wouldn't it make sense here to use
atexit() to ensure that the cleanup always happens? I am not sure
that it is a good idea to hope that anything processing the base
backup COPY chunks will remember to clean up those handles in the
event of an error. I am ready to bet that any future code will forget
to add that so we would keep falling into the same trap.
--
Michael

#4Rob
postgresql@mintsoft.net
In reply to: Michael Paquier (#3)
Re: [PATCH] Re: BUG #16032: pg_basebackup when running on Windows doesn't clean up on failure correctly

Hiya Michael,

Regarding your patch, anything living in the middle of processing
(basically calling disconnect_and_exit() in ~11 and exit() for 12~)
would fail into this category. Wouldn't it make sense here to use
atexit() to ensure that the cleanup always happens? I am not sure
that it is a good idea to hope that anything processing the base
backup COPY chunks will remember to clean up those handles in the
event of an error. I am ready to bet that any future code will forget
to add that so we would keep falling into the same trap.

Yes and no, the problem is that under Windows the file cleanup
occurs before the process itself actually ends, as the streaming of
the backup from the COPY is performed in a thread, not a fork.

So, if we attempt to use atexit to cleanup the filehandles,
pg_basebackup will still be attempting to delete the partial backup
before the file handles are cleaned up by the atexit callback.

I hope that makes sense; I agree with your sentiment that there must
be a better way of structuring this.

-Rob