could not link file in wal restore lines
I'm doing test restores of the wal with pgbackrest. My problem is that
there are lines like
could not link file "pg_wal/000000010000015600000098" to
"pg_wal/00000001000001570000006E": File exists
appearing in the postgres log on the first startup, when the wal logs are
loaded.
The restore and startup process completes fine, with no corruption AFAIK,
but I'd like to make sure that there is indeed no chance of corruption.
I've opened an issue on pgbackrest, where the developer confirmed that
those files are not touched by pgbackrest, they are basically copied back
exactly as they were. The backup/restore is on the same partition as the
data folder, so no chance of file system changes, etc. Single master, no
replication.
I'm always doing clean data dir + init_db so there is nothing there before
the restore. I can replicate this behaviour 100%.
PostgreSQL version: psql (PostgreSQL) 12.11 (Ubuntu 12.11-1.pgdg18.04+1)
OS is Ubuntu 18.04
pgbackrest issue: https://github.com/pgbackrest/pgbackrest/issues/1815
log details:
could not link file "pg_wal/000000010000015600000098" to
"pg_wal/00000001000001570000006E": File exists
In total, I get 18 lines of "could not link file" and 932 lines of
"restored log file" lines.
At the end it finishes with:
redo done at 15A/A001710
last completed transaction was at log time 2022-07-15 19:41:05.175573+00
restored log file "000000030000015A0000000A" from archive
selected new timeline ID: 4
archive recovery complete
restored log file "00000003.history" from archive
database system is ready to accept connections
---
Thanks and regards,
Zsolt
On Fri, Jul 22, 2022 at 04:24:54AM -0500, Zsolt Ero wrote:
I've opened an issue on pgbackrest, where the developer confirmed that
those files are not touched by pgbackrest, they are basically copied back
exactly as they were. The backup/restore is on the same partition as the
data folder, so no chance of file system changes, etc. Single master, no
replication.I'm always doing clean data dir + init_db so there is nothing there before
the restore. I can replicate this behaviour 100%.PostgreSQL version: psql (PostgreSQL) 12.11 (Ubuntu 12.11-1.pgdg18.04+1)
OS is Ubuntu 18.04
pgbackrest issue: https://github.com/pgbackrest/pgbackrest/issues/1815
log details:
could not link file "pg_wal/000000010000015600000098" to
"pg_wal/00000001000001570000006E": File exists
FWIW, the backend code has protections to prevent *exactly* this kind
of problems when recycling WAL segment files at checkpoints with a set
of LWLocks taken on the control file, for one. Perhaps you have
messed up things and you have finished in such a state that backrest
writes to pg_wal/ concurrently with a cluster running and running a
checkpoint, which would explain those link() calls to be failing?
--
Michael
My script does the following:
0. no processes running
1. rm -rf data_dir
2. mkdir -p data_dir
3. pgbackrest restore
4. start pg service
I don't know where could this come from or how should I start to debug it.
Also, if this is not critical then I can just stop worrying about it, but I
don't like having possible bugs in a DB restore process.
Zsolt
On 23. Jul 2022 at 05:36:47, Michael Paquier <michael@paquier.xyz> wrote:
Show quoted text
On Fri, Jul 22, 2022 at 04:24:54AM -0500, Zsolt Ero wrote:
I've opened an issue on pgbackrest, where the developer confirmed that
those files are not touched by pgbackrest, they are basically copied back
exactly as they were. The backup/restore is on the same partition as the
data folder, so no chance of file system changes, etc. Single master, no
replication.
I'm always doing clean data dir + init_db so there is nothing there before
the restore. I can replicate this behaviour 100%.
PostgreSQL version: psql (PostgreSQL) 12.11 (Ubuntu 12.11-1.pgdg18.04+1)
OS is Ubuntu 18.04
pgbackrest issue: https://github.com/pgbackrest/pgbackrest/issues/1815
log details:
could not link file "pg_wal/000000010000015600000098" to
"pg_wal/00000001000001570000006E": File exists
FWIW, the backend code has protections to prevent *exactly* this kind
of problems when recycling WAL segment files at checkpoints with a set
of LWLocks taken on the control file, for one. Perhaps you have
messed up things and you have finished in such a state that backrest
writes to pg_wal/ concurrently with a cluster running and running a
checkpoint, which would explain those link() calls to be failing?
--
Michael
On 7/22/22 23:36, Michael Paquier wrote:
On Fri, Jul 22, 2022 at 04:24:54AM -0500, Zsolt Ero wrote:
I've opened an issue on pgbackrest, where the developer confirmed that
those files are not touched by pgbackrest, they are basically copied back
exactly as they were. The backup/restore is on the same partition as the
data folder, so no chance of file system changes, etc. Single master, no
replication.I'm always doing clean data dir + init_db so there is nothing there before
the restore. I can replicate this behaviour 100%.PostgreSQL version: psql (PostgreSQL) 12.11 (Ubuntu 12.11-1.pgdg18.04+1)
OS is Ubuntu 18.04
pgbackrest issue: https://github.com/pgbackrest/pgbackrest/issues/1815
log details:
could not link file "pg_wal/000000010000015600000098" to
"pg_wal/00000001000001570000006E": File existsFWIW, the backend code has protections to prevent *exactly* this kind
of problems when recycling WAL segment files at checkpoints with a set
of LWLocks taken on the control file, for one. Perhaps you have
messed up things and you have finished in such a state that backrest
writes to pg_wal/ concurrently with a cluster running and running a
checkpoint, which would explain those link() calls to be failing?
During recovery pgbackrest only writes into pg_wal when invoked by
restore_command. That generally means pgbackrest should only be writing
to RECOVERYXLOG.
Zsolt, can you add --log-level-console=info to your archive_command so
we can see what parameters are being passed to pgbackrest? This logging
will appear in the postgres log.
Regards,
-David
I've posted the full command which replicates this issue on the GitHub
issue:
c.run(f'rm -rf {p.remote_data}/pg12')
c.run(f'mkdir -p {p.remote_data}/pg12')
c.run(
f'docker run --rm '
f'--name {p.instance_name} '
f'-v {p.remote_data}:/data '
f'-v {p.remote_shared}:/shared ' + f'{p.repo_name}:{version} '
'pgbackrest --stanza=app --log-level-console=info
--log-level-file=off restore',
)
c.run(
f'docker run --rm '
f'--name {p.instance_name} '
f'-v {p.remote_data}:/data '
f'-v {p.remote_shared}:/shared '
'--user=postgres '
f'{p.repo_name}:{version} '
'/usr/lib/postgresql/12/bin/postgres -D /data/pg12',
)
c.run() is just SSH automation (via Fabric), so these commands are the same
as I'd run in the terminal.
Zsolt, can you add --log-level-console=info to your archive_command so
we can see what parameters are being passed to pgbackrest?
I changed it to
archive_command = 'pgbackrest --stanza=app --log-level-console=info
archive-push %p'
on the destination server (where the restore happens). There are no changes
in the log, only 3 lines appear once everything is done. I've attached the
log. Do I need to change this command on the source server?
Zsolt
On 23. Jul 2022 at 15:20:32, David Steele <david@pgmasters.net> wrote:
Show quoted text
On 7/22/22 23:36, Michael Paquier wrote:
On Fri, Jul 22, 2022 at 04:24:54AM -0500, Zsolt Ero wrote:
I've opened an issue on pgbackrest, where the developer confirmed that
those files are not touched by pgbackrest, they are basically copied back
exactly as they were. The backup/restore is on the same partition as the
data folder, so no chance of file system changes, etc. Single master, no
replication.
I'm always doing clean data dir + init_db so there is nothing there
before
the restore. I can replicate this behaviour 100%.
PostgreSQL version: psql (PostgreSQL) 12.11 (Ubuntu 12.11-1.pgdg18.04+1)
OS is Ubuntu 18.04
pgbackrest issue: https://github.com/pgbackrest/pgbackrest/issues/1815
log details:
could not link file "pg_wal/000000010000015600000098" to
"pg_wal/00000001000001570000006E": File exists
FWIW, the backend code has protections to prevent *exactly* this kind
of problems when recycling WAL segment files at checkpoints with a set
of LWLocks taken on the control file, for one. Perhaps you have
messed up things and you have finished in such a state that backrest
writes to pg_wal/ concurrently with a cluster running and running a
checkpoint, which would explain those link() calls to be failing?
During recovery pgbackrest only writes into pg_wal when invoked by
restore_command. That generally means pgbackrest should only be writing
to RECOVERYXLOG.Zsolt, can you add --log-level-console=info to your archive_command so
we can see what parameters are being passed to pgbackrest? This logging
will appear in the postgres log.Regards,
-David
Attachments:
On 7/23/22 12:39, Zsolt Ero wrote:
On 23. Jul 2022 at 15:20:32, David Steele <david@pgmasters.net
<mailto:david@pgmasters.net>> wrote:
Zsolt, can you add --log-level-console=info to your archive_command so
we can see what parameters are being passed to pgbackrest?archive_command = 'pgbackrest --stanza=app --log-level-console=info
archive-push %p'on the destination server (where the restore happens). There are no
changes in the log, only 3 lines appear once everything is done. I've
attached the log. Do I need to change this command on the source server?
Oops, sorry, I meant the restore_command on the server where you do the
restore.
You'll need to edit this in postgresql.auto.conf after the pgbackrest
restore but before starting PostgreSQL.
Regards,
-David
I did that, attached the new log.
On 23. Jul 2022 at 18:43:57, David Steele <david@pgmasters.net> wrote:
Show quoted text
On 7/23/22 12:39, Zsolt Ero wrote:
On 23. Jul 2022 at 15:20:32, David Steele <david@pgmasters.net
<mailto:david@pgmasters.net>> wrote:
Zsolt, can you add --log-level-console=info to your archive_command so
we can see what parameters are being passed to pgbackrest?
archive_command = 'pgbackrest --stanza=app --log-level-console=info
archive-push %p'
on the destination server (where the restore happens). There are no
changes in the log, only 3 lines appear once everything is done. I've
attached the log. Do I need to change this command on the source server?
Oops, sorry, I meant the restore_command on the server where you do the
restore.You'll need to edit this in postgresql.auto.conf after the pgbackrest
restore but before starting PostgreSQL.Regards,
-David
Attachments:
At Sat, 23 Jul 2022 12:36:47 +0900, Michael Paquier <michael@paquier.xyz> wrote in
FWIW, the backend code has protections to prevent *exactly* this kind
of problems when recycling WAL segment files at checkpoints with a set
of LWLocks taken on the control file, for one. Perhaps you have
messed up things and you have finished in such a state that backrest
writes to pg_wal/ concurrently with a cluster running and running a
checkpoint, which would explain those link() calls to be failing?
That lock doesn't seem excluding recovery.
I can reproduce with the following script (see below) with some sleep
is added before (or after) durable_link_or_rename call in
InstallXlogFileSegment (attached). Some adjustment might be required
to reproduce the same on other environment.
=====
2022-07-25 17:05:57.730 JST [151758] LOG: restored log file "000000010000000000000057" from archive
2022-07-25 17:05:57.760 JST [151758] LOG: restored log file "000000010000000000000058" from archive
2022-07-25 17:05:57.782 JST [151758] LOG: restored log file "000000010000000000000059" from archive
2022-07-25 17:05:57.790 JST [151762] LOG: could not link file "pg_wal/000000010000000000000002" to "pg_wal/000000010000000000000059": File exists
2022-07-25 17:05:57.802 JST [151758] LOG: restored log file "00000001000000000000005A" from archive
2022-07-25 17:05:58.294 JST [151762] LOG: could not link file "pg_wal/000000010000000000000003" to "pg_wal/00000001000000000000005A": File exists
========
#! /bin/bash
# create a backup-source
PGDATA=~/test/data
PGARC=~/test/arc
BKDIR=~/test/bk
CPDATA=~/test/dt
rm /tmp/hoge
rm -r $PGDATA $PGARC $BKDIR $CPDATA
mkdir $PGARC
killall -9 postgres
initdb -D $PGDATA
echo "archive_mode=on" >> $PGDATA/postgresql.conf
echo "archive_command = 'cp %p $PGARC/%f'" >> $PGDATA/postgresql.conf
#start the source
pg_ctl -D $PGDATA start
# take a backup
pg_basebackup -D $BKDIR
echo "archive_mode=off" >> $BKDIR/postgresql.conf
echo "restore_command='cp $PGARC/%f %p'" >> $BKDIR/postgresql.conf
touch $BKDIR/recovery.signal
# create archived segments
psql -c 'create table t (a int)'
for i in $(seq 1 100); do psql -c 'insert into t values(0); select pg_switch_wal()'; done
#stop the source
pg_ctl -D $PGDATA stop
# start recovery
rm -rf $CPDATA
cp -r $BKDIR $CPDATA
touch /tmp/hoge
postgres -D $CPDATA 2>&1 | tee recovery.log
======
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
Attachments:
repro20220725.difftext/x-patch; charset=us-asciiDownload+5-0
At Mon, 25 Jul 2022 17:11:32 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in
2022-07-25 17:05:57.790 JST [151762] LOG: could not link file "pg_wal/000000010000000000000002" to "pg_wal/000000010000000000000059": File exists
(Sorry for the broken mail..)
So, this is the race condition between InstallXLogFileSegment and
KeepFileRestoredFromArchive.
We can insert locking on ControlFileLock in the latter to arbitrate
the race condition, but I think it would be better that the
checkpointer gives up renaming believing the file has been added by
recovery, with adding some prerequisites, or confirming recovered
files.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
On 7/23/22 12:53, Zsolt Ero wrote:
On 23. Jul 2022 at 18:43:57, David Steele <david@pgmasters.net
<mailto:david@pgmasters.net>> wrote:On 7/23/22 12:39, Zsolt Ero wrote:
On 23. Jul 2022 at 15:20:32, David Steele <david@pgmasters.net
<mailto:david@pgmasters.net>
<mailto:david@pgmasters.net <mailto:david@pgmasters.net>>> wrote:
Zsolt, can you add --log-level-console=info to your archive_command so
we can see what parameters are being passed to pgbackrest?archive_command = 'pgbackrest --stanza=app --log-level-console=info
archive-push %p'on the destination server (where the restore happens). There are no
changes in the log, only 3 lines appear once everything is done. I've
attached the log. Do I need to change this command on the source server?Oops, sorry, I meant the restore_command on the server where you do the
restore.You'll need to edit this in postgresql.auto.conf after the pgbackrest
restore but before starting PostgreSQL.I did that, attached the new log.
From this log it looks like pgBackRest is writing to RECOVERYXLOG as I
expected.
The error happens relatively rarely (18) compared to the number of WAL
segments fetched (1029). pgBackRest does not write the files in
question, e.g.
LOG: could not link file "pg_wal/000000010000015B0000002C" to
"pg_wal/000000010000015B000000F4": File exists
And the pg_wal directory started empty.
I ran one our standard stress tests on 12.11 (recovering 1592 WAL
segments) and I did not see the "could not link file" message.
Your system seems to be doing recovery pretty quickly. I wonder if there
is a race condition in WAL recycling?
One thing you could try is to set wal_recycle = false.
Regards,
-David
Ah, sorry for posting following too-early messages in the thread.
At Mon, 25 Jul 2022 08:40:12 -0400, David Steele <david@pgmasters.net> wrote in
Your system seems to be doing recovery pretty quickly. I wonder if
there is a race condition in WAL recycling?
Please take a look on [1]/messages/by-id/20220725.171132.2272594383346737093.horikyota.ntt@gmail.com
[1]: /messages/by-id/20220725.171132.2272594383346737093.horikyota.ntt@gmail.com
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
At Tue, 26 Jul 2022 11:48:14 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in
Ah, sorry for posting following too-early messages in the thread.
At Mon, 25 Jul 2022 08:40:12 -0400, David Steele <david@pgmasters.net> wrote in
Your system seems to be doing recovery pretty quickly. I wonder if
there is a race condition in WAL recycling?Please take a look on [1]
[1] /messages/by-id/20220725.171132.2272594383346737093.horikyota.ntt@gmail.com
And it has been fixed by cc2c7d65fc in PG15. That discussion [1]/messages/by-id/20210202151416.GB3304930@rfd.leadboat.com
concluded that we don't back-patch it.
regards.
[1]: /messages/by-id/20210202151416.GB3304930@rfd.leadboat.com
Does anyone prefer some alternative? It's probably not worth
back-patching anything for a restartpoint failure this rare, because
most restartpoint outcomes are not user-visible.
--
Kyotaro Horiguchi
NTT Open Source Software Center
On 7/26/22 02:03, Kyotaro Horiguchi wrote:
At Tue, 26 Jul 2022 11:48:14 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in
Ah, sorry for posting following too-early messages in the thread.
At Mon, 25 Jul 2022 08:40:12 -0400, David Steele <david@pgmasters.net> wrote in
Your system seems to be doing recovery pretty quickly. I wonder if
there is a race condition in WAL recycling?And it has been fixed by cc2c7d65fc in PG15. That discussion [1]
concluded that we don't back-patch it.[1] /messages/by-id/20210202151416.GB3304930@rfd.leadboat.com
Does anyone prefer some alternative? It's probably not worth
back-patching anything for a restartpoint failure this rare, because
most restartpoint outcomes are not user-visible.
I have responded on that thread to see if Noah can have a look and
decide if it would be worth back-patching cc2c7d65fc.
There have been other changes in this area (e.g. removing
durable_rename_excl) so it would be good to know if back-patching just
cc2c7d65fc will work.
Kyotaro, since you can reproduce the issue would you be willing to test
that?
Regards,
-David
David, my server is a dedicated AMD Ryzen 7 3700X with 2 x 1 TB NVME SSD
in RAID 0, so it's indeed quite fast, especially in I/O compared to most
cloud hosting options with non-local storage.
It's this one: https://www.hetzner.com/dedicated-rootserver/ax51-nvme
I can imagine that trying the same stress test is not triggering race
conditions on most cloud servers.
I'm happy to run tests as long as they are in Docker containers, I prefer
not to install anything directly on the host.
Zsolt
On 26. Jul 2022 at 13:33:09, David Steele <david@pgmasters.net> wrote:
Show quoted text
On 7/26/22 02:03, Kyotaro Horiguchi wrote:
At Tue, 26 Jul 2022 11:48:14 +0900 (JST), Kyotaro Horiguchi <
horikyota.ntt@gmail.com> wrote inAh, sorry for posting following too-early messages in the thread.
At Mon, 25 Jul 2022 08:40:12 -0400, David Steele <david@pgmasters.net>
wrote in
Your system seems to be doing recovery pretty quickly. I wonder if
there is a race condition in WAL recycling?
And it has been fixed by cc2c7d65fc in PG15. That discussion [1]
concluded that we don't back-patch it.
[1]
/messages/by-id/20210202151416.GB3304930@rfd.leadboat.comDoes anyone prefer some alternative? It's probably not worth
back-patching anything for a restartpoint failure this rare, because
most restartpoint outcomes are not user-visible.
I have responded on that thread to see if Noah can have a look and
decide if it would be worth back-patching cc2c7d65fc.There have been other changes in this area (e.g. removing
durable_rename_excl) so it would be good to know if back-patching just
cc2c7d65fc will work.Kyotaro, since you can reproduce the issue would you be willing to test
that?Regards,
-David
At Tue, 26 Jul 2022 08:37:40 -0700, Zsolt Ero <zsolt.ero@gmail.com> wrote in
On 26. Jul 2022 at 13:33:09, David Steele <david@pgmasters.net> wrote:
I have responded on that thread to see if Noah can have a look and
decide if it would be worth back-patching cc2c7d65fc.There have been other changes in this area (e.g. removing
durable_rename_excl) so it would be good to know if back-patching just
cc2c7d65fc will work.Kyotaro, since you can reproduce the issue would you be willing to test
that?
David, my server is a dedicated AMD Ryzen 7 3700X with 2 x 1 TB NVME SSD
in RAID 0, so it's indeed quite fast, especially in I/O compared to most
cloud hosting options with non-local storage.It's this one: https://www.hetzner.com/dedicated-rootserver/ax51-nvme
I can imagine that trying the same stress test is not triggering race
conditions on most cloud servers.I'm happy to run tests as long as they are in Docker containers, I prefer
not to install anything directly on the host.
It is better if the failure can be reproduced without an extra code
tweak. Aside from that I'm willing to try reproduce the issue on my
environment.
I'm gonig to modify the patch for PG12 then come up again.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
I'm gonig to modify the patch for PG12 then come up again.
So, this is the version of cc2c7d65fc that is minimally modified (but
not checked seriously) for PG12. (cc2c7d65fc was almost cleanly
applied.)
I applied this and observed that WAL recycleing is prevented until
archive recovery ends and I finally didn't see the "could not link"
error happen.
Zsolt, could you confirm this patch works for you?
regards
--
Kyotaro Horiguchi
NTT Open Source Software Center
Attachments:
cc2c7d65fc_for_PG12.patchtext/x-patch; charset=us-asciiDownload
Can you tell me how critical is the state of this patch? I mean is there
any chance of data corruption in the non-patched version?
Zsolt
On 27. Jul 2022 at 04:52:31, Kyotaro Horiguchi <horikyota.ntt@gmail.com>
wrote:
Show quoted text
I'm gonig to modify the patch for PG12 then come up again.
So, this is the version of cc2c7d65fc that is minimally modified (but
not checked seriously) for PG12. (cc2c7d65fc was almost cleanly
applied.)I applied this and observed that WAL recycleing is prevented until
archive recovery ends and I finally didn't see the "could not link"
error happen.Zsolt, could you confirm this patch works for you?
regards
--
Kyotaro Horiguchi
NTT Open Source Software Center
On 8/9/22 17:23, Zsolt Ero wrote:
Can you tell me how critical is the state of this patch? I mean is there
any chance of data corruption in the non-patched version?
We believe there is no chance of corruption due to these messages.
Because back-patching the better fix that was applied to PG15 could
cause unintended consequences, our decision, at least for now, was to
add a HINT [1]https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=41f613fc257c4d1df97f5f244461d1acb2b65227 to the log message.
Regards,
-David
[1]: https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=41f613fc257c4d1df97f5f244461d1acb2b65227
https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=41f613fc257c4d1df97f5f244461d1acb2b65227
Thanks a lot for the explanation and confirming that it's harmless.
On 10. Aug 2022 at 00:18:06, David Steele <david@pgmasters.net> wrote:
Show quoted text
On 8/9/22 17:23, Zsolt Ero wrote:
Can you tell me how critical is the state of this patch? I mean is there
any chance of data corruption in the non-patched version?
We believe there is no chance of corruption due to these messages.
Because back-patching the better fix that was applied to PG15 could
cause unintended consequences, our decision, at least for now, was to
add a HINT [1] to the log message.Regards,
-David[1]