pg_rewind WAL segments deletion pitfall
Hello,
It seems for me that there is currently a pitfall in the pg_rewind
implementation.
Imagine the following situation:
There is a cluster consisting of a primary with the following
configuration: wal_level=‘replica’, archive_mode=‘on’ and a replica.
1. The primary that is not fast enough in archiving WAL segments (e.g.
network issues, high CPU/Disk load...)
2. The primary fails
3. The replica is promoted
4. We are not lucky enough, the new and the old primary’s timelines
diverged, we need to run pg_rewind
5. We are even less lucky: the old primary still has some WAL segments
with .ready signal files that were generated before the point of divergence
and were not archived. (e.g. 000000020004D20200000095.done,
000000020004D20200000096.ready, 000000020004D20200000097.ready,
000000020004D20200000098.ready)
6. The promoted primary runs for some time and recycles the old WAL
segments.
7. We revive the old primary and try to rewind it
8. When pg_rewind finished successfully, we see that the WAL segments
with .ready files are removed, because they were already absent on the
promoted replica. We end up in a situation where we completely lose some
WAL segments, even though we had a clear sign that they were not
archived and
more importantly, pg_rewind read these segments while collecting
information about the data blocks.
9. The old primary fails to start because of the missing WAL segments
(more strictly, the records between the last common checkpoint and the
point of divergence) with the following log record: "ERROR: requested WAL
segment 000000020004D20200000096 has already been removed"
In this situation, after pg_rewind:
archived:
000000020004D20200000095
000000020004D20200000099.partial
000000030004D20200000099
the following segments are lost:
000000020004D20200000096
000000020004D20200000097
000000020004D20200000098
Thus, my thoughts are: why can’t pg_rewind be a little bit wiser in terms
of creating filemap for WALs? Can it preserve the WAL segments that contain
those potentially lost records (> the last common checkpoint and < the
point of divergence) on the target? (see the patch attached)
If I am missing something however, please correct me or explain why it is
not possible to implement this straightforward solution.
Thank you,
Polina Bungina
Attachments:
v1-0001-pg_rewind-wal-deletion.patchapplication/octet-stream; name=v1-0001-pg_rewind-wal-deletion.patchDownload+65-1
In the first place, this is not a bug. (At least doesn't seem.)
If you mean to propose behavioral changes, -hackers is the place.
At Tue, 23 Aug 2022 17:46:30 +0200, Полина Бунгина <bungina@gmail.com> wrote in
4. We are not lucky enough, the new and the old primary’s timelines
diverged, we need to run pg_rewind
5. We are even less lucky: the old primary still has some WAL segments
with .ready signal files that were generated before the point of divergence
and were not archived.
That dones't harm pg_rewind at all.
6. The promoted primary runs for some time and recycles the old WAL
segments.
7. We revive the old primary and try to rewind it
8. When pg_rewind finished successfully, we see that the WAL segments
with .ready files are removed, because they were already absent on the
promoted replica. We end up in a situation where we completely lose some
WAL segments, even though we had a clear sign that they were not
archived and
more importantly, pg_rewind read these segments while collecting
information about the data blocks.
In terms of syncing the old primary to the new primary, no data has
been lost. The "lost" segments are anyway unusable for the new primary
since they no longer compatible with it. How do you intended to use
the WAL files for the incompatible cluster?
9. The old primary fails to start because of the missing WAL segments
(more strictly, the records between the last common checkpoint and the
point of divergence) with the following log record: "ERROR: requested WAL
segment 000000020004D20200000096 has already been removed"
That means that the tail end of the rewound old primary has been lost
on the new primary's pg_wal. In that case, you need to somehow
copy-in the archived WAL files on the new primary. You can just do
that or you can set up restore_command properly.
Thus, my thoughts are: why can’t pg_rewind be a little bit wiser in terms
of creating filemap for WALs? Can it preserve the WAL segments that contain
those potentially lost records (> the last common checkpoint and < the
point of divergence) on the target? (see the patch attached)
Since they are not really needed once rewind completes.
If I am missing something however, please correct me or explain why it is
not possible to implement this straightforward solution.
Maybe you're mistaking the operation. If I understand the situation
correctly, I think the following steps replays your "issue" and then
resolve that.
# killall -9 postgres
# rm -r oldprim newprim oldarch newarch oldprim.log newprim.log
mkdir newarch oldarch
initdb -k -D oldprim
echo "archive_mode = 'always'">> oldprim/postgresql.conf
echo "archive_command = 'cp %p `pwd`/oldarch/%f'">> oldprim/postgresql.conf
pg_ctl -D oldprim -o '-p 5432' -l oldprim.log start
psql -p 5432 -c 'create table t(a int)'
pg_basebackup -D newprim -p 5432
echo "primary_conninfo='host=/tmp port=5432'">> oldprim/postgresql.conf
echo "archive_command = 'cp %p `pwd`/newarch/%f'">> newprim/postgresql.conf
touch newprim/standby.signal
pg_ctl -D newprim -o '-p 5433' -l newprim.log start
pg_ctl -D newprim promote
for i in $(seq 1 4); do psql -p 5432 -c 'insert into t values(0); select pg_switch_wal();'; done
psql -p 5432 -c 'checkpoint'
pg_ctl -D oldprim stop
echo "restore_command = 'cp `pwd`/oldarch/%f %p'">> oldprim/postgresql.conf
# pg_rewind -D oldprim --source-server='port=5433' # fails
pg_rewind -D oldprim --source-server='port=5433' -c
for i in $(seq 1 4); do psql -p 5433 -c 'insert into t values(0); select pg_switch_wal();'; done
psql -p 5433 -c 'checkpoint'
echo "primary_conninfo='host=/tmp port=5433'">> oldprim/postgresql.conf
touch oldprim/standby.signal
postgres -D oldprim
FATAL: could not receive data from WAL stream: ERROR: requested WAL segment 000000020000000000000003 has already been removed
[ctrl-C]
======
Now that the old primary requires older WAL files *on the new
primary*. Here, define restore command to do that.
=====
echo "restore_command='cp `pwd`/newarch/%f %p'">> oldprim/postgresql.conf
postgres -D oldprim
=====
Now the old primary run as the standby of the new primary.
LOG: restored log file "000000020000000000000006" from archive
LOG: consistent recovery state reached at 0/30020B0
LOG: database system is ready to accept read-only connections
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
Hello Kyotaro,
On Thu, 25 Aug 2022 at 09:49, Kyotaro Horiguchi <horikyota.ntt@gmail.com>
wrote:
In the first place, this is not a bug. (At least doesn't seem.)
If you mean to propose behavioral changes, -hackers is the place.
Well, maybe... We can always change it.
8. When pg_rewind finished successfully, we see that the WAL segments
with .ready files are removed, because they were already absent on the
promoted replica. We end up in a situation where we completely losesome
WAL segments, even though we had a clear sign that they were not
archived and
more importantly, pg_rewind read these segments while collecting
information about the data blocks.In terms of syncing the old primary to the new primary, no data has
been lost. The "lost" segments are anyway unusable for the new primary
since they no longer compatible with it. How do you intended to use
the WAL files for the incompatible cluster?
These files are required for the old primary to start as a replica.
9. The old primary fails to start because of the missing WAL segments
(more strictly, the records between the last common checkpoint and the
point of divergence) with the following log record: "ERROR:requested WAL
segment 000000020004D20200000096 has already been removed"
That means that the tail end of the rewound old primary has been lost
on the new primary's pg_wal.
Correct. The old primary was down for about 20m and we have
checkpoint_timeout = 5m, so the new primary already recycled them.
In that case, you need to somehow
copy-in the archived WAL files on the new primary. You can just do
that or you can set up restore_command properly.
These files never made it to the archive because the server crashed. The
only place where they existed was pg_wal in the old primary.
Thus, my thoughts are: why can’t pg_rewind be a little bit wiser in terms
of creating filemap for WALs? Can it preserve the WAL segments thatcontain
those potentially lost records (> the last common checkpoint and < the
point of divergence) on the target? (see the patch attached)Since they are not really needed once rewind completes.
The pg_rewind creates the backup_label file with START WAL LOCATION and
CHECKPOINT LOCATION that point to the last common checkpoint.
Removed files are between the last common checkpoint and diverged WAL
location, and therefore are required for Postgres to do successful recovery.
Since these files never made it to the archive and are also absent on the
new primary the old primary can't start as a replica.
And I will emphasize one more time, that these files were removed by
pg_rewind despite the known fact that they are required to perform a
recovery.
If I am missing something however, please correct me or explain why it is
not possible to implement this straightforward solution.Maybe you're mistaking the operation.
We are not (Patroni author is here).
If I understand the situation
correctly, I think the following steps replays your "issue" and then
resolve that.# killall -9 postgres
# rm -r oldprim newprim oldarch newarch oldprim.log newprim.log
mkdir newarch oldarch
initdb -k -D oldprim
echo "archive_mode = 'always'">> oldprim/postgresql.conf
With archive_mode = always you can't reproduce it.
It is very rarely people set it to always in production due to the overhead.
echo "archive_command = 'cp %p `pwd`/oldarch/%f'">> oldprim/postgresql.conf
pg_ctl -D oldprim -o '-p 5432' -l oldprim.log start
psql -p 5432 -c 'create table t(a int)'
pg_basebackup -D newprim -p 5432
echo "primary_conninfo='host=/tmp port=5432'">> oldprim/postgresql.conf
echo "archive_command = 'cp %p `pwd`/newarch/%f'">> newprim/postgresql.conf
touch newprim/standby.signal
pg_ctl -D newprim -o '-p 5433' -l newprim.log start
pg_ctl -D newprim promote
for i in $(seq 1 4); do psql -p 5432 -c 'insert into t values(0); select
pg_switch_wal();'; done
psql -p 5432 -c 'checkpoint'
pg_ctl -D oldprim stop
The archive_mode has to be set to on and the archive_command should be
failing when you do pg_ctl -D oldprim stop
echo "restore_command = 'cp `pwd`/oldarch/%f %p'">> oldprim/postgresql.conf
# pg_rewind -D oldprim --source-server='port=5433' # fails
pg_rewind -D oldprim --source-server='port=5433' -c
for i in $(seq 1 4); do psql -p 5433 -c 'insert into t values(0); select
pg_switch_wal();'; done
psql -p 5433 -c 'checkpoint'
echo "primary_conninfo='host=/tmp port=5433'">> oldprim/postgresql.conf
touch oldprim/standby.signalpostgres -D oldprim
FATAL: could not receive data from WAL stream: ERROR: requested WAL
segment 000000020000000000000003 has already been removed
Regards,
--
Alexander Kukushkin
(Moved to -hackers)
At Thu, 25 Aug 2022 10:34:40 +0200, Alexander Kukushkin <cyberdemn@gmail.com> wrote in
# killall -9 postgres
# rm -r oldprim newprim oldarch newarch oldprim.log newprim.log
mkdir newarch oldarch
initdb -k -D oldprim
echo "archive_mode = 'always'">> oldprim/postgresql.confWith archive_mode = always you can't reproduce it.
It is very rarely people set it to always in production due to the overhead.
...
The archive_mode has to be set to on and the archive_command should be
failing when you do pg_ctl -D oldprim stop
Ah, I see.
What I don't still understand is why pg_rewind doesn't work for the
old primary in that case. When archive_mode=on, the old primary has
the complete set of WAL files counting both pg_wal and its archive. So
as the same to the privious repro, pg_rewind -c ought to work (but it
uses its own archive this time). In that sense the proposed solution
is still not needed in this case.
A bit harder situation comes after the server successfully rewound; if
the new primary goes so far that the old primary cannot connect. Even
in that case, you can copy-in the requried WAL files or configure
restore_command of the old pimary so that it finds required WAL files
there.
As the result the system in total doesn't lose a WAL file.
So.. I might still be missing something..
###############################
# killall -9 postgres
# rm -r oldprim newprim oldarch newarch oldprim.log newprim.log
mkdir newarch oldarch
initdb -k -D oldprim
echo "archive_mode = 'on'">> oldprim/postgresql.conf
echo "archive_command = 'cp %p `pwd`/oldarch/%f'">> oldprim/postgresql.conf
pg_ctl -D oldprim -o '-p 5432' -l oldprim.log start
psql -p 5432 -c 'create table t(a int)'
pg_basebackup -D newprim -p 5432
echo "primary_conninfo='host=/tmp port=5432'">> newprim/postgresql.conf
echo "archive_command = 'cp %p `pwd`/newarch/%f'">> newprim/postgresql.conf
touch newprim/standby.signal
pg_ctl -D newprim -o '-p 5433' -l newprim.log start
# the last common checkpoint
psql -p 5432 -c 'checkpoint'
# record approx. diverging WAL segment
start_wal=`psql -p 5433 -Atc 'select pg_walfile_name(pg_last_wal_replay_lsn() - (select setting from pg_settings where name = 'wal_segment_size')::int);
`
psql -p 5432 -c 'insert into t values(0); select pg_switch_wal();'
pg_ctl -D newprim promote
psql -p 5433 -c 'checkpoint'
# old rprimary loses diverging WAL segment
for i in $(seq 1 4); do psql -p 5432 -c 'insert into t values(0); select pg_switch_wal();'; done
# old primary cannot archive any more
echo "archive_command = 'false'">> oldprim/postgresql.conf
pg_ctl -D oldprim reload
pg_ctl -D oldprim stop
# rewind the old primary, using its own archive
# pg_rewind -D oldprim --source-server='port=5433' # should fail
echo "restore_command = 'cp `pwd`/oldarch/%f %p'">> oldprim/postgresql.conf
pg_rewind -D oldprim --source-server='port=5433' -c
# advance WAL on the old primary; new primary loses the launching WAL seg
for i in $(seq 1 4); do psql -p 5433 -c 'insert into t values(0); select pg_switch_wal();'; done
psql -p 5433 -c 'checkpoint'
echo "primary_conninfo='host=/tmp port=5433'">> oldprim/postgresql.conf
touch oldprim/standby.signal
postgres -D oldprim # fails with "WAL file has been removed"
# The alternative of copying-in
# echo "restore_command = 'cp `pwd`/newarch/%f %p'">> oldprim/postgresql.conf
# copy-in WAL files from new primary's archive to old primary
(cd newarch;
for f in `ls`; do
if [[ "$f" > "$start_wal" ]]; then echo copy $f; cp $f ../oldprim/pg_wal; fi
done)
postgres -D oldprim
--
Kyotaro Horiguchi
NTT Open Source Software Center
Hello Kyotaro,
On Fri, 26 Aug 2022 at 10:04, Kyotaro Horiguchi <horikyota.ntt@gmail.com>
wrote:
With archive_mode = always you can't reproduce it.
It is very rarely people set it to always in production due to theoverhead.
...The archive_mode has to be set to on and the archive_command should be
failing when you do pg_ctl -D oldprim stopAh, I see.
What I don't still understand is why pg_rewind doesn't work for the
old primary in that case. When archive_mode=on, the old primary has
the complete set of WAL files counting both pg_wal and its archive. So
as the same to the privious repro, pg_rewind -c ought to work (but it
uses its own archive this time). In that sense the proposed solution
is still not needed in this case.
The pg_rewind finishes successfully. But as a result it removes some files
from pg_wal that are required to perform recovery because they are missing
on the new primary.
A bit harder situation comes after the server successfully rewound; if
the new primary goes so far that the old primary cannot connect. Even
in that case, you can copy-in the requried WAL files or configure
restore_command of the old pimary so that it finds required WAL files
there.
Yes, we can do the backup of pg_wal before running pg_rewind, but it feels
very ugly, because we will also have to clean this "backup" after a
successful recovery.
It would be much better if pg_rewind didn't remove WAL files between the
last common checkpoint and diverged LSN in the first place.
Regards,
--
Alexander Kukushkin
Hello, Alex.
At Fri, 26 Aug 2022 10:57:25 +0200, Alexander Kukushkin <cyberdemn@gmail.com> wrote in
On Fri, 26 Aug 2022 at 10:04, Kyotaro Horiguchi <horikyota.ntt@gmail.com>
wrote:What I don't still understand is why pg_rewind doesn't work for the
old primary in that case. When archive_mode=on, the old primary has
the complete set of WAL files counting both pg_wal and its archive. So
as the same to the privious repro, pg_rewind -c ought to work (but it
uses its own archive this time). In that sense the proposed solution
is still not needed in this case.The pg_rewind finishes successfully. But as a result it removes some files
from pg_wal that are required to perform recovery because they are missing
on the new primary.
IFAIS pg_rewind doesn't. -c option contrarily restores the all
segments after the last (common) checkpoint and all of them are left
alone after pg_rewind finishes. postgres itself removes the WAL files
after recovery. After-promotion cleanup and checkpoint revmoes the
files on the previous timeline.
Before pg_rewind runs in the repro below, the old primary has the
following segments.
TLI1: 2 8 9 A B C D
Just after pg_rewind finishes, the old primary has the following
segments.
TLI1: 2 3 5 6 7
TLI2: 4 (and 00000002.history)
pg_rewind copied 1-2 to 1-3 and 2-4 and history file from the new
primary, 1-4 to 1-7 from archive. After rewind finished, 1-4,1-8 to
1-D have been removed since the new primary didn't have them.
Recovery starts from 1-3 and promotes at 0/4_000000. postgres removes
1-5 to 1-7 by post-promotion cleanup and removes 1-2 to 1-4 by a
restartpoint. All of the segments are useless after the old primary
promotes.
When the old primary starts, it uses 1-3 and 2-4 for recovery and
fails to fetch 2-5 from the new primary. But it is not an issue of
pg_rewind at all.
A bit harder situation comes after the server successfully rewound; if
the new primary goes so far that the old primary cannot connect. Even
in that case, you can copy-in the requried WAL files or configure
restore_command of the old pimary so that it finds required WAL files
there.Yes, we can do the backup of pg_wal before running pg_rewind, but it feels
So, if I understand you correctly, the issue you are complaining is
not about the WAL segments on the old timeline but about those on the
new timeline, which don't have a business with what pg_rewind does. As
the same with the case of pg_basebackup, the missing segments need to
be somehow copied from the new primary since the old primary never had
the chance to have them before.
very ugly, because we will also have to clean this "backup" after a
successful recovery.
What do you mean by the "backup" here? Concretely what WAL segments do
you feel need to remove, for example, in the repro case? Or, could
you show your issue by something like the repro below?
It would be much better if pg_rewind didn't remove WAL files between the
last common checkpoint and diverged LSN in the first place.
Thus I don't follow this..
regards.
(Fixed a bug and slightly modified)
====
# killall -9 postgres
# rm -r oldprim newprim oldarch newarch oldprim.log newprim.log
mkdir newarch oldarch
initdb -k -D oldprim
echo "archive_mode = 'on'">> oldprim/postgresql.conf
echo "archive_command = 'echo "archive %f" >&2; cp %p `pwd`/oldarch/%f'">> oldprim/postgresql.conf
pg_ctl -D oldprim -o '-p 5432' -l oldprim.log start
psql -p 5432 -c 'create table t(a int)'
pg_basebackup -D newprim -p 5432
echo "primary_conninfo='host=/tmp port=5432'">> newprim/postgresql.conf
echo "archive_command = 'echo "archive %f" >&2; cp %p `pwd`/newarch/%f'">> newprim/postgresql.conf
touch newprim/standby.signal
pg_ctl -D newprim -o '-p 5433' -l newprim.log start
# the last common checkpoint
psql -p 5432 -c 'checkpoint'
# record approx. diverging WAL segment
start_wal=`psql -p 5433 -Atc "select pg_walfile_name(pg_last_wal_replay_lsn() - (select setting from pg_settings where name = 'wal_segment_size')::int);"`
psql -p 5432 -c 'insert into t values(0); select pg_switch_wal();'
pg_ctl -D newprim promote
# old rprimary loses diverging WAL segment
for i in $(seq 1 4); do psql -p 5432 -c 'insert into t values(0); select pg_switch_wal();'; done
psql -p 5432 -c 'checkpoint;'
psql -p 5433 -c 'checkpoint;'
# old primary cannot archive any more
echo "archive_command = 'false'">> oldprim/postgresql.conf
pg_ctl -D oldprim reload
pg_ctl -D oldprim stop
# rewind the old primary, using its own archive
# pg_rewind -D oldprim --source-server='port=5433' # should fail
echo "restore_command = 'echo "restore %f" >&2; cp `pwd`/oldarch/%f %p'">> oldprim/postgresql.conf
pg_rewind -D oldprim --source-server='port=5433' -c
# advance WAL on the old primary; new primary loses the launching WAL seg
for i in $(seq 1 4); do psql -p 5433 -c 'insert into t values(0); select pg_switch_wal();'; done
psql -p 5433 -c 'checkpoint'
echo "primary_conninfo='host=/tmp port=5433'">> oldprim/postgresql.conf
touch oldprim/standby.signal
postgres -D oldprim # fails with "WAL file has been removed"
# The alternative of copying-in
# echo "restore_command = 'echo "restore %f" >&2; cp `pwd`/newarch/%f %p'">> oldprim/postgresql.conf
# copy-in WAL files from new primary's archive to old primary
(cd newarch;
for f in `ls`; do
if [[ "$f" > "$start_wal" ]]; then echo copy $f; cp $f ../oldprim/pg_wal; fi
done)
postgres -D oldprim
====
--
Kyotaro Horiguchi
NTT Open Source Software Center
Hello Kyotaro,
On Tue, 30 Aug 2022 at 07:50, Kyotaro Horiguchi <horikyota.ntt@gmail.com>
wrote:
So, if I understand you correctly, the issue you are complaining is
not about the WAL segments on the old timeline but about those on the
new timeline, which don't have a business with what pg_rewind does. As
the same with the case of pg_basebackup, the missing segments need to
be somehow copied from the new primary since the old primary never had
the chance to have them before.
No, we are complaining exactly about WAL segments from the old timeline
that are removed by pg_rewind.
Those segments haven't been archived by the old primary and the new primary
already recycled them.
Thus I don't follow this..
I did a slight modification of your script that reproduces a problem.
====
mkdir newarch oldarch
initdb -k -D oldprim
echo "archive_mode = 'on'">> oldprim/postgresql.conf
echo "archive_command = 'echo "archive %f" >&2; cp %p `pwd`/oldarch/%f'">>
oldprim/postgresql.conf
pg_ctl -D oldprim -o '-p 5432' -l oldprim.log start
psql -p 5432 -c 'create table t(a int)'
pg_basebackup -D newprim -p 5432
echo "primary_conninfo='host=/tmp port=5432'">> newprim/postgresql.conf
echo "archive_command = 'echo "archive %f" >&2; cp %p `pwd`/newarch/%f'">>
newprim/postgresql.conf
touch newprim/standby.signal
pg_ctl -D newprim -o '-p 5433' -l newprim.log start
# the last common checkpoint
psql -p 5432 -c 'checkpoint'
# old primary cannot archive any more
echo "archive_command = 'false'">> oldprim/postgresql.conf
pg_ctl -D oldprim reload
# advance WAL on the old primary; four WAL segments will never make it to
the archive
for i in $(seq 1 4); do psql -p 5432 -c 'insert into t values(0); select
pg_switch_wal();'; done
# record approx. diverging WAL segment
start_wal=`psql -p 5432 -Atc "select
pg_walfile_name(pg_last_wal_replay_lsn() - (select setting from pg_settings
where name = 'wal_segment_size')::int);"`
pg_ctl -D newprim promote
# old rprimary loses diverging WAL segment
for i in $(seq 1 4); do psql -p 5432 -c 'insert into t values(0); select
pg_switch_wal();'; done
psql -p 5432 -c 'checkpoint;'
psql -p 5433 -c 'checkpoint;'
pg_ctl -D oldprim stop
# rewind the old primary, using its own archive
# pg_rewind -D oldprim --source-server='port=5433' # should fail
echo "restore_command = 'echo "restore %f" >&2; cp `pwd`/oldarch/%f %p'">>
oldprim/postgresql.conf
pg_rewind -D oldprim --source-server='port=5433' -c
# advance WAL on the old primary; new primary loses the launching WAL seg
for i in $(seq 1 4); do psql -p 5433 -c 'insert into t values(0); select
pg_switch_wal();'; done
psql -p 5433 -c 'checkpoint'
echo "primary_conninfo='host=/tmp port=5433'">> oldprim/postgresql.conf
touch oldprim/standby.signal
postgres -D oldprim # fails with "WAL file has been removed"
# The alternative of copying-in
# echo "restore_command = 'echo "restore %f" >&2; cp `pwd`/newarch/%f
%p'">> oldprim/postgresql.conf
# copy-in WAL files from new primary's archive to old primary
(cd newarch;
for f in `ls`; do
if [[ "$f" > "$start_wal" ]]; then echo copy $f; cp $f ../oldprim/pg_wal;
fi
done)
postgres -D oldprim # also fails with "requested WAL segment XXX has
already been removed"
===
Regards,
--
Alexander Kukushkin
I did a slight modification of your script that reproduces a problem.
====
It seems that formatting damaged the script, so I better attach it as a
file.
Regards,
--
Alexander Kukushkin
Attachments:
At Tue, 30 Aug 2022 14:50:26 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in
IFAIS pg_rewind doesn't. -c option contrarily restores the all
segments after the last (common) checkpoint and all of them are left
alone after pg_rewind finishes. postgres itself removes the WAL files
after recovery. After-promotion cleanup and checkpoint revmoes the
files on the previous timeline.Before pg_rewind runs in the repro below, the old primary has the
following segments.TLI1: 2 8 9 A B C D
Just after pg_rewind finishes, the old primary has the following
segments.TLI1: 2 3 5 6 7
TLI2: 4 (and 00000002.history)pg_rewind copied 1-2 to 1-3 and 2-4 and history file from the new
1> primary, 1-4 to 1-7 from archive. After rewind finished, 1-4,1-8 to
1-D have been removed since the new primary didn't have them.
Recovery starts from 1-3 and promotes at 0/4_000000. postgres removes
1-5 to 1-7 by post-promotion cleanup and removes 1-2 to 1-4 by a
restartpoint. All of the segments are useless after the old primary
promotes.When the old primary starts, it uses 1-3 and 2-4 for recovery and
fails to fetch 2-5 from the new primary. But it is not an issue of
pg_rewind at all.
Ah. I think I understand what you are mentioning. If the new primary
didn't have the segment 1-3 to 1-6, pg_rewind removes it. The new
primary doesn't have it in pg_wal nor in archive. The old primary has
it in its archive. So get out from the situation, we need to the
following *two* things before the old primary can start:
1. copy 1-3 to 1-6 from the archive of the *old* primary
2. copy 2-7 and later from the archive of the *new* primary
Since pg_rewind have copied in to the old primary's pg_wal, removing them just have users to perform the task duplicatedly, as you stated.
Okay, I completely understand the problem and convinced that it is
worth changing the behavior.
However, the proposed patch looks too complex to me. It can be done
by just comparing xlog file name and the last checkpoint location and
TLI in decide_file_actions().
regards.
=====
# killall -9 postgres
# rm -r oldprim newprim oldarch newarch oldprim.log newprim.log
mkdir newarch oldarch
initdb -k -D oldprim
echo "archive_mode = 'on'">> oldprim/postgresql.conf
echo "archive_command = 'echo "archive %f" >&2; cp %p `pwd`/oldarch/%f'">> oldprim/postgresql.conf
pg_ctl -D oldprim -o '-p 5432' -l oldprim.log start
psql -p 5432 -c 'create table t(a int)'
pg_basebackup -D newprim -p 5432
echo "primary_conninfo='host=/tmp port=5432'">> newprim/postgresql.conf
echo "archive_command = 'echo "archive %f" >&2; cp %p `pwd`/newarch/%f'">> newprim/postgresql.conf
touch newprim/standby.signal
pg_ctl -D newprim -o '-p 5433' -l newprim.log start
# the last common checkpoint
psql -p 5432 -c 'checkpoint'
# record approx. diverging WAL segment
start_wal=`psql -p 5433 -Atc "select pg_walfile_name(pg_last_wal_replay_lsn() - (select setting from pg_settings where name = 'wal_segment_size')::int);"`
for i in $(seq 1 5); do psql -p 5432 -c 'insert into t values(0); select pg_switch_wal();'; done
psql -p 5432 -c 'checkpoint'
pg_ctl -D newprim promote
# old rprimary loses diverging WAL segment
for i in $(seq 1 4); do psql -p 5432 -c 'insert into t values(0); select pg_switch_wal();'; done
psql -p 5432 -c 'checkpoint;'
psql -p 5433 -c 'checkpoint;'
# old primary cannot archive any more
echo "archive_command = 'false'">> oldprim/postgresql.conf
pg_ctl -D oldprim reload
pg_ctl -D oldprim stop
# rewind the old primary, using its own archive
# pg_rewind -D oldprim --source-server='port=5433' # should fail
echo "restore_command = 'echo "restore %f" >&2; cp `pwd`/oldarch/%f %p'">> oldprim/postgresql.conf
pg_rewind -D oldprim --source-server='port=5433' -c
# advance WAL on the old primary; new primary loses the launching WAL seg
for i in $(seq 1 4); do psql -p 5433 -c 'insert into t values(0); select pg_switch_wal();'; done
psql -p 5433 -c 'checkpoint'
echo "primary_conninfo='host=/tmp port=5433'">> oldprim/postgresql.conf
touch oldprim/standby.signal
#### copy the missing file of the old timeline
## cp oldarch/00000001000000000000000[3456] oldprim/pg_wal
## cp newarch/00000002000000000000000* oldprim/pg_wal
postgres -D oldprim # fails with "WAL file has been removed"
# The alternative of copying-in
# echo "restore_command = 'echo "restore %f" >&2; cp `pwd`/newarch/%f %p'">> oldprim/postgresql.conf
# copy-in WAL files from new primary's archive to old primary
(cd newarch;
for f in `ls`; do
if [[ "$f" > "$start_wal" ]]; then echo copy $f; cp $f ../oldprim/pg_wal; fi
done)
postgres -D oldprim
=====
--
Kyotaro Horiguchi
NTT Open Source Software Center
At Tue, 30 Aug 2022 08:49:27 +0200, Alexander Kukushkin <cyberdemn@gmail.com> wrote in
No, we are complaining exactly about WAL segments from the old timeline
that are removed by pg_rewind.
Those segments haven't been archived by the old primary and the new primary
already recycled them.
Yeah, sorry for my thick skull but I finally got your point.
And as I said in a mail I sent just before, the patch looks too
complex. How about just comparing WAL file name aginst the last
common checkpoint's tli and lsn? We can tell filemap.c about the last
checkpoint and decide_file_action can compare the file name with it.
It is sufficient to preserve WAL files if tli matches and the segment
number of the WAL file is equal to or later than the checkpoint
location.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
Hello Kyotaro,
On Tue, 30 Aug 2022 at 09:51, Kyotaro Horiguchi <horikyota.ntt@gmail.com>
wrote:
And as I said in a mail I sent just before, the patch looks too
complex. How about just comparing WAL file name aginst the last
common checkpoint's tli and lsn? We can tell filemap.c about the last
checkpoint and decide_file_action can compare the file name with it.It is sufficient to preserve WAL files if tli matches and the segment
number of the WAL file is equal to or later than the checkpoint
location.
What if the last common checkpoint was on a previous timeline?
I.e., standby was promoted to primary, the timeline changed from 1 to 2,
and after that the node crashed _before_ the CHECKPOINT after promote has
finished.
The next node will advance the timeline from 2 to 3.
In this case, the last common checkpoint will be on timeline 1, and the
check becomes more complex because we will have to consider both timelines,
1 and 2.
Also, we need to take into account the divergency LSN. Files after it are
not required.
Regards,
--
Alexander Kukushkin
At Tue, 30 Aug 2022 10:03:07 +0200, Alexander Kukushkin <cyberdemn@gmail.com> wrote in
On Tue, 30 Aug 2022 at 09:51, Kyotaro Horiguchi <horikyota.ntt@gmail.com>
wrote:And as I said in a mail I sent just before, the patch looks too
complex. How about just comparing WAL file name aginst the last
common checkpoint's tli and lsn? We can tell filemap.c about the last
checkpoint and decide_file_action can compare the file name with it.It is sufficient to preserve WAL files if tli matches and the segment
number of the WAL file is equal to or later than the checkpoint
location.What if the last common checkpoint was on a previous timeline?
I.e., standby was promoted to primary, the timeline changed from 1 to 2,
and after that the node crashed _before_ the CHECKPOINT after promote has
finished.
The next node will advance the timeline from 2 to 3.
In this case, the last common checkpoint will be on timeline 1, and the
check becomes more complex because we will have to consider both timelines,
1 and 2.
Hmm. Doesn't it work to ignoring tli then? All segments that their
segment number is equal to or larger than the checkpoint locaiton are
preserved regardless of TLI?
Also, we need to take into account the divergency LSN. Files after it are
not required.
They are removed at the later checkpoints. But also we can remove
segments that are out of the range between the last common checkpoint
and divergence point ignoring TLI. the divergence point is also
compared?
if (file_segno >= last_common_checkpoint_seg &&
file_segno <= divergence_seg)
<PRESERVE IT>;
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
On Tue, 30 Aug 2022 at 10:27, Kyotaro Horiguchi <horikyota.ntt@gmail.com>
wrote:
Hmm. Doesn't it work to ignoring tli then? All segments that their
segment number is equal to or larger than the checkpoint locaiton are
preserved regardless of TLI?
If we ignore TLI there is a chance that we may retain some unnecessary (or
just wrong) files.
Also, we need to take into account the divergency LSN. Files after it are
not required.They are removed at the later checkpoints. But also we can remove
segments that are out of the range between the last common checkpoint
and divergence point ignoring TLI.
Everything that is newer last_common_checkpoint_seg could be removed (but
it already happens automatically, because these files are missing on the
new primary).
WAL files that are older than last_common_checkpoint_seg could be either
removed or at least not copied from the new primary.
the divergence point is also
compared?if (file_segno >= last_common_checkpoint_seg &&
file_segno <= divergence_seg)
<PRESERVE IT>;
The current implementation relies on tracking WAL files being open while
searching for the last common checkpoint. It automatically starts from the
divergence_seg, automatically finishes at last_common_checkpoint_seg, and
last but not least, automatically handles timeline changes. I don't think
that manually written code that decides what to do from the WAL file name
(and also takes into account TLI) could be much simpler than the current
approach.
Actually, since we start doing some additional "manipulations" with files
in pg_wal, we probably should do a symmetric action with files inside
pg_wal/archive_status
Regards,
--
Alexander Kukushkin
At Tue, 30 Aug 2022 11:01:58 +0200, Alexander Kukushkin <cyberdemn@gmail.com> wrote in
On Tue, 30 Aug 2022 at 10:27, Kyotaro Horiguchi <horikyota.ntt@gmail.com>
wrote:Hmm. Doesn't it work to ignoring tli then? All segments that their
segment number is equal to or larger than the checkpoint locaiton are
preserved regardless of TLI?If we ignore TLI there is a chance that we may retain some unnecessary (or
just wrong) files.
Right. I mean I don't think thats a problem and we can rely on
postgres itself for later cleanup. Theoretically some out-of-range tli
or segno files are left alone but they surely will be gone soon after
the server starts.
Also, we need to take into account the divergency LSN. Files after it are
not required.They are removed at the later checkpoints. But also we can remove
segments that are out of the range between the last common checkpoint
and divergence point ignoring TLI.Everything that is newer last_common_checkpoint_seg could be removed (but
it already happens automatically, because these files are missing on the
new primary).
WAL files that are older than last_common_checkpoint_seg could be either
removed or at least not copied from the new primary.
..
The current implementation relies on tracking WAL files being open while
searching for the last common checkpoint. It automatically starts from the
divergence_seg, automatically finishes at last_common_checkpoint_seg, and
last but not least, automatically handles timeline changes. I don't think
that manually written code that decides what to do from the WAL file name
(and also takes into account TLI) could be much simpler than the current
approach.
Yeah, I know. My expectation is taking the simplest way for the same
effect. My concern was the additional hash. On second thought, I
concluded that we should that on the existing filehash.
We can just add a FILE_ACTION_NONE entry to the file hash from
SimpleXLogPageRead. Since this happens before decide_file_action()
call, decide_file_action() should ignore the entries with
FILE_ACTION_NONE. Also we need to call filehash_init() earlier.
Actually, since we start doing some additional "manipulations" with files
in pg_wal, we probably should do a symmetric action with files inside
pg_wal/archive_status
In that sense, pg_rewind rather should place missing
archive_status/*.done for segments including restored ones seen while
finding checkpoint. This is analogous of the behavior with
pg_basebackup and pg_receivewal. Also we should add FILE_ACTION_NONE
entries for .done files for segments read while finding checkpoint.
What do you think about that?
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
At Wed, 31 Aug 2022 14:30:31 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in
What do you think about that?
By the way don't you add an CF entry for this?
--
Kyotaro Horiguchi
NTT Open Source Software Center
Hello Kayotaro,
Here is the new version of the patch that includes the changes you
suggested. It is smaller now but I doubt if it is as easy to understand as
it used to be.
The need of manipulations with the target’s pg_wal/archive_status directory
is a question to discuss…
At first glance it seems to be useless for .ready files: checkpointer
process will anyway recreate them if archiving is enabled on the rewound
old primary and we will finally have them in the archive. As for the .done
files, it seems reasonable to follow the pg_basebackup logic and keep .done
files together with the corresponding segments (those between the last
common checkpoint and the point of divergence) to protect them from being
archived once again.
But on the other hand it seems to be not that straightforward: imaging we
have WAL segment X on the target along with X.done file and we decide to
preserve them both (or we download it from archive and force .done file
creation), while archive_mode was set to ‘always’ and the source (promoted
replica) also still has WAL segment X and X.ready file. After pg_rewind we
will end up with both X.ready and X.done, which seems to be not a good
situation (but most likely not critical either).
Regards,
Polina Bungina
On Wed, Aug 31, 2022 at 7:30 AM Kyotaro Horiguchi <horikyota.ntt@gmail.com>
wrote:
Show quoted text
At Tue, 30 Aug 2022 11:01:58 +0200, Alexander Kukushkin <
cyberdemn@gmail.com> wrote inOn Tue, 30 Aug 2022 at 10:27, Kyotaro Horiguchi <horikyota.ntt@gmail.com
wrote:
Hmm. Doesn't it work to ignoring tli then? All segments that their
segment number is equal to or larger than the checkpoint locaiton are
preserved regardless of TLI?If we ignore TLI there is a chance that we may retain some unnecessary
(or
just wrong) files.
Right. I mean I don't think thats a problem and we can rely on
postgres itself for later cleanup. Theoretically some out-of-range tli
or segno files are left alone but they surely will be gone soon after
the server starts.Also, we need to take into account the divergency LSN. Files after
it are
not required.
They are removed at the later checkpoints. But also we can remove
segments that are out of the range between the last common checkpoint
and divergence point ignoring TLI.Everything that is newer last_common_checkpoint_seg could be removed (but
it already happens automatically, because these files are missing on the
new primary).
WAL files that are older than last_common_checkpoint_seg could be either
removed or at least not copied from the new primary...
The current implementation relies on tracking WAL files being open while
searching for the last common checkpoint. It automatically starts fromthe
divergence_seg, automatically finishes at last_common_checkpoint_seg, and
last but not least, automatically handles timeline changes. I don't think
that manually written code that decides what to do from the WAL file name
(and also takes into account TLI) could be much simpler than the current
approach.Yeah, I know. My expectation is taking the simplest way for the same
effect. My concern was the additional hash. On second thought, I
concluded that we should that on the existing filehash.We can just add a FILE_ACTION_NONE entry to the file hash from
SimpleXLogPageRead. Since this happens before decide_file_action()
call, decide_file_action() should ignore the entries with
FILE_ACTION_NONE. Also we need to call filehash_init() earlier.Actually, since we start doing some additional "manipulations" with files
in pg_wal, we probably should do a symmetric action with files inside
pg_wal/archive_statusIn that sense, pg_rewind rather should place missing
archive_status/*.done for segments including restored ones seen while
finding checkpoint. This is analogous of the behavior with
pg_basebackup and pg_receivewal. Also we should add FILE_ACTION_NONE
entries for .done files for segments read while finding checkpoint.What do you think about that?
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
Attachments:
v2-0001-pg_rewind-wal-deletion.patchapplication/octet-stream; name=v2-0001-pg_rewind-wal-deletion.patchDownload+32-6
Terribly sorry for misspelling your name and for the topposting!
Regards,
Polina Bungina
Hello Kyotaro,
any further thoughts on it?
Regards,
--
Alexander Kukushkin
At Thu, 1 Sep 2022 13:33:09 +0200, Polina Bungina <bungina@gmail.com> wrote in
Here is the new version of the patch that includes the changes you
suggested. It is smaller now but I doubt if it is as easy to understand as
it used to be.
pg_rewind works in two steps. First it constructs file map which
decides the action for each file, then second, it performs file
operations according to the file map. So, if we are going to do
something on some files, that action should be record that in the file
map, I think.
Regarding the the patch, pg_rewind starts reading segments from the
divergence point back to the nearest checkpoint, then moves foward
during rewinding. So, the fact that SimpleXLogPageRead have read a
segment suggests that the segment is required during the next startup.
So I don't think we need to move around the keepWalSeg flag. All
files that are wanted while rewinding should be preserved
unconditionally.
It's annoying that the file path for file map and open(2) have
different top directory. But sharing the same path string between the
two seems rather ugly..
I feel uncomfortable to directly touch the internal of file_entry_t
outside filemap.c. I'd like to hide the internals in filemap.c, but
pg_rewind already does that..
+ /*
+ * Some entries (WAL segments) already have an action assigned
+ * (see SimpleXLogPageRead()).
+ */
+ if (entry->action == FILE_ACTION_NONE)
+ continue;
entry->action = decide_file_action(entry);
It might be more reasonable to call decide_file_action() when action
is UNDECIDED.
The need of manipulations with the target’s pg_wal/archive_status directory
is a question to discuss…At first glance it seems to be useless for .ready files: checkpointer
process will anyway recreate them if archiving is enabled on the rewound
old primary and we will finally have them in the archive. As for the .done
files, it seems reasonable to follow the pg_basebackup logic and keep .done
files together with the corresponding segments (those between the last
common checkpoint and the point of divergence) to protect them from being
archived once again.But on the other hand it seems to be not that straightforward: imaging we
have WAL segment X on the target along with X.done file and we decide to
preserve them both (or we download it from archive and force .done file
creation), while archive_mode was set to ‘always’ and the source (promoted
replica) also still has WAL segment X and X.ready file. After pg_rewind we
will end up with both X.ready and X.done, which seems to be not a good
situation (but most likely not critical either).
Thanks for the thought. Yes, it's not so straight-forward. And, as you
mentioned, the worst result comes from not doing that is that some
already-archived segments are archived at next run, which is generally
harmless. So I think we're ok to ignore that in this patdh then create
other patch if we still want to do that.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
On Tue, Sep 27, 2022 at 9:50 AM Kyotaro Horiguchi <horikyota.ntt@gmail.com>
wrote:
Regarding the the patch, pg_rewind starts reading segments from the
divergence point back to the nearest checkpoint, then moves foward
during rewinding. So, the fact that SimpleXLogPageRead have read a
segment suggests that the segment is required during the next startup.
So I don't think we need to move around the keepWalSeg flag. All
files that are wanted while rewinding should be preserved
unconditionally.
I am probably not getting this right but as far as I see SimpleXLogPageRead
is called at most 3 times during pg_rewind run:
1. From readOneRecord to determine the end-of-WAL on the target by reading
the last shutdown checkpoint record/minRecoveryPoint on it
2. From findLastCheckpoint to find last common checkpoint (here it
indeed reads all the segments that are required during the startup, hence
the keepWalSeg flag set to true)
3. From extractPageMap to extract all the pages modified after the fork
(here we also read all the segments that should be kept but also the ones
further, until the target's end record. Doesn't seem we should
unconditionally preserve them all).
Am I missing something?
+ /* + * Some entries (WAL segments) already have an action assigned + * (see SimpleXLogPageRead()). + */ + if (entry->action == FILE_ACTION_NONE) + continue; entry->action = decide_file_action(entry);
It might be more reasonable to call decide_file_action() when action
is UNDECIDED.
Agree, will change this part.
Regards,
Polina Bungina