Making pg_rewind faster
Hi Hackers,
I have been using pg_rewind in production for 2 years. One of the things that I noticed in pg_rewind is if it doesn't know what to do with a file "it copies". I understand it's the more safer option. After all, the alternative, pg_basebackup copies all the files from source to target.
However, this is making pg_rewind inefficient when we have a high number of WAL files. Majority of the data (in most of my cases 95%+) that it copies are WAL files which are anyway same between the source and target. Skipping those same WAL files from copying will improve the speed of pg_rewind a lot.
1. Does pg_rewind need to copy WAL files before the WAL that contains the last common check point?
Heikki's presentation https://pgsessions.com/assets/archives/pg_rewind-presentation-paris.pdf gave me a good overview and also explained the behavior what I mentioned.
Thanks,
Vignesh
Import Notes
Reply to msg id not found:
Hi everyone!
Here's the attached patch submission to optimize pg_rewind performance when many WAL files are retained on server. This patch avoids replaying (copying over) older WAL segment files that fall before the point of divergence between the source and target servers.
Thanks,
Justin
________________________________
From: Justin Kwan <jkwan@cloudflare.com>
Sent: July 15, 2022 6:13 PM
To: vignesh ravichandran <admin@viggy28.dev>
Cc: pgsql-hackers <pgsql-hackers@postgresql.org>; vignesh <vignesh@cloudflare.com>; justinpkwan@outlook.com <justinpkwan@outlook.com>
Subject: Re: Making pg_rewind faster
Looping in my other email.
On Thu, Jun 30, 2022 at 6:22 AM vignesh ravichandran <admin@viggy28.dev<mailto:admin@viggy28.dev>> wrote:
Hi Hackers,
I have been using pg_rewind in production for 2 years. One of the things that I noticed in pg_rewind is if it doesn't know what to do with a file "it copies". I understand it's the more safer option. After all, the alternative, pg_basebackup copies all the files from source to target.
However, this is making pg_rewind inefficient when we have a high number of WAL files. Majority of the data (in most of my cases 95%+) that it copies are WAL files which are anyway same between the source and target. Skipping those same WAL files from copying will improve the speed of pg_rewind a lot.
1. Does pg_rewind need to copy WAL files before the WAL that contains the last common check point?
Heikki's presentation https://pgsessions.com/assets/archives/pg_rewind-presentation-paris.pdf gave me a good overview and also explained the behavior what I mentioned.
Thanks,
Vignesh
Attachments:
v1-0001-Avoid-copying-WAL-segments-before-divergence-to-spee.patchapplication/octet-stream; name=v1-0001-Avoid-copying-WAL-segments-before-divergence-to-spee.patchDownload
From 653ac0c2c2e4ba1b5b9069dff3d4bb618de58141 Mon Sep 17 00:00:00 2001
From: Justin Kwan <jkwan@cloudflare.com>
Date: Wed, 29 Jun 2022 14:12:50 -0700
Subject: [PATCH] Avoid copying WAL segments before divergence to speed up
pg_rewind
"Optimize pg_rewind to Skip Copying Common WAL Files". It adds a conditional
check to avoid uncessesarily copying any WAL segment files from source to
target if they are common between both servers, before the point of WAL
divergence during pg_rewind. On the source server, each WAL file's
corresponding segment number is computed and compared against the segement
number of the first diverged LSN. All WAL files which fall before the segment
of the first diverged LSN can safely be skipped from copying to the target.
The reduction in WAL segment files transmitted over the network from source to
target server massively reduces overall pg_rewind execution time, when a large
amount of WAL segment files are retained.
This patch is intended for immediate application into the Postgres master
branch.
Regression tests are included to verify that WAL segment files prior to WAL
diverge are not copied. The source code successfully compiles, and all tests
successfully pass.
Author: Justin Kwan, Vignesh Ravichandran
---
src/bin/pg_rewind/filemap.c | 25 +++++--
src/bin/pg_rewind/pg_rewind.c | 38 +++++++++-
src/bin/pg_rewind/pg_rewind.h | 2 +
src/bin/pg_rewind/t/001_basic.pl | 63 ++++++++++++++++
src/bin/pg_rewind/t/008_min_recovery_point.pl | 72 +++++++++++++++++++
5 files changed, 193 insertions(+), 7 deletions(-)
diff --git a/src/bin/pg_rewind/filemap.c b/src/bin/pg_rewind/filemap.c
index 6252931041..1d3f52ec53 100644
--- a/src/bin/pg_rewind/filemap.c
+++ b/src/bin/pg_rewind/filemap.c
@@ -49,6 +49,7 @@ static uint32 hash_string_pointer(const char *s);
#define SH_RAW_ALLOCATOR pg_malloc0
#define SH_DECLARE
#define SH_DEFINE
+#include "access/xlog_internal.h"
#include "lib/simplehash.h"
#define FILEHASH_INITIAL_SIZE 1000
@@ -728,11 +729,25 @@ decide_file_action(file_entry_t *entry)
case FILE_TYPE_REGULAR:
if (!entry->isrelfile)
{
- /*
- * It's a non-data file that we have no special processing
- * for. Copy it in toto.
- */
- return FILE_ACTION_COPY;
+ /* Handle WAL segment file. */
+ const char *fname;
+ char *slash;
+
+ /* Split filepath into directory & filename. */
+ slash = strrchr(path, '/');
+ if (slash)
+ fname = slash + 1;
+ else
+ fname = path;
+
+ if (IsXLogFileName(fname))
+ return decide_wal_file_action(fname);
+
+ /*
+ * It's a non-data file that we have no special processing
+ * for. Copy it in toto.
+ */
+ return FILE_ACTION_COPY;
}
else
{
diff --git a/src/bin/pg_rewind/pg_rewind.c b/src/bin/pg_rewind/pg_rewind.c
index 1ff8da1676..b181de07d3 100644
--- a/src/bin/pg_rewind/pg_rewind.c
+++ b/src/bin/pg_rewind/pg_rewind.c
@@ -25,7 +25,6 @@
#include "fe_utils/recovery_gen.h"
#include "fe_utils/string_utils.h"
#include "file_ops.h"
-#include "filemap.h"
#include "getopt_long.h"
#include "pg_rewind.h"
#include "rewind_source.h"
@@ -69,6 +68,8 @@ bool dry_run = false;
bool do_sync = true;
bool restore_wal = false;
+static XLogRecPtr divergerec;
+
/* Target history */
TimeLineHistoryEntry *targetHistory;
int targetNentries;
@@ -129,7 +130,6 @@ main(int argc, char **argv)
};
int option_index;
int c;
- XLogRecPtr divergerec;
int lastcommontliIndex;
XLogRecPtr chkptrec;
TimeLineID chkpttli;
@@ -490,6 +490,40 @@ main(int argc, char **argv)
return 0;
}
+file_action_t
+decide_wal_file_action(const char *fname)
+{
+ TimeLineID file_tli;
+ XLogSegNo file_segno;
+ XLogSegNo last_common_segno;
+
+ /*
+ * Find last common WAL segment number between source and target before
+ * divergence given last common LSN (byte position).
+ */
+ XLByteToSeg(divergerec, last_common_segno, ControlFile_target.xlog_seg_size);
+
+ /* Get current WAL segment number given current segment file name. */
+ XLogFromFileName(fname, &file_tli, &file_segno, ControlFile_target.xlog_seg_size);
+
+ /*
+ * Avoid unnecessarily copying WAL segment files created before last common
+ * segment to avoid performance penalty when many WAL segment files are
+ * retained on source and copied to target.
+ *
+ * These files are already common between new source (old target) and new
+ * target (old source). Only WAL segment files after the last common segment
+ * number on the new source need to be copied to the new target.
+ */
+ if (file_segno < last_common_segno) {
+ pg_log_debug("WAL file entry \"%s\" not copied to target", fname);
+ return FILE_ACTION_NONE;
+ }
+
+ pg_log_debug("WAL file entry \"%s\" copied to target", fname);
+ return FILE_ACTION_COPY;
+}
+
/*
* Perform the rewind.
*
diff --git a/src/bin/pg_rewind/pg_rewind.h b/src/bin/pg_rewind/pg_rewind.h
index 393182fe2a..aed95c3700 100644
--- a/src/bin/pg_rewind/pg_rewind.h
+++ b/src/bin/pg_rewind/pg_rewind.h
@@ -14,6 +14,7 @@
#include "access/timeline.h"
#include "common/logging.h"
#include "datapagemap.h"
+#include "filemap.h"
#include "libpq-fe.h"
#include "storage/block.h"
#include "storage/relfilenode.h"
@@ -44,6 +45,7 @@ extern void findLastCheckpoint(const char *datadir, XLogRecPtr searchptr,
const char *restoreCommand);
extern XLogRecPtr readOneRecord(const char *datadir, XLogRecPtr ptr,
int tliIndex, const char *restoreCommand);
+extern file_action_t decide_wal_file_action(const char *fname);
/* in pg_rewind.c */
extern void progress_report(bool finished);
diff --git a/src/bin/pg_rewind/t/001_basic.pl b/src/bin/pg_rewind/t/001_basic.pl
index db9201f38e..a64672ff38 100644
--- a/src/bin/pg_rewind/t/001_basic.pl
+++ b/src/bin/pg_rewind/t/001_basic.pl
@@ -78,6 +78,31 @@ sub run_test
"insert into drop_tbl values ('in primary, after promotion')");
primary_psql("DROP TABLE drop_tbl");
+ # Record last modification time of WAL segment file common between both
+ # source and target, stored on current primary server. pg_rewind should
+ # refrain from overwriting this file with the source's newly forked copy.
+ $node_primary->psql(
+ 'postgres',
+ "SELECT extract(epoch from modification) FROM pg_stat_file('pg_wal/000000010000000000000002');",
+ stdout => \my $common_wal_modified_at);
+
+ # Record last modification time of WAL segment file that is partially
+ # written to just before the new timeline, but still common between both
+ # source and target. pg_rewind should overwrite this file with the source's
+ # copy to be safe.
+ $node_primary->psql(
+ 'postgres',
+ "SELECT extract(epoch from modification) FROM pg_stat_file('pg_wal/000000010000000000000003');",
+ stdout => \my $last_common_tli1_wal_modified_at);
+
+ # Record last modification time of first diverged WAL segment file on the
+ # old primary's timeline. pg_rewind should overwrite this file with the
+ # source's newly forked copy.
+ $node_primary->psql(
+ 'postgres',
+ "SELECT extract(epoch from modification) FROM pg_stat_file('pg_wal/000000020000000000000003');",
+ stdout => \my $last_common_tli2_wal_modified_at);
+
# Before running pg_rewind, do a couple of extra tests with several
# option combinations. As the code paths taken by those tests
# do not change for the "local" and "remote" modes, just run them
@@ -172,6 +197,44 @@ in primary, before promotion
),
'drop');
+ if ($test_mode eq 'local' || $test_mode eq 'remote')
+ {
+ # Last modification time of WAL segment files common between both
+ # source and target is unchanged on the target. This indicates that
+ # pg_rewind skipped copying the source's files and overwriting the
+ # target's files.
+ $node_primary->psql(
+ 'postgres',
+ "SELECT extract(epoch from modification) FROM pg_stat_file('pg_wal/000000010000000000000002');",
+ stdout => \my $common_wal_last_modified_at);
+
+ cmp_ok($common_wal_last_modified_at,
+ '==', $common_wal_modified_at,
+ 'common WAL segment file on target, before divergence, not overwritten');
+
+ # Last modification time of WAL segment files just before and after
+ # the new timeline should now be further ahead. (Both of these WAL
+ # files are internally represented by segment 3.) This indicates that
+ # pg_rewind copied the source's files and overwrote the target's files.
+ $node_primary->psql(
+ 'postgres',
+ "SELECT extract(epoch from modification) FROM pg_stat_file('pg_wal/000000010000000000000003');",
+ stdout => \my $last_common_tli1_wal_last_modified_at);
+
+ cmp_ok($last_common_tli1_wal_last_modified_at,
+ '>=', $last_common_tli1_wal_modified_at,
+ 'last common WAL segment file on target, before divergence, overwritten');
+
+ $node_primary->psql(
+ 'postgres',
+ "SELECT extract(epoch from modification) FROM pg_stat_file('pg_wal/000000020000000000000003');",
+ stdout => \my $last_common_tli2_wal_last_modified_at);
+
+ cmp_ok($last_common_tli2_wal_last_modified_at,
+ '>', $last_common_tli2_wal_modified_at,
+ 'last common WAL segment file on target, before divergence, overwritten');
+ }
+
# Permissions on PGDATA should be default
SKIP:
{
diff --git a/src/bin/pg_rewind/t/008_min_recovery_point.pl b/src/bin/pg_rewind/t/008_min_recovery_point.pl
index e6a7177fb7..ae17fbdb8d 100644
--- a/src/bin/pg_rewind/t/008_min_recovery_point.pl
+++ b/src/bin/pg_rewind/t/008_min_recovery_point.pl
@@ -136,6 +136,32 @@ $node_1->safe_psql('postgres',
$node_2->poll_query_until('postgres',
q|SELECT COUNT(*) > 1 FROM public.bar|, 't');
+# Record last modification time of WAL segment file that is partially
+# written to just before the new timeline, but still common between both
+# source and target. pg_rewind should overwrite this file with the source's
+# copy to be safe.
+$node_2->psql(
+ 'postgres',
+ "SELECT extract(epoch from modification) FROM pg_stat_file('pg_wal/000000010000000000000002');",
+ stdout => \my $common_wal_modified_at);
+
+# Record last modification time of WAL segment file that is partially
+# written to just before the new timeline, but still common between both
+# source and target. pg_rewind should overwrite this file with the source's
+# copy to be safe.
+$node_2->psql(
+ 'postgres',
+ "SELECT extract(epoch from modification) FROM pg_stat_file('pg_wal/000000010000000000000003');",
+ stdout => \my $last_common_tli1_wal_modified_at);
+
+# Record last modification time of first diverged WAL segment file on the
+# old primary's timeline. pg_rewind should overwrite this file with the
+# source's newly forked copy.
+$node_2->psql(
+ 'postgres',
+ "SELECT extract(epoch from modification) FROM pg_stat_file('pg_wal/000000020000000000000003');",
+ stdout => \my $last_common_tli2_wal_modified_at);
+
# At this point node_2 will shut down without a shutdown checkpoint,
# but with WAL entries beyond the preceding shutdown checkpoint.
$node_2->stop('fast');
@@ -174,4 +200,50 @@ and this too), 'table foo after rewind');
$result = $node_2->safe_psql('postgres', 'SELECT * FROM public.bar');
is($result, qq(in both), 'table bar after rewind');
+# Last modification time of WAL segment files common between both source and
+# target are unchanged on the target. This indicates that pg_rewind skipped
+# copying the source's files and overwriting the target's files.
+$node_2->psql(
+ 'postgres',
+ "SELECT extract(epoch from modification) FROM pg_stat_file('pg_wal/000000010000000000000002');",
+ stdout => \my $common_wal_last_modified_at);
+
+cmp_ok($common_wal_last_modified_at,
+ '==', $common_wal_modified_at,
+ 'common WAL segment file on target, before divergence, not overwritten');
+
+# Last modification time of WAL segment files just before and after the new
+# timelines should now be further ahead. (Both of these WAL files are
+# internally represented by segment 3.) This indicates that pg_rewind copied
+# the source's files and overwrote the target's files.
+$node_2->psql(
+ 'postgres',
+ "SELECT extract(epoch from modification) FROM pg_stat_file('pg_wal/000000010000000000000003');",
+ stdout => \my $last_common_tli1_wal_last_modified_at);
+
+cmp_ok($last_common_tli1_wal_last_modified_at,
+ '>', $last_common_tli1_wal_modified_at,
+ 'common WAL segment file on target, before divergence, not overwritten');
+
+$node_2->psql(
+ 'postgres',
+ "SELECT extract(epoch from modification) FROM pg_stat_file('pg_wal/000000020000000000000003');",
+ stdout => \my $last_common_tli2_wal_last_modified_at);
+
+cmp_ok($last_common_tli2_wal_last_modified_at,
+ '>', $last_common_tli2_wal_modified_at,
+ 'last common WAL segment file on target, before divergence, not overwritten');
+
+# Last modification time of WAL segment files in source but not target should
+# now be further ahead on the target. This indicates that pg_rewind copied the
+# source's files and overwrote the target's files.
+$node_2->psql(
+ 'postgres',
+ "SELECT extract(epoch from modification) FROM pg_stat_file('pg_wal/000000030000000000000003');",
+ stdout => \my $first_diverged_tli3_wal_last_modified_at);
+
+cmp_ok($first_diverged_tli3_wal_last_modified_at,
+ '>=', $last_common_tli2_wal_modified_at,
+ 'last common WAL segment file on target, before divergence, not overwritten');
+
done_testing();
--
2.35.3
Import Notes
Reply to msg id not found: CAAWA-0-Cn-hBPYpBXjF9NGT5wmrrBO953dkeGa+uKf5-NQMVNA@mail.gmail.com
Hi everyone!
I've also attached the pg_rewind optimization patch file for Postgres version 14.4. The previous patch file targets version Postgres version 15 Beta 1/2.
Thanks,
Justin
________________________________
From: Justin Kwan <jkwan@cloudflare.com>
Sent: July 15, 2022 6:13 PM
To: vignesh ravichandran <admin@viggy28.dev>
Cc: pgsql-hackers <pgsql-hackers@postgresql.org>; vignesh <vignesh@cloudflare.com>; justinpkwan@outlook.com <justinpkwan@outlook.com>
Subject: Re: Making pg_rewind faster
Looping in my other email.
On Thu, Jun 30, 2022 at 6:22 AM vignesh ravichandran <admin@viggy28.dev<mailto:admin@viggy28.dev>> wrote:
Hi Hackers,
I have been using pg_rewind in production for 2 years. One of the things that I noticed in pg_rewind is if it doesn't know what to do with a file "it copies". I understand it's the more safer option. After all, the alternative, pg_basebackup copies all the files from source to target.
However, this is making pg_rewind inefficient when we have a high number of WAL files. Majority of the data (in most of my cases 95%+) that it copies are WAL files which are anyway same between the source and target. Skipping those same WAL files from copying will improve the speed of pg_rewind a lot.
1. Does pg_rewind need to copy WAL files before the WAL that contains the last common check point?
Heikki's presentation https://pgsessions.com/assets/archives/pg_rewind-presentation-paris.pdf gave me a good overview and also explained the behavior what I mentioned.
Thanks,
Vignesh
Attachments:
v1-pg14.4-0001-Avoid-copying-WAL-segments-before-divergence-to-spee.patchapplication/octet-stream; name=v1-pg14.4-0001-Avoid-copying-WAL-segments-before-divergence-to-spee.patchDownload
From 745fc4bde84cae7e595a65a0e23c8476b0e3e323 Mon Sep 17 00:00:00 2001
From: Justin Kwan <jkwan@cloudflare.com>
Date: Fri, 15 Jul 2022 18:48:59 -0700
Subject: [PATCH] Avoid copying WAL segments before divergence to speed up
pg_rewind
"Optimize pg_rewind to Skip Copying Common WAL Files". It adds a conditional
check to avoid uncessesarily copying any WAL segment files from source to
target if they are common between both servers, before the point of WAL
divergence during pg_rewind. On the source server, each WAL file's
corresponding segment number is computed and compared against the segement
number of the first diverged LSN. All WAL files which fall before the segment
of the first diverged LSN can safely be skipped from copying to the target.
The reduction in WAL segment files transmitted over the network from source to
target server massively reduces overall pg_rewind execution time, when a large
amount of WAL segment files are retained.
This patch is intended for immediate application into the Postgres master
branch on version 14.4.
Regression tests are included to verify that WAL segment files prior to WAL
diverge are not copied. The source code successfully compiles, and all tests
successfully pass.
Author: Justin Kwan, Vignesh Ravichandran
---
src/bin/pg_rewind/filemap.c | 25 +++++--
src/bin/pg_rewind/pg_rewind.c | 39 +++++++++-
src/bin/pg_rewind/pg_rewind.h | 2 +
src/bin/pg_rewind/t/001_basic.pl | 65 +++++++++++++++-
src/bin/pg_rewind/t/008_min_recovery_point.pl | 74 ++++++++++++++++++-
5 files changed, 196 insertions(+), 9 deletions(-)
diff --git a/src/bin/pg_rewind/filemap.c b/src/bin/pg_rewind/filemap.c
index 2618b4c957..0df84bc115 100644
--- a/src/bin/pg_rewind/filemap.c
+++ b/src/bin/pg_rewind/filemap.c
@@ -26,6 +26,7 @@
#include <sys/stat.h>
#include <unistd.h>
+#include "access/xlog_internal.h"
#include "catalog/pg_tablespace_d.h"
#include "common/hashfn.h"
#include "common/string.h"
@@ -729,11 +730,25 @@ decide_file_action(file_entry_t *entry)
case FILE_TYPE_REGULAR:
if (!entry->isrelfile)
{
- /*
- * It's a non-data file that we have no special processing
- * for. Copy it in toto.
- */
- return FILE_ACTION_COPY;
+ /* Handle WAL segment file. */
+ const char *fname;
+ char *slash;
+
+ /* Split filepath into directory & filename. */
+ slash = strrchr(path, '/');
+ if (slash)
+ fname = slash + 1;
+ else
+ fname = path;
+
+ if (IsXLogFileName(fname))
+ return decide_wal_file_action(fname);
+
+ /*
+ * It's a non-data file that we have no special processing
+ * for. Copy it in toto.
+ */
+ return FILE_ACTION_COPY;
}
else
{
diff --git a/src/bin/pg_rewind/pg_rewind.c b/src/bin/pg_rewind/pg_rewind.c
index 2ac4910778..591f6674d7 100644
--- a/src/bin/pg_rewind/pg_rewind.c
+++ b/src/bin/pg_rewind/pg_rewind.c
@@ -24,7 +24,6 @@
#include "common/string.h"
#include "fe_utils/recovery_gen.h"
#include "file_ops.h"
-#include "filemap.h"
#include "getopt_long.h"
#include "pg_rewind.h"
#include "rewind_source.h"
@@ -67,6 +66,8 @@ bool dry_run = false;
bool do_sync = true;
bool restore_wal = false;
+static XLogRecPtr divergerec;
+
/* Target history */
TimeLineHistoryEntry *targetHistory;
int targetNentries;
@@ -124,7 +125,6 @@ main(int argc, char **argv)
};
int option_index;
int c;
- XLogRecPtr divergerec;
int lastcommontliIndex;
XLogRecPtr chkptrec;
TimeLineID chkpttli;
@@ -483,6 +483,41 @@ main(int argc, char **argv)
return 0;
}
+file_action_t
+decide_wal_file_action(const char *fname)
+{
+ TimeLineID file_tli;
+ XLogSegNo file_segno;
+ XLogSegNo last_common_segno;
+
+ /*
+ * Find last common WAL segment number between source and target before
+ * divergence given last common LSN (byte position).
+ */
+ XLByteToSeg(divergerec, last_common_segno, ControlFile_target.xlog_seg_size);
+
+ /* Get current WAL segment number given current segment file name. */
+ XLogFromFileName(fname, &file_tli, &file_segno, ControlFile_target.xlog_seg_size);
+
+ /*
+ * Avoid unnecessarily copying WAL segment files created before last common
+ * segment to avoid performance penalty when many WAL segment files are
+ * retained on source and copied to target.
+ *
+ * These files are already common between new source (old target) and new
+ * target (old source). Only WAL segment files after the last common segment
+ * number on the new source need to be copied to the new target.
+ */
+ if (file_segno < last_common_segno)
+ {
+ pg_log_debug("WAL file entry \"%s\" not copied to target", fname);
+ return FILE_ACTION_NONE;
+ }
+
+ pg_log_debug("WAL file entry \"%s\" copied to target", fname);
+ return FILE_ACTION_COPY;
+}
+
/*
* Perform the rewind.
*
diff --git a/src/bin/pg_rewind/pg_rewind.h b/src/bin/pg_rewind/pg_rewind.h
index d38635a73d..c7c72d4b38 100644
--- a/src/bin/pg_rewind/pg_rewind.h
+++ b/src/bin/pg_rewind/pg_rewind.h
@@ -14,6 +14,7 @@
#include "access/timeline.h"
#include "common/logging.h"
#include "datapagemap.h"
+#include "filemap.h"
#include "libpq-fe.h"
#include "storage/block.h"
#include "storage/relfilenode.h"
@@ -50,6 +51,7 @@ extern XLogRecPtr readOneRecord(const char *datadir, XLogRecPtr ptr,
/* in pg_rewind.c */
extern void progress_report(bool finished);
+extern file_action_t decide_wal_file_action(const char *fname);
/* in timeline.c */
extern TimeLineHistoryEntry *rewind_parseTimeLineHistory(char *buffer,
diff --git a/src/bin/pg_rewind/t/001_basic.pl b/src/bin/pg_rewind/t/001_basic.pl
index d636f35f5e..f9a079357e 100644
--- a/src/bin/pg_rewind/t/001_basic.pl
+++ b/src/bin/pg_rewind/t/001_basic.pl
@@ -4,7 +4,7 @@
use strict;
use warnings;
use TestLib;
-use Test::More tests => 23;
+use Test::More tests => 29;
use FindBin;
use lib $FindBin::RealBin;
@@ -78,6 +78,31 @@ sub run_test
"insert into drop_tbl values ('in primary, after promotion')");
primary_psql("DROP TABLE drop_tbl");
+ # Record last modification time of WAL segment file common between both
+ # source and target, stored on current primary server. pg_rewind should
+ # refrain from overwriting this file with the source's newly forked copy.
+ $node_primary->psql(
+ 'postgres',
+ "SELECT extract(epoch from modification) FROM pg_stat_file('pg_wal/000000010000000000000002');",
+ stdout => \my $common_wal_modified_at);
+
+ # Record last modification time of WAL segment file that is partially
+ # written to just before the new timeline, but still common between both
+ # source and target. pg_rewind should overwrite this file with the source's
+ # copy to be safe.
+ $node_primary->psql(
+ 'postgres',
+ "SELECT extract(epoch from modification) FROM pg_stat_file('pg_wal/000000010000000000000003');",
+ stdout => \my $last_common_tli1_wal_modified_at);
+
+ # Record last modification time of first diverged WAL segment file on the
+ # old primary's timeline. pg_rewind should overwrite this file with the
+ # source's newly forked copy.
+ $node_primary->psql(
+ 'postgres',
+ "SELECT extract(epoch from modification) FROM pg_stat_file('pg_wal/000000020000000000000003');",
+ stdout => \my $last_common_tli2_wal_modified_at);
+
# Before running pg_rewind, do a couple of extra tests with several
# option combinations. As the code paths taken by those tests
# do not change for the "local" and "remote" modes, just run them
@@ -172,6 +197,44 @@ in primary, before promotion
),
'drop');
+ if ($test_mode eq 'local' || $test_mode eq 'remote')
+ {
+ # Last modification time of WAL segment files common between both
+ # source and target is unchanged on the target. This indicates that
+ # pg_rewind skipped copying the source's files and overwriting the
+ # target's files.
+ $node_primary->psql(
+ 'postgres',
+ "SELECT extract(epoch from modification) FROM pg_stat_file('pg_wal/000000010000000000000002');",
+ stdout => \my $common_wal_last_modified_at);
+
+ cmp_ok($common_wal_last_modified_at,
+ '==', $common_wal_modified_at,
+ 'common WAL segment file on target, before divergence, not overwritten');
+
+ # Last modification time of WAL segment files just before and after
+ # the new timeline should now be further ahead. (Both of these WAL
+ # files are internally represented by segment 3.) This indicates that
+ # pg_rewind copied the source's files and overwrote the target's files.
+ $node_primary->psql(
+ 'postgres',
+ "SELECT extract(epoch from modification) FROM pg_stat_file('pg_wal/000000010000000000000003');",
+ stdout => \my $last_common_tli1_wal_last_modified_at);
+
+ cmp_ok($last_common_tli1_wal_last_modified_at,
+ '>=', $last_common_tli1_wal_modified_at,
+ 'last common WAL segment file on target, before divergence, overwritten');
+
+ $node_primary->psql(
+ 'postgres',
+ "SELECT extract(epoch from modification) FROM pg_stat_file('pg_wal/000000020000000000000003');",
+ stdout => \my $last_common_tli2_wal_last_modified_at);
+
+ cmp_ok($last_common_tli2_wal_last_modified_at,
+ '>=', $last_common_tli2_wal_modified_at,
+ 'last common WAL segment file on target, before divergence, overwritten');
+ }
+
# Permissions on PGDATA should be default
SKIP:
{
diff --git a/src/bin/pg_rewind/t/008_min_recovery_point.pl b/src/bin/pg_rewind/t/008_min_recovery_point.pl
index 9ebcbad0d2..1fa3d0907f 100644
--- a/src/bin/pg_rewind/t/008_min_recovery_point.pl
+++ b/src/bin/pg_rewind/t/008_min_recovery_point.pl
@@ -34,7 +34,7 @@ use strict;
use warnings;
use PostgresNode;
use TestLib;
-use Test::More tests => 3;
+use Test::More tests => 7;
use File::Copy;
@@ -138,6 +138,32 @@ $node_1->safe_psql('postgres',
$node_2->poll_query_until('postgres',
q|SELECT COUNT(*) > 1 FROM public.bar|, 't');
+# Record last modification time of WAL segment file that is partially
+# written to just before the new timeline, but still common between both
+# source and target. pg_rewind should overwrite this file with the source's
+# copy to be safe.
+$node_2->psql(
+ 'postgres',
+ "SELECT extract(epoch from modification) FROM pg_stat_file('pg_wal/000000010000000000000002');",
+ stdout => \my $common_wal_modified_at);
+
+# Record last modification time of WAL segment file that is partially
+# written to just before the new timeline, but still common between both
+# source and target. pg_rewind should overwrite this file with the source's
+# copy to be safe.
+$node_2->psql(
+ 'postgres',
+ "SELECT extract(epoch from modification) FROM pg_stat_file('pg_wal/000000010000000000000003');",
+ stdout => \my $last_common_tli1_wal_modified_at);
+
+# Record last modification time of first diverged WAL segment file on the
+# old primary's timeline. pg_rewind should overwrite this file with the
+# source's newly forked copy.
+$node_2->psql(
+ 'postgres',
+ "SELECT extract(epoch from modification) FROM pg_stat_file('pg_wal/000000020000000000000003');",
+ stdout => \my $last_common_tli2_wal_modified_at);
+
# At this point node_2 will shut down without a shutdown checkpoint,
# but with WAL entries beyond the preceding shutdown checkpoint.
$node_2->stop('fast');
@@ -175,3 +201,49 @@ and this too), 'table foo after rewind');
$result = $node_2->safe_psql('postgres', 'SELECT * FROM public.bar');
is($result, qq(in both), 'table bar after rewind');
+
+# Last modification time of WAL segment files common between both source and
+# target are unchanged on the target. This indicates that pg_rewind skipped
+# copying the source's files and overwriting the target's files.
+$node_2->psql(
+ 'postgres',
+ "SELECT extract(epoch from modification) FROM pg_stat_file('pg_wal/000000010000000000000002');",
+ stdout => \my $common_wal_last_modified_at);
+
+cmp_ok($common_wal_last_modified_at,
+ '==', $common_wal_modified_at,
+ 'common WAL segment file on target, before divergence, not overwritten');
+
+# Last modification time of WAL segment files just before and after the new
+# timelines should now be further ahead. (Both of these WAL files are
+# internally represented by segment 3.) This indicates that pg_rewind copied
+# the source's files and overwrote the target's files.
+$node_2->psql(
+ 'postgres',
+ "SELECT extract(epoch from modification) FROM pg_stat_file('pg_wal/000000010000000000000003');",
+ stdout => \my $last_common_tli1_wal_last_modified_at);
+
+cmp_ok($last_common_tli1_wal_last_modified_at,
+ '>=', $last_common_tli1_wal_modified_at,
+ 'common WAL segment file on target, before divergence, not overwritten');
+
+$node_2->psql(
+ 'postgres',
+ "SELECT extract(epoch from modification) FROM pg_stat_file('pg_wal/000000020000000000000003');",
+ stdout => \my $last_common_tli2_wal_last_modified_at);
+
+cmp_ok($last_common_tli2_wal_last_modified_at,
+ '>=', $last_common_tli2_wal_modified_at,
+ 'last common WAL segment file on target, before divergence, not overwritten');
+
+# Last modification time of WAL segment files in source but not target should
+# now be further ahead on the target. This indicates that pg_rewind copied the
+# source's files and overwrote the target's files.
+$node_2->psql(
+ 'postgres',
+ "SELECT extract(epoch from modification) FROM pg_stat_file('pg_wal/000000030000000000000003');",
+ stdout => \my $first_diverged_tli3_wal_last_modified_at);
+
+cmp_ok($first_diverged_tli3_wal_last_modified_at,
+ '>=', $last_common_tli2_wal_modified_at,
+ 'last common WAL segment file on target, before divergence, not overwritten');
--
2.35.3
Import Notes
Reply to msg id not found: CAAWA-0-Cn-hBPYpBXjF9NGT5wmrrBO953dkeGa+uKf5-NQMVNA@mail.gmail.com
Justin Kwan <justinpkwan@outlook.com> writes:
I've also attached the pg_rewind optimization patch file for Postgres version 14.4. The previous patch file targets version Postgres version 15 Beta 1/2.
It's very unlikely that we would consider committing such changes into
released branches. In fact, it's too late even for v15. You should
be submitting non-bug-fix patches against master (v16-to-be).
regards, tom lane
Hi Tom,
Thank you for taking a look at this and that sounds good. I will send over a patch compatible with Postgres v16.
Justin
________________________________
From: Tom Lane <tgl@sss.pgh.pa.us>
Sent: July 17, 2022 2:40 PM
To: Justin Kwan <justinpkwan@outlook.com>
Cc: pgsql-hackers <pgsql-hackers@postgresql.org>; vignesh <vignesh@cloudflare.com>; jkwan@cloudflare.com <jkwan@cloudflare.com>; vignesh ravichandran <admin@viggy28.dev>; hlinnaka@iki.fi <hlinnaka@iki.fi>
Subject: Re: Making pg_rewind faster
Justin Kwan <justinpkwan@outlook.com> writes:
I've also attached the pg_rewind optimization patch file for Postgres version 14.4. The previous patch file targets version Postgres version 15 Beta 1/2.
It's very unlikely that we would consider committing such changes into
released branches. In fact, it's too late even for v15. You should
be submitting non-bug-fix patches against master (v16-to-be).
regards, tom lane
On Mon, Jul 18, 2022 at 05:14:00PM +0000, Justin Kwan wrote:
Thank you for taking a look at this and that sounds good. I will
send over a patch compatible with Postgres v16.
+$node_2->psql(
+ 'postgres',
+ "SELECT extract(epoch from modification) FROM pg_stat_file('pg_wal/000000010000000000000003');",
+ stdout => \my $last_common_tli1_wal_last_modified_at);
Please note that you should not rely on the FS-level stats for
anything that touches the WAL segments. A rough guess about what you
could here to make sure that only the set of WAL segments you are
looking for is being copied over would be to either:
- Scan the logs produced by pg_rewind and see if the segments are
copied or not, depending on the divergence point (aka the last
checkpoint before WAL forked).
- Clean up pg_wal/ in the target node before running pg_rewind,
checking that only the segments you want are available once the
operation completes.
--
Michael
Hi Michael,
Thank you for your feedback, I've incorporated your suggestions by scanning the logs produced from pg_rewind when asserting that certain WAL segment files were skipped from being copied over to the target server.
I've also updated the pg_rewind patch file to target the Postgres master branch (version 16 to be). Please see attached.
Thanks,
Justin
________________________________
From: Michael Paquier
Sent: Tuesday, July 19, 2022 1:36 AM
To: Justin Kwan
Cc: Tom Lane; pgsql-hackers; vignesh; jkwan@cloudflare.com; vignesh ravichandran; hlinnaka@iki.fi
Subject: Re: Making pg_rewind faster
On Mon, Jul 18, 2022 at 05:14:00PM +0000, Justin Kwan wrote:
Thank you for taking a look at this and that sounds good. I will
send over a patch compatible with Postgres v16.
+$node_2->psql(
+ 'postgres',
+ "SELECT extract(epoch from modification) FROM pg_stat_file('pg_wal/000000010000000000000003');",
+ stdout => \my $last_common_tli1_wal_last_modified_at);
Please note that you should not rely on the FS-level stats for
anything that touches the WAL segments. A rough guess about what you
could here to make sure that only the set of WAL segments you are
looking for is being copied over would be to either:
- Scan the logs produced by pg_rewind and see if the segments are
copied or not, depending on the divergence point (aka the last
checkpoint before WAL forked).
- Clean up pg_wal/ in the target node before running pg_rewind,
checking that only the segments you want are available once the
operation completes.
--
Michael
Attachments:
v2-pg16-0001-Avoid-copying-WAL-segments-before-divergence-to-spee.patchapplication/octet-stream; name=v2-pg16-0001-Avoid-copying-WAL-segments-before-divergence-to-spee.patchDownload
From edb976cf1d2daedafef9cee1e36c4ef5f232e9aa Mon Sep 17 00:00:00 2001
From: Justin Kwan <jkwan@cloudflare.com>
Date: Tue, 19 Jul 2022 22:15:36 -0700
Subject: [PATCH 1/1] Avoid copying WAL segments before divergence to speed up
pg_rewind
"Optimize pg_rewind to Skip Copying Common WAL Files". It adds a conditional
check to avoid uncessesarily copying any WAL segment files from source to
target if they are common between both servers, before the point of WAL
divergence during pg_rewind. On the source server, each WAL file's
corresponding segment number is computed and compared against the segement
number of the first diverged LSN. All WAL files which fall before the segment
of the first diverged LSN can safely be skipped from copying to the target.
The reduction in WAL segment files transmitted over the network from source to
target server massively reduces overall pg_rewind execution time, when a large
amount of WAL segment files are retained.
This patch is intended for immediate application into the Postgres master
branch on version 14.4.
Regression tests are included to verify that WAL segment files prior to WAL
diverge are not copied. The source code successfully compiles, and all tests
successfully pass.
Author: Justin Kwan, Vignesh Ravichandran
---
src/bin/pg_rewind/filemap.c | 15 +++++
src/bin/pg_rewind/pg_rewind.c | 39 ++++++++++-
src/bin/pg_rewind/pg_rewind.h | 2 +
src/bin/pg_rewind/t/001_basic.pl | 65 +++++++++++++++++++
src/bin/pg_rewind/t/008_min_recovery_point.pl | 64 ++++++++++++++++++
5 files changed, 183 insertions(+), 2 deletions(-)
diff --git a/src/bin/pg_rewind/filemap.c b/src/bin/pg_rewind/filemap.c
index 269ed6446e..2fd0523ce5 100644
--- a/src/bin/pg_rewind/filemap.c
+++ b/src/bin/pg_rewind/filemap.c
@@ -26,6 +26,7 @@
#include <sys/stat.h>
#include <unistd.h>
+#include "access/xlog_internal.h"
#include "catalog/pg_tablespace_d.h"
#include "common/hashfn.h"
#include "common/string.h"
@@ -728,6 +729,20 @@ decide_file_action(file_entry_t *entry)
case FILE_TYPE_REGULAR:
if (!entry->isrelfile)
{
+ /* Handle WAL segment file. */
+ const char *fname;
+ char *slash;
+
+ /* Split filepath into directory & filename. */
+ slash = strrchr(path, '/');
+ if (slash)
+ fname = slash + 1;
+ else
+ fname = path;
+
+ if (IsXLogFileName(fname))
+ return decide_wal_file_action(fname);
+
/*
* It's a non-data file that we have no special processing
* for. Copy it in toto.
diff --git a/src/bin/pg_rewind/pg_rewind.c b/src/bin/pg_rewind/pg_rewind.c
index 1ff8da1676..7bdc3447c2 100644
--- a/src/bin/pg_rewind/pg_rewind.c
+++ b/src/bin/pg_rewind/pg_rewind.c
@@ -25,7 +25,6 @@
#include "fe_utils/recovery_gen.h"
#include "fe_utils/string_utils.h"
#include "file_ops.h"
-#include "filemap.h"
#include "getopt_long.h"
#include "pg_rewind.h"
#include "rewind_source.h"
@@ -69,6 +68,8 @@ bool dry_run = false;
bool do_sync = true;
bool restore_wal = false;
+static XLogRecPtr divergerec;
+
/* Target history */
TimeLineHistoryEntry *targetHistory;
int targetNentries;
@@ -129,7 +130,6 @@ main(int argc, char **argv)
};
int option_index;
int c;
- XLogRecPtr divergerec;
int lastcommontliIndex;
XLogRecPtr chkptrec;
TimeLineID chkpttli;
@@ -490,6 +490,41 @@ main(int argc, char **argv)
return 0;
}
+file_action_t
+decide_wal_file_action(const char *fname)
+{
+ TimeLineID file_tli;
+ XLogSegNo file_segno;
+ XLogSegNo last_common_segno;
+
+ /*
+ * Find last common WAL segment number between source and target before
+ * divergence given last common LSN (byte position).
+ */
+ XLByteToSeg(divergerec, last_common_segno, ControlFile_target.xlog_seg_size);
+
+ /* Get current WAL segment number given current segment file name. */
+ XLogFromFileName(fname, &file_tli, &file_segno, ControlFile_target.xlog_seg_size);
+
+ /*
+ * Avoid unnecessarily copying WAL segment files created before last common
+ * segment to avoid performance penalty when many WAL segment files are
+ * retained on source and copied to target.
+ *
+ * These files are already common between new source (old target) and new
+ * target (old source). Only WAL segment files after the last common segment
+ * number on the new source need to be copied to the new target.
+ */
+ if (file_segno < last_common_segno)
+ {
+ pg_log_debug("WAL file entry \"%s\" not copied to target", fname);
+ return FILE_ACTION_NONE;
+ }
+
+ pg_log_debug("WAL file entry \"%s\" is copied to target", fname);
+ return FILE_ACTION_COPY;
+}
+
/*
* Perform the rewind.
*
diff --git a/src/bin/pg_rewind/pg_rewind.h b/src/bin/pg_rewind/pg_rewind.h
index 8b4b50a33b..0752348831 100644
--- a/src/bin/pg_rewind/pg_rewind.h
+++ b/src/bin/pg_rewind/pg_rewind.h
@@ -14,6 +14,7 @@
#include "access/timeline.h"
#include "common/logging.h"
#include "datapagemap.h"
+#include "filemap.h"
#include "libpq-fe.h"
#include "storage/block.h"
#include "storage/relfilelocator.h"
@@ -47,6 +48,7 @@ extern XLogRecPtr readOneRecord(const char *datadir, XLogRecPtr ptr,
/* in pg_rewind.c */
extern void progress_report(bool finished);
+extern file_action_t decide_wal_file_action(const char *fname);
/* in timeline.c */
extern TimeLineHistoryEntry *rewind_parseTimeLineHistory(char *buffer,
diff --git a/src/bin/pg_rewind/t/001_basic.pl b/src/bin/pg_rewind/t/001_basic.pl
index db9201f38e..7caff1c029 100644
--- a/src/bin/pg_rewind/t/001_basic.pl
+++ b/src/bin/pg_rewind/t/001_basic.pl
@@ -172,6 +172,71 @@ in primary, before promotion
),
'drop');
+ if ($test_mode eq 'archive')
+ {
+ goto SKIP;
+ }
+
+ # Read logs produced by pg_rewind
+ my $log_file_name =
+ "$PostgreSQL::Test::Utils::tmp_check/log/regress_log_001_basic";
+ my $pg_rewind_logs = do {
+ local $/ = undef;
+ open(my $log_file, '<', $log_file_name) || die;
+ <$log_file>;
+ };
+
+ # This WAL segment file is common between both source and target,
+ # originating before WAL timeline divergence. pg_rewind should not copy
+ # and overwrite this target file with the source's copy.
+ like(
+ $pg_rewind_logs,
+ qr/pg_rewind: WAL file entry "000000010000000000000002" not copied to target/,
+ 'common segment file skipped from copying');
+ unlike(
+ $pg_rewind_logs,
+ qr/pg_rewind: pg_wal\/000000010000000000000002 \(COPY\)/,
+ 'common segment file not copied');
+ if ($test_mode eq 'remote')
+ {
+ unlike(
+ $pg_rewind_logs,
+ qr/pg_rewind: received chunk for file "pg_wal\/000000010000000000000002"/,
+ 'common segment file copy not received by remote target');
+ }
+
+ # These WAL segment files occur just before and after the divergence to the
+ # new timeline. (Both of these WAL files are internally represented by
+ # segment 3.) pg_rewind should copy and overwrite these target files with
+ # the source's copies.
+ like(
+ $pg_rewind_logs,
+ qr/pg_rewind: WAL file entry "000000010000000000000003" is copied to target/,
+ 'diverged segment file included in copying to target');
+ like(
+ $pg_rewind_logs,
+ qr/pg_rewind: pg_wal\/000000010000000000000003 \(COPY\)/,
+ 'diverged segment file copied');
+ if ($test_mode eq 'remote')
+ {
+ like(
+ $pg_rewind_logs,
+ qr/pg_rewind: received chunk for file "pg_wal\/000000010000000000000003"/,
+ 'diverged segment file copy received by remote target');
+ }
+
+ like(
+ $pg_rewind_logs,
+ qr/pg_rewind: pg_wal\/000000020000000000000003 \(COPY\)/,
+ 'diverged segment file included in copying to target');
+ if ($test_mode eq 'remote')
+ {
+ like(
+ $pg_rewind_logs,
+ qr/pg_rewind: received chunk for file "pg_wal\/000000020000000000000003"/,
+ 'diverged segment file copy received by remote target');
+ }
+
# Permissions on PGDATA should be default
SKIP:
{
diff --git a/src/bin/pg_rewind/t/008_min_recovery_point.pl b/src/bin/pg_rewind/t/008_min_recovery_point.pl
index e6a7177fb7..204ec10738 100644
--- a/src/bin/pg_rewind/t/008_min_recovery_point.pl
+++ b/src/bin/pg_rewind/t/008_min_recovery_point.pl
@@ -174,4 +174,68 @@ and this too), 'table foo after rewind');
$result = $node_2->safe_psql('postgres', 'SELECT * FROM public.bar');
is($result, qq(in both), 'table bar after rewind');
+# Read logs produced by pg_rewind
+my $log_file_name =
+ "$PostgreSQL::Test::Utils::tmp_check/log/regress_log_008_min_recovery_point";
+my $pg_rewind_logs = do {
+ local $/ = undef;
+ open(my $log_file, '<', $log_file_name) || die;
+ <$log_file>;
+};
+
+# This WAL segment file is common between both source and target,
+# originating before WAL timeline divergence. pg_rewind should not copy
+# and overwrite this target file with the source's copy.
+like(
+ $pg_rewind_logs,
+ qr/pg_rewind: WAL file entry "000000010000000000000002" not copied to target/,
+ 'common segment file skipped from copying');
+unlike(
+ $pg_rewind_logs,
+ qr/pg_rewind: pg_wal\/000000010000000000000002 \(COPY\)/,
+ 'common segment file not copied');
+unlike(
+ $pg_rewind_logs,
+ qr/pg_rewind: received chunk for file "pg_wal\/000000010000000000000002"/,
+ 'common segment file copy not received by remote target');
+
+# These WAL segment files occur just before and after the divergence to the
+# new timeline. (Both of these WAL files are internally represented by
+# segment 3.) pg_rewind should copy and overwrite these target files with
+# the source's copies.
+like(
+ $pg_rewind_logs,
+ qr/pg_rewind: WAL file entry "000000010000000000000003" is copied to target/,
+ 'diverged segment file included in copying to target');
+like(
+ $pg_rewind_logs,
+ qr/pg_rewind: pg_wal\/000000010000000000000003 \(COPY\)/,
+ 'diverged segment file copied');
+like(
+ $pg_rewind_logs,
+ qr/pg_rewind: received chunk for file "pg_wal\/000000010000000000000003"/,
+ 'diverged segment file copy received by remote target');
+
+like(
+ $pg_rewind_logs,
+ qr/pg_rewind: WAL file entry "000000020000000000000003" is copied to target/,
+ 'diverged segment file included in copying to target');
+like(
+ $pg_rewind_logs,
+ qr/pg_rewind: pg_wal\/000000020000000000000003 \(COPY\)/,
+ 'diverged segment file copied');
+like(
+ $pg_rewind_logs,
+ qr/pg_rewind: received chunk for file "pg_wal\/000000020000000000000003"/,
+ 'diverged segment file copy received by remote target');
+
+like(
+ $pg_rewind_logs,
+ qr/pg_rewind: pg_wal\/000000030000000000000003 \(COPY\)/,
+ 'diverged segment file copied');
+like(
+ $pg_rewind_logs,
+ qr/pg_rewind: received chunk for file "pg_wal\/000000030000000000000003"/,
+ 'diverged segment file copy received by remote target');
+
done_testing();
--
2.35.3
Import Notes
Resolved by subject fallback
Hi Michael,
Not sure if this email went through previously but thank you for your feedback, I've incorporated your suggestions by scanning the logs produced from pg_rewind when asserting that certain WAL segment files were skipped from being copied over to the target server.
I've also updated the pg_rewind patch file to target the Postgres master branch (version 16 to be). Please see attached.
Thanks,
Justin
________________________________
From: Justin Kwan <justinpkwan@outlook.com>
Sent: July 18, 2022 1:14 PM
To: Tom Lane <tgl@sss.pgh.pa.us>
Cc: pgsql-hackers <pgsql-hackers@postgresql.org>; vignesh <vignesh@cloudflare.com>; jkwan@cloudflare.com <jkwan@cloudflare.com>; vignesh ravichandran <admin@viggy28.dev>; hlinnaka@iki.fi <hlinnaka@iki.fi>
Subject: Re: Making pg_rewind faster
Hi Tom,
Thank you for taking a look at this and that sounds good. I will send over a patch compatible with Postgres v16.
Justin
________________________________
From: Tom Lane <tgl@sss.pgh.pa.us>
Sent: July 17, 2022 2:40 PM
To: Justin Kwan <justinpkwan@outlook.com>
Cc: pgsql-hackers <pgsql-hackers@postgresql.org>; vignesh <vignesh@cloudflare.com>; jkwan@cloudflare.com <jkwan@cloudflare.com>; vignesh ravichandran <admin@viggy28.dev>; hlinnaka@iki.fi <hlinnaka@iki.fi>
Subject: Re: Making pg_rewind faster
Justin Kwan <justinpkwan@outlook.com> writes:
I've also attached the pg_rewind optimization patch file for Postgres version 14.4. The previous patch file targets version Postgres version 15 Beta 1/2.
It's very unlikely that we would consider committing such changes into
released branches. In fact, it's too late even for v15. You should
be submitting non-bug-fix patches against master (v16-to-be).
regards, tom lane
Attachments:
v2-pg16-0001-Avoid-copying-WAL-segments-before-divergence-to-spee.patchapplication/octet-stream; name=v2-pg16-0001-Avoid-copying-WAL-segments-before-divergence-to-spee.patchDownload
From edb976cf1d2daedafef9cee1e36c4ef5f232e9aa Mon Sep 17 00:00:00 2001
From: Justin Kwan <jkwan@cloudflare.com>
Date: Tue, 19 Jul 2022 22:15:36 -0700
Subject: [PATCH 1/1] Avoid copying WAL segments before divergence to speed up
pg_rewind
"Optimize pg_rewind to Skip Copying Common WAL Files". It adds a conditional
check to avoid uncessesarily copying any WAL segment files from source to
target if they are common between both servers, before the point of WAL
divergence during pg_rewind. On the source server, each WAL file's
corresponding segment number is computed and compared against the segement
number of the first diverged LSN. All WAL files which fall before the segment
of the first diverged LSN can safely be skipped from copying to the target.
The reduction in WAL segment files transmitted over the network from source to
target server massively reduces overall pg_rewind execution time, when a large
amount of WAL segment files are retained.
This patch is intended for immediate application into the Postgres master
branch on version 14.4.
Regression tests are included to verify that WAL segment files prior to WAL
diverge are not copied. The source code successfully compiles, and all tests
successfully pass.
Author: Justin Kwan, Vignesh Ravichandran
---
src/bin/pg_rewind/filemap.c | 15 +++++
src/bin/pg_rewind/pg_rewind.c | 39 ++++++++++-
src/bin/pg_rewind/pg_rewind.h | 2 +
src/bin/pg_rewind/t/001_basic.pl | 65 +++++++++++++++++++
src/bin/pg_rewind/t/008_min_recovery_point.pl | 64 ++++++++++++++++++
5 files changed, 183 insertions(+), 2 deletions(-)
diff --git a/src/bin/pg_rewind/filemap.c b/src/bin/pg_rewind/filemap.c
index 269ed6446e..2fd0523ce5 100644
--- a/src/bin/pg_rewind/filemap.c
+++ b/src/bin/pg_rewind/filemap.c
@@ -26,6 +26,7 @@
#include <sys/stat.h>
#include <unistd.h>
+#include "access/xlog_internal.h"
#include "catalog/pg_tablespace_d.h"
#include "common/hashfn.h"
#include "common/string.h"
@@ -728,6 +729,20 @@ decide_file_action(file_entry_t *entry)
case FILE_TYPE_REGULAR:
if (!entry->isrelfile)
{
+ /* Handle WAL segment file. */
+ const char *fname;
+ char *slash;
+
+ /* Split filepath into directory & filename. */
+ slash = strrchr(path, '/');
+ if (slash)
+ fname = slash + 1;
+ else
+ fname = path;
+
+ if (IsXLogFileName(fname))
+ return decide_wal_file_action(fname);
+
/*
* It's a non-data file that we have no special processing
* for. Copy it in toto.
diff --git a/src/bin/pg_rewind/pg_rewind.c b/src/bin/pg_rewind/pg_rewind.c
index 1ff8da1676..7bdc3447c2 100644
--- a/src/bin/pg_rewind/pg_rewind.c
+++ b/src/bin/pg_rewind/pg_rewind.c
@@ -25,7 +25,6 @@
#include "fe_utils/recovery_gen.h"
#include "fe_utils/string_utils.h"
#include "file_ops.h"
-#include "filemap.h"
#include "getopt_long.h"
#include "pg_rewind.h"
#include "rewind_source.h"
@@ -69,6 +68,8 @@ bool dry_run = false;
bool do_sync = true;
bool restore_wal = false;
+static XLogRecPtr divergerec;
+
/* Target history */
TimeLineHistoryEntry *targetHistory;
int targetNentries;
@@ -129,7 +130,6 @@ main(int argc, char **argv)
};
int option_index;
int c;
- XLogRecPtr divergerec;
int lastcommontliIndex;
XLogRecPtr chkptrec;
TimeLineID chkpttli;
@@ -490,6 +490,41 @@ main(int argc, char **argv)
return 0;
}
+file_action_t
+decide_wal_file_action(const char *fname)
+{
+ TimeLineID file_tli;
+ XLogSegNo file_segno;
+ XLogSegNo last_common_segno;
+
+ /*
+ * Find last common WAL segment number between source and target before
+ * divergence given last common LSN (byte position).
+ */
+ XLByteToSeg(divergerec, last_common_segno, ControlFile_target.xlog_seg_size);
+
+ /* Get current WAL segment number given current segment file name. */
+ XLogFromFileName(fname, &file_tli, &file_segno, ControlFile_target.xlog_seg_size);
+
+ /*
+ * Avoid unnecessarily copying WAL segment files created before last common
+ * segment to avoid performance penalty when many WAL segment files are
+ * retained on source and copied to target.
+ *
+ * These files are already common between new source (old target) and new
+ * target (old source). Only WAL segment files after the last common segment
+ * number on the new source need to be copied to the new target.
+ */
+ if (file_segno < last_common_segno)
+ {
+ pg_log_debug("WAL file entry \"%s\" not copied to target", fname);
+ return FILE_ACTION_NONE;
+ }
+
+ pg_log_debug("WAL file entry \"%s\" is copied to target", fname);
+ return FILE_ACTION_COPY;
+}
+
/*
* Perform the rewind.
*
diff --git a/src/bin/pg_rewind/pg_rewind.h b/src/bin/pg_rewind/pg_rewind.h
index 8b4b50a33b..0752348831 100644
--- a/src/bin/pg_rewind/pg_rewind.h
+++ b/src/bin/pg_rewind/pg_rewind.h
@@ -14,6 +14,7 @@
#include "access/timeline.h"
#include "common/logging.h"
#include "datapagemap.h"
+#include "filemap.h"
#include "libpq-fe.h"
#include "storage/block.h"
#include "storage/relfilelocator.h"
@@ -47,6 +48,7 @@ extern XLogRecPtr readOneRecord(const char *datadir, XLogRecPtr ptr,
/* in pg_rewind.c */
extern void progress_report(bool finished);
+extern file_action_t decide_wal_file_action(const char *fname);
/* in timeline.c */
extern TimeLineHistoryEntry *rewind_parseTimeLineHistory(char *buffer,
diff --git a/src/bin/pg_rewind/t/001_basic.pl b/src/bin/pg_rewind/t/001_basic.pl
index db9201f38e..7caff1c029 100644
--- a/src/bin/pg_rewind/t/001_basic.pl
+++ b/src/bin/pg_rewind/t/001_basic.pl
@@ -172,6 +172,71 @@ in primary, before promotion
),
'drop');
+ if ($test_mode eq 'archive')
+ {
+ goto SKIP;
+ }
+
+ # Read logs produced by pg_rewind
+ my $log_file_name =
+ "$PostgreSQL::Test::Utils::tmp_check/log/regress_log_001_basic";
+ my $pg_rewind_logs = do {
+ local $/ = undef;
+ open(my $log_file, '<', $log_file_name) || die;
+ <$log_file>;
+ };
+
+ # This WAL segment file is common between both source and target,
+ # originating before WAL timeline divergence. pg_rewind should not copy
+ # and overwrite this target file with the source's copy.
+ like(
+ $pg_rewind_logs,
+ qr/pg_rewind: WAL file entry "000000010000000000000002" not copied to target/,
+ 'common segment file skipped from copying');
+ unlike(
+ $pg_rewind_logs,
+ qr/pg_rewind: pg_wal\/000000010000000000000002 \(COPY\)/,
+ 'common segment file not copied');
+ if ($test_mode eq 'remote')
+ {
+ unlike(
+ $pg_rewind_logs,
+ qr/pg_rewind: received chunk for file "pg_wal\/000000010000000000000002"/,
+ 'common segment file copy not received by remote target');
+ }
+
+ # These WAL segment files occur just before and after the divergence to the
+ # new timeline. (Both of these WAL files are internally represented by
+ # segment 3.) pg_rewind should copy and overwrite these target files with
+ # the source's copies.
+ like(
+ $pg_rewind_logs,
+ qr/pg_rewind: WAL file entry "000000010000000000000003" is copied to target/,
+ 'diverged segment file included in copying to target');
+ like(
+ $pg_rewind_logs,
+ qr/pg_rewind: pg_wal\/000000010000000000000003 \(COPY\)/,
+ 'diverged segment file copied');
+ if ($test_mode eq 'remote')
+ {
+ like(
+ $pg_rewind_logs,
+ qr/pg_rewind: received chunk for file "pg_wal\/000000010000000000000003"/,
+ 'diverged segment file copy received by remote target');
+ }
+
+ like(
+ $pg_rewind_logs,
+ qr/pg_rewind: pg_wal\/000000020000000000000003 \(COPY\)/,
+ 'diverged segment file included in copying to target');
+ if ($test_mode eq 'remote')
+ {
+ like(
+ $pg_rewind_logs,
+ qr/pg_rewind: received chunk for file "pg_wal\/000000020000000000000003"/,
+ 'diverged segment file copy received by remote target');
+ }
+
# Permissions on PGDATA should be default
SKIP:
{
diff --git a/src/bin/pg_rewind/t/008_min_recovery_point.pl b/src/bin/pg_rewind/t/008_min_recovery_point.pl
index e6a7177fb7..204ec10738 100644
--- a/src/bin/pg_rewind/t/008_min_recovery_point.pl
+++ b/src/bin/pg_rewind/t/008_min_recovery_point.pl
@@ -174,4 +174,68 @@ and this too), 'table foo after rewind');
$result = $node_2->safe_psql('postgres', 'SELECT * FROM public.bar');
is($result, qq(in both), 'table bar after rewind');
+# Read logs produced by pg_rewind
+my $log_file_name =
+ "$PostgreSQL::Test::Utils::tmp_check/log/regress_log_008_min_recovery_point";
+my $pg_rewind_logs = do {
+ local $/ = undef;
+ open(my $log_file, '<', $log_file_name) || die;
+ <$log_file>;
+};
+
+# This WAL segment file is common between both source and target,
+# originating before WAL timeline divergence. pg_rewind should not copy
+# and overwrite this target file with the source's copy.
+like(
+ $pg_rewind_logs,
+ qr/pg_rewind: WAL file entry "000000010000000000000002" not copied to target/,
+ 'common segment file skipped from copying');
+unlike(
+ $pg_rewind_logs,
+ qr/pg_rewind: pg_wal\/000000010000000000000002 \(COPY\)/,
+ 'common segment file not copied');
+unlike(
+ $pg_rewind_logs,
+ qr/pg_rewind: received chunk for file "pg_wal\/000000010000000000000002"/,
+ 'common segment file copy not received by remote target');
+
+# These WAL segment files occur just before and after the divergence to the
+# new timeline. (Both of these WAL files are internally represented by
+# segment 3.) pg_rewind should copy and overwrite these target files with
+# the source's copies.
+like(
+ $pg_rewind_logs,
+ qr/pg_rewind: WAL file entry "000000010000000000000003" is copied to target/,
+ 'diverged segment file included in copying to target');
+like(
+ $pg_rewind_logs,
+ qr/pg_rewind: pg_wal\/000000010000000000000003 \(COPY\)/,
+ 'diverged segment file copied');
+like(
+ $pg_rewind_logs,
+ qr/pg_rewind: received chunk for file "pg_wal\/000000010000000000000003"/,
+ 'diverged segment file copy received by remote target');
+
+like(
+ $pg_rewind_logs,
+ qr/pg_rewind: WAL file entry "000000020000000000000003" is copied to target/,
+ 'diverged segment file included in copying to target');
+like(
+ $pg_rewind_logs,
+ qr/pg_rewind: pg_wal\/000000020000000000000003 \(COPY\)/,
+ 'diverged segment file copied');
+like(
+ $pg_rewind_logs,
+ qr/pg_rewind: received chunk for file "pg_wal\/000000020000000000000003"/,
+ 'diverged segment file copy received by remote target');
+
+like(
+ $pg_rewind_logs,
+ qr/pg_rewind: pg_wal\/000000030000000000000003 \(COPY\)/,
+ 'diverged segment file copied');
+like(
+ $pg_rewind_logs,
+ qr/pg_rewind: received chunk for file "pg_wal\/000000030000000000000003"/,
+ 'diverged segment file copy received by remote target');
+
done_testing();
--
2.35.3
Hi, Justin!
On Fri, Jul 29, 2022 at 1:05 PM Justin Kwan <justinpkwan@outlook.com> wrote:
Not sure if this email went through previously but thank you for your feedback, I've incorporated your suggestions by scanning the logs produced from pg_rewind when asserting that certain WAL segment files were skipped from being copied over to the target server.
I've also updated the pg_rewind patch file to target the Postgres master branch (version 16 to be). Please see attached.
Thank you for the revision.
I've taken a look at this patch. Overall it looks good to me. I also
don't see any design objections in the thread.
A couple of points from me:
1) I would prefer to evade hard-coded names for WAL segments in the
tap tests. Could we calculate the names in the tap tests based on the
diverge point, etc.?
2) Patch contains some indentation with spaces, which should be done
in tabs. Please consider either manually fixing this or running
pgindent over modified files.
------
Regards,
Alexander Korotkov
Hi,
On 2022-09-13 20:50:20 +0300, Alexander Korotkov wrote:
On Fri, Jul 29, 2022 at 1:05 PM Justin Kwan <justinpkwan@outlook.com> wrote:
Not sure if this email went through previously but thank you for your feedback, I've incorporated your suggestions by scanning the logs produced from pg_rewind when asserting that certain WAL segment files were skipped from being copied over to the target server.
I've also updated the pg_rewind patch file to target the Postgres master branch (version 16 to be). Please see attached.
Thank you for the revision.
I've taken a look at this patch. Overall it looks good to me. I also
don't see any design objections in the thread.A couple of points from me:
1) I would prefer to evade hard-coded names for WAL segments in the
tap tests. Could we calculate the names in the tap tests based on the
diverge point, etc.?
2) Patch contains some indentation with spaces, which should be done
in tabs. Please consider either manually fixing this or running
pgindent over modified files.
This patch currently fails because it hasn't been adjusted for
commit c47885bd8b6
Author: Andres Freund <andres@anarazel.de>
Date: 2022-09-19 18:03:17 -0700
Split TESTDIR into TESTLOGDIR and TESTDATADIR
The adjustment is trivial. Attached, together with also producing an error
message rather than just dying wordlessly.
It doesn't seem quite right to read pg_rewind's logs by reading
regress_log_001_basic. Too easy to confuse different runs of pg_rewind
etc. I'd suggest trying to redirect the log to a different file.
With regard to Alexander's point about whitespace:
.git/rebase-apply/patch:25: indent with spaces.
/* Handle WAL segment file. */
.git/rebase-apply/patch:26: indent with spaces.
const char *fname;
.git/rebase-apply/patch:27: indent with spaces.
char *slash;
.git/rebase-apply/patch:29: indent with spaces.
/* Split filepath into directory & filename. */
.git/rebase-apply/patch:30: indent with spaces.
slash = strrchr(path, '/');
warning: squelched 29 whitespace errors
warning: 34 lines add whitespace errors.
Greetings,
Andres Freund
Attachments:
v3-0001-Avoid-copying-WAL-segments-before-divergence-to-s.patchtext/x-diff; charset=us-asciiDownload
From 44b987f4f21d4ff6c898b085db3c6c1094766f97 Mon Sep 17 00:00:00 2001
From: Justin Kwan <jkwan@cloudflare.com>
Date: Tue, 19 Jul 2022 22:15:36 -0700
Subject: [PATCH v3 1/2] Avoid copying WAL segments before divergence to speed
up pg_rewind
"Optimize pg_rewind to Skip Copying Common WAL Files". It adds a conditional
check to avoid uncessesarily copying any WAL segment files from source to
target if they are common between both servers, before the point of WAL
divergence during pg_rewind. On the source server, each WAL file's
corresponding segment number is computed and compared against the segement
number of the first diverged LSN. All WAL files which fall before the segment
of the first diverged LSN can safely be skipped from copying to the target.
The reduction in WAL segment files transmitted over the network from source to
target server massively reduces overall pg_rewind execution time, when a large
amount of WAL segment files are retained.
This patch is intended for immediate application into the Postgres master
branch on version 14.4.
Regression tests are included to verify that WAL segment files prior to WAL
diverge are not copied. The source code successfully compiles, and all tests
successfully pass.
Author: Justin Kwan, Vignesh Ravichandran
---
src/bin/pg_rewind/filemap.c | 15 +++++
src/bin/pg_rewind/pg_rewind.c | 39 ++++++++++-
src/bin/pg_rewind/pg_rewind.h | 2 +
src/bin/pg_rewind/t/001_basic.pl | 65 +++++++++++++++++++
src/bin/pg_rewind/t/008_min_recovery_point.pl | 64 ++++++++++++++++++
5 files changed, 183 insertions(+), 2 deletions(-)
diff --git a/src/bin/pg_rewind/filemap.c b/src/bin/pg_rewind/filemap.c
index 269ed6446e6..2fd0523ce5b 100644
--- a/src/bin/pg_rewind/filemap.c
+++ b/src/bin/pg_rewind/filemap.c
@@ -26,6 +26,7 @@
#include <sys/stat.h>
#include <unistd.h>
+#include "access/xlog_internal.h"
#include "catalog/pg_tablespace_d.h"
#include "common/hashfn.h"
#include "common/string.h"
@@ -728,6 +729,20 @@ decide_file_action(file_entry_t *entry)
case FILE_TYPE_REGULAR:
if (!entry->isrelfile)
{
+ /* Handle WAL segment file. */
+ const char *fname;
+ char *slash;
+
+ /* Split filepath into directory & filename. */
+ slash = strrchr(path, '/');
+ if (slash)
+ fname = slash + 1;
+ else
+ fname = path;
+
+ if (IsXLogFileName(fname))
+ return decide_wal_file_action(fname);
+
/*
* It's a non-data file that we have no special processing
* for. Copy it in toto.
diff --git a/src/bin/pg_rewind/pg_rewind.c b/src/bin/pg_rewind/pg_rewind.c
index 3cd77c09b1a..4ac38427b38 100644
--- a/src/bin/pg_rewind/pg_rewind.c
+++ b/src/bin/pg_rewind/pg_rewind.c
@@ -25,7 +25,6 @@
#include "fe_utils/recovery_gen.h"
#include "fe_utils/string_utils.h"
#include "file_ops.h"
-#include "filemap.h"
#include "getopt_long.h"
#include "pg_rewind.h"
#include "rewind_source.h"
@@ -69,6 +68,8 @@ bool dry_run = false;
bool do_sync = true;
bool restore_wal = false;
+static XLogRecPtr divergerec;
+
/* Target history */
TimeLineHistoryEntry *targetHistory;
int targetNentries;
@@ -129,7 +130,6 @@ main(int argc, char **argv)
};
int option_index;
int c;
- XLogRecPtr divergerec;
int lastcommontliIndex;
XLogRecPtr chkptrec;
TimeLineID chkpttli;
@@ -490,6 +490,41 @@ main(int argc, char **argv)
return 0;
}
+file_action_t
+decide_wal_file_action(const char *fname)
+{
+ TimeLineID file_tli;
+ XLogSegNo file_segno;
+ XLogSegNo last_common_segno;
+
+ /*
+ * Find last common WAL segment number between source and target before
+ * divergence given last common LSN (byte position).
+ */
+ XLByteToSeg(divergerec, last_common_segno, ControlFile_target.xlog_seg_size);
+
+ /* Get current WAL segment number given current segment file name. */
+ XLogFromFileName(fname, &file_tli, &file_segno, ControlFile_target.xlog_seg_size);
+
+ /*
+ * Avoid unnecessarily copying WAL segment files created before last common
+ * segment to avoid performance penalty when many WAL segment files are
+ * retained on source and copied to target.
+ *
+ * These files are already common between new source (old target) and new
+ * target (old source). Only WAL segment files after the last common segment
+ * number on the new source need to be copied to the new target.
+ */
+ if (file_segno < last_common_segno)
+ {
+ pg_log_debug("WAL file entry \"%s\" not copied to target", fname);
+ return FILE_ACTION_NONE;
+ }
+
+ pg_log_debug("WAL file entry \"%s\" is copied to target", fname);
+ return FILE_ACTION_COPY;
+}
+
/*
* Perform the rewind.
*
diff --git a/src/bin/pg_rewind/pg_rewind.h b/src/bin/pg_rewind/pg_rewind.h
index 80c276285a9..64b84ff1217 100644
--- a/src/bin/pg_rewind/pg_rewind.h
+++ b/src/bin/pg_rewind/pg_rewind.h
@@ -14,6 +14,7 @@
#include "access/timeline.h"
#include "common/logging.h"
#include "datapagemap.h"
+#include "filemap.h"
#include "libpq-fe.h"
#include "storage/block.h"
#include "storage/relfilelocator.h"
@@ -47,6 +48,7 @@ extern XLogRecPtr readOneRecord(const char *datadir, XLogRecPtr ptr,
/* in pg_rewind.c */
extern void progress_report(bool finished);
+extern file_action_t decide_wal_file_action(const char *fname);
/* in timeline.c */
extern TimeLineHistoryEntry *rewind_parseTimeLineHistory(char *buffer,
diff --git a/src/bin/pg_rewind/t/001_basic.pl b/src/bin/pg_rewind/t/001_basic.pl
index db9201f38e8..7caff1c0296 100644
--- a/src/bin/pg_rewind/t/001_basic.pl
+++ b/src/bin/pg_rewind/t/001_basic.pl
@@ -172,6 +172,71 @@ in primary, before promotion
),
'drop');
+ if ($test_mode eq 'archive')
+ {
+ goto SKIP;
+ }
+
+ # Read logs produced by pg_rewind
+ my $log_file_name =
+ "$PostgreSQL::Test::Utils::tmp_check/log/regress_log_001_basic";
+ my $pg_rewind_logs = do {
+ local $/ = undef;
+ open(my $log_file, '<', $log_file_name) || die;
+ <$log_file>;
+ };
+
+ # This WAL segment file is common between both source and target,
+ # originating before WAL timeline divergence. pg_rewind should not copy
+ # and overwrite this target file with the source's copy.
+ like(
+ $pg_rewind_logs,
+ qr/pg_rewind: WAL file entry "000000010000000000000002" not copied to target/,
+ 'common segment file skipped from copying');
+ unlike(
+ $pg_rewind_logs,
+ qr/pg_rewind: pg_wal\/000000010000000000000002 \(COPY\)/,
+ 'common segment file not copied');
+ if ($test_mode eq 'remote')
+ {
+ unlike(
+ $pg_rewind_logs,
+ qr/pg_rewind: received chunk for file "pg_wal\/000000010000000000000002"/,
+ 'common segment file copy not received by remote target');
+ }
+
+ # These WAL segment files occur just before and after the divergence to the
+ # new timeline. (Both of these WAL files are internally represented by
+ # segment 3.) pg_rewind should copy and overwrite these target files with
+ # the source's copies.
+ like(
+ $pg_rewind_logs,
+ qr/pg_rewind: WAL file entry "000000010000000000000003" is copied to target/,
+ 'diverged segment file included in copying to target');
+ like(
+ $pg_rewind_logs,
+ qr/pg_rewind: pg_wal\/000000010000000000000003 \(COPY\)/,
+ 'diverged segment file copied');
+ if ($test_mode eq 'remote')
+ {
+ like(
+ $pg_rewind_logs,
+ qr/pg_rewind: received chunk for file "pg_wal\/000000010000000000000003"/,
+ 'diverged segment file copy received by remote target');
+ }
+
+ like(
+ $pg_rewind_logs,
+ qr/pg_rewind: pg_wal\/000000020000000000000003 \(COPY\)/,
+ 'diverged segment file included in copying to target');
+ if ($test_mode eq 'remote')
+ {
+ like(
+ $pg_rewind_logs,
+ qr/pg_rewind: received chunk for file "pg_wal\/000000020000000000000003"/,
+ 'diverged segment file copy received by remote target');
+ }
+
# Permissions on PGDATA should be default
SKIP:
{
diff --git a/src/bin/pg_rewind/t/008_min_recovery_point.pl b/src/bin/pg_rewind/t/008_min_recovery_point.pl
index e6a7177fb7d..204ec107382 100644
--- a/src/bin/pg_rewind/t/008_min_recovery_point.pl
+++ b/src/bin/pg_rewind/t/008_min_recovery_point.pl
@@ -174,4 +174,68 @@ and this too), 'table foo after rewind');
$result = $node_2->safe_psql('postgres', 'SELECT * FROM public.bar');
is($result, qq(in both), 'table bar after rewind');
+# Read logs produced by pg_rewind
+my $log_file_name =
+ "$PostgreSQL::Test::Utils::tmp_check/log/regress_log_008_min_recovery_point";
+my $pg_rewind_logs = do {
+ local $/ = undef;
+ open(my $log_file, '<', $log_file_name) || die;
+ <$log_file>;
+};
+
+# This WAL segment file is common between both source and target,
+# originating before WAL timeline divergence. pg_rewind should not copy
+# and overwrite this target file with the source's copy.
+like(
+ $pg_rewind_logs,
+ qr/pg_rewind: WAL file entry "000000010000000000000002" not copied to target/,
+ 'common segment file skipped from copying');
+unlike(
+ $pg_rewind_logs,
+ qr/pg_rewind: pg_wal\/000000010000000000000002 \(COPY\)/,
+ 'common segment file not copied');
+unlike(
+ $pg_rewind_logs,
+ qr/pg_rewind: received chunk for file "pg_wal\/000000010000000000000002"/,
+ 'common segment file copy not received by remote target');
+
+# These WAL segment files occur just before and after the divergence to the
+# new timeline. (Both of these WAL files are internally represented by
+# segment 3.) pg_rewind should copy and overwrite these target files with
+# the source's copies.
+like(
+ $pg_rewind_logs,
+ qr/pg_rewind: WAL file entry "000000010000000000000003" is copied to target/,
+ 'diverged segment file included in copying to target');
+like(
+ $pg_rewind_logs,
+ qr/pg_rewind: pg_wal\/000000010000000000000003 \(COPY\)/,
+ 'diverged segment file copied');
+like(
+ $pg_rewind_logs,
+ qr/pg_rewind: received chunk for file "pg_wal\/000000010000000000000003"/,
+ 'diverged segment file copy received by remote target');
+
+like(
+ $pg_rewind_logs,
+ qr/pg_rewind: WAL file entry "000000020000000000000003" is copied to target/,
+ 'diverged segment file included in copying to target');
+like(
+ $pg_rewind_logs,
+ qr/pg_rewind: pg_wal\/000000020000000000000003 \(COPY\)/,
+ 'diverged segment file copied');
+like(
+ $pg_rewind_logs,
+ qr/pg_rewind: received chunk for file "pg_wal\/000000020000000000000003"/,
+ 'diverged segment file copy received by remote target');
+
+like(
+ $pg_rewind_logs,
+ qr/pg_rewind: pg_wal\/000000030000000000000003 \(COPY\)/,
+ 'diverged segment file copied');
+like(
+ $pg_rewind_logs,
+ qr/pg_rewind: received chunk for file "pg_wal\/000000030000000000000003"/,
+ 'diverged segment file copy received by remote target');
+
done_testing();
--
2.37.3.542.gdd3f6c4cae
v3-0002-Fix-log-file-names-used-in-tests.patchtext/x-diff; charset=us-asciiDownload
From e29ff080ba2f712522e42ebb8cbdafc934e40f04 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Sun, 2 Oct 2022 10:36:55 -0700
Subject: [PATCH v3 2/2] Fix log file names used in tests
---
src/bin/pg_rewind/t/001_basic.pl | 4 ++--
src/bin/pg_rewind/t/008_min_recovery_point.pl | 4 ++--
2 files changed, 4 insertions(+), 4 deletions(-)
diff --git a/src/bin/pg_rewind/t/001_basic.pl b/src/bin/pg_rewind/t/001_basic.pl
index 7caff1c0296..cad6cf5b43a 100644
--- a/src/bin/pg_rewind/t/001_basic.pl
+++ b/src/bin/pg_rewind/t/001_basic.pl
@@ -179,10 +179,10 @@ in primary, before promotion
# Read logs produced by pg_rewind
my $log_file_name =
- "$PostgreSQL::Test::Utils::tmp_check/log/regress_log_001_basic";
+ "$PostgreSQL::Test::Utils::log_path/regress_log_001_basic";
my $pg_rewind_logs = do {
local $/ = undef;
- open(my $log_file, '<', $log_file_name) || die;
+ open(my $log_file, '<', $log_file_name) || die "could not open $log_file_name: $!";
<$log_file>;
};
diff --git a/src/bin/pg_rewind/t/008_min_recovery_point.pl b/src/bin/pg_rewind/t/008_min_recovery_point.pl
index 204ec107382..deb7d00e3fa 100644
--- a/src/bin/pg_rewind/t/008_min_recovery_point.pl
+++ b/src/bin/pg_rewind/t/008_min_recovery_point.pl
@@ -176,10 +176,10 @@ is($result, qq(in both), 'table bar after rewind');
# Read logs produced by pg_rewind
my $log_file_name =
- "$PostgreSQL::Test::Utils::tmp_check/log/regress_log_008_min_recovery_point";
+ "$PostgreSQL::Test::Utils::log_path/regress_log_008_min_recovery_point";
my $pg_rewind_logs = do {
local $/ = undef;
- open(my $log_file, '<', $log_file_name) || die;
+ open(my $log_file, '<', $log_file_name) || die "could not open $log_file_name: $!";
<$log_file>;
};
--
2.37.3.542.gdd3f6c4cae
On Sun, Oct 02, 2022 at 10:44:25AM -0700, Andres Freund wrote:
It doesn't seem quite right to read pg_rewind's logs by reading
regress_log_001_basic. Too easy to confuse different runs of pg_rewind
etc. I'd suggest trying to redirect the log to a different file.
Hardcoding log file names in the test increases the overall
maintenance, even if renaming these would be easy to track and fix if
the naming convention is changed. Anyway, I think that what this
patch should do is to use command_checks_all() in RewindTest.pm as it
is the test API able to check after a status and multiple expected
outputs, which is what the changes in 001 and 008 are doing.
RewindTest::run_pg_rewind() needs to be a bit tweaked to accept these
regex patterns in input.
+ if (file_segno < last_common_segno)
+ {
+ pg_log_debug("WAL file entry \"%s\" not copied to target", fname);
+ return FILE_ACTION_NONE;
+ }
There may be something I am missing here, but there is no need to care
about segments with a TLI older than lastcommontliIndex, no?
decide_wal_file_action() assumes that the WAL segment exists on the
target and the source. This looks bug-prone to me without at least an
assertion.
file_entry_t has an entry to track if a file is a relation file. I
think that it would be much cleaner to track if we are handling a WAL
segment when inserting an entry in insert_filehash_entry(), so
isrelfile could be replaced by an enum with three values: relation
file, WAL segment and the rest.
--
Michael
On Thu, Oct 06, 2022 at 04:08:45PM +0900, Michael Paquier wrote:
file_entry_t has an entry to track if a file is a relation file. I
think that it would be much cleaner to track if we are handling a WAL
segment when inserting an entry in insert_filehash_entry(), so
isrelfile could be replaced by an enum with three values: relation
file, WAL segment and the rest.
This review has been done a few weeks ago, and there has been no
update since, so I am marking this entry as returned with feedback in
the CF app.
--
Michael
Hi,
I've attached an updated version of the patch against master with the changes
suggested.
On Tue, Nov 29, 2022 at 10:03 PM Michael Paquier <michael@paquier.xyz> wrote:
On Thu, Oct 06, 2022 at 04:08:45PM +0900, Michael Paquier wrote:
There may be something I am missing here, but there is no need to care
about segments with a TLI older than lastcommontliIndex, no?
Hard to say. pg_rewind is intended to make the same "copy" of the cluster which
implies pg_wal/ should look the same. There might be use cases around logical
replication where you would want these WAL files to still exist even
across promotions?
decide_wal_file_action() assumes that the WAL segment exists on the
target and the source. This looks bug-prone to me without at least an
assertion.
From previous refactors there is now an Assertion in filemap.c
decide_file_action that handles this.
Assert(entry->target_exists && entry->source_exists);
decide_wal_file_action is called after the assertion.
Thanks,
--
John Hsu - Amazon Web Services
Attachments:
0004-Avoid-copying-WAL-segments-before-divergence-to-spee.patchapplication/x-patch; name=0004-Avoid-copying-WAL-segments-before-divergence-to-spee.patchDownload
From 1825113812381f52c52e74a3e0e3e4c519f218bc Mon Sep 17 00:00:00 2001
From: John Hsu <johnyvr@gmail.com>
Date: Tue, 1 Jul 2025 18:05:37 +0000
Subject: [PATCH] Avoid copying WAL segments before divergence to speed up
pg_rewind
Adds a conditional check to avoid unnecessarily copying any
WAL segment files from source to target if they are common
between both servers before the point of WAL divergence
during pg_rewind. On the source server, each WAL file's.
All WAL files that exist on source and target, which fall
before the segment of the first diverged LSN can safely be
skipped from copying to the target.
---
src/bin/pg_rewind/filemap.c | 40 ++++++++++++++++--
src/bin/pg_rewind/filemap.h | 9 +++-
src/bin/pg_rewind/pg_rewind.c | 41 +++++++++++++++++++
src/bin/pg_rewind/pg_rewind.h | 2 +
src/bin/pg_rewind/t/008_min_recovery_point.pl | 18 +++++++-
5 files changed, 104 insertions(+), 6 deletions(-)
diff --git a/src/bin/pg_rewind/filemap.c b/src/bin/pg_rewind/filemap.c
index c933871ca9f..79e5bbd7710 100644
--- a/src/bin/pg_rewind/filemap.c
+++ b/src/bin/pg_rewind/filemap.c
@@ -58,6 +58,7 @@ static filehash_hash *filehash;
static bool isRelDataFile(const char *path);
static char *datasegpath(RelFileLocator rlocator, ForkNumber forknum,
BlockNumber segno);
+static file_content_type_t getFileType(const char *path);
static file_entry_t *insert_filehash_entry(const char *path);
static file_entry_t *lookup_filehash_entry(const char *path);
@@ -199,6 +200,28 @@ filehash_init(void)
filehash = filehash_create(FILEHASH_INITIAL_SIZE, NULL);
}
+/* Determine the type of file content (relation, WAL, or other) */
+static file_content_type_t
+getFileType(const char *path)
+{
+ if (isRelDataFile(path))
+ return FILE_CONTENT_TYPE_RELATION;
+ else
+ {
+ /* Extract the filename from the path */
+ const char *filename = last_dir_separator(path);
+ if (filename == NULL)
+ filename = path;
+ else
+ filename++; /* Skip the separator */
+
+ if (IsXLogFileName(filename))
+ return FILE_CONTENT_TYPE_WAL;
+ }
+
+ return FILE_CONTENT_TYPE_OTHER;
+}
+
/* Look up entry for 'path', creating a new one if it doesn't exist */
static file_entry_t *
insert_filehash_entry(const char *path)
@@ -210,7 +233,7 @@ insert_filehash_entry(const char *path)
if (!found)
{
entry->path = pg_strdup(path);
- entry->isrelfile = isRelDataFile(path);
+ entry->content_type = getFileType(path);
entry->target_exists = false;
entry->target_type = FILE_TYPE_UNDEFINED;
@@ -383,7 +406,7 @@ process_target_wal_block_change(ForkNumber forknum, RelFileLocator rlocator,
*/
if (entry)
{
- Assert(entry->isrelfile);
+ Assert(entry->content_type == FILE_CONTENT_TYPE_RELATION);
if (entry->target_exists)
{
@@ -799,7 +822,18 @@ decide_file_action(file_entry_t *entry)
return FILE_ACTION_NONE;
case FILE_TYPE_REGULAR:
- if (!entry->isrelfile)
+ if (entry->content_type == FILE_CONTENT_TYPE_WAL)
+ {
+ /* Handle WAL segment file */
+ const char *filename = last_dir_separator(entry->path);
+ if (filename == NULL)
+ filename = entry->path;
+ else
+ filename++; /* Skip the separator */
+
+ return decide_wal_file_action(filename);
+ }
+ else if (entry->content_type != FILE_CONTENT_TYPE_RELATION)
{
/*
* It's a non-data file that we have no special processing
diff --git a/src/bin/pg_rewind/filemap.h b/src/bin/pg_rewind/filemap.h
index df78a02e3da..fada420fc23 100644
--- a/src/bin/pg_rewind/filemap.h
+++ b/src/bin/pg_rewind/filemap.h
@@ -36,6 +36,13 @@ typedef enum
FILE_TYPE_SYMLINK,
} file_type_t;
+typedef enum
+{
+ FILE_CONTENT_TYPE_OTHER = 0,
+ FILE_CONTENT_TYPE_RELATION,
+ FILE_CONTENT_TYPE_WAL
+} file_content_type_t;
+
/*
* For every file found in the local or remote system, we have a file entry
* that contains information about the file on both systems. For relation
@@ -51,7 +58,7 @@ typedef struct file_entry_t
uint32 status; /* hash status */
const char *path;
- bool isrelfile; /* is it a relation data file? */
+ file_content_type_t content_type;
/*
* Status of the file in the target.
diff --git a/src/bin/pg_rewind/pg_rewind.c b/src/bin/pg_rewind/pg_rewind.c
index 9d16c1e6b47..917bc298cad 100644
--- a/src/bin/pg_rewind/pg_rewind.c
+++ b/src/bin/pg_rewind/pg_rewind.c
@@ -56,6 +56,9 @@ static void findCommonAncestorTimeline(TimeLineHistoryEntry *a_history,
static void ensureCleanShutdown(const char *argv0);
static void disconnect_atexit(void);
+/* WAL segment files handling */
+file_action_t decide_wal_file_action(const char *fname);
+
static ControlFileData ControlFile_target;
static ControlFileData ControlFile_source;
static ControlFileData ControlFile_source_after;
@@ -88,6 +91,9 @@ uint64 fetch_done;
static PGconn *conn;
static rewind_source *source;
+/* Segment number of the divergence record */
+XLogSegNo last_common_segno;
+
static void
usage(const char *progname)
{
@@ -397,6 +403,9 @@ main(int argc, char **argv)
LSN_FORMAT_ARGS(divergerec),
targetHistory[lastcommontliIndex].tli);
+ /* Convert divergence LSN to segment number */
+ XLByteToSeg(divergerec, last_common_segno, ControlFile_target.xlog_seg_size);
+
/*
* Don't need the source history anymore. The target history is still
* needed by the routines in parsexlog.c, when we read the target WAL.
@@ -1204,3 +1213,35 @@ disconnect_atexit(void)
if (conn != NULL)
PQfinish(conn);
}
+
+/*
+ * Decide what to do with a WAL segment file based on its position
+ * relative to the point of divergence.
+ * Caller is responsible for ensuring the file exists on both
+ * source and target.
+ */
+file_action_t
+decide_wal_file_action(const char *fname)
+{
+ TimeLineID file_tli;
+ XLogSegNo file_segno;
+
+ /* Get current WAL segment number given current segment file name */
+ XLogFromFileName(fname, &file_tli, &file_segno, ControlFile_target.xlog_seg_size);
+
+ /*
+ * Avoid copying files before the last common segment.
+ *
+ * These files are assumed to exist on source and target.
+ * Only WAL segment files after the last common segment number on
+ * the new source need to be copied to the new target.
+ */
+ if (file_segno < last_common_segno)
+ {
+ pg_log_debug("WAL file entry \"%s\" not copied to target", fname);
+ return FILE_ACTION_NONE;
+ }
+
+ pg_log_debug("WAL file entry \"%s\" is copied to target", fname);
+ return FILE_ACTION_COPY;
+}
diff --git a/src/bin/pg_rewind/pg_rewind.h b/src/bin/pg_rewind/pg_rewind.h
index 9cea144d2b2..421b4201998 100644
--- a/src/bin/pg_rewind/pg_rewind.h
+++ b/src/bin/pg_rewind/pg_rewind.h
@@ -14,6 +14,7 @@
#include "access/timeline.h"
#include "common/logging.h"
#include "common/file_utils.h"
+#include "filemap.h"
/* Configuration options */
extern char *datadir_target;
@@ -45,6 +46,7 @@ extern XLogRecPtr readOneRecord(const char *datadir, XLogRecPtr ptr,
/* in pg_rewind.c */
extern void progress_report(bool finished);
+extern file_action_t decide_wal_file_action(const char *fname);
/* in timeline.c */
extern TimeLineHistoryEntry *rewind_parseTimeLineHistory(char *buffer,
diff --git a/src/bin/pg_rewind/t/008_min_recovery_point.pl b/src/bin/pg_rewind/t/008_min_recovery_point.pl
index 28496afe350..ea279a9ee4b 100644
--- a/src/bin/pg_rewind/t/008_min_recovery_point.pl
+++ b/src/bin/pg_rewind/t/008_min_recovery_point.pl
@@ -71,6 +71,9 @@ $node_3->start;
# Wait until node 3 has connected and caught up
$node_1->wait_for_catchup('node_3');
+# Current common WAL segment
+my $wal_seg_skipped = $node_1->safe_psql('postgres', 'SELECT pg_walfile_name(pg_current_wal_lsn())');
+
#
# Swap the roles of node_1 and node_3, so that node_1 follows node_3.
#
@@ -93,6 +96,10 @@ primary_conninfo='$node_3_connstr'
));
$node_2->restart();
+# Confirm WAL segment has advanced for promotion
+my $advanced_wal_segment = $node_3->safe_psql('postgres', 'SELECT pg_walfile_name(pg_current_wal_lsn())');
+isnt($wal_seg_skipped, $advanced_wal_segment, "Expected WAL segment to have advanced");
+
#
# Promote node_1, to create a split-brain scenario.
#
@@ -140,14 +147,21 @@ copy(
"$node_2_pgdata/postgresql.conf",
"$tmp_folder/node_2-postgresql.conf.tmp");
-command_ok(
+command_checks_all(
[
'pg_rewind',
'--source-server' => $node_1_connstr,
'--target-pgdata' => $node_2_pgdata,
'--debug',
],
- 'run pg_rewind');
+ 0,
+ [
+ # qr/"WAL file entry $wal_seg_skipped not copied to target"/
+ qr//
+ ],
+ [qr//],
+ 'run pg_rewind'
+);
# Now move back postgresql.conf with old settings
move(
--
2.47.1
On Tue, 01 Jul 2025 at 11:13, John H <johnhyvr@gmail.com> wrote:
Hi,
I've attached an updated version of the patch against master with the changes
suggested.On Tue, Nov 29, 2022 at 10:03 PM Michael Paquier <michael@paquier.xyz> wrote:
On Thu, Oct 06, 2022 at 04:08:45PM +0900, Michael Paquier wrote:
There may be something I am missing here, but there is no need to care
about segments with a TLI older than lastcommontliIndex, no?Hard to say. pg_rewind is intended to make the same "copy" of the cluster which
implies pg_wal/ should look the same. There might be use cases around logical
replication where you would want these WAL files to still exist even
across promotions?decide_wal_file_action() assumes that the WAL segment exists on the
target and the source. This looks bug-prone to me without at least an
assertion.From previous refactors there is now an Assertion in filemap.c
decide_file_action that handles this.Assert(entry->target_exists && entry->source_exists);
decide_wal_file_action is called after the assertion.
Hi, John
Thanks for updating the patch.
1.
+/* Determine the type of file content (relation, WAL, or other) */
+static file_content_type_t
+getFileType(const char *path)
Considering the existence of file_type_t, would getFileContentType() be a
suitable function for handling file content types?
2.
Perhaps decide_wal_file_action() could be defined in filemap.c.
While this is unrelated to WAL logging, it could also contribute to faster
pg_rewind operations. Should we consider ignoring log files under PGDATA
(e.g., those in the default log/ directory)?
--
Regards,
Japin Li
HI
2.
Perhaps decide_wal_file_action() could be defined in filemap.c.
While this is unrelated to WAL logging, it could also contribute to
faster
pg_rewind operations. Should we consider ignoring log files under PGDATA
(e.g., those in the default log/ directory)?
Agree ,Usually the log file directory also takes up a lot of space, and the
number of log files is quite large
On Wed, Jul 2, 2025 at 10:21 AM Japin Li <japinli@hotmail.com> wrote:
Show quoted text
On Tue, 01 Jul 2025 at 11:13, John H <johnhyvr@gmail.com> wrote:
Hi,
I've attached an updated version of the patch against master with the
changes
suggested.
On Tue, Nov 29, 2022 at 10:03 PM Michael Paquier <michael@paquier.xyz>
wrote:
On Thu, Oct 06, 2022 at 04:08:45PM +0900, Michael Paquier wrote:
There may be something I am missing here, but there is no need to care
about segments with a TLI older than lastcommontliIndex, no?Hard to say. pg_rewind is intended to make the same "copy" of the
cluster which
implies pg_wal/ should look the same. There might be use cases around
logical
replication where you would want these WAL files to still exist even
across promotions?decide_wal_file_action() assumes that the WAL segment exists on the
target and the source. This looks bug-prone to me without at least an
assertion.From previous refactors there is now an Assertion in filemap.c
decide_file_action that handles this.Assert(entry->target_exists && entry->source_exists);
decide_wal_file_action is called after the assertion.
Hi, John
Thanks for updating the patch.
1. +/* Determine the type of file content (relation, WAL, or other) */ +static file_content_type_t +getFileType(const char *path)Considering the existence of file_type_t, would getFileContentType() be a
suitable function for handling file content types?2.
Perhaps decide_wal_file_action() could be defined in filemap.c.While this is unrelated to WAL logging, it could also contribute to faster
pg_rewind operations. Should we consider ignoring log files under PGDATA
(e.g., those in the default log/ directory)?--
Regards,
Japin Li
Hi,
Thanks for the quick review.
On Tue, Jul 1, 2025 at 8:16 PM wenhui qiu <qiuwenhuifx@gmail.com> wrote:
Perhaps decide_wal_file_action() could be defined in filemap.c.
That's a good point. I updated the patch to reflect that.
While this is unrelated to WAL logging, it could also contribute to faster
pg_rewind operations. Should we consider ignoring log files under PGDATA
(e.g., those in the default log/ directory)?Agree ,Usually the log file directory also takes up a lot of space, and the number of log files is quite large
Should we handle this use case? I do agree that for the more common
use-cases of pg_rewind which is synchronizing an old writer to the new
leader after failover, avoiding syncing the logging directory is
useful.
At the same time, pg_rewind is intended to make the same copy of the
source cluster as efficiently as possible which would include "all"
directories if they exist in $PGDATA. By default logging_collector is
off as well. I'd also think you would want to avoid putting the logs
in $PGDATA to have smaller backups if you are using tools like
pg_basebackup.
On Wed, Jul 2, 2025 at 10:21 AM Japin Li <japinli@hotmail.com> wrote:
Hi, John
Thanks for updating the patch.
1. +/* Determine the type of file content (relation, WAL, or other) */ +static file_content_type_t +getFileType(const char *path)Considering the existence of file_type_t, would getFileContentType() be a
suitable function for handling file content types?
Do you mean naming getFileType to getFileContentType?
Thanks,
--
John Hsu - Amazon Web Services
Attachments:
0005-Avoid-copying-WAL-segments-before-divergence-to-spee.patchapplication/octet-stream; name=0005-Avoid-copying-WAL-segments-before-divergence-to-spee.patchDownload
From 3de0c93d385c7127b87b5b9e3a9a9eb1c5aad237 Mon Sep 17 00:00:00 2001
From: John Hsu <johnyvr@gmail.com>
Date: Tue, 1 Jul 2025 18:05:37 +0000
Subject: [PATCH] Avoid copying WAL segments before divergence to speed up
pg_rewind
Adds a conditional check to avoid unnecessarily copying any
WAL segment files from source to target if they are common
between both servers before the point of WAL divergence
during pg_rewind. On the source server, each WAL file's.
All WAL files that exist on source and target, which fall
before the segment of the first diverged LSN can safely be
skipped from copying to the target.
---
src/bin/pg_rewind/filemap.c | 77 +++++++++++++++++--
src/bin/pg_rewind/filemap.h | 12 ++-
src/bin/pg_rewind/pg_rewind.c | 6 +-
src/bin/pg_rewind/t/008_min_recovery_point.pl | 18 ++++-
4 files changed, 102 insertions(+), 11 deletions(-)
diff --git a/src/bin/pg_rewind/filemap.c b/src/bin/pg_rewind/filemap.c
index c933871ca9f..74800568be5 100644
--- a/src/bin/pg_rewind/filemap.c
+++ b/src/bin/pg_rewind/filemap.c
@@ -199,6 +199,28 @@ filehash_init(void)
filehash = filehash_create(FILEHASH_INITIAL_SIZE, NULL);
}
+/* Determine the type of file content (relation, WAL, or other) */
+static file_content_type_t
+getFileType(const char *path)
+{
+ if (isRelDataFile(path))
+ return FILE_CONTENT_TYPE_RELATION;
+ else
+ {
+ /* Extract the filename from the path */
+ const char *filename = last_dir_separator(path);
+ if (filename == NULL)
+ filename = path;
+ else
+ filename++; /* Skip the separator */
+
+ if (IsXLogFileName(filename))
+ return FILE_CONTENT_TYPE_WAL;
+ }
+
+ return FILE_CONTENT_TYPE_OTHER;
+}
+
/* Look up entry for 'path', creating a new one if it doesn't exist */
static file_entry_t *
insert_filehash_entry(const char *path)
@@ -210,7 +232,7 @@ insert_filehash_entry(const char *path)
if (!found)
{
entry->path = pg_strdup(path);
- entry->isrelfile = isRelDataFile(path);
+ entry->content_type = getFileType(path);
entry->target_exists = false;
entry->target_type = FILE_TYPE_UNDEFINED;
@@ -383,7 +405,7 @@ process_target_wal_block_change(ForkNumber forknum, RelFileLocator rlocator,
*/
if (entry)
{
- Assert(entry->isrelfile);
+ Assert(entry->content_type == FILE_CONTENT_TYPE_RELATION);
if (entry->target_exists)
{
@@ -693,11 +715,43 @@ final_filemap_cmp(const void *a, const void *b)
return strcmp(fa->path, fb->path);
}
+/*
+ * Decide what to do with a WAL segment file based on its position
+ * relative to the point of divergence.
+ * Caller is responsible for ensuring the file exists on both
+ * source and target.
+ */
+static file_action_t
+decide_wal_file_action(const char *fname, XLogSegNo last_common_segno, int wal_segsz_bytes)
+{
+ TimeLineID file_tli;
+ XLogSegNo file_segno;
+
+ /* Get current WAL segment number given current segment file name */
+ XLogFromFileName(fname, &file_tli, &file_segno, wal_segsz_bytes);
+
+ /*
+ * Avoid copying files before the last common segment.
+ *
+ * These files are assumed to exist on source and target.
+ * Only WAL segment files after the last common segment number on
+ * the new source need to be copied to the new target.
+ */
+ if (file_segno < last_common_segno)
+ {
+ pg_log_debug("WAL file entry \"%s\" not copied to target", fname);
+ return FILE_ACTION_NONE;
+ }
+
+ pg_log_debug("WAL file entry \"%s\" is copied to target", fname);
+ return FILE_ACTION_COPY;
+}
+
/*
* Decide what action to perform to a file.
*/
static file_action_t
-decide_file_action(file_entry_t *entry)
+decide_file_action(file_entry_t *entry, XLogSegNo last_common_segno, int wal_segsz_bytes)
{
const char *path = entry->path;
@@ -799,7 +853,18 @@ decide_file_action(file_entry_t *entry)
return FILE_ACTION_NONE;
case FILE_TYPE_REGULAR:
- if (!entry->isrelfile)
+ if (entry->content_type == FILE_CONTENT_TYPE_WAL)
+ {
+ /* Handle WAL segment file */
+ const char *filename = last_dir_separator(entry->path);
+ if (filename == NULL)
+ filename = entry->path;
+ else
+ filename++; /* Skip the separator */
+
+ return decide_wal_file_action(filename, last_common_segno, wal_segsz_bytes);
+ }
+ else if (entry->content_type != FILE_CONTENT_TYPE_RELATION)
{
/*
* It's a non-data file that we have no special processing
@@ -858,7 +923,7 @@ decide_file_action(file_entry_t *entry)
* should be executed.
*/
filemap_t *
-decide_file_actions(void)
+decide_file_actions(XLogSegNo last_common_segno, int wal_segsz_bytes)
{
int i;
filehash_iterator it;
@@ -868,7 +933,7 @@ decide_file_actions(void)
filehash_start_iterate(filehash, &it);
while ((entry = filehash_iterate(filehash, &it)) != NULL)
{
- entry->action = decide_file_action(entry);
+ entry->action = decide_file_action(entry, last_common_segno, wal_segsz_bytes);
}
/*
diff --git a/src/bin/pg_rewind/filemap.h b/src/bin/pg_rewind/filemap.h
index df78a02e3da..ef1f539f1d1 100644
--- a/src/bin/pg_rewind/filemap.h
+++ b/src/bin/pg_rewind/filemap.h
@@ -11,6 +11,7 @@
#include "datapagemap.h"
#include "storage/block.h"
#include "storage/relfilelocator.h"
+#include "access/xlogdefs.h"
/* these enum values are sorted in the order we want actions to be processed */
typedef enum
@@ -36,6 +37,13 @@ typedef enum
FILE_TYPE_SYMLINK,
} file_type_t;
+typedef enum
+{
+ FILE_CONTENT_TYPE_OTHER = 0,
+ FILE_CONTENT_TYPE_RELATION,
+ FILE_CONTENT_TYPE_WAL
+} file_content_type_t;
+
/*
* For every file found in the local or remote system, we have a file entry
* that contains information about the file on both systems. For relation
@@ -51,7 +59,7 @@ typedef struct file_entry_t
uint32 status; /* hash status */
const char *path;
- bool isrelfile; /* is it a relation data file? */
+ file_content_type_t content_type;
/*
* Status of the file in the target.
@@ -106,7 +114,7 @@ extern void process_target_wal_block_change(ForkNumber forknum,
RelFileLocator rlocator,
BlockNumber blkno);
-extern filemap_t *decide_file_actions(void);
+extern filemap_t *decide_file_actions(XLogSegNo last_common_segno, int wal_segsz_bytes);
extern void calculate_totals(filemap_t *filemap);
extern void print_filemap(filemap_t *filemap);
diff --git a/src/bin/pg_rewind/pg_rewind.c b/src/bin/pg_rewind/pg_rewind.c
index 9d16c1e6b47..aaee5ceba84 100644
--- a/src/bin/pg_rewind/pg_rewind.c
+++ b/src/bin/pg_rewind/pg_rewind.c
@@ -153,6 +153,7 @@ main(int argc, char **argv)
bool rewind_needed;
bool writerecoveryconf = false;
filemap_t *filemap;
+ XLogSegNo last_common_segno;
pg_logging_init(argv[0]);
set_pglocale_pgservice(argv[0], PG_TEXTDOMAIN("pg_rewind"));
@@ -397,6 +398,9 @@ main(int argc, char **argv)
LSN_FORMAT_ARGS(divergerec),
targetHistory[lastcommontliIndex].tli);
+ /* Convert divergence LSN to segment number */
+ XLByteToSeg(divergerec, last_common_segno, ControlFile_target.xlog_seg_size);
+
/*
* Don't need the source history anymore. The target history is still
* needed by the routines in parsexlog.c, when we read the target WAL.
@@ -492,7 +496,7 @@ main(int argc, char **argv)
* We have collected all information we need from both systems. Decide
* what to do with each file.
*/
- filemap = decide_file_actions();
+ filemap = decide_file_actions(last_common_segno, ControlFile_target.xlog_seg_size);
if (showprogress)
calculate_totals(filemap);
diff --git a/src/bin/pg_rewind/t/008_min_recovery_point.pl b/src/bin/pg_rewind/t/008_min_recovery_point.pl
index 28496afe350..ea279a9ee4b 100644
--- a/src/bin/pg_rewind/t/008_min_recovery_point.pl
+++ b/src/bin/pg_rewind/t/008_min_recovery_point.pl
@@ -71,6 +71,9 @@ $node_3->start;
# Wait until node 3 has connected and caught up
$node_1->wait_for_catchup('node_3');
+# Current common WAL segment
+my $wal_seg_skipped = $node_1->safe_psql('postgres', 'SELECT pg_walfile_name(pg_current_wal_lsn())');
+
#
# Swap the roles of node_1 and node_3, so that node_1 follows node_3.
#
@@ -93,6 +96,10 @@ primary_conninfo='$node_3_connstr'
));
$node_2->restart();
+# Confirm WAL segment has advanced for promotion
+my $advanced_wal_segment = $node_3->safe_psql('postgres', 'SELECT pg_walfile_name(pg_current_wal_lsn())');
+isnt($wal_seg_skipped, $advanced_wal_segment, "Expected WAL segment to have advanced");
+
#
# Promote node_1, to create a split-brain scenario.
#
@@ -140,14 +147,21 @@ copy(
"$node_2_pgdata/postgresql.conf",
"$tmp_folder/node_2-postgresql.conf.tmp");
-command_ok(
+command_checks_all(
[
'pg_rewind',
'--source-server' => $node_1_connstr,
'--target-pgdata' => $node_2_pgdata,
'--debug',
],
- 'run pg_rewind');
+ 0,
+ [
+ # qr/"WAL file entry $wal_seg_skipped not copied to target"/
+ qr//
+ ],
+ [qr//],
+ 'run pg_rewind'
+);
# Now move back postgresql.conf with old settings
move(
--
2.47.1
On Wed, 02 Jul 2025 at 11:21, John H <johnhyvr@gmail.com> wrote:
Hi,
Thanks for the quick review.
On Tue, Jul 1, 2025 at 8:16 PM wenhui qiu <qiuwenhuifx@gmail.com> wrote:
Perhaps decide_wal_file_action() could be defined in filemap.c.
That's a good point. I updated the patch to reflect that.
Thanks for updating the patch.
While this is unrelated to WAL logging, it could also contribute to faster
pg_rewind operations. Should we consider ignoring log files under PGDATA
(e.g., those in the default log/ directory)?Agree ,Usually the log file directory also takes up a lot of space, and the number of log files is quite large
Should we handle this use case? I do agree that for the more common
use-cases of pg_rewind which is synchronizing an old writer to the new
leader after failover, avoiding syncing the logging directory is
useful.
At the same time, pg_rewind is intended to make the same copy of the
source cluster as efficiently as possible which would include "all"
directories if they exist in $PGDATA. By default logging_collector is
off as well. I'd also think you would want to avoid putting the logs
in $PGDATA to have smaller backups if you are using tools like
pg_basebackup.
Splitting the logs from $PGDATA is definitely better. The question is whether
it's worth implementing this directly in core or if a prominent note in the
documentation would suffice.
On Wed, Jul 2, 2025 at 10:21 AM Japin Li <japinli@hotmail.com> wrote:
Hi, John
Thanks for updating the patch.
1. +/* Determine the type of file content (relation, WAL, or other) */ +static file_content_type_t +getFileType(const char *path)Considering the existence of file_type_t, would getFileContentType() be a
suitable function for handling file content types?Do you mean naming getFileType to getFileContentType?
Exactly! It's confusing that getFileType() returns file_content_type_t
instead of file_type_t.
For v5 patch:
1.
We could simply use the global WalSegSz variable within decide_file_action(),
eliminating the need to pass wal_segsz_bytes as an argument.
2.
For last_common_segno, we could implement it similarly to WalSegSz, avoiding a
signature change for decide_file_actions() and decide_file_action(). I'm not
insisting on this approach, however.
--
Regards,
Japin Li
Hi,
On Wed, Jul 2, 2025 at 6:40 PM Japin Li <japinli@hotmail.com> wrote:
Splitting the logs from $PGDATA is definitely better. The question is whether
it's worth implementing this directly in core or if a prominent note in the
documentation would suffice.
I can work on the documentation update as a separate patch if folks
think this is worthwhile.
On Wed, Jul 2, 2025 at 10:21 AM Japin Li <japinli@hotmail.com> wrote:
Exactly! It's confusing that getFileType() returns file_content_type_t
instead of file_type_t.
Ah yes that is confusing, updated in patch.
For v5 patch:
1.
We could simply use the global WalSegSz variable within decide_file_action(),
eliminating the need to pass wal_segsz_bytes as an argument.
Good point.
2.
For last_common_segno, we could implement it similarly to WalSegSz, avoiding a
signature change for decide_file_actions() and decide_file_action(). I'm not
insisting on this approach, however.
I made it a global as well, and had to include access/xlog_internal.h
in pg_rewind.h but I don't feel strongly about it either way.
Thanks,
--
John Hsu - Amazon Web Services
Attachments:
0006-Avoid-copying-WAL-segments-before-divergence-to-spee.patchapplication/octet-stream; name=0006-Avoid-copying-WAL-segments-before-divergence-to-spee.patchDownload
From 1fd59466ad598b00d613fd3fec85466d46d3d122 Mon Sep 17 00:00:00 2001
From: John Hsu <johnyvr@gmail.com>
Date: Thu, 3 Jul 2025 19:39:37 +0000
Subject: [PATCH] Avoid copying WAL segments before divergence to speed up
pg_rewind
Adds a conditional check to avoid unnecessarily copying any
WAL segment files from source to target if they are common
between both servers before the point of WAL divergence
during pg_rewind. On the source server, each WAL file's.
All WAL files that exist on source and target, which fall
before the segment of the first diverged LSN can safely be
skipped from copying to the target.
---
src/bin/pg_rewind/filemap.c | 71 ++++++++++++++++++-
src/bin/pg_rewind/filemap.h | 10 ++-
src/bin/pg_rewind/pg_rewind.c | 5 +-
src/bin/pg_rewind/pg_rewind.h | 4 ++
src/bin/pg_rewind/t/008_min_recovery_point.pl | 18 ++++-
5 files changed, 101 insertions(+), 7 deletions(-)
diff --git a/src/bin/pg_rewind/filemap.c b/src/bin/pg_rewind/filemap.c
index c933871ca9f..14c7643f0e0 100644
--- a/src/bin/pg_rewind/filemap.c
+++ b/src/bin/pg_rewind/filemap.c
@@ -199,6 +199,28 @@ filehash_init(void)
filehash = filehash_create(FILEHASH_INITIAL_SIZE, NULL);
}
+/* Determine the type of file content (relation, WAL, or other) */
+static file_content_type_t
+getFileContentType(const char *path)
+{
+ if (isRelDataFile(path))
+ return FILE_CONTENT_TYPE_RELATION;
+ else
+ {
+ /* Extract the filename from the path */
+ const char *filename = last_dir_separator(path);
+ if (filename == NULL)
+ filename = path;
+ else
+ filename++; /* Skip the separator */
+
+ if (IsXLogFileName(filename))
+ return FILE_CONTENT_TYPE_WAL;
+ }
+
+ return FILE_CONTENT_TYPE_OTHER;
+}
+
/* Look up entry for 'path', creating a new one if it doesn't exist */
static file_entry_t *
insert_filehash_entry(const char *path)
@@ -210,7 +232,7 @@ insert_filehash_entry(const char *path)
if (!found)
{
entry->path = pg_strdup(path);
- entry->isrelfile = isRelDataFile(path);
+ entry->content_type = getFileContentType(path);
entry->target_exists = false;
entry->target_type = FILE_TYPE_UNDEFINED;
@@ -383,7 +405,7 @@ process_target_wal_block_change(ForkNumber forknum, RelFileLocator rlocator,
*/
if (entry)
{
- Assert(entry->isrelfile);
+ Assert(entry->content_type == FILE_CONTENT_TYPE_RELATION);
if (entry->target_exists)
{
@@ -693,6 +715,38 @@ final_filemap_cmp(const void *a, const void *b)
return strcmp(fa->path, fb->path);
}
+/*
+ * Decide what to do with a WAL segment file based on its position
+ * relative to the point of divergence.
+ * Caller is responsible for ensuring the file exists on both
+ * source and target.
+ */
+static file_action_t
+decide_wal_file_action(const char *fname)
+{
+ TimeLineID file_tli;
+ XLogSegNo file_segno;
+
+ /* Get current WAL segment number given current segment file name */
+ XLogFromFileName(fname, &file_tli, &file_segno, WalSegSz);
+
+ /*
+ * Avoid copying files before the last common segment.
+ *
+ * These files are assumed to exist on source and target.
+ * Only WAL segment files after the last common segment number on
+ * the new source need to be copied to the new target.
+ */
+ if (file_segno < last_common_segno)
+ {
+ pg_log_debug("WAL file entry \"%s\" not copied to target", fname);
+ return FILE_ACTION_NONE;
+ }
+
+ pg_log_debug("WAL file entry \"%s\" is copied to target", fname);
+ return FILE_ACTION_COPY;
+}
+
/*
* Decide what action to perform to a file.
*/
@@ -799,7 +853,18 @@ decide_file_action(file_entry_t *entry)
return FILE_ACTION_NONE;
case FILE_TYPE_REGULAR:
- if (!entry->isrelfile)
+ if (entry->content_type == FILE_CONTENT_TYPE_WAL)
+ {
+ /* Handle WAL segment file */
+ const char *filename = last_dir_separator(entry->path);
+ if (filename == NULL)
+ filename = entry->path;
+ else
+ filename++; /* Skip the separator */
+
+ return decide_wal_file_action(filename);
+ }
+ else if (entry->content_type != FILE_CONTENT_TYPE_RELATION)
{
/*
* It's a non-data file that we have no special processing
diff --git a/src/bin/pg_rewind/filemap.h b/src/bin/pg_rewind/filemap.h
index df78a02e3da..53082ceba3c 100644
--- a/src/bin/pg_rewind/filemap.h
+++ b/src/bin/pg_rewind/filemap.h
@@ -11,6 +11,7 @@
#include "datapagemap.h"
#include "storage/block.h"
#include "storage/relfilelocator.h"
+#include "access/xlogdefs.h"
/* these enum values are sorted in the order we want actions to be processed */
typedef enum
@@ -36,6 +37,13 @@ typedef enum
FILE_TYPE_SYMLINK,
} file_type_t;
+typedef enum
+{
+ FILE_CONTENT_TYPE_OTHER = 0,
+ FILE_CONTENT_TYPE_RELATION,
+ FILE_CONTENT_TYPE_WAL
+} file_content_type_t;
+
/*
* For every file found in the local or remote system, we have a file entry
* that contains information about the file on both systems. For relation
@@ -51,7 +59,7 @@ typedef struct file_entry_t
uint32 status; /* hash status */
const char *path;
- bool isrelfile; /* is it a relation data file? */
+ file_content_type_t content_type;
/*
* Status of the file in the target.
diff --git a/src/bin/pg_rewind/pg_rewind.c b/src/bin/pg_rewind/pg_rewind.c
index 9d16c1e6b47..83a64270418 100644
--- a/src/bin/pg_rewind/pg_rewind.c
+++ b/src/bin/pg_rewind/pg_rewind.c
@@ -15,7 +15,6 @@
#include <unistd.h>
#include "access/timeline.h"
-#include "access/xlog_internal.h"
#include "catalog/catversion.h"
#include "catalog/pg_control.h"
#include "common/controldata_utils.h"
@@ -62,6 +61,7 @@ static ControlFileData ControlFile_source_after;
static const char *progname;
int WalSegSz;
+XLogSegNo last_common_segno;
/* Configuration options */
char *datadir_target = NULL;
@@ -397,6 +397,9 @@ main(int argc, char **argv)
LSN_FORMAT_ARGS(divergerec),
targetHistory[lastcommontliIndex].tli);
+ /* Convert divergence LSN to segment number */
+ XLByteToSeg(divergerec, last_common_segno, ControlFile_target.xlog_seg_size);
+
/*
* Don't need the source history anymore. The target history is still
* needed by the routines in parsexlog.c, when we read the target WAL.
diff --git a/src/bin/pg_rewind/pg_rewind.h b/src/bin/pg_rewind/pg_rewind.h
index 9cea144d2b2..759be379e12 100644
--- a/src/bin/pg_rewind/pg_rewind.h
+++ b/src/bin/pg_rewind/pg_rewind.h
@@ -12,6 +12,7 @@
#define PG_REWIND_H
#include "access/timeline.h"
+#include "access/xlog_internal.h"
#include "common/logging.h"
#include "common/file_utils.h"
@@ -51,4 +52,7 @@ extern TimeLineHistoryEntry *rewind_parseTimeLineHistory(char *buffer,
TimeLineID targetTLI,
int *nentries);
+/* optimize WAL files copied */
+extern XLogSegNo last_common_segno;
+
#endif /* PG_REWIND_H */
diff --git a/src/bin/pg_rewind/t/008_min_recovery_point.pl b/src/bin/pg_rewind/t/008_min_recovery_point.pl
index 28496afe350..ea279a9ee4b 100644
--- a/src/bin/pg_rewind/t/008_min_recovery_point.pl
+++ b/src/bin/pg_rewind/t/008_min_recovery_point.pl
@@ -71,6 +71,9 @@ $node_3->start;
# Wait until node 3 has connected and caught up
$node_1->wait_for_catchup('node_3');
+# Current common WAL segment
+my $wal_seg_skipped = $node_1->safe_psql('postgres', 'SELECT pg_walfile_name(pg_current_wal_lsn())');
+
#
# Swap the roles of node_1 and node_3, so that node_1 follows node_3.
#
@@ -93,6 +96,10 @@ primary_conninfo='$node_3_connstr'
));
$node_2->restart();
+# Confirm WAL segment has advanced for promotion
+my $advanced_wal_segment = $node_3->safe_psql('postgres', 'SELECT pg_walfile_name(pg_current_wal_lsn())');
+isnt($wal_seg_skipped, $advanced_wal_segment, "Expected WAL segment to have advanced");
+
#
# Promote node_1, to create a split-brain scenario.
#
@@ -140,14 +147,21 @@ copy(
"$node_2_pgdata/postgresql.conf",
"$tmp_folder/node_2-postgresql.conf.tmp");
-command_ok(
+command_checks_all(
[
'pg_rewind',
'--source-server' => $node_1_connstr,
'--target-pgdata' => $node_2_pgdata,
'--debug',
],
- 'run pg_rewind');
+ 0,
+ [
+ # qr/"WAL file entry $wal_seg_skipped not copied to target"/
+ qr//
+ ],
+ [qr//],
+ 'run pg_rewind'
+);
# Now move back postgresql.conf with old settings
move(
--
2.47.1
On Thu, 03 Jul 2025 at 12:59, John H <johnhyvr@gmail.com> wrote:
Hi,
On Wed, Jul 2, 2025 at 6:40 PM Japin Li <japinli@hotmail.com> wrote:
Splitting the logs from $PGDATA is definitely better. The question is whether
it's worth implementing this directly in core or if a prominent note in the
documentation would suffice.I can work on the documentation update as a separate patch if folks
think this is worthwhile.On Wed, Jul 2, 2025 at 10:21 AM Japin Li <japinli@hotmail.com> wrote:
Exactly! It's confusing that getFileType() returns file_content_type_t
instead of file_type_t.Ah yes that is confusing, updated in patch.
For v5 patch:
1.
We could simply use the global WalSegSz variable within decide_file_action(),
eliminating the need to pass wal_segsz_bytes as an argument.Good point.
2.
For last_common_segno, we could implement it similarly to WalSegSz, avoiding a
signature change for decide_file_actions() and decide_file_action(). I'm not
insisting on this approach, however.I made it a global as well, and had to include access/xlog_internal.h
in pg_rewind.h but I don't feel strongly about it either way.
Thanks, LGTM.
--
Regards,
Japin Li
Hi
Thanks, LGTM.
I think it's possible to register to https://commitfest.postgresql.org/54,
https://commitfest.postgresql.org/53 will closed soon
Thanks
On Fri, Jul 4, 2025 at 10:50 AM Japin Li <japinli@hotmail.com> wrote:
Show quoted text
On Thu, 03 Jul 2025 at 12:59, John H <johnhyvr@gmail.com> wrote:
Hi,
On Wed, Jul 2, 2025 at 6:40 PM Japin Li <japinli@hotmail.com> wrote:
Splitting the logs from $PGDATA is definitely better. The question is
whether
it's worth implementing this directly in core or if a prominent note in
the
documentation would suffice.
I can work on the documentation update as a separate patch if folks
think this is worthwhile.On Wed, Jul 2, 2025 at 10:21 AM Japin Li <japinli@hotmail.com>
wrote:
Exactly! It's confusing that getFileType() returns file_content_type_t
instead of file_type_t.Ah yes that is confusing, updated in patch.
For v5 patch:
1.
We could simply use the global WalSegSz variable withindecide_file_action(),
eliminating the need to pass wal_segsz_bytes as an argument.
Good point.
2.
For last_common_segno, we could implement it similarly to WalSegSz,avoiding a
signature change for decide_file_actions() and decide_file_action().
I'm not
insisting on this approach, however.
I made it a global as well, and had to include access/xlog_internal.h
in pg_rewind.h but I don't feel strongly about it either way.Thanks, LGTM.
--
Regards,
Japin Li
Hi,
Thanks for the review.
On Sun, Jul 6, 2025 at 7:12 PM wenhui qiu <qiuwenhuifx@gmail.com> wrote:
Hi
Thanks, LGTM.
I think it's possible to register to https://commitfest.postgresql.org/54, https://commitfest.postgresql.org/53 will closed soon
I've created the commitfest entry here:
https://commitfest.postgresql.org/patch/5902/
I realized I wasn't testing properly in 0006. I've updated the latest
one to move it into a separate test file to make the results cleaner.
Thanks,
--
John Hsu - Amazon Web Services
Attachments:
0007-Avoid-copying-WAL-segments-before-divergence-to-spee.patchapplication/octet-stream; name=0007-Avoid-copying-WAL-segments-before-divergence-to-spee.patchDownload
From 1c2d9a0fc6ff551c1eec0c13f44ece813c0ee7b7 Mon Sep 17 00:00:00 2001
From: John Hsu <johnyvr@gmail.com>
Date: Thu, 3 Jul 2025 19:39:37 +0000
Subject: [PATCH] Avoid copying WAL segments before divergence to speed up
pg_rewind
Adds a conditional check to avoid unnecessarily copying any
WAL segment files from source to target if they are common
between both servers before the point of WAL divergence
during pg_rewind. On the source server, each WAL file's.
All WAL files that exist on source and target, which fall
before the segment of the first diverged LSN can safely be
skipped from copying to the target.
---
src/bin/pg_rewind/filemap.c | 71 ++++++++++++++++++-
src/bin/pg_rewind/filemap.h | 10 ++-
src/bin/pg_rewind/pg_rewind.c | 5 +-
src/bin/pg_rewind/pg_rewind.h | 4 ++
.../t/011_avoid_copying_common_wals.pl | 52 ++++++++++++++
5 files changed, 137 insertions(+), 5 deletions(-)
create mode 100644 src/bin/pg_rewind/t/011_avoid_copying_common_wals.pl
diff --git a/src/bin/pg_rewind/filemap.c b/src/bin/pg_rewind/filemap.c
index c933871ca9f..14c7643f0e0 100644
--- a/src/bin/pg_rewind/filemap.c
+++ b/src/bin/pg_rewind/filemap.c
@@ -199,6 +199,28 @@ filehash_init(void)
filehash = filehash_create(FILEHASH_INITIAL_SIZE, NULL);
}
+/* Determine the type of file content (relation, WAL, or other) */
+static file_content_type_t
+getFileContentType(const char *path)
+{
+ if (isRelDataFile(path))
+ return FILE_CONTENT_TYPE_RELATION;
+ else
+ {
+ /* Extract the filename from the path */
+ const char *filename = last_dir_separator(path);
+ if (filename == NULL)
+ filename = path;
+ else
+ filename++; /* Skip the separator */
+
+ if (IsXLogFileName(filename))
+ return FILE_CONTENT_TYPE_WAL;
+ }
+
+ return FILE_CONTENT_TYPE_OTHER;
+}
+
/* Look up entry for 'path', creating a new one if it doesn't exist */
static file_entry_t *
insert_filehash_entry(const char *path)
@@ -210,7 +232,7 @@ insert_filehash_entry(const char *path)
if (!found)
{
entry->path = pg_strdup(path);
- entry->isrelfile = isRelDataFile(path);
+ entry->content_type = getFileContentType(path);
entry->target_exists = false;
entry->target_type = FILE_TYPE_UNDEFINED;
@@ -383,7 +405,7 @@ process_target_wal_block_change(ForkNumber forknum, RelFileLocator rlocator,
*/
if (entry)
{
- Assert(entry->isrelfile);
+ Assert(entry->content_type == FILE_CONTENT_TYPE_RELATION);
if (entry->target_exists)
{
@@ -693,6 +715,38 @@ final_filemap_cmp(const void *a, const void *b)
return strcmp(fa->path, fb->path);
}
+/*
+ * Decide what to do with a WAL segment file based on its position
+ * relative to the point of divergence.
+ * Caller is responsible for ensuring the file exists on both
+ * source and target.
+ */
+static file_action_t
+decide_wal_file_action(const char *fname)
+{
+ TimeLineID file_tli;
+ XLogSegNo file_segno;
+
+ /* Get current WAL segment number given current segment file name */
+ XLogFromFileName(fname, &file_tli, &file_segno, WalSegSz);
+
+ /*
+ * Avoid copying files before the last common segment.
+ *
+ * These files are assumed to exist on source and target.
+ * Only WAL segment files after the last common segment number on
+ * the new source need to be copied to the new target.
+ */
+ if (file_segno < last_common_segno)
+ {
+ pg_log_debug("WAL file entry \"%s\" not copied to target", fname);
+ return FILE_ACTION_NONE;
+ }
+
+ pg_log_debug("WAL file entry \"%s\" is copied to target", fname);
+ return FILE_ACTION_COPY;
+}
+
/*
* Decide what action to perform to a file.
*/
@@ -799,7 +853,18 @@ decide_file_action(file_entry_t *entry)
return FILE_ACTION_NONE;
case FILE_TYPE_REGULAR:
- if (!entry->isrelfile)
+ if (entry->content_type == FILE_CONTENT_TYPE_WAL)
+ {
+ /* Handle WAL segment file */
+ const char *filename = last_dir_separator(entry->path);
+ if (filename == NULL)
+ filename = entry->path;
+ else
+ filename++; /* Skip the separator */
+
+ return decide_wal_file_action(filename);
+ }
+ else if (entry->content_type != FILE_CONTENT_TYPE_RELATION)
{
/*
* It's a non-data file that we have no special processing
diff --git a/src/bin/pg_rewind/filemap.h b/src/bin/pg_rewind/filemap.h
index df78a02e3da..53082ceba3c 100644
--- a/src/bin/pg_rewind/filemap.h
+++ b/src/bin/pg_rewind/filemap.h
@@ -11,6 +11,7 @@
#include "datapagemap.h"
#include "storage/block.h"
#include "storage/relfilelocator.h"
+#include "access/xlogdefs.h"
/* these enum values are sorted in the order we want actions to be processed */
typedef enum
@@ -36,6 +37,13 @@ typedef enum
FILE_TYPE_SYMLINK,
} file_type_t;
+typedef enum
+{
+ FILE_CONTENT_TYPE_OTHER = 0,
+ FILE_CONTENT_TYPE_RELATION,
+ FILE_CONTENT_TYPE_WAL
+} file_content_type_t;
+
/*
* For every file found in the local or remote system, we have a file entry
* that contains information about the file on both systems. For relation
@@ -51,7 +59,7 @@ typedef struct file_entry_t
uint32 status; /* hash status */
const char *path;
- bool isrelfile; /* is it a relation data file? */
+ file_content_type_t content_type;
/*
* Status of the file in the target.
diff --git a/src/bin/pg_rewind/pg_rewind.c b/src/bin/pg_rewind/pg_rewind.c
index 0c68dd4235e..002e8f667e0 100644
--- a/src/bin/pg_rewind/pg_rewind.c
+++ b/src/bin/pg_rewind/pg_rewind.c
@@ -15,7 +15,6 @@
#include <unistd.h>
#include "access/timeline.h"
-#include "access/xlog_internal.h"
#include "catalog/catversion.h"
#include "catalog/pg_control.h"
#include "common/controldata_utils.h"
@@ -62,6 +61,7 @@ static ControlFileData ControlFile_source_after;
static const char *progname;
int WalSegSz;
+XLogSegNo last_common_segno;
/* Configuration options */
char *datadir_target = NULL;
@@ -397,6 +397,9 @@ main(int argc, char **argv)
LSN_FORMAT_ARGS(divergerec),
targetHistory[lastcommontliIndex].tli);
+ /* Convert divergence LSN to segment number */
+ XLByteToSeg(divergerec, last_common_segno, ControlFile_target.xlog_seg_size);
+
/*
* Don't need the source history anymore. The target history is still
* needed by the routines in parsexlog.c, when we read the target WAL.
diff --git a/src/bin/pg_rewind/pg_rewind.h b/src/bin/pg_rewind/pg_rewind.h
index 9cea144d2b2..759be379e12 100644
--- a/src/bin/pg_rewind/pg_rewind.h
+++ b/src/bin/pg_rewind/pg_rewind.h
@@ -12,6 +12,7 @@
#define PG_REWIND_H
#include "access/timeline.h"
+#include "access/xlog_internal.h"
#include "common/logging.h"
#include "common/file_utils.h"
@@ -51,4 +52,7 @@ extern TimeLineHistoryEntry *rewind_parseTimeLineHistory(char *buffer,
TimeLineID targetTLI,
int *nentries);
+/* optimize WAL files copied */
+extern XLogSegNo last_common_segno;
+
#endif /* PG_REWIND_H */
diff --git a/src/bin/pg_rewind/t/011_avoid_copying_common_wals.pl b/src/bin/pg_rewind/t/011_avoid_copying_common_wals.pl
new file mode 100644
index 00000000000..3c823fc86e2
--- /dev/null
+++ b/src/bin/pg_rewind/t/011_avoid_copying_common_wals.pl
@@ -0,0 +1,52 @@
+# Copyright (c) 2021-2025, PostgreSQL Global Development Group
+#
+# Test situation where source and target data directory contains
+# the same WAL files
+#
+
+use strict;
+use warnings FATAL => 'all';
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+use FindBin;
+use lib $FindBin::RealBin;
+use RewindTest;
+
+RewindTest::setup_cluster();
+RewindTest::start_primary();
+
+RewindTest::create_standby();
+
+# advance WAL on primary; this WAL segment will be common between both
+RewindTest::primary_psql("CREATE TABLE t(a int)");
+RewindTest::primary_psql("INSERT INTO t VALUES(0)");
+
+# Common segment to be skipped copying over
+my $wal_seg_skipped = $node_primary->safe_psql('postgres', 'SELECT pg_walfile_name(pg_current_wal_lsn())');
+
+RewindTest::primary_psql("SELECT pg_switch_wal()");
+
+# last common checkpoint
+RewindTest::primary_psql("CHECKPOINT");
+
+RewindTest::promote_standby;
+
+$node_standby->stop();
+$node_primary->stop();
+
+command_checks_all(
+ [
+ 'pg_rewind', '--debug',
+ '--source-pgdata' => $node_standby->data_dir,
+ '--target-pgdata' => $node_primary->data_dir,
+ '--no-sync',
+ ],
+ 0,
+ [qr//
+ ],
+ [qr/WAL file entry \"$wal_seg_skipped\" not copied to target/],
+ 'run pg_rewind'
+);
+
+done_testing();
--
2.47.1
On Mon, Jul 7, 2025 at 2:46 PM John H <johnhyvr@gmail.com> wrote:
I've created the commitfest entry here:
https://commitfest.postgresql.org/patch/5902/I realized I wasn't testing properly in 0006. I've updated the latest
one to move it into a separate test file to make the results cleaner.
Thanks to everyone who has worked on this patch. I think this is a
real problem about which we ought to try to do something. I think you
could even defend calling this a back-patchable bug fix, since copying
a lot of files that we know have to be identical is a serious
efficiency issue that can affect the usability of pg_upgrade. There
are more aggressive things that we could try to do here that might be
even better, e.g. we could copy just the WAL segment or segments that
contain the last checkpoint record and let the server fetch the rest
as needed after it starts up. However, that would be a definitional
change, which we certainly wouldn't want to back-patch, and which
might have downsides for some people, so perhaps it would need to be
optional behavior. But the patch as proposed is just an optimization
that shouldn't change the outcome for anyone. Perhaps it's still best
to treat it as a master-only improvement, but calling it a fix for a
performance bug isn't completely crazy either, IMHO.
Moving on to the patch contents, the approach taken makes sense to me.
Any WAL files that exist in both source and target from prior to the
point of divergence should not need to be copied, because they should
necessarily be identical. If anyone sees a flaw in this reasoning,
please point it out! While I agree with the basic approach, I think
that the patch could be hardened in two ways. First, isRelDataFile
tests the complete pathname to decide whether it has a relation file,
but getFileContentType only tests the final component of the pathname.
I think it should test the whole pathname also, in case there's a
stray file with a file name that looks like a WAL file in some other
directory. In fact, I think that these two functions should be
integrated. Let's rename the isRelDataFile() to getFileContentType()
and put all the logic in that function, including appropriate tests of
the path name. Second, in decide_file_action(), I think we should
check that entry->target_size == entry->source_size and return
FILE_ACTION_COPY if not. This case really shouldn't happen, but it's
very cheap to check and might offer a bit of protection in some weird
scenario.
I am not a huge fan of the way that the patch stuffs last_common_segno
into a global variable that is then accessed by
decide_wal_file_action. I think it would be better to find a way to
pass this information down through the call stack, either as a
separate argument or as part of some struct we're already passing, if
there's a reasonable and not-too-invasive way to do that. If it looks
too ugly, maybe this is the best option, but I think we should try to
find something better.
It does not appear to me that the test case tests what it purports to
test, and I don't think that what it purports to test is the right
thing anyway. It claims that it is testing that a certain WAL file
(which it erroneously calls a "WAL file entry") is not copied to the
target, but I see no logic to actually check this. I also don't think
it's the correct behavior. I think what the patch is doing and should
do is ensure that we don't copy the file when it's already present. If
the file isn't present in the target cluster, it should still be
copied. Granted, I haven't verified that this is the actual behavior,
but the "Handle cases where the file is missing from one of the
systems" code in decide_file_actiomn() seems to be higher up than the
code the patch changes, so I assume that logic should be applied
first. If that's correct, I don't quite know what a test case here can
usefully verify, since the only difference would be performance, but
maybe I'm misunderstanding something?
As an administrative note, the naming convention for the patches
posted to the thread is not consistent and not correct. Use "git
format-patch -v$VERSION" to generate patches. In a name like
v7-0003-do-something.patch, "v7" is the version of the patch set as a
whole, and "0003" identifies a specific patch within that version of
the patch set. In other words, that would be the third patch in the
series from the seventh time that the patch set was posted. This patch
set so far contains only one patch, so the file name should be
vX-0001-whatever.patch, where whatever is the subject of the patch and
X is a number that increases by one every time it's reposted. Let git
format-patch do the work for you, though.
On a related note, something a committer would need to do if they
wanted to commit this patch is figure out who should be credited with
writing it. I see that a number of different people have posted
versions of the patch; it would be nice if "Author:" or
"Co-authored-by:" tags were added to the commit message to reflect
whose code is present in a given version.
--
Robert Haas
EDB: http://www.enterprisedb.com
Hi Robert,
Thanks for taking a look.
On Thu, Sep 25, 2025 at 11:35 AM Robert Haas <robertmhaas@gmail.com> wrote:
...
integrated. Let's rename the isRelDataFile() to getFileContentType()
and put all the logic in that function, including appropriate tests of
the path name. Second, in decide_file_action(), I think we should
check that entry->target_size == entry->source_size and return
FILE_ACTION_COPY if not. This case really shouldn't happen, but it's
very cheap to check and might offer a bit of protection in some weird
scenario.
Updated the patch to reflect that.
I am not a huge fan of the way that the patch stuffs last_common_segno
into a global variable that is then accessed by
...
I forgot why I bothered with a global instead since it was straightforward
to actually pass in the argument. Updated.
It does not appear to me that the test case tests what it purports to
test, and I don't think that what it purports to test is the right
thing anyway. It claims that it is testing that a certain WAL file
(which it erroneously calls a "WAL file entry") is not copied to the
target, but I see no logic to actually check this.
...
That's a fair point. The test does:
1. Write some data -> WAL 1
2. SELECT pg_switch_wal() -> pg_current_wal_lsn() is at WAL 2
3. CHECKPOINT; -> Makes WAL 2 the last common checkpoint
Then it checks that "WAL 1 not copied over" was logged which isn't
the best since if it was logged elsewhere then it would still pass.
...
first. If that's correct, I don't quite know what a test case here can
usefully verify, since the only difference would be performance, but
maybe I'm misunderstanding something?
I updated the test to run stat and get the modification time of the common
WAL segment before and after pg_rewind and verify it is the same.
In filemap.c in decide_wal_file_action, if you comment out
return FILE_ACTION_NONE;
the test fails.
As an administrative note, the naming convention for the patches
posted to the thread is not consistent and not correct. Use "git
format-patch -v$VERSION" to generate patches. In a name like
...
Fixed patch formatting.
...
writing it. I see that a number of different people have posted
versions of the patch; it would be nice if "Author:" or
"Co-authored-by:" tags were added to the commit message to reflect
I added Justin as the Co-Author in the commit message but I realized
in their original patch they had "Author: Justin Kwan, Vignesh Ravichandran".
Andres also had the latest patch series before mine so I'll leave it up to
the committer how they want to handle this.
Thanks,
--
John Hsu - Amazon Web Services
Attachments:
v8-0001-Avoid-copying-WAL-segments-before-divergence-to-s.patchapplication/octet-stream; name=v8-0001-Avoid-copying-WAL-segments-before-divergence-to-s.patchDownload
From a1cc55a3c6d2acaecf627153b2dd8858d29be1de Mon Sep 17 00:00:00 2001
From: John Hsu <johnyvr@gmail.com>
Date: Wed, 8 Oct 2025 21:20:59 +0000
Subject: [PATCH v8] Avoid copying WAL segments before divergence to speed up
pg_rewind
Adds a check to avoid copying any WAL segment files from source
to target if they are common between both servers before the
point of WAL divergence during pg_rewind.
All WAL files that exist on source and target, which fall
before the segment of the first diverged LSN can safely be
skipped from copying to the target as they have been replicated
from the original primary.
Author: John Hsu <johnhyvr@gmail.com>
Co-Author: Justin Kwan <justinpkwan@outlook.com>
---
src/bin/pg_rewind/filemap.c | 246 +++++++++++-------
src/bin/pg_rewind/filemap.h | 12 +-
src/bin/pg_rewind/pg_rewind.c | 6 +-
.../t/011_avoid_copying_common_wals.pl | 87 +++++++
4 files changed, 254 insertions(+), 97 deletions(-)
create mode 100644 src/bin/pg_rewind/t/011_avoid_copying_common_wals.pl
diff --git a/src/bin/pg_rewind/filemap.c b/src/bin/pg_rewind/filemap.c
index c933871ca9f..a4162bce24c 100644
--- a/src/bin/pg_rewind/filemap.c
+++ b/src/bin/pg_rewind/filemap.c
@@ -55,7 +55,6 @@
static filehash_hash *filehash;
-static bool isRelDataFile(const char *path);
static char *datasegpath(RelFileLocator rlocator, ForkNumber forknum,
BlockNumber segno);
@@ -199,6 +198,106 @@ filehash_init(void)
filehash = filehash_create(FILEHASH_INITIAL_SIZE, NULL);
}
+static file_content_type_t
+getFileContentType(const char *path)
+{
+ RelFileLocator rlocator;
+ unsigned int segNo;
+ int nmatch;
+ bool matched;
+
+ /* Check if it's a WAL file */
+ if (strncmp("pg_wal/", path, 7) == 0)
+ {
+ const char *filename = path + 7; /* Skip "pg_wal/" */
+
+ if (IsXLogFileName(filename))
+ return FILE_CONTENT_TYPE_WAL;
+ else
+ return FILE_CONTENT_TYPE_OTHER;
+ }
+
+ /*
+ * Does it look like a relation data file?
+ *
+ * For our purposes, only files belonging to the main fork are considered
+ * relation files. Other forks are always copied in toto, because we cannot
+ * reliably track changes to them, because WAL only contains block references
+ * for the main fork.
+ *
+ * ----
+ * Relation data files can be in one of the following directories:
+ *
+ * global/
+ * shared relations
+ *
+ * base/<db oid>/
+ * regular relations, default tablespace
+ *
+ * pg_tblspc/<tblspc oid>/<tblspc version>/
+ * within a non-default tablespace (the name of the directory
+ * depends on version)
+ *
+ * And the relation data files themselves have a filename like:
+ *
+ * <oid>.<segment number>
+ *
+ *----
+ */
+ rlocator.spcOid = InvalidOid;
+ rlocator.dbOid = InvalidOid;
+ rlocator.relNumber = InvalidRelFileNumber;
+ segNo = 0;
+ matched = false;
+
+ nmatch = sscanf(path, "global/%u.%u", &rlocator.relNumber, &segNo);
+ if (nmatch == 1 || nmatch == 2)
+ {
+ rlocator.spcOid = GLOBALTABLESPACE_OID;
+ rlocator.dbOid = 0;
+ matched = true;
+ }
+ else
+ {
+ nmatch = sscanf(path, "base/%u/%u.%u",
+ &rlocator.dbOid, &rlocator.relNumber, &segNo);
+ if (nmatch == 2 || nmatch == 3)
+ {
+ rlocator.spcOid = DEFAULTTABLESPACE_OID;
+ matched = true;
+ }
+ else
+ {
+ nmatch = sscanf(path, "pg_tblspc/%u/" TABLESPACE_VERSION_DIRECTORY "/%u/%u.%u",
+ &rlocator.spcOid, &rlocator.dbOid, &rlocator.relNumber,
+ &segNo);
+ if (nmatch == 3 || nmatch == 4)
+ matched = true;
+ }
+ }
+
+ /*
+ * The sscanf tests above can match files that have extra characters at
+ * the end. To eliminate such cases, cross-check that GetRelationPath
+ * creates the exact same filename, when passed the RelFileLocator
+ * information we extracted from the filename.
+ */
+ if (matched)
+ {
+ char *check_path = datasegpath(rlocator, MAIN_FORKNUM, segNo);
+
+ if (strcmp(check_path, path) != 0)
+ matched = false;
+
+ pfree(check_path);
+ }
+
+ if (matched)
+ return FILE_CONTENT_TYPE_RELATION;
+
+ return FILE_CONTENT_TYPE_OTHER;
+}
+
/* Look up entry for 'path', creating a new one if it doesn't exist */
static file_entry_t *
insert_filehash_entry(const char *path)
@@ -210,7 +309,7 @@ insert_filehash_entry(const char *path)
if (!found)
{
entry->path = pg_strdup(path);
- entry->isrelfile = isRelDataFile(path);
+ entry->content_type = getFileContentType(path);
entry->target_exists = false;
entry->target_type = FILE_TYPE_UNDEFINED;
@@ -294,7 +393,7 @@ process_source_file(const char *path, file_type_t type, size_t size,
* sanity check: a filename that looks like a data file better be a
* regular file
*/
- if (type != FILE_TYPE_REGULAR && isRelDataFile(path))
+ if (type != FILE_TYPE_REGULAR && getFileContentType(path) == FILE_CONTENT_TYPE_RELATION)
pg_fatal("data file \"%s\" in source is not a regular file", path);
/* Remember this source file */
@@ -383,7 +482,7 @@ process_target_wal_block_change(ForkNumber forknum, RelFileLocator rlocator,
*/
if (entry)
{
- Assert(entry->isrelfile);
+ Assert(entry->content_type == FILE_CONTENT_TYPE_RELATION);
if (entry->target_exists)
{
@@ -559,92 +658,6 @@ print_filemap(filemap_t *filemap)
fflush(stdout);
}
-/*
- * Does it look like a relation data file?
- *
- * For our purposes, only files belonging to the main fork are considered
- * relation files. Other forks are always copied in toto, because we cannot
- * reliably track changes to them, because WAL only contains block references
- * for the main fork.
- */
-static bool
-isRelDataFile(const char *path)
-{
- RelFileLocator rlocator;
- unsigned int segNo;
- int nmatch;
- bool matched;
-
- /*----
- * Relation data files can be in one of the following directories:
- *
- * global/
- * shared relations
- *
- * base/<db oid>/
- * regular relations, default tablespace
- *
- * pg_tblspc/<tblspc oid>/<tblspc version>/
- * within a non-default tablespace (the name of the directory
- * depends on version)
- *
- * And the relation data files themselves have a filename like:
- *
- * <oid>.<segment number>
- *
- *----
- */
- rlocator.spcOid = InvalidOid;
- rlocator.dbOid = InvalidOid;
- rlocator.relNumber = InvalidRelFileNumber;
- segNo = 0;
- matched = false;
-
- nmatch = sscanf(path, "global/%u.%u", &rlocator.relNumber, &segNo);
- if (nmatch == 1 || nmatch == 2)
- {
- rlocator.spcOid = GLOBALTABLESPACE_OID;
- rlocator.dbOid = 0;
- matched = true;
- }
- else
- {
- nmatch = sscanf(path, "base/%u/%u.%u",
- &rlocator.dbOid, &rlocator.relNumber, &segNo);
- if (nmatch == 2 || nmatch == 3)
- {
- rlocator.spcOid = DEFAULTTABLESPACE_OID;
- matched = true;
- }
- else
- {
- nmatch = sscanf(path, "pg_tblspc/%u/" TABLESPACE_VERSION_DIRECTORY "/%u/%u.%u",
- &rlocator.spcOid, &rlocator.dbOid, &rlocator.relNumber,
- &segNo);
- if (nmatch == 3 || nmatch == 4)
- matched = true;
- }
- }
-
- /*
- * The sscanf tests above can match files that have extra characters at
- * the end. To eliminate such cases, cross-check that GetRelationPath
- * creates the exact same filename, when passed the RelFileLocator
- * information we extracted from the filename.
- */
- if (matched)
- {
- char *check_path = datasegpath(rlocator, MAIN_FORKNUM, segNo);
-
- if (strcmp(check_path, path) != 0)
- matched = false;
-
- pfree(check_path);
- }
-
- return matched;
-}
-
/*
* A helper function to create the path of a relation file and segment.
*
@@ -693,11 +706,45 @@ final_filemap_cmp(const void *a, const void *b)
return strcmp(fa->path, fb->path);
}
+/*
+ * Decide what to do with a WAL segment file based on its position
+ * relative to the point of divergence.
+ * Caller is responsible for ensuring the file exists on both
+ * source and target.
+ */
+static file_action_t
+decide_wal_file_action(const char *fname, XLogSegNo last_common_segno,
+ size_t source_size, size_t target_size)
+{
+ TimeLineID file_tli;
+ XLogSegNo file_segno;
+
+ /* Get current WAL segment number given current segment file name */
+ XLogFromFileName(fname, &file_tli, &file_segno, WalSegSz);
+
+ /*
+ * Avoid copying files before the last common segment.
+ *
+ * These files are assumed to exist on source and target.
+ * Only WAL segment files after the last common segment number on
+ * the new source need to be copied to the new target.
+ *
+ * If the sizes are different copy again for sanity.
+ */
+ if (file_segno < last_common_segno && source_size == target_size)
+ {
+ pg_log_debug("WAL segment \"%s\" not copied to target", fname);
+ return FILE_ACTION_NONE;
+ }
+
+ return FILE_ACTION_COPY;
+}
+
/*
* Decide what action to perform to a file.
*/
static file_action_t
-decide_file_action(file_entry_t *entry)
+decide_file_action(file_entry_t *entry, XLogSegNo last_common_segno)
{
const char *path = entry->path;
@@ -799,7 +846,18 @@ decide_file_action(file_entry_t *entry)
return FILE_ACTION_NONE;
case FILE_TYPE_REGULAR:
- if (!entry->isrelfile)
+ if (entry->content_type == FILE_CONTENT_TYPE_WAL)
+ {
+ /* Handle WAL segment file */
+ const char *filename = last_dir_separator(entry->path);
+ if (filename == NULL)
+ filename = entry->path;
+ else
+ filename++; /* Skip the separator */
+
+ return decide_wal_file_action(filename, last_common_segno, entry->source_size, entry->target_size);
+ }
+ else if (entry->content_type != FILE_CONTENT_TYPE_RELATION)
{
/*
* It's a non-data file that we have no special processing
@@ -858,7 +916,7 @@ decide_file_action(file_entry_t *entry)
* should be executed.
*/
filemap_t *
-decide_file_actions(void)
+decide_file_actions(XLogSegNo last_common_segno)
{
int i;
filehash_iterator it;
@@ -868,7 +926,7 @@ decide_file_actions(void)
filehash_start_iterate(filehash, &it);
while ((entry = filehash_iterate(filehash, &it)) != NULL)
{
- entry->action = decide_file_action(entry);
+ entry->action = decide_file_action(entry, last_common_segno);
}
/*
diff --git a/src/bin/pg_rewind/filemap.h b/src/bin/pg_rewind/filemap.h
index df78a02e3da..5145f0b4c46 100644
--- a/src/bin/pg_rewind/filemap.h
+++ b/src/bin/pg_rewind/filemap.h
@@ -11,6 +11,7 @@
#include "datapagemap.h"
#include "storage/block.h"
#include "storage/relfilelocator.h"
+#include "access/xlogdefs.h"
/* these enum values are sorted in the order we want actions to be processed */
typedef enum
@@ -36,6 +37,13 @@ typedef enum
FILE_TYPE_SYMLINK,
} file_type_t;
+typedef enum
+{
+ FILE_CONTENT_TYPE_OTHER = 0,
+ FILE_CONTENT_TYPE_RELATION,
+ FILE_CONTENT_TYPE_WAL
+} file_content_type_t;
+
/*
* For every file found in the local or remote system, we have a file entry
* that contains information about the file on both systems. For relation
@@ -51,7 +59,7 @@ typedef struct file_entry_t
uint32 status; /* hash status */
const char *path;
- bool isrelfile; /* is it a relation data file? */
+ file_content_type_t content_type;
/*
* Status of the file in the target.
@@ -106,7 +114,7 @@ extern void process_target_wal_block_change(ForkNumber forknum,
RelFileLocator rlocator,
BlockNumber blkno);
-extern filemap_t *decide_file_actions(void);
+extern filemap_t *decide_file_actions(XLogSegNo last_common_segno);
extern void calculate_totals(filemap_t *filemap);
extern void print_filemap(filemap_t *filemap);
diff --git a/src/bin/pg_rewind/pg_rewind.c b/src/bin/pg_rewind/pg_rewind.c
index 0c68dd4235e..318d0a27184 100644
--- a/src/bin/pg_rewind/pg_rewind.c
+++ b/src/bin/pg_rewind/pg_rewind.c
@@ -147,6 +147,7 @@ main(int argc, char **argv)
TimeLineID source_tli;
TimeLineID target_tli;
XLogRecPtr target_wal_endrec;
+ XLogSegNo last_common_segno;
size_t size;
char *buffer;
bool no_ensure_shutdown = false;
@@ -397,6 +398,9 @@ main(int argc, char **argv)
LSN_FORMAT_ARGS(divergerec),
targetHistory[lastcommontliIndex].tli);
+ /* Convert divergence LSN to segment number */
+ XLByteToSeg(divergerec, last_common_segno, ControlFile_target.xlog_seg_size);
+
/*
* Don't need the source history anymore. The target history is still
* needed by the routines in parsexlog.c, when we read the target WAL.
@@ -492,7 +496,7 @@ main(int argc, char **argv)
* We have collected all information we need from both systems. Decide
* what to do with each file.
*/
- filemap = decide_file_actions();
+ filemap = decide_file_actions(last_common_segno);
if (showprogress)
calculate_totals(filemap);
diff --git a/src/bin/pg_rewind/t/011_avoid_copying_common_wals.pl b/src/bin/pg_rewind/t/011_avoid_copying_common_wals.pl
new file mode 100644
index 00000000000..d5fa28b1f49
--- /dev/null
+++ b/src/bin/pg_rewind/t/011_avoid_copying_common_wals.pl
@@ -0,0 +1,87 @@
+# Copyright (c) 2021-2025, PostgreSQL Global Development Group
+#
+# Test situation where source and target data directory contains
+# the same WAL files
+#
+
+use strict;
+use warnings FATAL => 'all';
+use PostgreSQL::Test::Utils;
+use Test::More;
+use File::Spec;
+use File::stat qw(stat);
+
+use FindBin;
+use lib $FindBin::RealBin;
+use RewindTest;
+use Time::HiRes qw(usleep);
+
+RewindTest::setup_cluster();
+RewindTest::start_primary();
+
+RewindTest::create_standby();
+
+# advance WAL on primary; this WAL segment will be common between both
+RewindTest::primary_psql("CREATE TABLE t(a int)");
+RewindTest::primary_psql("INSERT INTO t VALUES(0)");
+
+# Common segment to be skipped copying over
+my $wal_seg_skipped = $node_primary->safe_psql('postgres', 'SELECT pg_walfile_name(pg_current_wal_lsn())');
+
+RewindTest::primary_psql("SELECT pg_switch_wal()");
+
+# last common checkpoint
+RewindTest::primary_psql("CHECKPOINT");
+
+RewindTest::promote_standby;
+
+my $wal_seg_copied = $node_standby->safe_psql('postgres', 'SELECT pg_walfile_name(pg_current_wal_lsn())');
+
+# Get stat info for the WAL file that should be skipped
+my $wal_skipped_path = File::Spec->catfile($node_primary->data_dir, 'pg_wal', $wal_seg_skipped);
+my $wal_skipped_stat = stat($wal_skipped_path);
+defined($wal_skipped_stat) or die("unable to stat $wal_skipped_path");
+
+# Store modification time for later comparison
+my $wal_seg_skipped_mod_time = $wal_skipped_stat->mtime;
+
+# Verify that the WAL segment on new timeline does not exist in target before rewind
+my $wal_copied_path = File::Spec->catfile($node_primary->data_dir, 'pg_wal', $wal_seg_copied);
+my $wal_copied_stat = stat($wal_copied_path);
+ok(!defined($wal_copied_stat), "WAL segment $wal_seg_copied should not exist in target before rewind");
+
+
+$node_standby->stop();
+$node_primary->stop();
+
+# Sleep to allow mtime to be different
+usleep(1000);
+
+command_checks_all(
+ [
+ 'pg_rewind', '--debug',
+ '--source-pgdata' => $node_standby->data_dir,
+ '--target-pgdata' => $node_primary->data_dir,
+ '--no-sync',
+ ],
+ 0,
+ [qr//
+ ],
+ [qr/WAL segment \"$wal_seg_skipped\" not copied to target/],
+ 'run pg_rewind'
+);
+
+# Verify that the copied WAL segment now exists in target
+$wal_copied_stat = stat($wal_copied_path);
+ok(defined($wal_copied_stat), "WAL segment $wal_seg_copied should exist in target after rewind");
+
+# Get current modification time of the skipped WAL segment
+my $wal_skipped_stat_after = stat($wal_skipped_path);
+defined($wal_skipped_stat_after) or die("unable to stat $wal_skipped_path after rewind");
+my $wal_seg_latest_skipped_mod_time = $wal_skipped_stat_after->mtime;
+
+# 6. Validate that modification time hasn't changed,
+is($wal_seg_latest_skipped_mod_time, $wal_seg_skipped_mod_time,
+ "WAL segment $wal_seg_skipped modification time should be unchanged (not overwritten)");
+
+done_testing();
--
2.50.1
Hi,
just a second late :( i was about to post a patch addressing the refactors
which Robert mentioned ,anyway will have a look at your latest patch John
thanks :), curious about the tap test.
while i was writing the patch something suddenly struck me , that is why we
are even depending on last_common_segno ,because once we
reached decide_wal_file_action it means that the file exists in both target
and source ,AFAIK this can only happen with wal segments older than or
equal to last_common_segno because once the promotion competes the filename
of the WAL files gets changed with the new timelineID(2), for ex: if the
last_common_segno is 000000010000000000000003 then based on the rules
in XLogInitNewTimeline
1) if the timeline switch happens in middle of segment ,copy data from the
last WAL segment and create WAL file with same segno but different
timelineID,in this case the starting WAL file for the new timeline will be
000000020000000000000003
2) if the timeline switch happens at segment boundary , just create next
segment for this case the starting WAL file for the new timeline will be
000000020000000000000004
so basically the files which exists in source and not in target like the
new timeline WAL segments will be copied to target in total before we reach
decide_wal_file_action , so i think we don't need to think about copying
WAL files after divergence point by calculating and checking against
last_common_segno which we are doing in our current approach , i think we
can just do
in decide_wal_file_action
if (entry->target_size != entry->source_size)
{
pg_log_debug("WAL segment file \"%s\" is copied to target", fname);
return FILE_ACTION_COPY;
}
return FILE_ACTION_NONE;
thoughts?
--
Thanks,
Srinath Reddy Sadipiralla
EDB: https://www.enterprisedb.com/
Hey Srinath
On Thu, Oct 9, 2025 at 12:09 PM Srinath Reddy Sadipiralla
<srinath2133@gmail.com> wrote:
...
1) if the timeline switch happens in middle of segment ,copy data from the last WAL segment and create WAL file with same segno but different timelineID,in this case the starting WAL file for the new timeline will be 000000020000000000000003
2) if the timeline switch happens at segment boundary , just create next segment for this case the starting WAL file for the new timeline will be 000000020000000000000004so basically the files which exists in source and not in target like the new timeline WAL segments will be copied to target in total before we reach decide_wal_file_action , so i think we don't need to think about copying WAL files after divergence point by calculating and checking against last_common_segno which we are doing in our current approach , i think we can just do
...
That's a great point. I need to think about it some more but the
reasoning makes sense to me.
I think 'last_common_segno ' is only useful as another sanity check
but we already have the size
ones.
Thanks,
--
John Hsu - Amazon Web Services
Hi John , I have reviewed your latest patch(v8-0001).
On Thu, Oct 9, 2025 at 11:27 PM John H <johnhyvr@gmail.com> wrote:
...
integrated. Let's rename the isRelDataFile() to getFileContentType()
and put all the logic in that function, including appropriate tests of
the path name. Second, in decide_file_action(), I think we should
check that entry->target_size == entry->source_size and return
FILE_ACTION_COPY if not. This case really shouldn't happen, but it's
very cheap to check and might offer a bit of protection in some weird
scenario.Updated the patch to reflect that.
in getFileContentType as you moved the logic of validating
whether they are xlog files or not using IsXLogFileName ,
i think it's better to validate it the kind of similar way as its
done for relfiles by parsing the path using sscanf this will
help to get the tli,logno,segno of the xlog file and using
XLogFromFileName which gives us logSegNo ,so that we
can rebuild the xlog file path again using these values with
XLogFilePath , then validate this new path with the given path
,this helps to catch invalid xlog files like
pg_wal/00000001FFFFFFFFFFFFFF10.
maybe like this
+ if (strncmp("pg_wal/", path, 7) == 0)
+ {
+ const char *filename = path + 7;
+ XLogFromFileName(filename,&tli,logSegNo,WalSegSz);
+ XLogFilePath(check_path,tli,*logSegNo,WalSegSz);
+ if (strcmp(check_path, path) == 0)
+ return FILE_CONTENT_TYPE_WAL;
+ else
+ return FILE_CONTENT_TYPE_OTHER;
+ }
I am not a huge fan of the way that the patch stuffs last_common_segno
into a global variable that is then accessed by
...I forgot why I bothered with a global instead since it was straightforward
to actually pass in the argument. Updated.
instead of passing last_common_segno down the call stack directly,
we can have a struct maybe called "rewindState" which will have the common
information related to both clusters involved in rewind ,so that in future
if we want to pass more such information down we won't be increasing
the number of arguments that need to be passed down to all functions in
stack.
It does not appear to me that the test case tests what it purports to
test, and I don't think that what it purports to test is the right
thing anyway. It claims that it is testing that a certain WAL file
(which it erroneously calls a "WAL file entry") is not copied to the
target, but I see no logic to actually check this.
...That's a fair point. The test does:
1. Write some data -> WAL 1
2. SELECT pg_switch_wal() -> pg_current_wal_lsn() is at WAL 2
3. CHECKPOINT; -> Makes WAL 2 the last common checkpointThen it checks that "WAL 1 not copied over" was logged which isn't
the best since if it was logged elsewhere then it would still pass.
we can improve the TAP test file header comment as
# Test the situation where we skip copying the same WAL files from source
to target
# except if WAL file size differs.
let me put it this way
1) WALs are written to 000000010000000000000002 in primary.
2) then SELECT pg_switch_wal() now the current WAL is
000000010000000000000003
3) CHECKPOINT; this is the last common checkpoint and is written in
000000010000000000000003.
4) until this point we have the same WAL files in primary and standby.
5) promote standby ,this leads to timeline switch and lets say the
divergence point
is at segment boundary , so the first WAL segment in the new timeline is
000000020000000000000004.
6) when we stop standby then primary , in primary we get checkpoint
shutdown which
is greater than the divergence point ,means there is divergence so
pg_rewind is needed.
7) so the last common WAL segment was 000000010000000000000003.
I have mentioned below about how this updated TAP test is testing whether
common WAL file is being copied or not and about one more test we need to
cover
in this TAP test.
...
first. If that's correct, I don't quite know what a test case here can
usefully verify, since the only difference would be performance, but
maybe I'm misunderstanding something?I updated the test to run stat and get the modification time of the common
WAL segment before and after pg_rewind and verify it is the same.In filemap.c in decide_wal_file_action, if you comment out
return FILE_ACTION_NONE;
the test fails.
This test makes sense to me, by checking whether
last common segment(000000010000000000000003 in our case)
been copied or not using the stat->mtime after the pg_rewind, if its not
copied then stat->mtime will be the same ,else it will differ with the
one we noted before rewind, but we also need to test the case
where entry->target_size != entry->source_size and we do copy,
like if stat->mtime is changed (which means file has been copied)
and target_stat->size != soucrce_stat->size here no error expected.
--
Thanks,
Srinath Reddy Sadipiralla
EDB: https://www.enterprisedb.com/
Hi,
On Fri, Oct 10, 2025 at 12:45 AM Srinath Reddy Sadipiralla
<srinath2133@gmail.com> wrote:
...
XLogFilePath , then validate this new path with the given path
,this helps to catch invalid xlog files like pg_wal/00000001FFFFFFFFFFFFFF10.
...
Are you concerned that somehow these files, which are named like XLog
files but actually
aren't, are somehow created therefore we should sync them in case?
I'm trying to understand how these files would be generated in the first place.
instead of passing last_common_segno down the call stack directly,
we can have a struct maybe called "rewindState" which will have the common
information related to both clusters involved in rewind ,so that in future
if we want to pass more such information down we won't be increasing
the number of arguments that need to be passed down to all functions in stack.
I don't feel too strongly about this. Not sure how we view
future-proofing things,
as in do we proactively wrap around structs or wait until there's an
actual need.
This is the first time it's been touched since it was introduced in 14 for what
it's worth.
we can improve the TAP test file header comment as
# Test the situation where we skip copying the same WAL files from source to target
# except if WAL file size differs.let me put it this way
1) WALs are written to 000000010000000000000002 in primary.
...
7) so the last common WAL segment was 000000010000000000000003.
Updated comments.
...
where entry->target_size != entry->source_size and we do copy,
like if stat->mtime is changed (which means file has been copied)
and target_stat->size != soucrce_stat->size here no error expected.
Updated patch with one additional segment has a byte appended to it
on target but not source.
Thanks,
John H
Attachments:
v9-0001-Avoid-copying-WAL-segments-before-divergence-to-s.patchapplication/octet-stream; name=v9-0001-Avoid-copying-WAL-segments-before-divergence-to-s.patchDownload
From df7eb550c3ed8d42c76f91d547f68a32a08539ce Mon Sep 17 00:00:00 2001
From: John Hsu <johnyvr@gmail.com>
Date: Wed, 8 Oct 2025 21:20:59 +0000
Subject: [PATCH v9] Avoid copying WAL segments before divergence to speed up
pg_rewind
Adds a check to avoid copying any WAL segment files from source
to target if they are common between both servers before the
point of WAL divergence during pg_rewind.
All WAL files that exist on source and target, which fall
before the segment of the first diverged LSN can safely be
skipped from copying to the target as they have been replicated
from the original primary.
Author: John Hsu <johnhyvr@gmail.com>
Co-Author: Justin Kwan <justinpkwan@outlook.com>
---
src/bin/pg_rewind/filemap.c | 245 +++++++++++-------
src/bin/pg_rewind/filemap.h | 12 +-
src/bin/pg_rewind/pg_rewind.c | 6 +-
.../t/011_avoid_copying_common_wals.pl | 111 ++++++++
4 files changed, 277 insertions(+), 97 deletions(-)
create mode 100644 src/bin/pg_rewind/t/011_avoid_copying_common_wals.pl
diff --git a/src/bin/pg_rewind/filemap.c b/src/bin/pg_rewind/filemap.c
index c933871ca9f..cd155eb2ec1 100644
--- a/src/bin/pg_rewind/filemap.c
+++ b/src/bin/pg_rewind/filemap.c
@@ -55,7 +55,6 @@
static filehash_hash *filehash;
-static bool isRelDataFile(const char *path);
static char *datasegpath(RelFileLocator rlocator, ForkNumber forknum,
BlockNumber segno);
@@ -199,6 +198,106 @@ filehash_init(void)
filehash = filehash_create(FILEHASH_INITIAL_SIZE, NULL);
}
+static file_content_type_t
+getFileContentType(const char *path)
+{
+ RelFileLocator rlocator;
+ unsigned int segNo;
+ int nmatch;
+ bool matched;
+
+ /* Check if it's a WAL file */
+ if (strncmp("pg_wal/", path, 7) == 0)
+ {
+ const char *filename = path + 7; /* Skip "pg_wal/" */
+
+ if (IsXLogFileName(filename))
+ return FILE_CONTENT_TYPE_WAL;
+ else
+ return FILE_CONTENT_TYPE_OTHER;
+ }
+
+ /*
+ * Does it look like a relation data file?
+ *
+ * For our purposes, only files belonging to the main fork are considered
+ * relation files. Other forks are always copied in toto, because we cannot
+ * reliably track changes to them, because WAL only contains block references
+ * for the main fork.
+ *
+ * ----
+ * Relation data files can be in one of the following directories:
+ *
+ * global/
+ * shared relations
+ *
+ * base/<db oid>/
+ * regular relations, default tablespace
+ *
+ * pg_tblspc/<tblspc oid>/<tblspc version>/
+ * within a non-default tablespace (the name of the directory
+ * depends on version)
+ *
+ * And the relation data files themselves have a filename like:
+ *
+ * <oid>.<segment number>
+ *
+ *----
+ */
+ rlocator.spcOid = InvalidOid;
+ rlocator.dbOid = InvalidOid;
+ rlocator.relNumber = InvalidRelFileNumber;
+ segNo = 0;
+ matched = false;
+
+ nmatch = sscanf(path, "global/%u.%u", &rlocator.relNumber, &segNo);
+ if (nmatch == 1 || nmatch == 2)
+ {
+ rlocator.spcOid = GLOBALTABLESPACE_OID;
+ rlocator.dbOid = 0;
+ matched = true;
+ }
+ else
+ {
+ nmatch = sscanf(path, "base/%u/%u.%u",
+ &rlocator.dbOid, &rlocator.relNumber, &segNo);
+ if (nmatch == 2 || nmatch == 3)
+ {
+ rlocator.spcOid = DEFAULTTABLESPACE_OID;
+ matched = true;
+ }
+ else
+ {
+ nmatch = sscanf(path, "pg_tblspc/%u/" TABLESPACE_VERSION_DIRECTORY "/%u/%u.%u",
+ &rlocator.spcOid, &rlocator.dbOid, &rlocator.relNumber,
+ &segNo);
+ if (nmatch == 3 || nmatch == 4)
+ matched = true;
+ }
+ }
+
+ /*
+ * The sscanf tests above can match files that have extra characters at
+ * the end. To eliminate such cases, cross-check that GetRelationPath
+ * creates the exact same filename, when passed the RelFileLocator
+ * information we extracted from the filename.
+ */
+ if (matched)
+ {
+ char *check_path = datasegpath(rlocator, MAIN_FORKNUM, segNo);
+
+ if (strcmp(check_path, path) != 0)
+ matched = false;
+
+ pfree(check_path);
+ }
+
+ if (matched)
+ return FILE_CONTENT_TYPE_RELATION;
+
+ return FILE_CONTENT_TYPE_OTHER;
+}
+
/* Look up entry for 'path', creating a new one if it doesn't exist */
static file_entry_t *
insert_filehash_entry(const char *path)
@@ -210,7 +309,7 @@ insert_filehash_entry(const char *path)
if (!found)
{
entry->path = pg_strdup(path);
- entry->isrelfile = isRelDataFile(path);
+ entry->content_type = getFileContentType(path);
entry->target_exists = false;
entry->target_type = FILE_TYPE_UNDEFINED;
@@ -294,7 +393,7 @@ process_source_file(const char *path, file_type_t type, size_t size,
* sanity check: a filename that looks like a data file better be a
* regular file
*/
- if (type != FILE_TYPE_REGULAR && isRelDataFile(path))
+ if (type != FILE_TYPE_REGULAR && getFileContentType(path) == FILE_CONTENT_TYPE_RELATION)
pg_fatal("data file \"%s\" in source is not a regular file", path);
/* Remember this source file */
@@ -383,7 +482,7 @@ process_target_wal_block_change(ForkNumber forknum, RelFileLocator rlocator,
*/
if (entry)
{
- Assert(entry->isrelfile);
+ Assert(entry->content_type == FILE_CONTENT_TYPE_RELATION);
if (entry->target_exists)
{
@@ -559,92 +658,6 @@ print_filemap(filemap_t *filemap)
fflush(stdout);
}
-/*
- * Does it look like a relation data file?
- *
- * For our purposes, only files belonging to the main fork are considered
- * relation files. Other forks are always copied in toto, because we cannot
- * reliably track changes to them, because WAL only contains block references
- * for the main fork.
- */
-static bool
-isRelDataFile(const char *path)
-{
- RelFileLocator rlocator;
- unsigned int segNo;
- int nmatch;
- bool matched;
-
- /*----
- * Relation data files can be in one of the following directories:
- *
- * global/
- * shared relations
- *
- * base/<db oid>/
- * regular relations, default tablespace
- *
- * pg_tblspc/<tblspc oid>/<tblspc version>/
- * within a non-default tablespace (the name of the directory
- * depends on version)
- *
- * And the relation data files themselves have a filename like:
- *
- * <oid>.<segment number>
- *
- *----
- */
- rlocator.spcOid = InvalidOid;
- rlocator.dbOid = InvalidOid;
- rlocator.relNumber = InvalidRelFileNumber;
- segNo = 0;
- matched = false;
-
- nmatch = sscanf(path, "global/%u.%u", &rlocator.relNumber, &segNo);
- if (nmatch == 1 || nmatch == 2)
- {
- rlocator.spcOid = GLOBALTABLESPACE_OID;
- rlocator.dbOid = 0;
- matched = true;
- }
- else
- {
- nmatch = sscanf(path, "base/%u/%u.%u",
- &rlocator.dbOid, &rlocator.relNumber, &segNo);
- if (nmatch == 2 || nmatch == 3)
- {
- rlocator.spcOid = DEFAULTTABLESPACE_OID;
- matched = true;
- }
- else
- {
- nmatch = sscanf(path, "pg_tblspc/%u/" TABLESPACE_VERSION_DIRECTORY "/%u/%u.%u",
- &rlocator.spcOid, &rlocator.dbOid, &rlocator.relNumber,
- &segNo);
- if (nmatch == 3 || nmatch == 4)
- matched = true;
- }
- }
-
- /*
- * The sscanf tests above can match files that have extra characters at
- * the end. To eliminate such cases, cross-check that GetRelationPath
- * creates the exact same filename, when passed the RelFileLocator
- * information we extracted from the filename.
- */
- if (matched)
- {
- char *check_path = datasegpath(rlocator, MAIN_FORKNUM, segNo);
-
- if (strcmp(check_path, path) != 0)
- matched = false;
-
- pfree(check_path);
- }
-
- return matched;
-}
-
/*
* A helper function to create the path of a relation file and segment.
*
@@ -693,11 +706,44 @@ final_filemap_cmp(const void *a, const void *b)
return strcmp(fa->path, fb->path);
}
+/*
+ * Decide what to do with a WAL segment file based on its position
+ * relative to the point of divergence.
+ * Caller is responsible for ensuring the file exists on both
+ * source and target.
+ */
+static file_action_t
+decide_wal_file_action(const char *fname, XLogSegNo last_common_segno,
+ size_t source_size, size_t target_size)
+{
+ TimeLineID file_tli;
+ XLogSegNo file_segno;
+
+ /* Get current WAL segment number given current segment file name */
+ XLogFromFileName(fname, &file_tli, &file_segno, WalSegSz);
+
+ /*
+ * Avoid copying files before the last common segment.
+ *
+ * These files are assumed to exist on source and target which means
+ * they should already be identical and before the last_common_segno.
+ *
+ * However we check last_common_segno and file_size again for sanity.
+ */
+ if (file_segno < last_common_segno && source_size == target_size)
+ {
+ pg_log_debug("WAL segment \"%s\" not copied to target", fname);
+ return FILE_ACTION_NONE;
+ }
+
+ return FILE_ACTION_COPY;
+}
+
/*
* Decide what action to perform to a file.
*/
static file_action_t
-decide_file_action(file_entry_t *entry)
+decide_file_action(file_entry_t *entry, XLogSegNo last_common_segno)
{
const char *path = entry->path;
@@ -799,7 +845,18 @@ decide_file_action(file_entry_t *entry)
return FILE_ACTION_NONE;
case FILE_TYPE_REGULAR:
- if (!entry->isrelfile)
+ if (entry->content_type == FILE_CONTENT_TYPE_WAL)
+ {
+ /* Handle WAL segment file */
+ const char *filename = last_dir_separator(entry->path);
+ if (filename == NULL)
+ filename = entry->path;
+ else
+ filename++; /* Skip the separator */
+
+ return decide_wal_file_action(filename, last_common_segno, entry->source_size, entry->target_size);
+ }
+ else if (entry->content_type != FILE_CONTENT_TYPE_RELATION)
{
/*
* It's a non-data file that we have no special processing
@@ -858,7 +915,7 @@ decide_file_action(file_entry_t *entry)
* should be executed.
*/
filemap_t *
-decide_file_actions(void)
+decide_file_actions(XLogSegNo last_common_segno)
{
int i;
filehash_iterator it;
@@ -868,7 +925,7 @@ decide_file_actions(void)
filehash_start_iterate(filehash, &it);
while ((entry = filehash_iterate(filehash, &it)) != NULL)
{
- entry->action = decide_file_action(entry);
+ entry->action = decide_file_action(entry, last_common_segno);
}
/*
diff --git a/src/bin/pg_rewind/filemap.h b/src/bin/pg_rewind/filemap.h
index df78a02e3da..5145f0b4c46 100644
--- a/src/bin/pg_rewind/filemap.h
+++ b/src/bin/pg_rewind/filemap.h
@@ -11,6 +11,7 @@
#include "datapagemap.h"
#include "storage/block.h"
#include "storage/relfilelocator.h"
+#include "access/xlogdefs.h"
/* these enum values are sorted in the order we want actions to be processed */
typedef enum
@@ -36,6 +37,13 @@ typedef enum
FILE_TYPE_SYMLINK,
} file_type_t;
+typedef enum
+{
+ FILE_CONTENT_TYPE_OTHER = 0,
+ FILE_CONTENT_TYPE_RELATION,
+ FILE_CONTENT_TYPE_WAL
+} file_content_type_t;
+
/*
* For every file found in the local or remote system, we have a file entry
* that contains information about the file on both systems. For relation
@@ -51,7 +59,7 @@ typedef struct file_entry_t
uint32 status; /* hash status */
const char *path;
- bool isrelfile; /* is it a relation data file? */
+ file_content_type_t content_type;
/*
* Status of the file in the target.
@@ -106,7 +114,7 @@ extern void process_target_wal_block_change(ForkNumber forknum,
RelFileLocator rlocator,
BlockNumber blkno);
-extern filemap_t *decide_file_actions(void);
+extern filemap_t *decide_file_actions(XLogSegNo last_common_segno);
extern void calculate_totals(filemap_t *filemap);
extern void print_filemap(filemap_t *filemap);
diff --git a/src/bin/pg_rewind/pg_rewind.c b/src/bin/pg_rewind/pg_rewind.c
index 0c68dd4235e..318d0a27184 100644
--- a/src/bin/pg_rewind/pg_rewind.c
+++ b/src/bin/pg_rewind/pg_rewind.c
@@ -147,6 +147,7 @@ main(int argc, char **argv)
TimeLineID source_tli;
TimeLineID target_tli;
XLogRecPtr target_wal_endrec;
+ XLogSegNo last_common_segno;
size_t size;
char *buffer;
bool no_ensure_shutdown = false;
@@ -397,6 +398,9 @@ main(int argc, char **argv)
LSN_FORMAT_ARGS(divergerec),
targetHistory[lastcommontliIndex].tli);
+ /* Convert divergence LSN to segment number */
+ XLByteToSeg(divergerec, last_common_segno, ControlFile_target.xlog_seg_size);
+
/*
* Don't need the source history anymore. The target history is still
* needed by the routines in parsexlog.c, when we read the target WAL.
@@ -492,7 +496,7 @@ main(int argc, char **argv)
* We have collected all information we need from both systems. Decide
* what to do with each file.
*/
- filemap = decide_file_actions();
+ filemap = decide_file_actions(last_common_segno);
if (showprogress)
calculate_totals(filemap);
diff --git a/src/bin/pg_rewind/t/011_avoid_copying_common_wals.pl b/src/bin/pg_rewind/t/011_avoid_copying_common_wals.pl
new file mode 100644
index 00000000000..8d49ebde04d
--- /dev/null
+++ b/src/bin/pg_rewind/t/011_avoid_copying_common_wals.pl
@@ -0,0 +1,111 @@
+# Copyright (c) 2021-2025, PostgreSQL Global Development Group
+#
+# Test situation where we skip copying the same WAL files from source
+# to target unless file size differs.
+
+use strict;
+use warnings FATAL => 'all';
+use PostgreSQL::Test::Utils;
+use Test::More;
+use File::Spec;
+use File::stat qw(stat);
+
+use FindBin;
+use lib $FindBin::RealBin;
+use RewindTest;
+use Time::HiRes qw(usleep);
+
+RewindTest::setup_cluster();
+RewindTest::start_primary();
+
+RewindTest::create_standby();
+
+
+# Advance WAL on primary
+RewindTest::primary_psql("CREATE TABLE t(a int)");
+RewindTest::primary_psql("INSERT INTO t VALUES(0)");
+
+# Common segment to be skipped copying over as it's common and same size
+my $wal_seg_skipped = $node_primary->safe_psql('postgres', 'SELECT pg_walfile_name(pg_current_wal_lsn())');
+
+RewindTest::primary_psql("SELECT pg_switch_wal()");
+
+# Segment before divergerec that will be copied over because file size differs
+RewindTest::primary_psql("INSERT INTO t VALUES(0)");
+my $wal_seg_before_diverge_copied = $node_primary->safe_psql('postgres', 'SELECT pg_walfile_name(pg_current_wal_lsn())');
+RewindTest::primary_psql("SELECT pg_switch_wal()");
+
+# last common checkpoint
+RewindTest::primary_psql("CHECKPOINT");
+
+RewindTest::promote_standby;
+
+# Segment after divergerec, expected to be copied
+my $wal_seg_copied = $node_standby->safe_psql('postgres', 'SELECT pg_walfile_name(pg_current_wal_lsn())');
+
+# Get stat info for the WAL file that should be skipped
+my $wal_skipped_path = File::Spec->catfile($node_primary->data_dir, 'pg_wal', $wal_seg_skipped);
+my $wal_skipped_stat = stat($wal_skipped_path);
+defined($wal_skipped_stat) or die("unable to stat $wal_skipped_path");
+
+# Store modification time for later comparison
+my $wal_seg_skipped_mod_time = $wal_skipped_stat->mtime;
+
+# Append an additional byte to segment on target before divergerec to test it's copied over from source
+my $wal_seg_before_diverge_copied_path = File::Spec->catfile($node_primary->data_dir, 'pg_wal', $wal_seg_before_diverge_copied);
+open my $fh, ">>", $wal_seg_before_diverge_copied_path
+ or die "could not open $wal_seg_before_diverge_copied_path";
+
+print $fh 'a';
+close $fh;
+
+my $wal_seg_existing_copied_stat = stat($wal_seg_before_diverge_copied_path);
+ok(defined($wal_seg_existing_copied_stat), "WAL segment $wal_seg_before_diverge_copied should exist in target before rewind");
+
+# Verify that the WAL segment on new timeline does not exist in target before rewind
+my $wal_copied_path = File::Spec->catfile($node_primary->data_dir, 'pg_wal', $wal_seg_copied);
+my $wal_copied_stat = stat($wal_copied_path);
+ok(!defined($wal_copied_stat), "WAL segment $wal_seg_copied should not exist in target before rewind");
+
+
+$node_standby->stop();
+$node_primary->stop();
+
+# Sleep to allow mtime to be different
+usleep(1000);
+
+command_checks_all(
+ [
+ 'pg_rewind', '--debug',
+ '--source-pgdata' => $node_standby->data_dir,
+ '--target-pgdata' => $node_primary->data_dir,
+ '--no-sync',
+ ],
+ 0,
+ [qr//
+ ],
+ [qr/WAL segment \"$wal_seg_skipped\" not copied to target/],
+ 'run pg_rewind'
+);
+
+# Verify that the copied WAL segment now exists in target
+$wal_copied_stat = stat($wal_copied_path);
+ok(defined($wal_copied_stat), "WAL segment $wal_seg_copied should exist in target after rewind");
+
+# Get current modification time of the skipped WAL segment
+my $wal_skipped_stat_after = stat($wal_skipped_path);
+defined($wal_skipped_stat_after) or die("unable to stat $wal_skipped_path after rewind");
+my $wal_seg_latest_skipped_mod_time = $wal_skipped_stat_after->mtime;
+
+# Validate that modification time hasn't changed,
+is($wal_seg_latest_skipped_mod_time, $wal_seg_skipped_mod_time,
+ "WAL segment $wal_seg_skipped modification time should be unchanged (not overwritten)");
+
+# Validate WAL with size different prior to divergerec was updated
+my $wal_seg_existing_copied_after_stat = stat($wal_seg_before_diverge_copied_path);
+defined($wal_seg_existing_copied_after_stat) or die("unable to stat $wal_skipped_path");
+
+isnt($wal_seg_existing_copied_stat->size, $wal_seg_existing_copied_after_stat->size, "WAL segment $wal_seg_before_diverge_copied with byte difference did not get copied");
+
+
+done_testing();
--
2.50.1
Hi,
On Wed, Oct 15, 2025 at 4:49 AM John H <johnhyvr@gmail.com> wrote:
Hi,
On Fri, Oct 10, 2025 at 12:45 AM Srinath Reddy Sadipiralla
<srinath2133@gmail.com> wrote:...
XLogFilePath , then validate this new path with the given path
,this helps to catch invalid xlog files likepg_wal/00000001FFFFFFFFFFFFFF10.
...
Are you concerned that somehow these files, which are named like XLog
files but actually
aren't, are somehow created therefore we should sync them in case?
I'm trying to understand how these files would be generated in the first
place.
the problem is not when server generates them because
filename is properly calculated using the XLogRecPtr in
XLogWrite->XLogFileInit->XLogFileInitInternal->XLogFilePath
,the main problem is when if someone manually places an invalid WAL file
in pg_wal like 00000001FFFFFFFFFFFFFF10, IsXLogFileName will
consider it as valid ,so with the approach as i mentioned earlier we can
catch such cases.
instead of passing last_common_segno down the call stack directly,
we can have a struct maybe called "rewindState" which will have thecommon
information related to both clusters involved in rewind ,so that in
future
if we want to pass more such information down we won't be increasing
the number of arguments that need to be passed down to all functions instack.
I don't feel too strongly about this. Not sure how we view
future-proofing things,
as in do we proactively wrap around structs or wait until there's an
actual need.This is the first time it's been touched since it was introduced in 14 for
what
it's worth.
TBH first it seemed to me a good coding practice for future proofing,
then i checked if there's any place in postgres we are doing
something similar ,then found 1 in src/include/access/gist.h
typedef struct GISTDeletedPageContents
{
/* last xid which could see the page in a scan */
FullTransactionId deleteXid;
} GISTDeletedPageContents;
we can improve the TAP test file header comment as
# Test the situation where we skip copying the same WAL files fromsource to target
# except if WAL file size differs.
let me put it this way
1) WALs are written to 000000010000000000000002 in primary.
...
7) so the last common WAL segment was 000000010000000000000003.Updated comments.
...
where entry->target_size != entry->source_size and we do copy,
like if stat->mtime is changed (which means file has been copied)
and target_stat->size != soucrce_stat->size here no error expected.Updated patch with one additional segment has a byte appended to it
on target but not source.
thanks for updating, i have attached a diff patch on
top of v9-0001 patch , where i tried to add more tests
to improve the validation that we copy the WAL file even
when it exists on both source and target but the size differs.
I also tried to rename some variables for readability,while
testing this out i found that usleep(1000); is not enough
to show that there has been a copy which changed the
stat->mtime of the file because i think the copy was very
fast (less than 1 millisecond) so the mtime of the
file after rewind's copy didn't change or filesystem precision
didn't caught the change, the copying of the file is confirmed
because the same file has different size before rewind and
same after rewind which these tests proved
ok($corrupt_wal_seg_stat_before_rewind->size !=
$corrupt_wal_seg_source_stat->size,"WAL segment $corrupt_wal_seg has
different size in source vs target before rewind");
ok 11 - WAL segment 000000010000000000000004 has different size in source
vs target before rewind
ok($corrupt_wal_seg_stat_after_rewind->size ==
$corrupt_wal_seg_source_stat->size, "WAL segment $corrupt_wal_seg was
copied: file sizes are same between target and source after rewind");
ok 12 - WAL segment 000000010000000000000004 was copied: file sizes are
same between target and source after rewind
and more over the pg_rewind confirmed this with a new debug
message which I added in this diff patch
diff --git a/src/bin/pg_rewind/filemap.c b/src/bin/pg_rewind/filemap.c
index 69d7728ce44..743484f4833 100644
--- a/src/bin/pg_rewind/filemap.c
+++ b/src/bin/pg_rewind/filemap.c
@@ -732,11 +732,11 @@ decide_wal_file_action(const char *fname, XLogSegNo
last_common_segno,
*/
if (file_segno < last_common_segno && source_size == target_size)
{
pg_log_debug("WAL segment \"%s\" not copied to target", fname);
return FILE_ACTION_NONE;
}
+ pg_log_debug("WAL segment \"%s\" is copied to target", fname);
return FILE_ACTION_COPY;
}
ok 6 - run pg_rewind stderr /(?^:WAL segment \"000000010000000000000004\"
is copied to target)/
also did this instead of using catfile
-my $wal_skipped_path = File::Spec->catfile($node_primary->data_dir,
'pg_wal', $wal_seg_skipped);
+my $wal_skipped_path = $node_primary->data_dir . '/pg_wal/' .
$wal_seg_skipped;
seems consistent with other TAP tests.
--
Thanks,
Srinath Reddy Sadipiralla
EDB: https://www.enterprisedb.com/
Attachments:
v9-0001.diffapplication/octet-stream; name=v9-0001.diffDownload
diff --git a/src/bin/pg_rewind/filemap.c b/src/bin/pg_rewind/filemap.c
index 69d7728ce44..743484f4833 100644
--- a/src/bin/pg_rewind/filemap.c
+++ b/src/bin/pg_rewind/filemap.c
@@ -732,11 +732,11 @@ decide_wal_file_action(const char *fname, XLogSegNo last_common_segno,
*/
if (file_segno < last_common_segno && source_size == target_size)
{
pg_log_debug("WAL segment \"%s\" not copied to target", fname);
return FILE_ACTION_NONE;
}
+ pg_log_debug("WAL segment \"%s\" is copied to target", fname);
return FILE_ACTION_COPY;
}
diff --git a/src/bin/pg_rewind/t/011_avoid_copying_common_wals.pl b/src/bin/pg_rewind/t/011_avoid_copying_common_wals.pl
index 8d49ebde04d..bcad03a50c5 100644
--- a/src/bin/pg_rewind/t/011_avoid_copying_common_wals.pl
+++ b/src/bin/pg_rewind/t/011_avoid_copying_common_wals.pl
@@ -7,7 +7,6 @@ use strict;
use warnings FATAL => 'all';
use PostgreSQL::Test::Utils;
use Test::More;
-use File::Spec;
use File::stat qw(stat);
use FindBin;
@@ -32,7 +31,7 @@ RewindTest::primary_psql("SELECT pg_switch_wal()");
# Segment before divergerec that will be copied over because file size differs
RewindTest::primary_psql("INSERT INTO t VALUES(0)");
-my $wal_seg_before_diverge_copied = $node_primary->safe_psql('postgres', 'SELECT pg_walfile_name(pg_current_wal_lsn())');
+my $corrupt_wal_seg = $node_primary->safe_psql('postgres', 'SELECT pg_walfile_name(pg_current_wal_lsn())');
RewindTest::primary_psql("SELECT pg_switch_wal()");
# last common checkpoint
@@ -40,39 +39,40 @@ RewindTest::primary_psql("CHECKPOINT");
RewindTest::promote_standby;
-# Segment after divergerec, expected to be copied
-my $wal_seg_copied = $node_standby->safe_psql('postgres', 'SELECT pg_walfile_name(pg_current_wal_lsn())');
+# New Timeline Segment after divergerec, expected to be copied
+my $new_timeline_wal_seg = $node_standby->safe_psql('postgres', 'SELECT pg_walfile_name(pg_current_wal_lsn())');
# Get stat info for the WAL file that should be skipped
-my $wal_skipped_path = File::Spec->catfile($node_primary->data_dir, 'pg_wal', $wal_seg_skipped);
+my $wal_skipped_path = $node_primary->data_dir . '/pg_wal/' . $wal_seg_skipped;
my $wal_skipped_stat = stat($wal_skipped_path);
defined($wal_skipped_stat) or die("unable to stat $wal_skipped_path");
# Store modification time for later comparison
my $wal_seg_skipped_mod_time = $wal_skipped_stat->mtime;
-# Append an additional byte to segment on target before divergerec to test it's copied over from source
-my $wal_seg_before_diverge_copied_path = File::Spec->catfile($node_primary->data_dir, 'pg_wal', $wal_seg_before_diverge_copied);
-open my $fh, ">>", $wal_seg_before_diverge_copied_path
- or die "could not open $wal_seg_before_diverge_copied_path";
+# Corrupt a segment on target before divergerec to test it's copied over from source,
+# by appending an additional byte to it.
+my $corrupt_wal_seg_in_target_path = $node_primary->data_dir . '/pg_wal/' . $corrupt_wal_seg;
+open my $fh, ">>", $corrupt_wal_seg_in_target_path
+ or die "could not open $corrupt_wal_seg_in_target_path";
print $fh 'a';
close $fh;
-my $wal_seg_existing_copied_stat = stat($wal_seg_before_diverge_copied_path);
-ok(defined($wal_seg_existing_copied_stat), "WAL segment $wal_seg_before_diverge_copied should exist in target before rewind");
+my $corrupt_wal_seg_stat_before_rewind = stat($corrupt_wal_seg_in_target_path);
+ok(defined($corrupt_wal_seg_stat_before_rewind), "WAL segment $corrupt_wal_seg should exist in target before rewind");
# Verify that the WAL segment on new timeline does not exist in target before rewind
-my $wal_copied_path = File::Spec->catfile($node_primary->data_dir, 'pg_wal', $wal_seg_copied);
-my $wal_copied_stat = stat($wal_copied_path);
-ok(!defined($wal_copied_stat), "WAL segment $wal_seg_copied should not exist in target before rewind");
+my $new_timeline_wal_seg_path = $node_primary->data_dir . '/pg_wal/' . $new_timeline_wal_seg;
+my $new_timeline_wal_seg_stat = stat($new_timeline_wal_seg_path);
+ok(!defined($new_timeline_wal_seg_stat), "WAL segment $new_timeline_wal_seg should not exist in target before rewind");
$node_standby->stop();
$node_primary->stop();
# Sleep to allow mtime to be different
-usleep(1000);
+usleep(1000000);
command_checks_all(
[
@@ -84,28 +84,37 @@ command_checks_all(
0,
[qr//
],
- [qr/WAL segment \"$wal_seg_skipped\" not copied to target/],
+ [
+ qr/WAL segment \"$wal_seg_skipped\" not copied to target/,
+ qr/WAL segment \"$corrupt_wal_seg\" is copied to target/
+ ],
'run pg_rewind'
);
# Verify that the copied WAL segment now exists in target
-$wal_copied_stat = stat($wal_copied_path);
-ok(defined($wal_copied_stat), "WAL segment $wal_seg_copied should exist in target after rewind");
+$new_timeline_wal_seg_stat = stat($new_timeline_wal_seg_path);
+ok(defined($new_timeline_wal_seg_stat), "WAL segment $new_timeline_wal_seg should exist in target after rewind");
# Get current modification time of the skipped WAL segment
-my $wal_skipped_stat_after = stat($wal_skipped_path);
-defined($wal_skipped_stat_after) or die("unable to stat $wal_skipped_path after rewind");
-my $wal_seg_latest_skipped_mod_time = $wal_skipped_stat_after->mtime;
+my $wal_skipped_stat_after_rewind = stat($wal_skipped_path);
+defined($wal_skipped_stat_after_rewind) or die("unable to stat $wal_skipped_path after rewind");
+my $wal_seg_latest_skipped_mod_time = $wal_skipped_stat_after_rewind->mtime;
# Validate that modification time hasn't changed,
is($wal_seg_latest_skipped_mod_time, $wal_seg_skipped_mod_time,
"WAL segment $wal_seg_skipped modification time should be unchanged (not overwritten)");
-# Validate WAL with size different prior to divergerec was updated
-my $wal_seg_existing_copied_after_stat = stat($wal_seg_before_diverge_copied_path);
-defined($wal_seg_existing_copied_after_stat) or die("unable to stat $wal_skipped_path");
-
-isnt($wal_seg_existing_copied_stat->size, $wal_seg_existing_copied_after_stat->size, "WAL segment $wal_seg_before_diverge_copied with byte difference did not get copied");
+# Validate that the WAL segment with same filename as the
+# corrupt WAL segment in target has been copied from source
+# where we have it intact.
+my $corrupt_wal_seg_in_source_path = $node_standby->data_dir . '/pg_wal/' . $corrupt_wal_seg;
+my $corrupt_wal_seg_source_stat = stat($corrupt_wal_seg_in_source_path);
+ok(defined($corrupt_wal_seg_source_stat), "WAL segment $corrupt_wal_seg should exist in source after rewind");
+my $corrupt_wal_seg_stat_after_rewind = stat($corrupt_wal_seg_in_target_path);
+ok(defined($corrupt_wal_seg_stat_after_rewind), "WAL segment $corrupt_wal_seg should exist in target after rewind");
+ok($corrupt_wal_seg_stat_before_rewind->size != $corrupt_wal_seg_source_stat->size,"WAL segment $corrupt_wal_seg has different size in source vs target before rewind");
+ok($corrupt_wal_seg_stat_after_rewind->mtime > $corrupt_wal_seg_stat_before_rewind->mtime,"WAL segment $corrupt_wal_seg was copied: mtime after rewind > mtime before rewind");
+ok($corrupt_wal_seg_stat_after_rewind->size == $corrupt_wal_seg_source_stat->size, "WAL segment $corrupt_wal_seg was copied: file sizes are same between target and source after rewind");
done_testing();
On Wed, Oct 15, 2025 at 7:56 PM Srinath Reddy Sadipiralla <
srinath2133@gmail.com> wrote:
thanks for updating, i have attached a diff patch on
top of v9-0001 patch , where i tried to add more tests
to improve the validation that we copy the WAL file even
when it exists on both source and target but the size differs.
I also tried to rename some variables for readability,while
testing this out i found that usleep(1000); is not enough
to show that there has been a copy which changed the
stat->mtime of the file because i think the copy was very
fast (less than 1 millisecond) so the mtime of the
file after rewind's copy didn't change or filesystem precision
didn't caught the change, the copying of the file is confirmed
because the same file has different size before rewind and
same after rewind which these tests proved
ok($corrupt_wal_seg_stat_before_rewind->size !=
$corrupt_wal_seg_source_stat->size,"WAL segment $corrupt_wal_seg has
different size in source vs target before rewind");
ok 11 - WAL segment 000000010000000000000004 has different size in source
vs target before rewindok($corrupt_wal_seg_stat_after_rewind->size ==
$corrupt_wal_seg_source_stat->size, "WAL segment $corrupt_wal_seg was
copied: file sizes are same between target and source after rewind");
ok 12 - WAL segment 000000010000000000000004 was copied: file sizes are
same between target and source after rewindand more over the pg_rewind confirmed this with a new debug
message which I added in this diff patchdiff --git a/src/bin/pg_rewind/filemap.c b/src/bin/pg_rewind/filemap.c index 69d7728ce44..743484f4833 100644 --- a/src/bin/pg_rewind/filemap.c +++ b/src/bin/pg_rewind/filemap.c @@ -732,11 +732,11 @@ decide_wal_file_action(const char *fname, XLogSegNo last_common_segno, */ if (file_segno < last_common_segno && source_size == target_size) { pg_log_debug("WAL segment \"%s\" not copied to target", fname); return FILE_ACTION_NONE; }+ pg_log_debug("WAL segment \"%s\" is copied to target", fname);
return FILE_ACTION_COPY;
}
ok 6 - run pg_rewind stderr /(?^:WAL segment \"000000010000000000000004\"
is copied to target)/
btw i forgot to let you know that i have increased the
usleep(1000) to usleep(1000000) for the above reason
which helped to show the difference in mtime after rewind
adding the below extra validation that the file has been copied from
source to target ,this is already done in the v9-0001.diff patch.
ok($corrupt_wal_seg_stat_after_rewind->mtime >
$corrupt_wal_seg_stat_before_rewind->mtime,"WAL segment $corrupt_wal_seg
was copied: mtime after rewind > mtime before rewind");
ok 12 - WAL segment 000000010000000000000004 was copied: mtime after
rewind > mtime before rewind
On Fri, Oct 10, 2025 at 3:18 AM John H <johnhyvr@gmail.com> wrote:
Hey Srinath
On Thu, Oct 9, 2025 at 12:09 PM Srinath Reddy Sadipiralla
<srinath2133@gmail.com> wrote:...
1) if the timeline switch happens in middle of segment ,copy data fromthe last WAL segment and create WAL file with same segno but different
timelineID,in this case the starting WAL file for the new timeline will be
0000000200000000000000032) if the timeline switch happens at segment boundary , just create next
segment for this case the starting WAL file for the new timeline will be
000000020000000000000004so basically the files which exists in source and not in target like the
new timeline WAL segments will be copied to target in total before we reach
decide_wal_file_action , so i think we don't need to think about copying
WAL files after divergence point by calculating and checking against
last_common_segno which we are doing in our current approach , i think we
can just do...
That's a great point. I need to think about it some more but the
reasoning makes sense to me.
I think 'last_common_segno ' is only useful as another sanity check
but we already have the size
ones.
it would be awesome to know ,if you have any update on this.
--
Thanks,
Srinath Reddy Sadipiralla
EDB: https://www.enterprisedb.com/
On Thu, Oct 9, 2025 at 3:09 PM Srinath Reddy Sadipiralla
<srinath2133@gmail.com> wrote:
just a second late :( i was about to post a patch addressing the refactors which Robert mentioned ,anyway will have a look at your latest patch John thanks :), curious about the tap test.
while i was writing the patch something suddenly struck me , that is why we are even depending on last_common_segno ,because once we reached decide_wal_file_action it means that the file exists in both target and source ,AFAIK this can only happen with wal segments older than or equal to last_common_segno because once the promotion competes the filename of the WAL files gets changed with the new timelineID(2), for ex: if the last_common_segno is 000000010000000000000003 then based on the rules in XLogInitNewTimeline
1) if the timeline switch happens in middle of segment ,copy data from the last WAL segment and create WAL file with same segno but different timelineID,in this case the starting WAL file for the new timeline will be 000000020000000000000003
2) if the timeline switch happens at segment boundary , just create next segment for this case the starting WAL file for the new timeline will be 000000020000000000000004so basically the files which exists in source and not in target like the new timeline WAL segments will be copied to target in total before we reach decide_wal_file_action , so i think we don't need to think about copying WAL files after divergence point by calculating and checking against last_common_segno which we are doing in our current approach , i think we can just do
What makes me nervous about this is that it isn't necessarily the case
that the servers were perfectly in sync at the time of the failure.
Suppose that the primary was in the middle of writing
000000010000000000000003. The standby might also have this file, but
it might contain less valid data than the one on the primary;
therefore, if we don't copy the file, the two servers might not have
an identical file. Maybe that wouldn't really matter, in the sense
that the extra valid data that exists on the original primary
shouldn't prevent it from replaying WAL on the new primary's timeline,
which is probably all we really care about. But it feels dangerous to
me.
--
Robert Haas
EDB: http://www.enterprisedb.com
On Wed, Oct 15, 2025 at 10:27 AM Srinath Reddy Sadipiralla
<srinath2133@gmail.com> wrote:
On Fri, Oct 10, 2025 at 12:45 AM Srinath Reddy Sadipiralla
<srinath2133@gmail.com> wrote:XLogFilePath , then validate this new path with the given path
,this helps to catch invalid xlog files like pg_wal/00000001FFFFFFFFFFFFFF10.
...Are you concerned that somehow these files, which are named like XLog
files but actually
aren't, are somehow created therefore we should sync them in case?
I'm trying to understand how these files would be generated in the first place.the problem is not when server generates them because
filename is properly calculated using the XLogRecPtr in
XLogWrite->XLogFileInit->XLogFileInitInternal->XLogFilePath
,the main problem is when if someone manually places an invalid WAL file
in pg_wal like 00000001FFFFFFFFFFFFFF10, IsXLogFileName will
consider it as valid ,so with the approach as i mentioned earlier we can
catch such cases.
I think that parsing the file name may be a good idea so that we can
do appropriate sanity checks on the values (e.g. checking that we're
only skipping copying prior to last_common_segno), but I do not think
we should worry too much about the user manually injecting invalid WAL
files. I mean, I would prefer that if that does happen, it either
works anyway or fails with a sensible error message, rather than
emitting an incomprehensible error message or dumping core. But, it is
in general true that if manual modifications are made to the data
directory, things may go terribly wrong, and this code is not obliged
to provide any more protection against such scenarios than we do in
other cases. Ultimately, such modifications are user error.
--
Robert Haas
EDB: http://www.enterprisedb.com
Hi,
On Wed, Oct 15, 2025 at 7:27 AM Srinath Reddy Sadipiralla
<srinath2133@gmail.com> wrote:
TBH first it seemed to me a good coding practice for future proofing,
then i checked if there's any place in postgres we are doing
something similar ,then found 1 in src/include/access/gist.htypedef struct GISTDeletedPageContents
{
/* last xid which could see the page in a scan */
FullTransactionId deleteXid;
} GISTDeletedPageContents;
It makes more sense to me if we expect the struct or same set of arguments
to be reused in different places / want a stable API reference. I don't think
we want everything to be wrapped around a struct for functions just in case.
..
thanks for updating, i have attached a diff patch on
top of v9-0001 patch , where i tried to add more tests
...
Thank you for the diff! Your changes are a lot cleaner and I've included it in
the latest patch. One difference I've made is instead of additionally logging in
decide_wal_file_action
+ pg_log_debug("WAL segment \"%s\" is copied to target", fname);
I realized we are already logging the WAL segments that are copied over in
print_filemap so I've updated the test to check for that instead
qr/pg_wal\/$corrupt_wal_seg \(COPY\)/
I also updated the error message for the last three checks.
Thanks,
--
John Hsu - Amazon Web Services
Attachments:
v10-0001-Avoid-copying-WAL-segments-before-divergence-to-.patchapplication/octet-stream; name=v10-0001-Avoid-copying-WAL-segments-before-divergence-to-.patchDownload
From f3e6c164368ae8efdac57201ba5009d7656f5be2 Mon Sep 17 00:00:00 2001
From: John Hsu <johnyvr@gmail.com>
Date: Wed, 8 Oct 2025 21:20:59 +0000
Subject: [PATCH v10] Avoid copying WAL segments before divergence to speed up
pg_rewind
Adds a check to avoid copying any WAL segment files from source
to target if they are common between both servers before the
point of WAL divergence during pg_rewind.
All WAL files that exist on source and target, which fall
before the segment of the first diverged LSN can safely be
skipped from copying to the target as they have been replicated
from the original primary.
Author: John Hsu <johnhyvr@gmail.com>
Co-Author: Justin Kwan <justinpkwan@outlook.com>
---
src/bin/pg_rewind/filemap.c | 245 +++++++++++-------
src/bin/pg_rewind/filemap.h | 12 +-
src/bin/pg_rewind/pg_rewind.c | 6 +-
.../t/011_avoid_copying_common_wals.pl | 120 +++++++++
4 files changed, 286 insertions(+), 97 deletions(-)
create mode 100644 src/bin/pg_rewind/t/011_avoid_copying_common_wals.pl
diff --git a/src/bin/pg_rewind/filemap.c b/src/bin/pg_rewind/filemap.c
index c933871ca9f..cd155eb2ec1 100644
--- a/src/bin/pg_rewind/filemap.c
+++ b/src/bin/pg_rewind/filemap.c
@@ -55,7 +55,6 @@
static filehash_hash *filehash;
-static bool isRelDataFile(const char *path);
static char *datasegpath(RelFileLocator rlocator, ForkNumber forknum,
BlockNumber segno);
@@ -199,6 +198,106 @@ filehash_init(void)
filehash = filehash_create(FILEHASH_INITIAL_SIZE, NULL);
}
+static file_content_type_t
+getFileContentType(const char *path)
+{
+ RelFileLocator rlocator;
+ unsigned int segNo;
+ int nmatch;
+ bool matched;
+
+ /* Check if it's a WAL file */
+ if (strncmp("pg_wal/", path, 7) == 0)
+ {
+ const char *filename = path + 7; /* Skip "pg_wal/" */
+
+ if (IsXLogFileName(filename))
+ return FILE_CONTENT_TYPE_WAL;
+ else
+ return FILE_CONTENT_TYPE_OTHER;
+ }
+
+ /*
+ * Does it look like a relation data file?
+ *
+ * For our purposes, only files belonging to the main fork are considered
+ * relation files. Other forks are always copied in toto, because we cannot
+ * reliably track changes to them, because WAL only contains block references
+ * for the main fork.
+ *
+ * ----
+ * Relation data files can be in one of the following directories:
+ *
+ * global/
+ * shared relations
+ *
+ * base/<db oid>/
+ * regular relations, default tablespace
+ *
+ * pg_tblspc/<tblspc oid>/<tblspc version>/
+ * within a non-default tablespace (the name of the directory
+ * depends on version)
+ *
+ * And the relation data files themselves have a filename like:
+ *
+ * <oid>.<segment number>
+ *
+ *----
+ */
+ rlocator.spcOid = InvalidOid;
+ rlocator.dbOid = InvalidOid;
+ rlocator.relNumber = InvalidRelFileNumber;
+ segNo = 0;
+ matched = false;
+
+ nmatch = sscanf(path, "global/%u.%u", &rlocator.relNumber, &segNo);
+ if (nmatch == 1 || nmatch == 2)
+ {
+ rlocator.spcOid = GLOBALTABLESPACE_OID;
+ rlocator.dbOid = 0;
+ matched = true;
+ }
+ else
+ {
+ nmatch = sscanf(path, "base/%u/%u.%u",
+ &rlocator.dbOid, &rlocator.relNumber, &segNo);
+ if (nmatch == 2 || nmatch == 3)
+ {
+ rlocator.spcOid = DEFAULTTABLESPACE_OID;
+ matched = true;
+ }
+ else
+ {
+ nmatch = sscanf(path, "pg_tblspc/%u/" TABLESPACE_VERSION_DIRECTORY "/%u/%u.%u",
+ &rlocator.spcOid, &rlocator.dbOid, &rlocator.relNumber,
+ &segNo);
+ if (nmatch == 3 || nmatch == 4)
+ matched = true;
+ }
+ }
+
+ /*
+ * The sscanf tests above can match files that have extra characters at
+ * the end. To eliminate such cases, cross-check that GetRelationPath
+ * creates the exact same filename, when passed the RelFileLocator
+ * information we extracted from the filename.
+ */
+ if (matched)
+ {
+ char *check_path = datasegpath(rlocator, MAIN_FORKNUM, segNo);
+
+ if (strcmp(check_path, path) != 0)
+ matched = false;
+
+ pfree(check_path);
+ }
+
+ if (matched)
+ return FILE_CONTENT_TYPE_RELATION;
+
+ return FILE_CONTENT_TYPE_OTHER;
+}
+
/* Look up entry for 'path', creating a new one if it doesn't exist */
static file_entry_t *
insert_filehash_entry(const char *path)
@@ -210,7 +309,7 @@ insert_filehash_entry(const char *path)
if (!found)
{
entry->path = pg_strdup(path);
- entry->isrelfile = isRelDataFile(path);
+ entry->content_type = getFileContentType(path);
entry->target_exists = false;
entry->target_type = FILE_TYPE_UNDEFINED;
@@ -294,7 +393,7 @@ process_source_file(const char *path, file_type_t type, size_t size,
* sanity check: a filename that looks like a data file better be a
* regular file
*/
- if (type != FILE_TYPE_REGULAR && isRelDataFile(path))
+ if (type != FILE_TYPE_REGULAR && getFileContentType(path) == FILE_CONTENT_TYPE_RELATION)
pg_fatal("data file \"%s\" in source is not a regular file", path);
/* Remember this source file */
@@ -383,7 +482,7 @@ process_target_wal_block_change(ForkNumber forknum, RelFileLocator rlocator,
*/
if (entry)
{
- Assert(entry->isrelfile);
+ Assert(entry->content_type == FILE_CONTENT_TYPE_RELATION);
if (entry->target_exists)
{
@@ -559,92 +658,6 @@ print_filemap(filemap_t *filemap)
fflush(stdout);
}
-/*
- * Does it look like a relation data file?
- *
- * For our purposes, only files belonging to the main fork are considered
- * relation files. Other forks are always copied in toto, because we cannot
- * reliably track changes to them, because WAL only contains block references
- * for the main fork.
- */
-static bool
-isRelDataFile(const char *path)
-{
- RelFileLocator rlocator;
- unsigned int segNo;
- int nmatch;
- bool matched;
-
- /*----
- * Relation data files can be in one of the following directories:
- *
- * global/
- * shared relations
- *
- * base/<db oid>/
- * regular relations, default tablespace
- *
- * pg_tblspc/<tblspc oid>/<tblspc version>/
- * within a non-default tablespace (the name of the directory
- * depends on version)
- *
- * And the relation data files themselves have a filename like:
- *
- * <oid>.<segment number>
- *
- *----
- */
- rlocator.spcOid = InvalidOid;
- rlocator.dbOid = InvalidOid;
- rlocator.relNumber = InvalidRelFileNumber;
- segNo = 0;
- matched = false;
-
- nmatch = sscanf(path, "global/%u.%u", &rlocator.relNumber, &segNo);
- if (nmatch == 1 || nmatch == 2)
- {
- rlocator.spcOid = GLOBALTABLESPACE_OID;
- rlocator.dbOid = 0;
- matched = true;
- }
- else
- {
- nmatch = sscanf(path, "base/%u/%u.%u",
- &rlocator.dbOid, &rlocator.relNumber, &segNo);
- if (nmatch == 2 || nmatch == 3)
- {
- rlocator.spcOid = DEFAULTTABLESPACE_OID;
- matched = true;
- }
- else
- {
- nmatch = sscanf(path, "pg_tblspc/%u/" TABLESPACE_VERSION_DIRECTORY "/%u/%u.%u",
- &rlocator.spcOid, &rlocator.dbOid, &rlocator.relNumber,
- &segNo);
- if (nmatch == 3 || nmatch == 4)
- matched = true;
- }
- }
-
- /*
- * The sscanf tests above can match files that have extra characters at
- * the end. To eliminate such cases, cross-check that GetRelationPath
- * creates the exact same filename, when passed the RelFileLocator
- * information we extracted from the filename.
- */
- if (matched)
- {
- char *check_path = datasegpath(rlocator, MAIN_FORKNUM, segNo);
-
- if (strcmp(check_path, path) != 0)
- matched = false;
-
- pfree(check_path);
- }
-
- return matched;
-}
-
/*
* A helper function to create the path of a relation file and segment.
*
@@ -693,11 +706,44 @@ final_filemap_cmp(const void *a, const void *b)
return strcmp(fa->path, fb->path);
}
+/*
+ * Decide what to do with a WAL segment file based on its position
+ * relative to the point of divergence.
+ * Caller is responsible for ensuring the file exists on both
+ * source and target.
+ */
+static file_action_t
+decide_wal_file_action(const char *fname, XLogSegNo last_common_segno,
+ size_t source_size, size_t target_size)
+{
+ TimeLineID file_tli;
+ XLogSegNo file_segno;
+
+ /* Get current WAL segment number given current segment file name */
+ XLogFromFileName(fname, &file_tli, &file_segno, WalSegSz);
+
+ /*
+ * Avoid copying files before the last common segment.
+ *
+ * These files are assumed to exist on source and target which means
+ * they should already be identical and before the last_common_segno.
+ *
+ * However we check last_common_segno and file_size again for sanity.
+ */
+ if (file_segno < last_common_segno && source_size == target_size)
+ {
+ pg_log_debug("WAL segment \"%s\" not copied to target", fname);
+ return FILE_ACTION_NONE;
+ }
+
+ return FILE_ACTION_COPY;
+}
+
/*
* Decide what action to perform to a file.
*/
static file_action_t
-decide_file_action(file_entry_t *entry)
+decide_file_action(file_entry_t *entry, XLogSegNo last_common_segno)
{
const char *path = entry->path;
@@ -799,7 +845,18 @@ decide_file_action(file_entry_t *entry)
return FILE_ACTION_NONE;
case FILE_TYPE_REGULAR:
- if (!entry->isrelfile)
+ if (entry->content_type == FILE_CONTENT_TYPE_WAL)
+ {
+ /* Handle WAL segment file */
+ const char *filename = last_dir_separator(entry->path);
+ if (filename == NULL)
+ filename = entry->path;
+ else
+ filename++; /* Skip the separator */
+
+ return decide_wal_file_action(filename, last_common_segno, entry->source_size, entry->target_size);
+ }
+ else if (entry->content_type != FILE_CONTENT_TYPE_RELATION)
{
/*
* It's a non-data file that we have no special processing
@@ -858,7 +915,7 @@ decide_file_action(file_entry_t *entry)
* should be executed.
*/
filemap_t *
-decide_file_actions(void)
+decide_file_actions(XLogSegNo last_common_segno)
{
int i;
filehash_iterator it;
@@ -868,7 +925,7 @@ decide_file_actions(void)
filehash_start_iterate(filehash, &it);
while ((entry = filehash_iterate(filehash, &it)) != NULL)
{
- entry->action = decide_file_action(entry);
+ entry->action = decide_file_action(entry, last_common_segno);
}
/*
diff --git a/src/bin/pg_rewind/filemap.h b/src/bin/pg_rewind/filemap.h
index df78a02e3da..5145f0b4c46 100644
--- a/src/bin/pg_rewind/filemap.h
+++ b/src/bin/pg_rewind/filemap.h
@@ -11,6 +11,7 @@
#include "datapagemap.h"
#include "storage/block.h"
#include "storage/relfilelocator.h"
+#include "access/xlogdefs.h"
/* these enum values are sorted in the order we want actions to be processed */
typedef enum
@@ -36,6 +37,13 @@ typedef enum
FILE_TYPE_SYMLINK,
} file_type_t;
+typedef enum
+{
+ FILE_CONTENT_TYPE_OTHER = 0,
+ FILE_CONTENT_TYPE_RELATION,
+ FILE_CONTENT_TYPE_WAL
+} file_content_type_t;
+
/*
* For every file found in the local or remote system, we have a file entry
* that contains information about the file on both systems. For relation
@@ -51,7 +59,7 @@ typedef struct file_entry_t
uint32 status; /* hash status */
const char *path;
- bool isrelfile; /* is it a relation data file? */
+ file_content_type_t content_type;
/*
* Status of the file in the target.
@@ -106,7 +114,7 @@ extern void process_target_wal_block_change(ForkNumber forknum,
RelFileLocator rlocator,
BlockNumber blkno);
-extern filemap_t *decide_file_actions(void);
+extern filemap_t *decide_file_actions(XLogSegNo last_common_segno);
extern void calculate_totals(filemap_t *filemap);
extern void print_filemap(filemap_t *filemap);
diff --git a/src/bin/pg_rewind/pg_rewind.c b/src/bin/pg_rewind/pg_rewind.c
index 0c68dd4235e..318d0a27184 100644
--- a/src/bin/pg_rewind/pg_rewind.c
+++ b/src/bin/pg_rewind/pg_rewind.c
@@ -147,6 +147,7 @@ main(int argc, char **argv)
TimeLineID source_tli;
TimeLineID target_tli;
XLogRecPtr target_wal_endrec;
+ XLogSegNo last_common_segno;
size_t size;
char *buffer;
bool no_ensure_shutdown = false;
@@ -397,6 +398,9 @@ main(int argc, char **argv)
LSN_FORMAT_ARGS(divergerec),
targetHistory[lastcommontliIndex].tli);
+ /* Convert divergence LSN to segment number */
+ XLByteToSeg(divergerec, last_common_segno, ControlFile_target.xlog_seg_size);
+
/*
* Don't need the source history anymore. The target history is still
* needed by the routines in parsexlog.c, when we read the target WAL.
@@ -492,7 +496,7 @@ main(int argc, char **argv)
* We have collected all information we need from both systems. Decide
* what to do with each file.
*/
- filemap = decide_file_actions();
+ filemap = decide_file_actions(last_common_segno);
if (showprogress)
calculate_totals(filemap);
diff --git a/src/bin/pg_rewind/t/011_avoid_copying_common_wals.pl b/src/bin/pg_rewind/t/011_avoid_copying_common_wals.pl
new file mode 100644
index 00000000000..8c23efbf1cb
--- /dev/null
+++ b/src/bin/pg_rewind/t/011_avoid_copying_common_wals.pl
@@ -0,0 +1,120 @@
+# Copyright (c) 2021-2025, PostgreSQL Global Development Group
+#
+# Test situation where we skip copying the same WAL files from source
+# to target unless file size differs.
+
+use strict;
+use warnings FATAL => 'all';
+use PostgreSQL::Test::Utils;
+use Test::More;
+use File::stat qw(stat);
+
+use FindBin;
+use lib $FindBin::RealBin;
+use RewindTest;
+use Time::HiRes qw(usleep);
+
+RewindTest::setup_cluster();
+RewindTest::start_primary();
+
+RewindTest::create_standby();
+
+
+# Advance WAL on primary
+RewindTest::primary_psql("CREATE TABLE t(a int)");
+RewindTest::primary_psql("INSERT INTO t VALUES(0)");
+
+# Common segment to be skipped copying over as it's common and same size
+my $wal_seg_skipped = $node_primary->safe_psql('postgres', 'SELECT pg_walfile_name(pg_current_wal_lsn())');
+
+RewindTest::primary_psql("SELECT pg_switch_wal()");
+
+# Segment before divergerec that will be copied over because file size differs
+RewindTest::primary_psql("INSERT INTO t VALUES(0)");
+my $corrupt_wal_seg = $node_primary->safe_psql('postgres', 'SELECT pg_walfile_name(pg_current_wal_lsn())');
+RewindTest::primary_psql("SELECT pg_switch_wal()");
+
+# last common checkpoint
+RewindTest::primary_psql("CHECKPOINT");
+
+RewindTest::promote_standby;
+
+# New Timeline Segment after divergerec, expected to be copied
+my $new_timeline_wal_seg = $node_standby->safe_psql('postgres', 'SELECT pg_walfile_name(pg_current_wal_lsn())');
+
+# Get stat info for the WAL file that should be skipped
+my $wal_skipped_path = $node_primary->data_dir . '/pg_wal/' . $wal_seg_skipped;
+my $wal_skipped_stat = stat($wal_skipped_path);
+defined($wal_skipped_stat) or die("unable to stat $wal_skipped_path");
+
+# Store modification time for later comparison
+my $wal_seg_skipped_mod_time = $wal_skipped_stat->mtime;
+
+# Corrupt a segment on target before divergerec to test it's copied over from source,
+# by appending an additional byte to it.
+my $corrupt_wal_seg_in_target_path = $node_primary->data_dir . '/pg_wal/' . $corrupt_wal_seg;
+open my $fh, ">>", $corrupt_wal_seg_in_target_path
+ or die "could not open $corrupt_wal_seg_in_target_path";
+
+print $fh 'a';
+close $fh;
+
+my $corrupt_wal_seg_stat_before_rewind = stat($corrupt_wal_seg_in_target_path);
+ok(defined($corrupt_wal_seg_stat_before_rewind), "WAL segment $corrupt_wal_seg should exist in target before rewind");
+
+# Verify that the WAL segment on new timeline does not exist in target before rewind
+my $new_timeline_wal_seg_path = $node_primary->data_dir . '/pg_wal/' . $new_timeline_wal_seg;
+my $new_timeline_wal_seg_stat = stat($new_timeline_wal_seg_path);
+ok(!defined($new_timeline_wal_seg_stat), "WAL segment $new_timeline_wal_seg should not exist in target before rewind");
+
+$node_standby->stop();
+$node_primary->stop();
+
+# Sleep to allow mtime to be different
+usleep(1000000);
+
+command_checks_all(
+ [
+ 'pg_rewind', '--debug',
+ '--source-pgdata' => $node_standby->data_dir,
+ '--target-pgdata' => $node_primary->data_dir,
+ '--no-sync',
+ ],
+ 0,
+ [qr//
+ ],
+ [qr/WAL segment \"$wal_seg_skipped\" not copied to target/,
+ qr/pg_wal\/$corrupt_wal_seg \(COPY\)/
+ ],
+ 'run pg_rewind'
+);
+
+# Verify that the copied WAL segment now exists in target
+$new_timeline_wal_seg_stat = stat($new_timeline_wal_seg_path);
+ok(defined($new_timeline_wal_seg_stat), "WAL segment $new_timeline_wal_seg should exist in target after rewind");
+
+# Get current modification time of the skipped WAL segment
+my $wal_skipped_stat_after_rewind = stat($wal_skipped_path);
+defined($wal_skipped_stat_after_rewind) or die("unable to stat $wal_skipped_path after rewind");
+my $wal_seg_latest_skipped_mod_time = $wal_skipped_stat_after_rewind->mtime;
+
+# Validate that modification time hasn't changed,
+is($wal_seg_latest_skipped_mod_time, $wal_seg_skipped_mod_time,
+ "WAL segment $wal_seg_skipped modification time should be unchanged (not overwritten)");
+
+# Validate that the WAL segment with same filename as the
+# corrupt WAL segment in target has been copied from source
+# where we have it intact.
+my $corrupt_wal_seg_in_source_path = $node_standby->data_dir . '/pg_wal/' . $corrupt_wal_seg;
+my $corrupt_wal_seg_source_stat = stat($corrupt_wal_seg_in_source_path);
+ok(defined($corrupt_wal_seg_source_stat), "WAL segment $corrupt_wal_seg should exist in source after rewind");
+my $corrupt_wal_seg_stat_after_rewind = stat($corrupt_wal_seg_in_target_path);
+ok(defined($corrupt_wal_seg_stat_after_rewind), "WAL segment $corrupt_wal_seg should exist in target after rewind");
+ok($corrupt_wal_seg_stat_before_rewind->size != $corrupt_wal_seg_source_stat->size,
+ "Expected WAL segment $corrupt_wal_seg to have different size in source vs target before rewind");
+ok($corrupt_wal_seg_stat_after_rewind->mtime > $corrupt_wal_seg_stat_before_rewind->mtime,
+ "Expected WAL segment $corrupt_wal_seg to have greater mtime on target than source after rewind");
+ok($corrupt_wal_seg_stat_after_rewind->size == $corrupt_wal_seg_source_stat->size,
+ "Expected WAL segment $corrupt_wal_seg file sizes to be same between target and source after rewind");
+
+done_testing();
--
2.50.1
On Thu, Oct 16, 2025 at 12:00 PM Robert Haas <robertmhaas@gmail.com> wrote:
On Wed, Oct 15, 2025 at 10:27 AM Srinath Reddy Sadipiralla
,the main problem is when if someone manually places an invalid WAL file
in pg_wal like 00000001FFFFFFFFFFFFFF10, IsXLogFileName will
consider it as valid ,so with the approach as i mentioned earlier we can
catch such cases.I think that parsing the file name may be a good idea so that we can
do appropriate sanity checks on the values (e.g. checking that we're
only skipping copying prior to last_common_segno), but I do not think
we should worry too much about the user manually injecting invalid WAL
files. I mean, I would prefer that if that does happen, it either
works anyway or fails with a sensible error message, rather than
emitting an incomprehensible error message or dumping core. But, it is
in general true that if manual modifications are made to the data
directory, things may go terribly wrong, and this code is not obliged
to provide any more protection against such scenarios than we do in
other cases. Ultimately, such modifications are user error.
It feels like there's a lot of things we could attempt to ensure
"correctness" if we are concerned about scenarios when the user manually puts
or modifies content unexpectedly in the pg_wal directory.
For instance, one could make the argument that when considering to skip
copying the common WAL segments, even though they are of the same
size, it's possible the user has manipulated them directly. I don't
think we need to
run checksums on every WAL segment that is a valid candidate to ensure they
match.
--
John Hsu - Amazon Web Services
Thanks John, great to see this patch getting attention again.
This builds on my original version from 2022, glad to see it moving forward and happy with the direction it’s taking.
—Justin
________________________________
From: John H <johnhyvr@gmail.com>
Sent: October 16, 2025 6:55 PM
To: Robert Haas <robertmhaas@gmail.com>
Cc: Srinath Reddy Sadipiralla <srinath2133@gmail.com>; wenhui qiu <qiuwenhuifx@gmail.com>; Japin Li <japinli@hotmail.com>; Michael Paquier <michael@paquier.xyz>; Andres Freund <andres@anarazel.de>; Alexander Korotkov <aekorotkov@gmail.com>; Justin Kwan <justinpkwan@outlook.com>; Tom Lane <tgl@sss.pgh.pa.us>; pgsql-hackers <pgsql-hackers@postgresql.org>; vignesh <vignesh@cloudflare.com>; vignesh ravichandran <admin@viggy28.dev>; hlinnaka@iki.fi <hlinnaka@iki.fi>; jkwan@cloudflare.com <jkwan@cloudflare.com>
Subject: Re: Making pg_rewind faster
On Thu, Oct 16, 2025 at 12:00 PM Robert Haas <robertmhaas@gmail.com> wrote:
On Wed, Oct 15, 2025 at 10:27 AM Srinath Reddy Sadipiralla
,the main problem is when if someone manually places an invalid WAL file
in pg_wal like 00000001FFFFFFFFFFFFFF10, IsXLogFileName will
consider it as valid ,so with the approach as i mentioned earlier we can
catch such cases.I think that parsing the file name may be a good idea so that we can
do appropriate sanity checks on the values (e.g. checking that we're
only skipping copying prior to last_common_segno), but I do not think
we should worry too much about the user manually injecting invalid WAL
files. I mean, I would prefer that if that does happen, it either
works anyway or fails with a sensible error message, rather than
emitting an incomprehensible error message or dumping core. But, it is
in general true that if manual modifications are made to the data
directory, things may go terribly wrong, and this code is not obliged
to provide any more protection against such scenarios than we do in
other cases. Ultimately, such modifications are user error.
It feels like there's a lot of things we could attempt to ensure
"correctness" if we are concerned about scenarios when the user manually puts
or modifies content unexpectedly in the pg_wal directory.
For instance, one could make the argument that when considering to skip
copying the common WAL segments, even though they are of the same
size, it's possible the user has manipulated them directly. I don't
think we need to
run checksums on every WAL segment that is a valid candidate to ensure they
match.
--
John Hsu - Amazon Web Services
On Thu, Oct 16, 2025 at 11:48 PM Robert Haas <robertmhaas@gmail.com> wrote:
On Thu, Oct 9, 2025 at 3:09 PM Srinath Reddy Sadipiralla
<srinath2133@gmail.com> wrote:just a second late :( i was about to post a patch addressing the
refactors which Robert mentioned ,anyway will have a look at your latest
patch John thanks :), curious about the tap test.while i was writing the patch something suddenly struck me , that is why
we are even depending on last_common_segno ,because once we reached
decide_wal_file_action it means that the file exists in both target and
source ,AFAIK this can only happen with wal segments older than or equal to
last_common_segno because once the promotion competes the filename of the
WAL files gets changed with the new timelineID(2), for ex: if the
last_common_segno is 000000010000000000000003 then based on the rules in
XLogInitNewTimeline1) if the timeline switch happens in middle of segment ,copy data from
the last WAL segment and create WAL file with same segno but different
timelineID,in this case the starting WAL file for the new timeline will be
0000000200000000000000032) if the timeline switch happens at segment boundary , just create next
segment for this case the starting WAL file for the new timeline will be
000000020000000000000004so basically the files which exists in source and not in target like the
new timeline WAL segments will be copied to target in total before we reach
decide_wal_file_action , so i think we don't need to think about copying
WAL files after divergence point by calculating and checking against
last_common_segno which we are doing in our current approach , i think we
can just doWhat makes me nervous about this is that it isn't necessarily the case
that the servers were perfectly in sync at the time of the failure.
Suppose that the primary was in the middle of writing
000000010000000000000003. The standby might also have this file, but
it might contain less valid data than the one on the primary;
therefore, if we don't copy the file, the two servers might not have
an identical file. Maybe that wouldn't really matter, in the sense
that the extra valid data that exists on the original primary
shouldn't prevent it from replaying WAL on the new primary's timeline,
which is probably all we really care about. But it feels dangerous to
me.
Thanks Robert ,I want to understand this point more , and will get back .
--
Thanks,
Srinath Reddy Sadipiralla
EDB: https://www.enterprisedb.com/
Hi,
On Fri, Oct 17, 2025 at 4:22 AM John H <johnhyvr@gmail.com> wrote:
Hi,
On Wed, Oct 15, 2025 at 7:27 AM Srinath Reddy Sadipiralla
<srinath2133@gmail.com> wrote:TBH first it seemed to me a good coding practice for future proofing,
then i checked if there's any place in postgres we are doing
something similar ,then found 1 in src/include/access/gist.htypedef struct GISTDeletedPageContents
{
/* last xid which could see the page in a scan */
FullTransactionId deleteXid;
} GISTDeletedPageContents;It makes more sense to me if we expect the struct or same set of arguments
to be reused in different places / want a stable API reference. I don't
think
we want everything to be wrapped around a struct for functions just in
case...
thanks for updating, i have attached a diff patch on
top of v9-0001 patch , where i tried to add more tests
...Thank you for the diff! Your changes are a lot cleaner and I've included
it in
the latest patch. One difference I've made is instead of additionally
logging in
decide_wal_file_action+ pg_log_debug("WAL segment \"%s\" is copied to target", fname);
I realized we are already logging the WAL segments that are copied over in
print_filemap so I've updated the test to check for that insteadqr/pg_wal\/$corrupt_wal_seg \(COPY\)/
I also updated the error message for the last three checks.
Thanks for updating the patch, I have reviewed it,
except these below concerns, patch LGTM.
1) regarding the parsing the WAL filename
On Fri, Oct 17, 2025 at 4:25 AM John H <johnhyvr@gmail.com> wrote:
On Thu, Oct 16, 2025 at 12:00 PM Robert Haas <robertmhaas@gmail.com>
wrote:On Wed, Oct 15, 2025 at 10:27 AM Srinath Reddy Sadipiralla
,the main problem is when if someone manually places an invalid WAL
file
in pg_wal like 00000001FFFFFFFFFFFFFF10, IsXLogFileName will
consider it as valid ,so with the approach as i mentioned earlier wecan
catch such cases.
I think that parsing the file name may be a good idea so that we can
do appropriate sanity checks on the values (e.g. checking that we're
only skipping copying prior to last_common_segno), but I do not think
we should worry too much about the user manually injecting invalid WAL
files. I mean, I would prefer that if that does happen, it either
works anyway or fails with a sensible error message, rather than
emitting an incomprehensible error message or dumping core. But, it is
in general true that if manual modifications are made to the data
directory, things may go terribly wrong, and this code is not obliged
to provide any more protection against such scenarios than we do in
other cases. Ultimately, such modifications are user error.It feels like there's a lot of things we could attempt to ensure
"correctness" if we are concerned about scenarios when the user manually
puts
or modifies content unexpectedly in the pg_wal directory.For instance, one could make the argument that when considering to skip
copying the common WAL segments, even though they are of the same
size, it's possible the user has manipulated them directly. I don't
think we need to
run checksums on every WAL segment that is a valid candidate to ensure they
match.
I agree with Robert here, and i think if we found
that the WAL filename is not valid after parsing
we can return something like FILE_CONTENT_INVALID_TYPE
in getFileContentType so that it won't end up going
through decide_wal_file_action,then XLogFromFileName
may assign file_segno invalid values greater than
last_common_segno which means we lead to copying it,even
though it's an invalid file, thoughts?
2) Please fix the indentation by running pgindent,
also add file_content_type_t typedef to typedefs.list
file.
3) maybe we can improve the error messages for the
last 2 checks in tap test, that because of this
reason hence proven that the copy from source to
target has been done.
4) We need to update the pg_rewind docs regarding
this new behaviour.
--
Thanks,
Srinath Reddy Sadipiralla
EDB: https://www.enterprisedb.com/
Hi Srinath,
Thank you for the quick review!
On Sat, Oct 18, 2025 at 9:05 AM Srinath Reddy Sadipiralla
<srinath2133@gmail.com> wrote:
...
I agree with Robert here, and i think if we found
that the WAL filename is not valid after parsing
we can return something like FILE_CONTENT_INVALID_TYPE
in getFileContentType so that it won't end up going
through decide_wal_file_action,then XLogFromFileName
may assign file_segno invalid values greater than
last_common_segno which means we lead to copying it,even
though it's an invalid file, thoughts?
My understanding of pg_rewind is that it aims to make the two clusters
"identical"
after rewind, similar to a basebackup. It does not try to make decisions
on whether or not a file is necessary for database to start. I'd
prefer the patch aims
to keep parity with the existing behavior of pg_rewind/pg_basebackup as much as
possible. If there is an unexpected/invalid segment and we can't make
assumptions
about the content, we should just sync it to keep with existing behavior.
There's lots of potential optimizations pg_rewind can do if we only
talk about what is
"strictly" necessary for the database to start. We can exclude any
directories that
should not be there. For example, if there exists a directory
$PGDATA/random_files
on source and target, the whole directory gets synced right now. We could not
include that directory and the database would most likely still start
but that goes
against the current way pg_rewind works.
2) Please fix the indentation by running pgindent,
also add file_content_type_t typedef to typedefs.list
file.
Will update.
3) maybe we can improve the error messages for the
last 2 checks in tap test, that because of this
reason hence proven that the copy from source to
target has been done.
What are you thinking of? Something like the below?
+ "Expected WAL segment $corrupt_wal_seg to have been modified
due to rewind");
+ "Expected WAL segment $corrupt_wal_seg to have been synced
from source to target after rewind");
4) We need to update the pg_rewind docs regarding
this new behaviour.
Will update.
--
John Hsu - Amazon Web Services
Hi John,
On Tue, Oct 21, 2025 at 3:34 AM John H <johnhyvr@gmail.com> wrote:
3) maybe we can improve the error messages for the
last 2 checks in tap test, that because of this
reason hence proven that the copy from source to
target has been done.What are you thinking of? Something like the below?
+ "Expected WAL segment $corrupt_wal_seg to have been modified due to rewind"); + "Expected WAL segment $corrupt_wal_seg to have been synced from source to target after rewind");
maybe something like this
+ "Expected WAL segment $corrupt_wal_seg to have later mtime
on target than source after rewind as it was copied");
+ "Expected WAL segment $corrupt_wal_seg file sizes to be same
between target and source after rewind as it was copied");
--
Thanks,
Srinath Reddy Sadipiralla
EDB: https://www.enterprisedb.com/
Hi Srinath,
On Mon, Oct 20, 2025 at 9:36 PM Srinath Reddy Sadipiralla
<srinath2133@gmail.com> wrote:
... maybe something like this + "Expected WAL segment $corrupt_wal_seg to have later mtime on target than source after rewind as it was copied"); + "Expected WAL segment $corrupt_wal_seg file sizes to be same between target and source after rewind as it was copied");
Made the changes and included the documentation update.
Thanks,
--
John Hsu - Amazon Web Services
Attachments:
v11-0001-Avoid-copying-WAL-segments-before-divergence-to-.patchapplication/octet-stream; name=v11-0001-Avoid-copying-WAL-segments-before-divergence-to-.patchDownload
From 0896b78e4ed60c8b57481783d06c7522f5672692 Mon Sep 17 00:00:00 2001
From: John Hsu <johnyvr@gmail.com>
Date: Wed, 8 Oct 2025 21:20:59 +0000
Subject: [PATCH v11] Avoid copying WAL segments before divergence to speed up
pg_rewind
Adds a check to avoid copying any WAL segment files from source
to target if they are common between both servers before the
point of WAL divergence during pg_rewind.
All WAL files that exist on source and target, which fall
before the segment of the first diverged LSN can safely be
skipped from copying to the target as they have been replicated
from the original primary.
Author: John Hsu <johnhyvr@gmail.com>
Co-Author: Justin Kwan <justinpkwan@outlook.com>
---
doc/src/sgml/ref/pg_rewind.sgml | 30 ++-
src/bin/pg_rewind/filemap.c | 243 +++++++++++-------
src/bin/pg_rewind/filemap.h | 12 +-
src/bin/pg_rewind/pg_rewind.c | 6 +-
.../t/011_avoid_copying_common_wals.pl | 120 +++++++++
5 files changed, 308 insertions(+), 103 deletions(-)
create mode 100644 src/bin/pg_rewind/t/011_avoid_copying_common_wals.pl
diff --git a/doc/src/sgml/ref/pg_rewind.sgml b/doc/src/sgml/ref/pg_rewind.sgml
index 5485033ed8c..4e60e6dbef8 100644
--- a/doc/src/sgml/ref/pg_rewind.sgml
+++ b/doc/src/sgml/ref/pg_rewind.sgml
@@ -52,12 +52,30 @@ PostgreSQL documentation
analogous to a base backup of the source data directory. Unlike taking
a new base backup or using a tool like <application>rsync</application>,
<application>pg_rewind</application> does not require comparing or copying
- unchanged relation blocks in the cluster. Only changed blocks from existing
- relation files are copied; all other files, including new relation files,
- configuration files, and WAL segments, are copied in full. As such the
- rewind operation is significantly faster than other approaches when the
- database is large and only a small fraction of blocks differ between the
- clusters.
+ unchanged relation blocks in the cluster.
+ </para>
+ <orderedlist>
+ <listitem>
+ <para>
+ Only changed blocks from existing relation files are copied
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ WAL segments prior to the point of divergence are skipped
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ All other files, including new relation files, configuration files, and WAL
+ segments after point of divergence are copied in full.
+ </para>
+ </listitem>
+ </orderedlist>
+ <para>
+ As such the rewind operation is significantly faster than other approaches when
+ the database is large and only a small fraction of blocks differ between the
+ clusters.
</para>
<para>
diff --git a/src/bin/pg_rewind/filemap.c b/src/bin/pg_rewind/filemap.c
index c933871ca9f..27f89842d66 100644
--- a/src/bin/pg_rewind/filemap.c
+++ b/src/bin/pg_rewind/filemap.c
@@ -55,7 +55,6 @@
static filehash_hash *filehash;
-static bool isRelDataFile(const char *path);
static char *datasegpath(RelFileLocator rlocator, ForkNumber forknum,
BlockNumber segno);
@@ -199,6 +198,103 @@ filehash_init(void)
filehash = filehash_create(FILEHASH_INITIAL_SIZE, NULL);
}
+static file_content_type_t
+getFileContentType(const char *path)
+{
+ RelFileLocator rlocator;
+ unsigned int segNo;
+ int nmatch;
+ bool matched;
+
+ /* Check if it's a WAL file */
+ if (strncmp("pg_wal/", path, 7) == 0)
+ {
+ const char *filename = path + 7; /* Skip "pg_wal/" */
+
+ pg_log_debug("file_name: %s", filename);
+ if (IsXLogFileName(filename))
+ return FILE_CONTENT_TYPE_WAL;
+ else
+ return FILE_CONTENT_TYPE_OTHER;
+ }
+
+ /*
+ * Does it look like a relation data file?
+ *
+ * For our purposes, only files belonging to the main fork are considered
+ * relation files. Other forks are always copied in toto, because we
+ * cannot reliably track changes to them, because WAL only contains block
+ * references for the main fork.
+ *
+ * ---- Relation data files can be in one of the following directories:
+ *
+ * global/ shared relations
+ *
+ * base/<db oid>/ regular relations, default tablespace
+ *
+ * pg_tblspc/<tblspc oid>/<tblspc version>/ within a non-default
+ * tablespace (the name of the directory depends on version)
+ *
+ * And the relation data files themselves have a filename like:
+ *
+ * <oid>.<segment number>
+ *
+ * ----
+ */
+ rlocator.spcOid = InvalidOid;
+ rlocator.dbOid = InvalidOid;
+ rlocator.relNumber = InvalidRelFileNumber;
+ segNo = 0;
+ matched = false;
+
+ nmatch = sscanf(path, "global/%u.%u", &rlocator.relNumber, &segNo);
+ if (nmatch == 1 || nmatch == 2)
+ {
+ rlocator.spcOid = GLOBALTABLESPACE_OID;
+ rlocator.dbOid = 0;
+ matched = true;
+ }
+ else
+ {
+ nmatch = sscanf(path, "base/%u/%u.%u",
+ &rlocator.dbOid, &rlocator.relNumber, &segNo);
+ if (nmatch == 2 || nmatch == 3)
+ {
+ rlocator.spcOid = DEFAULTTABLESPACE_OID;
+ matched = true;
+ }
+ else
+ {
+ nmatch = sscanf(path, "pg_tblspc/%u/" TABLESPACE_VERSION_DIRECTORY "/%u/%u.%u",
+ &rlocator.spcOid, &rlocator.dbOid, &rlocator.relNumber,
+ &segNo);
+ if (nmatch == 3 || nmatch == 4)
+ matched = true;
+ }
+ }
+
+ /*
+ * The sscanf tests above can match files that have extra characters at
+ * the end. To eliminate such cases, cross-check that GetRelationPath
+ * creates the exact same filename, when passed the RelFileLocator
+ * information we extracted from the filename.
+ */
+ if (matched)
+ {
+ char *check_path = datasegpath(rlocator, MAIN_FORKNUM, segNo);
+
+ if (strcmp(check_path, path) != 0)
+ matched = false;
+
+ pfree(check_path);
+ }
+
+ if (matched)
+ return FILE_CONTENT_TYPE_RELATION;
+
+ return FILE_CONTENT_TYPE_OTHER;
+}
+
/* Look up entry for 'path', creating a new one if it doesn't exist */
static file_entry_t *
insert_filehash_entry(const char *path)
@@ -210,7 +306,7 @@ insert_filehash_entry(const char *path)
if (!found)
{
entry->path = pg_strdup(path);
- entry->isrelfile = isRelDataFile(path);
+ entry->content_type = getFileContentType(path);
entry->target_exists = false;
entry->target_type = FILE_TYPE_UNDEFINED;
@@ -294,7 +390,7 @@ process_source_file(const char *path, file_type_t type, size_t size,
* sanity check: a filename that looks like a data file better be a
* regular file
*/
- if (type != FILE_TYPE_REGULAR && isRelDataFile(path))
+ if (type != FILE_TYPE_REGULAR && getFileContentType(path) == FILE_CONTENT_TYPE_RELATION)
pg_fatal("data file \"%s\" in source is not a regular file", path);
/* Remember this source file */
@@ -383,7 +479,7 @@ process_target_wal_block_change(ForkNumber forknum, RelFileLocator rlocator,
*/
if (entry)
{
- Assert(entry->isrelfile);
+ Assert(entry->content_type == FILE_CONTENT_TYPE_RELATION);
if (entry->target_exists)
{
@@ -559,92 +655,6 @@ print_filemap(filemap_t *filemap)
fflush(stdout);
}
-/*
- * Does it look like a relation data file?
- *
- * For our purposes, only files belonging to the main fork are considered
- * relation files. Other forks are always copied in toto, because we cannot
- * reliably track changes to them, because WAL only contains block references
- * for the main fork.
- */
-static bool
-isRelDataFile(const char *path)
-{
- RelFileLocator rlocator;
- unsigned int segNo;
- int nmatch;
- bool matched;
-
- /*----
- * Relation data files can be in one of the following directories:
- *
- * global/
- * shared relations
- *
- * base/<db oid>/
- * regular relations, default tablespace
- *
- * pg_tblspc/<tblspc oid>/<tblspc version>/
- * within a non-default tablespace (the name of the directory
- * depends on version)
- *
- * And the relation data files themselves have a filename like:
- *
- * <oid>.<segment number>
- *
- *----
- */
- rlocator.spcOid = InvalidOid;
- rlocator.dbOid = InvalidOid;
- rlocator.relNumber = InvalidRelFileNumber;
- segNo = 0;
- matched = false;
-
- nmatch = sscanf(path, "global/%u.%u", &rlocator.relNumber, &segNo);
- if (nmatch == 1 || nmatch == 2)
- {
- rlocator.spcOid = GLOBALTABLESPACE_OID;
- rlocator.dbOid = 0;
- matched = true;
- }
- else
- {
- nmatch = sscanf(path, "base/%u/%u.%u",
- &rlocator.dbOid, &rlocator.relNumber, &segNo);
- if (nmatch == 2 || nmatch == 3)
- {
- rlocator.spcOid = DEFAULTTABLESPACE_OID;
- matched = true;
- }
- else
- {
- nmatch = sscanf(path, "pg_tblspc/%u/" TABLESPACE_VERSION_DIRECTORY "/%u/%u.%u",
- &rlocator.spcOid, &rlocator.dbOid, &rlocator.relNumber,
- &segNo);
- if (nmatch == 3 || nmatch == 4)
- matched = true;
- }
- }
-
- /*
- * The sscanf tests above can match files that have extra characters at
- * the end. To eliminate such cases, cross-check that GetRelationPath
- * creates the exact same filename, when passed the RelFileLocator
- * information we extracted from the filename.
- */
- if (matched)
- {
- char *check_path = datasegpath(rlocator, MAIN_FORKNUM, segNo);
-
- if (strcmp(check_path, path) != 0)
- matched = false;
-
- pfree(check_path);
- }
-
- return matched;
-}
-
/*
* A helper function to create the path of a relation file and segment.
*
@@ -693,11 +703,44 @@ final_filemap_cmp(const void *a, const void *b)
return strcmp(fa->path, fb->path);
}
+/*
+ * Decide what to do with a WAL segment file based on its position
+ * relative to the point of divergence.
+ * Caller is responsible for ensuring the file exists on both
+ * source and target.
+ */
+static file_action_t
+decide_wal_file_action(const char *fname, XLogSegNo last_common_segno,
+ size_t source_size, size_t target_size)
+{
+ TimeLineID file_tli;
+ XLogSegNo file_segno;
+
+ /* Get current WAL segment number given current segment file name */
+ XLogFromFileName(fname, &file_tli, &file_segno, WalSegSz);
+
+ /*
+ * Avoid copying files before the last common segment.
+ *
+ * These files are assumed to exist on source and target which means they
+ * should already be identical and before the last_common_segno.
+ *
+ * However we check last_common_segno and file_size again for sanity.
+ */
+ if (file_segno < last_common_segno && source_size == target_size)
+ {
+ pg_log_debug("WAL segment \"%s\" not copied to target", fname);
+ return FILE_ACTION_NONE;
+ }
+
+ return FILE_ACTION_COPY;
+}
+
/*
* Decide what action to perform to a file.
*/
static file_action_t
-decide_file_action(file_entry_t *entry)
+decide_file_action(file_entry_t *entry, XLogSegNo last_common_segno)
{
const char *path = entry->path;
@@ -799,7 +842,19 @@ decide_file_action(file_entry_t *entry)
return FILE_ACTION_NONE;
case FILE_TYPE_REGULAR:
- if (!entry->isrelfile)
+ if (entry->content_type == FILE_CONTENT_TYPE_WAL)
+ {
+ /* Handle WAL segment file */
+ const char *filename = last_dir_separator(entry->path);
+
+ if (filename == NULL)
+ filename = entry->path;
+ else
+ filename++; /* Skip the separator */
+
+ return decide_wal_file_action(filename, last_common_segno, entry->source_size, entry->target_size);
+ }
+ else if (entry->content_type != FILE_CONTENT_TYPE_RELATION)
{
/*
* It's a non-data file that we have no special processing
@@ -858,7 +913,7 @@ decide_file_action(file_entry_t *entry)
* should be executed.
*/
filemap_t *
-decide_file_actions(void)
+decide_file_actions(XLogSegNo last_common_segno)
{
int i;
filehash_iterator it;
@@ -868,7 +923,7 @@ decide_file_actions(void)
filehash_start_iterate(filehash, &it);
while ((entry = filehash_iterate(filehash, &it)) != NULL)
{
- entry->action = decide_file_action(entry);
+ entry->action = decide_file_action(entry, last_common_segno);
}
/*
diff --git a/src/bin/pg_rewind/filemap.h b/src/bin/pg_rewind/filemap.h
index df78a02e3da..982de52176b 100644
--- a/src/bin/pg_rewind/filemap.h
+++ b/src/bin/pg_rewind/filemap.h
@@ -11,6 +11,7 @@
#include "datapagemap.h"
#include "storage/block.h"
#include "storage/relfilelocator.h"
+#include "access/xlogdefs.h"
/* these enum values are sorted in the order we want actions to be processed */
typedef enum
@@ -36,6 +37,13 @@ typedef enum
FILE_TYPE_SYMLINK,
} file_type_t;
+typedef enum
+{
+ FILE_CONTENT_TYPE_OTHER = 0,
+ FILE_CONTENT_TYPE_RELATION,
+ FILE_CONTENT_TYPE_WAL
+} file_content_type_t;
+
/*
* For every file found in the local or remote system, we have a file entry
* that contains information about the file on both systems. For relation
@@ -51,7 +59,7 @@ typedef struct file_entry_t
uint32 status; /* hash status */
const char *path;
- bool isrelfile; /* is it a relation data file? */
+ file_content_type_t content_type;
/*
* Status of the file in the target.
@@ -106,7 +114,7 @@ extern void process_target_wal_block_change(ForkNumber forknum,
RelFileLocator rlocator,
BlockNumber blkno);
-extern filemap_t *decide_file_actions(void);
+extern filemap_t *decide_file_actions(XLogSegNo last_common_segno);
extern void calculate_totals(filemap_t *filemap);
extern void print_filemap(filemap_t *filemap);
diff --git a/src/bin/pg_rewind/pg_rewind.c b/src/bin/pg_rewind/pg_rewind.c
index 0c68dd4235e..318d0a27184 100644
--- a/src/bin/pg_rewind/pg_rewind.c
+++ b/src/bin/pg_rewind/pg_rewind.c
@@ -147,6 +147,7 @@ main(int argc, char **argv)
TimeLineID source_tli;
TimeLineID target_tli;
XLogRecPtr target_wal_endrec;
+ XLogSegNo last_common_segno;
size_t size;
char *buffer;
bool no_ensure_shutdown = false;
@@ -397,6 +398,9 @@ main(int argc, char **argv)
LSN_FORMAT_ARGS(divergerec),
targetHistory[lastcommontliIndex].tli);
+ /* Convert divergence LSN to segment number */
+ XLByteToSeg(divergerec, last_common_segno, ControlFile_target.xlog_seg_size);
+
/*
* Don't need the source history anymore. The target history is still
* needed by the routines in parsexlog.c, when we read the target WAL.
@@ -492,7 +496,7 @@ main(int argc, char **argv)
* We have collected all information we need from both systems. Decide
* what to do with each file.
*/
- filemap = decide_file_actions();
+ filemap = decide_file_actions(last_common_segno);
if (showprogress)
calculate_totals(filemap);
diff --git a/src/bin/pg_rewind/t/011_avoid_copying_common_wals.pl b/src/bin/pg_rewind/t/011_avoid_copying_common_wals.pl
new file mode 100644
index 00000000000..a39972e689d
--- /dev/null
+++ b/src/bin/pg_rewind/t/011_avoid_copying_common_wals.pl
@@ -0,0 +1,120 @@
+# Copyright (c) 2021-2025, PostgreSQL Global Development Group
+#
+# Test situation where we skip copying the same WAL files from source
+# to target unless file size differs.
+
+use strict;
+use warnings FATAL => 'all';
+use PostgreSQL::Test::Utils;
+use Test::More;
+use File::stat qw(stat);
+
+use FindBin;
+use lib $FindBin::RealBin;
+use RewindTest;
+use Time::HiRes qw(usleep);
+
+RewindTest::setup_cluster();
+RewindTest::start_primary();
+
+RewindTest::create_standby();
+
+
+# Advance WAL on primary
+RewindTest::primary_psql("CREATE TABLE t(a int)");
+RewindTest::primary_psql("INSERT INTO t VALUES(0)");
+
+# Common segment to be skipped copying over as it's common and same size
+my $wal_seg_skipped = $node_primary->safe_psql('postgres', 'SELECT pg_walfile_name(pg_current_wal_lsn())');
+
+RewindTest::primary_psql("SELECT pg_switch_wal()");
+
+# Segment before divergerec that will be copied over because file size differs
+RewindTest::primary_psql("INSERT INTO t VALUES(0)");
+my $corrupt_wal_seg = $node_primary->safe_psql('postgres', 'SELECT pg_walfile_name(pg_current_wal_lsn())');
+RewindTest::primary_psql("SELECT pg_switch_wal()");
+
+# last common checkpoint
+RewindTest::primary_psql("CHECKPOINT");
+
+RewindTest::promote_standby;
+
+# New Timeline Segment after divergerec, expected to be copied
+my $new_timeline_wal_seg = $node_standby->safe_psql('postgres', 'SELECT pg_walfile_name(pg_current_wal_lsn())');
+
+# Get stat info for the WAL file that should be skipped
+my $wal_skipped_path = $node_primary->data_dir . '/pg_wal/' . $wal_seg_skipped;
+my $wal_skipped_stat = stat($wal_skipped_path);
+defined($wal_skipped_stat) or die("unable to stat $wal_skipped_path");
+
+# Store modification time for later comparison
+my $wal_seg_skipped_mod_time = $wal_skipped_stat->mtime;
+
+# Corrupt a segment on target before divergerec to test it's copied over from source,
+# by appending an additional byte to it.
+my $corrupt_wal_seg_in_target_path = $node_primary->data_dir . '/pg_wal/' . $corrupt_wal_seg;
+open my $fh, ">>", $corrupt_wal_seg_in_target_path
+ or die "could not open $corrupt_wal_seg_in_target_path";
+
+print $fh 'a';
+close $fh;
+
+my $corrupt_wal_seg_stat_before_rewind = stat($corrupt_wal_seg_in_target_path);
+ok(defined($corrupt_wal_seg_stat_before_rewind), "WAL segment $corrupt_wal_seg should exist in target before rewind");
+
+# Verify that the WAL segment on new timeline does not exist in target before rewind
+my $new_timeline_wal_seg_path = $node_primary->data_dir . '/pg_wal/' . $new_timeline_wal_seg;
+my $new_timeline_wal_seg_stat = stat($new_timeline_wal_seg_path);
+ok(!defined($new_timeline_wal_seg_stat), "WAL segment $new_timeline_wal_seg should not exist in target before rewind");
+
+$node_standby->stop();
+$node_primary->stop();
+
+# Sleep to allow mtime to be different
+usleep(1000000);
+
+command_checks_all(
+ [
+ 'pg_rewind', '--debug',
+ '--source-pgdata' => $node_standby->data_dir,
+ '--target-pgdata' => $node_primary->data_dir,
+ '--no-sync',
+ ],
+ 0,
+ [qr//
+ ],
+ [qr/WAL segment \"$wal_seg_skipped\" not copied to target/,
+ qr/pg_wal\/$corrupt_wal_seg \(COPY\)/
+ ],
+ 'run pg_rewind'
+);
+
+# Verify that the copied WAL segment now exists in target
+$new_timeline_wal_seg_stat = stat($new_timeline_wal_seg_path);
+ok(defined($new_timeline_wal_seg_stat), "WAL segment $new_timeline_wal_seg should exist in target after rewind");
+
+# Get current modification time of the skipped WAL segment
+my $wal_skipped_stat_after_rewind = stat($wal_skipped_path);
+defined($wal_skipped_stat_after_rewind) or die("unable to stat $wal_skipped_path after rewind");
+my $wal_seg_latest_skipped_mod_time = $wal_skipped_stat_after_rewind->mtime;
+
+# Validate that modification time hasn't changed,
+is($wal_seg_latest_skipped_mod_time, $wal_seg_skipped_mod_time,
+ "WAL segment $wal_seg_skipped modification time should be unchanged (not overwritten)");
+
+# Validate that the WAL segment with same filename as the
+# corrupt WAL segment in target has been copied from source
+# where we have it intact.
+my $corrupt_wal_seg_in_source_path = $node_standby->data_dir . '/pg_wal/' . $corrupt_wal_seg;
+my $corrupt_wal_seg_source_stat = stat($corrupt_wal_seg_in_source_path);
+ok(defined($corrupt_wal_seg_source_stat), "WAL segment $corrupt_wal_seg should exist in source after rewind");
+my $corrupt_wal_seg_stat_after_rewind = stat($corrupt_wal_seg_in_target_path);
+ok(defined($corrupt_wal_seg_stat_after_rewind), "WAL segment $corrupt_wal_seg should exist in target after rewind");
+ok($corrupt_wal_seg_stat_before_rewind->size != $corrupt_wal_seg_source_stat->size,
+ "Expected WAL segment $corrupt_wal_seg to have different size in source vs target before rewind");
+ok($corrupt_wal_seg_stat_after_rewind->mtime > $corrupt_wal_seg_stat_before_rewind->mtime,
+ "Expected WAL segment $corrupt_wal_seg to have later mtime on target than source after rewind as it was copied");
+ok($corrupt_wal_seg_stat_after_rewind->size == $corrupt_wal_seg_source_stat->size,
+ "Expected WAL segment $corrupt_wal_seg file sizes to be same between target and source after rewind as it was copied");
+
+done_testing();
--
2.50.1
On Thu, Sep 25, 2025 at 02:35:16PM -0400, Robert Haas wrote:
On a related note, something a committer would need to do if they
wanted to commit this patch is figure out who should be credited with
writing it. I see that a number of different people have posted
versions of the patch; it would be nice if "Author:" or
"Co-authored-by:" tags were added to the commit message to reflect
whose code is present in a given version.
Looking at this patch was on my TODO list (more to that in a bit), and
the simplest answer to this question is that everybody involved would
be cited in the release notes. Justin Kwan has been the original
author. John Hsu and Srinath have followed up in an equivalent
fashion, including answers with your review comments posted on this
previous message. So listing all three as equivalent authors feels
normal to me (no need for co-authoring notes, IMO).
--
Michael
On Tue, Oct 21, 2025 at 04:14:55PM -0700, John H wrote:
Made the changes and included the documentation update.
I have been looking at this patch, and the first part that stood out
is that the refactoring related to isRelDataFile() can be done as an
independent piece, cutting the total size of the main patch by 30% or
so. So I have extracted this part, simplified it to make it more
consistent and les noisy on HEAD, and applied it as 6ae08d9583e9.
I am still looking at the rest, and attached is a rebased version of
what I have for the moment. Note first that the new test was missing
in meson.build, so the coverage provided by the CI was limited.
+ * However we check last_common_segno and file_size again for sanity.
+ */
+ if (file_segno < last_common_segno && source_size == target_size)
What if a segment has a size different than the one a cluster has been
initialized with, still both have the same size on the target and the
source? Unlikely, still incorrect if not cross-checked with what a
control file has in store. :)
+# Sleep to allow mtime to be different
+usleep(1000000);
[...]
+ [qr/WAL segment \"$wal_seg_skipped\" not copied to target/,
+ qr/pg_wal\/$corrupt_wal_seg \(COPY\)/
+ ],
FWIW, I am definitely not a fan of the test relying on the timestamp
of the files based on their retrieved meta-data stats, with the mtime.
I suspect complications on Windows. Worse thing, there is a forced
sleep of 1s added to the test, to make sure that the timestamp of the
files we compare have a different number. But do we really need that
anyway if we have the debug information with the file map printed in
it?
Hence, I think that we should simplify and rework the test a bit,
tweaking a bit how the filemap is printed: now that we can detect if
an operation is happening on a WAL file thanks to file_content_type_t,
let's always print the type of operation done for each WAL file and
remove this "not copied to target" debug log which provides the same
information, then let's compare the debug output with what we want.
It seems to me that it would be good for the test to run some sanity
checks on the rewound standby, as well. That would provide more
validation for the case of the "corrupted" segment that gets copied
anyway.
--
Michael
Attachments:
v12-0001-Avoid-copying-WAL-segments-before-divergence-to-.patchtext/x-diff; charset=us-asciiDownload
From 93970d6e62812738778668fe5551198e8749dc7f Mon Sep 17 00:00:00 2001
From: John Hsu <johnyvr@gmail.com>
Date: Wed, 8 Oct 2025 21:20:59 +0000
Subject: [PATCH v12] Avoid copying WAL segments before divergence to speed up
pg_rewind
Adds a check to avoid copying any WAL segment files from source
to target if they are common between both servers before the
point of WAL divergence during pg_rewind.
All WAL files that exist on source and target, which fall
before the segment of the first diverged LSN can safely be
skipped from copying to the target as they have been replicated
from the original primary.
Author: John Hsu <johnhyvr@gmail.com>
Co-Author: Justin Kwan <justinpkwan@outlook.com>
---
src/bin/pg_rewind/filemap.c | 53 ++++++++++-
src/bin/pg_rewind/filemap.h | 3 +-
src/bin/pg_rewind/meson.build | 1 +
src/bin/pg_rewind/pg_rewind.c | 9 +-
src/bin/pg_rewind/t/011_wal_copy.pl | 141 ++++++++++++++++++++++++++++
doc/src/sgml/ref/pg_rewind.sgml | 32 +++++--
6 files changed, 226 insertions(+), 13 deletions(-)
create mode 100644 src/bin/pg_rewind/t/011_wal_copy.pl
diff --git a/src/bin/pg_rewind/filemap.c b/src/bin/pg_rewind/filemap.c
index 00f5d60d6209..2d82edec0608 100644
--- a/src/bin/pg_rewind/filemap.c
+++ b/src/bin/pg_rewind/filemap.c
@@ -706,11 +706,45 @@ final_filemap_cmp(const void *a, const void *b)
return strcmp(fa->path, fb->path);
}
+/*
+ * Decide what to do with a WAL segment file based on its position
+ * relative to the point of divergence.
+ *
+ * Caller is responsible for ensuring that the file exists on both
+ * source and target servers.
+ */
+static file_action_t
+decide_wal_file_action(const char *fname, XLogSegNo last_common_segno,
+ size_t source_size, size_t target_size)
+{
+ TimeLineID file_tli;
+ XLogSegNo file_segno;
+
+ /* Get current WAL segment number given current segment file name */
+ XLogFromFileName(fname, &file_tli, &file_segno, WalSegSz);
+
+ /*
+ * Avoid copying files before the last common segment.
+ *
+ * These files are assumed to exist on source and target which means they
+ * should already be identical and before the last_common_segno.
+ *
+ * However we check last_common_segno and file_size again for sanity.
+ */
+ if (file_segno < last_common_segno && source_size == target_size)
+ {
+ pg_log_debug("WAL segment \"%s\" not copied to target", fname);
+ return FILE_ACTION_NONE;
+ }
+
+ return FILE_ACTION_COPY;
+}
+
/*
* Decide what action to perform to a file.
*/
static file_action_t
-decide_file_action(file_entry_t *entry)
+decide_file_action(file_entry_t *entry, XLogSegNo last_common_segno)
{
const char *path = entry->path;
@@ -814,8 +848,17 @@ decide_file_action(file_entry_t *entry)
case FILE_TYPE_REGULAR:
if (entry->content_type == FILE_CONTENT_TYPE_WAL)
{
- /* It's a WAL file, copy it. */
- return FILE_ACTION_COPY;
+ /* Handle WAL segment file */
+ const char *filename = last_dir_separator(entry->path);
+
+ if (filename == NULL)
+ filename = entry->path;
+ else
+ filename++; /* Skip the separator */
+
+ return decide_wal_file_action(filename, last_common_segno,
+ entry->source_size,
+ entry->target_size);
}
else if (entry->content_type != FILE_CONTENT_TYPE_RELATION)
{
@@ -876,7 +919,7 @@ decide_file_action(file_entry_t *entry)
* should be executed.
*/
filemap_t *
-decide_file_actions(void)
+decide_file_actions(XLogSegNo last_common_segno)
{
int i;
filehash_iterator it;
@@ -886,7 +929,7 @@ decide_file_actions(void)
filehash_start_iterate(filehash, &it);
while ((entry = filehash_iterate(filehash, &it)) != NULL)
{
- entry->action = decide_file_action(entry);
+ entry->action = decide_file_action(entry, last_common_segno);
}
/*
diff --git a/src/bin/pg_rewind/filemap.h b/src/bin/pg_rewind/filemap.h
index fada420fc230..5145f0b4c46c 100644
--- a/src/bin/pg_rewind/filemap.h
+++ b/src/bin/pg_rewind/filemap.h
@@ -11,6 +11,7 @@
#include "datapagemap.h"
#include "storage/block.h"
#include "storage/relfilelocator.h"
+#include "access/xlogdefs.h"
/* these enum values are sorted in the order we want actions to be processed */
typedef enum
@@ -113,7 +114,7 @@ extern void process_target_wal_block_change(ForkNumber forknum,
RelFileLocator rlocator,
BlockNumber blkno);
-extern filemap_t *decide_file_actions(void);
+extern filemap_t *decide_file_actions(XLogSegNo last_common_segno);
extern void calculate_totals(filemap_t *filemap);
extern void print_filemap(filemap_t *filemap);
diff --git a/src/bin/pg_rewind/meson.build b/src/bin/pg_rewind/meson.build
index 36171600ccaf..97f001d94a50 100644
--- a/src/bin/pg_rewind/meson.build
+++ b/src/bin/pg_rewind/meson.build
@@ -44,6 +44,7 @@ tests += {
't/008_min_recovery_point.pl',
't/009_growing_files.pl',
't/010_keep_recycled_wals.pl',
+ 't/011_wal_copy.pl',
],
},
}
diff --git a/src/bin/pg_rewind/pg_rewind.c b/src/bin/pg_rewind/pg_rewind.c
index 0c68dd4235e6..1b953692b176 100644
--- a/src/bin/pg_rewind/pg_rewind.c
+++ b/src/bin/pg_rewind/pg_rewind.c
@@ -147,6 +147,7 @@ main(int argc, char **argv)
TimeLineID source_tli;
TimeLineID target_tli;
XLogRecPtr target_wal_endrec;
+ XLogSegNo last_common_segno;
size_t size;
char *buffer;
bool no_ensure_shutdown = false;
@@ -397,6 +398,12 @@ main(int argc, char **argv)
LSN_FORMAT_ARGS(divergerec),
targetHistory[lastcommontliIndex].tli);
+ /*
+ * Convert the divergence LSN to a segment number, that will be used
+ * to decide how WAL segments should be processed.
+ */
+ XLByteToSeg(divergerec, last_common_segno, ControlFile_target.xlog_seg_size);
+
/*
* Don't need the source history anymore. The target history is still
* needed by the routines in parsexlog.c, when we read the target WAL.
@@ -492,7 +499,7 @@ main(int argc, char **argv)
* We have collected all information we need from both systems. Decide
* what to do with each file.
*/
- filemap = decide_file_actions();
+ filemap = decide_file_actions(last_common_segno);
if (showprogress)
calculate_totals(filemap);
diff --git a/src/bin/pg_rewind/t/011_wal_copy.pl b/src/bin/pg_rewind/t/011_wal_copy.pl
new file mode 100644
index 000000000000..b9e24844654d
--- /dev/null
+++ b/src/bin/pg_rewind/t/011_wal_copy.pl
@@ -0,0 +1,141 @@
+# Copyright (c) 2025, PostgreSQL Global Development Group
+#
+# Check how the copy of WAL segments is handled from the source to
+# the target server.
+
+use strict;
+use warnings FATAL => 'all';
+use PostgreSQL::Test::Utils;
+use Test::More;
+use File::stat qw(stat);
+
+use FindBin;
+use lib $FindBin::RealBin;
+use RewindTest;
+use Time::HiRes qw(usleep);
+
+RewindTest::setup_cluster();
+RewindTest::start_primary();
+RewindTest::create_standby();
+
+# Advance WAL on primary
+RewindTest::primary_psql("CREATE TABLE t(a int)");
+RewindTest::primary_psql("INSERT INTO t VALUES(0)");
+
+# Segment that is not copied from the source to the target, being
+# generated before the servers have diverged.
+my $wal_seg_skipped = $node_primary->safe_psql('postgres',
+ 'SELECT pg_walfile_name(pg_current_wal_lsn())');
+
+RewindTest::primary_psql("SELECT pg_switch_wal()");
+
+# Follow-up segment, that will include corrupted contents, and should be
+# copied from the source to the target even if generated before the point
+# of divergence.
+RewindTest::primary_psql("INSERT INTO t VALUES(0)");
+my $corrupt_wal_seg = $node_primary->safe_psql('postgres',
+ 'SELECT pg_walfile_name(pg_current_wal_lsn())');
+RewindTest::primary_psql("SELECT pg_switch_wal()");
+
+RewindTest::primary_psql("CHECKPOINT");
+RewindTest::promote_standby;
+
+# New segment on a new timeline, expected to be copied.
+my $new_timeline_wal_seg = $node_standby->safe_psql('postgres',
+ 'SELECT pg_walfile_name(pg_current_wal_lsn())');
+
+# Get some stats info for the WAL files whose copies should be skipped.
+my $wal_skipped_path =
+ $node_primary->data_dir . '/pg_wal/' . $wal_seg_skipped;
+my $wal_skipped_stat = stat($wal_skipped_path);
+defined($wal_skipped_stat) or die("unable to stat $wal_skipped_path");
+
+# Store modification time for later comparison
+my $wal_seg_skipped_mod_time = $wal_skipped_stat->mtime;
+
+# Corrupt a WAL segment on target that has been generated before the
+# divergence point. We will check that it is copied over from source.
+my $corrupt_wal_seg_in_target_path =
+ $node_primary->data_dir . '/pg_wal/' . $corrupt_wal_seg;
+open my $fh, ">>", $corrupt_wal_seg_in_target_path
+ or die "could not open $corrupt_wal_seg_in_target_path";
+
+print $fh 'a';
+close $fh;
+
+my $corrupt_wal_seg_stat_before_rewind =
+ stat($corrupt_wal_seg_in_target_path);
+ok(defined($corrupt_wal_seg_stat_before_rewind),
+ "WAL segment $corrupt_wal_seg should exist in target before rewind");
+
+# Verify that the WAL segment on the new timeline does not exist in target
+# before the rewind.
+my $new_timeline_wal_seg_path =
+ $node_primary->data_dir . '/pg_wal/' . $new_timeline_wal_seg;
+my $new_timeline_wal_seg_stat = stat($new_timeline_wal_seg_path);
+ok(!defined($new_timeline_wal_seg_stat),
+ "WAL segment $new_timeline_wal_seg should not exist in target before rewind"
+);
+
+$node_standby->stop();
+$node_primary->stop();
+
+# Sleep to allow mtime to be different
+usleep(1000000);
+
+command_checks_all(
+ [
+ 'pg_rewind', '--debug',
+ '--source-pgdata' => $node_standby->data_dir,
+ '--target-pgdata' => $node_primary->data_dir,
+ '--no-sync',
+ ],
+ 0,
+ [qr//],
+ [
+ qr/WAL segment \"$wal_seg_skipped\" not copied to target/,
+ qr/pg_wal\/$corrupt_wal_seg \(COPY\)/
+ ],
+ 'run pg_rewind');
+
+# Verify that the copied WAL segment now exists in target.
+$new_timeline_wal_seg_stat = stat($new_timeline_wal_seg_path);
+ok(defined($new_timeline_wal_seg_stat),
+ "WAL segment $new_timeline_wal_seg should exist in target after rewind");
+
+# Get current modification time of the skipped WAL segment.
+my $wal_skipped_stat_after_rewind = stat($wal_skipped_path);
+defined($wal_skipped_stat_after_rewind)
+ or die("unable to stat $wal_skipped_path after rewind");
+my $wal_seg_latest_skipped_mod_time = $wal_skipped_stat_after_rewind->mtime;
+
+# Validate that modification time hasn't changed.
+is($wal_seg_latest_skipped_mod_time, $wal_seg_skipped_mod_time,
+ "WAL segment $wal_seg_skipped modification time should be unchanged (not overwritten)"
+);
+
+# Validate that the WAL segment with the same file name as the
+# corrupted WAL segment in target has been copied from source
+# where we have it intact.
+my $corrupt_wal_seg_in_source_path =
+ $node_standby->data_dir . '/pg_wal/' . $corrupt_wal_seg;
+my $corrupt_wal_seg_source_stat = stat($corrupt_wal_seg_in_source_path);
+ok(defined($corrupt_wal_seg_source_stat),
+ "WAL segment $corrupt_wal_seg should exist in source after rewind");
+my $corrupt_wal_seg_stat_after_rewind = stat($corrupt_wal_seg_in_target_path);
+ok(defined($corrupt_wal_seg_stat_after_rewind),
+ "WAL segment $corrupt_wal_seg should exist in target after rewind");
+ok( $corrupt_wal_seg_stat_before_rewind->size !=
+ $corrupt_wal_seg_source_stat->size,
+ "Expected WAL segment $corrupt_wal_seg to have different size in source vs target before rewind"
+);
+ok( $corrupt_wal_seg_stat_after_rewind->mtime >
+ $corrupt_wal_seg_stat_before_rewind->mtime,
+ "Expected WAL segment $corrupt_wal_seg to have later mtime on target than source after rewind as it was copied"
+);
+ok( $corrupt_wal_seg_stat_after_rewind->size ==
+ $corrupt_wal_seg_source_stat->size,
+ "Expected WAL segment $corrupt_wal_seg file sizes to be same between target and source after rewind as it was copied"
+);
+
+done_testing();
diff --git a/doc/src/sgml/ref/pg_rewind.sgml b/doc/src/sgml/ref/pg_rewind.sgml
index 5485033ed8c7..440b2c79f5f0 100644
--- a/doc/src/sgml/ref/pg_rewind.sgml
+++ b/doc/src/sgml/ref/pg_rewind.sgml
@@ -52,12 +52,32 @@ PostgreSQL documentation
analogous to a base backup of the source data directory. Unlike taking
a new base backup or using a tool like <application>rsync</application>,
<application>pg_rewind</application> does not require comparing or copying
- unchanged relation blocks in the cluster. Only changed blocks from existing
- relation files are copied; all other files, including new relation files,
- configuration files, and WAL segments, are copied in full. As such the
- rewind operation is significantly faster than other approaches when the
- database is large and only a small fraction of blocks differ between the
- clusters.
+ unchanged relation blocks in the cluster.
+ </para>
+ <orderedlist>
+ <listitem>
+ <para>
+ Only changed blocks from existing relation files are copied.
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ WAL segments prior to the point where the source and target servers
+ have diverged are not copied. WAL segments generated after the source
+ and target servers have diverged are copied in full.
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ All other files, including new relation files and configuration files,
+ are copied in full.
+ </para>
+ </listitem>
+ </orderedlist>
+ <para>
+ As such, the rewind operation is significantly faster than other
+ approaches when the database is large and only a small fraction of blocks
+ differ between the clusters.
</para>
<para>
--
2.51.0
On Thu, Oct 23, 2025 at 4:22 AM Michael Paquier <michael@paquier.xyz> wrote:
+ * However we check last_common_segno and file_size again for sanity. + */ + if (file_segno < last_common_segno && source_size == target_size)What if a segment has a size different than the one a cluster has been
initialized with, still both have the same size on the target and the
source? Unlikely, still incorrect if not cross-checked with what a
control file has in store. :)
While I'm not against cross-checking against the control file, this
sounds like an imaginary scenario to me. That is, it would only happen
if somebody maliciously modified the contents of the data directory by
hand with the express goal of breaking the tool. But we fundamentally
cannot defend against a malicious user whose express goal is to break
the tool, and I do not see any compelling reason to expend energy on
it even in cases like this where we could theoretically detect it
without much effort. If we go down that path, we'll end up not only
complicating the code, but also obscuring our own goals: it will look
like we've either done too much sanity checking (because we will have
added checks that are unnecessary with a non-malicious user) or too
little (because we will not have caught all the things a malicious
user might do).
--
Robert Haas
EDB: http://www.enterprisedb.com
Hi Michael,
On Thu, Oct 23, 2025 at 1:52 PM Michael Paquier <michael@paquier.xyz> wrote:
On Tue, Oct 21, 2025 at 04:14:55PM -0700, John H wrote:
Made the changes and included the documentation update.
I have been looking at this patch, and the first part that stood out
is that the refactoring related to isRelDataFile() can be done as an
independent piece, cutting the total size of the main patch by 30% or
so. So I have extracted this part, simplified it to make it more
consistent and les noisy on HEAD, and applied it as 6ae08d9583e9.
yeah ... that's cleaner, thanks for committing.
+# Sleep to allow mtime to be different +usleep(1000000); [...] + [qr/WAL segment \"$wal_seg_skipped\" not copied to target/, + qr/pg_wal\/$corrupt_wal_seg \(COPY\)/ + ],FWIW, I am definitely not a fan of the test relying on the timestamp
of the files based on their retrieved meta-data stats, with the mtime.
I suspect complications on Windows. Worse thing, there is a forced
sleep of 1s added to the test, to make sure that the timestamp of the
files we compare have a different number. But do we really need that
anyway if we have the debug information with the file map printed in
it?
thought it would act like an extra sanity check ,to prove the point that the
pg_rewind is saying through the debug info that it has been really copied
or skipped.
Hence, I think that we should simplify and rework the test a bit,
tweaking a bit how the filemap is printed: now that we can detect if
an operation is happening on a WAL file thanks to file_content_type_t,
let's always print the type of operation done for each WAL file and
remove this "not copied to target" debug log which provides the same
information, then let's compare the debug output with what we want.
i guess this is what you want ,please let me know if it's otherwise,
diff --git a/src/bin/pg_rewind/filemap.c b/src/bin/pg_rewind/filemap.c
index 2d82edec060..2180495d337 100644
--- a/src/bin/pg_rewind/filemap.c
+++ b/src/bin/pg_rewind/filemap.c
@@ -546,7 +546,7 @@ print_filemap(filemap_t *filemap)
for (i = 0; i < filemap->nentries; i++)
{
entry = filemap->entries[i];
- if (entry->action != FILE_ACTION_NONE ||
+ if (entry->content_type == FILE_CONTENT_TYPE_WAL ||
entry->action != FILE_ACTION_NONE ||
entry->target_pages_to_overwrite.bitmapsize > 0)
{
pg_log_debug("%s (%s)", entry->path,
@@ -733,7 +733,6 @@ decide_wal_file_action(const char *fname, XLogSegNo
last_common_segno,
*/
if (file_segno < last_common_segno && source_size == target_size)
{
- pg_log_debug("WAL segment \"%s\" not copied to target",
fname);
return FILE_ACTION_NONE;
}
diff --git a/src/bin/pg_rewind/t/011_wal_copy.pl b/src/bin/pg_rewind/t/
011_wal_copy.pl
index b9e24844654..3e12744dfa6 100644
--- a/src/bin/pg_rewind/t/011_wal_copy.pl
+++ b/src/bin/pg_rewind/t/011_wal_copy.pl
@@ -93,7 +93,7 @@ command_checks_all(
0,
[qr//],
[
- qr/WAL segment \"$wal_seg_skipped\" not copied to target/,
+ qr/pg_wal\/$wal_seg_skipped \(NONE\)/,
qr/pg_wal\/$corrupt_wal_seg \(COPY\)/
],
'run pg_rewind');
It seems to me that it would be good for the test to run some sanity
checks on the rewound standby, as well. That would provide more
validation for the case of the "corrupted" segment that gets copied
anyway.
i think these checks help do the same
1) qr/pg_wal\/$corrupt_wal_seg \(COPY\)/
this shows the corrupted segment is copied.
2) ok( $corrupt_wal_seg_stat_before_rewind->size !=
$corrupt_wal_seg_source_stat->size,
"Expected WAL segment $corrupt_wal_seg to have different size in
source vs target before rewind"
);
my $corrupt_wal_seg_stat_after_rewind =
stat($corrupt_wal_seg_in_target_path);
ok( $corrupt_wal_seg_stat_after_rewind->size ==
$corrupt_wal_seg_source_stat->size,
"Expected WAL segment $corrupt_wal_seg file sizes to be same between
target and source after rewind as it was copied"
);
These checks show that before the corrupt segment's size on
rewound standby(target) was not the same as source but after
rewind it was the same as source,please let me know if i am
missing your point here.
--
Thanks,
Srinath Reddy Sadipiralla
EDB: https://www.enterprisedb.com/
Hi,
On Thu, Oct 23, 2025 at 9:08 AM Srinath Reddy Sadipiralla
<srinath2133@gmail.com> wrote:
On Thu, Oct 23, 2025 at 1:52 PM Michael Paquier <michael@paquier.xyz> wrote:
FWIW, I am definitely not a fan of the test relying on the timestamp
of the files based on their retrieved meta-data stats, with the mtime.
I suspect complications on Windows. Worse thing, there is a forced
sleep of 1s added to the test, to make sure that the timestamp of the
files we compare have a different number. But do we really need that
anyway if we have the debug information with the file map printed in
it?thought it would act like an extra sanity check ,to prove the point that the
pg_rewind is saying through the debug info that it has been really copied
or skipped.
I don't feel too strongly about it. If we want to rely on checking debug log
that works for me. We don't do any in-depth checks that every file was
properly copied anyway for existing files.
...
i guess this is what you want ,please let me know if it's otherwise,
...
That looks right to me.
It seems to me that it would be good for the test to run some sanity
checks on the rewound standby, as well. That would provide more
validation for the case of the "corrupted" segment that gets copied
anyway....
These checks show that before the corrupt segment's size on
rewound standby(target) was not the same as source but after
rewind it was the same as source,please let me know if i am
missing your point here.
What did you have in mind for additional sanity checks?
The existing test checks that when sizes are different the correct
branch is taken. For something more in-depth that requires comparing
data before and after the rewind with a "corrupted" segment that seems
complicated since the segment would have to somehow be applied to
writer but not replica prior to divergence.
--
John Hsu - Amazon Web Services
On Thu, Oct 23, 2025 at 03:01:32PM -0700, John H wrote:
On Thu, Oct 23, 2025 at 9:08 AM Srinath Reddy Sadipiralla <srinath2133@gmail.com> wrote:
...
i guess this is what you want ,please let me know if it's otherwise,
...That looks right to me.
Yes, that would be something among these lines, WAL segments being
treated as an exception when printing the file map to show that they
are not copied.
These checks show that before the corrupt segment's size on
rewound standby(target) was not the same as source but after
rewind it was the same as source,please let me know if i am
missing your point here.What did you have in mind for additional sanity checks?
The existing test checks that when sizes are different the correct
branch is taken. For something more in-depth that requires comparing
data before and after the rewind with a "corrupted" segment that seems
complicated since the segment would have to somehow be applied to
writer but not replica prior to divergence.
I was just wondering about some queries on the new standby once we are
done with the rewind. But after sleeping on it, I'd be happy with
just the set of tests we have: the debug output checks, the size
checks pre and post-rewind, and no mtime.
--
Michael
On Fri, Oct 24, 2025 at 04:56:33PM +0900, Michael Paquier wrote:
I was just wondering about some queries on the new standby once we are
done with the rewind. But after sleeping on it, I'd be happy with
just the set of tests we have: the debug output checks, the size
checks pre and post-rewind, and no mtime.
So, I am coming down to the attached (minus the commit message) after
more edits and adjustments. It's Friday evening here, and I am not
planning to do more today and to take bets on the buildfarm, even if
the CI is OK.
--
Michael
Attachments:
v13-0001-Avoid-copying-WAL-segments-before-divergence-to-.patchtext/x-diff; charset=us-asciiDownload
From 2697ce6f8f2267a844e80e5ce7c253489b1f450d Mon Sep 17 00:00:00 2001
From: Michael Paquier <michael@paquier.xyz>
Date: Fri, 24 Oct 2025 17:35:31 +0900
Subject: [PATCH v13] Avoid copying WAL segments before divergence to speed up
pg_rewind
Adds a check to avoid copying any WAL segment files from source
to target if they are common between both servers before the
point of WAL divergence during pg_rewind.
All WAL files that exist on source and target, which fall
before the segment of the first diverged LSN can safely be
skipped from copying to the target as they have been replicated
from the original primary.
Author: John Hsu <johnhyvr@gmail.com>
Co-Author: Justin Kwan <justinpkwan@outlook.com>
---
src/bin/pg_rewind/filemap.c | 55 ++++++++++--
src/bin/pg_rewind/filemap.h | 3 +-
src/bin/pg_rewind/meson.build | 1 +
src/bin/pg_rewind/pg_rewind.c | 9 +-
src/bin/pg_rewind/t/011_wal_copy.pl | 128 ++++++++++++++++++++++++++++
doc/src/sgml/ref/pg_rewind.sgml | 32 +++++--
6 files changed, 215 insertions(+), 13 deletions(-)
create mode 100644 src/bin/pg_rewind/t/011_wal_copy.pl
diff --git a/src/bin/pg_rewind/filemap.c b/src/bin/pg_rewind/filemap.c
index 00f5d60d6209..59672e669323 100644
--- a/src/bin/pg_rewind/filemap.c
+++ b/src/bin/pg_rewind/filemap.c
@@ -546,7 +546,9 @@ print_filemap(filemap_t *filemap)
for (i = 0; i < filemap->nentries; i++)
{
entry = filemap->entries[i];
+
if (entry->action != FILE_ACTION_NONE ||
+ entry->content_type == FILE_CONTENT_TYPE_WAL ||
entry->target_pages_to_overwrite.bitmapsize > 0)
{
pg_log_debug("%s (%s)", entry->path,
@@ -706,11 +708,45 @@ final_filemap_cmp(const void *a, const void *b)
return strcmp(fa->path, fb->path);
}
+/*
+ * Decide what to do with a WAL segment file based on its position
+ * relative to the point of divergence.
+ *
+ * Caller is responsible for ensuring that the file exists on both
+ * source and target servers.
+ */
+static file_action_t
+decide_wal_file_action(const char *fname, XLogSegNo last_common_segno,
+ size_t source_size, size_t target_size)
+{
+ TimeLineID file_tli;
+ XLogSegNo file_segno;
+
+ /* Get current WAL segment number given current segment file name */
+ XLogFromFileName(fname, &file_tli, &file_segno, WalSegSz);
+
+ /*
+ * Avoid copying files before the last common segment.
+ *
+ * These files exist on the source and the target services, so they should
+ * be identical and located strictly before the segment that contains the
+ * LSN where target and source servers have diverged.
+ *
+ * While we are on it, double-check the size of each file and copy the
+ * file if they do not match, in case.
+ */
+ if (file_segno < last_common_segno &&
+ source_size == target_size)
+ return FILE_ACTION_NONE;
+
+ return FILE_ACTION_COPY;
+}
+
/*
* Decide what action to perform to a file.
*/
static file_action_t
-decide_file_action(file_entry_t *entry)
+decide_file_action(file_entry_t *entry, XLogSegNo last_common_segno)
{
const char *path = entry->path;
@@ -814,8 +850,17 @@ decide_file_action(file_entry_t *entry)
case FILE_TYPE_REGULAR:
if (entry->content_type == FILE_CONTENT_TYPE_WAL)
{
- /* It's a WAL file, copy it. */
- return FILE_ACTION_COPY;
+ /* Handle WAL segment file */
+ const char *filename = last_dir_separator(entry->path);
+
+ if (filename == NULL)
+ filename = entry->path;
+ else
+ filename++; /* Skip the separator */
+
+ return decide_wal_file_action(filename, last_common_segno,
+ entry->source_size,
+ entry->target_size);
}
else if (entry->content_type != FILE_CONTENT_TYPE_RELATION)
{
@@ -876,7 +921,7 @@ decide_file_action(file_entry_t *entry)
* should be executed.
*/
filemap_t *
-decide_file_actions(void)
+decide_file_actions(XLogSegNo last_common_segno)
{
int i;
filehash_iterator it;
@@ -886,7 +931,7 @@ decide_file_actions(void)
filehash_start_iterate(filehash, &it);
while ((entry = filehash_iterate(filehash, &it)) != NULL)
{
- entry->action = decide_file_action(entry);
+ entry->action = decide_file_action(entry, last_common_segno);
}
/*
diff --git a/src/bin/pg_rewind/filemap.h b/src/bin/pg_rewind/filemap.h
index fada420fc230..5145f0b4c46c 100644
--- a/src/bin/pg_rewind/filemap.h
+++ b/src/bin/pg_rewind/filemap.h
@@ -11,6 +11,7 @@
#include "datapagemap.h"
#include "storage/block.h"
#include "storage/relfilelocator.h"
+#include "access/xlogdefs.h"
/* these enum values are sorted in the order we want actions to be processed */
typedef enum
@@ -113,7 +114,7 @@ extern void process_target_wal_block_change(ForkNumber forknum,
RelFileLocator rlocator,
BlockNumber blkno);
-extern filemap_t *decide_file_actions(void);
+extern filemap_t *decide_file_actions(XLogSegNo last_common_segno);
extern void calculate_totals(filemap_t *filemap);
extern void print_filemap(filemap_t *filemap);
diff --git a/src/bin/pg_rewind/meson.build b/src/bin/pg_rewind/meson.build
index 36171600ccaf..97f001d94a50 100644
--- a/src/bin/pg_rewind/meson.build
+++ b/src/bin/pg_rewind/meson.build
@@ -44,6 +44,7 @@ tests += {
't/008_min_recovery_point.pl',
't/009_growing_files.pl',
't/010_keep_recycled_wals.pl',
+ 't/011_wal_copy.pl',
],
},
}
diff --git a/src/bin/pg_rewind/pg_rewind.c b/src/bin/pg_rewind/pg_rewind.c
index 0c68dd4235e6..1b953692b176 100644
--- a/src/bin/pg_rewind/pg_rewind.c
+++ b/src/bin/pg_rewind/pg_rewind.c
@@ -147,6 +147,7 @@ main(int argc, char **argv)
TimeLineID source_tli;
TimeLineID target_tli;
XLogRecPtr target_wal_endrec;
+ XLogSegNo last_common_segno;
size_t size;
char *buffer;
bool no_ensure_shutdown = false;
@@ -397,6 +398,12 @@ main(int argc, char **argv)
LSN_FORMAT_ARGS(divergerec),
targetHistory[lastcommontliIndex].tli);
+ /*
+ * Convert the divergence LSN to a segment number, that will be used
+ * to decide how WAL segments should be processed.
+ */
+ XLByteToSeg(divergerec, last_common_segno, ControlFile_target.xlog_seg_size);
+
/*
* Don't need the source history anymore. The target history is still
* needed by the routines in parsexlog.c, when we read the target WAL.
@@ -492,7 +499,7 @@ main(int argc, char **argv)
* We have collected all information we need from both systems. Decide
* what to do with each file.
*/
- filemap = decide_file_actions();
+ filemap = decide_file_actions(last_common_segno);
if (showprogress)
calculate_totals(filemap);
diff --git a/src/bin/pg_rewind/t/011_wal_copy.pl b/src/bin/pg_rewind/t/011_wal_copy.pl
new file mode 100644
index 000000000000..fb5b7104378e
--- /dev/null
+++ b/src/bin/pg_rewind/t/011_wal_copy.pl
@@ -0,0 +1,128 @@
+# Copyright (c) 2025, PostgreSQL Global Development Group
+#
+# Check how the copy of WAL segments is handled from the source to
+# the target server.
+
+use strict;
+use warnings FATAL => 'all';
+use PostgreSQL::Test::Utils;
+use Test::More;
+use File::stat qw(stat);
+
+use FindBin;
+use lib $FindBin::RealBin;
+use RewindTest;
+
+RewindTest::setup_cluster();
+RewindTest::start_primary();
+RewindTest::create_standby();
+
+# Advance WAL on primary
+RewindTest::primary_psql("CREATE TABLE t(a int)");
+RewindTest::primary_psql("INSERT INTO t VALUES(0)");
+
+# Segment that is not copied from the source to the target, being
+# generated before the servers have diverged.
+my $wal_seg_skipped = $node_primary->safe_psql('postgres',
+ 'SELECT pg_walfile_name(pg_current_wal_lsn())');
+
+RewindTest::primary_psql("SELECT pg_switch_wal()");
+
+# Follow-up segment, that will include corrupted contents, and will be
+# copied from the source to the target even if generated before the point
+# of divergence.
+RewindTest::primary_psql("INSERT INTO t VALUES(0)");
+my $corrupt_wal_seg = $node_primary->safe_psql('postgres',
+ 'SELECT pg_walfile_name(pg_current_wal_lsn())');
+RewindTest::primary_psql("SELECT pg_switch_wal()");
+
+RewindTest::primary_psql("CHECKPOINT");
+RewindTest::promote_standby;
+
+# New segment on a new timeline, expected to be copied.
+my $new_timeline_wal_seg = $node_standby->safe_psql('postgres',
+ 'SELECT pg_walfile_name(pg_current_wal_lsn())');
+
+# Get some stats info for the WAL file whose copy is skipped.
+my $wal_skipped_path =
+ $node_primary->data_dir . '/pg_wal/' . $wal_seg_skipped;
+my $wal_skipped_stat = stat($wal_skipped_path);
+defined($wal_skipped_stat) or die("unable to stat $wal_skipped_path");
+
+# Corrupt a WAL segment on target that has been generated before the
+# divergence point. We will check that it is copied from the source.
+my $corrupt_wal_seg_in_target_path =
+ $node_primary->data_dir . '/pg_wal/' . $corrupt_wal_seg;
+open my $fh, ">>", $corrupt_wal_seg_in_target_path
+ or die "could not open $corrupt_wal_seg_in_target_path";
+
+print $fh 'a';
+close $fh;
+
+my $corrupt_wal_seg_stat_before_rewind =
+ stat($corrupt_wal_seg_in_target_path);
+ok(defined($corrupt_wal_seg_stat_before_rewind),
+ "segment $corrupt_wal_seg exists in target before rewind");
+
+# Verify that the WAL segment on the new timeline does not exist in target
+# before the rewind.
+my $new_timeline_wal_seg_path =
+ $node_primary->data_dir . '/pg_wal/' . $new_timeline_wal_seg;
+my $new_timeline_wal_seg_stat = stat($new_timeline_wal_seg_path);
+ok(!defined($new_timeline_wal_seg_stat),
+ "segment $new_timeline_wal_seg does not exist in target before rewind"
+);
+
+$node_standby->stop();
+$node_primary->stop();
+
+# Cross-check how WAL segments are handled:
+# - The "corrupted" segment generated before the point of divergence is
+# copied.
+# - The "clean" segment generated before the point of divergence is skipped.
+# - The segment of the new timeline is copied.
+command_checks_all(
+ [
+ 'pg_rewind', '--debug',
+ '--source-pgdata' => $node_standby->data_dir,
+ '--target-pgdata' => $node_primary->data_dir,
+ '--no-sync',
+ ],
+ 0,
+ [qr//],
+ [
+ qr/pg_wal\/$wal_seg_skipped \(NONE\)/,
+ qr/pg_wal\/$corrupt_wal_seg \(COPY\)/,
+ qr/pg_wal\/$new_timeline_wal_seg \(COPY\)/,
+ ],
+ 'run pg_rewind');
+
+# Verify that the first WAL segment of the new timeline now exists in
+# target.
+$new_timeline_wal_seg_stat = stat($new_timeline_wal_seg_path);
+ok(defined($new_timeline_wal_seg_stat),
+ "new timeline segment $new_timeline_wal_seg exists in target after rewind"
+);
+
+# Validate that the WAL segment with the same file name as the
+# corrupted WAL segment in target has been copied from source
+# where it was still intact.
+my $corrupt_wal_seg_in_source_path =
+ $node_standby->data_dir . '/pg_wal/' . $corrupt_wal_seg;
+my $corrupt_wal_seg_source_stat = stat($corrupt_wal_seg_in_source_path);
+ok(defined($corrupt_wal_seg_source_stat),
+ "corrupted $corrupt_wal_seg exists in source after rewind");
+
+my $corrupt_wal_seg_stat_after_rewind = stat($corrupt_wal_seg_in_target_path);
+ok(defined($corrupt_wal_seg_stat_after_rewind),
+ "corrupted $corrupt_wal_seg exists in target after rewind");
+isnt( $corrupt_wal_seg_stat_before_rewind->size,
+ $corrupt_wal_seg_source_stat->size,
+ "different size of corrupted $corrupt_wal_seg in source vs target before rewind"
+);
+is( $corrupt_wal_seg_stat_after_rewind->size,
+ $corrupt_wal_seg_source_stat->size,
+ "same size of corrupted $corrupt_wal_seg in source and target after rewind"
+);
+
+done_testing();
diff --git a/doc/src/sgml/ref/pg_rewind.sgml b/doc/src/sgml/ref/pg_rewind.sgml
index 5485033ed8c7..928e78cda33a 100644
--- a/doc/src/sgml/ref/pg_rewind.sgml
+++ b/doc/src/sgml/ref/pg_rewind.sgml
@@ -52,12 +52,32 @@ PostgreSQL documentation
analogous to a base backup of the source data directory. Unlike taking
a new base backup or using a tool like <application>rsync</application>,
<application>pg_rewind</application> does not require comparing or copying
- unchanged relation blocks in the cluster. Only changed blocks from existing
- relation files are copied; all other files, including new relation files,
- configuration files, and WAL segments, are copied in full. As such the
- rewind operation is significantly faster than other approaches when the
- database is large and only a small fraction of blocks differ between the
- clusters.
+ unchanged relation blocks in the cluster:
+ </para>
+ <itemizedlist>
+ <listitem>
+ <para>
+ Only changed blocks from existing relation files are copied.
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ WAL segments prior to the point where the source and target servers
+ have diverged are not copied. WAL segments generated after the source
+ and target servers have diverged are copied in full.
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ All other files, including new relation files and configuration files,
+ are copied in full.
+ </para>
+ </listitem>
+ </itemizedlist>
+ <para>
+ As such, the rewind operation is significantly faster than other
+ approaches when the database is large and only a small fraction of blocks
+ differ between the clusters.
</para>
<para>
--
2.51.0
On Fri, Oct 24, 2025 at 2:08 PM Michael Paquier <michael@paquier.xyz> wrote:
On Fri, Oct 24, 2025 at 04:56:33PM +0900, Michael Paquier wrote:
I was just wondering about some queries on the new standby once we are
done with the rewind. But after sleeping on it, I'd be happy with
just the set of tests we have: the debug output checks, the size
checks pre and post-rewind, and no mtime.So, I am coming down to the attached (minus the commit message) after
more edits and adjustments. It's Friday evening here, and I am not
planning to do more today and to take bets on the buildfarm, even if
the CI is OK.
Thanks for the updated patch, I have reviewed and tested the
patch, except these minor refactors ,it LGTM.
diff --git a/src/bin/pg_rewind/filemap.c b/src/bin/pg_rewind/filemap.c
index 59672e66932..467fd97ebcf 100644
--- a/src/bin/pg_rewind/filemap.c
+++ b/src/bin/pg_rewind/filemap.c
@@ -728,7 +728,7 @@ decide_wal_file_action(const char *fname, XLogSegNo
last_common_segno,
/*
* Avoid copying files before the last common segment.
*
- * These files exist on the source and the target services, so they
should
+ * These files exist on the source and the target servers, so they
should
* be identical and located strictly before the segment that
contains the
* LSN where target and source servers have diverged.
*
I think as we are not using mtime to show that the file has not been copied
and been skipped ,instead we are doing the same with the debug message
(qr/pg_wal\/$wal_seg_skipped \(NONE\)/,), so stat calculation of this WAL
segment can be removed.
diff --git a/src/bin/pg_rewind/t/011_wal_copy.pl b/src/bin/pg_rewind/t/
011_wal_copy.pl
index fb5b7104378..c4d7200d710 100644
--- a/src/bin/pg_rewind/t/011_wal_copy.pl
+++ b/src/bin/pg_rewind/t/011_wal_copy.pl
@@ -43,12 +43,6 @@ RewindTest::promote_standby;
my $new_timeline_wal_seg = $node_standby->safe_psql('postgres',
'SELECT pg_walfile_name(pg_current_wal_lsn())');
-# Get some stats info for the WAL file whose copy is skipped.
-my $wal_skipped_path =
- $node_primary->data_dir . '/pg_wal/' . $wal_seg_skipped;
-my $wal_skipped_stat = stat($wal_skipped_path);
-defined($wal_skipped_stat) or die("unable to stat $wal_skipped_path");
-
# Corrupt a WAL segment on target that has been generated before the
# divergence point. We will check that it is copied from the source.
*Test:*
I tried to demonstrate that with this patch, we are not copying WAL
segments before the divergence point by using wal_keep_size.
*Environment:*
A primary and standby physical streaming replication setup was used.
The TPC-H dbgen tool with a Scale Factor of 10 was used to generate
data, which in turn generated sufficient WAL to test with wal_keep_size =
20000 MB.
A failover and rewind were performed.
*Result:*
*Without Patch:*
SELECT
pg_size_pretty(sum(size)) AS wal_size
FROM
pg_ls_waldir();
wal_size
----------
19 GB
(1 row)
pg_rewind: connected to server
pg_rewind: servers diverged at WAL location 4/DEACBAB0 on timeline 1
pg_rewind: rewinding from last common checkpoint at 4/C8250D28 on timeline 1
pg_rewind: reading source file list
pg_rewind: reading target file list
pg_rewind: reading WAL in target
*pg_rewind: need to copy 20449 MB (total source directory size is 32778 MB)*
20940602/20940602 kB (100%) copied
pg_rewind: creating backup label and updating control file
pg_rewind: syncing target data directory
pg_rewind: Done!
To show that only a small amount of WAL was generated between the last
common
checkpoint and the latest checkpoint on the source:
postgres=# select pg_size_pretty('4/DEACBB78'::pg_lsn -
'4/C8250D28'::pg_lsn);
pg_size_pretty
----------------
360 MB
(1 row)
so the 20.449 GB copied was primarily due to wal_keep_size
(19 GB of accumulated WAL) being copied, as previously non-data files were
copied in full.
*With Patch:*
SELECT
pg_size_pretty(sum(size)) AS wal_size
FROM
pg_ls_waldir();
wal_size
----------
19 GB
(1 row)
pg_rewind: connected to server
pg_rewind: servers diverged at WAL location 4/B0F2FE78 on timeline 1
pg_rewind: rewinding from last common checkpoint at 4/B087E468 on timeline 1
pg_rewind: reading source file list
pg_rewind: reading target file list
pg_rewind: reading WAL in target
*pg_rewind: need to copy 212 MB (total source directory size is 32074 MB)*
*pg_rewind: skipped WAL 19168 MB* -> tweaked calculate_totals to get this,
just to give more info (not included in patch).
217467/217467 kB (100%) copied
pg_rewind: creating backup label and updating control file
pg_rewind: syncing target data directory
pg_rewind: Done!
postgres=# select pg_size_pretty('4/B0F30AB0'::pg_lsn -
'4/B087E468'::pg_lsn);
pg_size_pretty
----------------
6855 kB
(1 row)
This test shows that with the patch applied, only the necessary WAL segments
after the divergence point are copied, while previously accumulated WAL
segments (due to wal_keep_size) are correctly skipped.
--
Thanks,
Srinath Reddy Sadipiralla
EDB: https://www.enterprisedb.com/
On Fri, Oct 24, 2025 at 05:51:54PM +0530, Srinath Reddy Sadipiralla wrote:
- * These files exist on the source and the target services, so they should + * These files exist on the source and the target servers, so they should
Not sure what my fingers were doing here.
I think as we are not using mtime to show that the file has not been copied
and been skipped ,instead we are doing the same with the debug message
(qr/pg_wal\/$wal_seg_skipped \(NONE\)/,), so stat calculation of this WAL
segment can be removed.
Yes, removing this one makes sense.
And, to keep it short: applied.
--
Michael
On Sat, Oct 25, 2025 at 6:46 AM Michael Paquier <michael@paquier.xyz> wrote:
On Fri, Oct 24, 2025 at 05:51:54PM +0530, Srinath Reddy Sadipiralla wrote:
- * These files exist on the source and the target services, so
they
should
+ * These files exist on the source and the target servers, sothey
should
Not sure what my fingers were doing here.
I think as we are not using mtime to show that the file has not been
copied
and been skipped ,instead we are doing the same with the debug message
(qr/pg_wal\/$wal_seg_skipped \(NONE\)/,), so stat calculation of this WAL
segment can be removed.Yes, removing this one makes sense.
And, to keep it short: applied.
Thanks Michael.
--
Thanks,
Srinath Reddy Sadipiralla
EDB: https://www.enterprisedb.com/
On Thu, Oct 23, 2025 at 08:40:14AM -0400, Robert Haas wrote:
While I'm not against cross-checking against the control file, this
sounds like an imaginary scenario to me. That is, it would only happen
if somebody maliciously modified the contents of the data directory by
hand with the express goal of breaking the tool. But we fundamentally
cannot defend against a malicious user whose express goal is to break
the tool, and I do not see any compelling reason to expend energy on
it even in cases like this where we could theoretically detect it
without much effort. If we go down that path, we'll end up not only
complicating the code, but also obscuring our own goals: it will look
like we've either done too much sanity checking (because we will have
added checks that are unnecessary with a non-malicious user) or too
little (because we will not have caught all the things a malicious
user might do).
I was thinking about this argument over the weekend, and I am
wondering if we could not do better here to detect if a file should be
copied or not. What if we included a checksum of each file if both
exist on the target and source, and just not copy them if the
checksums match? You cannot do that for relation files when the
source is online, of course, but for files like the oldest segments
before the divergence point, that's better than checking the size,
still more expensive due to the cost of the checksum computation.
And there is a sha256() available at SQL level.
Just throwing one idea in the bucket of ideas. That may not be worth
the extra cost here, of course, but attaching a checksum to
file_entry_t is not what I would qualify as an invasive change.
--
Michael
On Tue, Oct 28, 2025 at 12:02 AM Michael Paquier <michael@paquier.xyz> wrote:
I was thinking about this argument over the weekend, and I am
wondering if we could not do better here to detect if a file should be
copied or not. What if we included a checksum of each file if both
exist on the target and source, and just not copy them if the
checksums match?
Well, that would require reading the entire file on both sides to
compute the checksum, which sounds pretty expensive. I mean, a copy
would only be a read on one side and a write on the other. Even
granting that writes are more expensive than reads, a read of both
sides would still be a substantial percentage of the total cost, I
think.
Also, I don't think we really want to reinvent a worse version of
rsync. If you want to use checksums or file timestamps to decide what
to copy, there are already good tools for that which probably handle
that task better than our code ever will. What we can bring to the
table is PG-specific logic, where we're able to reason about the
behavior of PG in a way that a general-purpose tool can't. That's why
for example we use the WAL to decide what data blocks need to be
copied, rather than checksums -- it's an optimization that rsync can't
do, and we can. The rule implemented here is similar: rsync can't know
that WAL from before the divergence point should be the same on both
sides, but we can.
Now, of course, if in a specific situation the assumptions on which
pg_rewind relies are not valid, e.g. because manual data directory
modification has occurred, then pg_rewind should not be used. And if
on the other hand we find some flaw that will keep pg_rewind from
delivering correct results even when nothing strange has happened,
then that's a bug or a design problem that we need to fix. But if we
just start second-guessing ourselves by adding overhead to protect
against can't-happen scenarios, we'll end up making pg_rewind useless.
--
Robert Haas
EDB: http://www.enterprisedb.com