pg_rewind copies
If a file is modified and becomes larger in the source system while
pg_rewind is running, pg_rewind can leave behind a partial copy of file.
That's by design, and it's OK for relation files because they're
replayed from WAL. But it can cause trouble for configuration files.
I ran into this while playing with pg_auto_failover. After failover,
pg_auto_failover would often launch pg_rewind, and run ALTER SYSTEM on
the primary while pg_rewind was running. The resulting rewound system
would fail to start up:
Nov 13 09:24:42 pg-node-a pg_autoctl[2217]: 09:24:42 2220 ERROR
2020-11-13 09:24:32.547 GMT [2246] LOG: syntax error in file
"/data/pgdata/postgresql.auto.conf" line 4, near token "'"
Nov 13 09:24:42 pg-node-a pg_autoctl[2217]: 09:24:42 2220 ERROR
2020-11-13 09:24:32.547 GMT [2246] FATAL: configuration file
"postgresql.auto.conf" contains errors
Attached is a patch to mitigate that. It changes pg_rewind so that when
it copies a whole file, it ignores the original file size. It's not a
complete cure: it still believes the original size for files larger than
1 MB. That limit was just expedient given the way the chunking logic in
libpq_source.c works, but should be enough for configuration files.
There's another race condition that this doesn't try to fix: If a file
is modified while it's being copied, you can have a torn file with one
half of the file from the old version, and one half from the new. That's
a much more narrow window, though, and pg_basebackup has the same problem.
- Heikki
Attachments:
0001-pg_rewind-Fetch-small-files-according-to-new-size.patchtext/x-patch; charset=UTF-8; name=0001-pg_rewind-Fetch-small-files-according-to-new-size.patchDownload
From f0207e355c6be2d710d2ac11e4895795c3471ef3 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: Fri, 13 Nov 2020 11:04:29 +0200
Subject: [PATCH 1/1] pg_rewind: Fetch small files according to new size.
There's a race condition if a file changes in the source system after we
have collected the file list. If the file becomes larger, we only fetched
up to its original size. That can easily result in a truncated file.
That's not a problem for relation files, files in pg_xact, etc. because
any actions on them will be replayed from the WAL. However, configuration
files are affected.
This commit mitigates the race condition by fetching small files in
whole, even if they have grown. This is not a full fix: we still believe
the original file size for files larger than 1 MB. That should be enough
for configuration files, and doing more than that would require bigger
changes to the chunking logic in in libpq_source.c.
That mitigates the race condition if the file is modified between the
original scan of files and copying the file, but there's still a race
condition if a file is changed while it's being copied. That's a much
smaller window, though, and pg_basebackup has the same issue.
I ran into this while playing with pg_auto_failover, which frequently
uses ALTER SYSTEM, which update postgresql.auto.conf. Often, pg_rewind
would fail, because the postgresql.auto.conf file changed concurrently
and a partial version of it was copied to the target. The partial
file would fail to parse, preventing the server from starting up.
---
src/bin/pg_rewind/libpq_source.c | 32 +++++++++++++
src/bin/pg_rewind/local_source.c | 76 +++++++++++++++++++++++++++----
src/bin/pg_rewind/pg_rewind.c | 5 +-
src/bin/pg_rewind/rewind_source.h | 13 ++++++
4 files changed, 112 insertions(+), 14 deletions(-)
diff --git a/src/bin/pg_rewind/libpq_source.c b/src/bin/pg_rewind/libpq_source.c
index 47beba277a4..f544d9828a2 100644
--- a/src/bin/pg_rewind/libpq_source.c
+++ b/src/bin/pg_rewind/libpq_source.c
@@ -63,6 +63,7 @@ static void process_queued_fetch_requests(libpq_source *src);
/* public interface functions */
static void libpq_traverse_files(rewind_source *source,
process_file_callback_t callback);
+static void libpq_queue_fetch_file(rewind_source *source, const char *path, size_t len);
static void libpq_queue_fetch_range(rewind_source *source, const char *path,
off_t off, size_t len);
static void libpq_finish_fetch(rewind_source *source);
@@ -88,6 +89,7 @@ init_libpq_source(PGconn *conn)
src->common.traverse_files = libpq_traverse_files;
src->common.fetch_file = libpq_fetch_file;
+ src->common.queue_fetch_file = libpq_queue_fetch_file;
src->common.queue_fetch_range = libpq_queue_fetch_range;
src->common.finish_fetch = libpq_finish_fetch;
src->common.get_current_wal_insert_lsn = libpq_get_current_wal_insert_lsn;
@@ -307,6 +309,36 @@ libpq_traverse_files(rewind_source *source, process_file_callback_t callback)
PQclear(res);
}
+/*
+ * Queue up a request to fetch a file from remote system.
+ */
+static void
+libpq_queue_fetch_file(rewind_source *source, const char *path, size_t len)
+{
+ /*
+ * Truncate the target file immediately, and queue a request to fetch it
+ * from the source. If the file is small, smaller than MAX_CHUNK_SIZE,
+ * request fetching a full-sized chunk anyway, so that if the file has
+ * become larger in the source system, after we scanned the source
+ * directory, we still fetch the whole file. This only works for files up
+ * to MAX_CHUNK_SIZE, but that's good enough for small configuration files
+ * and such that are changed every now and then, but not WAL-logged.
+ * For larger files, we fetch up to the original size.
+ *
+ * Even with that mechanism, there is an inherent race condition if the
+ * file is modified at the same instant that we're copying it, so that we
+ * might copy a torn version of the file with one half from the old
+ * version and another half from the new. But pg_basebackup has the same
+ * problem, and it hasn't been problem in practice.
+ *
+ * It might seem more natural to truncate the file later, when we receive
+ * it from the source server, but then we'd need to track which
+ * fetch-requests are for a whole file.
+ */
+ open_target_file(path, true);
+ libpq_queue_fetch_range(source, path, 0, Max(len, MAX_CHUNK_SIZE));
+}
+
/*
* Queue up a request to fetch a piece of a file from remote system.
*/
diff --git a/src/bin/pg_rewind/local_source.c b/src/bin/pg_rewind/local_source.c
index fa1b6e80ec3..c1ad2c49c5f 100644
--- a/src/bin/pg_rewind/local_source.c
+++ b/src/bin/pg_rewind/local_source.c
@@ -29,8 +29,10 @@ static void local_traverse_files(rewind_source *source,
process_file_callback_t callback);
static char *local_fetch_file(rewind_source *source, const char *path,
size_t *filesize);
-static void local_fetch_file_range(rewind_source *source, const char *path,
- off_t off, size_t len);
+static void local_queue_fetch_file(rewind_source *source, const char *path,
+ size_t len);
+static void local_queue_fetch_range(rewind_source *source, const char *path,
+ off_t off, size_t len);
static void local_finish_fetch(rewind_source *source);
static void local_destroy(rewind_source *source);
@@ -43,7 +45,8 @@ init_local_source(const char *datadir)
src->common.traverse_files = local_traverse_files;
src->common.fetch_file = local_fetch_file;
- src->common.queue_fetch_range = local_fetch_file_range;
+ src->common.queue_fetch_file = local_queue_fetch_file;
+ src->common.queue_fetch_range = local_queue_fetch_range;
src->common.finish_fetch = local_finish_fetch;
src->common.get_current_wal_insert_lsn = NULL;
src->common.destroy = local_destroy;
@@ -65,12 +68,65 @@ local_fetch_file(rewind_source *source, const char *path, size_t *filesize)
return slurpFile(((local_source *) source)->datadir, path, filesize);
}
+/*
+ * Copy a file from source to target.
+ *
+ * 'len' is the expected length of the file.
+ */
+static void
+local_queue_fetch_file(rewind_source *source, const char *path, size_t len)
+{
+ const char *datadir = ((local_source *) source)->datadir;
+ PGAlignedBlock buf;
+ char srcpath[MAXPGPATH];
+ int srcfd;
+ size_t written_len;
+
+ snprintf(srcpath, sizeof(srcpath), "%s/%s", datadir, path);
+
+ /* Open source file for reading. */
+ srcfd = open(srcpath, O_RDONLY | PG_BINARY, 0);
+ if (srcfd < 0)
+ pg_fatal("could not open source file \"%s\": %m",
+ srcpath);
+
+ /* Truncate and open the target file for writing. */
+ open_target_file(path, true);
+
+ written_len = 0;
+ for (;;)
+ {
+ ssize_t read_len;
+
+ read_len = read(srcfd, buf.data, sizeof(buf));
+
+ if (read_len < 0)
+ pg_fatal("could not read file \"%s\": %m", srcpath);
+ else if (read_len == 0)
+ break; /* EOF reached */
+
+ write_target_range(buf.data, written_len, read_len);
+ written_len += read_len;
+ }
+
+ /*
+ * A local source is not expected to change while we're rewinding, so check
+ * that we size of the file matches our earlier expectation.
+ */
+ if (written_len != len)
+ pg_fatal("size of source file \"%s\" changed concurrently: " UINT64_FORMAT " bytes expected, " UINT64_FORMAT " copied",
+ srcpath, len, written_len);
+
+ if (close(srcfd) != 0)
+ pg_fatal("could not close file \"%s\": %m", srcpath);
+}
+
/*
* Copy a file from source to target, starting at 'off', for 'len' bytes.
*/
static void
-local_fetch_file_range(rewind_source *source, const char *path, off_t off,
- size_t len)
+local_queue_fetch_range(rewind_source *source, const char *path, off_t off,
+ size_t len)
{
const char *datadir = ((local_source *) source)->datadir;
PGAlignedBlock buf;
@@ -94,14 +150,14 @@ local_fetch_file_range(rewind_source *source, const char *path, off_t off,
while (end - begin > 0)
{
ssize_t readlen;
- size_t len;
+ size_t thislen;
if (end - begin > sizeof(buf))
- len = sizeof(buf);
+ thislen = sizeof(buf);
else
- len = end - begin;
+ thislen = end - begin;
- readlen = read(srcfd, buf.data, len);
+ readlen = read(srcfd, buf.data, thislen);
if (readlen < 0)
pg_fatal("could not read file \"%s\": %m", srcpath);
@@ -120,7 +176,7 @@ static void
local_finish_fetch(rewind_source *source)
{
/*
- * Nothing to do, local_fetch_file_range() copies the ranges immediately.
+ * Nothing to do, local_queue_fetch_range() copies the ranges immediately.
*/
}
diff --git a/src/bin/pg_rewind/pg_rewind.c b/src/bin/pg_rewind/pg_rewind.c
index 2bbf8e74385..fe828c9e840 100644
--- a/src/bin/pg_rewind/pg_rewind.c
+++ b/src/bin/pg_rewind/pg_rewind.c
@@ -525,10 +525,7 @@ perform_rewind(filemap_t *filemap, rewind_source *source,
break;
case FILE_ACTION_COPY:
- /* Truncate the old file out of the way, if any */
- open_target_file(entry->path, true);
- source->queue_fetch_range(source, entry->path,
- 0, entry->source_size);
+ source->queue_fetch_file(source, entry->path, entry->source_size);
break;
case FILE_ACTION_TRUNCATE:
diff --git a/src/bin/pg_rewind/rewind_source.h b/src/bin/pg_rewind/rewind_source.h
index e87f239a47a..a8dc9fda0da 100644
--- a/src/bin/pg_rewind/rewind_source.h
+++ b/src/bin/pg_rewind/rewind_source.h
@@ -47,6 +47,19 @@ typedef struct rewind_source
void (*queue_fetch_range) (struct rewind_source *, const char *path,
off_t offset, size_t len);
+ /*
+ * Like queue_fetch_range(), but requests replacing the whole local file
+ * from the source system. 'len' is the expected length of the file,
+ * although when the source is a live server, the file may change
+ * concurrently. The implementation is not obliged to copy more than 'len'
+ * bytes, even if the file is larger. However, to avoid copying a
+ * truncated version of the file, which can cause trouble if e.g. a
+ * configuration file is modified concurrently, the implementation should
+ * try to copy the whole file, even if it's larger than expected.
+ */
+ void (*queue_fetch_file) (struct rewind_source *, const char *path,
+ size_t len);
+
/*
* Execute all requests queued up with queue_fetch_range().
*/
--
2.20.1
The following review has been posted through the commitfest application:
make installcheck-world: tested, passed
Implements feature: tested, passed
Spec compliant: tested, passed
Documentation: not tested
Hello
The patch seems to do as described and the regression and tap tests are passing
+ /*
+ * A local source is not expected to change while we're rewinding, so check
+ * that we size of the file matches our earlier expectation.
+ */
Is this a tyoo?
thanks
Cary
On 16/12/2020 00:08, Cary Huang wrote:
The following review has been posted through the commitfest application:
make installcheck-world: tested, passed
Implements feature: tested, passed
Spec compliant: tested, passed
Documentation: not testedHello
The patch seems to do as described and the regression and tap tests are passing + /* + * A local source is not expected to change while we're rewinding, so check + * that we size of the file matches our earlier expectation. + */ Is this a tyoo?
Yep, thanks! Attached is a new patch version, with that fixed and
rebased. No other changes.
- Heikki
Attachments:
v2-0001-pg_rewind-Fetch-small-files-according-to-new-size.patchtext/x-patch; charset=UTF-8; name=v2-0001-pg_rewind-Fetch-small-files-according-to-new-size.patchDownload
From 649ce2ffb7ef390e96dbde9bd7da27a8a3d330d4 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: Fri, 22 Jan 2021 15:16:41 +0200
Subject: [PATCH v2 1/1] pg_rewind: Fetch small files according to new size.
There's a race condition if a file changes in the source system after we
have collected the file list. If the file becomes larger, we only fetched
up to its original size. That can easily result in a truncated file.
That's not a problem for relation files, files in pg_xact, etc. because
any actions on them will be replayed from the WAL. However, configuration
files are affected.
This commit mitigates the race condition by fetching small files in
whole, even if they have grown. This is not a full fix: we still believe
the original file size for files larger than 1 MB. That should be enough
for configuration files, and doing more than that would require bigger
changes to the chunking logic in in libpq_source.c.
That mitigates the race condition if the file is modified between the
original scan of files and copying the file, but there's still a race
condition if a file is changed while it's being copied. That's a much
smaller window, though, and pg_basebackup has the same issue.
I ran into this while playing with pg_auto_failover, which frequently
uses ALTER SYSTEM, which update postgresql.auto.conf. Often, pg_rewind
would fail, because the postgresql.auto.conf file changed concurrently
and a partial version of it was copied to the target. The partial
file would fail to parse, preventing the server from starting up.
Reviewed-by: Cary Huang
Discussion: https://www.postgresql.org/message-id/f67feb24-5833-88cb-1020-19a4a2b83ac7%40iki.fi
---
src/bin/pg_rewind/libpq_source.c | 32 +++++++++++++
src/bin/pg_rewind/local_source.c | 76 +++++++++++++++++++++++++++----
src/bin/pg_rewind/pg_rewind.c | 5 +-
src/bin/pg_rewind/rewind_source.h | 13 ++++++
4 files changed, 112 insertions(+), 14 deletions(-)
diff --git a/src/bin/pg_rewind/libpq_source.c b/src/bin/pg_rewind/libpq_source.c
index 86d2adcaee9..ff16add16f5 100644
--- a/src/bin/pg_rewind/libpq_source.c
+++ b/src/bin/pg_rewind/libpq_source.c
@@ -63,6 +63,7 @@ static void process_queued_fetch_requests(libpq_source *src);
/* public interface functions */
static void libpq_traverse_files(rewind_source *source,
process_file_callback_t callback);
+static void libpq_queue_fetch_file(rewind_source *source, const char *path, size_t len);
static void libpq_queue_fetch_range(rewind_source *source, const char *path,
off_t off, size_t len);
static void libpq_finish_fetch(rewind_source *source);
@@ -88,6 +89,7 @@ init_libpq_source(PGconn *conn)
src->common.traverse_files = libpq_traverse_files;
src->common.fetch_file = libpq_fetch_file;
+ src->common.queue_fetch_file = libpq_queue_fetch_file;
src->common.queue_fetch_range = libpq_queue_fetch_range;
src->common.finish_fetch = libpq_finish_fetch;
src->common.get_current_wal_insert_lsn = libpq_get_current_wal_insert_lsn;
@@ -307,6 +309,36 @@ libpq_traverse_files(rewind_source *source, process_file_callback_t callback)
PQclear(res);
}
+/*
+ * Queue up a request to fetch a file from remote system.
+ */
+static void
+libpq_queue_fetch_file(rewind_source *source, const char *path, size_t len)
+{
+ /*
+ * Truncate the target file immediately, and queue a request to fetch it
+ * from the source. If the file is small, smaller than MAX_CHUNK_SIZE,
+ * request fetching a full-sized chunk anyway, so that if the file has
+ * become larger in the source system, after we scanned the source
+ * directory, we still fetch the whole file. This only works for files up
+ * to MAX_CHUNK_SIZE, but that's good enough for small configuration files
+ * and such that are changed every now and then, but not WAL-logged.
+ * For larger files, we fetch up to the original size.
+ *
+ * Even with that mechanism, there is an inherent race condition if the
+ * file is modified at the same instant that we're copying it, so that we
+ * might copy a torn version of the file with one half from the old
+ * version and another half from the new. But pg_basebackup has the same
+ * problem, and it hasn't been problem in practice.
+ *
+ * It might seem more natural to truncate the file later, when we receive
+ * it from the source server, but then we'd need to track which
+ * fetch-requests are for a whole file.
+ */
+ open_target_file(path, true);
+ libpq_queue_fetch_range(source, path, 0, Max(len, MAX_CHUNK_SIZE));
+}
+
/*
* Queue up a request to fetch a piece of a file from remote system.
*/
diff --git a/src/bin/pg_rewind/local_source.c b/src/bin/pg_rewind/local_source.c
index 9c3491c3fba..1899d1cc4ae 100644
--- a/src/bin/pg_rewind/local_source.c
+++ b/src/bin/pg_rewind/local_source.c
@@ -29,8 +29,10 @@ static void local_traverse_files(rewind_source *source,
process_file_callback_t callback);
static char *local_fetch_file(rewind_source *source, const char *path,
size_t *filesize);
-static void local_fetch_file_range(rewind_source *source, const char *path,
- off_t off, size_t len);
+static void local_queue_fetch_file(rewind_source *source, const char *path,
+ size_t len);
+static void local_queue_fetch_range(rewind_source *source, const char *path,
+ off_t off, size_t len);
static void local_finish_fetch(rewind_source *source);
static void local_destroy(rewind_source *source);
@@ -43,7 +45,8 @@ init_local_source(const char *datadir)
src->common.traverse_files = local_traverse_files;
src->common.fetch_file = local_fetch_file;
- src->common.queue_fetch_range = local_fetch_file_range;
+ src->common.queue_fetch_file = local_queue_fetch_file;
+ src->common.queue_fetch_range = local_queue_fetch_range;
src->common.finish_fetch = local_finish_fetch;
src->common.get_current_wal_insert_lsn = NULL;
src->common.destroy = local_destroy;
@@ -65,12 +68,65 @@ local_fetch_file(rewind_source *source, const char *path, size_t *filesize)
return slurpFile(((local_source *) source)->datadir, path, filesize);
}
+/*
+ * Copy a file from source to target.
+ *
+ * 'len' is the expected length of the file.
+ */
+static void
+local_queue_fetch_file(rewind_source *source, const char *path, size_t len)
+{
+ const char *datadir = ((local_source *) source)->datadir;
+ PGAlignedBlock buf;
+ char srcpath[MAXPGPATH];
+ int srcfd;
+ size_t written_len;
+
+ snprintf(srcpath, sizeof(srcpath), "%s/%s", datadir, path);
+
+ /* Open source file for reading. */
+ srcfd = open(srcpath, O_RDONLY | PG_BINARY, 0);
+ if (srcfd < 0)
+ pg_fatal("could not open source file \"%s\": %m",
+ srcpath);
+
+ /* Truncate and open the target file for writing. */
+ open_target_file(path, true);
+
+ written_len = 0;
+ for (;;)
+ {
+ ssize_t read_len;
+
+ read_len = read(srcfd, buf.data, sizeof(buf));
+
+ if (read_len < 0)
+ pg_fatal("could not read file \"%s\": %m", srcpath);
+ else if (read_len == 0)
+ break; /* EOF reached */
+
+ write_target_range(buf.data, written_len, read_len);
+ written_len += read_len;
+ }
+
+ /*
+ * A local source is not expected to change while we're rewinding, so check
+ * that the size of the file matches our earlier expectation.
+ */
+ if (written_len != len)
+ pg_fatal("size of source file \"%s\" changed concurrently: " UINT64_FORMAT " bytes expected, " UINT64_FORMAT " copied",
+ srcpath, len, written_len);
+
+ if (close(srcfd) != 0)
+ pg_fatal("could not close file \"%s\": %m", srcpath);
+}
+
/*
* Copy a file from source to target, starting at 'off', for 'len' bytes.
*/
static void
-local_fetch_file_range(rewind_source *source, const char *path, off_t off,
- size_t len)
+local_queue_fetch_range(rewind_source *source, const char *path, off_t off,
+ size_t len)
{
const char *datadir = ((local_source *) source)->datadir;
PGAlignedBlock buf;
@@ -94,14 +150,14 @@ local_fetch_file_range(rewind_source *source, const char *path, off_t off,
while (end - begin > 0)
{
ssize_t readlen;
- size_t len;
+ size_t thislen;
if (end - begin > sizeof(buf))
- len = sizeof(buf);
+ thislen = sizeof(buf);
else
- len = end - begin;
+ thislen = end - begin;
- readlen = read(srcfd, buf.data, len);
+ readlen = read(srcfd, buf.data, thislen);
if (readlen < 0)
pg_fatal("could not read file \"%s\": %m", srcpath);
@@ -120,7 +176,7 @@ static void
local_finish_fetch(rewind_source *source)
{
/*
- * Nothing to do, local_fetch_file_range() copies the ranges immediately.
+ * Nothing to do, local_queue_fetch_range() copies the ranges immediately.
*/
}
diff --git a/src/bin/pg_rewind/pg_rewind.c b/src/bin/pg_rewind/pg_rewind.c
index 359a6a587cb..9030a1505e3 100644
--- a/src/bin/pg_rewind/pg_rewind.c
+++ b/src/bin/pg_rewind/pg_rewind.c
@@ -537,10 +537,7 @@ perform_rewind(filemap_t *filemap, rewind_source *source,
break;
case FILE_ACTION_COPY:
- /* Truncate the old file out of the way, if any */
- open_target_file(entry->path, true);
- source->queue_fetch_range(source, entry->path,
- 0, entry->source_size);
+ source->queue_fetch_file(source, entry->path, entry->source_size);
break;
case FILE_ACTION_TRUNCATE:
diff --git a/src/bin/pg_rewind/rewind_source.h b/src/bin/pg_rewind/rewind_source.h
index 2da92dbff94..799b7c120ea 100644
--- a/src/bin/pg_rewind/rewind_source.h
+++ b/src/bin/pg_rewind/rewind_source.h
@@ -47,6 +47,19 @@ typedef struct rewind_source
void (*queue_fetch_range) (struct rewind_source *, const char *path,
off_t offset, size_t len);
+ /*
+ * Like queue_fetch_range(), but requests replacing the whole local file
+ * from the source system. 'len' is the expected length of the file,
+ * although when the source is a live server, the file may change
+ * concurrently. The implementation is not obliged to copy more than 'len'
+ * bytes, even if the file is larger. However, to avoid copying a
+ * truncated version of the file, which can cause trouble if e.g. a
+ * configuration file is modified concurrently, the implementation should
+ * try to copy the whole file, even if it's larger than expected.
+ */
+ void (*queue_fetch_file) (struct rewind_source *, const char *path,
+ size_t len);
+
/*
* Execute all requests queued up with queue_fetch_range().
*/
--
2.29.2
On 1/22/21 8:26 AM, Heikki Linnakangas wrote:
On 16/12/2020 00:08, Cary Huang wrote:
The following review has been posted through the commitfest application:
make installcheck-world: tested, passed
Implements feature: tested, passed
Spec compliant: tested, passed
Documentation: not testedHello
The patch seems to do as described and the regression and tap tests are passing + /* + * A local source is not expected to change while we're rewinding, so check + * that we size of the file matches our earlier expectation. + */ Is this a tyoo?Yep, thanks! Attached is a new patch version, with that fixed and
rebased. No other changes.
Cary, does this patch look ready to commit? If so, please change the
state in the CF entry to "Ready for Committer".
Regards,
--
-David
david@pgmasters.net
On 25 Mar 2021, at 14:32, David Steele <david@pgmasters.net> wrote:
On 1/22/21 8:26 AM, Heikki Linnakangas wrote:
On 16/12/2020 00:08, Cary Huang wrote:
The following review has been posted through the commitfest application:
make installcheck-world: tested, passed
Implements feature: tested, passed
Spec compliant: tested, passed
Documentation: not testedHello
The patch seems to do as described and the regression and tap tests are passing + /* + * A local source is not expected to change while we're rewinding, so check + * that we size of the file matches our earlier expectation. + */ Is this a tyoo?Yep, thanks! Attached is a new patch version, with that fixed and rebased. No other changes.
Cary, does this patch look ready to commit? If so, please change the state in the CF entry to "Ready for Committer".
Reading the patch I think it definitely qualifies for RFC state. Heikki, do
you have plans to address till patch in the ongoing CF?
--
Daniel Gustafsson https://vmware.com/
I took another look at this patch, and I think it's ready to go in, it clearly
fixes a bug that isn't too hard to hit in production settings. To ensure we
don't break this I've added a testcase which pipes the pg_rewind --verbose
output to a file it's asked to copy, which then guarantees that the file is
growing in size during the operation without need for synchronizing two
processes with IPC::Run (it also passes on Windows in the CI setup).
One small comment on the patch:
+ snprintf(srcpath, sizeof(srcpath), "%s/%s", datadir, path);
This should IMO check the returnvalue of snprintf to ensure it wasn't
truncated. While the risk is exceedingly small, a truncated filename might
match another existing filename and the error not getting caught. There is
another instance just like this one in open_target_file() to which I think we
should apply the same belts-and-suspenders treatment. I've fixed this in the
attached version which also have had a pg_indent run on top of a fresh rebase.
Thoughts on this version?
--
Daniel Gustafsson https://vmware.com/
Attachments:
v3-0001-pg_rewind-Fetch-small-files-according-to-new-size.patchapplication/octet-stream; name=v3-0001-pg_rewind-Fetch-small-files-according-to-new-size.patch; x-unix-mode=0644Download
From 3993107da2499322cccb20e7ec1c7a929c0632fb Mon Sep 17 00:00:00 2001
From: Daniel Gustafsson <daniel@yesql.se>
Date: Fri, 1 Apr 2022 10:50:15 +0200
Subject: [PATCH v3] pg_rewind: Fetch small files according to new size.
There's a race condition if a file changes in the source system
after we have collected the file list. If the file becomes larger,
we only fetched up to its original size. That can easily result in
a truncated file. That's not a problem for relation files, files
in pg_xact, etc. because any actions on them will be replayed from
the WAL. However, configuration files are affected.
This commit mitigates the race condition by fetching small files in
whole, even if they have grown. A test is added in which an extra
file copied is concurrently grown with the output of pg_rewind thus
guaranteeing it to have changed in size during the operation. This
is not a full fix: we still believe the original file size for files
larger than 1 MB. That should be enough for configuration files,
and doing more than that would require big changes to the chunking
logic in libpq_source.c.
This mitigates the race condition if the file is modified between
the original scan of files and copying the file, but there's still
a race condition if a file is changed while it's being copied.
That's a much smaller window, though, and pg_basebackup has the
same issue.
This race can be seen with pg_auto_failover, which frequently uses
ALTER SYSTEM, which update postgresql.auto.conf. Often, pg_rewind
will fail, because the postgresql.auto.conf file changed concurrently
and a partial version of it was copied to the target. The partial
file would fail to parse, preventing the server from starting up.
While there, also add an errorpath in file_ops.c for the filename
buffer being too small which mirrors an added errorpath in this
patch.
Author: Heikki Linnakangas
Reviewed-by: Cary Huang
Discussion: https://postgr.es/m/f67feb24-5833-88cb-1020-19a4a2b83ac7%40iki.fi
---
src/bin/pg_rewind/file_ops.c | 5 +-
src/bin/pg_rewind/libpq_source.c | 32 ++++++++++
src/bin/pg_rewind/local_source.c | 79 +++++++++++++++++++++---
src/bin/pg_rewind/pg_rewind.c | 5 +-
src/bin/pg_rewind/rewind_source.h | 13 ++++
src/bin/pg_rewind/t/009_growing_files.pl | 76 +++++++++++++++++++++++
6 files changed, 195 insertions(+), 15 deletions(-)
create mode 100644 src/bin/pg_rewind/t/009_growing_files.pl
diff --git a/src/bin/pg_rewind/file_ops.c b/src/bin/pg_rewind/file_ops.c
index 6cb288f099..880835aa33 100644
--- a/src/bin/pg_rewind/file_ops.c
+++ b/src/bin/pg_rewind/file_ops.c
@@ -47,6 +47,7 @@ void
open_target_file(const char *path, bool trunc)
{
int mode;
+ int len;
if (dry_run)
return;
@@ -57,7 +58,9 @@ open_target_file(const char *path, bool trunc)
close_target_file();
- snprintf(dstpath, sizeof(dstpath), "%s/%s", datadir_target, path);
+ len = snprintf(dstpath, sizeof(dstpath), "%s/%s", datadir_target, path);
+ if (len >= sizeof(dstpath))
+ pg_fatal("filepath buffer too small"); /* shouldn't happen */
mode = O_WRONLY | O_CREAT | PG_BINARY;
if (trunc)
diff --git a/src/bin/pg_rewind/libpq_source.c b/src/bin/pg_rewind/libpq_source.c
index 997d4e2b48..eaed5eed39 100644
--- a/src/bin/pg_rewind/libpq_source.c
+++ b/src/bin/pg_rewind/libpq_source.c
@@ -63,6 +63,7 @@ static void process_queued_fetch_requests(libpq_source *src);
/* public interface functions */
static void libpq_traverse_files(rewind_source *source,
process_file_callback_t callback);
+static void libpq_queue_fetch_file(rewind_source *source, const char *path, size_t len);
static void libpq_queue_fetch_range(rewind_source *source, const char *path,
off_t off, size_t len);
static void libpq_finish_fetch(rewind_source *source);
@@ -88,6 +89,7 @@ init_libpq_source(PGconn *conn)
src->common.traverse_files = libpq_traverse_files;
src->common.fetch_file = libpq_fetch_file;
+ src->common.queue_fetch_file = libpq_queue_fetch_file;
src->common.queue_fetch_range = libpq_queue_fetch_range;
src->common.finish_fetch = libpq_finish_fetch;
src->common.get_current_wal_insert_lsn = libpq_get_current_wal_insert_lsn;
@@ -307,6 +309,36 @@ libpq_traverse_files(rewind_source *source, process_file_callback_t callback)
PQclear(res);
}
+/*
+ * Queue up a request to fetch a file from remote system.
+ */
+static void
+libpq_queue_fetch_file(rewind_source *source, const char *path, size_t len)
+{
+ /*
+ * Truncate the target file immediately, and queue a request to fetch it
+ * from the source. If the file is small, smaller than MAX_CHUNK_SIZE,
+ * request fetching a full-sized chunk anyway, so that if the file has
+ * become larger in the source system, after we scanned the source
+ * directory, we still fetch the whole file. This only works for files up
+ * to MAX_CHUNK_SIZE, but that's good enough for small configuration files
+ * and such that are changed every now and then, but not WAL-logged. For
+ * larger files, we fetch up to the original size.
+ *
+ * Even with that mechanism, there is an inherent race condition if the
+ * file is modified at the same instant that we're copying it, so that we
+ * might copy a torn version of the file with one half from the old
+ * version and another half from the new. But pg_basebackup has the same
+ * problem, and it hasn't been problem in practice.
+ *
+ * It might seem more natural to truncate the file later, when we receive
+ * it from the source server, but then we'd need to track which
+ * fetch-requests are for a whole file.
+ */
+ open_target_file(path, true);
+ libpq_queue_fetch_range(source, path, 0, Max(len, MAX_CHUNK_SIZE));
+}
+
/*
* Queue up a request to fetch a piece of a file from remote system.
*/
diff --git a/src/bin/pg_rewind/local_source.c b/src/bin/pg_rewind/local_source.c
index 3406fc7037..a9213c834a 100644
--- a/src/bin/pg_rewind/local_source.c
+++ b/src/bin/pg_rewind/local_source.c
@@ -29,8 +29,10 @@ static void local_traverse_files(rewind_source *source,
process_file_callback_t callback);
static char *local_fetch_file(rewind_source *source, const char *path,
size_t *filesize);
-static void local_fetch_file_range(rewind_source *source, const char *path,
- off_t off, size_t len);
+static void local_queue_fetch_file(rewind_source *source, const char *path,
+ size_t len);
+static void local_queue_fetch_range(rewind_source *source, const char *path,
+ off_t off, size_t len);
static void local_finish_fetch(rewind_source *source);
static void local_destroy(rewind_source *source);
@@ -43,7 +45,8 @@ init_local_source(const char *datadir)
src->common.traverse_files = local_traverse_files;
src->common.fetch_file = local_fetch_file;
- src->common.queue_fetch_range = local_fetch_file_range;
+ src->common.queue_fetch_file = local_queue_fetch_file;
+ src->common.queue_fetch_range = local_queue_fetch_range;
src->common.finish_fetch = local_finish_fetch;
src->common.get_current_wal_insert_lsn = NULL;
src->common.destroy = local_destroy;
@@ -65,12 +68,68 @@ local_fetch_file(rewind_source *source, const char *path, size_t *filesize)
return slurpFile(((local_source *) source)->datadir, path, filesize);
}
+/*
+ * Copy a file from source to target.
+ *
+ * 'len' is the expected length of the file.
+ */
+static void
+local_queue_fetch_file(rewind_source *source, const char *path, size_t len)
+{
+ const char *datadir = ((local_source *) source)->datadir;
+ PGAlignedBlock buf;
+ char srcpath[MAXPGPATH];
+ int srcfd;
+ size_t written_len;
+ int pathlen;
+
+ pathlen = snprintf(srcpath, sizeof(srcpath), "%s/%s", datadir, path);
+ if (pathlen >= sizeof(srcpath))
+ pg_fatal("filepath buffer too small"); /* shouldn't happen */
+
+ /* Open source file for reading */
+ srcfd = open(srcpath, O_RDONLY | PG_BINARY, 0);
+ if (srcfd < 0)
+ pg_fatal("could not open source file \"%s\": %m",
+ srcpath);
+
+ /* Truncate and open the target file for writing */
+ open_target_file(path, true);
+
+ written_len = 0;
+ for (;;)
+ {
+ ssize_t read_len;
+
+ read_len = read(srcfd, buf.data, sizeof(buf));
+
+ if (read_len < 0)
+ pg_fatal("could not read file \"%s\": %m", srcpath);
+ else if (read_len == 0)
+ break; /* EOF reached */
+
+ write_target_range(buf.data, written_len, read_len);
+ written_len += read_len;
+ }
+
+ /*
+ * A local source is not expected to change while we're rewinding, so
+ * check that the size of the file matches our earlier expectation.
+ */
+ if (written_len != len)
+ pg_fatal("size of source file \"%s\" changed concurrently: " UINT64_FORMAT " bytes expected, " UINT64_FORMAT " copied",
+ srcpath, len, written_len);
+
+ if (close(srcfd) != 0)
+ pg_fatal("could not close file \"%s\": %m", srcpath);
+}
+
/*
* Copy a file from source to target, starting at 'off', for 'len' bytes.
*/
static void
-local_fetch_file_range(rewind_source *source, const char *path, off_t off,
- size_t len)
+local_queue_fetch_range(rewind_source *source, const char *path, off_t off,
+ size_t len)
{
const char *datadir = ((local_source *) source)->datadir;
PGAlignedBlock buf;
@@ -94,14 +153,14 @@ local_fetch_file_range(rewind_source *source, const char *path, off_t off,
while (end - begin > 0)
{
ssize_t readlen;
- size_t len;
+ size_t thislen;
if (end - begin > sizeof(buf))
- len = sizeof(buf);
+ thislen = sizeof(buf);
else
- len = end - begin;
+ thislen = end - begin;
- readlen = read(srcfd, buf.data, len);
+ readlen = read(srcfd, buf.data, thislen);
if (readlen < 0)
pg_fatal("could not read file \"%s\": %m", srcpath);
@@ -120,7 +179,7 @@ static void
local_finish_fetch(rewind_source *source)
{
/*
- * Nothing to do, local_fetch_file_range() copies the ranges immediately.
+ * Nothing to do, local_queue_fetch_range() copies the ranges immediately.
*/
}
diff --git a/src/bin/pg_rewind/pg_rewind.c b/src/bin/pg_rewind/pg_rewind.c
index b39b5c1aac..6cc44172fb 100644
--- a/src/bin/pg_rewind/pg_rewind.c
+++ b/src/bin/pg_rewind/pg_rewind.c
@@ -537,10 +537,7 @@ perform_rewind(filemap_t *filemap, rewind_source *source,
break;
case FILE_ACTION_COPY:
- /* Truncate the old file out of the way, if any */
- open_target_file(entry->path, true);
- source->queue_fetch_range(source, entry->path,
- 0, entry->source_size);
+ source->queue_fetch_file(source, entry->path, entry->source_size);
break;
case FILE_ACTION_TRUNCATE:
diff --git a/src/bin/pg_rewind/rewind_source.h b/src/bin/pg_rewind/rewind_source.h
index 78a12eb0f3..1310e86e75 100644
--- a/src/bin/pg_rewind/rewind_source.h
+++ b/src/bin/pg_rewind/rewind_source.h
@@ -47,6 +47,19 @@ typedef struct rewind_source
void (*queue_fetch_range) (struct rewind_source *, const char *path,
off_t offset, size_t len);
+ /*
+ * Like queue_fetch_range(), but requests replacing the whole local file
+ * from the source system. 'len' is the expected length of the file,
+ * although when the source is a live server, the file may change
+ * concurrently. The implementation is not obliged to copy more than 'len'
+ * bytes, even if the file is larger. However, to avoid copying a
+ * truncated version of the file, which can cause trouble if e.g. a
+ * configuration file is modified concurrently, the implementation should
+ * try to copy the whole file, even if it's larger than expected.
+ */
+ void (*queue_fetch_file) (struct rewind_source *, const char *path,
+ size_t len);
+
/*
* Execute all requests queued up with queue_fetch_range().
*/
diff --git a/src/bin/pg_rewind/t/009_growing_files.pl b/src/bin/pg_rewind/t/009_growing_files.pl
new file mode 100644
index 0000000000..752781e637
--- /dev/null
+++ b/src/bin/pg_rewind/t/009_growing_files.pl
@@ -0,0 +1,76 @@
+
+# Copyright (c) 2021-2022, PostgreSQL Global Development Group
+
+use strict;
+use warnings;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+use FindBin;
+use lib $FindBin::RealBin;
+
+use RewindTest;
+
+RewindTest::setup_cluster("local");
+RewindTest::start_primary();
+
+# Create a test table and insert a row in primary.
+primary_psql("CREATE TABLE tbl1 (d text)");
+primary_psql("INSERT INTO tbl1 VALUES ('in primary')");
+primary_psql("CHECKPOINT");
+
+RewindTest::create_standby("local");
+
+# Insert additional data on primary that will be replicated to standby
+primary_psql("INSERT INTO tbl1 values ('in primary, before promotion')");
+primary_psql('CHECKPOINT');
+
+RewindTest::promote_standby();
+
+# Insert a row in the old primary. This causes the primary and standby to have
+# "diverged", it's no longer possible to just apply the standy's logs over
+# primary directory - you need to rewind. Also insert a new row in the
+# standby, which won't be present in the old primary.
+primary_psql("INSERT INTO tbl1 VALUES ('in primary, after promotion')");
+standby_psql("INSERT INTO tbl1 VALUES ('in standby, after promotion')");
+
+# Stop the nodes before running pg_rewind
+$node_standby->stop;
+$node_primary->stop;
+
+my $primary_pgdata = $node_primary->data_dir;
+my $standby_pgdata = $node_standby->data_dir;
+
+# Add an extra file that we can tamper with without interfering with the data
+# directory data files.
+mkdir "$standby_pgdata/tst_both_dir";
+append_to_file "$standby_pgdata/tst_both_dir/file1", 'a';
+
+# Run pg_rewind and pipe the output from the run into the extra file we want
+# to copy. This will ensure that the file is continously growing during the
+# copy operation and the result will be an error.
+my $ret = run_log(
+ [
+ 'pg_rewind', '--debug',
+ '--source-pgdata', $standby_pgdata,
+ '--target-pgdata', $primary_pgdata,
+ '--no-sync',
+ ],
+ '2>>', "$standby_pgdata/tst_both_dir/file1");
+ok(!$ret, 'Error out on copying growing file');
+
+# Ensure that the files are of different size, the final error message should
+# only be in one of them making them guaranteed to be different
+my $primary_size = -s "$primary_pgdata/tst_both_dir/file1";
+my $standby_size = -s "$standby_pgdata/tst_both_dir/file1";
+isnt($standby_size, $primary_size, "File sizes should differ");
+
+# Extract the last line from the verbose output as that should have the error
+# message for the unexpected file size
+my $last;
+open my $f, '<', "$standby_pgdata/tst_both_dir/file1";
+$last = $_ while (<$f>);
+close $f;
+like($last, qr/fatal: size of source file/, "Check error message");
+
+done_testing();
--
2.24.3 (Apple Git-128)
On 01/04/2022 12:00, Daniel Gustafsson wrote:
I took another look at this patch, and I think it's ready to go in, it clearly
fixes a bug that isn't too hard to hit in production settings. To ensure we
don't break this I've added a testcase which pipes the pg_rewind --verbose
output to a file it's asked to copy, which then guarantees that the file is
growing in size during the operation without need for synchronizing two
processes with IPC::Run (it also passes on Windows in the CI setup).One small comment on the patch:
+ snprintf(srcpath, sizeof(srcpath), "%s/%s", datadir, path);
This should IMO check the returnvalue of snprintf to ensure it wasn't
truncated. While the risk is exceedingly small, a truncated filename might
match another existing filename and the error not getting caught. There is
another instance just like this one in open_target_file() to which I think we
should apply the same belts-and-suspenders treatment. I've fixed this in the
attached version which also have had a pg_indent run on top of a fresh rebase.
+ if (len >= sizeof(dstpath)) + pg_fatal("filepath buffer too small"); /* shouldn't happen */
Makes sense. I would remove the "shouldn't happen"; it's not very hard
to make it happen, you just need a very long target datadir path. And
rephrase the error message as "datadir path too long".
One typo in the commit message: s/update/updates/.
Thanks!
- Heikki
On 1 Apr 2022, at 12:46, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
+ if (len >= sizeof(dstpath)) + pg_fatal("filepath buffer too small"); /* shouldn't happen */Makes sense. I would remove the "shouldn't happen"; it's not very hard to make it happen, you just need a very long target datadir path. And rephrase the error message as "datadir path too long".
Right, good point.
One typo in the commit message: s/update/updates/.
Will fix.
--
Daniel Gustafsson https://vmware.com/
On 01.04.22 11:00, Daniel Gustafsson wrote:
One small comment on the patch:
+ snprintf(srcpath, sizeof(srcpath), "%s/%s", datadir, path);
This should IMO check the returnvalue of snprintf to ensure it wasn't
truncated. While the risk is exceedingly small, a truncated filename might
match another existing filename and the error not getting caught. There is
another instance just like this one in open_target_file() to which I think we
should apply the same belts-and-suspenders treatment. I've fixed this in the
attached version which also have had a pg_indent run on top of a fresh rebase.
We use snprintf() like that countless times, and approximately none of
them check for overflow. So while you are right, this might not be the
place to start a new policy.
If you don't like this approach, use psprintf() perhaps.
On 4 Apr 2022, at 15:08, Peter Eisentraut <peter.eisentraut@enterprisedb.com> wrote:
We use snprintf() like that countless times, and approximately none of them check for overflow. So while you are right, this might not be the place to start a new policy.
Fair enough, I'll remove these hunks before committing and will look a bigger
picture patch to address these across the codebase later.
--
Daniel Gustafsson https://vmware.com/