[bug fix] pg_rewind creates corrupt WAL files, and the standby cannot catch up the primary
Hello,
Our customer hit another bug of pg_rewind with PG 9.5. The attached patch fixes this.
PROBLEM
========================================
After a long run of successful pg_rewind, the synchronized standby could not catch up the primary forever, emitting the following message repeatedly:
LOG: XX000: could not read from log segment 000000060000028A00000031, offset 16384: No error
CAUSE
========================================
If the primary removes WAL files that pg_rewind is going to get, pg_rewind leaves 0-byte WAL files in the target directory here:
[libpq_fetch.c]
case FILE_ACTION_COPY:
/* Truncate the old file out of the way, if any */
open_target_file(entry->path, true);
fetch_file_range(entry->path, 0, entry->newsize);
break;
pg_rewind completes successfully, create recovery.conf, and then start the standby in the target cluster. walreceiver receives WAL records and write them to the 0-byte WAL files. Finally, xlog reader complains that he cannot read a WAL page.
FIX
========================================
pg_rewind deletes the file when it finds that the primary has deleted it.
OTHER THOUGHTS
========================================
BTW, should pg_rewind really copy WAL files from the primary? If the sole purpose of pg_rewind is to recover an instance to use as a standby, can pg_rewind just remove all WAL files in the target directory, because the standby can get WAL files from the primary and/or archive?
Related to this, shouldn't pg_rewind avoid copying more files and directories like pg_basebackup? Currently, pg_rewind doesn't copy postmaster.pid, postmaster.opts, and temporary files/directories (pg_sql_tmp/).
Regards
Takayuki Tsunakawa
Attachments:
pg_rewind_corrupt_wal.patchapplication/octet-stream; name=pg_rewind_corrupt_wal.patchDownload
diff --git a/src/bin/pg_rewind/file_ops.c b/src/bin/pg_rewind/file_ops.c
index 705383d..1b53cc1 100644
--- a/src/bin/pg_rewind/file_ops.c
+++ b/src/bin/pg_rewind/file_ops.c
@@ -29,7 +29,6 @@
static int dstfd = -1;
static char dstpath[MAXPGPATH] = "";
-static void remove_target_file(const char *path);
static void create_target_dir(const char *path);
static void remove_target_dir(const char *path);
static void create_target_symlink(const char *path, const char *link);
@@ -134,7 +133,7 @@ remove_target(file_entry_t *entry)
break;
case FILE_TYPE_REGULAR:
- remove_target_file(entry->path);
+ remove_target_file(entry->path, false);
break;
case FILE_TYPE_SYMLINK:
@@ -165,8 +164,8 @@ create_target(file_entry_t *entry)
}
}
-static void
-remove_target_file(const char *path)
+void
+remove_target_file(const char *path, bool ignore_error)
{
char dstpath[MAXPGPATH];
@@ -174,7 +173,7 @@ remove_target_file(const char *path)
return;
snprintf(dstpath, sizeof(dstpath), "%s/%s", datadir_target, path);
- if (unlink(dstpath) != 0)
+ if (unlink(dstpath) != 0 && !ignore_error)
pg_fatal("could not remove file \"%s\": %s\n",
dstpath, strerror(errno));
}
diff --git a/src/bin/pg_rewind/file_ops.h b/src/bin/pg_rewind/file_ops.h
index be580ee..d415aa2 100644
--- a/src/bin/pg_rewind/file_ops.h
+++ b/src/bin/pg_rewind/file_ops.h
@@ -15,6 +15,7 @@
extern void open_target_file(const char *path, bool trunc);
extern void write_target_range(char *buf, off_t begin, size_t size);
extern void close_target_file(void);
+extern void remove_target_file(const char *path, bool ignore_error);
extern void truncate_target_file(const char *path, off_t newsize);
extern void create_target(file_entry_t *t);
extern void remove_target(file_entry_t *t);
diff --git a/src/bin/pg_rewind/libpq_fetch.c b/src/bin/pg_rewind/libpq_fetch.c
index d1726d1..f8bb8dc 100644
--- a/src/bin/pg_rewind/libpq_fetch.c
+++ b/src/bin/pg_rewind/libpq_fetch.c
@@ -313,6 +313,7 @@ receiveFileChunks(const char *sql)
pg_log(PG_DEBUG,
"received null value for chunk for file \"%s\", file has been deleted\n",
filename);
+ remove_target_file(filename, true);
pg_free(filename);
PQclear(res);
continue;
On Thu, Mar 01, 2018 at 01:26:32AM +0000, Tsunakawa, Takayuki wrote:
BTW, should pg_rewind really copy WAL files from the primary? If the
sole purpose of pg_rewind is to recover an instance to use as a
standby, can pg_rewind just remove all WAL files in the target
directory, because the standby can get WAL files from the primary
and/or archive?
Yes, it should not copy those WAL files. Most of the time they are going
to be meaningless. See this recent thread:
/messages/by-id/20180126023609.GH17847@paquier.xyz
So I would rather go this way instead of having to worry about
manipulating those WAL segments as you do. Depending on the needs, I
think that even a backpatch could be considered.
Related to this, shouldn't pg_rewind avoid copying more files and
directories like pg_basebackup? Currently, pg_rewind doesn't copy
postmaster.pid, postmaster.opts, and temporary files/directories
(pg_sql_tmp/).
Yes, it should not copy those files. I have a patch in the current CF
to do that:
https://commitfest.postgresql.org/17/1507/
--
Michael
From: Michael Paquier [mailto:michael@paquier.xyz]
Yes, it should not copy those WAL files. Most of the time they are going
to be meaningless. See this recent thread:
/messages/by-id/20180126023609.GH17847@paquier
.xyz
So I would rather go this way instead of having to worry about manipulating
those WAL segments as you do. Depending on the needs, I think that even
a backpatch could be considered.
Thank you for information. I didn't notice those activities going around pg_rewind.
It's a regret that Chen's patch, which limits the WAL to be copied, is not committed yet. It looks good to be ready for committer.
Related to this, shouldn't pg_rewind avoid copying more files and
directories like pg_basebackup? Currently, pg_rewind doesn't copy
postmaster.pid, postmaster.opts, and temporary files/directories
(pg_sql_tmp/).Yes, it should not copy those files. I have a patch in the current CF to
do that:
https://commitfest.postgresql.org/17/1507/
Wow, what a great patch. I think I should look at it. But I'm afraid it won't be backpatched because it's big...
Even with your patch and Chen's one, my small patch is probably necessary to avoid leaving 0-byte or half-baked files. I'm not sure whether those strangely sized files would cause actual trouble, but maybe it would be healthy to try to clean things up as much as possible. (files in pg_twophase/ might emit WARNING messages, garbage server log files might make the DBA worried, etc.; yes, these may be just FUD.)
Regards
Takayuki Tsunakawa
On Thu, Mar 01, 2018 at 07:49:06AM +0000, Tsunakawa, Takayuki wrote:
It's a regret that Chen's patch, which limits the WAL to be copied, is
not committed yet. It looks good to be ready for committer.
The message I sent provides reasons about why it should not be
integrated. Particularly since the prior last checkpoint has been
removed in v11, there is always going to be a whole in WAL segments as
you need to create a checkpoint on the ex-standby after it has been
promoted so as its control file data is correctly reflected on disk.
Related to this, shouldn't pg_rewind avoid copying more files and
directories like pg_basebackup? Currently, pg_rewind doesn't copy
postmaster.pid, postmaster.opts, and temporary files/directories
(pg_sql_tmp/).Yes, it should not copy those files. I have a patch in the current CF to
do that:
https://commitfest.postgresql.org/17/1507/Wow, what a great patch. I think I should look at it. But I'm afraid
it won't be backpatched because it's big...
That's a new feature. This won't get backpatch'ed anyway.
Even with your patch and Chen's one, my small patch is probably
necessary to avoid leaving 0-byte or half-baked files. I'm not sure
whether those strangely sized files would cause actual trouble, but
maybe it would be healthy to try to clean things up as much as
possible. (files in pg_twophase/ might emit WARNING messages, garbage
server log files might make the DBA worried, etc.; yes, these may be
just FUD.)
Yeah, I'd like to double-check what you are proposing here anyway.
Sorry but I do not have an opinion about what you have sent yet :(
The only thing I am sure of though is that for HEAD not copying files
from pg_wal from the origin is the way to do it. For back-branches
that's another story.
--
Michael
On Thu, Mar 01, 2018 at 01:26:32AM +0000, Tsunakawa, Takayuki wrote:
Our customer hit another bug of pg_rewind with PG 9.5. The attached
patch fixes this.
Sorry for my late reply. It took me unfortunately some time before
coming back to this patch.
PROBLEM
========================================After a long run of successful pg_rewind, the synchronized standby
could not catch up the primary forever, emitting the following message
repeatedly:LOG: XX000: could not read from log segment 000000060000028A00000031,
offset 16384: No error
Oops.
CAUSE
========================================If the primary removes WAL files that pg_rewind is going to get,
pg_rewind leaves 0-byte WAL files in the target directory here:[libpq_fetch.c]
case FILE_ACTION_COPY:
/* Truncate the old file out of the way, if any */
open_target_file(entry->path, true);
fetch_file_range(entry->path, 0, entry->newsize);
break;pg_rewind completes successfully, create recovery.conf, and then start
the standby in the target cluster. walreceiver receives WAL records
and write them to the 0-byte WAL files. Finally, xlog reader
complains that he cannot read a WAL page.
This part qualifies as a data corruption bug as some records are lost by
the backend.
FIX
========================================pg_rewind deletes the file when it finds that the primary has deleted
it.
The concept looks right to me. I think that this does not apply only to
WAL segments though. You could have a temporary WAL segment that has
been included in the chunk table, which is even more volatile, or a
multixact file, a snapshot file, etc. In short anything which is not a
relation file and could disappear between the moment the file map is
built and the data is fetched from the source server.
@@ -174,7 +173,7 @@ remove_target_file(const char *path)
return;
snprintf(dstpath, sizeof(dstpath), "%s/%s", datadir_target, path);
- if (unlink(dstpath) != 0)
+ if (unlink(dstpath) != 0 && !ignore_error)
I see. So the only reason why this flag exists is that if a file is
large enough so as it is split into multiple chunks, then the first
unlink will work but not the successive ones. One chunk can be at most
1 million bytes, which is why it is essential for WAL segments. Instead
of ignoring *all* errors, let's ignore only ENOENT and rename
ignore_error to missing_ok as well.
You need to update the comment in receiveFileChunks where an entry gets
deleted with basically what I mentioned above, say:
"If a file has been deleted on the source, remove it on the target as
well. Note that multiple unlink() calls may happen on the same file if
multiple data chunks are associated with it, hence ignore
unconditionally anything missing. If this file is not a relation data
file, then it has been already truncated when creating the file chunk
list at hte previous execution of the filemap."
Adding also a comment on top of remove_target_file to explain what
missing_ok does would be nice to keep track of what the code should do.
The portion about whether data from pg_wal should be transfered or not
is really an improvement that could come in a future version, and what
you are proposing here is more general, so I think that we should go
ahead with it.
--
Michael
From: Michael Paquier [mailto:michael@paquier.xyz]
I see. So the only reason why this flag exists is that if a file is large
enough so as it is split into multiple chunks, then the first unlink will
work but not the successive ones. One chunk can be at most
1 million bytes, which is why it is essential for WAL segments. Instead
of ignoring *all* errors, let's ignore only ENOENT and rename ignore_error
to missing_ok as well.You need to update the comment in receiveFileChunks where an entry gets
deleted with basically what I mentioned above, say:
"If a file has been deleted on the source, remove it on the target as well.
Note that multiple unlink() calls may happen on the same file if multiple
data chunks are associated with it, hence ignore unconditionally anything
missing. If this file is not a relation data file, then it has been already
truncated when creating the file chunk list at hte previous execution of
the filemap."Adding also a comment on top of remove_target_file to explain what
missing_ok does would be nice to keep track of what the code should do.
Thanks for reviewing. All done.
Regards
Takayuki Tsunakawa
Attachments:
pg_rewind_corrupt_wal_v2.patchapplication/octet-stream; name=pg_rewind_corrupt_wal_v2.patchDownload
diff --git a/src/bin/pg_rewind/file_ops.c b/src/bin/pg_rewind/file_ops.c
index 705383d..3348c4a 100644
--- a/src/bin/pg_rewind/file_ops.c
+++ b/src/bin/pg_rewind/file_ops.c
@@ -29,7 +29,6 @@
static int dstfd = -1;
static char dstpath[MAXPGPATH] = "";
-static void remove_target_file(const char *path);
static void create_target_dir(const char *path);
static void remove_target_dir(const char *path);
static void create_target_symlink(const char *path, const char *link);
@@ -134,7 +133,7 @@ remove_target(file_entry_t *entry)
break;
case FILE_TYPE_REGULAR:
- remove_target_file(entry->path);
+ remove_target_file(entry->path, false);
break;
case FILE_TYPE_SYMLINK:
@@ -165,8 +164,11 @@ create_target(file_entry_t *entry)
}
}
-static void
-remove_target_file(const char *path)
+/*
+ * Remove a file, ignoring an error if it doesn't exist and missing_ok is true.
+ */
+void
+remove_target_file(const char *path, bool missing_ok)
{
char dstpath[MAXPGPATH];
@@ -174,7 +176,8 @@ remove_target_file(const char *path)
return;
snprintf(dstpath, sizeof(dstpath), "%s/%s", datadir_target, path);
- if (unlink(dstpath) != 0)
+ if (unlink(dstpath) != 0 &&
+ !(errno == ENOENT && missing_ok))
pg_fatal("could not remove file \"%s\": %s\n",
dstpath, strerror(errno));
}
diff --git a/src/bin/pg_rewind/file_ops.h b/src/bin/pg_rewind/file_ops.h
index be580ee..9d26cf4 100644
--- a/src/bin/pg_rewind/file_ops.h
+++ b/src/bin/pg_rewind/file_ops.h
@@ -15,6 +15,7 @@
extern void open_target_file(const char *path, bool trunc);
extern void write_target_range(char *buf, off_t begin, size_t size);
extern void close_target_file(void);
+extern void remove_target_file(const char *path, bool missing_ok);
extern void truncate_target_file(const char *path, off_t newsize);
extern void create_target(file_entry_t *t);
extern void remove_target(file_entry_t *t);
diff --git a/src/bin/pg_rewind/libpq_fetch.c b/src/bin/pg_rewind/libpq_fetch.c
index d1726d1..114caf3 100644
--- a/src/bin/pg_rewind/libpq_fetch.c
+++ b/src/bin/pg_rewind/libpq_fetch.c
@@ -304,15 +304,19 @@ receiveFileChunks(const char *sql)
chunk = PQgetvalue(res, 0, 2);
/*
- * It's possible that the file was deleted on remote side after we
- * created the file map. In this case simply ignore it, as if it was
- * not there in the first place, and move on.
+ * If a file has been deleted on the source, remove it on the
+ * target as well. Note that multiple unlink() calls may happen
+ * on the same file if multiple data chunks are associated with
+ * it, hence ignore unconditionally anything missing. If this
+ * file is not a relation data file, then it has been already
+ * truncated when creating the file chunk list at the previous execution of the filemap.
*/
if (PQgetisnull(res, 0, 2))
{
pg_log(PG_DEBUG,
"received null value for chunk for file \"%s\", file has been deleted\n",
filename);
+ remove_target_file(filename, true);
pg_free(filename);
PQclear(res);
continue;
On Fri, Mar 09, 2018 at 08:22:49AM +0000, Tsunakawa, Takayuki wrote:
Thanks for reviewing. All done.
Thanks. Please be careful with the indentation of the patch. Attached
is a slightly-updated version with a small modification in
remove_target_file to make the code cleaner, a proposal of commit
message and pgindent applied on the code. I am switching that as ready
for committer.
--
Michael
Attachments:
0001-Fix-handling-of-removed-files-on-target-server-with-.patchtext/x-diff; charset=us-asciiDownload
From 62950c615d87e3e58998f295ce446eb5c80707e4 Mon Sep 17 00:00:00 2001
From: Michael Paquier <michael@paquier.xyz>
Date: Sun, 11 Mar 2018 22:49:55 +0900
Subject: [PATCH] Fix handling of removed files on target server with pg_rewind
After processing the filemap to build the list of chunks that will be
fetched from the source to rewing the target server, it is possible that
a file which was previously processed is removed from the source. A
simple example of such an occurence is a WAL segment which gets recycled
on the target in-between. When the filemap is processed, files not
categorized as relation files are first truncated to prepare for its
full copy of which is going to be taken from the source, divided into a
set of junks. However, for a recycled WAL segment, this would result in
a segment which has a zero-byte size. With such an empty file,
post-rewind recovery thinks that records are saved but they are actually
not because of the truncation which happened when processing the
filemap, resulting in data loss.
In order to fix the problem, make sure that files which are found as
removed on the source when receiving chunks of them are as well deleted
on the target server for consistency. This ensures that no empty junk
files remain. If a file has a size so large that it needs more than one
chunk to be fully copied, it could be possible that the deletion is
attempted more than once. It could be possible as well that the file
has gone away after already copying some chunks on the target. For
those reasons, the file removals done when processing the file chunks
are lossy, and ignore missing entries.
Patch and report from Tsunakawa Takayuki, reviewed by Michael Paquier.
Discussion: https://www.postgresql.org/message-id/0A3221C70F24FB45833433255569204D1F8DAAA2%40G01JPEXMBYT05
---
src/bin/pg_rewind/file_ops.c | 16 ++++++++++++----
src/bin/pg_rewind/file_ops.h | 1 +
src/bin/pg_rewind/libpq_fetch.c | 10 +++++++---
3 files changed, 20 insertions(+), 7 deletions(-)
diff --git a/src/bin/pg_rewind/file_ops.c b/src/bin/pg_rewind/file_ops.c
index 705383d184..f491ed7f5c 100644
--- a/src/bin/pg_rewind/file_ops.c
+++ b/src/bin/pg_rewind/file_ops.c
@@ -29,7 +29,6 @@
static int dstfd = -1;
static char dstpath[MAXPGPATH] = "";
-static void remove_target_file(const char *path);
static void create_target_dir(const char *path);
static void remove_target_dir(const char *path);
static void create_target_symlink(const char *path, const char *link);
@@ -134,7 +133,7 @@ remove_target(file_entry_t *entry)
break;
case FILE_TYPE_REGULAR:
- remove_target_file(entry->path);
+ remove_target_file(entry->path, false);
break;
case FILE_TYPE_SYMLINK:
@@ -165,8 +164,12 @@ create_target(file_entry_t *entry)
}
}
-static void
-remove_target_file(const char *path)
+/*
+ * Remove a file from target data directory. If missing_ok is true, it
+ * is fine for the target file to not exist.
+ */
+void
+remove_target_file(const char *path, bool missing_ok)
{
char dstpath[MAXPGPATH];
@@ -175,8 +178,13 @@ remove_target_file(const char *path)
snprintf(dstpath, sizeof(dstpath), "%s/%s", datadir_target, path);
if (unlink(dstpath) != 0)
+ {
+ if (errno == ENOENT && missing_ok)
+ return;
+
pg_fatal("could not remove file \"%s\": %s\n",
dstpath, strerror(errno));
+ }
}
void
diff --git a/src/bin/pg_rewind/file_ops.h b/src/bin/pg_rewind/file_ops.h
index be580ee4db..9d26cf4f77 100644
--- a/src/bin/pg_rewind/file_ops.h
+++ b/src/bin/pg_rewind/file_ops.h
@@ -15,6 +15,7 @@
extern void open_target_file(const char *path, bool trunc);
extern void write_target_range(char *buf, off_t begin, size_t size);
extern void close_target_file(void);
+extern void remove_target_file(const char *path, bool missing_ok);
extern void truncate_target_file(const char *path, off_t newsize);
extern void create_target(file_entry_t *t);
extern void remove_target(file_entry_t *t);
diff --git a/src/bin/pg_rewind/libpq_fetch.c b/src/bin/pg_rewind/libpq_fetch.c
index 8f8d504455..5914b15017 100644
--- a/src/bin/pg_rewind/libpq_fetch.c
+++ b/src/bin/pg_rewind/libpq_fetch.c
@@ -311,15 +311,19 @@ receiveFileChunks(const char *sql)
chunk = PQgetvalue(res, 0, 2);
/*
- * It's possible that the file was deleted on remote side after we
- * created the file map. In this case simply ignore it, as if it was
- * not there in the first place, and move on.
+ * If a file has been deleted on the source, remove it on the target
+ * as well. Note that multiple unlink() calls may happen on the same
+ * file if multiple data chunks are associated with it, hence ignore
+ * unconditionally anything missing. If this file is not a relation
+ * data file, then it has been already truncated when creating the
+ * file chunk list at the previous execution of the filemap.
*/
if (PQgetisnull(res, 0, 2))
{
pg_log(PG_DEBUG,
"received null value for chunk for file \"%s\", file has been deleted\n",
filename);
+ remove_target_file(filename, true);
pg_free(filename);
PQclear(res);
continue;
--
2.16.2
On Sun, Mar 11, 2018 at 11:04:01PM +0900, Michael Paquier wrote:
On Fri, Mar 09, 2018 at 08:22:49AM +0000, Tsunakawa, Takayuki wrote:
Thanks for reviewing. All done.
Thanks. Please be careful with the indentation of the patch. Attached
is a slightly-updated version with a small modification in
remove_target_file to make the code cleaner, a proposal of commit
message and pgindent applied on the code. I am switching that as ready
for committer.
--
Michael
From 62950c615d87e3e58998f295ce446eb5c80707e4 Mon Sep 17 00:00:00 2001
From: Michael Paquier <michael@paquier.xyz>
Date: Sun, 11 Mar 2018 22:49:55 +0900
Subject: [PATCH] Fix handling of removed files on target server with pg_rewindAfter processing the filemap to build the list of chunks that will be
fetched from the source to rewing the target server, it is possible that
a file which was previously processed is removed from the source. A
simple example of such an occurence is a WAL segment which gets recycled
on the target in-between. When the filemap is processed, files not
categorized as relation files are first truncated to prepare for its
full copy of which is going to be taken from the source, divided into a
set of junks.
Hopefully a set of CHUNKS not JUNKS ?
Justin
On Sun, Mar 11, 2018 at 09:56:33AM -0500, Justin Pryzby wrote:
Hopefully a set of CHUNKS not JUNKS ?
Yes. (laugh)
--
Michael
On Sun, Mar 11, 2018 at 11:04 PM, Michael Paquier <michael@paquier.xyz> wrote:
On Fri, Mar 09, 2018 at 08:22:49AM +0000, Tsunakawa, Takayuki wrote:
Thanks for reviewing. All done.
Thanks. Please be careful with the indentation of the patch. Attached
is a slightly-updated version with a small modification in
remove_target_file to make the code cleaner, a proposal of commit
message and pgindent applied on the code. I am switching that as ready
for committer.
The patch looks good to me! Barring objection, I will commit it
and back-patch to 9.5 where pg_rewind was added.
Regards,
--
Fujii Masao
On Wed, Mar 28, 2018 at 04:45:49AM +0900, Fujii Masao wrote:
The patch looks good to me! Barring objection, I will commit it
and back-patch to 9.5 where pg_rewind was added.
Thanks in advance! I just eyeballed the patch again and I don't see
issues with what's used here. The thing should apply smoothly back to
9.5, of course let me know if you need help or an extra pair of eyes.
--
Michael
On Wed, Mar 28, 2018 at 11:24 AM, Michael Paquier <michael@paquier.xyz> wrote:
On Wed, Mar 28, 2018 at 04:45:49AM +0900, Fujii Masao wrote:
The patch looks good to me! Barring objection, I will commit it
and back-patch to 9.5 where pg_rewind was added.Thanks in advance! I just eyeballed the patch again and I don't see
issues with what's used here. The thing should apply smoothly back to
9.5, of course let me know if you need help or an extra pair of eyes.
Pushed. Thanks!
Regards,
--
Fujii Masao