BUG #14999: pg_rewind corrupts control file global/pg_control
The following bug has been logged on the website:
Bug reference: 14999
Logged by: Christian H.
Email address: office@tiptop-labs.com
PostgreSQL version: 10.1
Operating system: e.g. Debian Buster
Description:
I have encountered a bug in PostgreSQL 10.1: when the target directory for
pg_rewind contains a read-only file (e.g. server.key), pg_rewind exits with
"could not open target file" (legitimate) and corrupts the control file
global/pg_control to size 0 (bug). From now on, pg_rewind always exits with
"unexpected control file size 0, expected 8192" and a restore from
pg_basebackup is needed.
A patch for branch REL_10_STABLE of repository
https://github.com/postgres/postgres, a README, and Dockerfiles for
demonstrating both bug and patch are available from
https://github.com/tiptop-labs/postgres-patches .
On Fri, Jan 5, 2018 at 5:06 AM, PG Bug reporting form
<noreply@postgresql.org> wrote:
I have encountered a bug in PostgreSQL 10.1: when the target directory for
pg_rewind contains a read-only file (e.g. server.key), pg_rewind exits with
"could not open target file" (legitimate) and corrupts the control file
global/pg_control to size 0 (bug). From now on, pg_rewind always exits with
"unexpected control file size 0, expected 8192" and a restore from
pg_basebackup is needed.
Likely that's reproducible down to 9.5 where pg_rewind has been
introduced. I agree that we should do better with failure handling
here. Corrupting the control file is not cool.
A patch for branch REL_10_STABLE of repository
https://github.com/postgres/postgres, a README, and Dockerfiles for
demonstrating both bug and patch are available from
https://github.com/tiptop-labs/postgres-patches .
You may not know when github.com is gone, which would cause the data
you are attaching here to go away. What looks interesting is
pg_rewind.patch, which is what you are proposing as a fix, right? I am
attaching it here for PG archives.
- open_target_file(filename, false);
+ /* Trunc target file for action FILE_ACTION_COPY. */
+ open_target_file(filename, chunkoff == 0);
write_target_range(chunk, chunkoff, chunksize);
Hm. Let me think more about that as there are quite a few
distributions that link to SSL files that cannot be written, and
pg_rewind copies all configuration files in full.
- "FROM fetchchunks\n";
+ "FROM fetchchunks\n"
+ "ORDER BY path, begin\n";
If this is aimed at improving the performance of pg_rewind by making
chunk writes more sequential, it should be a different patch. I would
see value in that. If this is aimed to change the order of the files
to avoid the write corruption, this is wrong.
I would imagine that you could see something similar with the offline
mode, haven't tested yet though.
--
Michael
Attachments:
pg_rewind.patchapplication/octet-stream; name=pg_rewind.patchDownload
diff --git a/src/bin/pg_rewind/libpq_fetch.c b/src/bin/pg_rewind/libpq_fetch.c
index 0cdff55..74d8f1c7 100644
--- a/src/bin/pg_rewind/libpq_fetch.c
+++ b/src/bin/pg_rewind/libpq_fetch.c
@@ -351,7 +351,8 @@ receiveFileChunks(const char *sql)
pg_log(PG_DEBUG, "received chunk for file \"%s\", offset %s, size %d\n",
filename, chunkoff_str, chunksize);
- open_target_file(filename, false);
+ /* Trunc target file for action FILE_ACTION_COPY. */
+ open_target_file(filename, chunkoff == 0);
write_target_range(chunk, chunkoff, chunksize);
@@ -478,8 +479,6 @@ libpq_executeFileMap(filemap_t *map)
break;
case FILE_ACTION_COPY:
- /* Truncate the old file out of the way, if any */
- open_target_file(entry->path, true);
fetch_file_range(entry->path, 0, entry->newsize);
break;
@@ -520,7 +519,8 @@ libpq_executeFileMap(filemap_t *map)
sql =
"SELECT path, begin,\n"
" pg_read_binary_file(path, begin, len, true) AS chunk\n"
- "FROM fetchchunks\n";
+ "FROM fetchchunks\n"
+ "ORDER BY path, begin\n";
receiveFileChunks(sql);
}
On Jan 5, 2018, at 2:26 AM, Michael Paquier <michael.paquier@gmail.com> wrote:
On Fri, Jan 5, 2018 at 5:06 AM, PG Bug reporting form
<noreply@postgresql.org> wrote:I have encountered a bug in PostgreSQL 10.1: when the target directory for
pg_rewind contains a read-only file (e.g. server.key), pg_rewind exits with
"could not open target file" (legitimate) and corrupts the control file
global/pg_control to size 0 (bug). From now on, pg_rewind always exits with
"unexpected control file size 0, expected 8192" and a restore from
pg_basebackup is needed.Likely that's reproducible down to 9.5 where pg_rewind has been
introduced. I agree that we should do better with failure handling
here. Corrupting the control file is not cool.
I can already confirm that this also occurs with PostgreSQL 9.6.
A patch for branch REL_10_STABLE of repository
https://github.com/postgres/postgres, a README, and Dockerfiles for
demonstrating both bug and patch are available from
https://github.com/tiptop-labs/postgres-patches .You may not know when github.com is gone, which would cause the data
you are attaching here to go away. What looks interesting is
pg_rewind.patch, which is what you are proposing as a fix, right? I am
attaching it here for PG archives.
Yes, it's pg_rewind.patch. There is a bit of rationale in README, feel free to also attach here if/as needed.
- open_target_file(filename, false); + /* Trunc target file for action FILE_ACTION_COPY. */ + open_target_file(filename, chunkoff == 0);write_target_range(chunk, chunkoff, chunksize);
Hm. Let me think more about that as there are quite a few
distributions that link to SSL files that cannot be written, and
pg_rewind copies all configuration files in full.
Could you please identify a specific distribution with such properties; I'd like to have a look there too. Thanks.
- "FROM fetchchunks\n";
+ "FROM fetchchunks\n"
+ "ORDER BY path, begin\n";
If this is aimed at improving the performance of pg_rewind by making
chunk writes more sequential, it should be a different patch. I would
see value in that. If this is aimed to change the order of the files
to avoid the write corruption, this is wrong.
It may be wrong then; from README: Truncation of the file was moved to the second loop. Truncation occurs there if chunks are written into files at offset 0. This is the case for FILE_ACTION_COPY. An additional SQL "ORDER BY path, begin" ensures that these chunks are processed first.
I would imagine that you could see something similar with the offline
mode, haven't tested yet though.
--
Michael
<pg_rewind.patch>
--
Christian
On Fri, Jan 5, 2018 at 12:36 PM, TipTop Labs <office@tiptop-labs.com> wrote:
Could you please identify a specific distribution with such properties; I'd like to have a look there too. Thanks.
I recall that Ubuntu and Debian do that. I don't use them but others
on this list will likely correct me. The point is that this is
authorized.
--
Michael
On Fri, Jan 05, 2018 at 04:36:22AM +0100, TipTop Labs wrote:
On Jan 5, 2018, at 2:26 AM, Michael Paquier <michael.paquier@gmail.com> wrote:
On Fri, Jan 5, 2018 at 5:06 AM, PG Bug reporting form
<noreply@postgresql.org> wrote:I have encountered a bug in PostgreSQL 10.1: when the target directory for
pg_rewind contains a read-only file (e.g. server.key), pg_rewind exits with
"could not open target file" (legitimate) and corrupts the control file
global/pg_control to size 0 (bug). From now on, pg_rewind always exits with
"unexpected control file size 0, expected 8192" and a restore from
pg_basebackup is needed.Likely that's reproducible down to 9.5 where pg_rewind has been
introduced. I agree that we should do better with failure handling
here. Corrupting the control file is not cool.I can already confirm that this also occurs with PostgreSQL 9.6.
As far as I can see, this happens because of the way the 'remote' mode
of pg_rewind considers a set of files to truncate or not. Attached is a
patch which does things in a more correct way. The key point here is to
make the difference between a relation file and a configuration file
when building the filemap for copying a file in full. When a
configuration file is not readable, then trying to open it should not be
a failure. When using a relation file, there should be failures. There
are still two things I am not completely happy about:
- pg_control is considered as a configuration file with this patch,
however we should fail it pg_control is not readable. In practice I
guess that this should not happen, and the patch produces a warning
message. I think that we should consider add a special handling for
things like pg_control, filenode.map. etc. so as you get a hard failure
if those are not readable, so they should enter in the category of
FILE_ACTION_COPY_DATA. This needs a bit more thoughts in
process_source_file().
- open_target_file() resets manually errno. This is necessary as this
gets called continuously when processing the same file in remote mode. I
tried as well to make open_target_file and close_target_file one-time
operations for each file, but the result was even more ugly, so I went
up with this solution.
I am adding that to the next CF to not forget about it. This approach
is back-patchable down to 9.5 in my opinion. I have added as well a TAP
test in the patch which is able to reproduce the problem.
Thoughts?
--
Michael
Attachments:
rewind-readonly-fix-v1.patchtext/x-diff; charset=us-asciiDownload
diff --git a/src/bin/pg_rewind/copy_fetch.c b/src/bin/pg_rewind/copy_fetch.c
index 04db409675..2696972421 100644
--- a/src/bin/pg_rewind/copy_fetch.c
+++ b/src/bin/pg_rewind/copy_fetch.c
@@ -154,9 +154,11 @@ recurse_dir(const char *datadir, const char *parentpath,
* Copy a file from source to target, between 'begin' and 'end' offsets.
*
* If 'trunc' is true, any existing file with the same name is truncated.
+ * If 'error_ok' is true, any failures when opening the file is ignored.
*/
static void
-rewind_copy_file_range(const char *path, off_t begin, off_t end, bool trunc)
+rewind_copy_file_range(const char *path, off_t begin, off_t end,
+ bool trunc, bool error_ok)
{
char buf[BLCKSZ];
char srcpath[MAXPGPATH];
@@ -172,7 +174,8 @@ rewind_copy_file_range(const char *path, off_t begin, off_t end, bool trunc)
if (lseek(srcfd, begin, SEEK_SET) == -1)
pg_fatal("could not seek in source file: %s\n", strerror(errno));
- open_target_file(path, trunc);
+ if (!open_target_file(path, trunc, error_ok ? PG_WARNING : PG_FATAL))
+ return;
while (end - begin > 0)
{
@@ -221,8 +224,14 @@ copy_executeFileMap(filemap_t *map)
/* ok, do nothing.. */
break;
- case FILE_ACTION_COPY:
- rewind_copy_file_range(entry->path, 0, entry->newsize, true);
+ case FILE_ACTION_COPY_CONFIG:
+ rewind_copy_file_range(entry->path, 0, entry->newsize, true,
+ true);
+ break;
+
+ case FILE_ACTION_COPY_DATA:
+ rewind_copy_file_range(entry->path, 0, entry->newsize, true,
+ false);
break;
case FILE_ACTION_TRUNCATE:
@@ -231,7 +240,7 @@ copy_executeFileMap(filemap_t *map)
case FILE_ACTION_COPY_TAIL:
rewind_copy_file_range(entry->path, entry->oldsize,
- entry->newsize, false);
+ entry->newsize, false, false);
break;
case FILE_ACTION_CREATE:
@@ -258,7 +267,7 @@ execute_pagemap(datapagemap_t *pagemap, const char *path)
while (datapagemap_next(iter, &blkno))
{
offset = blkno * BLCKSZ;
- rewind_copy_file_range(path, offset, offset + BLCKSZ, false);
+ rewind_copy_file_range(path, offset, offset + BLCKSZ, false, false);
/* Ok, this block has now been copied from new data dir to old */
}
pg_free(iter);
diff --git a/src/bin/pg_rewind/file_ops.c b/src/bin/pg_rewind/file_ops.c
index 705383d184..290a4a1c5b 100644
--- a/src/bin/pg_rewind/file_ops.c
+++ b/src/bin/pg_rewind/file_ops.c
@@ -37,19 +37,21 @@ static void remove_target_symlink(const char *path);
/*
* Open a target file for writing. If 'trunc' is true and the file already
- * exists, it will be truncated.
+ * exists, it will be truncated. 'elevel' can be used to mark as non-fatal
+ * errors happening on non-readable files. Returns true if the file has been
+ * successfully opened, and false otherwise.
*/
-void
-open_target_file(const char *path, bool trunc)
+bool
+open_target_file(const char *path, bool trunc, int elevel)
{
int mode;
if (dry_run)
- return;
+ return true;
if (dstfd != -1 && !trunc &&
strcmp(path, &dstpath[strlen(datadir_target) + 1]) == 0)
- return; /* already open */
+ return true; /* already open */
close_target_file();
@@ -59,9 +61,22 @@ open_target_file(const char *path, bool trunc)
if (trunc)
mode |= O_TRUNC;
dstfd = open(dstpath, mode, 0600);
+
+ if (errno == EACCES)
+ {
+ /* Ignore errors trying to open unreadable files */
+ pg_log(elevel, "could not open target file \"%s\": %s\n",
+ dstpath, strerror(errno));
+
+ /* Reset stack for successive callers */
+ errno = 0;
+ return false;
+ }
+
if (dstfd < 0)
pg_fatal("could not open target file \"%s\": %s\n",
dstpath, strerror(errno));
+ return true;
}
/*
diff --git a/src/bin/pg_rewind/file_ops.h b/src/bin/pg_rewind/file_ops.h
index be580ee4db..17d0952e5f 100644
--- a/src/bin/pg_rewind/file_ops.h
+++ b/src/bin/pg_rewind/file_ops.h
@@ -12,7 +12,8 @@
#include "filemap.h"
-extern void open_target_file(const char *path, bool trunc);
+extern bool open_target_file(const char *path, bool trunc, int elevel);
+extern bool is_target_file(const char *path);
extern void write_target_range(char *buf, off_t begin, size_t size);
extern void close_target_file(void);
extern void truncate_target_file(const char *path, off_t newsize);
diff --git a/src/bin/pg_rewind/filemap.c b/src/bin/pg_rewind/filemap.c
index 19da1ee7e0..fe273864b2 100644
--- a/src/bin/pg_rewind/filemap.c
+++ b/src/bin/pg_rewind/filemap.c
@@ -176,7 +176,8 @@ process_source_file(const char *path, file_type_t type, size_t newsize,
}
else
{
- action = FILE_ACTION_COPY;
+ action = isRelDataFile(path) ? FILE_ACTION_COPY_DATA :
+ FILE_ACTION_COPY_CONFIG;
oldsize = 0;
}
}
@@ -392,7 +393,8 @@ process_block_change(ForkNumber forknum, RelFileNode rnode, BlockNumber blkno)
datapagemap_add(&entry->pagemap, blkno_inseg);
break;
- case FILE_ACTION_COPY:
+ case FILE_ACTION_COPY_DATA:
+ case FILE_ACTION_COPY_CONFIG:
case FILE_ACTION_REMOVE:
break;
@@ -456,8 +458,10 @@ action_to_str(file_action_t action)
{
case FILE_ACTION_NONE:
return "NONE";
- case FILE_ACTION_COPY:
- return "COPY";
+ case FILE_ACTION_COPY_DATA:
+ return "COPY_DATA";
+ case FILE_ACTION_COPY_CONFIG:
+ return "COPY_CONFIG";
case FILE_ACTION_TRUNCATE:
return "TRUNCATE";
case FILE_ACTION_COPY_TAIL:
@@ -494,7 +498,8 @@ calculate_totals(void)
map->total_size += entry->newsize;
- if (entry->action == FILE_ACTION_COPY)
+ if (entry->action == FILE_ACTION_COPY_CONFIG ||
+ entry->action == FILE_ACTION_COPY_DATA)
{
map->fetch_size += entry->newsize;
continue;
diff --git a/src/bin/pg_rewind/filemap.h b/src/bin/pg_rewind/filemap.h
index 63ba32a6cd..28e48762ef 100644
--- a/src/bin/pg_rewind/filemap.h
+++ b/src/bin/pg_rewind/filemap.h
@@ -24,7 +24,10 @@
typedef enum
{
FILE_ACTION_CREATE, /* create local directory or symbolic link */
- FILE_ACTION_COPY, /* copy whole file, overwriting if exists */
+ FILE_ACTION_COPY_CONFIG, /* copy whole configuration file, overwriting
+ * if exists */
+ FILE_ACTION_COPY_DATA, /* copy whole relation file, overwriting if
+ * exists */
FILE_ACTION_COPY_TAIL, /* copy tail from 'oldsize' to 'newsize' */
FILE_ACTION_NONE, /* no action (we might still copy modified
* blocks based on the parsed WAL) */
diff --git a/src/bin/pg_rewind/libpq_fetch.c b/src/bin/pg_rewind/libpq_fetch.c
index d1726d1c74..c015603a93 100644
--- a/src/bin/pg_rewind/libpq_fetch.c
+++ b/src/bin/pg_rewind/libpq_fetch.c
@@ -326,7 +326,11 @@ receiveFileChunks(const char *sql)
pg_log(PG_DEBUG, "received chunk for file \"%s\", offset %s, size %d\n",
filename, chunkoff_str, chunksize);
- open_target_file(filename, false);
+ /*
+ * File chunks are present for data files, so consider any failures
+ * as fatal.
+ */
+ open_target_file(filename, false, PG_FATAL);
write_target_range(chunk, chunkoff, chunksize);
@@ -452,9 +456,27 @@ libpq_executeFileMap(filemap_t *map)
/* nothing else to do */
break;
- case FILE_ACTION_COPY:
- /* Truncate the old file out of the way, if any */
- open_target_file(entry->path, true);
+ case FILE_ACTION_COPY_CONFIG:
+ /*
+ * Truncate the old file out of the way, if any. Any
+ * failure in opening the file is not critical here. We
+ * are dealing with a configuration files here, so
+ * ignore any failures if the file is unreadable in the
+ * target.
+ */
+ if (!open_target_file(entry->path, true, PG_WARNING))
+ break;
+ fetch_file_range(entry->path, 0, entry->newsize);
+ break;
+
+ case FILE_ACTION_COPY_DATA:
+ /*
+ * Truncate the old file out of the way, if any. We
+ * are dealing with a relation file which normally
+ * exists on the source but not on the target, so
+ * consider any failures as fatal.
+ */
+ open_target_file(entry->path, true, PG_FATAL);
fetch_file_range(entry->path, 0, entry->newsize);
break;
diff --git a/src/bin/pg_rewind/pg_rewind.c b/src/bin/pg_rewind/pg_rewind.c
index 72ab2f8d5e..0997062d86 100644
--- a/src/bin/pg_rewind/pg_rewind.c
+++ b/src/bin/pg_rewind/pg_rewind.c
@@ -597,7 +597,8 @@ createBackupLabel(XLogRecPtr startpoint, TimeLineID starttli, XLogRecPtr checkpo
pg_fatal("backup label buffer too small\n"); /* shouldn't happen */
/* TODO: move old file out of the way, if any. */
- open_target_file("backup_label", true); /* BACKUP_LABEL_FILE */
+ open_target_file("backup_label", true, PG_FATAL);
+
write_target_range(buf, 0, len);
close_target_file();
}
@@ -675,7 +676,7 @@ updateControlFile(ControlFileData *ControlFile)
memset(buffer, 0, PG_CONTROL_FILE_SIZE);
memcpy(buffer, ControlFile, sizeof(ControlFileData));
- open_target_file("global/pg_control", false);
+ open_target_file("global/pg_control", false, PG_FATAL);
write_target_range(buffer, 0, PG_CONTROL_FILE_SIZE);
diff --git a/src/bin/pg_rewind/t/006_readonly.pl b/src/bin/pg_rewind/t/006_readonly.pl
new file mode 100644
index 0000000000..5807110dad
--- /dev/null
+++ b/src/bin/pg_rewind/t/006_readonly.pl
@@ -0,0 +1,46 @@
+# Test how pg_rewind reacts to read-only files in the data dirs.
+# All such files should be ignored in the process.
+
+use strict;
+use warnings;
+use TestLib;
+use Test::More tests => 2;
+
+use File::Find;
+
+use RewindTest;
+
+
+sub run_test
+{
+ my $test_mode = shift;
+
+ RewindTest::setup_cluster($test_mode);
+ RewindTest::start_master();
+ RewindTest::create_standby($test_mode);
+
+ # Create the same read-only file in standby and master
+ my $test_master_datadir = $node_master->data_dir;
+ my $test_standby_datadir = $node_standby->data_dir;
+ my $first_readonly_master = "$test_master_datadir/aaa_readonly_file";
+ my $first_readonly_standby = "$test_standby_datadir/aaa_readonly_file";
+ my $last_readonly_master = "$test_master_datadir/zzz_readonly_file";
+ my $last_readonly_standby = "$test_standby_datadir/zzz_readonly_file";
+
+ append_to_file($first_readonly_master, "in master");
+ append_to_file($first_readonly_standby, "in standby");
+ append_to_file($last_readonly_master, "in master");
+ append_to_file($last_readonly_standby, "in standby");
+ #chmod 0400, $first_readonly_master, $first_readonly_standby;
+ chmod 0400, $last_readonly_master, $last_readonly_standby;
+
+ RewindTest::promote_standby();
+ RewindTest::run_pg_rewind($test_mode);
+ RewindTest::clean_rewind_test();
+}
+
+# Run the test in both modes.
+run_test('local');
+run_test('remote');
+
+exit(0);
1. I can confirm that your patch is effective also in my Docker-based test setup and with the current REL_10_STABLE code base (i.e. PostgreSQL 10.2).
2. Your patch is more encompassing than the one I had submitted earlier, and it may be the right way to go. It is cleaner but more "complicated" in that it may require enlisting/recognizing all those special files (pg_control, filenode.map, etc). IMO the earlier patch would already/tolerate handle those, because the distinction it makes is not based on whether something is a configuration file, but purely on whether it is writable.
3. Sorry for the late response :)
Show quoted text
On Jan 15, 2018, at 8:25 AM, Michael Paquier <michael.paquier@gmail.com> wrote:
On Fri, Jan 05, 2018 at 04:36:22AM +0100, TipTop Labs wrote:
On Jan 5, 2018, at 2:26 AM, Michael Paquier <michael.paquier@gmail.com> wrote:
On Fri, Jan 5, 2018 at 5:06 AM, PG Bug reporting form
<noreply@postgresql.org> wrote:I have encountered a bug in PostgreSQL 10.1: when the target directory for
pg_rewind contains a read-only file (e.g. server.key), pg_rewind exits with
"could not open target file" (legitimate) and corrupts the control file
global/pg_control to size 0 (bug). From now on, pg_rewind always exits with
"unexpected control file size 0, expected 8192" and a restore from
pg_basebackup is needed.Likely that's reproducible down to 9.5 where pg_rewind has been
introduced. I agree that we should do better with failure handling
here. Corrupting the control file is not cool.I can already confirm that this also occurs with PostgreSQL 9.6.
As far as I can see, this happens because of the way the 'remote' mode
of pg_rewind considers a set of files to truncate or not. Attached is a
patch which does things in a more correct way. The key point here is to
make the difference between a relation file and a configuration file
when building the filemap for copying a file in full. When a
configuration file is not readable, then trying to open it should not be
a failure. When using a relation file, there should be failures. There
are still two things I am not completely happy about:
- pg_control is considered as a configuration file with this patch,
however we should fail it pg_control is not readable. In practice I
guess that this should not happen, and the patch produces a warning
message. I think that we should consider add a special handling for
things like pg_control, filenode.map. etc. so as you get a hard failure
if those are not readable, so they should enter in the category of
FILE_ACTION_COPY_DATA. This needs a bit more thoughts in
process_source_file().
- open_target_file() resets manually errno. This is necessary as this
gets called continuously when processing the same file in remote mode. I
tried as well to make open_target_file and close_target_file one-time
operations for each file, but the result was even more ugly, so I went
up with this solution.I am adding that to the next CF to not forget about it. This approach
is back-patchable down to 9.5 in my opinion. I have added as well a TAP
test in the patch which is able to reproduce the problem.Thoughts?
--
Michael
<rewind-readonly-fix-v1.patch>
On Mon, Feb 26, 2018 at 05:28:53PM +0100, TipTop Labs wrote:
1. I can confirm that your patch is effective also in my Docker-based
test setup and with the current REL_10_STABLE code base
(i.e. PostgreSQL 10.2).
Thanks for checking. Note that I am still not completely happy with the
handling in errno in some newly-added code paths..
2. Your patch is more encompassing than the one I had submitted
earlier, and it may be the right way to go. It is cleaner but more
"complicated" in that it may require enlisting/recognizing all those
special files (pg_control, filenode.map, etc). IMO the earlier patch
would already/tolerate handle those, because the distinction it makes
is not based on whether something is a configuration file, but purely
on whether it is writable.
You are basically looking for that I think:
/messages/by-id/20180205071022.GA17337@paquier.xyz
You cannot ignore pg_control and filenode.map though as those are
critical data so they have to be updated. So if those files are not
writable, you actually have more problems than you think as the cluster
would not be able to correctly start.
--
Michael
On 27 February 2018 at 02:24, Michael Paquier <michael@paquier.xyz> wrote:
Note that I am still not completely happy with the
handling in errno in some newly-added code paths..
It maybe a stupid question, but why do you need to reset errno here? Is it to
avoid its value being carried from previous calls of `open_target_file`? In
this case if you put the code with `errno == EACCESS` under the if condition
`if (dstfd < 0)`, then as far as I remember you should always get relevant
errno. At the same time maybe it's valid to reset `errno` before `open`, like
with `getpriority`:
For some system calls and library functions (e.g., getpriority(2)), -1 is
a valid return on success. In such cases, a successful return can be
distinguished from an error return by setting errno to zero before the
call, and then, if the call returns a status that indicates that an error
may have occurred, checking to see if errno has a nonzero value.
On Fri, Mar 02, 2018 at 04:35:07PM +0100, Dmitry Dolgov wrote:
It maybe a stupid question, but why do you need to reset errno here? Is it to
avoid its value being carried from previous calls of `open_target_file`? In
this case if you put the code with `errno == EACCESS` under the if condition
`if (dstfd < 0)`, then as far as I remember you should always get relevant
errno. At the same time maybe it's valid to reset `errno` before `open`, like
with `getpriority`:
[ ... reads and feels stupid ...]
Of course all the checks should be where dstno is negative...
I have done a second pass on the patch, and attached is a new version.
This fixes this handling of errno and addresses some typos. I have also
fixed the test case where one of the read-only files was not properly
switched to do so. I have also added a commit log message.
--
Michael
Attachments:
0001-Improve-error-handling-of-configuration-files-in-pg_.patchtext/x-diff; charset=us-asciiDownload
From 194d38a967263171ad0d11f98dcad9bf21f50f21 Mon Sep 17 00:00:00 2001
From: Michael Paquier <michael@paquier.xyz>
Date: Mon, 5 Mar 2018 16:44:13 +0900
Subject: [PATCH] Improve error handling of configuration files in pg_rewind
In a couple of Linux distributions like Debian and Ubuntu, the data
folder can include links to files which are read-only for the user
managing the PostgreSQL instance. When using pg_rewind, this can lead to
unwanted corruption of data folders as a failure in writing a file
causes a follow-up write to the control file to fail and making it to be
truncated.
Relation files which are unreadable should still result in a failure, as
those are critical for the state of the cluster. However make the
handling of other files smoother by filtering out of the rewind process
files which are found on the source and target server, but are not
readable on the target as the handling of such configuration files is
likely to map between the source and the target.
Bug report from Christian H.
Discussion: https://postgr.es/m/20180104200633.17004.16377%40wrigleys.postgresql.org
---
src/bin/pg_rewind/copy_fetch.c | 21 ++++++++++++-----
src/bin/pg_rewind/file_ops.c | 32 ++++++++++++++++++++++----
src/bin/pg_rewind/file_ops.h | 3 ++-
src/bin/pg_rewind/filemap.c | 15 ++++++++----
src/bin/pg_rewind/filemap.h | 5 +++-
src/bin/pg_rewind/libpq_fetch.c | 30 ++++++++++++++++++++----
src/bin/pg_rewind/pg_rewind.c | 5 ++--
src/bin/pg_rewind/t/006_readonly.pl | 46 +++++++++++++++++++++++++++++++++++++
8 files changed, 133 insertions(+), 24 deletions(-)
create mode 100644 src/bin/pg_rewind/t/006_readonly.pl
diff --git a/src/bin/pg_rewind/copy_fetch.c b/src/bin/pg_rewind/copy_fetch.c
index 04db409675..2696972421 100644
--- a/src/bin/pg_rewind/copy_fetch.c
+++ b/src/bin/pg_rewind/copy_fetch.c
@@ -154,9 +154,11 @@ recurse_dir(const char *datadir, const char *parentpath,
* Copy a file from source to target, between 'begin' and 'end' offsets.
*
* If 'trunc' is true, any existing file with the same name is truncated.
+ * If 'error_ok' is true, any failures when opening the file is ignored.
*/
static void
-rewind_copy_file_range(const char *path, off_t begin, off_t end, bool trunc)
+rewind_copy_file_range(const char *path, off_t begin, off_t end,
+ bool trunc, bool error_ok)
{
char buf[BLCKSZ];
char srcpath[MAXPGPATH];
@@ -172,7 +174,8 @@ rewind_copy_file_range(const char *path, off_t begin, off_t end, bool trunc)
if (lseek(srcfd, begin, SEEK_SET) == -1)
pg_fatal("could not seek in source file: %s\n", strerror(errno));
- open_target_file(path, trunc);
+ if (!open_target_file(path, trunc, error_ok ? PG_WARNING : PG_FATAL))
+ return;
while (end - begin > 0)
{
@@ -221,8 +224,14 @@ copy_executeFileMap(filemap_t *map)
/* ok, do nothing.. */
break;
- case FILE_ACTION_COPY:
- rewind_copy_file_range(entry->path, 0, entry->newsize, true);
+ case FILE_ACTION_COPY_CONFIG:
+ rewind_copy_file_range(entry->path, 0, entry->newsize, true,
+ true);
+ break;
+
+ case FILE_ACTION_COPY_DATA:
+ rewind_copy_file_range(entry->path, 0, entry->newsize, true,
+ false);
break;
case FILE_ACTION_TRUNCATE:
@@ -231,7 +240,7 @@ copy_executeFileMap(filemap_t *map)
case FILE_ACTION_COPY_TAIL:
rewind_copy_file_range(entry->path, entry->oldsize,
- entry->newsize, false);
+ entry->newsize, false, false);
break;
case FILE_ACTION_CREATE:
@@ -258,7 +267,7 @@ execute_pagemap(datapagemap_t *pagemap, const char *path)
while (datapagemap_next(iter, &blkno))
{
offset = blkno * BLCKSZ;
- rewind_copy_file_range(path, offset, offset + BLCKSZ, false);
+ rewind_copy_file_range(path, offset, offset + BLCKSZ, false, false);
/* Ok, this block has now been copied from new data dir to old */
}
pg_free(iter);
diff --git a/src/bin/pg_rewind/file_ops.c b/src/bin/pg_rewind/file_ops.c
index 705383d184..e293c9b86d 100644
--- a/src/bin/pg_rewind/file_ops.c
+++ b/src/bin/pg_rewind/file_ops.c
@@ -37,19 +37,21 @@ static void remove_target_symlink(const char *path);
/*
* Open a target file for writing. If 'trunc' is true and the file already
- * exists, it will be truncated.
+ * exists, it will be truncated. 'elevel' can be used to mark as non-fatal
+ * errors happening on non-readable files. Returns true if the file has been
+ * successfully opened, and false otherwise.
*/
-void
-open_target_file(const char *path, bool trunc)
+bool
+open_target_file(const char *path, bool trunc, int elevel)
{
int mode;
if (dry_run)
- return;
+ return true;
if (dstfd != -1 && !trunc &&
strcmp(path, &dstpath[strlen(datadir_target) + 1]) == 0)
- return; /* already open */
+ return true; /* already open */
close_target_file();
@@ -59,9 +61,29 @@ open_target_file(const char *path, bool trunc)
if (trunc)
mode |= O_TRUNC;
dstfd = open(dstpath, mode, 0600);
+
if (dstfd < 0)
+ {
+ if (errno == EACCES)
+ {
+ /*
+ * Ignore errors trying to open unreadable files if caller
+ * to do so.
+ */
+ pg_log(elevel, "could not open target file \"%s\": %s\n",
+ dstpath, strerror(errno));
+
+ /* Reset fd for successive callers */
+ dstfd = -1;
+ return false;
+ }
+
+ /* Fallback to fatal error in other cases */
pg_fatal("could not open target file \"%s\": %s\n",
dstpath, strerror(errno));
+ }
+
+ return true;
}
/*
diff --git a/src/bin/pg_rewind/file_ops.h b/src/bin/pg_rewind/file_ops.h
index be580ee4db..17d0952e5f 100644
--- a/src/bin/pg_rewind/file_ops.h
+++ b/src/bin/pg_rewind/file_ops.h
@@ -12,7 +12,8 @@
#include "filemap.h"
-extern void open_target_file(const char *path, bool trunc);
+extern bool open_target_file(const char *path, bool trunc, int elevel);
+extern bool is_target_file(const char *path);
extern void write_target_range(char *buf, off_t begin, size_t size);
extern void close_target_file(void);
extern void truncate_target_file(const char *path, off_t newsize);
diff --git a/src/bin/pg_rewind/filemap.c b/src/bin/pg_rewind/filemap.c
index 19da1ee7e0..fe273864b2 100644
--- a/src/bin/pg_rewind/filemap.c
+++ b/src/bin/pg_rewind/filemap.c
@@ -176,7 +176,8 @@ process_source_file(const char *path, file_type_t type, size_t newsize,
}
else
{
- action = FILE_ACTION_COPY;
+ action = isRelDataFile(path) ? FILE_ACTION_COPY_DATA :
+ FILE_ACTION_COPY_CONFIG;
oldsize = 0;
}
}
@@ -392,7 +393,8 @@ process_block_change(ForkNumber forknum, RelFileNode rnode, BlockNumber blkno)
datapagemap_add(&entry->pagemap, blkno_inseg);
break;
- case FILE_ACTION_COPY:
+ case FILE_ACTION_COPY_DATA:
+ case FILE_ACTION_COPY_CONFIG:
case FILE_ACTION_REMOVE:
break;
@@ -456,8 +458,10 @@ action_to_str(file_action_t action)
{
case FILE_ACTION_NONE:
return "NONE";
- case FILE_ACTION_COPY:
- return "COPY";
+ case FILE_ACTION_COPY_DATA:
+ return "COPY_DATA";
+ case FILE_ACTION_COPY_CONFIG:
+ return "COPY_CONFIG";
case FILE_ACTION_TRUNCATE:
return "TRUNCATE";
case FILE_ACTION_COPY_TAIL:
@@ -494,7 +498,8 @@ calculate_totals(void)
map->total_size += entry->newsize;
- if (entry->action == FILE_ACTION_COPY)
+ if (entry->action == FILE_ACTION_COPY_CONFIG ||
+ entry->action == FILE_ACTION_COPY_DATA)
{
map->fetch_size += entry->newsize;
continue;
diff --git a/src/bin/pg_rewind/filemap.h b/src/bin/pg_rewind/filemap.h
index 63ba32a6cd..28e48762ef 100644
--- a/src/bin/pg_rewind/filemap.h
+++ b/src/bin/pg_rewind/filemap.h
@@ -24,7 +24,10 @@
typedef enum
{
FILE_ACTION_CREATE, /* create local directory or symbolic link */
- FILE_ACTION_COPY, /* copy whole file, overwriting if exists */
+ FILE_ACTION_COPY_CONFIG, /* copy whole configuration file, overwriting
+ * if exists */
+ FILE_ACTION_COPY_DATA, /* copy whole relation file, overwriting if
+ * exists */
FILE_ACTION_COPY_TAIL, /* copy tail from 'oldsize' to 'newsize' */
FILE_ACTION_NONE, /* no action (we might still copy modified
* blocks based on the parsed WAL) */
diff --git a/src/bin/pg_rewind/libpq_fetch.c b/src/bin/pg_rewind/libpq_fetch.c
index 8f8d504455..bef5736256 100644
--- a/src/bin/pg_rewind/libpq_fetch.c
+++ b/src/bin/pg_rewind/libpq_fetch.c
@@ -333,7 +333,11 @@ receiveFileChunks(const char *sql)
pg_log(PG_DEBUG, "received chunk for file \"%s\", offset %s, size %d\n",
filename, chunkoff_str, chunksize);
- open_target_file(filename, false);
+ /*
+ * File chunks are present for data files, so consider any failures
+ * as fatal.
+ */
+ open_target_file(filename, false, PG_FATAL);
write_target_range(chunk, chunkoff, chunksize);
@@ -459,9 +463,27 @@ libpq_executeFileMap(filemap_t *map)
/* nothing else to do */
break;
- case FILE_ACTION_COPY:
- /* Truncate the old file out of the way, if any */
- open_target_file(entry->path, true);
+ case FILE_ACTION_COPY_CONFIG:
+ /*
+ * Truncate the old file out of the way, if any. Any
+ * failure in opening the file is not critical here. We
+ * are dealing with a configuration file here, so
+ * ignore any failures if the file is unreadable in the
+ * target.
+ */
+ if (!open_target_file(entry->path, true, PG_WARNING))
+ break;
+ fetch_file_range(entry->path, 0, entry->newsize);
+ break;
+
+ case FILE_ACTION_COPY_DATA:
+ /*
+ * Truncate the old file out of the way, if any. We
+ * are dealing with a relation file which normally
+ * exists on the source but not on the target, so
+ * consider any failures as fatal.
+ */
+ open_target_file(entry->path, true, PG_FATAL);
fetch_file_range(entry->path, 0, entry->newsize);
break;
diff --git a/src/bin/pg_rewind/pg_rewind.c b/src/bin/pg_rewind/pg_rewind.c
index 72ab2f8d5e..0997062d86 100644
--- a/src/bin/pg_rewind/pg_rewind.c
+++ b/src/bin/pg_rewind/pg_rewind.c
@@ -597,7 +597,8 @@ createBackupLabel(XLogRecPtr startpoint, TimeLineID starttli, XLogRecPtr checkpo
pg_fatal("backup label buffer too small\n"); /* shouldn't happen */
/* TODO: move old file out of the way, if any. */
- open_target_file("backup_label", true); /* BACKUP_LABEL_FILE */
+ open_target_file("backup_label", true, PG_FATAL);
+
write_target_range(buf, 0, len);
close_target_file();
}
@@ -675,7 +676,7 @@ updateControlFile(ControlFileData *ControlFile)
memset(buffer, 0, PG_CONTROL_FILE_SIZE);
memcpy(buffer, ControlFile, sizeof(ControlFileData));
- open_target_file("global/pg_control", false);
+ open_target_file("global/pg_control", false, PG_FATAL);
write_target_range(buffer, 0, PG_CONTROL_FILE_SIZE);
diff --git a/src/bin/pg_rewind/t/006_readonly.pl b/src/bin/pg_rewind/t/006_readonly.pl
new file mode 100644
index 0000000000..7c0af7baf4
--- /dev/null
+++ b/src/bin/pg_rewind/t/006_readonly.pl
@@ -0,0 +1,46 @@
+# Test how pg_rewind reacts to read-only files in the data dirs.
+# All such files should be ignored in the process.
+
+use strict;
+use warnings;
+use TestLib;
+use Test::More tests => 2;
+
+use File::Find;
+
+use RewindTest;
+
+
+sub run_test
+{
+ my $test_mode = shift;
+
+ RewindTest::setup_cluster($test_mode);
+ RewindTest::start_master();
+ RewindTest::create_standby($test_mode);
+
+ # Create the same read-only file in standby and master
+ my $test_master_datadir = $node_master->data_dir;
+ my $test_standby_datadir = $node_standby->data_dir;
+ my $first_readonly_master = "$test_master_datadir/aaa_readonly_file";
+ my $first_readonly_standby = "$test_standby_datadir/aaa_readonly_file";
+ my $last_readonly_master = "$test_master_datadir/zzz_readonly_file";
+ my $last_readonly_standby = "$test_standby_datadir/zzz_readonly_file";
+
+ append_to_file($first_readonly_master, "in master");
+ append_to_file($first_readonly_standby, "in standby");
+ append_to_file($last_readonly_master, "in master");
+ append_to_file($last_readonly_standby, "in standby");
+ chmod 0400, $first_readonly_master, $first_readonly_standby;
+ chmod 0400, $last_readonly_master, $last_readonly_standby;
+
+ RewindTest::promote_standby();
+ RewindTest::run_pg_rewind($test_mode);
+ RewindTest::clean_rewind_test();
+}
+
+# Run the test in both modes.
+run_test('local');
+run_test('remote');
+
+exit(0);
--
2.16.2
On Mon, Mar 5, 2018 at 4:54 PM, Michael Paquier <michael@paquier.xyz> wrote:
On Fri, Mar 02, 2018 at 04:35:07PM +0100, Dmitry Dolgov wrote:
It maybe a stupid question, but why do you need to reset errno here? Is it to
avoid its value being carried from previous calls of `open_target_file`? In
this case if you put the code with `errno == EACCESS` under the if condition
`if (dstfd < 0)`, then as far as I remember you should always get relevant
errno. At the same time maybe it's valid to reset `errno` before `open`, like
with `getpriority`:[ ... reads and feels stupid ...]
Of course all the checks should be where dstno is negative...
I have done a second pass on the patch, and attached is a new version.
This fixes this handling of errno and addresses some typos. I have also
fixed the test case where one of the read-only files was not properly
switched to do so. I have also added a commit log message.
@@ -24,7 +24,10 @@
typedef enum
{
FILE_ACTION_CREATE, /* create local
directory or symbolic link */
- FILE_ACTION_COPY, /* copy whole file,
overwriting if exists */
+ FILE_ACTION_COPY_CONFIG, /* copy whole configuration
file, overwriting
+ * if exists */
+ FILE_ACTION_COPY_DATA, /* copy whole relation file,
overwriting if
+ * exists */
FILE_ACTION_COPY_TAIL, /* copy tail from 'oldsize' to
'newsize' */
FILE_ACTION_NONE, /* no action (we might
still copy modified
*
blocks based on the parsed WAL) */
Since files other than relation files such as vm, fsm, WAL are
categorize as FILE_ACTION_COPY_CONFIG, I think the name would cause
misreading. For example, we can use FILE_ACTION_COPY_DATA and
FILE_ACTION_COPY_OTHER?
As you mentioned before, with this patch we end up ignoring file-open
error of database files other than relation files. I guess it's a bit
risky. We can enter them to COPY_DATA but I'd defer it to committer.
Regards,
--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center
On Mon, Mar 05, 2018 at 05:47:06PM +0900, Masahiko Sawada wrote:
Since files other than relation files such as vm, fsm, WAL are
categorize as FILE_ACTION_COPY_CONFIG, I think the name would cause
misreading. For example, we can use FILE_ACTION_COPY_DATA and
FILE_ACTION_COPY_OTHER?
I am not stopped on a given name.
As you mentioned before, with this patch we end up ignoring file-open
error of database files other than relation files. I guess it's a bit
risky. We can enter them to COPY_DATA but I'd defer it to committer.
COPY_DATA is aimed at being used on files which can be parsed by
blocks. You cannot do that with VMs and FSMs. Now you are true that we
could complain for non-configuration files which need to be
fully-copied, still Postgres issues a fsync on all the files of the data
folder when beginning recovery, so you would bump on problems anyway
after restarting the rewound instance.
--
Michael
On Mon, Mar 5, 2018 at 6:01 PM, Michael Paquier <michael@paquier.xyz> wrote:
On Mon, Mar 05, 2018 at 05:47:06PM +0900, Masahiko Sawada wrote:
Since files other than relation files such as vm, fsm, WAL are
categorize as FILE_ACTION_COPY_CONFIG, I think the name would cause
misreading. For example, we can use FILE_ACTION_COPY_DATA and
FILE_ACTION_COPY_OTHER?I am not stopped on a given name.
Hmm, when I used pg_rewind --debug with this patch the debug message
made me confused because almost database files appears with
COPY_CONFIG. Also looking at the source code comment, it says
COPY_CONFIG is aimed to configuration files. But these file are not
configuration files actually. COPY_DATA makes sense to me, but
COPY_CONFIG doesn't.
Anyway, other than that the patch looks good to me. I'd like to wait
for the Dmitory's review comment before marking this as "Ready for
Commiter".
Regards,
--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center
On 6 March 2018 at 02:39, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
On Mon, Mar 5, 2018 at 6:01 PM, Michael Paquier <michael@paquier.xyz> wrote:On Mon, Mar 05, 2018 at 05:47:06PM +0900, Masahiko Sawada wrote:
Since files other than relation files such as vm, fsm, WAL are
categorize as FILE_ACTION_COPY_CONFIG, I think the name would cause
misreading. For example, we can use FILE_ACTION_COPY_DATA and
FILE_ACTION_COPY_OTHER?I am not stopped on a given name.
Hmm, when I used pg_rewind --debug with this patch the debug message
made me confused because almost database files appears with
COPY_CONFIG. Also looking at the source code comment, it says
COPY_CONFIG is aimed to configuration files. But these file are not
configuration files actually. COPY_DATA makes sense to me, but
COPY_CONFIG doesn't.Anyway, other than that the patch looks good to me. I'd like to wait
for the Dmitory's review comment before marking this as "Ready for
Commiter".
Thank you for waiting. Yes, it also looks good for me, but I'm wondering about
one thing - does it make sense to think about other error codes here, not only
about `EACCESS`? E.g. if a file was removed during the process (so, it should
be `ENOENT`), or something more exotic happened, like there are too many
symbolic links were encountered in resolving a pathname (`ELOOP`)?
On Tue, Mar 06, 2018 at 09:37:34PM +0100, Dmitry Dolgov wrote:
Thank you for waiting. Yes, it also looks good for me, but I'm wondering about
one thing - does it make sense to think about other error codes here, not only
about `EACCESS`? E.g. if a file was removed during the process (so, it should
be `ENOENT`), or something more exotic happened, like there are too many
symbolic links were encountered in resolving a pathname (`ELOOP`)?
The presence of the file is ensured in the pre-phase which builds the
file map (see process_source_file), and actions are taken depending on
the presence of a file on the source and the target. So a file missing
on the target after those pre-checks have ensured that it was actually
existing should be reported with ENOENT. ELOOP would as well be faced
on the backend before seeing it in pg_rewind, no? In short, it seems to
me that it is better to keep the code simple.
--
Michael
On 7 March 2018 at 02:46, Michael Paquier <michael@paquier.xyz> wrote:
On Tue, Mar 06, 2018 at 09:37:34PM +0100, Dmitry Dolgov wrote:Thank you for waiting. Yes, it also looks good for me, but I'm wondering about
one thing - does it make sense to think about other error codes here, not only
about `EACCESS`? E.g. if a file was removed during the process (so, it should
be `ENOENT`), or something more exotic happened, like there are too many
symbolic links were encountered in resolving a pathname (`ELOOP`)?The presence of the file is ensured in the pre-phase which builds the
file map (see process_source_file), and actions are taken depending on
the presence of a file on the source and the target. So a file missing
on the target after those pre-checks have ensured that it was actually
existing should be reported with ENOENT. ELOOP would as well be faced
on the backend before seeing it in pg_rewind, no? In short, it seems to
me that it is better to keep the code simple.
Ok, I agree. Then yes, this patch can be probably marked as ready.
Dmitry Dolgov <9erthalion6@gmail.com> writes:
Ok, I agree. Then yes, this patch can be probably marked as ready.
I started to look over this patch, and soon decided that the
commentary in pg_rewind is impossibly bad :-(. You need to study
libpq_executeFileMap for a long time before you are even sure which side
of the transfer it's executing on; the fact that it's doing a copy *TO*
the remote server is incredibly misleading, and the comments are not
anywhere near good enough to de-confuse somebody coming upon the code for
the first time. Not to mention that the function's name conveys nothing
whatever. Maybe it's not the job of this patch to fix that, but I can't
help thinking that poor design and documentation are a big part of why
this bug exists in the first place.
But enough of that rant. Now that I've more or less wrapped my head
around what's happening, I think this is the core of the problem:
libpq_executeFileMap truncates every non-data file in the local data
directory before it's fetched anything at all from the remote server.
This seems to me to be utterly cavalier and brittle, because ANY FAILURE
AT ALL in the fetch sequence leaves you with a data directory that's been
trashed in whole or in part --- and that sequence could span a good long
time, if there's a lot to copy. The originally complained-of problem
(that pg_control gets nuked) is just the tip of that iceberg, and I'm
afraid that the proposed patch is only removing a single potential
failure mode.
I think that a saner design would involve not truncating any particular
target file until we first get some data to write to it. Christian's
original patch shown at
/messages/by-id/CAB7nPqTyq3cjphJmoCPc0n8r90O4j1BpZC2BMaj8dKZ7g0-mtw@mail.gmail.com
seems to be headed in that direction, in that it forces ordering of the
data fetch (which is likely a good idea anyway to ensure locality of
reference) and then performs the truncation when we get the first chunk
of data for a file.
Michael's proposal to have separate error handling for data and config
files is not bad in itself, and it does improve the behavior for one
foreseeable trouble case. But I think we need the other thing as well,
or some variant of it, before we can regard pg_rewind as anything except
a chainsaw with the safety catch missing.
regards, tom lane
On Mon, Apr 02, 2018 at 07:27:55PM -0400, Tom Lane wrote:
I think that a saner design would involve not truncating any particular
target file until we first get some data to write to it. Christian's
original patch shown at
/messages/by-id/CAB7nPqTyq3cjphJmoCPc0n8r90O4j1BpZC2BMaj8dKZ7g0-mtw@mail.gmail.com
seems to be headed in that direction, in that it forces ordering of the
data fetch (which is likely a good idea anyway to ensure locality of
reference) and then performs the truncation when we get the first chunk
of data for a file.
Please note that Christopher's patch is awkwardly making the difference
between a configuration file and a relation file by abusing of chunkoff.
It also blows up on the test t/006_readonly.pl included in some of my
previous patches.
Michael's proposal to have separate error handling for data and config
files is not bad in itself, and it does improve the behavior for one
foreseeable trouble case. But I think we need the other thing as well,
or some variant of it, before we can regard pg_rewind as anything except
a chainsaw with the safety catch missing.
I think that I agree with your proposal to delay the file truncation up
to the moment where the first chunk for a file is received, as well as I
agree to the fact that enforcing the ordering of the files would give a
more efficient work base. However, it seems to me that both concepts
should be included to get a complete patch. In short, a complete patch
could do in its most complex shape:
1) Ordering of the data fetched by libpq.
2) Truncation of files a first chunk is received only if it is a
configuration file.
3) Discard any data related to files that cannot be opened:
3-1) If the file worked on is a relation file, we have a hard failure.
3-2) If the file worked on is a configuration file, then a warning is
issued and we move on to the next file. This fixes the case of
read-only files without being user-intrusive.
In the case of 3-2) having a warning instead of a failure is subject to
discussion though. Perhaps we could live with a simple failure, but
that's rather user-unfriendly, still we could document the limitation
and tell users to not put read-only files on the target which is also
present on the source.
At the same time pg_rewind considers in my patch things like FSM and VM
files as configuration files, which you want to truncate out of the way
and fully copy, but also want to fail hard if they cannot be written to!
So at the end a failure for everything may make the most sense, if we
document that people don't put read-only files in the target's data
folder which can cause failures.
Note also that if we use isRelDataFile() in receiveFileChunks() it is
possible to decide if a file needs to be truncated when receiving its
first chunk, so it is possible to drop completely the concept of file
type as well. This would make the patch even more simple, and there is
no need to rely on other weird tweaks like chunkoff or such.
The ordering of the items when doing a local copy is not mandatory as
files are processed one by one, still I think that it could be more
solid to reuse isRelDataFile() in rewind_copy_file_range() and drop the
"trunc" argument.
Thoughts?
--
Michael
On Apr 3, 2018, at 4:41 AM, Michael Paquier <michael@paquier.xyz> wrote:
On Mon, Apr 02, 2018 at 07:27:55PM -0400, Tom Lane wrote:
I think that a saner design would involve not truncating any particular
target file until we first get some data to write to it. Christian's
original patch shown at
/messages/by-id/CAB7nPqTyq3cjphJmoCPc0n8r90O4j1BpZC2BMaj8dKZ7g0-mtw@mail.gmail.com
seems to be headed in that direction, in that it forces ordering of the
data fetch (which is likely a good idea anyway to ensure locality of
reference) and then performs the truncation when we get the first chunk
of data for a file.Please note that Christopher's patch is awkwardly making the difference
between a configuration file and a relation file by abusing of chunkoff.
It also blows up on the test t/006_readonly.pl included in some of my
previous patches.
IMO this is a better characterization of my original patch (from https://github.com/tiptop-labs/postgres-patches/blob/master/bug_14999/README <https://github.com/tiptop-labs/postgres-patches/blob/master/bug_14999/README>):
bug
===
libpq_fetch.c loops twice over files in pgdata, a first time in
libpq_executeFileMap(), and then a second time (for files with action
FILE_ACTION_COPY or FILE_ACTION_COPY_TAIL) in receiveFileChunks().
For FILE_ACTION_COPY, it deletes (truncates) the file from the target directory
already during the first loop. If pg_rewind then encounters a read-only file
(e.g. server.key) still during the first loop, it exits with pg_fatal
("could not open target file"). After such truncation of global/pg_control
pg_rewind cannot run again ("unexpected control file size 0, , expected 8192")
and a restore from pg_basebackup is needed.
patch
=====
Truncation of the file was moved to the second loop. Truncation occurs there if
chunks are written into files at offset 0. This is the case for
FILE_ACTION_COPY. An additional SQL "ORDER BY path, begin" ensures that these
chunks are processed first.
Michael's proposal to have separate error handling for data and config
files is not bad in itself, and it does improve the behavior for one
foreseeable trouble case. But I think we need the other thing as well,
or some variant of it, before we can regard pg_rewind as anything except
a chainsaw with the safety catch missing.I think that I agree with your proposal to delay the file truncation up
to the moment where the first chunk for a file is received, as well as I
agree to the fact that enforcing the ordering of the files would give a
more efficient work base. However, it seems to me that both concepts
should be included to get a complete patch. In short, a complete patch
could do in its most complex shape:
1) Ordering of the data fetched by libpq.
2) Truncation of files a first chunk is received only if it is a
configuration file.
3) Discard any data related to files that cannot be opened:
3-1) If the file worked on is a relation file, we have a hard failure.
3-2) If the file worked on is a configuration file, then a warning is
issued and we move on to the next file. This fixes the case of
read-only files without being user-intrusive.In the case of 3-2) having a warning instead of a failure is subject to
discussion though. Perhaps we could live with a simple failure, but
that's rather user-unfriendly, still we could document the limitation
and tell users to not put read-only files on the target which is also
present on the source.At the same time pg_rewind considers in my patch things like FSM and VM
files as configuration files, which you want to truncate out of the way
and fully copy, but also want to fail hard if they cannot be written to!So at the end a failure for everything may make the most sense, if we
document that people don't put read-only files in the target's data
folder which can cause failures.Note also that if we use isRelDataFile() in receiveFileChunks() it is
possible to decide if a file needs to be truncated when receiving its
first chunk, so it is possible to drop completely the concept of file
type as well. This would make the patch even more simple, and there is
no need to rely on other weird tweaks like chunkoff or such.The ordering of the items when doing a local copy is not mandatory as
files are processed one by one, still I think that it could be more
solid to reuse isRelDataFile() in rewind_copy_file_range() and drop the
"trunc" argument.Thoughts?
--
Michael
Regards,
Christian
On Tue, Apr 03, 2018 at 10:13:05PM +0200, TipTop Labs wrote:
Truncation of the file was moved to the second loop. Truncation occurs
there if chunks are written into files at offset 0. This is the case
for FILE_ACTION_COPY. An additional SQL "ORDER BY path, begin" ensures
that these chunks are processed first.
Yes. I have spent a large portion of my day looking at that. And
actually this is wrong as well. First, please find attached a patch I
have written while testing your changes. There is nothing much new into
it, except that I added more comments and a test (other tests fail as
well, that does not matter).
First, I have looked at a way to not rely on the tweak with chunkoff by
extending the temporary table used to register the chunks with an extra
column which tracks the type of action which is taken on the file.
Another one I looked at was to pass down the action type taken on the
entry directly to receiveFileChunks(). However both of them have been
grotty.
So after that I falled back to your patch and began testing it, which is
where I noticed that we can *never* give the insurance to recover a data
folder on which an error has happened in the middle of a pg_rewind. The
reason for that is quite simple: even if the truncation has been moved
down to the moment where the first chunk of a file is received, you may
have already done work on some relation files. Particularly, some of
them may have been truncated down to a given size without a new range of
blocks fetched from the source. So the data folder would be in an
inconsistent state if trying to rewind it again.
If the control file from the source has been received and then a
read-only error is found, then in order to perform a subsequent rewind
you need to start and stop the target cluster cleanly, or update
manually the control file and mark it as cleanly stopped. And this is a
recipe for data corruption as we now start a rewind based on the
guarantee that a target cluster has been able to cleanly stop, with a
shutdown checkpoint issued. You could always try to start and stop the
cluster cleanly, but if your target instance replays WAL on data blocks
which have been truncated by a previous rewind, you would need to
re-create an instance using a new base backup, which is actually harder
to diagnose.
The patch I sent prevously which makes the difference between relation
files and "configuration" files helps a bit, still it makes me uneasy
because it will not be able to handle correctly failures on files which
are part of a relation but need to be fully copied. So I think that we
should drop this approach as well for its complexity.
Another thing that could definitely help is to work on a patch which
checks that *all* files are writable before doing any operations for
files which are present on both the source and the target, and make the
rewind nicely fail with the source still intact. That's quite a bit of
refactoring ahead for little benefit I think in the long run.
What we could simply do is to document the limitation. Hence, if both
the target and the source have read-only files, then the user needs to
be sure to remove them from the target before doing the rewind, and do
the re-linking after the operation. If an error is found while
processing the rewind, then a new base backup needs to be taken.
The refactoring I am mentioning two paragraphs above could definitely be
done to ease user's life, but I would treat that as a new feature and
not a bug. Another approach, way more simple that we could do is to
scan the data folder of the target before doing anything and see if some
files are not writable, and then report this list back to the user. If
we do that even for files which are on the target but not the source
then that would be a loss for some cases, however I would like to
imagine that when using pg_rewind the configuration file set is
consistent across nodes in a majority of cases, so a user could remove
those non-writable files manually. This has also the advantage to be
non-intrusive, so a back-patch is definitely possible.
Thoughts?
--
Michael
Attachments:
rewind-readonly-v3.patchtext/x-diff; charset=us-asciiDownload
diff --git a/src/bin/pg_rewind/libpq_fetch.c b/src/bin/pg_rewind/libpq_fetch.c
index 5914b15017..821e4b0c81 100644
--- a/src/bin/pg_rewind/libpq_fetch.c
+++ b/src/bin/pg_rewind/libpq_fetch.c
@@ -337,7 +337,18 @@ receiveFileChunks(const char *sql)
pg_log(PG_DEBUG, "received chunk for file \"%s\", offset %s, size %d\n",
filename, chunkoff_str, chunksize);
- open_target_file(filename, false);
+ /*
+ * Truncate new files which are fully copied. We are sure to not
+ * truncate a file whose data has been partially fetched for two
+ * reasons:
+ * 1) The set of file ranges is ordered so as the chunks of each
+ * file is processed sequentially.
+ * 2) Only files fully copied register a chunk with an offset of
+ * 0, which means that the beginning of a file is received, so it
+ * should be truncated first. Note that the truncation is lossy,
+ * and may be tried on a file which does not exist locally.
+ */
+ open_target_file(filename, chunkoff == 0);
write_target_range(chunk, chunkoff, chunksize);
@@ -464,8 +475,6 @@ libpq_executeFileMap(filemap_t *map)
break;
case FILE_ACTION_COPY:
- /* Truncate the old file out of the way, if any */
- open_target_file(entry->path, true);
fetch_file_range(entry->path, 0, entry->newsize);
break;
@@ -501,12 +510,17 @@ libpq_executeFileMap(filemap_t *map)
/*
* We've now copied the list of file ranges that we need to fetch to the
- * temporary table. Now, actually fetch all of those ranges.
+ * temporary table. Now, actually fetch all of those ranges. The elements
+ * fetched are ordered so as the truncation of files fully copied from the
+ * source server is done after receiving their first chunk. This also
+ * gives a better sequential performance to the operations, especially
+ * when working on large files.
*/
sql =
"SELECT path, begin,\n"
" pg_read_binary_file(path, begin, len, true) AS chunk\n"
- "FROM fetchchunks\n";
+ "FROM fetchchunks\n"
+ "ORDER BY path, begin\n";
receiveFileChunks(sql);
}
diff --git a/src/bin/pg_rewind/t/006_readonly.pl b/src/bin/pg_rewind/t/006_readonly.pl
new file mode 100644
index 0000000000..51db93a9a0
--- /dev/null
+++ b/src/bin/pg_rewind/t/006_readonly.pl
@@ -0,0 +1,85 @@
+# Test how pg_rewind reacts to read-only files in the data dirs.
+# All such files should be ignored in the process.
+
+use strict;
+use warnings;
+use TestLib;
+use Test::More tests => 2;
+
+use File::Copy;
+use File::Find;
+
+use RewindTest;
+
+my $test_mode = "remote";
+
+RewindTest::setup_cluster($test_mode);
+RewindTest::start_master();
+RewindTest::create_standby($test_mode);
+
+# Create the same read-only file in standby and master
+my $test_master_datadir = $node_master->data_dir;
+my $test_standby_datadir = $node_standby->data_dir;
+my $readonly_master = "$test_master_datadir/readonly_file";
+my $readonly_standby = "$test_standby_datadir/readonly_file";
+
+append_to_file($readonly_master, "in master");
+append_to_file($readonly_standby, "in standby");
+chmod 0400, $readonly_master, $readonly_standby;
+
+RewindTest::promote_standby();
+
+# Stop the master and run pg_rewind.
+$node_master->stop;
+
+my $master_pgdata = $node_master->data_dir;
+my $standby_pgdata = $node_standby->data_dir;
+my $standby_connstr = $node_standby->connstr('postgres');
+my $tmp_folder = TestLib::tempdir;
+
+# Keep a temporary postgresql.conf for master node or it would be
+# overwritten during the rewind.
+copy(
+ "$master_pgdata/postgresql.conf",
+ "$tmp_folder/master-postgresql.conf.tmp");
+
+# Including read-only data in the source and the target will
+# cause pg_rewind to fail.
+my $rewind_command = [ 'pg_rewind', "--debug",
+ "--source-server", $standby_connstr,
+ "--target-pgdata=$master_pgdata" ];
+
+command_fails($rewind_command, 'pg_rewind fails with read-only');
+
+# Need to put the server back into a clean state for next rewind.
+# Move back postgresql.conf with old settings to allow the node to
+# start and stop, allowing the follow-up rewind to work properly.
+move("$tmp_folder/master-postgresql.conf.tmp",
+ "$master_pgdata/postgresql.conf");
+$node_master->start;
+$node_master->stop;
+
+# Now remove the read-only data on both sides, the data folder
+# from the previous attempt should still be able to work.
+unlink($readonly_master);
+unlink($readonly_standby);
+command_ok($rewind_command, 'pg_rewind passes without read-only');
+
+# Now move back postgresql.conf with old settings
+move("$tmp_folder/master-postgresql.conf.tmp",
+ "$master_pgdata/postgresql.conf");
+
+# Plug-in rewound node to the now-promoted standby node
+my $port_standby = $node_standby->port;
+$node_master->append_conf('recovery.conf', qq(
+primary_conninfo='port=$port_standby'
+standby_mode=on
+recovery_target_timeline='latest'
+));
+
+# Restart the master to check that rewind went correctly.
+$node_master->start;
+
+RewindTest::clean_rewind_test();
+
+exit(0);
Michael Paquier <michael@paquier.xyz> writes:
So after that I falled back to your patch and began testing it, which is
where I noticed that we can *never* give the insurance to recover a data
folder on which an error has happened in the middle of a pg_rewind. The
reason for that is quite simple: even if the truncation has been moved
down to the moment where the first chunk of a file is received, you may
have already done work on some relation files. Particularly, some of
them may have been truncated down to a given size without a new range of
blocks fetched from the source. So the data folder would be in an
inconsistent state if trying to rewind it again.
Yes, we certainly cannot guarantee that failure partway through pg_rewind
leaves a consistent state of the target data directory. It is likely
worth pointing that out in the documentation. Whether we can or should
do anything about it is a different question.
When I first started looking at this thread, I wondered if maybe somebody
had had in mind to create an active defense against starting a postmaster
in an inconsistent target cluster, by dint of intentionally truncating
pg_control before the transfer starts and not making it valid again till
the very end. It's now clear from looking at the code that that's not
what's going on :-(. But I wonder how hard it would be to make it so,
and whether that'd be worth doing if it's not too hard.
Actually, probably a safer way to attack that would be to remove or
rename the topmost PG_VERSION file, and then put it back afterwards.
That'd be far easier to recover from manually, if need be, than
clobbering pg_control.
In any case, that seems separate from the question of what to do with
read-only files in the data directory. Should we push forward with
committing Michael's previous patch, and leave that issue for later?
regards, tom lane
On Wed, Apr 04, 2018 at 02:50:12PM -0400, Tom Lane wrote:
When I first started looking at this thread, I wondered if maybe somebody
had had in mind to create an active defense against starting a postmaster
in an inconsistent target cluster, by dint of intentionally truncating
pg_control before the transfer starts and not making it valid again till
the very end. It's now clear from looking at the code that that's not
what's going on :-(. But I wonder how hard it would be to make it so,
and whether that'd be worth doing if it's not too hard.
One safeguard I can see is to add an extra loop just before processing
the file map to check if a file planned for copy can be sanely opened or
not. That's simple to do with open_target_file(), followed by
close_target_file(). If there is an EACCES error, then we complain at
this early stage, and the data folder can be reused after cleaning up
the data folder from those files. That's not actually hard to do at
quick glance, but I may be missing something so let's be careful.
Another approach that we could consider, HEAD-only though, is to extend
pg_stat_file so as an entry's st_mode is available and we could use that
to not copy the file's data from the start. That would be actually
quite clean, particularly for large read-only files.
Actually, probably a safer way to attack that would be to remove or
rename the topmost PG_VERSION file, and then put it back afterwards.
That'd be far easier to recover from manually, if need be, than
clobbering pg_control.In any case, that seems separate from the question of what to do with
read-only files in the data directory. Should we push forward with
committing Michael's previous patch, and leave that issue for later?
That one is not good either as long as we don't make the difference in
the data folder between three types of files:
1) Relation files.
2) System files which are critical at diverse degrees for the system.
3) Custom configuration files.
pg_rewind is able to consider 1) as a separate category as it usually
needs to fetch a range set of blocks, and puts 2) and 3) in the same
bucket. It could be possible to split 2) and 3) but the maintenance
cost is high as for each new system file added in Postgres we would need
to maintain an extra filtering logic in pg_rewind, something which is
most likely going to rot, leading to potential corruption problems.
My point is that I have seen on some VMware test beds a kernel going
rogue because of ESX servers, causing a PostgreSQL-related partition to
go in read-only mode. So if you use my previous patch, and that the
partition where the target server's data folder is located turns
all-of-a-sudden to read-only, there is a risk of silencing real
failures. Even if the fetched data is ordered, paths like pg_xact and
pg_twophase are processed after all relation files, so failing to write
them correctly would silently break a set of transactions :(
So please let me suggest a set of two patches:
1) 0001 is a documentation patch which should be back-patched. This
tells two things:
1-1) If pg_rewind fails in the middle of processing, then the best thing
to do is to re-create the data folder using a new fresh backup.
1-2) Be careful of files that the process running pg_rewind is not able
to write to. If those are located on both the target and the source,
then pg_rewind will copy it from the source to the target, leading to an
immediate failure. After removing those unwritable files, be sure to
clean up anything which has been copied from the source and should be
read-only, then rebuild the links.
2) 0002 is a patch for HEAD to order the set of chunks fetched, because
this makes the data writes sequentials which is good for performance and
limits the number of open()/close() on the files of the target's data
folder.
I think that this is the best answer we can give now as there are a
couple of ways to improving the handling of any checks, however there
are side effects to any of them. This also feels a bit like a new
feature, while the documentation in pg_rewind sucks now for such
details. I have also spent some time detailing the commit message of
the first patch about all that, that would be nice to keep in the git
history.
If the extra checks I am mentioning at the top of this email are worth
adding, then it is possible to drop 1-2) partially, rewriting it to
mention that pg_rewind tells the user at early stages about the failure
and that a data folder can be again reused. This does not change the
fact that a pg_rewind breaking mid-flight should result in a new base
backup, so documentation is mandatory anyway. And there are a couple of
approaches that we could consider here. At least I saw two of them.
Other folks may see better approaches than I did, who knows..
Thanks,
--
Michael
Attachments:
0001-Add-note-in-pg_rewind-documentation-about-read-only-.patchtext/x-diff; charset=us-asciiDownload
From 4b2718d09d41c9ea8445e467e276947bffd0cdf1 Mon Sep 17 00:00:00 2001
From: Michael Paquier <michael@paquier.xyz>
Date: Thu, 5 Apr 2018 10:11:34 +0900
Subject: [PATCH 1/2] Add note in pg_rewind documentation about read-only files
When performing pg_rewind, the presence of a read-only file which is not
accessible for writes will cause a failure while processing. This can
cause the control file of the target data folder to be truncated,
causing it to not be reusable with a successive run.
We have discussed on the problem's thread a couple of ways to deal with
the problem:
1) Consider EACCES failures on relation files as critical not not on
non-relation files, which goes down to custom configuration files as
well as FSM or VM files. Being able to make the difference between
custom files and the ones critical for the system requires additional
maintenance with the addition of new filtering rules, which is costly in
the long term.
2) Order the fetched block ranges and delay the truncation of a file
only when its first range chunk is received. That's actually not
reliable either, as when facing a failure some of the relation files may
have been already manipulated. If by chance one if able to start and
stop the target's server after a failed rewind, they have good chances
to have already a broken instance.
There are some solutions which could be considered:
1) Add pre-checks making sure that a set of files which are going to be
processed for copy can be sanely written to, and complain about it
before writing any data on the target's data folder. This has the
disadvantage that a user would still need to rebuild the links used for
what was previously a set of read-only files.
2) Prevent the data of read-only files to be fetched from the source,
which would save some bandwidth, but requires the backend's
pg_stat_file() to be extended with an entry's st_mode. This can only
happen on HEAD.
3) Allow callers to specify custom exclusion rules.
1) is the one causing the less code churn, still it is not clear to me
if any of those are worth considering anyway as read-only files would
need to be most likely-rebuilt after a rewind. Hence the most simple
solution is to just document the behavior and tell users to not do
that. We could always consider new solutions in future releases to ease
the handling of such files, but that may not be worth it.
Also, when pg_rewind fails midflight, there is likely no way to be able
to recover the target data folder anyway, in which case a new base
backup is the best option. A note is added in the documentation as well.
Discussion: https://postgr.es/m/20180104200633.17004.16377%40wrigleys.postgresql.org
---
doc/src/sgml/ref/pg_rewind.sgml | 20 ++++++++++++++++++++
1 file changed, 20 insertions(+)
diff --git a/doc/src/sgml/ref/pg_rewind.sgml b/doc/src/sgml/ref/pg_rewind.sgml
index 520d843f0e..ee35ce18b0 100644
--- a/doc/src/sgml/ref/pg_rewind.sgml
+++ b/doc/src/sgml/ref/pg_rewind.sgml
@@ -95,6 +95,26 @@ PostgreSQL documentation
are currently on by default. <xref linkend="guc-full-page-writes"/>
must also be set to <literal>on</literal>, but is enabled by default.
</para>
+
+ <warning>
+ <para>
+ If <application>pg_rewind</application> fails while processing, then
+ the data folder of the target is likely not in a state that can be
+ recovered. In such a case, taking a new fresh backup is recommended.
+ </para>
+
+ <para>
+ <application>pg_rewind</application> will fail immediately if it finds
+ files it cannot write directly to. This can happen for example when
+ the source and the target server use the same file mapping for read-only
+ SSL keys and certificates. If such files are present on the target server
+ it is recommended to remove them before running
+ <application>pg_rewind</application>. After doing the rewind, some of
+ those files may have been copied from the source, in which case it may
+ be necessary to remove the data copied and restore back the set of links
+ used before the rewind.
+ </para>
+ </warning>
</refsect1>
<refsect1>
--
2.16.3
0002-Enforce-ordering-of-data-chunks-fetched-in-pg_rewind.patchtext/x-diff; charset=us-asciiDownload
From 812ced024a7edea7dda1634fe926936c20a22f93 Mon Sep 17 00:00:00 2001
From: Michael Paquier <michael@paquier.xyz>
Date: Thu, 5 Apr 2018 10:31:53 +0900
Subject: [PATCH 2/2] Enforce ordering of data chunks fetched in pg_rewind for
libpq transfer
This is good for performance, as this gives a sequential behavior to the
rewind operation, and also limits the number of files opened and closed
when applying data chunks.
Author: Christian H.
Author: Michael Paquier
Discussion: https://postgr.es/m/20180104200633.17004.16377%40wrigleys.postgresql.org
---
src/bin/pg_rewind/libpq_fetch.c | 10 ++++++++--
1 file changed, 8 insertions(+), 2 deletions(-)
diff --git a/src/bin/pg_rewind/libpq_fetch.c b/src/bin/pg_rewind/libpq_fetch.c
index 5914b15017..880c6553ea 100644
--- a/src/bin/pg_rewind/libpq_fetch.c
+++ b/src/bin/pg_rewind/libpq_fetch.c
@@ -501,12 +501,18 @@ libpq_executeFileMap(filemap_t *map)
/*
* We've now copied the list of file ranges that we need to fetch to the
- * temporary table. Now, actually fetch all of those ranges.
+ * temporary table. Now, actually fetch all of those ranges. The elements
+ * fetched are ordered to ensure that the set of chunks applying to one
+ * are processed successively. This gives a sequential behavior to the
+ * rewind, which is good for performance especially on large files, and
+ * limites the back-and-forth movements to open and close multiple times
+ * the same file.
*/
sql =
"SELECT path, begin,\n"
" pg_read_binary_file(path, begin, len, true) AS chunk\n"
- "FROM fetchchunks\n";
+ "FROM fetchchunks\n"
+ "ORDER BY path, begin\n";
receiveFileChunks(sql);
}
--
2.16.3
On Apr 5, 2018, at 3:38 AM, Michael Paquier <michael@paquier.xyz> wrote:
On Wed, Apr 04, 2018 at 02:50:12PM -0400, Tom Lane wrote:
When I first started looking at this thread, I wondered if maybe somebody
had had in mind to create an active defense against starting a postmaster
in an inconsistent target cluster, by dint of intentionally truncating
pg_control before the transfer starts and not making it valid again till
the very end. It's now clear from looking at the code that that's not
what's going on :-(. But I wonder how hard it would be to make it so,
and whether that'd be worth doing if it's not too hard.One safeguard I can see is to add an extra loop just before processing
the file map to check if a file planned for copy can be sanely opened or
not. That's simple to do with open_target_file(), followed by
close_target_file(). If there is an EACCES error, then we complain at
this early stage, and the data folder can be reused after cleaning up
the data folder from those files. That's not actually hard to do at
quick glance, but I may be missing something so let's be careful.Another approach that we could consider, HEAD-only though, is to extend
pg_stat_file so as an entry's st_mode is available and we could use that
to not copy the file's data from the start. That would be actually
quite clean, particularly for large read-only files.Actually, probably a safer way to attack that would be to remove or
rename the topmost PG_VERSION file, and then put it back afterwards.
That'd be far easier to recover from manually, if need be, than
clobbering pg_control.In any case, that seems separate from the question of what to do with
read-only files in the data directory. Should we push forward with
committing Michael's previous patch, and leave that issue for later?That one is not good either as long as we don't make the difference in
the data folder between three types of files:
1) Relation files.
2) System files which are critical at diverse degrees for the system.
3) Custom configuration files.pg_rewind is able to consider 1) as a separate category as it usually
needs to fetch a range set of blocks, and puts 2) and 3) in the same
bucket. It could be possible to split 2) and 3) but the maintenance
cost is high as for each new system file added in Postgres we would need
to maintain an extra filtering logic in pg_rewind, something which is
most likely going to rot, leading to potential corruption problems.My point is that I have seen on some VMware test beds a kernel going
rogue because of ESX servers, causing a PostgreSQL-related partition to
go in read-only mode. So if you use my previous patch, and that the
partition where the target server's data folder is located turns
all-of-a-sudden to read-only, there is a risk of silencing real
failures. Even if the fetched data is ordered, paths like pg_xact and
pg_twophase are processed after all relation files, so failing to write
them correctly would silently break a set of transactions :(So please let me suggest a set of two patches:
1) 0001 is a documentation patch which should be back-patched. This
tells two things:
1-1) If pg_rewind fails in the middle of processing, then the best thing
to do is to re-create the data folder using a new fresh backup.
1-2) Be careful of files that the process running pg_rewind is not able
to write to. If those are located on both the target and the source,
then pg_rewind will copy it from the source to the target, leading to an
immediate failure. After removing those unwritable files, be sure to
clean up anything which has been copied from the source and should be
read-only, then rebuild the links.
2) 0002 is a patch for HEAD to order the set of chunks fetched, because
this makes the data writes sequentials which is good for performance and
limits the number of open()/close() on the files of the target's data
folder.
Thanks for crediting me in patch 0002.
One final word about my original patch, since the commit message for 0001 refers to its limitations: I acknowledge that it cannot cover situations where the order of processing files is such that the second loop detects non-writable files only after it had already done work on some others. It was meant to fix the specific problem I ran into without breaking the regression tests that ship with the source code ("make check"). The main reason that it is not more elaborate is that I did not want to submit anything affecting more than a few lines of code without prior input from the community.
So from my perspective I welcome both 0001 and 0002.
I think that this is the best answer we can give now as there are a
couple of ways to improving the handling of any checks, however there
are side effects to any of them. This also feels a bit like a new
feature, while the documentation in pg_rewind sucks now for such
details. I have also spent some time detailing the commit message of
the first patch about all that, that would be nice to keep in the git
history.If the extra checks I am mentioning at the top of this email are worth
adding, then it is possible to drop 1-2) partially, rewriting it to
mention that pg_rewind tells the user at early stages about the failure
and that a data folder can be again reused. This does not change the
fact that a pg_rewind breaking mid-flight should result in a new base
backup, so documentation is mandatory anyway. And there are a couple of
approaches that we could consider here. At least I saw two of them.
Other folks may see better approaches than I did, who knows..
Thanks,
--
Michael
<0001-Add-note-in-pg_rewind-documentation-about-read-only-.patch><0002-Enforce-ordering-of-data-chunks-fetched-in-pg_rewind.patch>
--
Christian
On Thu, Apr 05, 2018 at 07:04:40AM +0200, TipTop Labs wrote:
One final word about my original patch, since the commit message for
0001 refers to its limitations: I acknowledge that it cannot cover
situations where the order of processing files is such that the second
loop detects non-writable files only after it had already done work on
some others. It was meant to fix the specific problem I ran into
without breaking the regression tests that ship with the source code
("make check"). The main reason that it is not more elaborate is that
I did not want to submit anything affecting more than a few lines of
code without prior input from the community.
That definitely makes sense to me. Reviews and discussions can take a
long time to settle so this method looks right to me. It is better not
spend too much time on a patch if you unsure of its shape, and much
better to ask for input first before diving more into the details. This
stuff is really complicated and the devil is in the details.
So from my perspective I welcome both 0001 and 0002.
OK, thanks for confirming!
--
Michael
Thinking about this some more, I wondered if useful behavior with
read-only config files would be to check whether they have the same
contents on both servers, and give a warning or error if not.
Now, that would only be useful if you suppose that they actually should be
the same on both. For the particular case of server private keys, maybe
they typically would be different. So I'm not real sure about this idea,
but wanted to toss it out for discussion.
In any case, after another look at the code, it seems to me that it'd be
pretty easy to get pg_rewind to notice at the start whether a target file
it plans to overwrite is read-only: process_source_file is already doing
an lstat per file, so I think a permissions check there wouldn't add much
overhead. So at that point we could either bail out without damage, or
decide to skip the file. We could still lose if permissions change midway
through the run, but that's the sort of situation I think is OK to fail
in.
Also, I'm having second thoughts about the usefulness of adding an ORDER
BY to the fetch query just on (untested) performance grounds. It looks
to me like the chunks within a file already will be inserted into the
fetchchunks table in order, and I think within-file order is all that's
worth worrying about. In principle the remote server might read out the
rows in some other order than we stored them, but since fetchchunks is a
temp table and not subject to synchronize_seqscans, I don't think we
really need to worry about that.
Furthermore, the patch as given has got a serious performance gotcha:
sql =
"SELECT path, begin,\n"
" pg_read_binary_file(path, begin, len, true) AS chunk\n"
- "FROM fetchchunks\n";
+ "FROM fetchchunks\n"
+ "ORDER BY path, begin\n";
While 9.6 and later will do this in a sane way, older versions will
execute pg_read_binary_file() ahead of the sort step, like so:
# explain verbose SELECT path, begin, pg_read_binary_file(path, begin, len, true) AS chunk FROM fetchchunks ORDER BY path, begin;
QUERY PLAN
-------------------------------------------------------------------------------------
Sort (cost=74639.34..75889.41 rows=500025 width=44)
Output: path, begin, (pg_read_binary_file(path, begin, (len)::bigint, true))
Sort Key: fetchchunks.path, fetchchunks.begin
-> Seq Scan on pg_temp_2.fetchchunks (cost=0.00..11925.38 rows=500025 width=44)
Output: path, begin, pg_read_binary_file(path, begin, (len)::bigint, true)
(5 rows)
with the effect that we're passing the entire contents of the source data
directory (or at least, as much of it as pg_rewind needs to read) through
the sort. Yipes. So it'd be safer to do it like this:
# explain verbose SELECT path, begin, pg_read_binary_file(path, begin, len, true) AS chunk FROM (select * from fetchchunks ORDER BY path, begin) ss;
QUERY PLAN
---------------------------------------------------------------------------------------------
Subquery Scan on ss (cost=72139.22..80889.66 rows=500025 width=44)
Output: ss.path, ss.begin, pg_read_binary_file(ss.path, ss.begin, (ss.len)::bigint, true)
-> Sort (cost=72139.22..73389.28 rows=500025 width=44)
Output: fetchchunks.path, fetchchunks.begin, fetchchunks.len
Sort Key: fetchchunks.path, fetchchunks.begin
-> Seq Scan on pg_temp_2.fetchchunks (cost=0.00..9425.25 rows=500025 width=44)
Output: fetchchunks.path, fetchchunks.begin, fetchchunks.len
(7 rows)
But the real bottom line here is I don't feel a need to touch the fetch
query at all without some actual performance measurements proving there's
a benefit.
regards, tom lane
On Sat, Apr 07, 2018 at 11:51:32AM -0400, Tom Lane wrote:
In any case, after another look at the code, it seems to me that it'd be
pretty easy to get pg_rewind to notice at the start whether a target file
it plans to overwrite is read-only: process_source_file is already doing
an lstat per file, so I think a permissions check there wouldn't add much
overhead. So at that point we could either bail out without damage, or
decide to skip the file.
Yes, that's one of the methods I mentioned in my previous email. Now,
do we want to fail the run immediately if such a file is found? Or do
we want to report at once all such files so as the user does not need to
run multiple times pg_rewind. The latter would be much more useful.
A downside here is that the file from the source would still be fetched
and copied on the target. So a file which was in read-only would become
basically writable. Don't think that it is a big deal as the has
superuser rights at this point, but that's worth mentioning.
We could still lose if permissions change midway
through the run, but that's the sort of situation I think is OK to fail
in.
If the file system switches to read-only in the middle of the run, I see
no way to recover from that. That would be bad luck, but it is way
better to fail hard and let the user that something has gone wrong. I
still think that we want to document that if pg_rewind fails hard
midflight, then the only thing safe enough moving forward is to use a
new base backup.
--
Michael
On Sun, Apr 08, 2018 at 07:22:58AM +0900, Michael Paquier wrote:
Yes, that's one of the methods I mentioned in my previous email. Now,
do we want to fail the run immediately if such a file is found? Or do
we want to report at once all such files so as the user does not need to
run multiple times pg_rewind.
So I have been playing with that part...
The latter would be much more useful.
While that would be nice, the result is rather ugly and this needs to
generate the same error message in two code paths. So I have chewed
things in such a way that one failure found makes the whole rewind to
just fail, and allows retries to work. The attached has a test as well
as documentation.
Thoughts?
A downside here is that the file from the source would still be fetched
and copied on the target. So a file which was in read-only would become
basically writable. Don't think that it is a big deal as the has
superuser rights at this point, but that's worth mentioning.
That's mentioned in the docs of the attached.
--
Michael
Attachments:
rewind-readonly-v4.patchtext/x-diff; charset=us-asciiDownload
diff --git a/doc/src/sgml/ref/pg_rewind.sgml b/doc/src/sgml/ref/pg_rewind.sgml
index 520d843f0e..0952f3903e 100644
--- a/doc/src/sgml/ref/pg_rewind.sgml
+++ b/doc/src/sgml/ref/pg_rewind.sgml
@@ -95,6 +95,25 @@ PostgreSQL documentation
are currently on by default. <xref linkend="guc-full-page-writes"/>
must also be set to <literal>on</literal>, but is enabled by default.
</para>
+
+ <para>
+ Before rewriting any files on the target data directory,
+ <application>pg_rewind</application> checks if any files are writable,
+ failing immediately if that is not the case. It is possible to
+ execute again a rewind after fixing the permission issues. This can
+ happen for example for read-only SSL-related configuration files.
+ Note that if the file is copied from the source to the target, then
+ it may be necessary to rebuild the links properly as a rewind would
+ have caused those files to be created with newer permission sets.
+ </para>
+
+ <warning>
+ <para>
+ If <application>pg_rewind</application> fails while processing, then
+ the data folder of the target is likely not in a state that can be
+ recovered. In such a case, taking a new fresh backup is recommended.
+ </para>
+ </warning>
</refsect1>
<refsect1>
diff --git a/src/bin/pg_rewind/filemap.c b/src/bin/pg_rewind/filemap.c
index c3fc519895..c0b58c76ce 100644
--- a/src/bin/pg_rewind/filemap.c
+++ b/src/bin/pg_rewind/filemap.c
@@ -189,7 +189,56 @@ process_source_file(const char *path, file_type_t type, size_t newsize,
exists = false;
}
else
- exists = true;
+ {
+ int fd;
+
+ /*
+ * The file exists on the source and the target. Now check if
+ * the file is writable.
+ */
+ fd = open(localpath, O_WRONLY);
+
+ if (fd < 0)
+ {
+ if (errno == EISDIR)
+ {
+ /*
+ * No need to worry about directories, those are not
+ * valid targets for writes.
+ */
+ exists = true;
+ }
+ else if (errno == EACCES)
+ {
+ /*
+ * The file exists on the source and the target but it is
+ * not writable on target. In order to avoid failures
+ * mid-flight when processing the data directory, fail
+ * immediately and let the user know. This way, if anything
+ * like a read-only file is found, then pg_rewind complains
+ * and a successive run can be completed after doing some
+ * cleanup on this target data directory.
+ */
+ pg_fatal("file \"%s\" exists on both target and source server, but is not writable on target.\n"
+ "Please check the permissions on \"%s\" before executing again pg_rewind.\n",
+ path, localpath);
+ }
+ else
+ {
+ /*
+ * Complain about anything else, that should not happen,
+ * even ENOENT which has already been checked above.
+ */
+ pg_fatal("could not open file \"%s\": %s\n",
+ localpath, strerror(errno));
+ }
+ }
+ else
+ {
+ exists = true;
+ close(fd);
+ }
+ }
switch (type)
{
diff --git a/src/bin/pg_rewind/t/006_readonly.pl b/src/bin/pg_rewind/t/006_readonly.pl
new file mode 100644
index 0000000000..489e1f4593
--- /dev/null
+++ b/src/bin/pg_rewind/t/006_readonly.pl
@@ -0,0 +1,77 @@
+# Test how pg_rewind reacts to read-only files in the data dirs.
+# All such files should be ignored in the process.
+
+use strict;
+use warnings;
+use TestLib;
+use Test::More tests => 2;
+
+use File::Copy;
+use File::Find;
+
+use RewindTest;
+
+my $test_mode = "remote";
+
+RewindTest::setup_cluster($test_mode);
+RewindTest::start_master();
+RewindTest::create_standby($test_mode);
+
+# Create the same read-only file in standby and master
+my $test_master_datadir = $node_master->data_dir;
+my $test_standby_datadir = $node_standby->data_dir;
+my $readonly_master = "$test_master_datadir/readonly_file";
+my $readonly_standby = "$test_standby_datadir/readonly_file";
+
+append_to_file($readonly_master, "in master");
+append_to_file($readonly_standby, "in standby");
+chmod 0400, $readonly_master, $readonly_standby;
+
+RewindTest::promote_standby();
+
+# Stop the master and run pg_rewind.
+$node_master->stop;
+
+my $master_pgdata = $node_master->data_dir;
+my $standby_pgdata = $node_standby->data_dir;
+my $standby_connstr = $node_standby->connstr('postgres');
+my $tmp_folder = TestLib::tempdir;
+
+# Keep a temporary postgresql.conf for master node or it would be
+# overwritten during the rewind.
+copy(
+ "$master_pgdata/postgresql.conf",
+ "$tmp_folder/master-postgresql.conf.tmp");
+
+# Including read-only data in the source and the target will
+# cause pg_rewind to fail.
+my $rewind_command = [ 'pg_rewind', "--debug",
+ "--source-server", $standby_connstr,
+ "--target-pgdata=$master_pgdata" ];
+
+command_fails($rewind_command, 'pg_rewind fails with read-only');
+
+# Now remove the read-only data on both sides, the data folder
+# from the previous attempt should still be able to work.
+unlink($readonly_master);
+unlink($readonly_standby);
+command_ok($rewind_command, 'pg_rewind passes without read-only');
+
+# Now move back postgresql.conf with old settings
+move("$tmp_folder/master-postgresql.conf.tmp",
+ "$master_pgdata/postgresql.conf");
+
+# Plug-in rewound node to the now-promoted standby node
+my $port_standby = $node_standby->port;
+$node_master->append_conf('recovery.conf', qq(
+primary_conninfo='port=$port_standby'
+standby_mode=on
+recovery_target_timeline='latest'
+));
+
+# Restart the master to check that rewind went correctly.
+$node_master->start;
+
+RewindTest::clean_rewind_test();
+
+exit(0);
On Mon, Apr 09, 2018 at 01:58:26PM +0900, Michael Paquier wrote:
That's mentioned in the docs of the attached.
So, I have read again this thread, and it seems to me that there are
many approaches, but all of them have downsides. Hence at the end I
would suggest to document the limitation with for example the attached,
and call it a day.
Thoughts?
--
Michael
Attachments:
0001-Add-note-in-pg_rewind-documentation-about-read-only-.patchtext/x-diff; charset=us-asciiDownload
From e7a1f9e599ec56a3b142b943b2351b10b5037585 Mon Sep 17 00:00:00 2001
From: Michael Paquier <michael@paquier.xyz>
Date: Tue, 19 Jun 2018 16:03:51 +0900
Subject: [PATCH] Add note in pg_rewind documentation about read-only files
When performing pg_rewind, the presence of a read-only file which is not
accessible for writes will cause a failure while processing. This can
cause the control file of the target data folder to be truncated,
causing it to not be reusable with a successive run.
We have discussed on the thread a couple of ways to deal with the
problem:
1) Consider EACCES failures on relation files as critical not not on
non-relation files, which goes down to custom configuration files as
well as FSM or VM files. Being able to make the difference between
custom files and the ones critical for the system requires additional
maintenance with the addition of new filtering rules, which is costly in
the long term.
2) Order the fetched block ranges and delay the truncation of a file
only when its first range chunk is received. That's actually not
reliable either, as when facing a failure some of the relation files may
have been already manipulated. If by chance one if able to start and
stop the target's server after a failed rewind, they have good chances
to have already a broken instance.
There are some solutions which could be considered:
1) Add pre-checks making sure that a set of files which are going to be
processed for copy can be sanely written to, and complain about it
before writing any data on the target's data folder. This has the
disadvantage that a user would still need to rebuild the links used for
what was previously a set of read-only files, and also to fetch
read-only files on the source and save them as raw files locally, which
could be security-sensitive.
2) Prevent the data of read-only files to be fetched from the source,
which would save some bandwidth, but requires the backend's
pg_stat_file() to be extended with an entry's st_mode. This can only
happen on HEAD.
3) Allow callers to specify custom exclusion rules, however this could
be a foot-gun for anything accidentally filtering out critical files.
1) is the one causing the less code churn, still it is not clear to me
if any of those are worth considering anyway as read-only files would
need to be most likely-rebuilt after a rewind. Hence the most simple
solution is to just document the behavior and tell users to not do
that. We could always consider new solutions in future releases to ease
the handling of such files, but that may not be worth it. 2) would be
rather interesting, but that's a lot of infrastructure to justify.
Also, when pg_rewind fails mid-flight, there is likely no way to be able
to recover the target data folder anyway, in which case a new base
backup is the best option. A note is added in the documentation as
well.
Discussion: https://postgr.es/m/20180104200633.17004.16377%40wrigleys.postgresql.org
---
doc/src/sgml/ref/pg_rewind.sgml | 20 ++++++++++++++++++++
1 file changed, 20 insertions(+)
diff --git a/doc/src/sgml/ref/pg_rewind.sgml b/doc/src/sgml/ref/pg_rewind.sgml
index 520d843f0e..ee35ce18b0 100644
--- a/doc/src/sgml/ref/pg_rewind.sgml
+++ b/doc/src/sgml/ref/pg_rewind.sgml
@@ -95,6 +95,26 @@ PostgreSQL documentation
are currently on by default. <xref linkend="guc-full-page-writes"/>
must also be set to <literal>on</literal>, but is enabled by default.
</para>
+
+ <warning>
+ <para>
+ If <application>pg_rewind</application> fails while processing, then
+ the data folder of the target is likely not in a state that can be
+ recovered. In such a case, taking a new fresh backup is recommended.
+ </para>
+
+ <para>
+ <application>pg_rewind</application> will fail immediately if it finds
+ files it cannot write directly to. This can happen for example when
+ the source and the target server use the same file mapping for read-only
+ SSL keys and certificates. If such files are present on the target server
+ it is recommended to remove them before running
+ <application>pg_rewind</application>. After doing the rewind, some of
+ those files may have been copied from the source, in which case it may
+ be necessary to remove the data copied and restore back the set of links
+ used before the rewind.
+ </para>
+ </warning>
</refsect1>
<refsect1>
--
2.17.1
The following review has been posted through the commitfest application:
make installcheck-world: not tested
Implements feature: not tested
Spec compliant: not tested
Documentation: not tested
I agree that this problem should just be documented, and no further fix done. The proposed commit message seems to be too long - I think it should mainly just just refer to the mailing list discussion. Other than that it seems fine.
The new status of this patch is: Ready for Committer
On Fri, Jul 06, 2018 at 03:45:46PM +0000, Andrew Dunstan wrote:
I agree that this problem should just be documented, and no further
fix done. The proposed commit message seems to be too long - I think
it should mainly just just refer to the mailing list discussion. Other
than that it seems fine.The new status of this patch is: Ready for Committer
Thanks for the review, Andrew. Committed as suggested.
--
Michael