BUG #14999: pg_rewind corrupts control file global/pg_control

Started by PG Bug reporting formover 8 years ago29 messageshackersbugs
Jump to latest
#1PG Bug reporting form
noreply@postgresql.org
hackersbugs

The following bug has been logged on the website:

Bug reference: 14999
Logged by: Christian H.
Email address: office@tiptop-labs.com
PostgreSQL version: 10.1
Operating system: e.g. Debian Buster
Description:

I have encountered a bug in PostgreSQL 10.1: when the target directory for
pg_rewind contains a read-only file (e.g. server.key), pg_rewind exits with
"could not open target file" (legitimate) and corrupts the control file
global/pg_control to size 0 (bug). From now on, pg_rewind always exits with
"unexpected control file size 0, expected 8192" and a restore from
pg_basebackup is needed.

A patch for branch REL_10_STABLE of repository
https://github.com/postgres/postgres, a README, and Dockerfiles for
demonstrating both bug and patch are available from
https://github.com/tiptop-labs/postgres-patches .

#2Michael Paquier
michael@paquier.xyz
In reply to: PG Bug reporting form (#1)
hackersbugs
Re: BUG #14999: pg_rewind corrupts control file global/pg_control

On Fri, Jan 5, 2018 at 5:06 AM, PG Bug reporting form
<noreply@postgresql.org> wrote:

I have encountered a bug in PostgreSQL 10.1: when the target directory for
pg_rewind contains a read-only file (e.g. server.key), pg_rewind exits with
"could not open target file" (legitimate) and corrupts the control file
global/pg_control to size 0 (bug). From now on, pg_rewind always exits with
"unexpected control file size 0, expected 8192" and a restore from
pg_basebackup is needed.

Likely that's reproducible down to 9.5 where pg_rewind has been
introduced. I agree that we should do better with failure handling
here. Corrupting the control file is not cool.

A patch for branch REL_10_STABLE of repository
https://github.com/postgres/postgres, a README, and Dockerfiles for
demonstrating both bug and patch are available from
https://github.com/tiptop-labs/postgres-patches .

You may not know when github.com is gone, which would cause the data
you are attaching here to go away. What looks interesting is
pg_rewind.patch, which is what you are proposing as a fix, right? I am
attaching it here for PG archives.

-               open_target_file(filename, false);
+               /* Trunc target file for action FILE_ACTION_COPY. */
+               open_target_file(filename, chunkoff == 0);

write_target_range(chunk, chunkoff, chunksize);
Hm. Let me think more about that as there are quite a few
distributions that link to SSL files that cannot be written, and
pg_rewind copies all configuration files in full.

- "FROM fetchchunks\n";
+ "FROM fetchchunks\n"
+ "ORDER BY path, begin\n";
If this is aimed at improving the performance of pg_rewind by making
chunk writes more sequential, it should be a different patch. I would
see value in that. If this is aimed to change the order of the files
to avoid the write corruption, this is wrong.

I would imagine that you could see something similar with the offline
mode, haven't tested yet though.
--
Michael

Attachments:

pg_rewind.patchapplication/octet-stream; name=pg_rewind.patchDownload+4-4
#3TipTop Labs
office@tiptop-labs.com
In reply to: Michael Paquier (#2)
hackersbugs
Re: BUG #14999: pg_rewind corrupts control file global/pg_control

On Jan 5, 2018, at 2:26 AM, Michael Paquier <michael.paquier@gmail.com> wrote:

On Fri, Jan 5, 2018 at 5:06 AM, PG Bug reporting form
<noreply@postgresql.org> wrote:

I have encountered a bug in PostgreSQL 10.1: when the target directory for
pg_rewind contains a read-only file (e.g. server.key), pg_rewind exits with
"could not open target file" (legitimate) and corrupts the control file
global/pg_control to size 0 (bug). From now on, pg_rewind always exits with
"unexpected control file size 0, expected 8192" and a restore from
pg_basebackup is needed.

Likely that's reproducible down to 9.5 where pg_rewind has been
introduced. I agree that we should do better with failure handling
here. Corrupting the control file is not cool.

I can already confirm that this also occurs with PostgreSQL 9.6.

A patch for branch REL_10_STABLE of repository
https://github.com/postgres/postgres, a README, and Dockerfiles for
demonstrating both bug and patch are available from
https://github.com/tiptop-labs/postgres-patches .

You may not know when github.com is gone, which would cause the data
you are attaching here to go away. What looks interesting is
pg_rewind.patch, which is what you are proposing as a fix, right? I am
attaching it here for PG archives.

Yes, it's pg_rewind.patch. There is a bit of rationale in README, feel free to also attach here if/as needed.

-               open_target_file(filename, false);
+               /* Trunc target file for action FILE_ACTION_COPY. */
+               open_target_file(filename, chunkoff == 0);

write_target_range(chunk, chunkoff, chunksize);
Hm. Let me think more about that as there are quite a few
distributions that link to SSL files that cannot be written, and
pg_rewind copies all configuration files in full.

Could you please identify a specific distribution with such properties; I'd like to have a look there too. Thanks.

- "FROM fetchchunks\n";
+ "FROM fetchchunks\n"
+ "ORDER BY path, begin\n";
If this is aimed at improving the performance of pg_rewind by making
chunk writes more sequential, it should be a different patch. I would
see value in that. If this is aimed to change the order of the files
to avoid the write corruption, this is wrong.

It may be wrong then; from README: Truncation of the file was moved to the second loop. Truncation occurs there if chunks are written into files at offset 0. This is the case for FILE_ACTION_COPY. An additional SQL "ORDER BY path, begin" ensures that these chunks are processed first.

I would imagine that you could see something similar with the offline
mode, haven't tested yet though.
--
Michael
<pg_rewind.patch>

--
Christian

#4Michael Paquier
michael@paquier.xyz
In reply to: TipTop Labs (#3)
hackersbugs
Re: BUG #14999: pg_rewind corrupts control file global/pg_control

On Fri, Jan 5, 2018 at 12:36 PM, TipTop Labs <office@tiptop-labs.com> wrote:

Could you please identify a specific distribution with such properties; I'd like to have a look there too. Thanks.

I recall that Ubuntu and Debian do that. I don't use them but others
on this list will likely correct me. The point is that this is
authorized.
--
Michael

#5Michael Paquier
michael@paquier.xyz
In reply to: TipTop Labs (#3)
hackersbugs
Re: BUG #14999: pg_rewind corrupts control file global/pg_control

On Fri, Jan 05, 2018 at 04:36:22AM +0100, TipTop Labs wrote:

On Jan 5, 2018, at 2:26 AM, Michael Paquier <michael.paquier@gmail.com> wrote:
On Fri, Jan 5, 2018 at 5:06 AM, PG Bug reporting form
<noreply@postgresql.org> wrote:

I have encountered a bug in PostgreSQL 10.1: when the target directory for
pg_rewind contains a read-only file (e.g. server.key), pg_rewind exits with
"could not open target file" (legitimate) and corrupts the control file
global/pg_control to size 0 (bug). From now on, pg_rewind always exits with
"unexpected control file size 0, expected 8192" and a restore from
pg_basebackup is needed.

Likely that's reproducible down to 9.5 where pg_rewind has been
introduced. I agree that we should do better with failure handling
here. Corrupting the control file is not cool.

I can already confirm that this also occurs with PostgreSQL 9.6.

As far as I can see, this happens because of the way the 'remote' mode
of pg_rewind considers a set of files to truncate or not. Attached is a
patch which does things in a more correct way. The key point here is to
make the difference between a relation file and a configuration file
when building the filemap for copying a file in full. When a
configuration file is not readable, then trying to open it should not be
a failure. When using a relation file, there should be failures. There
are still two things I am not completely happy about:
- pg_control is considered as a configuration file with this patch,
however we should fail it pg_control is not readable. In practice I
guess that this should not happen, and the patch produces a warning
message. I think that we should consider add a special handling for
things like pg_control, filenode.map. etc. so as you get a hard failure
if those are not readable, so they should enter in the category of
FILE_ACTION_COPY_DATA. This needs a bit more thoughts in
process_source_file().
- open_target_file() resets manually errno. This is necessary as this
gets called continuously when processing the same file in remote mode. I
tried as well to make open_target_file and close_target_file one-time
operations for each file, but the result was even more ugly, so I went
up with this solution.

I am adding that to the next CF to not forget about it. This approach
is back-patchable down to 9.5 in my opinion. I have added as well a TAP
test in the patch which is able to reproduce the problem.

Thoughts?
--
Michael

Attachments:

rewind-readonly-fix-v1.patchtext/x-diff; charset=us-asciiDownload+126-24
#6TipTop Labs
office@tiptop-labs.com
In reply to: Michael Paquier (#5)
hackersbugs
Re: BUG #14999: pg_rewind corrupts control file global/pg_control

1. I can confirm that your patch is effective also in my Docker-based test setup and with the current REL_10_STABLE code base (i.e. PostgreSQL 10.2).

2. Your patch is more encompassing than the one I had submitted earlier, and it may be the right way to go. It is cleaner but more "complicated" in that it may require enlisting/recognizing all those special files (pg_control, filenode.map, etc). IMO the earlier patch would already/tolerate handle those, because the distinction it makes is not based on whether something is a configuration file, but purely on whether it is writable.

3. Sorry for the late response :)

Show quoted text

On Jan 15, 2018, at 8:25 AM, Michael Paquier <michael.paquier@gmail.com> wrote:

On Fri, Jan 05, 2018 at 04:36:22AM +0100, TipTop Labs wrote:

On Jan 5, 2018, at 2:26 AM, Michael Paquier <michael.paquier@gmail.com> wrote:
On Fri, Jan 5, 2018 at 5:06 AM, PG Bug reporting form
<noreply@postgresql.org> wrote:

I have encountered a bug in PostgreSQL 10.1: when the target directory for
pg_rewind contains a read-only file (e.g. server.key), pg_rewind exits with
"could not open target file" (legitimate) and corrupts the control file
global/pg_control to size 0 (bug). From now on, pg_rewind always exits with
"unexpected control file size 0, expected 8192" and a restore from
pg_basebackup is needed.

Likely that's reproducible down to 9.5 where pg_rewind has been
introduced. I agree that we should do better with failure handling
here. Corrupting the control file is not cool.

I can already confirm that this also occurs with PostgreSQL 9.6.

As far as I can see, this happens because of the way the 'remote' mode
of pg_rewind considers a set of files to truncate or not. Attached is a
patch which does things in a more correct way. The key point here is to
make the difference between a relation file and a configuration file
when building the filemap for copying a file in full. When a
configuration file is not readable, then trying to open it should not be
a failure. When using a relation file, there should be failures. There
are still two things I am not completely happy about:
- pg_control is considered as a configuration file with this patch,
however we should fail it pg_control is not readable. In practice I
guess that this should not happen, and the patch produces a warning
message. I think that we should consider add a special handling for
things like pg_control, filenode.map. etc. so as you get a hard failure
if those are not readable, so they should enter in the category of
FILE_ACTION_COPY_DATA. This needs a bit more thoughts in
process_source_file().
- open_target_file() resets manually errno. This is necessary as this
gets called continuously when processing the same file in remote mode. I
tried as well to make open_target_file and close_target_file one-time
operations for each file, but the result was even more ugly, so I went
up with this solution.

I am adding that to the next CF to not forget about it. This approach
is back-patchable down to 9.5 in my opinion. I have added as well a TAP
test in the patch which is able to reproduce the problem.

Thoughts?
--
Michael
<rewind-readonly-fix-v1.patch>

#7Michael Paquier
michael@paquier.xyz
In reply to: TipTop Labs (#6)
hackersbugs
Re: BUG #14999: pg_rewind corrupts control file global/pg_control

On Mon, Feb 26, 2018 at 05:28:53PM +0100, TipTop Labs wrote:

1. I can confirm that your patch is effective also in my Docker-based
test setup and with the current REL_10_STABLE code base
(i.e. PostgreSQL 10.2).

Thanks for checking. Note that I am still not completely happy with the
handling in errno in some newly-added code paths..

2. Your patch is more encompassing than the one I had submitted
earlier, and it may be the right way to go. It is cleaner but more
"complicated" in that it may require enlisting/recognizing all those
special files (pg_control, filenode.map, etc). IMO the earlier patch
would already/tolerate handle those, because the distinction it makes
is not based on whether something is a configuration file, but purely
on whether it is writable.

You are basically looking for that I think:
/messages/by-id/20180205071022.GA17337@paquier.xyz
You cannot ignore pg_control and filenode.map though as those are
critical data so they have to be updated. So if those files are not
writable, you actually have more problems than you think as the cluster
would not be able to correctly start.
--
Michael

#8Dmitry Dolgov
9erthalion6@gmail.com
In reply to: Michael Paquier (#7)
hackersbugs
Re: BUG #14999: pg_rewind corrupts control file global/pg_control

On 27 February 2018 at 02:24, Michael Paquier <michael@paquier.xyz> wrote:

Note that I am still not completely happy with the
handling in errno in some newly-added code paths..

It maybe a stupid question, but why do you need to reset errno here? Is it to
avoid its value being carried from previous calls of `open_target_file`? In
this case if you put the code with `errno == EACCESS` under the if condition
`if (dstfd < 0)`, then as far as I remember you should always get relevant
errno. At the same time maybe it's valid to reset `errno` before `open`, like
with `getpriority`:

For some system calls and library functions (e.g., getpriority(2)), -1 is
a valid return on success. In such cases, a successful return can be
distinguished from an error return by setting errno to zero before the
call, and then, if the call returns a status that indicates that an error
may have occurred, checking to see if errno has a nonzero value.

#9Michael Paquier
michael@paquier.xyz
In reply to: Dmitry Dolgov (#8)
hackersbugs
Re: BUG #14999: pg_rewind corrupts control file global/pg_control

On Fri, Mar 02, 2018 at 04:35:07PM +0100, Dmitry Dolgov wrote:

It maybe a stupid question, but why do you need to reset errno here? Is it to
avoid its value being carried from previous calls of `open_target_file`? In
this case if you put the code with `errno == EACCESS` under the if condition
`if (dstfd < 0)`, then as far as I remember you should always get relevant
errno. At the same time maybe it's valid to reset `errno` before `open`, like
with `getpriority`:

[ ... reads and feels stupid ...]

Of course all the checks should be where dstno is negative...

I have done a second pass on the patch, and attached is a new version.
This fixes this handling of errno and addresses some typos. I have also
fixed the test case where one of the read-only files was not properly
switched to do so. I have also added a commit log message.
--
Michael

Attachments:

0001-Improve-error-handling-of-configuration-files-in-pg_.patchtext/x-diff; charset=us-asciiDownload+133-25
#10Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Michael Paquier (#9)
hackersbugs
Re: BUG #14999: pg_rewind corrupts control file global/pg_control

On Mon, Mar 5, 2018 at 4:54 PM, Michael Paquier <michael@paquier.xyz> wrote:

On Fri, Mar 02, 2018 at 04:35:07PM +0100, Dmitry Dolgov wrote:

It maybe a stupid question, but why do you need to reset errno here? Is it to
avoid its value being carried from previous calls of `open_target_file`? In
this case if you put the code with `errno == EACCESS` under the if condition
`if (dstfd < 0)`, then as far as I remember you should always get relevant
errno. At the same time maybe it's valid to reset `errno` before `open`, like
with `getpriority`:

[ ... reads and feels stupid ...]

Of course all the checks should be where dstno is negative...

I have done a second pass on the patch, and attached is a new version.
This fixes this handling of errno and addresses some typos. I have also
fixed the test case where one of the read-only files was not properly
switched to do so. I have also added a commit log message.

@@ -24,7 +24,10 @@
 typedef enum
 {
        FILE_ACTION_CREATE,                     /* create local
directory or symbolic link */
-       FILE_ACTION_COPY,                       /* copy whole file,
overwriting if exists */
+       FILE_ACTION_COPY_CONFIG,        /* copy whole configuration
file, overwriting
+                                                                * if exists */
+       FILE_ACTION_COPY_DATA,          /* copy whole relation file,
overwriting if
+                                                                * exists */
        FILE_ACTION_COPY_TAIL,          /* copy tail from 'oldsize' to
'newsize' */
        FILE_ACTION_NONE,                       /* no action (we might
still copy modified
                                                                 *
blocks based on the parsed WAL) */

Since files other than relation files such as vm, fsm, WAL are
categorize as FILE_ACTION_COPY_CONFIG, I think the name would cause
misreading. For example, we can use FILE_ACTION_COPY_DATA and
FILE_ACTION_COPY_OTHER?

As you mentioned before, with this patch we end up ignoring file-open
error of database files other than relation files. I guess it's a bit
risky. We can enter them to COPY_DATA but I'd defer it to committer.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

#11Michael Paquier
michael@paquier.xyz
In reply to: Masahiko Sawada (#10)
hackersbugs
Re: BUG #14999: pg_rewind corrupts control file global/pg_control

On Mon, Mar 05, 2018 at 05:47:06PM +0900, Masahiko Sawada wrote:

Since files other than relation files such as vm, fsm, WAL are
categorize as FILE_ACTION_COPY_CONFIG, I think the name would cause
misreading. For example, we can use FILE_ACTION_COPY_DATA and
FILE_ACTION_COPY_OTHER?

I am not stopped on a given name.

As you mentioned before, with this patch we end up ignoring file-open
error of database files other than relation files. I guess it's a bit
risky. We can enter them to COPY_DATA but I'd defer it to committer.

COPY_DATA is aimed at being used on files which can be parsed by
blocks. You cannot do that with VMs and FSMs. Now you are true that we
could complain for non-configuration files which need to be
fully-copied, still Postgres issues a fsync on all the files of the data
folder when beginning recovery, so you would bump on problems anyway
after restarting the rewound instance.
--
Michael

#12Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Michael Paquier (#11)
hackersbugs
Re: BUG #14999: pg_rewind corrupts control file global/pg_control

On Mon, Mar 5, 2018 at 6:01 PM, Michael Paquier <michael@paquier.xyz> wrote:

On Mon, Mar 05, 2018 at 05:47:06PM +0900, Masahiko Sawada wrote:

Since files other than relation files such as vm, fsm, WAL are
categorize as FILE_ACTION_COPY_CONFIG, I think the name would cause
misreading. For example, we can use FILE_ACTION_COPY_DATA and
FILE_ACTION_COPY_OTHER?

I am not stopped on a given name.

Hmm, when I used pg_rewind --debug with this patch the debug message
made me confused because almost database files appears with
COPY_CONFIG. Also looking at the source code comment, it says
COPY_CONFIG is aimed to configuration files. But these file are not
configuration files actually. COPY_DATA makes sense to me, but
COPY_CONFIG doesn't.

Anyway, other than that the patch looks good to me. I'd like to wait
for the Dmitory's review comment before marking this as "Ready for
Commiter".

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

#13Dmitry Dolgov
9erthalion6@gmail.com
In reply to: Masahiko Sawada (#12)
hackersbugs
Re: BUG #14999: pg_rewind corrupts control file global/pg_control

On 6 March 2018 at 02:39, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
On Mon, Mar 5, 2018 at 6:01 PM, Michael Paquier <michael@paquier.xyz> wrote:

On Mon, Mar 05, 2018 at 05:47:06PM +0900, Masahiko Sawada wrote:

Since files other than relation files such as vm, fsm, WAL are
categorize as FILE_ACTION_COPY_CONFIG, I think the name would cause
misreading. For example, we can use FILE_ACTION_COPY_DATA and
FILE_ACTION_COPY_OTHER?

I am not stopped on a given name.

Hmm, when I used pg_rewind --debug with this patch the debug message
made me confused because almost database files appears with
COPY_CONFIG. Also looking at the source code comment, it says
COPY_CONFIG is aimed to configuration files. But these file are not
configuration files actually. COPY_DATA makes sense to me, but
COPY_CONFIG doesn't.

Anyway, other than that the patch looks good to me. I'd like to wait
for the Dmitory's review comment before marking this as "Ready for
Commiter".

Thank you for waiting. Yes, it also looks good for me, but I'm wondering about
one thing - does it make sense to think about other error codes here, not only
about `EACCESS`? E.g. if a file was removed during the process (so, it should
be `ENOENT`), or something more exotic happened, like there are too many
symbolic links were encountered in resolving a pathname (`ELOOP`)?

#14Michael Paquier
michael@paquier.xyz
In reply to: Dmitry Dolgov (#13)
hackersbugs
Re: BUG #14999: pg_rewind corrupts control file global/pg_control

On Tue, Mar 06, 2018 at 09:37:34PM +0100, Dmitry Dolgov wrote:

Thank you for waiting. Yes, it also looks good for me, but I'm wondering about
one thing - does it make sense to think about other error codes here, not only
about `EACCESS`? E.g. if a file was removed during the process (so, it should
be `ENOENT`), or something more exotic happened, like there are too many
symbolic links were encountered in resolving a pathname (`ELOOP`)?

The presence of the file is ensured in the pre-phase which builds the
file map (see process_source_file), and actions are taken depending on
the presence of a file on the source and the target. So a file missing
on the target after those pre-checks have ensured that it was actually
existing should be reported with ENOENT. ELOOP would as well be faced
on the backend before seeing it in pg_rewind, no? In short, it seems to
me that it is better to keep the code simple.
--
Michael

#15Dmitry Dolgov
9erthalion6@gmail.com
In reply to: Michael Paquier (#14)
hackersbugs
Re: BUG #14999: pg_rewind corrupts control file global/pg_control

On 7 March 2018 at 02:46, Michael Paquier <michael@paquier.xyz> wrote:
On Tue, Mar 06, 2018 at 09:37:34PM +0100, Dmitry Dolgov wrote:

Thank you for waiting. Yes, it also looks good for me, but I'm wondering about
one thing - does it make sense to think about other error codes here, not only
about `EACCESS`? E.g. if a file was removed during the process (so, it should
be `ENOENT`), or something more exotic happened, like there are too many
symbolic links were encountered in resolving a pathname (`ELOOP`)?

The presence of the file is ensured in the pre-phase which builds the
file map (see process_source_file), and actions are taken depending on
the presence of a file on the source and the target. So a file missing
on the target after those pre-checks have ensured that it was actually
existing should be reported with ENOENT. ELOOP would as well be faced
on the backend before seeing it in pg_rewind, no? In short, it seems to
me that it is better to keep the code simple.

Ok, I agree. Then yes, this patch can be probably marked as ready.

#16Tom Lane
tgl@sss.pgh.pa.us
In reply to: Dmitry Dolgov (#15)
hackersbugs
Re: BUG #14999: pg_rewind corrupts control file global/pg_control

Dmitry Dolgov <9erthalion6@gmail.com> writes:

Ok, I agree. Then yes, this patch can be probably marked as ready.

I started to look over this patch, and soon decided that the
commentary in pg_rewind is impossibly bad :-(. You need to study
libpq_executeFileMap for a long time before you are even sure which side
of the transfer it's executing on; the fact that it's doing a copy *TO*
the remote server is incredibly misleading, and the comments are not
anywhere near good enough to de-confuse somebody coming upon the code for
the first time. Not to mention that the function's name conveys nothing
whatever. Maybe it's not the job of this patch to fix that, but I can't
help thinking that poor design and documentation are a big part of why
this bug exists in the first place.

But enough of that rant. Now that I've more or less wrapped my head
around what's happening, I think this is the core of the problem:
libpq_executeFileMap truncates every non-data file in the local data
directory before it's fetched anything at all from the remote server.
This seems to me to be utterly cavalier and brittle, because ANY FAILURE
AT ALL in the fetch sequence leaves you with a data directory that's been
trashed in whole or in part --- and that sequence could span a good long
time, if there's a lot to copy. The originally complained-of problem
(that pg_control gets nuked) is just the tip of that iceberg, and I'm
afraid that the proposed patch is only removing a single potential
failure mode.

I think that a saner design would involve not truncating any particular
target file until we first get some data to write to it. Christian's
original patch shown at
/messages/by-id/CAB7nPqTyq3cjphJmoCPc0n8r90O4j1BpZC2BMaj8dKZ7g0-mtw@mail.gmail.com
seems to be headed in that direction, in that it forces ordering of the
data fetch (which is likely a good idea anyway to ensure locality of
reference) and then performs the truncation when we get the first chunk
of data for a file.

Michael's proposal to have separate error handling for data and config
files is not bad in itself, and it does improve the behavior for one
foreseeable trouble case. But I think we need the other thing as well,
or some variant of it, before we can regard pg_rewind as anything except
a chainsaw with the safety catch missing.

regards, tom lane

#17Michael Paquier
michael@paquier.xyz
In reply to: Tom Lane (#16)
hackersbugs
Re: BUG #14999: pg_rewind corrupts control file global/pg_control

On Mon, Apr 02, 2018 at 07:27:55PM -0400, Tom Lane wrote:

I think that a saner design would involve not truncating any particular
target file until we first get some data to write to it. Christian's
original patch shown at
/messages/by-id/CAB7nPqTyq3cjphJmoCPc0n8r90O4j1BpZC2BMaj8dKZ7g0-mtw@mail.gmail.com
seems to be headed in that direction, in that it forces ordering of the
data fetch (which is likely a good idea anyway to ensure locality of
reference) and then performs the truncation when we get the first chunk
of data for a file.

Please note that Christopher's patch is awkwardly making the difference
between a configuration file and a relation file by abusing of chunkoff.
It also blows up on the test t/006_readonly.pl included in some of my
previous patches.

Michael's proposal to have separate error handling for data and config
files is not bad in itself, and it does improve the behavior for one
foreseeable trouble case. But I think we need the other thing as well,
or some variant of it, before we can regard pg_rewind as anything except
a chainsaw with the safety catch missing.

I think that I agree with your proposal to delay the file truncation up
to the moment where the first chunk for a file is received, as well as I
agree to the fact that enforcing the ordering of the files would give a
more efficient work base. However, it seems to me that both concepts
should be included to get a complete patch. In short, a complete patch
could do in its most complex shape:
1) Ordering of the data fetched by libpq.
2) Truncation of files a first chunk is received only if it is a
configuration file.
3) Discard any data related to files that cannot be opened:
3-1) If the file worked on is a relation file, we have a hard failure.
3-2) If the file worked on is a configuration file, then a warning is
issued and we move on to the next file. This fixes the case of
read-only files without being user-intrusive.

In the case of 3-2) having a warning instead of a failure is subject to
discussion though. Perhaps we could live with a simple failure, but
that's rather user-unfriendly, still we could document the limitation
and tell users to not put read-only files on the target which is also
present on the source.

At the same time pg_rewind considers in my patch things like FSM and VM
files as configuration files, which you want to truncate out of the way
and fully copy, but also want to fail hard if they cannot be written to!

So at the end a failure for everything may make the most sense, if we
document that people don't put read-only files in the target's data
folder which can cause failures.

Note also that if we use isRelDataFile() in receiveFileChunks() it is
possible to decide if a file needs to be truncated when receiving its
first chunk, so it is possible to drop completely the concept of file
type as well. This would make the patch even more simple, and there is
no need to rely on other weird tweaks like chunkoff or such.

The ordering of the items when doing a local copy is not mandatory as
files are processed one by one, still I think that it could be more
solid to reuse isRelDataFile() in rewind_copy_file_range() and drop the
"trunc" argument.

Thoughts?
--
Michael

#18TipTop Labs
office@tiptop-labs.com
In reply to: Michael Paquier (#17)
hackersbugs
Re: BUG #14999: pg_rewind corrupts control file global/pg_control

On Apr 3, 2018, at 4:41 AM, Michael Paquier <michael@paquier.xyz> wrote:

On Mon, Apr 02, 2018 at 07:27:55PM -0400, Tom Lane wrote:

I think that a saner design would involve not truncating any particular
target file until we first get some data to write to it. Christian's
original patch shown at
/messages/by-id/CAB7nPqTyq3cjphJmoCPc0n8r90O4j1BpZC2BMaj8dKZ7g0-mtw@mail.gmail.com
seems to be headed in that direction, in that it forces ordering of the
data fetch (which is likely a good idea anyway to ensure locality of
reference) and then performs the truncation when we get the first chunk
of data for a file.

Please note that Christopher's patch is awkwardly making the difference
between a configuration file and a relation file by abusing of chunkoff.
It also blows up on the test t/006_readonly.pl included in some of my
previous patches.

IMO this is a better characterization of my original patch (from https://github.com/tiptop-labs/postgres-patches/blob/master/bug_14999/README <https://github.com/tiptop-labs/postgres-patches/blob/master/bug_14999/README&gt;):

bug
===

libpq_fetch.c loops twice over files in pgdata, a first time in
libpq_executeFileMap(), and then a second time (for files with action
FILE_ACTION_COPY or FILE_ACTION_COPY_TAIL) in receiveFileChunks().
For FILE_ACTION_COPY, it deletes (truncates) the file from the target directory
already during the first loop. If pg_rewind then encounters a read-only file
(e.g. server.key) still during the first loop, it exits with pg_fatal
("could not open target file"). After such truncation of global/pg_control
pg_rewind cannot run again ("unexpected control file size 0, , expected 8192")
and a restore from pg_basebackup is needed.

patch
=====

Truncation of the file was moved to the second loop. Truncation occurs there if
chunks are written into files at offset 0. This is the case for
FILE_ACTION_COPY. An additional SQL "ORDER BY path, begin" ensures that these
chunks are processed first.

Michael's proposal to have separate error handling for data and config
files is not bad in itself, and it does improve the behavior for one
foreseeable trouble case. But I think we need the other thing as well,
or some variant of it, before we can regard pg_rewind as anything except
a chainsaw with the safety catch missing.

I think that I agree with your proposal to delay the file truncation up
to the moment where the first chunk for a file is received, as well as I
agree to the fact that enforcing the ordering of the files would give a
more efficient work base. However, it seems to me that both concepts
should be included to get a complete patch. In short, a complete patch
could do in its most complex shape:
1) Ordering of the data fetched by libpq.
2) Truncation of files a first chunk is received only if it is a
configuration file.
3) Discard any data related to files that cannot be opened:
3-1) If the file worked on is a relation file, we have a hard failure.
3-2) If the file worked on is a configuration file, then a warning is
issued and we move on to the next file. This fixes the case of
read-only files without being user-intrusive.

In the case of 3-2) having a warning instead of a failure is subject to
discussion though. Perhaps we could live with a simple failure, but
that's rather user-unfriendly, still we could document the limitation
and tell users to not put read-only files on the target which is also
present on the source.

At the same time pg_rewind considers in my patch things like FSM and VM
files as configuration files, which you want to truncate out of the way
and fully copy, but also want to fail hard if they cannot be written to!

So at the end a failure for everything may make the most sense, if we
document that people don't put read-only files in the target's data
folder which can cause failures.

Note also that if we use isRelDataFile() in receiveFileChunks() it is
possible to decide if a file needs to be truncated when receiving its
first chunk, so it is possible to drop completely the concept of file
type as well. This would make the patch even more simple, and there is
no need to rely on other weird tweaks like chunkoff or such.

The ordering of the items when doing a local copy is not mandatory as
files are processed one by one, still I think that it could be more
solid to reuse isRelDataFile() in rewind_copy_file_range() and drop the
"trunc" argument.

Thoughts?
--
Michael

Regards,
Christian

#19Michael Paquier
michael@paquier.xyz
In reply to: TipTop Labs (#18)
hackersbugs
Re: BUG #14999: pg_rewind corrupts control file global/pg_control

On Tue, Apr 03, 2018 at 10:13:05PM +0200, TipTop Labs wrote:

Truncation of the file was moved to the second loop. Truncation occurs
there if chunks are written into files at offset 0. This is the case
for FILE_ACTION_COPY. An additional SQL "ORDER BY path, begin" ensures
that these chunks are processed first.

Yes. I have spent a large portion of my day looking at that. And
actually this is wrong as well. First, please find attached a patch I
have written while testing your changes. There is nothing much new into
it, except that I added more comments and a test (other tests fail as
well, that does not matter).

First, I have looked at a way to not rely on the tweak with chunkoff by
extending the temporary table used to register the chunks with an extra
column which tracks the type of action which is taken on the file.
Another one I looked at was to pass down the action type taken on the
entry directly to receiveFileChunks(). However both of them have been
grotty.

So after that I falled back to your patch and began testing it, which is
where I noticed that we can *never* give the insurance to recover a data
folder on which an error has happened in the middle of a pg_rewind. The
reason for that is quite simple: even if the truncation has been moved
down to the moment where the first chunk of a file is received, you may
have already done work on some relation files. Particularly, some of
them may have been truncated down to a given size without a new range of
blocks fetched from the source. So the data folder would be in an
inconsistent state if trying to rewind it again.

If the control file from the source has been received and then a
read-only error is found, then in order to perform a subsequent rewind
you need to start and stop the target cluster cleanly, or update
manually the control file and mark it as cleanly stopped. And this is a
recipe for data corruption as we now start a rewind based on the
guarantee that a target cluster has been able to cleanly stop, with a
shutdown checkpoint issued. You could always try to start and stop the
cluster cleanly, but if your target instance replays WAL on data blocks
which have been truncated by a previous rewind, you would need to
re-create an instance using a new base backup, which is actually harder
to diagnose.

The patch I sent prevously which makes the difference between relation
files and "configuration" files helps a bit, still it makes me uneasy
because it will not be able to handle correctly failures on files which
are part of a relation but need to be fully copied. So I think that we
should drop this approach as well for its complexity.

Another thing that could definitely help is to work on a patch which
checks that *all* files are writable before doing any operations for
files which are present on both the source and the target, and make the
rewind nicely fail with the source still intact. That's quite a bit of
refactoring ahead for little benefit I think in the long run.

What we could simply do is to document the limitation. Hence, if both
the target and the source have read-only files, then the user needs to
be sure to remove them from the target before doing the rewind, and do
the re-linking after the operation. If an error is found while
processing the rewind, then a new base backup needs to be taken.

The refactoring I am mentioning two paragraphs above could definitely be
done to ease user's life, but I would treat that as a new feature and
not a bug. Another approach, way more simple that we could do is to
scan the data folder of the target before doing anything and see if some
files are not writable, and then report this list back to the user. If
we do that even for files which are on the target but not the source
then that would be a loss for some cases, however I would like to
imagine that when using pg_rewind the configuration file set is
consistent across nodes in a majority of cases, so a user could remove
those non-writable files manually. This has also the advantage to be
non-intrusive, so a back-patch is definitely possible.

Thoughts?
--
Michael

Attachments:

rewind-readonly-v3.patchtext/x-diff; charset=us-asciiDownload+104-5
#20Tom Lane
tgl@sss.pgh.pa.us
In reply to: Michael Paquier (#19)
hackersbugs
Re: BUG #14999: pg_rewind corrupts control file global/pg_control

Michael Paquier <michael@paquier.xyz> writes:

So after that I falled back to your patch and began testing it, which is
where I noticed that we can *never* give the insurance to recover a data
folder on which an error has happened in the middle of a pg_rewind. The
reason for that is quite simple: even if the truncation has been moved
down to the moment where the first chunk of a file is received, you may
have already done work on some relation files. Particularly, some of
them may have been truncated down to a given size without a new range of
blocks fetched from the source. So the data folder would be in an
inconsistent state if trying to rewind it again.

Yes, we certainly cannot guarantee that failure partway through pg_rewind
leaves a consistent state of the target data directory. It is likely
worth pointing that out in the documentation. Whether we can or should
do anything about it is a different question.

When I first started looking at this thread, I wondered if maybe somebody
had had in mind to create an active defense against starting a postmaster
in an inconsistent target cluster, by dint of intentionally truncating
pg_control before the transfer starts and not making it valid again till
the very end. It's now clear from looking at the code that that's not
what's going on :-(. But I wonder how hard it would be to make it so,
and whether that'd be worth doing if it's not too hard.

Actually, probably a safer way to attack that would be to remove or
rename the topmost PG_VERSION file, and then put it back afterwards.
That'd be far easier to recover from manually, if need be, than
clobbering pg_control.

In any case, that seems separate from the question of what to do with
read-only files in the data directory. Should we push forward with
committing Michael's previous patch, and leave that issue for later?

regards, tom lane

#21Michael Paquier
michael@paquier.xyz
In reply to: Tom Lane (#20)
hackersbugs
#22TipTop Labs
office@tiptop-labs.com
In reply to: Michael Paquier (#21)
hackersbugs
#23Michael Paquier
michael@paquier.xyz
In reply to: TipTop Labs (#22)
hackersbugs
#24Tom Lane
tgl@sss.pgh.pa.us
In reply to: Michael Paquier (#23)
hackersbugs
#25Michael Paquier
michael@paquier.xyz
In reply to: Tom Lane (#24)
hackersbugs
#26Michael Paquier
michael@paquier.xyz
In reply to: Michael Paquier (#25)
hackersbugs
#27Michael Paquier
michael@paquier.xyz
In reply to: Michael Paquier (#26)
hackersbugs
#28Andrew Dunstan
andrew@dunslane.net
In reply to: Michael Paquier (#27)
hackersbugs
#29Michael Paquier
michael@paquier.xyz
In reply to: Andrew Dunstan (#28)
hackersbugs