BUG #15331: Please check if recovery.conf can be renamed

Started by PG Bug reporting formover 7 years ago4 messagesbugs
Jump to latest
#1PG Bug reporting form
noreply@postgresql.org

The following bug has been logged on the website:

Bug reference: 15331
Logged by: Phil Endecott
Email address: spam_from_pgsql_lists@chezphil.org
PostgreSQL version: 9.6.10
Operating system: Debian Stretch
Description:

When a standby server is promoted it renames recovery.conf to
recovery.done.
That will not be possible if that file is owned by root or otherwise has the
wrong permissions. It's unusual for a program to modify its own
configuration files like this.
It would be great if PostgreSQL could check that the permissions are
suitable when it starts, and emit a warning if not. Currently it only fails
when asked to promote, with this log message:
FATAL: could not open file "recovery.conf": Permission denied
(Note that it only says "could not open", not "could not rename".)
This means that promotion fails, and for me even after fixing the
permissions the system was in an odd state that took some work to fix.
Failover is hard to get right; emitting a warning earlier in this case would
mean one less thing to go wrong.

#2Michael Paquier
michael@paquier.xyz
In reply to: PG Bug reporting form (#1)
Re: BUG #15331: Please check if recovery.conf can be renamed

On Thu, Aug 16, 2018 at 11:30:09AM +0000, PG Bug reporting form wrote:

This means that promotion fails, and for me even after fixing the
permissions the system was in an odd state that took some work to fix.
Failover is hard to get right; emitting a warning earlier in this case would
mean one less thing to go wrong.

I think that you would be interested in this recent commit (fixed as of
the last round of minor releases):
commit: cbc55da556bbcb649e059804009c38100ee98884
committer: Michael Paquier <michael@paquier.xyz>
date: Mon, 9 Jul 2018 10:22:34 +0900
Rework order of end-of-recovery actions to delay timeline history write

And this thread:
/messages/by-id/CABUevEz09XY2EevA2dLjPCY-C5UO4Hq=XxmXLmF6ipNFecbShQ@mail.gmail.com

To give you a summary, once recovery finished and before recovery.conf
was renamed, some on-disk actions happened, which could put the cluster
in a weird state, perhaps similarly to what you saw.
--
Michael

#3Andres Freund
andres@anarazel.de
In reply to: Michael Paquier (#2)
Re: BUG #15331: Please check if recovery.conf can be renamed

On 2018-08-16 20:50:55 +0900, Michael Paquier wrote:

On Thu, Aug 16, 2018 at 11:30:09AM +0000, PG Bug reporting form wrote:

This means that promotion fails, and for me even after fixing the
permissions the system was in an odd state that took some work to fix.
Failover is hard to get right; emitting a warning earlier in this case would
mean one less thing to go wrong.

I think that you would be interested in this recent commit (fixed as of
the last round of minor releases):
commit: cbc55da556bbcb649e059804009c38100ee98884
committer: Michael Paquier <michael@paquier.xyz>
date: Mon, 9 Jul 2018 10:22:34 +0900
Rework order of end-of-recovery actions to delay timeline history write

And this thread:
/messages/by-id/CABUevEz09XY2EevA2dLjPCY-C5UO4Hq=XxmXLmF6ipNFecbShQ@mail.gmail.com

To give you a summary, once recovery finished and before recovery.conf
was renamed, some on-disk actions happened, which could put the cluster
in a weird state, perhaps similarly to what you saw.

How would this address OP's concern? You'd still not learn meaningfully
earlier that your attempted promotion failed (instead of learning of the
problem before you ever promote).

Greetings,

Andres Freund

#4Michael Paquier
michael@paquier.xyz
In reply to: Andres Freund (#3)
Re: BUG #15331: Please check if recovery.conf can be renamed

On Thu, Aug 16, 2018 at 05:09:43AM -0700, Andres Freund wrote:

How would this address OP's concern? You'd still not learn meaningfully
earlier that your attempted promotion failed (instead of learning of the
problem before you ever promote).

The problem that the previous commit fixes is to make sure that even if
recovery.conf renaming fails, then the cluster does not get into a weird
state, making it reusable later on, and the OP would not see the later
problems reported after the failed promotion. I am not sure that using
a warning at an early stage would be actually useful as I doubt that any
user would remark it, but there could be indeed an argument to make sure
that recovery.conf has a correct permission set, and fail hard before
entering recovery if that's not the case. I am not sure how much we
want to restrict things though, lately has been for example introduced
read grouping access in data folders...
--
Michael