Why restore_command is called for existing files in pg_xlog?

Started by Alexander Kukushkinover 8 years ago4 messages

Alexander Kukushkin

cyberdemn@gmail.com

over 8 years ago

Hello hackers,

There is one strange and awful thing I don't understand about
restore_command: it is always being called for every single WAL segment
postgres wants to apply (even if such segment already exists in pg_xlog)
until replica start streaming from the master.

If there is no restore_command in the recovery.conf - it perfectly works,
i.e. postgres replays existing wal segments and at some point connects to
the master and start streaming from it.

When recovery_conf is there, starting of a replica could become a real
problem, especially if restore_command is slow.

Is it possible to change this behavior somehow? First look into pg_xlog and
only if file is missing or "corrupted" call restore_command.

Regards,
---
Alexander Kukushkin

alexk@hintbits.com

over 8 years ago

In reply to: Alexander Kukushkin (#1)

Re: Why restore_command is called for existing files in pg_xlog?

On Fri, Jun 2, 2017, at 11:51 AM, Alexander Kukushkin wrote:

Hello hackers,
There is one strange and awful thing I don't understand about
restore_command: it is always being called for every single WAL
segment postgres wants to apply (even if such segment already exists
in pg_xlog) until replica start streaming from the master.

The real problem this question is related to is being unable to bring a
former master, demoted after a crash, online, since the WAL segments
required to get it to the consistent state were not archived while it
was still a master, and local segments in pg_xlog are ignored when a
restore_command is defined. The other replicas wouldn't be good
candidates for promotion as well, as they were way behind the master
(because the last N WAL segments were not archived and streaming
replication had a few seconds delay).
Is this a correct list for such questions, or would it be more
appropriate to ask elsewhere (i.e. pgsql-bugs?)

If there is no restore_command in the recovery.conf - it perfectly
works, i.e. postgres replays existing wal segments and at some point
connects to the master and start streaming from it.>
When recovery_conf is there, starting of a replica could become a real
problem, especially if restore_command is slow.>
Is it possible to change this behavior somehow? First look into
pg_xlog and only if file is missing or "corrupted" call
restore_command.>

Regards,
---
Alexander Kukushkin

Sincerely,
Alex

jeff.janes@gmail.com

over 8 years ago

In reply to: Alex Kliukin (#2)

Re: Why restore_command is called for existing files in pg_xlog?

On Mon, Jun 12, 2017 at 5:25 AM, Alex Kliukin <alexk@hintbits.com> wrote:

On Fri, Jun 2, 2017, at 11:51 AM, Alexander Kukushkin wrote:

Hello hackers,
There is one strange and awful thing I don't understand about
restore_command: it is always being called for every single WAL segment
postgres wants to apply (even if such segment already exists in pg_xlog)
until replica start streaming from the master.

The real problem this question is related to is being unable to bring a
former master, demoted after a crash, online, since the WAL segments
required to get it to the consistent state were not archived while it was
still a master, and local segments in pg_xlog are ignored when a
restore_command is defined. The other replicas wouldn't be good candidates
for promotion as well, as they were way behind the master (because the last
N WAL segments were not archived and streaming replication had a few
seconds delay).

I don't really understand the problem. If the other replicas are not
candidates for promotion, than why was the master ever "demoted" in the
first place? It should just go through normal crash recovery, not PITR
recovery, and therefore will read the files from pg_xlog just fine.

If you already promoted one of the replicas and accepted data changes into
it, and now are thinking that was not a good idea, then there is no off the
shelf automatic way to merge together the two systems. You have do a
manual inspection of the differences. To do that, you would start by
starting up (a copy of) the crashed master, using normal crash recovery,
not PITR.

Is this a correct list for such questions, or would it be more appropriate
to ask elsewhere (i.e. pgsql-bugs?)

Probably more appropriate for pgsql-general or pgsql-admin.

Cheers,

Jeff

alexk@hintbits.com

over 8 years ago

In reply to: Jeff Janes (#3)

Re: [HACKERS] Why restore_command is called for existing files in pg_xlog?

Hi Jeff,

On Mon, Jun 12, 2017, at 06:42 PM, Jeff Janes wrote:

On Mon, Jun 12, 2017 at 5:25 AM, Alex Kliukin
<alexk@hintbits.com> wrote:>> __

On Fri, Jun 2, 2017, at 11:51 AM, Alexander Kukushkin wrote:

Hello hackers,
There is one strange and awful thing I don't understand about
restore_command: it is always being called for every single WAL
segment postgres wants to apply (even if such segment already exists
in pg_xlog) until replica start streaming from the master.>>

The real problem this question is related to is being unable to bring
a former master, demoted after a crash, online, since the WAL
segments required to get it to the consistent state were not archived
while it was still a master, and local segments in pg_xlog are
ignored when a restore_command is defined. The other replicas
wouldn't be good candidates for promotion as well, as they were way
behind the master (because the last N WAL segments were not archived
and streaming replication had a few seconds delay).>

I don't really understand the problem. If the other replicas are not
candidates for promotion, than why was the master ever "demoted" in
the first place? It should just go through normal crash recovery,
not PITR recovery, and therefore will read the files from pg_xlog
just fine.

We run an automatic failover daemon, called "Patroni", that uses a
consistency layer (RAFT, implemented by Etcd) in order to decide on
which node should be the leader. In Patroni, only the node that has the
leader key in Etcd is allowed to become a master. When Patroni detects
that the PostgreSQL on the node that holds the leader lock is not
running, it starts the instance in a "read-only" mode by writing a
recovery.conf without the "primary_conninfo". Once the former master
running as a read-only recovers to a consistent state and is not behind
the last known master's position, it is promoted back unless a replica
takes over the master lock.
The reason we cannot just start the crashed master normally is a
possible split-brain scenario. If during the former master's crash
recovery another replica takes over the lock because it is close enough
to the last known master position and is deemed "healthy" to promote,
the former master starts as a master nevertheless (we have no control
over the PostgreSQL crash recovery process), despite the fact that it
has no lock, violating the rule of "first get the lock, then promote".

If you already promoted one of the replicas and accepted data changes
into it, and now are thinking that was not a good idea, then there is
no off the shelf automatic way to merge together the two systems. You
have do a manual inspection of the differences. To do that, you would
start by starting up (a copy of) the crashed master, using normal
crash recovery, not PITR.

In our scenario, no replica is promoted. The master starts in a read-
only mode, and is stuck there forever, since it cannot restore WAL
segments stored in its own WAL directory, and those segments were never
archived. The replicas cannot be promoted, because they are to far
behind from the master.
I don't really see any reasons not to try to restore WAL segments from
the WAL directory first. It would speed up the recovery in many cases,
since the segments are already there, there is no need to fetch them

Probably more appropriate for pgsql-general or pgsql-admin.

Thanks!

Sincerely,
Alex