Using old master as new replica after clean switchover

Started by Nikolay Samokhvalovabout 7 years ago5 messages

samokhvalov@gmail.com

about 7 years ago

1 attachment(s)

Currently, the documentation explicitly states, that after failover, the
old master must be recreated from scratch, or pg_rewind should be used
(requiring wal_log_hints to be on, which is off by default):

The former standby is now the primary, but the former primary is down and

might stay down. To return to normal operation, a standby server must be
recreated, either on the former primary system when it comes up, or on a
third, possibly new, system. The pg_rewind utility can be used to speed up
this process on large clusters.

My research shows that some people already rely on the following when
planned failover (aka switchover) procedure, doing it in production:

1) shutdown the current master
2) ensure that the "master candidate" replica has received all WAL data
including shutdown checkpoint from the old master
3) promote the master candidate to make it new master
4) configure recovery.conf on the old master node, while it's inactive
5) start the old master node as a new replica following the new master.

It looks to me now, that if no steps missed in the procedure, this approach
is eligible for Postgres versions 9.3+ (for older versions like 9.3 maybe
not really always – people who know details better will correct me here
maybe). Am I right? Or I'm missing some risks here?

Two changes were made in 9.3 which allowed this approach in general [1]Support clean switchover https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=985bd7d49726c9f178558491d31a570d47340459
[2]: Allow a streaming replication standby to follow a timeline switch https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=abfd192b1b5ba5216ac4b1f31dcd553106304b19
walsenders are the last who are stopped, so allow replicas to get the
shutdown checkpoint information.

Is this approach considered as safe now?

if so, let's add it to the documentation, making it official. The patch is
attached.

Links:
[0]: 26.3 Failover https://www.postgresql.org/docs/current/static/warm-standby-failover.html
https://www.postgresql.org/docs/current/static/warm-standby-failover.html
[1]: Support clean switchover https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=985bd7d49726c9f178558491d31a570d47340459
https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=985bd7d49726c9f178558491d31a570d47340459
[2]: Allow a streaming replication standby to follow a timeline switch https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=abfd192b1b5ba5216ac4b1f31dcd553106304b19
https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=abfd192b1b5ba5216ac4b1f31dcd553106304b19
[3]: https://git.postgresql.org/gitweb/?p=postgresql.git;a=blob;f=src/backend/replication/walsender.c;hb=HEAD#l276
https://git.postgresql.org/gitweb/?p=postgresql.git;a=blob;f=src/backend/replication/walsender.c;hb=HEAD#l276

Regards,
Nik

Attachments:

failover_doc.patchapplication/octet-stream; name=failover_doc.patchDownload

diff --git a/doc/src/sgml/high-availability.sgml b/doc/src/sgml/high-availability.sgml
index faf8e71854..088c51c144 100644
--- a/doc/src/sgml/high-availability.sgml
+++ b/doc/src/sgml/high-availability.sgml
@@ -1452,7 +1452,12 @@ synchronous_standby_names = 'ANY 2 (s1, s2, s3)'
     must be recreated,
     either on the former primary system when it comes up, or on a third,
     possibly new, system. The <xref linkend="app-pgrewind"/> utility can be
-    used to speed up this process on large clusters.
+    used to speed up this process on large clusters. At the same time,
+    if before failover, the old master was cleanly shut down, and
+    all WAL data including so-called shutdown checkpoint was received
+    by the replica before it was promoted, the old master can be started
+    as a new replica attaching to the new master without rebuilding or using
+    pg_rewind. In this case, only configuration of recovery.conf is needed.
     Once complete, the primary and standby can be
     considered to have switched roles. Some people choose to use a third
     server to provide backup for the new primary until the new standby

Jehan-Guillaume de Rorthais

jgdr@dalibo.com

about 7 years ago

In reply to: Nikolay Samokhvalov (#1)

Re: Using old master as new replica after clean switchover

On Thu, 25 Oct 2018 02:57:18 -0400
Nikolay Samokhvalov <samokhvalov@gmail.com> wrote:
...

My research shows that some people already rely on the following when
planned failover (aka switchover) procedure, doing it in production:

1) shutdown the current master
2) ensure that the "master candidate" replica has received all WAL data
including shutdown checkpoint from the old master
3) promote the master candidate to make it new master
4) configure recovery.conf on the old master node, while it's inactive
5) start the old master node as a new replica following the new master.

Indeed.

It looks to me now, that if no steps missed in the procedure, this approach
is eligible for Postgres versions 9.3+ (for older versions like 9.3 maybe
not really always – people who know details better will correct me here
maybe). Am I right? Or I'm missing some risks here?

As far as I know, this is correct.

Two changes were made in 9.3 which allowed this approach in general [1]
[2]. Also, I see from the code [3] that during shutdown process, the
walsenders are the last who are stopped, so allow replicas to get the
shutdown checkpoint information.

I had the same conclusions when I was studying controlled failover some years
ago to implement it PAF project (allowing controlled switchover in one command).
Here is a discussions around switchover taking place three years ago on
Pacemaker mailing list:

https://lists.clusterlabs.org/pipermail/users/2016-October/011568.html

Is this approach considered as safe now?

Considering above points, I do think so.

The only additional nice step would be to be able to run some more safety tests
AFTER the switchover process on te old master. The only way I can think of
would be to run pg_rewind even if it doesn't do much.

if so, let's add it to the documentation, making it official. The patch is
attached.

I suppose we should add the technical steps in a sample procedure?

Michael Paquier

michael@paquier.xyz

about 7 years ago

In reply to: Jehan-Guillaume de Rorthais (#2)

Re: Using old master as new replica after clean switchover

On Thu, Oct 25, 2018 at 11:15:51AM +0200, Jehan-Guillaume de Rorthais wrote:

On Thu, 25 Oct 2018 02:57:18 -0400
Nikolay Samokhvalov <samokhvalov@gmail.com> wrote:

My research shows that some people already rely on the following when
planned failover (aka switchover) procedure, doing it in production:

1) shutdown the current master
2) ensure that the "master candidate" replica has received all WAL data
including shutdown checkpoint from the old master
3) promote the master candidate to make it new master
4) configure recovery.conf on the old master node, while it's inactive
5) start the old master node as a new replica following the new master.

Indeed.

The important point here is that the primary will wait for the shutdown
checkpoint record to be replayed on the standbys before finishing to
shut down.

The only additional nice step would be to be able to run some more safety tests
AFTER the switchover process on te old master. The only way I can think of
would be to run pg_rewind even if it doesn't do much.

Do you have something specific in mind here? I am curious if you're
thinking about things like page-level checks for LSN matches under some
threshold or such, because you should not have pages on the previous
primary which have LSNs newer than the point up to which the standby has
replayed.

if so, let's add it to the documentation, making it official. The patch is
attached.

I suppose we should add the technical steps in a sample procedure?

If an addition to the docs is done, symbolizing the steps in a list
would be cleaner, with perhaps something in a dedicated section or a new
sub-section. The failover flow you are mentioning is good practice
because that's safe, and there is always room for improvements in the
docs.
--
Michael

Jehan-Guillaume de Rorthais

jgdr@dalibo.com

about 7 years ago

In reply to: Michael Paquier (#3)

Re: Using old master as new replica after clean switchover

On Thu, 25 Oct 2018 20:45:57 +0900
Michael Paquier <michael@paquier.xyz> wrote:

On Thu, Oct 25, 2018 at 11:15:51AM +0200, Jehan-Guillaume de Rorthais wrote:

On Thu, 25 Oct 2018 02:57:18 -0400
Nikolay Samokhvalov <samokhvalov@gmail.com> wrote:

My research shows that some people already rely on the following when
planned failover (aka switchover) procedure, doing it in production:

1) shutdown the current master
2) ensure that the "master candidate" replica has received all WAL data
including shutdown checkpoint from the old master
3) promote the master candidate to make it new master
4) configure recovery.conf on the old master node, while it's inactive
5) start the old master node as a new replica following the new master.

Indeed.

The important point here is that the primary will wait for the shutdown
checkpoint record to be replayed on the standbys before finishing to
shut down.

Yes. However, it gives up if the connection to the standby fails. This is
obvious. But that's why we really need to double check on the standby the
shutdown checkpoints has been received. Just in case of some network troubles
or such.

The only additional nice step would be to be able to run some more safety
tests AFTER the switchover process on te old master. The only way I can
think of would be to run pg_rewind even if it doesn't do much.

Do you have something specific in mind here? I am curious if you're
thinking about things like page-level checks for LSN matches under some
threshold or such, because you should not have pages on the previous
primary which have LSNs newer than the point up to which the standby has
replayed.

This could be a decent check. Heavy and slow, but safe.

Other ideas I have (see bellow) are only related to ease the existing
procedure.

Both are interesting projects I could hopefully work on.

if so, let's add it to the documentation, making it official. The patch is
attached.

I suppose we should add the technical steps in a sample procedure?

If an addition to the docs is done, symbolizing the steps in a list
would be cleaner, with perhaps something in a dedicated section or a new
sub-section. The failover flow you are mentioning is good practice
because that's safe, and there is always room for improvements in the
docs.

The hardest part to explain here is how to check the shutdown checkpoint hit
the standby-to-promote.
* in PAF, I'm using pg_waldump to check if the shutdown checkpoint has been
received.
* in manual operation, I force a checkpoint on the standby and compare "Latest
checkpoint's REDO location" from the controldata file with the one on the old
master.

I'm not sure how to explain clearly one or the other method in the doc.

Two ideas come in mind to improve this.

What about logging the shutdown checkpoint on the old master?
On the standby side, we could cross-check it with a function confirming:
1/ the very last XLogRecord received was the old master shutdown checkpoint
2/ the received shutdown checkpoint has the same LSN

Second idea would be that an old master detect it has been started as a new
standby and only replay XLogRecord from the new master if its TL fork is
following its previous TL and shutdown checkpoint?

Nikolay Samokhvalov

samokhvalov@gmail.com

about 7 years ago

In reply to: Jehan-Guillaume de Rorthais (#4)

Re: Using old master as new replica after clean switchover

On Thu, Oct 25, 2018 at 6:03 AM Jehan-Guillaume de Rorthais <jgdr@dalibo.com>
wrote:

What about logging the shutdown checkpoint on the old master?
On the standby side, we could cross-check it with a function confirming:
1/ the very last XLogRecord received was the old master shutdown checkpoint
2/ the received shutdown checkpoint has the same LSN

Additionally, the new instructions in the doc might include recommendation,
what to do if we
found that shutdown checkpoint wasn't received and replayed by the
replica-to-promote. From my
understanding, before promotion, we could "manually" transfer missing WAL
data from the old,
inactive master and replay it on the replica-to-promote (of course, if
recovery_command is
properly configured on it). Right?

By the way, if it looks to me that it might be better to write more than
just few sentences, what if it
will be a new chapter – say, "Switchover", next to "Failover". It would
also give better understanding
to the reading, explicitly distinguishing planned and unplanned processes
of master/replica role
changes.

Regards,
Nik