Failback with log shipping
At PGCon, several people asked me about restarting an old master as a
standby after failover has happened. And it wasn't the first time people
ask me about it, even before 9.0. We have no mention of that in the
docs, which is a pretty serious oversight. What can we say about it?
I believe the current official policy is that you have to take a new
base backup and restore from that. Rsync can be used to speed that up.
However, someone once asked me for a comment on a script he wrote to do
that in a smarter way. I forget who and when and how exactly it worked,
but it seems possible to do safely.
First of all, you have to shut down the master cleanly for this to work,
otherwise there can be changes in the old master that never made it to
the standby.
Assuming controlled shutdown and that the standby received all WAL from
the old master before it was promoted, I think you can simply create a
recovery.conf in the old master's data directory to turn it into a
standby server, and restart. Am I missing something?
--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
Assuming controlled shutdown and that the standby received all WAL from the
old master before it was promoted, I think you can simply create a
recovery.conf in the old master's data directory to turn it into a standby
server, and restart. Am I missing something?
Would that mean that a controlled restart of the old master so that the
recovery stops before applying the logs that were not shipped to the
slave would put it in the same situation?
How easy is it to script that? It seems all we need is the current XID
of the slave at the end of recovery. It should be in the log, maybe it's
easy enough to expose it at the SQL level…
Regards,
--
dim
On 28/05/10 16:11, Dimitri Fontaine wrote:
Heikki Linnakangas<heikki.linnakangas@enterprisedb.com> writes:
Assuming controlled shutdown and that the standby received all WAL from the
old master before it was promoted, I think you can simply create a
recovery.conf in the old master's data directory to turn it into a standby
server, and restart. Am I missing something?Would that mean that a controlled restart of the old master so that the
recovery stops before applying the logs that were not shipped to the
slave would put it in the same situation?
Not shipped before the first failover you mean? No, if any WAL records
were created in the old master that were not shipped to the standby
before failover, the corresponding changes to the data files might've
been flushed to disk already, and you can't undo those by not replaying
the WAL record on restart.
How easy is it to script that? It seems all we need is the current XID
of the slave at the end of recovery. It should be in the log, maybe it's
easy enough to expose it at the SQL level…
XID doesn't help at all, LSN more likely, but I feel that I don't fully
understand what you're saying.
--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
Not shipped before the first failover you mean? No, if any WAL records were
created in the old master that were not shipped to the standby before
failover, the corresponding changes to the data files might've been flushed
to disk already, and you can't undo those by not replaying the WAL record on
restart.
Ah yes you need to fail between when (WAL is written and not sent) and
CHECKPOINT for this to be possible. But automatic testing of the
situation (is the data already safe in PGDATA) might still be possible?
How easy is it to script that? It seems all we need is the current XID
of the slave at the end of recovery. It should be in the log, maybe it's
easy enough to expose it at the SQL level…XID doesn't help at all, LSN more likely, but I feel that I don't fully
understand what you're saying.
Sorry I was unclear, I was thinking in terms of recovery.conf file and
either recovery_target_xid or recovery_target_time. The idea being that
if the old-master didn't CHECKPOINT the changes that the slave missed,
then we can do crash recovery and choose to stop before that point, then
apply WALs from the new master.
That might sounds like a strange thing to do, but if switching from
master to slave allows skipping the base backup to get a slave again, I
guess we'll see people choosing the all automated failover scripting
(with heartbeat and so on). The goal would be to reduce downtime the
more you can.
When possible I'd still choose manual failover to the slave after a
master's restart and crash recovery, but the downtime constraint might
not allow that everywhere.
So you're saying controlled failover could possibly skip base backup to
reuse old master as new slave, and I'm asking if by some luck (crash
happened before CHECKPOINT) and some recovery.conf setup we could get to
the same situation in case of hard failure. That would allow completely
automatic switchover / failover with no need to resync.
I'm not sure how much clearer I managed to be :)
Regards,
--
dim
On 28/05/10 22:20, Dimitri Fontaine wrote:
Heikki Linnakangas<heikki.linnakangas@enterprisedb.com> writes:
Not shipped before the first failover you mean? No, if any WAL records were
created in the old master that were not shipped to the standby before
failover, the corresponding changes to the data files might've been flushed
to disk already, and you can't undo those by not replaying the WAL record on
restart.Ah yes you need to fail between when (WAL is written and not sent) and
CHECKPOINT for this to be possible.
Checkpoint only guarantees that everything before that is flushed to
disk. It doesn't guarantee that nothing is flushed to disk until that.
If there's a checkpoint that hasn't been shipped to the standby, you're
certainly hosed, but if there is no checkpoint you don't know if the
data files have changed or not.
But automatic testing of the
situation (is the data already safe in PGDATA) might still be possible?
Hmm, so the situation is this:
D - E - crash!
/
A - B - C
\
d - f - g - h
The letters represent WAL records. C is the last WAL record that was
shipped to the standby, D & E are WAL records that were generated in the
old master before the crash but never sent to the standby, and d-h are
WAL records created in the standby after failover.
I guess you could read the WAL in the old master and compare it with the
WAL from the standby to figure out where the failover happened (C), and
then scan all the data pages involved in records D - E, checking that
the LSNs on the data pages touched by those records are earlier than C.
That's a bit laborious, and requires knowledge of all different kinds of
WAL records to figure out which data pages they touch, but seems
possible in theory.
How easy is it to script that? It seems all we need is the current XID
of the slave at the end of recovery. It should be in the log, maybe it's
easy enough to expose it at the SQL level…XID doesn't help at all, LSN more likely, but I feel that I don't fully
understand what you're saying.Sorry I was unclear, I was thinking in terms of recovery.conf file and
either recovery_target_xid or recovery_target_time. The idea being that
if the old-master didn't CHECKPOINT the changes that the slave missed,
then we can do crash recovery and choose to stop before that point, then
apply WALs from the new master.
Ah, I see. No, you don't want to use a recovery target, that would end
the recovery and start the server. You just need to make sure to use
WALs from the new master instead of the old one when both exist.
So you're saying controlled failover could possibly skip base backup to
reuse old master as new slave, and I'm asking if by some luck (crash
happened before CHECKPOINT) and some recovery.conf setup we could get to
the same situation in case of hard failure. That would allow completely
automatic switchover / failover with no need to resync.
Yeah, that would be nice. In practice, I think you would get lucky more
often than not, because whenever you modify and dirty a page, writing a
WAL record, the usage count on the buffer is incremented and it won't be
evicted from the buffer cache for a while.
--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com
On Fri, May 28, 2010 at 7:58 PM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:
At PGCon, several people asked me about restarting an old master as a
standby after failover has happened. And it wasn't the first time people ask
me about it, even before 9.0. We have no mention of that in the docs, which
is a pretty serious oversight. What can we say about it?I believe the current official policy is that you have to take a new base
backup and restore from that. Rsync can be used to speed that up.However, someone once asked me for a comment on a script he wrote to do that
in a smarter way. I forget who and when and how exactly it worked, but it
seems possible to do safely.First of all, you have to shut down the master cleanly for this to work,
otherwise there can be changes in the old master that never made it to the
standby.Assuming controlled shutdown and that the standby received all WAL from the
old master before it was promoted, I think you can simply create a
recovery.conf in the old master's data directory to turn it into a standby
server, and restart. Am I missing something?
Failover always increments the timeline ID of the old standby (i.e.,
new master).
Can that procedure work around the gap of the timeline ID between servers?
Regards,
--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center