Hot Standby Failover Scenario
Hi everybody.
I want to ask about hot-standby related issues. First of all, maybe I will
describe my scenario of Postgres master-slave.
1. There are Master A and Slave B in different location, assumed different
region of nation.
2. Configuring Master A and Slave B to become hot-standby is same as
described in documentations.
3. When Master A fails to service, the database will failovered to Slave B
by triggering with trigger file.
4. As soon as Slave B become standalone pg server, run pg_start_backup(),
so that all transactions will only be recorded to WAL files.
5. Applications swinged to Standalone B, until Server A recovery is done.
6. When Server A has recovered (but still offline), run pg_stop_backup()
and copy all WAL files from B to A.
7. Once the WAL files copied to A, set A's configuration back to Master and
B to Slave again (for B, change recovery.done to recovery.conf and remove
the trigger file).
8. Bring up A, restart B and all applications will be swinged back to A.
I've tried these methods with no luck. Before A fails to service, condition
is A has 10 million records, and B has 10 million records too. Then I
failovered to B, manually, simulating that A failed to service. I run
pg_start_backup() and inserting bunch of data, let say the current
condition is A still 10 million, B 20 million. So I tried to copy WAL files
from B to A and hope that when A up again, the records will intact to B, A
20 million and B 20 million and hot-standby streaming will run as well. But
my experiments failed to do so.
I've checked the log and found that the timeline is invalid. On Slave B's
log, it appeared that timeline of primary server (Master A) does not match
target timeline of standby server. Can anyone suggest for this case? Any
suggestions will be greatly appreciated. Thank you.
On 02/27/2012 10:05 PM, Lucky Haryadi wrote:
3. When Master A fails to service, the database will failovered to Slave
B by triggering with trigger file.
As soon as you trigger a standby, it changes it to a new timeline. At
that point, the series of WAL files diverges. It's no longer possible
to apply them to a system that is still on the original timeline, such
as your original master A in this situation. There's a good reason for
that. Let's say that A committed an additional transaction before it
went down, but that commit wasn't replicated to B. You can't just move
records from B over anymore in that case. The only way to make sure A
is in sync again is to do a new base backup, which you can potentially
accelerate using rsync to only copy what has changed. I see a lot of
people try to bypass one of the steps recommended in the manual using
various schemes like yours, and they usually have a bug like this in
there--sometimes obvious like this, sometimes subtle. Trying to get too
clever here is dangerous to your database.
Warning: pgsql-hackers is the mailing list for people to discuss the
development of PostgreSQL, not how to use it. Questions like this
should be asked on either the pgsql-admin or pgsql-general mailing list.
I'm not going to answer additional questions like this from you here
on this list, and I doubt anyone else will either.
--
Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com