Warm Standby Setup Documentation

Started by Ogdenabout 16 years ago9 messagesgeneral
Jump to latest
#1Ogden
lists@darkstatic.com

I have looked all over but could not find any detailed docs on setting up a warm standby solution using PostgreSQL 8.4. I do know of http://www.postgresql.org/docs/8.4/static/warm-standby.html but was wondering if there was a more detailed document on this topic.

Are people using this as a viable backup/hot spare solution? How has it worked out?

Thank you

Ogden

#2Bryan Murphy
bmurphy1976@gmail.com
In reply to: Ogden (#1)
Re: Warm Standby Setup Documentation

On Mon, Mar 22, 2010 at 9:21 AM, Ogden <lists@darkstatic.com> wrote:

I have looked all over but could not find any detailed docs on setting up a
warm standby solution using PostgreSQL 8.4. I do know of
http://www.postgresql.org/docs/8.4/static/warm-standby.html but was
wondering if there was a more detailed document on this topic.

Are people using this as a viable backup/hot spare solution? How has it
worked out?

Thank you

Ogden

We use it, it works pretty well, although it's a bit of a pain to set up the
first time. We have two spares, one which is an equivalent sized machine we
use for failover, and one which is a smaller machine that we use for
worst-case-scenario file system snapshots/backups.

The one thing you should be aware of is that when you fail over, your spare
has no spares. I have not found a way around this problem yet. So, when
you fail over, there is a window where you have no backups while you're
building the new spares. This can be pretty nerve wracking if your database
is like ours and it takes 3-6 hours to bring a new spare online from
scratch.

I was able to build out our solution reading the docs and asking questions
on the mailing list. The information is in the docs, you just have to read
it a few times for it to sink in.

Bryan

#3Greg Smith
gsmith@gregsmith.com
In reply to: Bryan Murphy (#2)
Re: Warm Standby Setup Documentation

Bryan Murphy wrote:

The one thing you should be aware of is that when you fail over, your
spare has no spares. I have not found a way around this problem yet.
So, when you fail over, there is a window where you have no backups
while you're building the new spares. This can be pretty nerve
wracking if your database is like ours and it takes 3-6 hours to bring
a new spare online from scratch.

If there's another server around, you can have your archive_command on
the master ship to two systems, then use the second one as a way to
jump-start this whole process. After fail-over, just start shipping
from the new primary to that 3rd server, now the replacement standby,
and sync any files it doesn't have. Then switch it into recovery. Much
faster than doing a new base backup from the standby on larger systems.

--
Greg Smith 2ndQuadrant US Baltimore, MD
PostgreSQL Training, Services and Support
greg@2ndQuadrant.com www.2ndQuadrant.us

#4Ogden
lists@darkstatic.com
In reply to: Greg Smith (#3)
Re: Warm Standby Setup Documentation

On Mar 26, 2010, at 1:32 PM, Greg Smith wrote:

Bryan Murphy wrote:

The one thing you should be aware of is that when you fail over, your spare has no spares. I have not found a way around this problem yet. So, when you fail over, there is a window where you have no backups while you're building the new spares. This can be pretty nerve wracking if your database is like ours and it takes 3-6 hours to bring a new spare online from scratch.

If there's another server around, you can have your archive_command on the master ship to two systems, then use the second one as a way to jump-start this whole process. After fail-over, just start shipping from the new primary to that 3rd server, now the replacement standby, and sync any files it doesn't have. Then switch it into recovery. Much faster than doing a new base backup from the standby on larger systems.

How is it possible to use the archive_command to ship to different ones?

archive_command = 'rsync -a %p postgres@192.168.x.x:/usr/local/pgsql/walfiles/%f </dev/null'
archive_timeout = 120 # force a logfile segment switch after this

I suppose you can put multiple commands there then?

Also, 2 minutes - is this reasonable for a heavy write database?

Thank you

Ogden

#5Bryan Murphy
bmurphy1976@gmail.com
In reply to: Greg Smith (#3)
Re: Warm Standby Setup Documentation

On Fri, Mar 26, 2010 at 1:32 PM, Greg Smith <greg@2ndquadrant.com> wrote:

If there's another server around, you can have your archive_command on the
master ship to two systems, then use the second one as a way to jump-start
this whole process. After fail-over, just start shipping from the new
primary to that 3rd server, now the replacement standby, and sync any files
it doesn't have. Then switch it into recovery. Much faster than doing a
new base backup from the standby on larger systems.

Every time I've tried to do this it's failed because the third server was
looking for log files starting with 00000006... but the secondary server
(new master) is now shipping files starting with 00000007... How do I get
the third server to switch over to the higher numbered files? That's the
part I was never able to overcome.

I'd really like to fix this, because this has literally given me nightmares.
:)

Bryan

#6Yaroslav Tykhiy
yar@barnet.com.au
In reply to: Ogden (#4)
Re: Warm Standby Setup Documentation

On Fri, Mar 26, 2010 at 01:35:43PM -0500, Ogden wrote:

On Mar 26, 2010, at 1:32 PM, Greg Smith wrote:

Bryan Murphy wrote:

The one thing you should be aware of is that when you fail over, your spare has no spares. I have not found a way around this problem yet. So, when you fail over, there is a window where you have no backups while you're building the new spares. This can be pretty nerve wracking if your database is like ours and it takes 3-6 hours to bring a new spare online from scratch.

If there's another server around, you can have your archive_command on the master ship to two systems, then use the second one as a way to jump-start this whole process. After fail-over, just start shipping from the new primary to that 3rd server, now the replacement standby, and sync any files it doesn't have. Then switch it into recovery. Much faster than doing a new base backup from the standby on larger systems.

How is it possible to use the archive_command to ship to different ones?

archive_command = 'rsync -a %p postgres@192.168.x.x:/usr/local/pgsql/walfiles/%f </dev/null'
archive_timeout = 120 # force a logfile segment switch after this

I suppose you can put multiple commands there then?

You can always wrap as many commands as you like in a script.
However, there is a pitfall to watch out for when shipping WALs to
multiple standby servers. Namely your script has to handle failures
of individual WAL shipping targets so that a single target going down
doesn't disrupt operation of the whole cluster. Please see
http://archives.postgresql.org/pgsql-general/2009-10/msg00590.php
for discussion.

Yar

#7Greg Smith
gsmith@gregsmith.com
In reply to: Ogden (#4)
Re: Warm Standby Setup Documentation

Ogden wrote:

How is it possible to use the archive_command to ship to different ones?

archive_command = 'rsync -a %p postgres@192.168.x.x:/usr/local/pgsql/walfiles/%f </dev/null'
archive_timeout = 120 # force a logfile segment switch after this

I suppose you can put multiple commands there then?

The examples in the manual lead one toward putting a full command line
into the script. I personally never do that; I call a full-size script
with "%f %p" and put all of the transport details into it instead. Once
you do that, you can ship that segment all over the place if you feel
like it, and add significantly better error detection/recovery than
possible in a single line too. As already mentioned, you do need to
make sure that you don't end up blocking archiving on the master due to
delivery failure on an optional node however.

Also, 2 minutes - is this reasonable for a heavy write database?

It's extremely unlikely a write-heavy database will care about the
setting of archive_timeout. That setting exists for the situation where
you sometimes go some number of minutes without generating at least 16MB
of WAL writes, and want to force a log file to ship anyway. That
shouldn't happen often on a busy server. Setting archive_timeout to a
lower value mainly is a source of overhead on mostly idle systems.

--
Greg Smith 2ndQuadrant US Baltimore, MD
PostgreSQL Training, Services and Support
greg@2ndQuadrant.com www.2ndQuadrant.us

#8Greg Smith
gsmith@gregsmith.com
In reply to: Bryan Murphy (#5)
Re: Warm Standby Setup Documentation

Bryan Murphy wrote:

On Fri, Mar 26, 2010 at 1:32 PM, Greg Smith <greg@2ndquadrant.com
<mailto:greg@2ndquadrant.com>> wrote:

If there's another server around, you can have your
archive_command on the master ship to two systems, then use the
second one as a way to jump-start this whole process. After
fail-over, just start shipping from the new primary to that 3rd
server, now the replacement standby, and sync any files it doesn't
have. Then switch it into recovery. Much faster than doing a new
base backup from the standby on larger systems.

Every time I've tried to do this it's failed because the third server
was looking for log files starting with 00000006... but the secondary
server (new master) is now shipping files starting with 00000007...
How do I get the third server to switch over to the higher numbered
files? That's the part I was never able to overcome.

One thing it's easy to miss is that you have to save every incoming WAL
archive file on the standby, and sync them all over to the 3rd system
before you trigger the standby to be active. The archive_command has to
be active and shipping to the 3rd system before the server is triggered too.

You can think of any given standby server as a base backup and some
number of WAL segments applied to it. So long as you never let a WAL
file get applied to or generated by the standby who becomes the master
without making its way to the additional system, it should always be
possible to bring up that additional server up to date without something
being missing. The exact order of operations to ensure that in all case
is certainly not obvious though.

--
Greg Smith 2ndQuadrant US Baltimore, MD
PostgreSQL Training, Services and Support
greg@2ndQuadrant.com www.2ndQuadrant.us

#9Ogden
lists@darkstatic.com
In reply to: Yaroslav Tykhiy (#6)
Re: Warm Standby Setup Documentation

On Mar 28, 2010, at 7:45 PM, Yar Tykhiy wrote:

On Fri, Mar 26, 2010 at 01:35:43PM -0500, Ogden wrote:

On Mar 26, 2010, at 1:32 PM, Greg Smith wrote:

Bryan Murphy wrote:

The one thing you should be aware of is that when you fail over, your spare has no spares. I have not found a way around this problem yet. So, when you fail over, there is a window where you have no backups while you're building the new spares. This can be pretty nerve wracking if your database is like ours and it takes 3-6 hours to bring a new spare online from scratch.

If there's another server around, you can have your archive_command on the master ship to two systems, then use the second one as a way to jump-start this whole process. After fail-over, just start shipping from the new primary to that 3rd server, now the replacement standby, and sync any files it doesn't have. Then switch it into recovery. Much faster than doing a new base backup from the standby on larger systems.

How is it possible to use the archive_command to ship to different ones?

archive_command = 'rsync -a %p postgres@192.168.x.x:/usr/local/pgsql/walfiles/%f </dev/null'
archive_timeout = 120 # force a logfile segment switch after this

I suppose you can put multiple commands there then?

You can always wrap as many commands as you like in a script.
However, there is a pitfall to watch out for when shipping WALs to
multiple standby servers. Namely your script has to handle failures
of individual WAL shipping targets so that a single target going down
doesn't disrupt operation of the whole cluster. Please see
http://archives.postgresql.org/pgsql-general/2009-10/msg00590.php
for discussion.

Is it as simple as doing this:

archive_command = '/var/lib/pgsql/data/warm_standby.sh %p %f </dev/null'

Where /var/lib/pgsql/data/warm_standby.sh is:

#!/bin/sh

rsync -a $1 postgres@192.168.1.26:/usr/local/pgsql/walfiles/$2
rsync -a $1 postgres@192.168.1.27:/usr/local/pgsql/walfiles/$2
...

For each warm standby "slave"?

Is it safe to do it this way? I wish there were some scripts out there that I can see as examples.

Thank you

Ogden