monitoring warm standby lag in 8.4?
Hi all,
I'm wondering if there's an accepted way to monitor a warm standby
machine's lag in 8.4. The wiki[1]http://wiki.postgresql.org/wiki/Warm_Standby has a link[2]http://www.kennygorman.com/wordpress/?p=249 to a script which
parses the output of pg_controldata, looking for a line like this:
Time of latest checkpoint: Thu 09 Dec 2010 01:35:46 PM EST
But I'm not sure whether this timestamp is to be trusted as an
indicator of how far behind the standby is in its recovery -- this
timestamp just tells us when the standby last performed a checkpoint,
regardless of how far behind in the WAL stream it is, right?
I haven't come across any other monitoring suggestions for warm
standby on 8.4. I've seen suggestions for hot standby slaves to use:
SELECT pg_last_xlog_receive_location();
but this won't work on an 8.4 warm standby of course. I've searched
around and haven't found[3]I was hoping this page would have some relevant info: http://www.scottrmead.com/blogs/scott/warm-standby-monitoring , but it's down now :( any other tips on how to monitor my
standby.
The manual mentions[4]http://www.postgresql.org/docs/8.4/static/warm-standby.html#WARM-STANDBY-RECORD using pg_xlogfile_name_offset() in the context
of implementing record-based log shipping. Would this be useful for
monitoring standby lag? Any other ideas?
Thanks,
Josh
--
[1]: http://wiki.postgresql.org/wiki/Warm_Standby
[2]: http://www.kennygorman.com/wordpress/?p=249
[3]: I was hoping this page would have some relevant info: http://www.scottrmead.com/blogs/scott/warm-standby-monitoring , but it's down now :(
http://www.scottrmead.com/blogs/scott/warm-standby-monitoring , but
it's down now :(
[4]: http://www.postgresql.org/docs/8.4/static/warm-standby.html#WARM-STANDBY-RECORD
-----BEGIN PGP SIGNED MESSAGE-----
Hash: RIPEMD160
I'm wondering if there's an accepted way to monitor a warm standby
machine's lag in 8.4. The wiki[1] has a link[2] to a script which
parses the output of pg_controldata, looking for a line like this:Time of latest checkpoint: Thu 09 Dec 2010 01:35:46 PM EST
But I'm not sure whether this timestamp is to be trusted as an
indicator of how far behind the standby is in its recovery -- this
timestamp just tells us when the standby last performed a checkpoint,
regardless of how far behind in the WAL stream it is, right?
Correct. But since we cannot connect to a database in recovery mode,
there are very few options to determine how far 'behind' it is. The
pg_controldata is what the check_postgres program uses. This offers a
rough check which is usually sufficient unless you have a very
inactive database or need very fine grained checking.
A better system would perhaps connect to both ends and examine which
specific WALs were being shipped and which one was last played, but
there are no tools I know of that do that. I suspect the reason for
this is that the pg_controldata check is "good enough". Certainly,
that's what we are using for many clients via check_postgres, and
it's been very good at detecting when the replica has problems. Good
enough that I've never worried about writing a different method,
anyway. :)
- --
Greg Sabino Mullane greg@turnstep.com
End Point Corporation http://www.endpoint.com/
PGP Key: 0x14964AC8 201012101126
http://biglumber.com/x/web?pk=2529DF6AB8F79407E94445B4BC9B906714964AC8
-----BEGIN PGP SIGNATURE-----
iEYEAREDAAYFAk0CVN4ACgkQvJuQZxSWSshqIgCZASW1Tsf+8/Mk2qYPIzYUoYZz
+CwAmwQbwzv8ED1QRskG8DavSr89NG/d
=qwtc
-----END PGP SIGNATURE-----
Yeah, my website is busted. I'll fix it for you.
Show quoted text
On Thu, Dec 9, 2010 at 2:30 PM, Josh Kupershmidt <schmiddy@gmail.com> wrote:
Hi all,
I'm wondering if there's an accepted way to monitor a warm standby
machine's lag in 8.4. The wiki[1] has a link[2] to a script which
parses the output of pg_controldata, looking for a line like this:Time of latest checkpoint: Thu 09 Dec 2010 01:35:46 PM EST
But I'm not sure whether this timestamp is to be trusted as an
indicator of how far behind the standby is in its recovery -- this
timestamp just tells us when the standby last performed a checkpoint,
regardless of how far behind in the WAL stream it is, right?I haven't come across any other monitoring suggestions for warm
standby on 8.4. I've seen suggestions for hot standby slaves to use:
SELECT pg_last_xlog_receive_location();
but this won't work on an 8.4 warm standby of course. I've searched
around and haven't found[3] any other tips on how to monitor my
standby.The manual mentions[4] using pg_xlogfile_name_offset() in the context
of implementing record-based log shipping. Would this be useful for
monitoring standby lag? Any other ideas?Thanks,
Josh--
[1] http://wiki.postgresql.org/wiki/Warm_Standby
[2] http://www.kennygorman.com/wordpress/?p=249
[3] I was hoping this page would have some relevant info:
http://www.scottrmead.com/blogs/scott/warm-standby-monitoring , but
it's down now :(
[4] http://www.postgresql.org/docs/8.4/static/warm-standby.html#WARM-STANDBY-RECORD--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
On Fri, Dec 10, 2010 at 11:27 AM, Greg Sabino Mullane <greg@turnstep.com> wrote:
Correct. But since we cannot connect to a database in recovery mode,
there are very few options to determine how far 'behind' it is. The
pg_controldata is what the check_postgres program uses. This offers a
rough check which is usually sufficient unless you have a very
inactive database or need very fine grained checking.A better system would perhaps connect to both ends and examine which
specific WALs were being shipped and which one was last played, but
there are no tools I know of that do that. I suspect the reason for
this is that the pg_controldata check is "good enough". Certainly,
that's what we are using for many clients via check_postgres, and
it's been very good at detecting when the replica has problems. Good
enough that I've never worried about writing a different method,
anyway. :)
Thanks for the reply.
One simple piece I added in to my monitoring script which wasn't here:
http://www.kennygorman.com/wordpress/?p=249
(or in check_postgres.pl, from a quick look at check_checkpoint() in
check_postgres.pl) is a verification that the standby slave is
actually 'in archive recovery' mode, from looking at the 'Database
cluster state:' output of pg_controldata.
I was mulling over some ways to add in a reasonable check that the
standby was keeping up with the WAL stream. Comparing WAL file names
on master vs. standby would probably work, but I was also thinking
that a simple directory-size check on the standby's WAL archive
directory would show whether we were receiving WAL files faster than
we could process them.
Josh