monitoring warm standby lag in 8.4?

Started by Josh Kupershmidtover 15 years ago4 messagesgeneral
Jump to latest
#1Josh Kupershmidt
schmiddy@gmail.com

Hi all,

I'm wondering if there's an accepted way to monitor a warm standby
machine's lag in 8.4. The wiki[1]http://wiki.postgresql.org/wiki/Warm_Standby has a link[2]http://www.kennygorman.com/wordpress/?p=249 to a script which
parses the output of pg_controldata, looking for a line like this:

Time of latest checkpoint: Thu 09 Dec 2010 01:35:46 PM EST

But I'm not sure whether this timestamp is to be trusted as an
indicator of how far behind the standby is in its recovery -- this
timestamp just tells us when the standby last performed a checkpoint,
regardless of how far behind in the WAL stream it is, right?

I haven't come across any other monitoring suggestions for warm
standby on 8.4. I've seen suggestions for hot standby slaves to use:
SELECT pg_last_xlog_receive_location();
but this won't work on an 8.4 warm standby of course. I've searched
around and haven't found[3]I was hoping this page would have some relevant info: http://www.scottrmead.com/blogs/scott/warm-standby-monitoring , but it's down now :( any other tips on how to monitor my
standby.

The manual mentions[4]http://www.postgresql.org/docs/8.4/static/warm-standby.html#WARM-STANDBY-RECORD using pg_xlogfile_name_offset() in the context
of implementing record-based log shipping. Would this be useful for
monitoring standby lag? Any other ideas?

Thanks,
Josh

--
[1]: http://wiki.postgresql.org/wiki/Warm_Standby
[2]: http://www.kennygorman.com/wordpress/?p=249
[3]: I was hoping this page would have some relevant info: http://www.scottrmead.com/blogs/scott/warm-standby-monitoring , but it's down now :(
http://www.scottrmead.com/blogs/scott/warm-standby-monitoring , but
it's down now :(
[4]: http://www.postgresql.org/docs/8.4/static/warm-standby.html#WARM-STANDBY-RECORD

#2Greg Sabino Mullane
greg@turnstep.com
In reply to: Josh Kupershmidt (#1)
Re: monitoring warm standby lag in 8.4?

-----BEGIN PGP SIGNED MESSAGE-----
Hash: RIPEMD160

I'm wondering if there's an accepted way to monitor a warm standby
machine's lag in 8.4. The wiki[1] has a link[2] to a script which
parses the output of pg_controldata, looking for a line like this:

Time of latest checkpoint: Thu 09 Dec 2010 01:35:46 PM EST

But I'm not sure whether this timestamp is to be trusted as an
indicator of how far behind the standby is in its recovery -- this
timestamp just tells us when the standby last performed a checkpoint,
regardless of how far behind in the WAL stream it is, right?

Correct. But since we cannot connect to a database in recovery mode,
there are very few options to determine how far 'behind' it is. The
pg_controldata is what the check_postgres program uses. This offers a
rough check which is usually sufficient unless you have a very
inactive database or need very fine grained checking.

A better system would perhaps connect to both ends and examine which
specific WALs were being shipped and which one was last played, but
there are no tools I know of that do that. I suspect the reason for
this is that the pg_controldata check is "good enough". Certainly,
that's what we are using for many clients via check_postgres, and
it's been very good at detecting when the replica has problems. Good
enough that I've never worried about writing a different method,
anyway. :)

- --
Greg Sabino Mullane greg@turnstep.com
End Point Corporation http://www.endpoint.com/
PGP Key: 0x14964AC8 201012101126
http://biglumber.com/x/web?pk=2529DF6AB8F79407E94445B4BC9B906714964AC8
-----BEGIN PGP SIGNATURE-----

iEYEAREDAAYFAk0CVN4ACgkQvJuQZxSWSshqIgCZASW1Tsf+8/Mk2qYPIzYUoYZz
+CwAmwQbwzv8ED1QRskG8DavSr89NG/d
=qwtc
-----END PGP SIGNATURE-----

#3Scott Mead
scott@scottrmead.com
In reply to: Josh Kupershmidt (#1)
Re: monitoring warm standby lag in 8.4?

Yeah, my website is busted. I'll fix it for you.

Show quoted text

On Thu, Dec 9, 2010 at 2:30 PM, Josh Kupershmidt <schmiddy@gmail.com> wrote:

Hi all,

I'm wondering if there's an accepted way to monitor a warm standby
machine's lag in 8.4. The wiki[1] has a link[2] to a script which
parses the output of pg_controldata, looking for a line like this:

 Time of latest checkpoint:            Thu 09 Dec 2010 01:35:46 PM EST

But I'm not sure whether this timestamp is to be trusted as an
indicator of how far behind the standby is in its recovery -- this
timestamp just tells us when the standby last performed a checkpoint,
regardless of how far behind in the WAL stream it is, right?

I haven't come across any other monitoring suggestions for warm
standby on 8.4. I've seen suggestions for hot standby slaves to use:
 SELECT pg_last_xlog_receive_location();
but this won't work on an 8.4 warm standby of course. I've searched
around and haven't found[3] any other tips on how to monitor my
standby.

The manual mentions[4] using pg_xlogfile_name_offset() in the context
of implementing record-based log shipping. Would this be useful for
monitoring standby lag? Any other ideas?

Thanks,
Josh

--
[1] http://wiki.postgresql.org/wiki/Warm_Standby
[2] http://www.kennygorman.com/wordpress/?p=249
[3] I was hoping this page would have some relevant info:
http://www.scottrmead.com/blogs/scott/warm-standby-monitoring , but
it's down now :(
[4] http://www.postgresql.org/docs/8.4/static/warm-standby.html#WARM-STANDBY-RECORD

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#4Josh Kupershmidt
schmiddy@gmail.com
In reply to: Greg Sabino Mullane (#2)
Re: monitoring warm standby lag in 8.4?

On Fri, Dec 10, 2010 at 11:27 AM, Greg Sabino Mullane <greg@turnstep.com> wrote:

Correct. But since we cannot connect to a database in recovery mode,
there are very few options to determine how far 'behind' it is. The
pg_controldata is what the check_postgres program uses. This offers a
rough check which is usually sufficient unless you have a very
inactive database or need very fine grained checking.

A better system would perhaps connect to both ends and examine which
specific WALs were being shipped and which one was last played, but
there are no tools I know of that do that. I suspect the reason for
this is that the pg_controldata check is "good enough". Certainly,
that's what we are using for many clients via check_postgres, and
it's been very good at detecting when the replica has problems. Good
enough that I've never worried about writing a different method,
anyway. :)

Thanks for the reply.

One simple piece I added in to my monitoring script which wasn't here:
http://www.kennygorman.com/wordpress/?p=249
(or in check_postgres.pl, from a quick look at check_checkpoint() in
check_postgres.pl) is a verification that the standby slave is
actually 'in archive recovery' mode, from looking at the 'Database
cluster state:' output of pg_controldata.

I was mulling over some ways to add in a reasonable check that the
standby was keeping up with the WAL stream. Comparing WAL file names
on master vs. standby would probably work, but I was also thinking
that a simple directory-size check on the standby's WAL archive
directory would show whether we were receiving WAL files faster than
we could process them.

Josh