warning message in standby

Started by Fujii Masaoalmost 16 years ago61 messageshackers
Jump to latest
#1Fujii Masao
masao.fujii@gmail.com

Hi,

When an error is found in the WAL streamed from the master, a warning
message is repeated without interval forever in the standby. This
consumes CPU load very much, and would interfere with read-only queries.
To fix this problem, we should add a sleep into emode_for_corrupt_record()
or somewhere? Or we should stop walreceiver and retry to read WAL from
pg_xlog or the archive?

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

#2Robert Haas
robertmhaas@gmail.com
In reply to: Fujii Masao (#1)
Re: warning message in standby

On Mon, Jun 7, 2010 at 9:21 AM, Fujii Masao <masao.fujii@gmail.com> wrote:

When an error is found in the WAL streamed from the master, a warning
message is repeated without interval forever in the standby. This
consumes CPU load very much, and would interfere with read-only queries.
To fix this problem, we should add a sleep into emode_for_corrupt_record()
or somewhere? Or we should stop walreceiver and retry to read WAL from
pg_xlog or the archive?

I ran into this problem at one point, too, but was in the middle of
trying to investigate a different bug and didn't have time to track
down what was causing it.

I think the basic question here is - if there's an error in the WAL,
how do we expect to EVER recover? Even if we can read from the
archive or pg_xlog, presumably it's the same WAL - why should we be
any more successful the second time?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

#3Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#2)
Re: warning message in standby

Robert Haas <robertmhaas@gmail.com> writes:

On Mon, Jun 7, 2010 at 9:21 AM, Fujii Masao <masao.fujii@gmail.com> wrote:

When an error is found in the WAL streamed from the master, a warning
message is repeated without interval forever in the standby. This
consumes CPU load very much, and would interfere with read-only queries.
To fix this problem, we should add a sleep into emode_for_corrupt_record()
or somewhere? Or we should stop walreceiver and retry to read WAL from
pg_xlog or the archive?

I ran into this problem at one point, too, but was in the middle of
trying to investigate a different bug and didn't have time to track
down what was causing it.

I think the basic question here is - if there's an error in the WAL,
how do we expect to EVER recover? Even if we can read from the
archive or pg_xlog, presumably it's the same WAL - why should we be
any more successful the second time?

What "warning message" are we talking about? All the error cases I can
think of in WAL-application are ERROR, or likely even PANIC.

regards, tom lane

#4Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Tom Lane (#3)
Re: warning message in standby

On 10/06/10 17:38, Tom Lane wrote:

Robert Haas<robertmhaas@gmail.com> writes:

On Mon, Jun 7, 2010 at 9:21 AM, Fujii Masao<masao.fujii@gmail.com> wrote:

When an error is found in the WAL streamed from the master, a warning
message is repeated without interval forever in the standby. This
consumes CPU load very much, and would interfere with read-only queries.
To fix this problem, we should add a sleep into emode_for_corrupt_record()
or somewhere? Or we should stop walreceiver and retry to read WAL from
pg_xlog or the archive?

I ran into this problem at one point, too, but was in the middle of
trying to investigate a different bug and didn't have time to track
down what was causing it.

I think the basic question here is - if there's an error in the WAL,
how do we expect to EVER recover? Even if we can read from the
archive or pg_xlog, presumably it's the same WAL - why should we be
any more successful the second time?

What "warning message" are we talking about? All the error cases I can
think of in WAL-application are ERROR, or likely even PANIC.

We're talking about a corrupt record (incorrect CRC, incorrect backlink
etc.), not errors within redo functions. During crash recovery, a
corrupt record means you've reached end of WAL. In standby mode, when
streaming WAL from master, that shouldn't happen, and it's not clear
what to do if it does. PANIC is not a good idea, at least if the server
uses hot standby, because that only makes the situation worse from
availability point of view. So we log the error as a WARNING, and keep
retrying. It's unlikely that the problem will just go away, but we keep
retrying anyway in the hope that it does. However, it seems that we're
too aggressive with the retries.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#5Robert Haas
robertmhaas@gmail.com
In reply to: Heikki Linnakangas (#4)
Re: warning message in standby

On Thu, Jun 10, 2010 at 12:01 PM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

We're talking about a corrupt record (incorrect CRC, incorrect backlink
etc.), not errors within redo functions. During crash recovery, a corrupt
record means you've reached end of WAL. In standby mode, when streaming WAL
from master, that shouldn't happen, and it's not clear what to do if it
does. PANIC is not a good idea, at least if the server uses hot standby,
because that only makes the situation worse from availability point of view.
So we log the error as a WARNING, and keep retrying. It's unlikely that the
problem will just go away, but we keep retrying anyway in the hope that it
does. However, it seems that we're too aggressive with the retries.

You can reproduce this problem by doing the following.

1. initdb
2. edit postgresql.conf, set wal_level=hot_standby, max_wal_senders=1;
edit pg_hba.conf, trust local replication connections
3. pg_ctl start
4. make some changes to the database
5. take a hot backup to another directory (call it pgslave)
6. create pgslave/recovery.conf with standby_mode='on',
primary_conninfo=whatever, edit pgslave/postgresql.conf change the
port number, set hot_standby=on
7. pg_ctl start -D pgslave

At this point you should have a working HS/SR setup. Now:

8. shut the slave down
9. move recovery.conf out of the way
10. restart the slave - it will do recovery and enter normal running
11. make some database changes
12. stop the slave
13. put recovery.conf back
14. restart the slave
15. make a bunch of changes on the master

When the slave then tries to replay, you then get something like:

WARNING: invalid record length at 0/4005330
WARNING: invalid record length at 0/4005330
WARNING: invalid record length at 0/4005330

...ad infinitum.

Obviously there are other ways this could occur - the WAL could really
be corrupted, for example - but the current handling is not too
graceful. I'm actually thinking it might be better to trigger a
shutdown if this happens. Probably something has gone haywire and
manual intervention is required. Retrying when there's no hope of
success isn't really that helpful.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

#6Bruce Momjian
bruce@momjian.us
In reply to: Robert Haas (#5)
Re: warning message in standby

On Thu, Jun 10, 2010 at 5:13 PM, Robert Haas <robertmhaas@gmail.com> wrote:

At this point you should have a working HS/SR setup.  Now:

8. shut the slave down
9. move recovery.conf out of the way
10. restart the slave - it will do recovery and enter normal running
11. make some database changes
12. stop the slave
13. put recovery.conf back
14. restart the slave
15. make a bunch of changes on the master

When the slave then tries to replay, you then get something like:

WARNING:  invalid record length at 0/4005330
WARNING:  invalid record length at 0/4005330
WARNING:  invalid record length at 0/4005330

Woah, why does this procedure lead to this situation? I would hope
there's nothing a user could do which would cause it short of invoking
dd to corrupt the WAL files.

At precisely which step of the procedure did the user do something
wrong? Is there any reason we can't detect that they've done it and
throw a specific error message saying the configuration is invalid?

--
greg

#7Robert Haas
robertmhaas@gmail.com
In reply to: Bruce Momjian (#6)
Re: warning message in standby

On Thu, Jun 10, 2010 at 12:49 PM, Greg Stark <gsstark@mit.edu> wrote:

On Thu, Jun 10, 2010 at 5:13 PM, Robert Haas <robertmhaas@gmail.com> wrote:

At this point you should have a working HS/SR setup.  Now:

8. shut the slave down
9. move recovery.conf out of the way
10. restart the slave - it will do recovery and enter normal running
11. make some database changes
12. stop the slave
13. put recovery.conf back
14. restart the slave
15. make a bunch of changes on the master

When the slave then tries to replay, you then get something like:

WARNING:  invalid record length at 0/4005330
WARNING:  invalid record length at 0/4005330
WARNING:  invalid record length at 0/4005330

Woah, why does this procedure lead to this situation? I would hope
there's nothing a user could do which would cause it short of invoking
dd to corrupt the WAL files.

At precisely which step of the procedure did the user do something
wrong?

13.

Is there any reason we can't detect that they've done it and
throw a specific error message saying the configuration is invalid?

I'm not sure how we'd go about doing that, but I agree it would be nice.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

#8Fujii Masao
masao.fujii@gmail.com
In reply to: Heikki Linnakangas (#4)
Re: warning message in standby

On Fri, Jun 11, 2010 at 1:01 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

We're talking about a corrupt record (incorrect CRC, incorrect backlink
etc.), not errors within redo functions. During crash recovery, a corrupt
record means you've reached end of WAL. In standby mode, when streaming WAL
from master, that shouldn't happen, and it's not clear what to do if it
does. PANIC is not a good idea, at least if the server uses hot standby,
because that only makes the situation worse from availability point of view.
So we log the error as a WARNING, and keep retrying. It's unlikely that the
problem will just go away, but we keep retrying anyway in the hope that it
does. However, it seems that we're too aggressive with the retries.

Right. The attached patch calms down the retries: if we found an invalid
record while streaming WAL from master, we sleep for 5 seconds (needs to
be reduced?) before retrying to replay the record which is in the same
location where the invalid one was found. Comments?

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Attachments:

calm_down_retries_v1.patchapplication/octet-stream; name=calm_down_retries_v1.patchDownload+14-0
#9Simon Riggs
simon@2ndQuadrant.com
In reply to: Robert Haas (#2)
Re: warning message in standby

On Thu, 2010-06-10 at 09:57 -0400, Robert Haas wrote:

On Mon, Jun 7, 2010 at 9:21 AM, Fujii Masao <masao.fujii@gmail.com>
wrote:

When an error is found in the WAL streamed from the master, a

warning

message is repeated without interval forever in the standby. This
consumes CPU load very much, and would interfere with read-only

queries.

To fix this problem, we should add a sleep into

emode_for_corrupt_record()

or somewhere? Or we should stop walreceiver and retry to read WAL

from

pg_xlog or the archive?

I ran into this problem at one point, too, but was in the middle of
trying to investigate a different bug and didn't have time to track
down what was causing it.

So you saw a bug, but didn't report it? Why was that?

I find it amazing that such an obvious problem wasn't covered in
testing, nor do we have a clear plan in the design for handling it.

How many other things are known about but not reported?

If anybody has bugs, report them now.

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Training and Services

#10Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Fujii Masao (#8)
Re: warning message in standby

On 11/06/10 07:18, Fujii Masao wrote:

On Fri, Jun 11, 2010 at 1:01 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

We're talking about a corrupt record (incorrect CRC, incorrect backlink
etc.), not errors within redo functions. During crash recovery, a corrupt
record means you've reached end of WAL. In standby mode, when streaming WAL
from master, that shouldn't happen, and it's not clear what to do if it
does. PANIC is not a good idea, at least if the server uses hot standby,
because that only makes the situation worse from availability point of view.
So we log the error as a WARNING, and keep retrying. It's unlikely that the
problem will just go away, but we keep retrying anyway in the hope that it
does. However, it seems that we're too aggressive with the retries.

Right. The attached patch calms down the retries: if we found an invalid
record while streaming WAL from master, we sleep for 5 seconds (needs to
be reduced?) before retrying to replay the record which is in the same
location where the invalid one was found. Comments?

Hmm, right now it doesn't even reconnect when it sees a corrupt record
streamed from the master. It's really pointless to retry in that case,
reapplying the exact same piece of WAL surely won't work. I think it
should disconnect, and then retry reading from archive and pg_xlog, and
then retry streaming again. That's pretty hopeless too, but it's at
least theoretically possible that something went wrong in the
transmission and the file in the archive is fine.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

Attachments:

calm_down_retries_v2.patchtext/x-diff; name=calm_down_retries_v2.patchDownload+16-0
#11Robert Haas
robertmhaas@gmail.com
In reply to: Simon Riggs (#9)
Re: warning message in standby

On Fri, Jun 11, 2010 at 8:19 AM, Simon Riggs <simon@2ndquadrant.com> wrote:

On Thu, 2010-06-10 at 09:57 -0400, Robert Haas wrote:

On Mon, Jun 7, 2010 at 9:21 AM, Fujii Masao <masao.fujii@gmail.com>
wrote:

When an error is found in the WAL streamed from the master, a

warning

message is repeated without interval forever in the standby. This
consumes CPU load very much, and would interfere with read-only

queries.

To fix this problem, we should add a sleep into

emode_for_corrupt_record()

or somewhere? Or we should stop walreceiver and retry to read WAL

from

pg_xlog or the archive?

I ran into this problem at one point, too, but was in the middle of
trying to investigate a different bug and didn't have time to track
down what was causing it.

So you saw a bug, but didn't report it? Why was that?

Because it happened while I was in the middle of trying to investigate
a different bug, and I didn't know precisely what I had done to cause
it. I could have reported it anyway, but I thought it would not have
been that helpful to report a bug that I didn't know for sure how to
reproduce. Like everyone else here, I am working very hard and do not
always have time to follow up on every issue. I had something else
weird happen yesterday but I'm not sure I know exactly what I did and
I'm not at all sure that it wasn't just pilot error. If I start
reporting every pilot error as a bug, I'm going to get laughed out of
town, because I make a pilot error *every time* I tried to set up SR
w/HS. :-(

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

#12Fujii Masao
masao.fujii@gmail.com
In reply to: Heikki Linnakangas (#10)
Re: warning message in standby

On Fri, Jun 11, 2010 at 9:32 PM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

Hmm, right now it doesn't even reconnect when it sees a corrupt record
streamed from the master. It's really pointless to retry in that case,
reapplying the exact same piece of WAL surely won't work. I think it should
disconnect, and then retry reading from archive and pg_xlog, and then retry
streaming again. That's pretty hopeless too, but it's at least theoretically
possible that something went wrong in the transmission and the file in the
archive is fine.

Yep, this makes more sense.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

#13Simon Riggs
simon@2ndQuadrant.com
In reply to: Heikki Linnakangas (#4)
Re: warning message in standby

On Thu, 2010-06-10 at 19:01 +0300, Heikki Linnakangas wrote:

What "warning message" are we talking about? All the error cases I can
think of in WAL-application are ERROR, or likely even PANIC.

We're talking about a corrupt record (incorrect CRC, incorrect backlink
etc.), not errors within redo functions. During crash recovery, a
corrupt record means you've reached end of WAL. In standby mode, when
streaming WAL from master, that shouldn't happen, and it's not clear
what to do if it does. PANIC is not a good idea, at least if the server
uses hot standby, because that only makes the situation worse from
availability point of view. So we log the error as a WARNING, and keep
retrying. It's unlikely that the problem will just go away, but we keep
retrying anyway in the hope that it does. However, it seems that we're
too aggressive with the retries.

If my streaming replication stops working, I want to know about it as
soon as possible. WARNING just doesn't cut it.

This needs some better thought.

If we PANIC, then surely it will PANIC again when we restart unless we
do something. So we can't do that. But we need to do something better
than

WARNING there is a bug that will likely cause major data loss
HINT you'll be sacked if you miss this message

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Training and Services

#14Robert Haas
robertmhaas@gmail.com
In reply to: Simon Riggs (#13)
Re: warning message in standby

On Fri, Jun 11, 2010 at 9:43 AM, Simon Riggs <simon@2ndquadrant.com> wrote:

On Thu, 2010-06-10 at 19:01 +0300, Heikki Linnakangas wrote:

What "warning message" are we talking about?  All the error cases I can
think of in WAL-application are ERROR, or likely even PANIC.

We're talking about a corrupt record (incorrect CRC, incorrect backlink
etc.), not errors within redo functions. During crash recovery, a
corrupt record means you've reached end of WAL. In standby mode, when
streaming WAL from master, that shouldn't happen, and it's not clear
what to do if it does. PANIC is not a good idea, at least if the server
uses hot standby, because that only makes the situation worse from
availability point of view. So we log the error as a WARNING, and keep
retrying. It's unlikely that the problem will just go away, but we keep
retrying anyway in the hope that it does. However, it seems that we're
too aggressive with the retries.

If my streaming replication stops working, I want to know about it as
soon as possible. WARNING just doesn't cut it.

This needs some better thought.

If we PANIC, then surely it will PANIC again when we restart unless we
do something. So we can't do that. But we need to do something better
than

WARNING there is a bug that will likely cause major data loss
HINT you'll be sacked if you miss this message

+1. I was making this same argument (less eloquently) upthread.

I particularly like the errhint().

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

#15Bruce Momjian
bruce@momjian.us
In reply to: Robert Haas (#14)
Re: warning message in standby

Robert Haas wrote:

If my streaming replication stops working, I want to know about it as
soon as possible. WARNING just doesn't cut it.

This needs some better thought.

If we PANIC, then surely it will PANIC again when we restart unless we
do something. So we can't do that. But we need to do something better
than

WARNING there is a bug that will likely cause major data loss
HINT you'll be sacked if you miss this message

+1. I was making this same argument (less eloquently) upthread.
I particularly like the errhint().

I am wondering what action would be most likely to get the
administrator's attention.

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ None of us is going to be here forever. +

#16Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Bruce Momjian (#15)
Re: warning message in standby

On 12/06/10 04:19, Bruce Momjian wrote:

Robert Haas wrote:

If my streaming replication stops working, I want to know about it as
soon as possible. WARNING just doesn't cut it.

This needs some better thought.

If we PANIC, then surely it will PANIC again when we restart unless we
do something. So we can't do that. But we need to do something better
than

WARNING there is a bug that will likely cause major data loss
HINT you'll be sacked if you miss this message

+1. I was making this same argument (less eloquently) upthread.
I particularly like the errhint().

I am wondering what action would be most likely to get the
administrator's attention.

I've committed the patch to disconnect the SR connection in that case.
If the message needs improvement, let's do that separately once we
figure out what to do.

Seems like we need something like WARNING that doesn't cause the process
to die, but more alarming like ERROR/FATAL/PANIC. Or maybe just adding a
hint to the warning will do. How about

WARNING: invalid record length at 0/4005330
HINT: An invalid record was streamed from master. That can be a sign of
corruption in the master, or inconsistency between master and standby
state. The record will be re-fetched, but that is unlikely to fix the
problem. You may have to restore standby from base backup.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#17Bruce Momjian
bruce@momjian.us
In reply to: Heikki Linnakangas (#16)
Re: warning message in standby

Heikki Linnakangas wrote:

On 12/06/10 04:19, Bruce Momjian wrote:

Robert Haas wrote:

If my streaming replication stops working, I want to know about it as
soon as possible. WARNING just doesn't cut it.

This needs some better thought.

If we PANIC, then surely it will PANIC again when we restart unless we
do something. So we can't do that. But we need to do something better
than

WARNING there is a bug that will likely cause major data loss
HINT you'll be sacked if you miss this message

+1. I was making this same argument (less eloquently) upthread.
I particularly like the errhint().

I am wondering what action would be most likely to get the
administrator's attention.

I've committed the patch to disconnect the SR connection in that case.
If the message needs improvement, let's do that separately once we
figure out what to do.

Seems like we need something like WARNING that doesn't cause the process
to die, but more alarming like ERROR/FATAL/PANIC. Or maybe just adding a
hint to the warning will do. How about

WARNING: invalid record length at 0/4005330
HINT: An invalid record was streamed from master. That can be a sign of
corruption in the master, or inconsistency between master and standby
state. The record will be re-fetched, but that is unlikely to fix the
problem. You may have to restore standby from base backup.

I am thinking about log monitoring tools like Nagios. I am afraid
they are never going to pick up something tagged WARNING, no matter
what the wording is. Crazy idea, but can we force a fatal error line
into the logs with something like "WARNING ...\nFATAL: ...".

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ None of us is going to be here forever. +

#18Magnus Hagander
magnus@hagander.net
In reply to: Bruce Momjian (#17)
Re: warning message in standby

On Mon, Jun 14, 2010 at 12:16, Bruce Momjian <bruce@momjian.us> wrote:

Heikki Linnakangas wrote:

On 12/06/10 04:19, Bruce Momjian wrote:

Robert Haas wrote:

If my streaming replication stops working, I want to know about it as
soon as possible. WARNING just doesn't cut it.

This needs some better thought.

If we PANIC, then surely it will PANIC again when we restart unless we
do something. So we can't do that. But we need to do something better
than

WARNING there is a bug that will likely cause major data loss
HINT you'll be sacked if you miss this message

+1.  I was making this same argument (less eloquently) upthread.
I particularly like the errhint().

I am wondering what action would be most likely to get the
administrator's attention.

I've committed the patch to disconnect the SR connection in that case.
If the message needs improvement, let's do that separately once we
figure out what to do.

Seems like we need something like WARNING that doesn't cause the process
to die, but more alarming like ERROR/FATAL/PANIC. Or maybe just adding a
hint to the warning will do. How about

WARNING:  invalid record length at 0/4005330
HINT: An invalid record was streamed from master. That can be a sign of
corruption in the master, or inconsistency between master and standby
state. The record will be re-fetched, but that is unlikely to fix the
problem. You may have to restore standby from base backup.

I am thinking about log monitoring tools like Nagios.  I am afraid
they are never going to pick up something tagged WARNING, no matter

If they are properly configured, I imagine they would. And if they're
not, well, there's not much for us to do.

(What would be more usful then would be to separate "user-warnings"
like warnings about cast from actual system-warnings like this, but
that's a whole different story)

what the wording is.  Crazy idea, but can we force a fatal error line
into the logs with something like "WARNING ...\nFATAL: ...".

That's way too crazy :P And btw, randomly sticking newlines into that
will mess up *most* log displayers and I bet a lot of the log
monitoring tools as well...

--
Magnus Hagander
Me: http://www.hagander.net/
Work: http://www.redpill-linpro.com/

#19Bruce Momjian
bruce@momjian.us
In reply to: Magnus Hagander (#18)
Re: warning message in standby

Magnus Hagander wrote:

Seems like we need something like WARNING that doesn't cause the process
to die, but more alarming like ERROR/FATAL/PANIC. Or maybe just adding a
hint to the warning will do. How about

WARNING: ?invalid record length at 0/4005330
HINT: An invalid record was streamed from master. That can be a sign of
corruption in the master, or inconsistency between master and standby
state. The record will be re-fetched, but that is unlikely to fix the
problem. You may have to restore standby from base backup.

I am thinking about log monitoring tools like Nagios. ?I am afraid
they are never going to pick up something tagged WARNING, no matter

If they are properly configured, I imagine they would. And if they're
not, well, there's not much for us to do.

What does that mean?

(What would be more usful then would be to separate "user-warnings"
like warnings about cast from actual system-warnings like this, but
that's a whole different story)

what the wording is. ?Crazy idea, but can we force a fatal error line
into the logs with something like "WARNING ...\nFATAL: ...".

That's way too crazy :P And btw, randomly sticking newlines into that
will mess up *most* log displayers and I bet a lot of the log
monitoring tools as well...

Yeah, it would mess up CSV log output, for example?

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ None of us is going to be here forever. +

#20Magnus Hagander
magnus@hagander.net
In reply to: Bruce Momjian (#19)
Re: warning message in standby

On Mon, Jun 14, 2010 at 13:11, Bruce Momjian <bruce@momjian.us> wrote:

Magnus Hagander wrote:

Seems like we need something like WARNING that doesn't cause the process
to die, but more alarming like ERROR/FATAL/PANIC. Or maybe just adding a
hint to the warning will do. How about

WARNING: ?invalid record length at 0/4005330
HINT: An invalid record was streamed from master. That can be a sign of
corruption in the master, or inconsistency between master and standby
state. The record will be re-fetched, but that is unlikely to fix the
problem. You may have to restore standby from base backup.

I am thinking about log monitoring tools like Nagios. ?I am afraid
they are never going to pick up something tagged WARNING, no matter

If they are properly configured, I imagine they would. And if they're
not, well, there's not much for us to do.

What does that mean?

It means that we can't prevent people from configuring their tools to
ignore important warning. We can't prevent them rom ignoring ERROR or
FATAL either...

(What would be more usful then would be to separate "user-warnings"
like warnings about cast from actual system-warnings like this, but
that's a whole different story)

what the wording is. ?Crazy idea, but can we force a fatal error line
into the logs with something like "WARNING ...\nFATAL: ...".

That's way too crazy :P And btw, randomly sticking newlines into that
will mess up *most* log displayers and I bet a lot of the log
monitoring tools as well...

Yeah, it would mess up CSV log output, for example?

Possibly. And your general syslog output would become pretty hard to
read once this kind of stuff shows up.

--
Magnus Hagander
Me: http://www.hagander.net/
Work: http://www.redpill-linpro.com/

#21Bruce Momjian
bruce@momjian.us
In reply to: Magnus Hagander (#20)
#22Robert Haas
robertmhaas@gmail.com
In reply to: Magnus Hagander (#20)
#23Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Bruce Momjian (#17)
#24Tom Lane
tgl@sss.pgh.pa.us
In reply to: Bruce Momjian (#21)
#25Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#24)
#26Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#25)
#27Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#26)
#28Bruce Momjian
bruce@momjian.us
In reply to: Tom Lane (#26)
#29Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#27)
#30Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#29)
#31Simon Riggs
simon@2ndQuadrant.com
In reply to: Tom Lane (#26)
#32Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#30)
#33Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#32)
#34Tom Lane
tgl@sss.pgh.pa.us
In reply to: Simon Riggs (#31)
#35Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#34)
#36Simon Riggs
simon@2ndQuadrant.com
In reply to: Robert Haas (#35)
#37Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#35)
#38Robert Haas
robertmhaas@gmail.com
In reply to: Simon Riggs (#36)
#39Dimitri Fontaine
dimitri@2ndQuadrant.fr
In reply to: Robert Haas (#35)
#40Simon Riggs
simon@2ndQuadrant.com
In reply to: Dimitri Fontaine (#39)
#41Robert Haas
robertmhaas@gmail.com
In reply to: Simon Riggs (#40)
#42Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#41)
#43Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#42)
#44Simon Riggs
simon@2ndQuadrant.com
In reply to: Tom Lane (#34)
#45Tom Lane
tgl@sss.pgh.pa.us
In reply to: Simon Riggs (#44)
#46Kevin Grittner
Kevin.Grittner@wicourts.gov
In reply to: Simon Riggs (#44)
#47Tom Lane
tgl@sss.pgh.pa.us
In reply to: Kevin Grittner (#46)
#48Josh Berkus
josh@agliodbs.com
In reply to: Tom Lane (#32)
#49Magnus Hagander
magnus@hagander.net
In reply to: Tom Lane (#45)
#50Kevin Grittner
Kevin.Grittner@wicourts.gov
In reply to: Tom Lane (#47)
#51Fujii Masao
masao.fujii@gmail.com
In reply to: Robert Haas (#33)
#52Robert Haas
robertmhaas@gmail.com
In reply to: Fujii Masao (#51)
#53Fujii Masao
masao.fujii@gmail.com
In reply to: Fujii Masao (#51)
#54Robert Haas
robertmhaas@gmail.com
In reply to: Fujii Masao (#53)
#55Robert Haas
robertmhaas@gmail.com
In reply to: Robert Haas (#54)
#56Kevin Grittner
Kevin.Grittner@wicourts.gov
In reply to: Robert Haas (#55)
#57Robert Haas
robertmhaas@gmail.com
In reply to: Kevin Grittner (#56)
#58Kevin Grittner
Kevin.Grittner@wicourts.gov
In reply to: Robert Haas (#57)
#59Fujii Masao
masao.fujii@gmail.com
In reply to: Robert Haas (#54)
#60Robert Haas
robertmhaas@gmail.com
In reply to: Fujii Masao (#59)
#61Robert Haas
robertmhaas@gmail.com
In reply to: Kevin Grittner (#58)