corruption of WAL page header is never reported

Started by Yugo Nagataalmost 5 years ago29 messageshackers
Jump to latest
#1Yugo Nagata
nagata@sraoss.co.jp

Hello,

I found that any corruption of WAL page header found during recovery is never
reported in log messages. If wal page header is broken, it is detected in
XLogReaderValidatePageHeader called from XLogPageRead, but the error messages
are always reset and never reported.

if (!XLogReaderValidatePageHeader(xlogreader, targetPagePtr, readBuf))
{
/* reset any error XLogReaderValidatePageHeader() might have set */
xlogreader->errormsg_buf[0] = '\0';
goto next_record_is_invalid;
}

Since the commit 06687198018, we call XLogReaderValidatePageHeader here so that
we can check a page header and retry immediately if it's invalid, but the error
message is reset immediately and not reported. I guess the reason why the error
message is reset is because we might get the right WAL after some retries.
However, I think it is better to report the error for each check in order to let
users know the actual issues founded in the WAL.

I attached a patch to fix it in this way.

Or, if we wouldn't like to report an error for each check and also what we want
to check here is just about old recycled WAL instead of header corruption itself,
I wander that we could check just xlp_pageaddr instead of calling
XLogReaderValidatePageHeader.

Regards,
Yugo Nagata

--
Yugo NAGATA <nagata@sraoss.co.jp>

Attachments:

report_corrupt_page_header.patchtext/x-diff; name=report_corrupt_page_header.patchDownload+7-2
#2Ranier Vilela
ranier.vf@gmail.com
In reply to: Yugo Nagata (#1)
Re: corruption of WAL page header is never reported

Em sáb., 17 de jul. de 2021 às 16:57, Yugo NAGATA <nagata@sraoss.co.jp>
escreveu:

Hello,

I found that any corruption of WAL page header found during recovery is
never
reported in log messages. If wal page header is broken, it is detected in
XLogReaderValidatePageHeader called from XLogPageRead, but the error
messages
are always reset and never reported.

if (!XLogReaderValidatePageHeader(xlogreader, targetPagePtr,
readBuf))
{
/* reset any error XLogReaderValidatePageHeader() might
have set */
xlogreader->errormsg_buf[0] = '\0';
goto next_record_is_invalid;
}

Since the commit 06687198018, we call XLogReaderValidatePageHeader here so
that
we can check a page header and retry immediately if it's invalid, but the
error
message is reset immediately and not reported. I guess the reason why the
error
message is reset is because we might get the right WAL after some retries.
However, I think it is better to report the error for each check in order
to let
users know the actual issues founded in the WAL.

I attached a patch to fix it in this way.

I think to keep the same behavior as before, is necessary always run:

/* reset any error XLogReaderValidatePageHeader() might have set */
xlogreader->errormsg_buf[0] = '\0';

not?

regards,
Ranier Vilela

Attachments:

v1_report_corrupt_page_header.patchtext/x-patch; charset=US-ASCII; name=v1_report_corrupt_page_header.patchDownload+5-0
#3Yugo Nagata
nagata@sraoss.co.jp
In reply to: Ranier Vilela (#2)
Re: corruption of WAL page header is never reported

On Sat, 17 Jul 2021 18:40:02 -0300
Ranier Vilela <ranier.vf@gmail.com> wrote:

Em sáb., 17 de jul. de 2021 às 16:57, Yugo NAGATA <nagata@sraoss.co.jp>
escreveu:

Hello,

I found that any corruption of WAL page header found during recovery is
never
reported in log messages. If wal page header is broken, it is detected in
XLogReaderValidatePageHeader called from XLogPageRead, but the error
messages
are always reset and never reported.

if (!XLogReaderValidatePageHeader(xlogreader, targetPagePtr,
readBuf))
{
/* reset any error XLogReaderValidatePageHeader() might
have set */
xlogreader->errormsg_buf[0] = '\0';
goto next_record_is_invalid;
}

Since the commit 06687198018, we call XLogReaderValidatePageHeader here so
that
we can check a page header and retry immediately if it's invalid, but the
error
message is reset immediately and not reported. I guess the reason why the
error
message is reset is because we might get the right WAL after some retries.
However, I think it is better to report the error for each check in order
to let
users know the actual issues founded in the WAL.

I attached a patch to fix it in this way.

I think to keep the same behavior as before, is necessary always run:

/* reset any error XLogReaderValidatePageHeader() might have set */
xlogreader->errormsg_buf[0] = '\0';

not?

If we are not in StandbyMode, the check is not retried, and an error is returned
immediately. So, I think ,we don't have to display an error message in such cases,
and neither reset it. Instead, it would be better to leave the error message
handling to the caller of XLogReadRecord.

Regards,
Yugo Nagat

--
Yugo NAGATA <nagata@sraoss.co.jp>

#4Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Yugo Nagata (#1)
Re: corruption of WAL page header is never reported

Hello.

At Sun, 18 Jul 2021 04:55:05 +0900, Yugo NAGATA <nagata@sraoss.co.jp> wrote in

Hello,

I found that any corruption of WAL page header found during recovery is never
reported in log messages. If wal page header is broken, it is detected in
XLogReaderValidatePageHeader called from XLogPageRead, but the error messages
are always reset and never reported.

Good catch! Currently recovery stops showing no reason if it is
stopped by page-header errors.

I attached a patch to fix it in this way.

However, it is a kind of a roof-over-a-roof. What we should do is
just omitting the check in XLogPageRead while in standby mode.

Or, if we wouldn't like to report an error for each check and also what we want
to check here is just about old recycled WAL instead of header corruption itself,
I wander that we could check just xlp_pageaddr instead of calling
XLogReaderValidatePageHeader.

I'm not sure. But as described in the commit message, the commit
intended to save a common case and there's no obvious reason to (and
not to) restrict the check only to page address. So it uses the
established checking function.

I was tempted to adjust the comment just above by adding "while in
standby mode", but "so that we can retry immediately" is suggesting
that so I didn't do that in the attached.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachments:

v1-0001-Don-t-forget-message-of-hage-header-errors-while-.patchtext/x-patch; charset=us-asciiDownload+2-2
#5Yugo Nagata
nagata@sraoss.co.jp
In reply to: Kyotaro Horiguchi (#4)
Re: corruption of WAL page header is never reported

On Mon, 19 Jul 2021 15:14:41 +0900 (JST)
Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote:

Hello.

At Sun, 18 Jul 2021 04:55:05 +0900, Yugo NAGATA <nagata@sraoss.co.jp> wrote in

Hello,

I found that any corruption of WAL page header found during recovery is never
reported in log messages. If wal page header is broken, it is detected in
XLogReaderValidatePageHeader called from XLogPageRead, but the error messages
are always reset and never reported.

Good catch! Currently recovery stops showing no reason if it is
stopped by page-header errors.

I attached a patch to fix it in this way.

However, it is a kind of a roof-over-a-roof. What we should do is
just omitting the check in XLogPageRead while in standby mode.

Your patch doesn't fix the issue that the error message is never reported in
standby mode. When a WAL page header is broken, the standby would silently repeat
retrying forever.

I think we have to let users know the corruption of WAL page header even in
standby mode, not? A corruption of WAL record header is always reported,
by the way. (See that XLogReadRecord is calling ValidXLogRecordHeader.)

Regards,
Yugo Nagata

--
Yugo NAGATA <nagata@sraoss.co.jp>

#6Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Yugo Nagata (#5)
Re: corruption of WAL page header is never reported

At Mon, 19 Jul 2021 16:00:39 +0900, Yugo NAGATA <nagata@sraoss.co.jp> wrote in

Your patch doesn't fix the issue that the error message is never reported in
standby mode. When a WAL page header is broken, the standby would silently repeat
retrying forever.

Ok, I see your point and agree to that.

I think we have to let users know the corruption of WAL page header even in
standby mode, not? A corruption of WAL record header is always reported,
by the way. (See that XLogReadRecord is calling ValidXLogRecordHeader.)

Howeer, I'm still on the opinion that we don't need to check that
while in standby mode.

How about the attached?

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachments:

0001-Don-t-hide-xlog-page-header-errors-while-recovery.patchtext/x-patch; charset=us-asciiDownload+10-2
#7Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Kyotaro Horiguchi (#6)
Re: corruption of WAL page header is never reported

me> Howeer, I'm still on the opinion that we don't need to check that
me> while in standby mode.

Of course it's typo of "while not in standby mode".

sorry..

--
Kyotaro Horiguchi
NTT Open Source Software Center

#8Yugo Nagata
nagata@sraoss.co.jp
In reply to: Kyotaro Horiguchi (#6)
Re: corruption of WAL page header is never reported

On Mon, 19 Jul 2021 17:47:07 +0900 (JST)
Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote:

At Mon, 19 Jul 2021 16:00:39 +0900, Yugo NAGATA <nagata@sraoss.co.jp> wrote in

Your patch doesn't fix the issue that the error message is never reported in
standby mode. When a WAL page header is broken, the standby would silently repeat
retrying forever.

Ok, I see your point and agree to that.

I think we have to let users know the corruption of WAL page header even in
standby mode, not? A corruption of WAL record header is always reported,
by the way. (See that XLogReadRecord is calling ValidXLogRecordHeader.)

Howeer, I'm still on the opinion that we don't need to check that
while in standby mode.

How about the attached?

On Mon, 19 Jul 2021 17:50:16 +0900 (JST)
Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote:

me> Howeer, I'm still on the opinion that we don't need to check that
me> while in standby mode.

Of course it's typo of "while not in standby mode".

Thanks for updating the patch. I agree with you.

I think it is nice to fix to perform the check only during standby mode
because it make a bit clearer why we check it immediately in XLogPageRead.

Regards,
Yugo Nagata

--
Yugo NAGATA <nagata@sraoss.co.jp>

#9Ranier Vilela
ranier.vf@gmail.com
In reply to: Yugo Nagata (#8)
Re: corruption of WAL page header is never reported

Em seg., 19 de jul. de 2021 às 06:15, Yugo NAGATA <nagata@sraoss.co.jp>
escreveu:

On Mon, 19 Jul 2021 17:47:07 +0900 (JST)
Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote:

At Mon, 19 Jul 2021 16:00:39 +0900, Yugo NAGATA <nagata@sraoss.co.jp>

wrote in

Your patch doesn't fix the issue that the error message is never

reported in

standby mode. When a WAL page header is broken, the standby would

silently repeat

retrying forever.

Ok, I see your point and agree to that.

I think we have to let users know the corruption of WAL page header

even in

standby mode, not? A corruption of WAL record header is always

reported,

by the way. (See that XLogReadRecord is calling ValidXLogRecordHeader.)

Howeer, I'm still on the opinion that we don't need to check that
while in standby mode.

How about the attached?

On Mon, 19 Jul 2021 17:50:16 +0900 (JST)
Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote:

me> Howeer, I'm still on the opinion that we don't need to check that
me> while in standby mode.

Of course it's typo of "while not in standby mode".

Thanks for updating the patch. I agree with you.

I think it is nice to fix to perform the check only during standby mode
because it make a bit clearer why we check it immediately in XLogPageRead.

And as I had reviewed, your first patch was wrong and now with the Kyotaro
version,
to keep the same behavior, it is necessary to reset the error, correct?

regards,
Ranier Vilela

#10Yugo Nagata
nagata@sraoss.co.jp
In reply to: Ranier Vilela (#9)
Re: corruption of WAL page header is never reported

On Mon, 19 Jul 2021 06:32:28 -0300
Ranier Vilela <ranier.vf@gmail.com> wrote:

Em seg., 19 de jul. de 2021 às 06:15, Yugo NAGATA <nagata@sraoss.co.jp>
escreveu:

On Mon, 19 Jul 2021 17:47:07 +0900 (JST)
Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote:

At Mon, 19 Jul 2021 16:00:39 +0900, Yugo NAGATA <nagata@sraoss.co.jp>

wrote in

Your patch doesn't fix the issue that the error message is never

reported in

standby mode. When a WAL page header is broken, the standby would

silently repeat

retrying forever.

Ok, I see your point and agree to that.

I think we have to let users know the corruption of WAL page header

even in

standby mode, not? A corruption of WAL record header is always

reported,

by the way. (See that XLogReadRecord is calling ValidXLogRecordHeader.)

Howeer, I'm still on the opinion that we don't need to check that
while in standby mode.

How about the attached?

On Mon, 19 Jul 2021 17:50:16 +0900 (JST)
Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote:

me> Howeer, I'm still on the opinion that we don't need to check that
me> while in standby mode.

Of course it's typo of "while not in standby mode".

Thanks for updating the patch. I agree with you.

I think it is nice to fix to perform the check only during standby mode
because it make a bit clearer why we check it immediately in XLogPageRead.

And as I had reviewed, your first patch was wrong and now with the Kyotaro
version,
to keep the same behavior, it is necessary to reset the error, correct?

Well, I think my first patch was not wrong. The difference with the latest
patch is just whether we perform the additional check when we are not in
standby mode or not. The behavior is basically the same although which function
detects and prints the page-header error in cases of crash recovery is different.
On the other hand, in your patch, the error message was always ommitted in cases
of crash recovery, and it seemed to me wrong.

Regards,
Yugo Nagata

--
Yugo NAGATA <nagata@sraoss.co.jp>

#11Fujii Masao
masao.fujii@gmail.com
In reply to: Yugo Nagata (#10)
Re: corruption of WAL page header is never reported

On 2021/07/19 18:52, Yugo NAGATA wrote:

Well, I think my first patch was not wrong. The difference with the latest
patch is just whether we perform the additional check when we are not in
standby mode or not. The behavior is basically the same although which function
detects and prints the page-header error in cases of crash recovery is different.

Yes, so which patch do you think is better? I like your version
because there seems no reason why XLogPageRead() should skip
XLogReaderValidatePageHeader() when not in standby mode.

Also I'm tempted to move ereport() and reset of errmsg_buf to
under next_record_is_invalid as follows. That is, in standby mode
whenever we find an invalid record and retry reading WAL page
in XLogPageRead(), we report the error message and reset it.
For now in XLogPageRead(), only XLogReaderValidatePageHeader()
sets errmsg_buf, but in the future other code or function doing that
may be added. For that case, the following change seems more elegant.
Thought?

* shouldn't be a big deal from a performance point of view.
*/
if (!XLogReaderValidatePageHeader(xlogreader, targetPagePtr, readBuf))
- {
- /* reset any error XLogReaderValidatePageHeader() might have set */
- xlogreader->errormsg_buf[0] = '\0';
goto next_record_is_invalid;
- }

return readLen;

@@ -12517,7 +12513,17 @@ next_record_is_invalid:

  	/* In standby-mode, keep trying */
  	if (StandbyMode)
+	{
+		if (xlogreader->errormsg_buf[0])
+		{
+			ereport(emode_for_corrupt_record(emode, EndRecPtr),
+					(errmsg_internal("%s", xlogreader->errormsg_buf)));
+
+			/* reset any error XLogReaderValidatePageHeader() might have set */
+			xlogreader->errormsg_buf[0] = '\0';
+		}
  		goto retry;
+	}
  	else
  		return -1;
  }

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

#12Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Fujii Masao (#11)
Re: corruption of WAL page header is never reported

At Thu, 2 Sep 2021 12:19:25 +0900, Fujii Masao <masao.fujii@oss.nttdata.com> wrote in

On 2021/07/19 18:52, Yugo NAGATA wrote:

Well, I think my first patch was not wrong. The difference with the
latest
patch is just whether we perform the additional check when we are not
in
standby mode or not. The behavior is basically the same although which
function
detects and prints the page-header error in cases of crash recovery is
different.

Yes, so which patch do you think is better? I like your version
because there seems no reason why XLogPageRead() should skip
XLogReaderValidatePageHeader() when not in standby mode.

Did you read the comment just above?

xlog.c:12523

* Check the page header immediately, so that we can retry immediately if
* it's not valid. This may seem unnecessary, because XLogReadRecord()
* validates the page header anyway, and would propagate the failure up to
* ReadRecord(), which would retry. However, there's a corner case with
* continuation records, if a record is split across two pages such that

So when not in standby mode, the same check is performed by xlogreader
which has the responsibility to validate the binary data read by
XLogPageRead. The page-header validation is a compromise to save a
specific case.

Also I'm tempted to move ereport() and reset of errmsg_buf to
under next_record_is_invalid as follows. That is, in standby mode
whenever we find an invalid record and retry reading WAL page
in XLogPageRead(), we report the error message and reset it.
For now in XLogPageRead(), only XLogReaderValidatePageHeader()
sets errmsg_buf, but in the future other code or function doing that
may be added. For that case, the following change seems more elegant.
Thought?

I don't think it is good choice to conflate read-failure and header
validation failure from the view of modularity.

* shouldn't be a big deal from a performance point of view.
*/
if (!XLogReaderValidatePageHeader(xlogreader, targetPagePtr, readBuf))
-	{
-		/* reset any error XLogReaderValidatePageHeader() might have set */
-		xlogreader->errormsg_buf[0] = '\0';
goto next_record_is_invalid;
-	}
return readLen;
@@ -12517,7 +12513,17 @@ next_record_is_invalid:
/* In standby-mode, keep trying */
if (StandbyMode)
+	{
+		if (xlogreader->errormsg_buf[0])
+		{
+			ereport(emode_for_corrupt_record(emode, EndRecPtr),
+ (errmsg_internal("%s", xlogreader->errormsg_buf)));
+
+ /* reset any error XLogReaderValidatePageHeader() might have set */
+			xlogreader->errormsg_buf[0] = '\0';
+		}
goto retry;
+	}
else
return -1;
}

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#13Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Kyotaro Horiguchi (#12)
Re: corruption of WAL page header is never reported

Sorry, please let me add something.

At Thu, 02 Sep 2021 13:17:16 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in

At Thu, 2 Sep 2021 12:19:25 +0900, Fujii Masao <masao.fujii@oss.nttdata.com> wrote in

Also I'm tempted to move ereport() and reset of errmsg_buf to
under next_record_is_invalid as follows. That is, in standby mode
whenever we find an invalid record and retry reading WAL page
in XLogPageRead(), we report the error message and reset it.
For now in XLogPageRead(), only XLogReaderValidatePageHeader()
sets errmsg_buf, but in the future other code or function doing that
may be added. For that case, the following change seems more elegant.
Thought?

I don't think it is good choice to conflate read-failure and header
validation failure from the view of modularity.

In other words, XLogReaderValidatePageHeader is foreign for
XLogPageRead and the function indeuces the need of extra care for
errormsg_buf that is not relevant to the elog-capable module.

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

#14Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Kyotaro Horiguchi (#13)
Re: corruption of WAL page header is never reported

(.. sorry for the chained-mails)

At Thu, 02 Sep 2021 13:31:31 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in

Sorry, please let me add something.

At Thu, 02 Sep 2021 13:17:16 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in

At Thu, 2 Sep 2021 12:19:25 +0900, Fujii Masao <masao.fujii@oss.nttdata.com> wrote in

Also I'm tempted to move ereport() and reset of errmsg_buf to
under next_record_is_invalid as follows. That is, in standby mode
whenever we find an invalid record and retry reading WAL page
in XLogPageRead(), we report the error message and reset it.
For now in XLogPageRead(), only XLogReaderValidatePageHeader()
sets errmsg_buf, but in the future other code or function doing that
may be added. For that case, the following change seems more elegant.
Thought?

I don't think it is good choice to conflate read-failure and header
validation failure from the view of modularity.

In other words, XLogReaderValidatePageHeader is foreign for
XLogPageRead and the function indeuces the need of extra care for
errormsg_buf that is not relevant to the elog-capable module.

However, I can agree that the error handling code can be moved further
later. Like this,

* shouldn't be a big deal from a performance point of view.
*/

-	if (!XLogReaderValidatePageHeader(xlogreader, targetPagePtr, readBuf))
-		/* reset any error XLogReaderValidatePageHeader() might have set */
-		xlogreader->errormsg_buf[0] = '\0';
-		goto next_record_is_invalid;
+   if (... && XLogReaderValidatePageHeader())
+       goto page_header_is_invalid;
...

return readlen;

+ page_header_is_invalid:
+	/*
+	 * in this case we consume this error right now then retry immediately,
+	 * the message is already translated
+	 */
+	if (xlogreader->errormsg_buf[0])
+		ereport(emode_for_corrupt_record(emode, EndRecPtr),
+				(errmsg_internal("%s", xlogreader->errormsg_buf)));
+
+ 	/* reset any error XLogReaderValidatePageHeader() might have set */
+ 	xlogreader->errormsg_buf[0] = '\0';

next_record_is_invalid:
lastSourceFailed = true;

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#15Fujii Masao
masao.fujii@gmail.com
In reply to: Kyotaro Horiguchi (#12)
Re: corruption of WAL page header is never reported

On 2021/09/02 13:17, Kyotaro Horiguchi wrote:

Did you read the comment just above?

Yes!

xlog.c:12523

* Check the page header immediately, so that we can retry immediately if
* it's not valid. This may seem unnecessary, because XLogReadRecord()
* validates the page header anyway, and would propagate the failure up to
* ReadRecord(), which would retry. However, there's a corner case with
* continuation records, if a record is split across two pages such that

So when not in standby mode, the same check is performed by xlogreader
which has the responsibility to validate the binary data read by
XLogPageRead. The page-header validation is a compromise to save a
specific case.

Yes, so XLogPageRead() can skip the validation check of page head if not
in standby mode. On the other hand, there is no problem if it still performs
the validation check as it does for now. No?

I don't think it is good choice to conflate read-failure and header
validation failure from the view of modularity.

I don't think that the proposed change does that. But maybe I failed to get
your point yet... Anyway the proposed change just tries to reset
errormsg_buf whenever XLogPageRead() retries, whatever error happened
before. Also if errormsg_buf is set at that moment, it's reported.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

#16Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Fujii Masao (#15)
Re: corruption of WAL page header is never reported

At Thu, 2 Sep 2021 14:44:31 +0900, Fujii Masao <masao.fujii@oss.nttdata.com> wrote in

On 2021/09/02 13:17, Kyotaro Horiguchi wrote:

Did you read the comment just above?

Yes!

Glad to hear that, or..:p

xlog.c:12523

* Check the page header immediately, so that we can retry immediately if
* it's not valid. This may seem unnecessary, because XLogReadRecord()
* validates the page header anyway, and would propagate the failure up to
* ReadRecord(), which would retry. However, there's a corner case with
* continuation records, if a record is split across two pages such that

So when not in standby mode, the same check is performed by xlogreader
which has the responsibility to validate the binary data read by
XLogPageRead. The page-header validation is a compromise to save a
specific case.

Yes, so XLogPageRead() can skip the validation check of page head if
not
in standby mode. On the other hand, there is no problem if it still
performs
the validation check as it does for now. No?

Practically yes, and it has always been like that as you say.

I don't think it is good choice to conflate read-failure and header
validation failure from the view of modularity.

I don't think that the proposed change does that. But maybe I failed

It's about your idea in a recent mail. not about the proposed
patch(es).

to get
your point yet... Anyway the proposed change just tries to reset
errormsg_buf whenever XLogPageRead() retries, whatever error happened
before. Also if errormsg_buf is set at that moment, it's reported.

I believe errmsg_buf is an interface to emit error messages dedicated
to xlogreader that doesn't have an access to elog facility, and
xlogreader doesn't (or ought not to or expect to) suppose
non-xlogreader callback functions set the variable. In that sense I
don't think theoriginally proposed patch is proper for the reason that
the non-xlogreader callback function may set errmsg_buf. This is what
I meant by the word "modularity".

For that reason I avoided in my second proposal to call
XLogReaderValidatePageHeader() at all while not in standby mode,
because calling the validator function while in non-standby mode
results in the non-xlogreader function return errmsg_buf. Of course
we can instead always consume errmsg_buf in the function but I don't
like to shadow the caller's task.

Does that makes sense?

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#17Fujii Masao
masao.fujii@gmail.com
In reply to: Kyotaro Horiguchi (#16)
Re: corruption of WAL page header is never reported

On 2021/09/02 16:26, Kyotaro Horiguchi wrote:

I believe errmsg_buf is an interface to emit error messages dedicated
to xlogreader that doesn't have an access to elog facility, and
xlogreader doesn't (or ought not to or expect to) suppose
non-xlogreader callback functions set the variable. In that sense I
don't think theoriginally proposed patch is proper for the reason that
the non-xlogreader callback function may set errmsg_buf. This is what
I meant by the word "modularity".

For that reason I avoided in my second proposal to call
XLogReaderValidatePageHeader() at all while not in standby mode,
because calling the validator function while in non-standby mode
results in the non-xlogreader function return errmsg_buf. Of course
we can instead always consume errmsg_buf in the function but I don't
like to shadow the caller's task.

Understood. Thanks for clarifying this!

Does that makes sense?

Yes, I'm fine with your latest patch.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

#18Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Fujii Masao (#17)
Re: corruption of WAL page header is never reported

At Thu, 2 Sep 2021 21:52:00 +0900, Fujii Masao <masao.fujii@oss.nttdata.com> wrote in

On 2021/09/02 16:26, Kyotaro Horiguchi wrote:

I believe errmsg_buf is an interface to emit error messages dedicated
to xlogreader that doesn't have an access to elog facility, and
xlogreader doesn't (or ought not to or expect to) suppose
non-xlogreader callback functions set the variable. In that sense I
don't think theoriginally proposed patch is proper for the reason that
the non-xlogreader callback function may set errmsg_buf. This is what
I meant by the word "modularity".
For that reason I avoided in my second proposal to call
XLogReaderValidatePageHeader() at all while not in standby mode,
because calling the validator function while in non-standby mode
results in the non-xlogreader function return errmsg_buf. Of course
we can instead always consume errmsg_buf in the function but I don't
like to shadow the caller's task.

Understood. Thanks for clarifying this!

Does that makes sense?

Yes, I'm fine with your latest patch.

Thanks. Maybe some additional comment is needed.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#19Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Kyotaro Horiguchi (#18)
Re: corruption of WAL page header is never reported

At Fri, 03 Sep 2021 16:55:36 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in

Yes, I'm fine with your latest patch.

Thanks. Maybe some additional comment is needed.

Something like this.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachments:

v2-0001-Don-t-hide-xlog-page-header-errors-while-recovery.difftext/x-patch; charset=us-asciiDownload+13-1
#20Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Kyotaro Horiguchi (#19)
Re: corruption of WAL page header is never reported

On 2021-Sep-03, Kyotaro Horiguchi wrote:

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 24165ab03e..b621ad6b0f 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -12496,9 +12496,21 @@ retry:
*
* Validating the page header is cheap enough that doing it twice
* shouldn't be a big deal from a performance point of view.
+	 *
+	 * Don't call XLogReaderValidatePageHeader here while not in standby mode
+	 * so that this function won't return with a valid errmsg_buf.
*/
-	if (!XLogReaderValidatePageHeader(xlogreader, targetPagePtr, readBuf))
+	if (StandbyMode &&
+		!XLogReaderValidatePageHeader(xlogreader, targetPagePtr, readBuf))

OK, but I don't understand why we have a comment that says (referring to
non-standby mode) "doing it twice shouldn't be a big deal", followed by
"Don't do it twice while not in standby mode" -- that seems quite
contradictory. I think the new comment should overwrite the previous
one, something like this:

-	 * Validating the page header is cheap enough that doing it twice
-	 * shouldn't be a big deal from a performance point of view.
+	 *
+	 * We do this in standby mode only,
+	 * so that this function won't return with a valid errmsg_buf.
 	 */
-	if (!XLogReaderValidatePageHeader(xlogreader, targetPagePtr, readBuf))
+	if (StandbyMode &&
+		!XLogReaderValidatePageHeader(xlogreader, targetPagePtr, readBuf))

--
Álvaro Herrera Valdivia, Chile — https://www.EnterpriseDB.com/

#21Fujii Masao
masao.fujii@gmail.com
In reply to: Alvaro Herrera (#20)
#22Fujii Masao
masao.fujii@gmail.com
In reply to: Fujii Masao (#21)
#23Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Fujii Masao (#22)
#24Fujii Masao
masao.fujii@gmail.com
In reply to: Kyotaro Horiguchi (#23)
#25Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Fujii Masao (#24)
#26Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Kyotaro Horiguchi (#25)
#27Fujii Masao
masao.fujii@gmail.com
In reply to: Kyotaro Horiguchi (#25)
#28Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Fujii Masao (#27)
#29Fujii Masao
masao.fujii@gmail.com
In reply to: Kyotaro Horiguchi (#28)