Make mesage at end-of-recovery less scary.

Started by Kyotaro Horiguchialmost 6 years ago69 messages

Kyotaro Horiguchi

horikyota.ntt@gmail.com

almost 6 years ago

1 attachment(s)

Hello, this is a followup thread of [1]/messages/by-id/20200117.172655.1384889922565817808.horikyota.ntt@gmail.com.

# I didn't noticed that the thread didn't cover -hackers..

When recovery of any type ends, we see several kinds of error messages
that says "WAL is broken".

LOG: invalid record length at 0/7CB6BC8: wanted 24, got 0
LOG: redo is not required
LOG: database system is ready to accept connections

This patch reduces the scariness of such messages as the follows.

LOG: rached end of WAL at 0/1551048 on timeline 1 in pg_wal during crash recovery
DETAIL: invalid record length at 0/1551048: wanted 24, got 0
LOG: redo is not required
LOG: database system is ready to accept connections

[1]: /messages/by-id/20200117.172655.1384889922565817808.horikyota.ntt@gmail.com

I'll register this to the coming CF.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachments:

0001-Make-End-Of-Recovery-error-less-scary.patchtext/x-patch; charset=us-asciiDownload

From f3692cb484b7f1ebc351ba8a522039c0b91bcfdb Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Fri, 28 Feb 2020 15:52:58 +0900
Subject: [PATCH] Make End-Of-Recovery error less scary

When recovery in any type ends, we see a bit scary error message like
"invalid record length" that suggests something serious is happening.
Make this message less scary as "reached end of WAL".
---
 src/backend/access/transam/xlog.c | 45 ++++++++++++++++++++++++++-----
 1 file changed, 38 insertions(+), 7 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index d19408b3be..452c376f62 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -4288,6 +4288,10 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
 		EndRecPtr = xlogreader->EndRecPtr;
 		if (record == NULL)
 		{
+			int actual_emode =
+				emode_for_corrupt_record(emode,
+										 ReadRecPtr ? ReadRecPtr : EndRecPtr);
+
 			if (readFile >= 0)
 			{
 				close(readFile);
@@ -4295,14 +4299,41 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
 			}
 
 			/*
-			 * We only end up here without a message when XLogPageRead()
-			 * failed - in that case we already logged something. In
-			 * StandbyMode that only happens if we have been triggered, so we
-			 * shouldn't loop anymore in that case.
+			 * randAccess here means we are reading successive records during
+			 * recovery. If we get here during recovery, we can assume that we
+			 * reached the end of WAL.  Otherwise something's really wrong and
+			 * we report just only the errormsg if any. If we don't receive
+			 * errormsg here, we already logged something.  We don't emit
+			 * "reached end of WAL" in muted messages.
+			 *
+			 * Note: errormsg is alreay translated.
 			 */
-			if (errormsg)
-				ereport(emode_for_corrupt_record(emode, EndRecPtr),
-						(errmsg_internal("%s", errormsg) /* already translated */ ));
+			if (!private->randAccess && actual_emode == emode)
+			{
+				if (StandbyMode)
+					ereport(actual_emode,
+							(errmsg ("rached end of WAL at %X/%X on timeline %u in %s during streaming replication",
+									 (uint32) (EndRecPtr >> 32), (uint32) EndRecPtr,
+									 ThisTimeLineID,
+									 xlogSourceNames[currentSource]),
+							 (errormsg ? errdetail_internal("%s", errormsg) : 0)));
+				else if (InArchiveRecovery)
+					ereport(actual_emode,
+							(errmsg ("rached end of WAL at %X/%X on timeline %u in %s during archive recovery",
+									 (uint32) (EndRecPtr >> 32), (uint32) EndRecPtr,
+									 ThisTimeLineID,
+									 xlogSourceNames[currentSource]),
+							 (errormsg ? errdetail_internal("%s", errormsg) : 0)));
+				else
+					ereport(actual_emode,
+							(errmsg ("rached end of WAL at %X/%X on timeline %u in %s during crash recovery",
+									 (uint32) (EndRecPtr >> 32), (uint32) EndRecPtr,
+									 ThisTimeLineID,
+									 xlogSourceNames[currentSource]),
+							 (errormsg ? errdetail_internal("%s", errormsg) : 0)));
+			}
+			else if (errormsg)
+				ereport(actual_emode, (errmsg_internal("%s", errormsg)));
 		}
 
 		/*
-- 
2.18.2

Michael Paquier

michael@paquier.xyz

almost 6 years ago

In reply to: Kyotaro Horiguchi (#1)

Re: Make mesage at end-of-recovery less scary.

On Fri, Feb 28, 2020 at 04:01:00PM +0900, Kyotaro Horiguchi wrote:

Hello, this is a followup thread of [1].

# I didn't noticed that the thread didn't cover -hackers..

When recovery of any type ends, we see several kinds of error messages
that says "WAL is broken".

Have you considered an error context here? Your patch leads to a bit
of duplication with the message a bit down of what you are changing
where the end of local pg_wal is reached.

+	* reached the end of WAL.  Otherwise something's really wrong and
+	* we report just only the errormsg if any. If we don't receive

This sentence sounds strange to me. Or you meant "Something is wrong,
so use errormsg as report if it is set"?

+ * Note: errormsg is alreay translated.

Typo here.

+	if (StandbyMode)
+		ereport(actual_emode,
+			(errmsg ("rached end of WAL at %X/%X on timeline %u in %s during streaming replication",

StandbyMode happens also with only WAL archiving, depending on if
primary_conninfo is set or not.

+ (errmsg ("rached end of WAL at %X/%X on timeline %u in %s during crash recovery",

FWIW, you are introducing three times the same typo, in the same
word, in three different messages.
--
Michael

Kyotaro Horiguchi

horikyota.ntt@gmail.com

almost 6 years ago

In reply to: Michael Paquier (#2)

Re: Make mesage at end-of-recovery less scary.

Thank you for the comments.

At Fri, 28 Feb 2020 16:33:18 +0900, Michael Paquier <michael@paquier.xyz> wrote in

On Fri, Feb 28, 2020 at 04:01:00PM +0900, Kyotaro Horiguchi wrote:

Hello, this is a followup thread of [1].

# I didn't noticed that the thread didn't cover -hackers..

When recovery of any type ends, we see several kinds of error messages
that says "WAL is broken".

Have you considered an error context here? Your patch leads to a bit
of duplication with the message a bit down of what you are changing
where the end of local pg_wal is reached.

It is a DEBUG message and it is for the time moving from crash
recovery to archive recovery. I could remove that but decided to leave
it for tracability.

+	* reached the end of WAL.  Otherwise something's really wrong and
+	* we report just only the errormsg if any. If we don't receive
This sentence sounds strange to me. Or you meant "Something is wrong,
so use errormsg as report if it is set"?

The whole comment there follows.
| recovery. If we get here during recovery, we can assume that we
| reached the end of WAL. Otherwise something's really wrong and
| we report just only the errormsg if any. If we don't receive
| errormsg here, we already logged something. We don't emit
| "reached end of WAL" in muted messages.

"Othhersise" means "other than the case of recovery". "Just only the
errmsg" means "show the message not as a part the message "reached end
of WAL".

+ * Note: errormsg is alreay translated.

Typo here.

Thanks. Will fix along with "rached".

+	if (StandbyMode)
+		ereport(actual_emode,
+			(errmsg ("rached end of WAL at %X/%X on timeline %u in %s during streaming replication",
StandbyMode happens also with only WAL archiving, depending on if
primary_conninfo is set or not.

Right. I'll fix it. Maybe to "during standby mode".

+ (errmsg ("rached end of WAL at %X/%X on timeline %u in %s during crash recovery",

FWIW, you are introducing three times the same typo, in the same
word, in three different messages.

They're copy-pasto. I refrained from constructing an error message
from multiple nonindipendent parts. Are you suggesting to do so?

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

Kyotaro Horiguchi

horikyota.ntt@gmail.com

almost 6 years ago

In reply to: Kyotaro Horiguchi (#3)

1 attachment(s)

Re: Make mesage at end-of-recovery less scary.

Hello.

I changed the condition from randAccess to fetching_ckpt considering
the discussion in another thread [1]. Then I moved the block that
shows the new messages to more appropriate place.

At Fri, 28 Feb 2020 17:28:06 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in

Have you considered an error context here? Your patch leads to a bit
of duplication with the message a bit down of what you are changing
where the end of local pg_wal is reached.

It is a DEBUG message and it is for the time moving from crash
recovery to archive recovery. I could remove that but decided to leave
it for tracability.

I modified the message so that it has the same look to the new
messages, but I left it being DEBUG1, since it is just a intermediate
state. We should finally see one of the new three messages.

After the messages changed, another message from wal sender came to
look redundant.

| [20866] LOG: replication terminated by primary server
| [20866] DETAIL: End of WAL reached on timeline 1 at 0/30001C8.
| [20866] FATAL: could not send end-of-streaming message to primary: no COPY in progress
| [20851] LOG: reached end of WAL at 0/30001C8 on timeline 1 in archive during standby mode
| [20851] DETAIL: invalid record length at 0/30001C8: wanted 24, got 0

I changed the above to the below, which looks more adequate.

| [24271] LOG: replication terminated by primary server on timeline 1 at 0/3000240.
| [24271] FATAL: could not send end-of-streaming message to primary: no COPY in progress
| [24267] LOG: reached end of WAL at 0/3000240 on timeline 1 in archive during standby mode
| [24267] DETAIL: invalid record length at 0/3000240: wanted 24, got 0

+	* reached the end of WAL.  Otherwise something's really wrong and
+	* we report just only the errormsg if any. If we don't receive
This sentence sounds strange to me. Or you meant "Something is wrong,
so use errormsg as report if it is set"?

The message no longer exists.

+ (errmsg ("rached end of WAL at %X/%X on timeline %u in %s during crash recovery",

FWIW, you are introducing three times the same typo, in the same
word, in three different messages.

They're copy-pasto. I refrained from constructing an error message
from multiple nonindipendent parts. Are you suggesting to do so?

The tree times repetition of almost same phrases is very unreadable. I
rewrote it in more simple shape.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachments:

v2-0001-Make-End-Of-Recovery-error-less-scary.patchtext/x-patch; charset=us-asciiDownload

From 17ee82e5d44dd5a932ed69b8a1ea91a23d170952 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Fri, 28 Feb 2020 15:52:58 +0900
Subject: [PATCH v2] Make End-Of-Recovery error less scary

When recovery in any type ends, we see a bit scary error message like
"invalid record length" that suggests something serious is happening.
Make this message less scary as "reached end of WAL".
---
 src/backend/access/transam/xlog.c     | 72 ++++++++++++++++++++-------
 src/backend/replication/walreceiver.c |  3 +-
 2 files changed, 55 insertions(+), 20 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index d19408b3be..849cf6fe6b 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -4282,12 +4282,15 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
 	for (;;)
 	{
 		char	   *errormsg;
+		XLogRecPtr	ErrRecPtr = InvalidXLogRecPtr;
 
 		record = XLogReadRecord(xlogreader, &errormsg);
 		ReadRecPtr = xlogreader->ReadRecPtr;
 		EndRecPtr = xlogreader->EndRecPtr;
 		if (record == NULL)
 		{
+			ErrRecPtr = ReadRecPtr ? ReadRecPtr : EndRecPtr;
+
 			if (readFile >= 0)
 			{
 				close(readFile);
@@ -4295,14 +4298,16 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
 			}
 
 			/*
-			 * We only end up here without a message when XLogPageRead()
-			 * failed - in that case we already logged something. In
-			 * StandbyMode that only happens if we have been triggered, so we
-			 * shouldn't loop anymore in that case.
+			 * If we are fetching checkpoint, we emit the error message right
+			 * now. Otherwise the error is regarded as "end of WAL" and the
+			 * message if any is shown as a part of the end-of-WAL message
+			 * below.
 			 */
-			if (errormsg)
-				ereport(emode_for_corrupt_record(emode, EndRecPtr),
+			if (fetching_ckpt && errormsg)
+			{
+				ereport(emode_for_corrupt_record(emode, ErrRecPtr),
 						(errmsg_internal("%s", errormsg) /* already translated */ ));
+			}
 		}
 
 		/*
@@ -4332,11 +4337,12 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
 			/* Great, got a record */
 			return record;
 		}
-		else
-		{
-			/* No valid record available from this source */
-			lastSourceFailed = true;
 
+		/* No valid record available from this source */
+		lastSourceFailed = true;
+
+		if (!fetching_ckpt)
+		{
 			/*
 			 * If archive recovery was requested, but we were still doing
 			 * crash recovery, switch to archive recovery and retry using the
@@ -4349,11 +4355,18 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
 			 * we'd have no idea how far we'd have to replay to reach
 			 * consistency.  So err on the safe side and give up.
 			 */
-			if (!InArchiveRecovery && ArchiveRecoveryRequested &&
-				!fetching_ckpt)
+			if (!InArchiveRecovery && ArchiveRecoveryRequested)
 			{
+				/*
+				 * We don't report this as LOG, since we don't stop recovery
+				 * here
+				 */
 				ereport(DEBUG1,
-						(errmsg_internal("reached end of WAL in pg_wal, entering archive recovery")));
+						(errmsg_internal("reached end of WAL at %X/%X on timeline %u in %s during crash recovery, entering archive recovery",
+										 (uint32) (ErrRecPtr >> 32),
+										 (uint32) ErrRecPtr,
+										 ThisTimeLineID,
+										 xlogSourceNames[currentSource])));
 				InArchiveRecovery = true;
 				if (StandbyModeRequested)
 					StandbyMode = true;
@@ -4391,12 +4404,35 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
 				continue;
 			}
 
-			/* In standby mode, loop back to retry. Otherwise, give up. */
-			if (StandbyMode && !CheckForStandbyTrigger())
-				continue;
-			else
-				return NULL;
+			/*
+			 *  We reached the end of WAL, show the messages just once at the
+			 *  same LSN.
+			 */
+			if (emode_for_corrupt_record(LOG, ErrRecPtr) == LOG)
+			{
+				char *fmt;
+
+				if (StandbyMode)
+					fmt = gettext_noop("reached end of WAL at %X/%X on timeline %u in %s during standby mode");
+				else if (InArchiveRecovery)
+					fmt = gettext_noop("reached end of WAL at %X/%X on timeline %u in %s during archive recovery");
+				else
+					fmt = gettext_noop("reached end of WAL at %X/%X on timeline %u in %s during crash recovery");
+
+				ereport(LOG,
+						(errmsg (fmt, (uint32) (EndRecPtr >> 32),
+								 (uint32) EndRecPtr,
+								 ThisTimeLineID,
+								 xlogSourceNames[currentSource]),
+						 (errormsg ? errdetail_internal("%s", errormsg) : 0)));
+			}
 		}
+
+		/* In standby mode, loop back to retry. Otherwise, give up. */
+		if (StandbyMode && !CheckForStandbyTrigger())
+			continue;
+		else
+			return NULL;
 	}
 }
 
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index 2ab15c3cbb..682dbb4e1f 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -478,8 +478,7 @@ WalReceiverMain(void)
 						else if (len < 0)
 						{
 							ereport(LOG,
-									(errmsg("replication terminated by primary server"),
-									 errdetail("End of WAL reached on timeline %u at %X/%X.",
+									(errmsg("replication terminated by primary server on timeline %u at %X/%X.",
 											   startpointTLI,
 											   (uint32) (LogstreamResult.Write >> 32), (uint32) LogstreamResult.Write)));
 							endofwal = true;
-- 
2.18.2

Peter Eisentraut

peter.eisentraut@2ndquadrant.com

almost 6 years ago

In reply to: Kyotaro Horiguchi (#4)

Re: Make mesage at end-of-recovery less scary.

On 2020-03-05 08:06, Kyotaro Horiguchi wrote:

| [20866] LOG: replication terminated by primary server
| [20866] DETAIL: End of WAL reached on timeline 1 at 0/30001C8.
| [20866] FATAL: could not send end-of-streaming message to primary: no COPY in progress
| [20851] LOG: reached end of WAL at 0/30001C8 on timeline 1 in archive during standby mode
| [20851] DETAIL: invalid record length at 0/30001C8: wanted 24, got 0

I changed the above to the below, which looks more adequate.

| [24271] LOG: replication terminated by primary server on timeline 1 at 0/3000240.
| [24271] FATAL: could not send end-of-streaming message to primary: no COPY in progress
| [24267] LOG: reached end of WAL at 0/3000240 on timeline 1 in archive during standby mode
| [24267] DETAIL: invalid record length at 0/3000240: wanted 24, got 0

Is this the before and after? That doesn't seem like a substantial
improvement to me. You still get the "scary" message at the end.

--
Peter Eisentraut http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Ashwin Agrawal

aagrawal@pivotal.io

almost 6 years ago

In reply to: Peter Eisentraut (#5)

Re: Make mesage at end-of-recovery less scary.

On Mon, Mar 23, 2020 at 2:37 AM Peter Eisentraut <
peter.eisentraut@2ndquadrant.com> wrote:

On 2020-03-05 08:06, Kyotaro Horiguchi wrote:

| [20866] LOG: replication terminated by primary server
| [20866] DETAIL: End of WAL reached on timeline 1 at 0/30001C8.
| [20866] FATAL: could not send end-of-streaming message to primary: no

COPY in progress

| [20851] LOG: reached end of WAL at 0/30001C8 on timeline 1 in archive

during standby mode

| [20851] DETAIL: invalid record length at 0/30001C8: wanted 24, got 0

I changed the above to the below, which looks more adequate.

| [24271] LOG: replication terminated by primary server on timeline 1

at 0/3000240.

| [24271] FATAL: could not send end-of-streaming message to primary:

no COPY in progress

| [24267] LOG: reached end of WAL at 0/3000240 on timeline 1 in

archive during standby mode

| [24267] DETAIL: invalid record length at 0/3000240: wanted 24, got 0

Is this the before and after? That doesn't seem like a substantial
improvement to me. You still get the "scary" message at the end.

+1 I agree it still reads scary and doesn't seem improvement.

Plus, I am hoping message will improve for pg_waldump as well?
Since it reads confusing and every-time have to explain new developer it's
expected behavior which is annoying.

pg_waldump: fatal: error in WAL record at 0/1553F70: invalid record length
at 0/1553FA8: wanted 24, got 0

Andres Freund

andres@anarazel.de

almost 6 years ago

In reply to: Peter Eisentraut (#5)

Re: Make mesage at end-of-recovery less scary.

Hi,

On 2020-03-23 10:37:16 +0100, Peter Eisentraut wrote:

On 2020-03-05 08:06, Kyotaro Horiguchi wrote:

| [20866] LOG: replication terminated by primary server
| [20866] DETAIL: End of WAL reached on timeline 1 at 0/30001C8.
| [20866] FATAL: could not send end-of-streaming message to primary: no COPY in progress

IMO it's a bug that we see this FATAL. I seem to recall that we didn't
use to get that?

| [20851] LOG: reached end of WAL at 0/30001C8 on timeline 1 in archive during standby mode
| [20851] DETAIL: invalid record length at 0/30001C8: wanted 24, got 0

I changed the above to the below, which looks more adequate.

| [24271] LOG: replication terminated by primary server on timeline 1 at 0/3000240.
| [24271] FATAL: could not send end-of-streaming message to primary: no COPY in progress
| [24267] LOG: reached end of WAL at 0/3000240 on timeline 1 in archive during standby mode
| [24267] DETAIL: invalid record length at 0/3000240: wanted 24, got 0

Is this the before and after? That doesn't seem like a substantial
improvement to me. You still get the "scary" message at the end.

It seems like a minor improvement - folding the DETAIL into the message
makes sense to me here. But it indeed doesn't really address the main
issue.

I think we don't want to elide the information about how the end of the
WAL was detected - there are some issues where I found that quite
helpful. But we could reformulate it to be clearer that it's informative
output, not a bug. E.g. something roughly like

LOG: reached end of WAL at 0/3000240 on timeline 1 in archive during standby mode
DETAIL: End detected due to invalid record length at 0/3000240: expected 24, got 0
(I first elided the position in the DETAIL, but it could differ from the
one in LOG)

I don't find that very satisfying, but I can't come up with something
that provides the current information, while being less scary than my
suggestion?

Greetings,

Andres Freund

andres@anarazel.de

almost 6 years ago

In reply to: Ashwin Agrawal (#6)

Re: Make mesage at end-of-recovery less scary.

Hi,

On 2020-03-23 10:43:09 -0700, Ashwin Agrawal wrote:

Plus, I am hoping message will improve for pg_waldump as well?
Since it reads confusing and every-time have to explain new developer it's
expected behavior which is annoying.

pg_waldump: fatal: error in WAL record at 0/1553F70: invalid record length
at 0/1553FA8: wanted 24, got 0

What would you like to see here? There's inherently a lot less
information about the context in waldump. We can't know whether it's to
be expected that the WAL ends at that point, or whether there was
corruption.

Greetings,

Andres Freund

Kyotaro Horiguchi

horikyota.ntt@gmail.com

almost 6 years ago

In reply to: Andres Freund (#7)

1 attachment(s)

Re: Make mesage at end-of-recovery less scary.

At Mon, 23 Mar 2020 12:47:36 -0700, Andres Freund <andres@anarazel.de> wrote in

Hi,

On 2020-03-23 10:37:16 +0100, Peter Eisentraut wrote:

On 2020-03-05 08:06, Kyotaro Horiguchi wrote:

| [20866] LOG: replication terminated by primary server
| [20866] DETAIL: End of WAL reached on timeline 1 at 0/30001C8.
| [20866] FATAL: could not send end-of-streaming message to primary: no COPY in progress

IMO it's a bug that we see this FATAL. I seem to recall that we didn't
use to get that?

I thought that it is a convention that A auxiliary process uses ERROR
(which is turned into FATAL in ereport) to exit, which I didn't like
so much, but it was out of scope of this patch.

As for the message bove, the FATAL is preceded by the "LOG:
replication terminated by" message, that means walreceiver tries to
send new data after disconnection just to fail, which is
unreasonable. I think we should exit immediately after detecting
disconnection. The FATAL is gone by the attached.

| [24267] LOG: reached end of WAL at 0/3000240 on timeline 1 in archive during standby mode
| [24267] DETAIL: invalid record length at 0/3000240: wanted 24, got 0

Is this the before and after? That doesn't seem like a substantial
improvement to me. You still get the "scary" message at the end.

It seems like a minor improvement - folding the DETAIL into the message
makes sense to me here. But it indeed doesn't really address the main
issue.

I think we don't want to elide the information about how the end of the
WAL was detected - there are some issues where I found that quite
helpful. But we could reformulate it to be clearer that it's informative
output, not a bug. E.g. something roughly like

LOG: reached end of WAL at 0/3000240 on timeline 1 in archive during standby mode
DETAIL: End detected due to invalid record length at 0/3000240: expected 24, got 0
(I first elided the position in the DETAIL, but it could differ from the
one in LOG)

I don't find that very satisfying, but I can't come up with something
that provides the current information, while being less scary than my
suggestion?

The 0-length record is not an "invalid" state during recovery, so we
can add the message for the state as "record length is 0 at %X/%X". I
think if other states found there, it implies something wrong.

LSN is redundantly shown but I'm not sure if it is better to remove it
from either of the two lines.

| LOG: reached end of WAL at 0/3000850 on timeline 1 in pg_wal during crash recovery
| DETAIL: record length is 0 at 0/3000850

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachments:

v3-0001-Make-End-Of-Recovery-error-less-scary.patchtext/x-patch; charset=us-asciiDownload

From 47511afed5f8acf92abaf1cd6fcfecc1faea9c87 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Fri, 28 Feb 2020 15:52:58 +0900
Subject: [PATCH v3] Make End-Of-Recovery error less scary

When recovery in any type ends, we see a bit scary error message like
"invalid record length" that suggests something serious is happening.
Make this message less scary as "reached end of WAL".
---
 src/backend/access/transam/xlog.c       | 69 ++++++++++++++++++-------
 src/backend/access/transam/xlogreader.c |  9 ++++
 src/backend/replication/walreceiver.c   | 11 ++--
 3 files changed, 67 insertions(+), 22 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 793c076da6..6c2924dfb7 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -4283,12 +4283,15 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
 	for (;;)
 	{
 		char	   *errormsg;
+		XLogRecPtr	ErrRecPtr = InvalidXLogRecPtr;
 
 		record = XLogReadRecord(xlogreader, &errormsg);
 		ReadRecPtr = xlogreader->ReadRecPtr;
 		EndRecPtr = xlogreader->EndRecPtr;
 		if (record == NULL)
 		{
+			ErrRecPtr = ReadRecPtr ? ReadRecPtr : EndRecPtr;
+
 			if (readFile >= 0)
 			{
 				close(readFile);
@@ -4296,13 +4299,13 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
 			}
 
 			/*
-			 * We only end up here without a message when XLogPageRead()
-			 * failed - in that case we already logged something. In
-			 * StandbyMode that only happens if we have been triggered, so we
-			 * shouldn't loop anymore in that case.
+			 * If we are fetching checkpoint, we emit the error message right
+			 * now. Otherwise the error is regarded as "end of WAL" and the
+			 * message if any is shown as a part of the end-of-WAL message
+			 * below.
 			 */
-			if (errormsg)
-				ereport(emode_for_corrupt_record(emode, EndRecPtr),
+			if (fetching_ckpt && errormsg)
+				ereport(emode_for_corrupt_record(emode, ErrRecPtr),
 						(errmsg_internal("%s", errormsg) /* already translated */ ));
 		}
 
@@ -4333,11 +4336,12 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
 			/* Great, got a record */
 			return record;
 		}
-		else
-		{
-			/* No valid record available from this source */
-			lastSourceFailed = true;
 
+		/* No valid record available from this source */
+		lastSourceFailed = true;
+
+		if (!fetching_ckpt)
+		{
 			/*
 			 * If archive recovery was requested, but we were still doing
 			 * crash recovery, switch to archive recovery and retry using the
@@ -4350,11 +4354,18 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
 			 * we'd have no idea how far we'd have to replay to reach
 			 * consistency.  So err on the safe side and give up.
 			 */
-			if (!InArchiveRecovery && ArchiveRecoveryRequested &&
-				!fetching_ckpt)
+			if (!InArchiveRecovery && ArchiveRecoveryRequested)
 			{
+				/*
+				 * We don't report this as LOG, since we don't stop recovery
+				 * here
+				 */
 				ereport(DEBUG1,
-						(errmsg_internal("reached end of WAL in pg_wal, entering archive recovery")));
+						(errmsg_internal("reached end of WAL at %X/%X on timeline %u in %s during crash recovery, entering archive recovery",
+										 (uint32) (ErrRecPtr >> 32),
+										 (uint32) ErrRecPtr,
+										 ThisTimeLineID,
+										 xlogSourceNames[currentSource])));
 				InArchiveRecovery = true;
 				if (StandbyModeRequested)
 					StandbyMode = true;
@@ -4392,12 +4403,34 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
 				continue;
 			}
 
-			/* In standby mode, loop back to retry. Otherwise, give up. */
-			if (StandbyMode && !CheckForStandbyTrigger())
-				continue;
-			else
-				return NULL;
+			/*
+			 *  We reached the end of WAL, show the messages just once at the
+			 *  same LSN.
+			 */
+			if (emode_for_corrupt_record(LOG, ErrRecPtr) == LOG)
+			{
+				char *fmt;
+
+				if (StandbyMode)
+					fmt = gettext_noop("reached end of WAL at %X/%X on timeline %u in %s during standby mode");
+				else if (InArchiveRecovery)
+					fmt = gettext_noop("reached end of WAL at %X/%X on timeline %u in %s during archive recovery");
+				else
+					fmt = gettext_noop("reached end of WAL at %X/%X on timeline %u in %s during crash recovery");
+
+				ereport(LOG,
+						(errmsg(fmt, (uint32) (EndRecPtr >> 32),
+								(uint32) EndRecPtr,	ThisTimeLineID,
+								xlogSourceNames[currentSource]),
+						 (errormsg ? errdetail_internal("%s", errormsg) : 0)));
+			}
 		}
+
+		/* In standby mode, loop back to retry. Otherwise, give up. */
+		if (StandbyMode && !CheckForStandbyTrigger())
+			continue;
+		else
+			return NULL;
 	}
 }
 
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index 32f02256ed..9ea1305364 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -682,6 +682,15 @@ ValidXLogRecordHeader(XLogReaderState *state, XLogRecPtr RecPtr,
 					  XLogRecPtr PrevRecPtr, XLogRecord *record,
 					  bool randAccess)
 {
+
+	if (record->xl_tot_len == 0)
+	{
+		/* This is strictly not an invalid state, so phrase it as so. */
+		report_invalid_record(state,
+							  "record length is 0 at %X/%X",
+							  (uint32) (RecPtr >> 32), (uint32) RecPtr);
+		return false;
+	}
 	if (record->xl_tot_len < SizeOfXLogRecord)
 	{
 		report_invalid_record(state,
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index 25e0333c9e..da978d4047 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -479,12 +479,15 @@ WalReceiverMain(void)
 						else if (len < 0)
 						{
 							ereport(LOG,
-									(errmsg("replication terminated by primary server"),
-									 errdetail("End of WAL reached on timeline %u at %X/%X.",
+									(errmsg("replication terminated by primary server on timeline %u at %X/%X.",
 											   startpointTLI,
 											   (uint32) (LogstreamResult.Write >> 32), (uint32) LogstreamResult.Write)));
-							endofwal = true;
-							break;
+
+							/*
+							 * we have no longer anything to do on the broken
+							 * connection other than exiting.
+							 */
+							proc_exit(1);
 						}
 						len = walrcv_receive(wrconn, &buf, &wait_fd);
 					}
-- 
2.18.2

#10

Peter Eisentraut

peter.eisentraut@2ndquadrant.com

almost 6 years ago

In reply to: Kyotaro Horiguchi (#9)

Re: Make mesage at end-of-recovery less scary.

On 2020-03-24 02:52, Kyotaro Horiguchi wrote:

I don't find that very satisfying, but I can't come up with something
that provides the current information, while being less scary than my
suggestion?

The 0-length record is not an "invalid" state during recovery, so we
can add the message for the state as "record length is 0 at %X/%X". I
think if other states found there, it implies something wrong.

LSN is redundantly shown but I'm not sure if it is better to remove it
from either of the two lines.

| LOG: reached end of WAL at 0/3000850 on timeline 1 in pg_wal during crash recovery
| DETAIL: record length is 0 at 0/3000850

I'm not up to date on all these details, but my high-level idea would be
some kind of hint associated with the existing error messages, like:

HINT: This is to be expected if this is the end of the WAL. Otherwise,
it could indicate corruption.

--
Peter Eisentraut http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#11

Robert Haas

robertmhaas@gmail.com

almost 6 years ago

In reply to: Peter Eisentraut (#10)

Re: Make mesage at end-of-recovery less scary.

On Wed, Mar 25, 2020 at 8:53 AM Peter Eisentraut
<peter.eisentraut@2ndquadrant.com> wrote:

HINT: This is to be expected if this is the end of the WAL. Otherwise,
it could indicate corruption.

First, I agree that this general issue is a problem, because it's come
up for me in quite a number of customer situations. Either people get
scared when they shouldn't, because the message is innocuous, or they
don't get scared about other things that actually are scary, because
if some scary-looking messages are actually innocuous, it can lead
people to believe that the same is true in other cases.

Second, I don't really like the particular formulation you have above,
because the user still doesn't know whether or not to be scared. Can
we figure that out? I think if we're in crash recovery, I think that
we should not be scared, because we have no alternative to assuming
that we've reached the end of WAL, so all crash recoveries will end
like this. If we're in archive recovery, we should definitely be
scared if we haven't yet reached the minimum recovery point, because
more WAL than that should certainly exist. After that, it depends on
how we got the WAL. If it's being streamed, the question is whether
we've reached the end of what got streamed. If it's being copied from
the archive, we ought to have the whole segment, but maybe not more.
Can we get the right context to the point where the error is being
reported to know whether we hit the error at the end of the WAL that
was streamed? If not, can we somehow rejigger things so that we only
make it sound scary if we keep getting stuck at the same point when we
woud've expected to make progress meanwhile?

I'm just spitballing here, but it would be really good if there's a
way to know definitely whether or not you should be scared. Corrupted
WAL segments are definitely a thing that happens, but retries are a
lot more common.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#12

James Coleman

jtc331@gmail.com

almost 6 years ago

In reply to: Robert Haas (#11)

Re: Make mesage at end-of-recovery less scary.

On Thu, Mar 26, 2020 at 12:41 PM Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Mar 25, 2020 at 8:53 AM Peter Eisentraut
<peter.eisentraut@2ndquadrant.com> wrote:

HINT: This is to be expected if this is the end of the WAL. Otherwise,
it could indicate corruption.

First, I agree that this general issue is a problem, because it's come
up for me in quite a number of customer situations. Either people get
scared when they shouldn't, because the message is innocuous, or they
don't get scared about other things that actually are scary, because
if some scary-looking messages are actually innocuous, it can lead
people to believe that the same is true in other cases.

Second, I don't really like the particular formulation you have above,
because the user still doesn't know whether or not to be scared. Can
we figure that out? I think if we're in crash recovery, I think that
we should not be scared, because we have no alternative to assuming
that we've reached the end of WAL, so all crash recoveries will end
like this. If we're in archive recovery, we should definitely be
scared if we haven't yet reached the minimum recovery point, because
more WAL than that should certainly exist. After that, it depends on
how we got the WAL. If it's being streamed, the question is whether
we've reached the end of what got streamed. If it's being copied from
the archive, we ought to have the whole segment, but maybe not more.
Can we get the right context to the point where the error is being
reported to know whether we hit the error at the end of the WAL that
was streamed? If not, can we somehow rejigger things so that we only
make it sound scary if we keep getting stuck at the same point when we
woud've expected to make progress meanwhile?

I'm just spitballing here, but it would be really good if there's a
way to know definitely whether or not you should be scared. Corrupted
WAL segments are definitely a thing that happens, but retries are a
lot more common.

First, I agree that getting enough context to say precisely is by far the ideal.

That being said, as an end user who's found this surprising -- and
momentarily scary every time I initially scan it even though I *know
intellectually it's not* -- I would find Peter's suggestion a
significant improvement over what we have now. I'm fairly certainly my
co-workers on our database team would also. Knowing that something is
at least not always scary is good. Though I'll grant that this does
have the negative in reverse: if it actually is a scary
situation...this mutes your concern level. On the other hand,
monitoring would tell us if there's a real problem (namely replication
lag), so I think the trade-off is clearly worth it.

How about this minor tweak:
HINT: This is expected if this is the end of currently available WAL.
Otherwise, it could indicate corruption.

Thanks,
James

#13

David Steele

david@pgmasters.net

almost 5 years ago

In reply to: James Coleman (#12)

Re: Make mesage at end-of-recovery less scary.

Hi Kyotaro,

On 3/27/20 10:25 PM, James Coleman wrote:

On Thu, Mar 26, 2020 at 12:41 PM Robert Haas <robertmhaas@gmail.com> wrote:

I'm just spitballing here, but it would be really good if there's a
way to know definitely whether or not you should be scared. Corrupted
WAL segments are definitely a thing that happens, but retries are a
lot more common.

First, I agree that getting enough context to say precisely is by far the ideal.

That being said, as an end user who's found this surprising -- and
momentarily scary every time I initially scan it even though I *know
intellectually it's not* -- I would find Peter's suggestion a
significant improvement over what we have now. I'm fairly certainly my
co-workers on our database team would also. Knowing that something is
at least not always scary is good. Though I'll grant that this does
have the negative in reverse: if it actually is a scary
situation...this mutes your concern level. On the other hand,
monitoring would tell us if there's a real problem (namely replication
lag), so I think the trade-off is clearly worth it.

How about this minor tweak:
HINT: This is expected if this is the end of currently available WAL.
Otherwise, it could indicate corruption.

Any thoughts on the suggestions for making the messaging clearer?

Also, the patch no longer applies:
http://cfbot.cputube.org/patch_32_2490.log.

Marking this Waiting on Author.

Regards,
--
-David
david@pgmasters.net

#14

Kyotaro Horiguchi

horikyota.ntt@gmail.com

almost 5 years ago

In reply to: David Steele (#13)

1 attachment(s)

Re: Make mesage at end-of-recovery less scary.

At Wed, 3 Mar 2021 11:14:20 -0500, David Steele <david@pgmasters.net> wrote in

Hi Kyotaro,

On 3/27/20 10:25 PM, James Coleman wrote:

On Thu, Mar 26, 2020 at 12:41 PM Robert Haas <robertmhaas@gmail.com>
wrote:

I'm just spitballing here, but it would be really good if there's a
way to know definitely whether or not you should be scared. Corrupted
WAL segments are definitely a thing that happens, but retries are a
lot more common.

First, I agree that getting enough context to say precisely is by far
the ideal.
That being said, as an end user who's found this surprising -- and
momentarily scary every time I initially scan it even though I *know
intellectually it's not* -- I would find Peter's suggestion a
significant improvement over what we have now. I'm fairly certainly my
co-workers on our database team would also. Knowing that something is
at least not always scary is good. Though I'll grant that this does
have the negative in reverse: if it actually is a scary
situation...this mutes your concern level. On the other hand,
monitoring would tell us if there's a real problem (namely replication
lag), so I think the trade-off is clearly worth it.
How about this minor tweak:
HINT: This is expected if this is the end of currently available WAL.
Otherwise, it could indicate corruption.

Any thoughts on the suggestions for making the messaging clearer?

Also, the patch no longer applies:
http://cfbot.cputube.org/patch_32_2490.log.

Sorry for missing the last discussions. I agree to the point about
really-scary situation.

ValidXLogRecordHeader deliberately marks End-Of-WAL only in the case
of zero-length record so that the callers can identify that case,
instead of inferring the EOW state without it. All other invalid data
is treated as potentially danger situation. I think this is a
reasonable classification. And the error level for the "danger" cases
is changed to WARNING (from LOG).

As the result, the following messages are emitted with the attached.

- found zero-length record during recovery (the DETAIL might not be needed.)

LOG: redo starts at 0/14000118
LOG: reached end of WAL at 0/14C5D070 on timeline 1 in pg_wal during crash recovery
DETAIL: record length is 0 at 0/14C5D070
LOG: redo done at 0/14C5CF48 system usage: ...

- found another kind of invalid data

LOG: redo starts at 0/150000A0
WARNING: invalid record length at 0/1500CA60: wanted 24, got 54
LOG: redo done at 0/1500CA28 system usage: ...

On the way checking the patch, I found that it emits the following log
lines in the case the redo loop meets an invalid record at the
starting:

LOG: invalid record length at 0/10000118: wanted 24, got 42
LOG: redo is not required

which doesn't look proper. That case is identifiable using the
End-of_WAL flag this patch adds. Thus we get the following error
messages.

- found end-of-wal at the beginning of recovery

LOG: reached end of WAL at 0/130000A0 on timeline 1 in pg_wal during crash recovery
DETAIL: record length is 0 at 0/130000A0
LOG: redo is not required

- found invalid data

WARNING: invalid record length at 0/120000A0: wanted 24, got 42
WARNING: redo is skipped
HINT: This suggests WAL file corruption. You might need to check the database.

The logic of ErrRecPtr in ReadRecord may wrong. I remember having
such an discussion before...

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachments:

v4-0001-Make-End-Of-Recovery-error-less-scary.patchtext/x-patch; charset=us-asciiDownload

From f89b5c965d0a49de3c1297bb5edd2dc061951b71 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Fri, 28 Feb 2020 15:52:58 +0900
Subject: [PATCH v4] Make End-Of-Recovery error less scary

When recovery in any type ends, we see a bit scary error message like
"invalid record length" that suggests something serious is happening.
Make this message less scary as "reached end of WAL".
---
 src/backend/access/transam/xlog.c       | 81 ++++++++++++++++++-------
 src/backend/access/transam/xlogreader.c | 11 ++++
 src/backend/replication/walreceiver.c   | 13 ++--
 src/include/access/xlogreader.h         |  1 +
 4 files changed, 79 insertions(+), 27 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 377afb8732..fbcb8d78b8 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -4361,12 +4361,15 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
 	for (;;)
 	{
 		char	   *errormsg;
+		XLogRecPtr	ErrRecPtr = InvalidXLogRecPtr;
 
 		record = XLogReadRecord(xlogreader, &errormsg);
 		ReadRecPtr = xlogreader->ReadRecPtr;
 		EndRecPtr = xlogreader->EndRecPtr;
 		if (record == NULL)
 		{
+			ErrRecPtr = ReadRecPtr ? ReadRecPtr : EndRecPtr;
+
 			if (readFile >= 0)
 			{
 				close(readFile);
@@ -4374,13 +4377,12 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
 			}
 
 			/*
-			 * We only end up here without a message when XLogPageRead()
-			 * failed - in that case we already logged something. In
-			 * StandbyMode that only happens if we have been triggered, so we
-			 * shouldn't loop anymore in that case.
+			 * If we met other than end-of-wal, emit the error message right
+			 * now. Otherwise the message if any is shown as a part of the
+			 * end-of-WAL message below.
 			 */
-			if (errormsg)
-				ereport(emode_for_corrupt_record(emode, EndRecPtr),
+			if (!xlogreader->EndOfWAL && errormsg)
+				ereport(emode_for_corrupt_record(emode, ErrRecPtr),
 						(errmsg_internal("%s", errormsg) /* already translated */ ));
 		}
 
@@ -4411,11 +4413,12 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
 			/* Great, got a record */
 			return record;
 		}
-		else
+
+		/* No valid record available from this source */
+		lastSourceFailed = true;
+
+		if (!fetching_ckpt)
 		{
-			/* No valid record available from this source */
-			lastSourceFailed = true;
-
 			/*
 			 * If archive recovery was requested, but we were still doing
 			 * crash recovery, switch to archive recovery and retry using the
@@ -4428,11 +4431,17 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
 			 * we'd have no idea how far we'd have to replay to reach
 			 * consistency.  So err on the safe side and give up.
 			 */
-			if (!InArchiveRecovery && ArchiveRecoveryRequested &&
-				!fetching_ckpt)
+			if (!InArchiveRecovery && ArchiveRecoveryRequested)
 			{
+				/*
+				 * We don't report this as LOG, since we don't stop recovery
+				 * here
+				 */
 				ereport(DEBUG1,
-						(errmsg_internal("reached end of WAL in pg_wal, entering archive recovery")));
+						(errmsg_internal("reached end of WAL at %X/%X on timeline %u in %s during crash recovery, entering archive recovery",
+										 LSN_FORMAT_ARGS(ErrRecPtr),
+										 ThisTimeLineID,
+										 xlogSourceNames[currentSource])));
 				InArchiveRecovery = true;
 				if (StandbyModeRequested)
 					StandbyMode = true;
@@ -4480,12 +4489,33 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
 				continue;
 			}
 
-			/* In standby mode, loop back to retry. Otherwise, give up. */
-			if (StandbyMode && !CheckForStandbyTrigger())
-				continue;
-			else
-				return NULL;
+			/*
+			 *  We reached the end of WAL, show the messages just once at the
+			 *  same LSN.
+			 */
+			if (emode_for_corrupt_record(LOG, ErrRecPtr) == LOG)
+			{
+				char *fmt;
+
+				if (StandbyMode)
+					fmt = gettext_noop("reached end of WAL at %X/%X on timeline %u in %s during standby mode");
+				else if (InArchiveRecovery)
+					fmt = gettext_noop("reached end of WAL at %X/%X on timeline %u in %s during archive recovery");
+				else
+					fmt = gettext_noop("reached end of WAL at %X/%X on timeline %u in %s during crash recovery");
+
+				ereport(LOG,
+						(errmsg(fmt, LSN_FORMAT_ARGS(EndRecPtr), ThisTimeLineID,
+								xlogSourceNames[currentSource]),
+						 (errormsg ? errdetail_internal("%s", errormsg) : 0)));
+			}
 		}
+
+		/* In standby mode, loop back to retry. Otherwise, give up. */
+		if (StandbyMode && !CheckForStandbyTrigger())
+			continue;
+		else
+			return NULL;
 	}
 }
 
@@ -7227,7 +7257,7 @@ StartupXLOG(void)
 		else
 		{
 			/* just have to read next record after CheckPoint */
-			record = ReadRecord(xlogreader, LOG, false);
+			record = ReadRecord(xlogreader, WARNING, false);
 		}
 
 		if (record != NULL)
@@ -7454,7 +7484,7 @@ StartupXLOG(void)
 				}
 
 				/* Else, try to fetch the next WAL record */
-				record = ReadRecord(xlogreader, LOG, false);
+				record = ReadRecord(xlogreader, WARNING, false);
 			} while (record != NULL);
 
 			/*
@@ -7514,13 +7544,20 @@ StartupXLOG(void)
 
 			InRedo = false;
 		}
-		else
+		else if (xlogreader->EndOfWAL)
 		{
 			/* there are no WAL records following the checkpoint */
 			ereport(LOG,
 					(errmsg("redo is not required")));
 
 		}
+		else
+		{
+			/* broken record found */
+			ereport(WARNING,
+					(errmsg("redo is skipped"),
+					 errhint("This suggests WAL file corruption. You might need to check the database.")));
+		}
 
 		/*
 		 * This check is intentionally after the above log messages that
@@ -12653,7 +12690,7 @@ emode_for_corrupt_record(int emode, XLogRecPtr RecPtr)
 {
 	static XLogRecPtr lastComplaint = 0;
 
-	if (readSource == XLOG_FROM_PG_WAL && emode == LOG)
+	if (readSource == XLOG_FROM_PG_WAL && emode <= WARNING)
 	{
 		if (RecPtr == lastComplaint)
 			emode = DEBUG1;
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index 42738eb940..dacba32143 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -118,6 +118,7 @@ XLogReaderAllocate(int wal_segment_size, const char *waldir,
 		pfree(state);
 		return NULL;
 	}
+	state->EndOfWAL = false;
 	state->errormsg_buf[0] = '\0';
 
 	/*
@@ -288,6 +289,7 @@ XLogReadRecord(XLogReaderState *state, char **errormsg)
 	/* reset error state */
 	*errormsg = NULL;
 	state->errormsg_buf[0] = '\0';
+	state->EndOfWAL = false;
 
 	ResetDecoder(state);
 
@@ -689,6 +691,15 @@ ValidXLogRecordHeader(XLogReaderState *state, XLogRecPtr RecPtr,
 					  XLogRecPtr PrevRecPtr, XLogRecord *record,
 					  bool randAccess)
 {
+	if (record->xl_tot_len == 0)
+	{
+		/* This is strictly not an invalid state, so phrase it as so. */
+		report_invalid_record(state,
+							  "record length is 0 at %X/%X",
+							  (uint32) (RecPtr >> 32), (uint32) RecPtr);
+		state->EndOfWAL = true;
+		return false;
+	}
 	if (record->xl_tot_len < SizeOfXLogRecord)
 	{
 		report_invalid_record(state,
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index e5f8a06fea..2377c58b4c 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -460,12 +460,15 @@ WalReceiverMain(void)
 						else if (len < 0)
 						{
 							ereport(LOG,
-									(errmsg("replication terminated by primary server"),
-									 errdetail("End of WAL reached on timeline %u at %X/%X.",
+									(errmsg("replication terminated by primary server on timeline %u at %X/%X.",
 											   startpointTLI,
-											   LSN_FORMAT_ARGS(LogstreamResult.Write))));
-							endofwal = true;
-							break;
+								LSN_FORMAT_ARGS(LogstreamResult.Write))));
+
+							/*
+							 * we have no longer anything to do on the broken
+							 * connection other than exiting.
+							 */
+							proc_exit(1);
 						}
 						len = walrcv_receive(wrconn, &buf, &wait_fd);
 					}
diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h
index 21d200d3df..0491adfc5b 100644
--- a/src/include/access/xlogreader.h
+++ b/src/include/access/xlogreader.h
@@ -174,6 +174,7 @@ struct XLogReaderState
 	 */
 	XLogRecPtr	ReadRecPtr;		/* start of last record read */
 	XLogRecPtr	EndRecPtr;		/* end+1 of last record read */
+	bool		EndOfWAL;		/* the last attempt was EOW? */
 
 
 	/* ----------------------------------------
-- 
2.27.0

#15

Bossart, Nathan

bossartn@amazon.com

about 4 years ago

In reply to: Kyotaro Horiguchi (#14)

Re: Make mesage at end-of-recovery less scary.

On 3/4/21, 10:50 PM, "Kyotaro Horiguchi" <horikyota.ntt@gmail.com> wrote:

As the result, the following messages are emitted with the attached.

I'd like to voice my support for this effort, and I intend to help
review the patch. It looks like the latest patch no longer applies,
so I've marked the commitfest entry [0]https://commitfest.postgresql.org/35/2490/ as waiting-on-author.

Nathan

[0]: https://commitfest.postgresql.org/35/2490/

#16

Kyotaro Horiguchi

horikyota.ntt@gmail.com

about 4 years ago

In reply to: Bossart, Nathan (#15)

1 attachment(s)

Re: Make mesage at end-of-recovery less scary.

At Fri, 22 Oct 2021 17:54:40 +0000, "Bossart, Nathan" <bossartn@amazon.com> wrote in

On 3/4/21, 10:50 PM, "Kyotaro Horiguchi" <horikyota.ntt@gmail.com> wrote:

As the result, the following messages are emitted with the attached.

I'd like to voice my support for this effort, and I intend to help
review the patch. It looks like the latest patch no longer applies,
so I've marked the commitfest entry [0] as waiting-on-author.

Nathan

[0] https://commitfest.postgresql.org/35/2490/

Sorry for being late to reply. I rebased this to the current master.

- rebased

- use LSN_FORMAT_ARGS instead of bare shift and mask.

- v4 immediately exited walreceiver on disconnection. Maybe I wanted
not to see a FATAL message on standby after primary dies. However
that would be another issue and that change was plain wrong.. v5
just removes the "end-of-WAL" part from the message, which duplicate
to what startup emits.

- add a new error message "missing contrecord at %X/%X". Maybe this
should be regarded as a leftover of the contrecord patch. In the
attached patch the "%X/%X" is the LSN of the current record. The log
messages look like this (026_overwrite_contrecord).

LOG: redo starts at 0/1486CB8
WARNING: missing contrecord at 0/1FFC2E0
LOG: consistent recovery state reached at 0/1FFC2E0
LOG: started streaming WAL from primary at 0/2000000 on timeline 1
LOG: successfully skipped missing contrecord at 0/1FFC2E0, overwritten at 2021-11-08 14:50:11.969952+09
CONTEXT: WAL redo at 0/2000028 for XLOG/OVERWRITE_CONTRECORD: lsn 0/1FFC2E0; time 2021-11-08 14:50:11.969952+09

While checking the behavior for the case of missing-contrecord, I
noticed that emode_for_corrupt_record() doesn't work as expected since
readSource is reset to XLOG_FROM_ANY after a read failure. We could
remember the last failed source but pg_wal should have been visited if
page read error happened so I changed the function so that it treats
XLOG_FROM_ANY the same way with XLOG_FROM_PG_WAL.

(Otherwise we see "LOG: reached end-of-WAL at .." message after
"WARNING: missing contrecord at.." message.)

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachments:

v5-0001-Make-End-Of-Recovery-error-less-scary.patchtext/x-patch; charset=us-asciiDownload

From 276f59c8b37a31cb831b7753d2b107eb1d83c1fb Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Fri, 28 Feb 2020 15:52:58 +0900
Subject: [PATCH v5] Make End-Of-Recovery error less scary

When recovery in any type ends, we see a bit scary error message like
"invalid record length" that suggests something serious is
happening. Actually if recovery meets a record with length = 0, that
usually means it finished applying all available WAL records.

Make this message less scary as "reached end of WAL". Instead raise
the error level for other kind of WAL failure to WARNING.
---
 src/backend/access/transam/xlog.c       | 94 +++++++++++++++++++------
 src/backend/access/transam/xlogreader.c | 14 ++++
 src/backend/replication/walreceiver.c   |  3 +-
 src/include/access/xlogreader.h         |  1 +
 4 files changed, 87 insertions(+), 25 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 5cda30836f..623fb01d0a 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -4477,6 +4477,7 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
 	for (;;)
 	{
 		char	   *errormsg;
+		XLogRecPtr	ErrRecPtr = InvalidXLogRecPtr;
 
 		record = XLogReadRecord(xlogreader, &errormsg);
 		ReadRecPtr = xlogreader->ReadRecPtr;
@@ -4494,6 +4495,16 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
 			{
 				abortedRecPtr = xlogreader->abortedRecPtr;
 				missingContrecPtr = xlogreader->missingContrecPtr;
+				ErrRecPtr = abortedRecPtr;
+			}
+			else
+			{
+				/*
+				 * NULL ReadRecPtr means we could not read a record at
+				 * beginning. In that case EndRecPtr is storing the LSN of the
+				 * record we tried to read.
+				 */
+				ErrRecPtr = ReadRecPtr ? ReadRecPtr : EndRecPtr;
 			}
 
 			if (readFile >= 0)
@@ -4503,13 +4514,12 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
 			}
 
 			/*
-			 * We only end up here without a message when XLogPageRead()
-			 * failed - in that case we already logged something. In
-			 * StandbyMode that only happens if we have been triggered, so we
-			 * shouldn't loop anymore in that case.
+			 * If we get here for other than end-of-wal, emit the error message
+			 * right now. Otherwise the message if any is shown as a part of
+			 * the end-of-WAL message below.
 			 */
-			if (errormsg)
-				ereport(emode_for_corrupt_record(emode, EndRecPtr),
+			if (!xlogreader->EndOfWAL && errormsg)
+				ereport(emode_for_corrupt_record(emode, ErrRecPtr),
 						(errmsg_internal("%s", errormsg) /* already translated */ ));
 		}
 
@@ -4540,11 +4550,12 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
 			/* Great, got a record */
 			return record;
 		}
-		else
+
+		/* No valid record available from this source */
+		lastSourceFailed = true;
+
+		if (!fetching_ckpt)
 		{
-			/* No valid record available from this source */
-			lastSourceFailed = true;
-
 			/*
 			 * If archive recovery was requested, but we were still doing
 			 * crash recovery, switch to archive recovery and retry using the
@@ -4557,11 +4568,17 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
 			 * we'd have no idea how far we'd have to replay to reach
 			 * consistency.  So err on the safe side and give up.
 			 */
-			if (!InArchiveRecovery && ArchiveRecoveryRequested &&
-				!fetching_ckpt)
+			if (!InArchiveRecovery && ArchiveRecoveryRequested)
 			{
+				/*
+				 * We don't report this as LOG, since we don't stop recovery
+				 * here
+				 */
 				ereport(DEBUG1,
-						(errmsg_internal("reached end of WAL in pg_wal, entering archive recovery")));
+						(errmsg_internal("reached end of WAL at %X/%X on timeline %u in %s during crash recovery, entering archive recovery",
+										 LSN_FORMAT_ARGS(ErrRecPtr),
+										 replayTLI,
+										 xlogSourceNames[currentSource])));
 				InArchiveRecovery = true;
 				if (StandbyModeRequested)
 					StandbyMode = true;
@@ -4609,12 +4626,33 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
 				continue;
 			}
 
-			/* In standby mode, loop back to retry. Otherwise, give up. */
-			if (StandbyMode && !CheckForStandbyTrigger())
-				continue;
-			else
-				return NULL;
+			/*
+			 *  If we haven't emit an error message, we have safely reached the
+			 *  end-of-WAL.
+			 */
+			if (emode_for_corrupt_record(LOG, ErrRecPtr) == LOG)
+			{
+				char *fmt;
+
+				if (StandbyMode)
+					fmt = gettext_noop("reached end of WAL at %X/%X on timeline %u in %s during standby mode");
+				else if (InArchiveRecovery)
+					fmt = gettext_noop("reached end of WAL at %X/%X on timeline %u in %s during archive recovery");
+				else
+					fmt = gettext_noop("reached end of WAL at %X/%X on timeline %u in %s during crash recovery");
+
+				ereport(LOG,
+						(errmsg(fmt, LSN_FORMAT_ARGS(ErrRecPtr), replayTLI,
+								xlogSourceNames[currentSource]),
+						 (errormsg ? errdetail_internal("%s", errormsg) : 0)));
+			}
 		}
+
+		/* In standby mode, loop back to retry. Otherwise, give up. */
+		if (StandbyMode && !CheckForStandbyTrigger())
+			continue;
+		else
+			return NULL;
 	}
 }
 
@@ -7544,7 +7582,7 @@ StartupXLOG(void)
 		else
 		{
 			/* just have to read next record after CheckPoint */
-			record = ReadRecord(xlogreader, LOG, false, ThisTimeLineID);
+			record = ReadRecord(xlogreader, WARNING, false, ThisTimeLineID);
 		}
 
 		if (record != NULL)
@@ -7781,7 +7819,7 @@ StartupXLOG(void)
 				}
 
 				/* Else, try to fetch the next WAL record */
-				record = ReadRecord(xlogreader, LOG, false, ThisTimeLineID);
+				record = ReadRecord(xlogreader, WARNING, false, ThisTimeLineID);
 			} while (record != NULL);
 
 			/*
@@ -7841,13 +7879,20 @@ StartupXLOG(void)
 
 			InRedo = false;
 		}
-		else
+		else if (xlogreader->EndOfWAL)
 		{
 			/* there are no WAL records following the checkpoint */
 			ereport(LOG,
 					(errmsg("redo is not required")));
 
 		}
+		else
+		{
+			/* broken record found */
+			ereport(WARNING,
+					(errmsg("redo is skipped"),
+					 errhint("This suggests WAL file corruption. You might need to check the database.")));
+		}
 
 		/*
 		 * This check is intentionally after the above log messages that
@@ -13135,7 +13180,9 @@ XLogShutdownWalRcv(void)
  * reading from pg_wal, because we don't expect any invalid records in archive
  * or in records streamed from the primary. Files in the archive should be complete,
  * and we should never hit the end of WAL because we stop and wait for more WAL
- * to arrive before replaying it.
+ * to arrive before replaying it.  When we failed to read a new page,
+ * readSource is reset to XLOG_FROM_ANY. This indicates all sources including
+ * pg_wal was failed. Thus treat that the same way with XLOG_FROM_PG_WAL.
  *
  * NOTE: This function remembers the RecPtr value it was last called with,
  * to suppress repeated messages about the same record. Only call this when
@@ -13147,7 +13194,8 @@ emode_for_corrupt_record(int emode, XLogRecPtr RecPtr)
 {
 	static XLogRecPtr lastComplaint = 0;
 
-	if (readSource == XLOG_FROM_PG_WAL && emode == LOG)
+	if ((readSource == XLOG_FROM_PG_WAL || readSource == XLOG_FROM_ANY)
+		&& emode <= WARNING)
 	{
 		if (RecPtr == lastComplaint)
 			emode = DEBUG1;
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index f39f8044a9..df2198e862 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -121,6 +121,7 @@ XLogReaderAllocate(int wal_segment_size, const char *waldir,
 		pfree(state);
 		return NULL;
 	}
+	state->EndOfWAL = false;
 	state->errormsg_buf[0] = '\0';
 
 	/*
@@ -292,6 +293,7 @@ XLogReadRecord(XLogReaderState *state, char **errormsg)
 	/* reset error state */
 	*errormsg = NULL;
 	state->errormsg_buf[0] = '\0';
+	state->EndOfWAL = false;
 
 	ResetDecoder(state);
 	state->abortedRecPtr = InvalidXLogRecPtr;
@@ -588,6 +590,9 @@ err:
 		 */
 		state->abortedRecPtr = RecPtr;
 		state->missingContrecPtr = targetPagePtr;
+		report_invalid_record(state,
+							  "missing contrecord at %X/%X",
+							  LSN_FORMAT_ARGS(RecPtr));
 	}
 
 	/*
@@ -730,6 +735,15 @@ ValidXLogRecordHeader(XLogReaderState *state, XLogRecPtr RecPtr,
 					  XLogRecPtr PrevRecPtr, XLogRecord *record,
 					  bool randAccess)
 {
+	if (record->xl_tot_len == 0)
+	{
+		/* This is strictly not an invalid state, so phrase it as so. */
+		report_invalid_record(state,
+							  "record length is 0 at %X/%X",
+							  LSN_FORMAT_ARGS(RecPtr));
+		state->EndOfWAL = true;
+		return false;
+	}
 	if (record->xl_tot_len < SizeOfXLogRecord)
 	{
 		report_invalid_record(state,
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index 7a7eb3784e..ba3c4bd550 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -471,8 +471,7 @@ WalReceiverMain(void)
 						else if (len < 0)
 						{
 							ereport(LOG,
-									(errmsg("replication terminated by primary server"),
-									 errdetail("End of WAL reached on timeline %u at %X/%X.",
+									(errmsg("replication terminated by primary server on timeline %u at %X/%X.",
 											   startpointTLI,
 											   LSN_FORMAT_ARGS(LogstreamResult.Write))));
 							endofwal = true;
diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h
index de6fd791fe..1241b85838 100644
--- a/src/include/access/xlogreader.h
+++ b/src/include/access/xlogreader.h
@@ -174,6 +174,7 @@ struct XLogReaderState
 	 */
 	XLogRecPtr	ReadRecPtr;		/* start of last record read */
 	XLogRecPtr	EndRecPtr;		/* end+1 of last record read */
+	bool		EndOfWAL;		/* the last attempt was EOW? */
 
 	/*
 	 * Set at the end of recovery: the start point of a partial record at the
-- 
2.27.0

#17

Michael Paquier

michael@paquier.xyz

about 4 years ago

In reply to: Kyotaro Horiguchi (#16)

Re: Make mesage at end-of-recovery less scary.

On Mon, Nov 08, 2021 at 02:59:46PM +0900, Kyotaro Horiguchi wrote:

While checking the behavior for the case of missing-contrecord, I
noticed that emode_for_corrupt_record() doesn't work as expected since
readSource is reset to XLOG_FROM_ANY after a read failure. We could
remember the last failed source but pg_wal should have been visited if
page read error happened so I changed the function so that it treats
XLOG_FROM_ANY the same way with XLOG_FROM_PG_WAL.

FWIW, I am not much a fan of assuming that it is fine to use
XLOG_FROM_ANY as a condition here. The comments on top of
emode_for_corrupt_record() make it rather clear what the expectations
are, and this is the default readSource.

(Otherwise we see "LOG: reached end-of-WAL at .." message after
"WARNING: missing contrecord at.." message.)

+      /* broken record found */
+      ereport(WARNING,
+                      (errmsg("redo is skipped"),
+                       errhint("This suggests WAL file corruption. You might need to check the database.")));
This looks rather scary to me, FWIW, and this could easily be reached
if one forgets about EndOfWAL while hacking on xlogreader.c.
Unlikely so, still.

+       report_invalid_record(state,
+                             "missing contrecord at %X/%X",
+                             LSN_FORMAT_ARGS(RecPtr));
Isn't there a risk here to break applications checking after error
messages stored in the WAL reader after seeing a contrecord?

+   if (record->xl_tot_len == 0)
+   {
+       /* This is strictly not an invalid state, so phrase it as so. */
+       report_invalid_record(state,
+                             "record length is 0 at %X/%X",
+                             LSN_FORMAT_ARGS(RecPtr));
+       state->EndOfWAL = true;
+       return false;
+   }
This assumes that a value of 0 for xl_tot_len is a synonym of the end
of WAL, but cannot we have also a corrupted record in this case in the
shape of xl_tot_len being 0?  We validate the full record after
reading the header, so it seems to me that we should not assume that
things are just ending as proposed in this patch.
--
Michael

#18

Kyotaro Horiguchi

horikyota.ntt@gmail.com

about 4 years ago

In reply to: Michael Paquier (#17)

1 attachment(s)

Re: Make mesage at end-of-recovery less scary.

Thank you for the comments!

At Tue, 9 Nov 2021 09:53:15 +0900, Michael Paquier <michael@paquier.xyz> wrote in

On Mon, Nov 08, 2021 at 02:59:46PM +0900, Kyotaro Horiguchi wrote:

While checking the behavior for the case of missing-contrecord, I
noticed that emode_for_corrupt_record() doesn't work as expected since
readSource is reset to XLOG_FROM_ANY after a read failure. We could
remember the last failed source but pg_wal should have been visited if
page read error happened so I changed the function so that it treats
XLOG_FROM_ANY the same way with XLOG_FROM_PG_WAL.

FWIW, I am not much a fan of assuming that it is fine to use
XLOG_FROM_ANY as a condition here. The comments on top of
emode_for_corrupt_record() make it rather clear what the expectations
are, and this is the default readSource.

The readSource is expected by the function to be the failed source but
it goes back to XLOG_FROM_ANY on page read failure. So the function
*is* standing on the wrong assumption. I noticed that currentSource
holds the last accessed source (but forgot about that). So it is
exactly what we need here. No longer need to introduce the unclear
assumption by using it.

(Otherwise we see "LOG: reached end-of-WAL at .." message after
"WARNING: missing contrecord at.." message.)

+      /* broken record found */
+      ereport(WARNING,
+                      (errmsg("redo is skipped"),
+                       errhint("This suggests WAL file corruption. You might need to check the database.")));
This looks rather scary to me, FWIW, and this could easily be reached

Yes, the message is intentionally scary, since we don't come here in
the case of clean WAL:)

if one forgets about EndOfWAL while hacking on xlogreader.c.
Unlikely so, still.

I don't understand. Isn't it the case of almost every feature?

The patch compells hackers to maintain the condition for recovery
being considered cleanly ended. If the last record doesn't meet the
condition, the WAL file should be considered having a
problem. However, I don't see the condition expanded to have another
term in future.

Even if someone including myself broke that condition, we will at
worst unwantedly see a scary message. And I believe almost all
hackers can easily find it a bug from the DETAILED message shown along
aside. I'm not sure such bugs could be found in development phase,
though..

+       report_invalid_record(state,
+                             "missing contrecord at %X/%X",
+                             LSN_FORMAT_ARGS(RecPtr));
Isn't there a risk here to break applications checking after error
messages stored in the WAL reader after seeing a contrecord?

I'm not sure you are mentioning the case where no message is stored
previously, or the case where already a message is stored. The former
is fine as the record is actually broken. But I was missing the latter
case. In this version I avoided to overwite the error message.

+   if (record->xl_tot_len == 0)
+   {
+       /* This is strictly not an invalid state, so phrase it as so. */
+       report_invalid_record(state,
+                             "record length is 0 at %X/%X",
+                             LSN_FORMAT_ARGS(RecPtr));
+       state->EndOfWAL = true;
+       return false;
+   }
This assumes that a value of 0 for xl_tot_len is a synonym of the end
of WAL, but cannot we have also a corrupted record in this case in the
shape of xl_tot_len being 0?  We validate the full record after
reading the header, so it seems to me that we should not assume that
things are just ending as proposed in this patch.

Yeah, it's the most serious concern to me. So I didn't hide the
detailed message in the "end-of-wal reached message".

LOG: reached end of WAL at 0/512F198 on timeline 1 in pg_wal during crash recovery
DETAIL: record length is 0 at 0/512F210

I believe everyone regards zero record length as fine unless something
wrong is seen afterwards. However, we can extend the check to the
whole record header. I think it is by far nearer to the perfect for
almost all cases. The attached emits the following message for the
good (true end-of-WAL) case.

LOG: reached end of WAL at 0/512F4A0 on timeline 1 in pg_wal during crash recovery
DETAIL: empty record header found at 0/512F518

If garbage bytes are found in the header area, the following log will
be left. I think we can have a better message here.

WARNING: garbage record header at 0/2095458
LOG: redo done at 0/2095430 system usage: CPU: user: 0.03 s, system: 0.01 s, elapsed: 0.04 s

This is the updated version.

- emode_for_currupt_record() now uses currentSource instead of
readSource.

- If zero record length is faced, make sure the whole header is zeroed
before deciding it as the end-of-WAL.

- Do not overwrite existig message when missing contrecord is
detected. The message added here is seen in the TAP test log
026_overwrite_contrecord_standby.log

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachments:

v6-0001-Make-End-Of-Recovery-error-less-scary.patchtext/x-patch; charset=us-asciiDownload

From 1d5f6e707f8d67172eea79689c8a5f4d86889d3e Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Fri, 28 Feb 2020 15:52:58 +0900
Subject: [PATCH v6] Make End-Of-Recovery error less scary

When recovery in any type ends, we see a bit scary error message like
"invalid record length" that suggests something serious is
happening. Actually if recovery meets a record with length = 0, that
usually means it finished applying all available WAL records.

Make this message less scary as "reached end of WAL". Instead raise
the error level for other kind of WAL failure to WARNING.
---
 src/backend/access/transam/xlog.c       | 89 +++++++++++++++++++------
 src/backend/access/transam/xlogreader.c | 42 ++++++++++++
 src/backend/replication/walreceiver.c   |  3 +-
 src/include/access/xlogreader.h         |  1 +
 4 files changed, 111 insertions(+), 24 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 5cda30836f..e90c69810b 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -4477,6 +4477,7 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
 	for (;;)
 	{
 		char	   *errormsg;
+		XLogRecPtr	ErrRecPtr = InvalidXLogRecPtr;
 
 		record = XLogReadRecord(xlogreader, &errormsg);
 		ReadRecPtr = xlogreader->ReadRecPtr;
@@ -4494,6 +4495,16 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
 			{
 				abortedRecPtr = xlogreader->abortedRecPtr;
 				missingContrecPtr = xlogreader->missingContrecPtr;
+				ErrRecPtr = abortedRecPtr;
+			}
+			else
+			{
+				/*
+				 * NULL ReadRecPtr means we could not read a record at
+				 * beginning. In that case EndRecPtr is storing the LSN of the
+				 * record we tried to read.
+				 */
+				ErrRecPtr = ReadRecPtr ? ReadRecPtr : EndRecPtr;
 			}
 
 			if (readFile >= 0)
@@ -4503,13 +4514,12 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
 			}
 
 			/*
-			 * We only end up here without a message when XLogPageRead()
-			 * failed - in that case we already logged something. In
-			 * StandbyMode that only happens if we have been triggered, so we
-			 * shouldn't loop anymore in that case.
+			 * If we get here for other than end-of-wal, emit the error message
+			 * right now. Otherwise the message if any is shown as a part of
+			 * the end-of-WAL message below.
 			 */
-			if (errormsg)
-				ereport(emode_for_corrupt_record(emode, EndRecPtr),
+			if (!xlogreader->EndOfWAL && errormsg)
+				ereport(emode_for_corrupt_record(emode, ErrRecPtr),
 						(errmsg_internal("%s", errormsg) /* already translated */ ));
 		}
 
@@ -4540,11 +4550,12 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
 			/* Great, got a record */
 			return record;
 		}
-		else
+
+		/* No valid record available from this source */
+		lastSourceFailed = true;
+
+		if (!fetching_ckpt)
 		{
-			/* No valid record available from this source */
-			lastSourceFailed = true;
-
 			/*
 			 * If archive recovery was requested, but we were still doing
 			 * crash recovery, switch to archive recovery and retry using the
@@ -4557,11 +4568,17 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
 			 * we'd have no idea how far we'd have to replay to reach
 			 * consistency.  So err on the safe side and give up.
 			 */
-			if (!InArchiveRecovery && ArchiveRecoveryRequested &&
-				!fetching_ckpt)
+			if (!InArchiveRecovery && ArchiveRecoveryRequested)
 			{
+				/*
+				 * We don't report this as LOG, since we don't stop recovery
+				 * here
+				 */
 				ereport(DEBUG1,
-						(errmsg_internal("reached end of WAL in pg_wal, entering archive recovery")));
+						(errmsg_internal("reached end of WAL at %X/%X on timeline %u in %s during crash recovery, entering archive recovery",
+										 LSN_FORMAT_ARGS(ErrRecPtr),
+										 replayTLI,
+										 xlogSourceNames[currentSource])));
 				InArchiveRecovery = true;
 				if (StandbyModeRequested)
 					StandbyMode = true;
@@ -4609,12 +4626,33 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
 				continue;
 			}
 
-			/* In standby mode, loop back to retry. Otherwise, give up. */
-			if (StandbyMode && !CheckForStandbyTrigger())
-				continue;
-			else
-				return NULL;
+			/*
+			 *  If we haven't emit an error message, we have safely reached the
+			 *  end-of-WAL.
+			 */
+			if (emode_for_corrupt_record(LOG, ErrRecPtr) == LOG)
+			{
+				char *fmt;
+
+				if (StandbyMode)
+					fmt = gettext_noop("reached end of WAL at %X/%X on timeline %u in %s during standby mode");
+				else if (InArchiveRecovery)
+					fmt = gettext_noop("reached end of WAL at %X/%X on timeline %u in %s during archive recovery");
+				else
+					fmt = gettext_noop("reached end of WAL at %X/%X on timeline %u in %s during crash recovery");
+
+				ereport(LOG,
+						(errmsg(fmt, LSN_FORMAT_ARGS(ErrRecPtr), replayTLI,
+								xlogSourceNames[currentSource]),
+						 (errormsg ? errdetail_internal("%s", errormsg) : 0)));
+			}
 		}
+
+		/* In standby mode, loop back to retry. Otherwise, give up. */
+		if (StandbyMode && !CheckForStandbyTrigger())
+			continue;
+		else
+			return NULL;
 	}
 }
 
@@ -7544,7 +7582,7 @@ StartupXLOG(void)
 		else
 		{
 			/* just have to read next record after CheckPoint */
-			record = ReadRecord(xlogreader, LOG, false, ThisTimeLineID);
+			record = ReadRecord(xlogreader, WARNING, false, ThisTimeLineID);
 		}
 
 		if (record != NULL)
@@ -7781,7 +7819,7 @@ StartupXLOG(void)
 				}
 
 				/* Else, try to fetch the next WAL record */
-				record = ReadRecord(xlogreader, LOG, false, ThisTimeLineID);
+				record = ReadRecord(xlogreader, WARNING, false, ThisTimeLineID);
 			} while (record != NULL);
 
 			/*
@@ -7841,13 +7879,20 @@ StartupXLOG(void)
 
 			InRedo = false;
 		}
-		else
+		else if (xlogreader->EndOfWAL)
 		{
 			/* there are no WAL records following the checkpoint */
 			ereport(LOG,
 					(errmsg("redo is not required")));
 
 		}
+		else
+		{
+			/* broken record found */
+			ereport(WARNING,
+					(errmsg("redo is skipped"),
+					 errhint("This suggests WAL file corruption. You might need to check the database.")));
+		}
 
 		/*
 		 * This check is intentionally after the above log messages that
@@ -13147,7 +13192,7 @@ emode_for_corrupt_record(int emode, XLogRecPtr RecPtr)
 {
 	static XLogRecPtr lastComplaint = 0;
 
-	if (readSource == XLOG_FROM_PG_WAL && emode == LOG)
+	if (currentSource == XLOG_FROM_PG_WAL && emode <= WARNING)
 	{
 		if (RecPtr == lastComplaint)
 			emode = DEBUG1;
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index f39f8044a9..273b927cd9 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -121,6 +121,7 @@ XLogReaderAllocate(int wal_segment_size, const char *waldir,
 		pfree(state);
 		return NULL;
 	}
+	state->EndOfWAL = false;
 	state->errormsg_buf[0] = '\0';
 
 	/*
@@ -292,6 +293,7 @@ XLogReadRecord(XLogReaderState *state, char **errormsg)
 	/* reset error state */
 	*errormsg = NULL;
 	state->errormsg_buf[0] = '\0';
+	state->EndOfWAL = false;
 
 	ResetDecoder(state);
 	state->abortedRecPtr = InvalidXLogRecPtr;
@@ -588,6 +590,16 @@ err:
 		 */
 		state->abortedRecPtr = RecPtr;
 		state->missingContrecPtr = targetPagePtr;
+
+		/*
+		 * If the messages is not set yet, that means we failed to load the
+		 * page for the record.  Otherwise do not hide the existing message at
+		 * it should be more detailed.
+		 */
+		if (state->errormsg_buf[0] == '\0')
+			report_invalid_record(state,
+								  "missing contrecord at %X/%X",
+								  LSN_FORMAT_ARGS(RecPtr));
 	}
 
 	/*
@@ -730,6 +742,36 @@ ValidXLogRecordHeader(XLogReaderState *state, XLogRecPtr RecPtr,
 					  XLogRecPtr PrevRecPtr, XLogRecord *record,
 					  bool randAccess)
 {
+	if (record->xl_tot_len == 0)
+	{
+		/*
+		 * We are almost sure reaching the end of WAL, make sure that the whole
+		 * header is zeroed.
+		 */
+		char   *p = (char *)record;
+		char   *pe = (char *)record + SizeOfXLogRecord;
+
+		while (*p == 0 && p < pe)
+			p++;
+
+		if (p == pe)
+		{
+			/* it is completely zeroed, call it a day  */
+			report_invalid_record(state, "empty record header found at %X/%X",
+								  LSN_FORMAT_ARGS(RecPtr));
+
+			/* notify end-of-wal to callers */
+			state->EndOfWAL = true;
+		}
+		else
+		{
+			/* Otherwise we found a garbage header.. */
+			report_invalid_record(state, "garbage record header at %X/%X",
+								  LSN_FORMAT_ARGS(RecPtr));
+		}
+
+		return false;
+	}
 	if (record->xl_tot_len < SizeOfXLogRecord)
 	{
 		report_invalid_record(state,
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index 7a7eb3784e..ba3c4bd550 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -471,8 +471,7 @@ WalReceiverMain(void)
 						else if (len < 0)
 						{
 							ereport(LOG,
-									(errmsg("replication terminated by primary server"),
-									 errdetail("End of WAL reached on timeline %u at %X/%X.",
+									(errmsg("replication terminated by primary server on timeline %u at %X/%X.",
 											   startpointTLI,
 											   LSN_FORMAT_ARGS(LogstreamResult.Write))));
 							endofwal = true;
diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h
index de6fd791fe..1241b85838 100644
--- a/src/include/access/xlogreader.h
+++ b/src/include/access/xlogreader.h
@@ -174,6 +174,7 @@ struct XLogReaderState
 	 */
 	XLogRecPtr	ReadRecPtr;		/* start of last record read */
 	XLogRecPtr	EndRecPtr;		/* end+1 of last record read */
+	bool		EndOfWAL;		/* the last attempt was EOW? */
 
 	/*
 	 * Set at the end of recovery: the start point of a partial record at the
-- 
2.27.0

#19

Kyotaro Horiguchi

horikyota.ntt@gmail.com

about 4 years ago

In reply to: Kyotaro Horiguchi (#18)

1 attachment(s)

Re: Make mesage at end-of-recovery less scary.

At Tue, 09 Nov 2021 16:27:51 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in

This is the updated version.

- emode_for_currupt_record() now uses currentSource instead of
readSource.

- If zero record length is faced, make sure the whole header is zeroed
before deciding it as the end-of-WAL.

- Do not overwrite existig message when missing contrecord is
detected. The message added here is seen in the TAP test log
026_overwrite_contrecord_standby.log

d2ddfa681db27a138acb63c8defa8cc6fa588922 removed global variables
ReadRecPtr and EndRecPtr. This is rebased version that reads the LSNs
directly from xlogreader instead of the removed global variables.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachments:

v7-0001-Make-End-Of-Recovery-error-less-scary.patchtext/x-patch; charset=us-asciiDownload

From cc521692a9f98fabde07e248b63f1222f8406de1 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Fri, 28 Feb 2020 15:52:58 +0900
Subject: [PATCH v7] Make End-Of-Recovery error less scary

When recovery in any type ends, we see a bit scary error message like
"invalid record length" that suggests something serious is
happening. Actually if recovery meets a record with length = 0, that
usually means it finished applying all available WAL records.

Make this message less scary as "reached end of WAL". Instead, raise
the error level for other kind of WAL failure to WARNING.
---
 src/backend/access/transam/xlog.c       | 89 +++++++++++++++++++------
 src/backend/access/transam/xlogreader.c | 42 ++++++++++++
 src/backend/replication/walreceiver.c   |  3 +-
 src/include/access/xlogreader.h         |  1 +
 4 files changed, 112 insertions(+), 23 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index d894af310a..fa435faec4 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -4469,6 +4469,7 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
 	for (;;)
 	{
 		char	   *errormsg;
+		XLogRecPtr	ErrRecPtr = InvalidXLogRecPtr;
 
 		record = XLogReadRecord(xlogreader, &errormsg);
 		if (record == NULL)
@@ -4484,6 +4485,18 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
 			{
 				abortedRecPtr = xlogreader->abortedRecPtr;
 				missingContrecPtr = xlogreader->missingContrecPtr;
+				ErrRecPtr = abortedRecPtr;
+			}
+			else
+			{
+				/*
+				 * NULL ReadRecPtr means we could not read a record at
+				 * beginning. In that case EndRecPtr is storing the LSN of the
+				 * record we tried to read.
+				 */
+				ErrRecPtr =
+					xlogreader->ReadRecPtr ?
+					xlogreader->ReadRecPtr : xlogreader->EndRecPtr;
 			}
 
 			if (readFile >= 0)
@@ -4493,12 +4506,11 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
 			}
 
 			/*
-			 * We only end up here without a message when XLogPageRead()
-			 * failed - in that case we already logged something. In
-			 * StandbyMode that only happens if we have been triggered, so we
-			 * shouldn't loop anymore in that case.
+			 * If we get here for other than end-of-wal, emit the error message
+			 * right now. Otherwise the message if any is shown as a part of
+			 * the end-of-WAL message below.
 			 */
-			if (errormsg)
+			if (!xlogreader->EndOfWAL && errormsg)
 				ereport(emode_for_corrupt_record(emode, xlogreader->EndRecPtr),
 						(errmsg_internal("%s", errormsg) /* already translated */ ));
 		}
@@ -4530,11 +4542,12 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
 			/* Great, got a record */
 			return record;
 		}
-		else
+
+		/* No valid record available from this source */
+		lastSourceFailed = true;
+
+		if (!fetching_ckpt)
 		{
-			/* No valid record available from this source */
-			lastSourceFailed = true;
-
 			/*
 			 * If archive recovery was requested, but we were still doing
 			 * crash recovery, switch to archive recovery and retry using the
@@ -4547,11 +4560,17 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
 			 * we'd have no idea how far we'd have to replay to reach
 			 * consistency.  So err on the safe side and give up.
 			 */
-			if (!InArchiveRecovery && ArchiveRecoveryRequested &&
-				!fetching_ckpt)
+			if (!InArchiveRecovery && ArchiveRecoveryRequested)
 			{
+				/*
+				 * We don't report this as LOG, since we don't stop recovery
+				 * here
+				 */
 				ereport(DEBUG1,
-						(errmsg_internal("reached end of WAL in pg_wal, entering archive recovery")));
+						(errmsg_internal("reached end of WAL at %X/%X on timeline %u in %s during crash recovery, entering archive recovery",
+										 LSN_FORMAT_ARGS(ErrRecPtr),
+										 replayTLI,
+										 xlogSourceNames[currentSource])));
 				InArchiveRecovery = true;
 				if (StandbyModeRequested)
 					StandbyMode = true;
@@ -4599,12 +4618,33 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
 				continue;
 			}
 
-			/* In standby mode, loop back to retry. Otherwise, give up. */
-			if (StandbyMode && !CheckForStandbyTrigger())
-				continue;
-			else
-				return NULL;
+			/*
+			 *  If we haven't emit an error message, we have safely reached the
+			 *  end-of-WAL.
+			 */
+			if (emode_for_corrupt_record(LOG, ErrRecPtr) == LOG)
+			{
+				char *fmt;
+
+				if (StandbyMode)
+					fmt = gettext_noop("reached end of WAL at %X/%X on timeline %u in %s during standby mode");
+				else if (InArchiveRecovery)
+					fmt = gettext_noop("reached end of WAL at %X/%X on timeline %u in %s during archive recovery");
+				else
+					fmt = gettext_noop("reached end of WAL at %X/%X on timeline %u in %s during crash recovery");
+
+				ereport(LOG,
+						(errmsg(fmt, LSN_FORMAT_ARGS(ErrRecPtr), replayTLI,
+								xlogSourceNames[currentSource]),
+						 (errormsg ? errdetail_internal("%s", errormsg) : 0)));
+			}
 		}
+
+		/* In standby mode, loop back to retry. Otherwise, give up. */
+		if (StandbyMode && !CheckForStandbyTrigger())
+			continue;
+		else
+			return NULL;
 	}
 }
 
@@ -7536,7 +7576,7 @@ StartupXLOG(void)
 		else
 		{
 			/* just have to read next record after CheckPoint */
-			record = ReadRecord(xlogreader, LOG, false, replayTLI);
+			record = ReadRecord(xlogreader, WARNING, false, replayTLI);
 		}
 
 		if (record != NULL)
@@ -7774,7 +7814,7 @@ StartupXLOG(void)
 				}
 
 				/* Else, try to fetch the next WAL record */
-				record = ReadRecord(xlogreader, LOG, false, replayTLI);
+				record = ReadRecord(xlogreader, WARNING, false, replayTLI);
 			} while (record != NULL);
 
 			/*
@@ -7834,13 +7874,20 @@ StartupXLOG(void)
 
 			InRedo = false;
 		}
-		else
+		else if (xlogreader->EndOfWAL)
 		{
 			/* there are no WAL records following the checkpoint */
 			ereport(LOG,
 					(errmsg("redo is not required")));
 
 		}
+		else
+		{
+			/* broken record found */
+			ereport(WARNING,
+					(errmsg("redo is skipped"),
+					 errhint("This suggests WAL file corruption. You might need to check the database.")));
+		}
 
 		/*
 		 * This check is intentionally after the above log messages that
@@ -13130,7 +13177,7 @@ emode_for_corrupt_record(int emode, XLogRecPtr RecPtr)
 {
 	static XLogRecPtr lastComplaint = 0;
 
-	if (readSource == XLOG_FROM_PG_WAL && emode == LOG)
+	if (currentSource == XLOG_FROM_PG_WAL && emode <= WARNING)
 	{
 		if (RecPtr == lastComplaint)
 			emode = DEBUG1;
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index 3a7de02565..e16b6fe041 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -121,6 +121,7 @@ XLogReaderAllocate(int wal_segment_size, const char *waldir,
 		pfree(state);
 		return NULL;
 	}
+	state->EndOfWAL = false;
 	state->errormsg_buf[0] = '\0';
 
 	/*
@@ -292,6 +293,7 @@ XLogReadRecord(XLogReaderState *state, char **errormsg)
 	/* reset error state */
 	*errormsg = NULL;
 	state->errormsg_buf[0] = '\0';
+	state->EndOfWAL = false;
 
 	ResetDecoder(state);
 	state->abortedRecPtr = InvalidXLogRecPtr;
@@ -588,6 +590,16 @@ err:
 		 */
 		state->abortedRecPtr = RecPtr;
 		state->missingContrecPtr = targetPagePtr;
+
+		/*
+		 * If the messages is not set yet, that means we failed to load the
+		 * page for the record.  Otherwise do not hide the existing message at
+		 * it should be more detailed.
+		 */
+		if (state->errormsg_buf[0] == '\0')
+			report_invalid_record(state,
+								  "missing contrecord at %X/%X",
+								  LSN_FORMAT_ARGS(RecPtr));
 	}
 
 	/*
@@ -730,6 +742,36 @@ ValidXLogRecordHeader(XLogReaderState *state, XLogRecPtr RecPtr,
 					  XLogRecPtr PrevRecPtr, XLogRecord *record,
 					  bool randAccess)
 {
+	if (record->xl_tot_len == 0)
+	{
+		/*
+		 * We are almost sure reaching the end of WAL, make sure that the whole
+		 * header is zeroed.
+		 */
+		char   *p = (char *)record;
+		char   *pe = (char *)record + SizeOfXLogRecord;
+
+		while (*p == 0 && p < pe)
+			p++;
+
+		if (p == pe)
+		{
+			/* it is completely zeroed, call it a day  */
+			report_invalid_record(state, "empty record header found at %X/%X",
+								  LSN_FORMAT_ARGS(RecPtr));
+
+			/* notify end-of-wal to callers */
+			state->EndOfWAL = true;
+		}
+		else
+		{
+			/* Otherwise we found a garbage header.. */
+			report_invalid_record(state, "garbage record header at %X/%X",
+								  LSN_FORMAT_ARGS(RecPtr));
+		}
+
+		return false;
+	}
 	if (record->xl_tot_len < SizeOfXLogRecord)
 	{
 		report_invalid_record(state,
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index 7a7eb3784e..ba3c4bd550 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -471,8 +471,7 @@ WalReceiverMain(void)
 						else if (len < 0)
 						{
 							ereport(LOG,
-									(errmsg("replication terminated by primary server"),
-									 errdetail("End of WAL reached on timeline %u at %X/%X.",
+									(errmsg("replication terminated by primary server on timeline %u at %X/%X.",
 											   startpointTLI,
 											   LSN_FORMAT_ARGS(LogstreamResult.Write))));
 							endofwal = true;
diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h
index de6fd791fe..1241b85838 100644
--- a/src/include/access/xlogreader.h
+++ b/src/include/access/xlogreader.h
@@ -174,6 +174,7 @@ struct XLogReaderState
 	 */
 	XLogRecPtr	ReadRecPtr;		/* start of last record read */
 	XLogRecPtr	EndRecPtr;		/* end+1 of last record read */
+	bool		EndOfWAL;		/* the last attempt was EOW? */
 
 	/*
 	 * Set at the end of recovery: the start point of a partial record at the
-- 
2.27.0

#20

Pavel Borisov

pashkin.elfe@gmail.com

almost 4 years ago

In reply to: Kyotaro Horiguchi (#19)

Re: Make mesage at end-of-recovery less scary.

d2ddfa681db27a138acb63c8defa8cc6fa588922 removed global variables
ReadRecPtr and EndRecPtr. This is rebased version that reads the LSNs
directly from xlogreader instead of the removed global variables.

Hi, hackers!

I've checked the latest version of a patch. It applies cleanly, check-world
passes and CI is also in the green state.
Proposed messages seem good to me, but probably it would be better to have
a test on conditions where "reached end of WAL..." emitted.
Then, I believe it can be set as 'ready for committter'.

--
Best regards,
Pavel Borisov

Postgres Professional: http://postgrespro.com <http://www.postgrespro.com>

#21

Kyotaro Horiguchi

horikyota.ntt@gmail.com

almost 4 years ago

In reply to: Pavel Borisov (#20)

1 attachment(s)

Re: Make mesage at end-of-recovery less scary.

At Mon, 24 Jan 2022 14:23:33 +0400, Pavel Borisov <pashkin.elfe@gmail.com> wrote in

d2ddfa681db27a138acb63c8defa8cc6fa588922 removed global variables
ReadRecPtr and EndRecPtr. This is rebased version that reads the LSNs
directly from xlogreader instead of the removed global variables.

Hi, hackers!

I've checked the latest version of a patch. It applies cleanly, check-world
passes and CI is also in the green state.
Proposed messages seem good to me, but probably it would be better to have
a test on conditions where "reached end of WAL..." emitted.
Then, I believe it can be set as 'ready for committter'.

Thanks for checking that, and the comment!

I thought that we usually don't test log messages, but finally I found
that I needed that. It is because I found another mode of end-of-wal
and a bug that emits a spurious message on passing...

This v8 is changed in...

- Added tests to 011_crash_recovery.pl

- Fixed a bug that server emits "end-of-wal" messages even if it have
emitted an error message for the same LSN.

- Changed XLogReaderValidatePageHeader() so that it recognizes an
empty page as end-of-WAL.

- Made pg_waldump conscious of end-of-wal.

While doing the last item, I noticed that pg_waldump shows the wrong
LSN as the error position. Concretely it emits the LSN of the last
sound WAL record as the error position. I will post a bug-fix patch
for the issue after confirmation.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachments:

v8-0001-Make-End-Of-Recovery-error-less-scary.patchtext/x-patch; charset=us-asciiDownload

From 0f1024bdfba9d1926465351fa1b7125698a21e8d Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Fri, 28 Feb 2020 15:52:58 +0900
Subject: [PATCH v8] Make End-Of-Recovery error less scary

When recovery in any type ends, we see a bit scary error message like
"invalid record length" that suggests something serious is
happening. Actually if recovery meets a record with length = 0, that
usually means it finished applying all available WAL records.

Make this message less scary as "reached end of WAL". Instead, raise
the error level for other kind of WAL failure to WARNING.
---
 src/backend/access/transam/xlog.c         |  91 +++++++++++++-----
 src/backend/access/transam/xlogreader.c   |  64 +++++++++++++
 src/backend/replication/walreceiver.c     |   3 +-
 src/bin/pg_waldump/pg_waldump.c           |  13 ++-
 src/include/access/xlogreader.h           |   1 +
 src/test/recovery/t/011_crash_recovery.pl | 110 +++++++++++++++++++++-
 6 files changed, 254 insertions(+), 28 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 58922f7ede..c08b9554b3 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -4480,6 +4480,7 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
 	for (;;)
 	{
 		char	   *errormsg;
+		XLogRecPtr	ErrRecPtr = InvalidXLogRecPtr;
 
 		record = XLogReadRecord(xlogreader, &errormsg);
 		if (record == NULL)
@@ -4495,6 +4496,18 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
 			{
 				abortedRecPtr = xlogreader->abortedRecPtr;
 				missingContrecPtr = xlogreader->missingContrecPtr;
+				ErrRecPtr = abortedRecPtr;
+			}
+			else
+			{
+				/*
+				 * NULL ReadRecPtr means we could not read a record at the
+				 * beginning. In that case EndRecPtr is storing the LSN of the
+				 * record we tried to read.
+				 */
+				ErrRecPtr =
+					xlogreader->ReadRecPtr ?
+					xlogreader->ReadRecPtr : xlogreader->EndRecPtr;
 			}
 
 			if (readFile >= 0)
@@ -4504,13 +4517,12 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
 			}
 
 			/*
-			 * We only end up here without a message when XLogPageRead()
-			 * failed - in that case we already logged something. In
-			 * StandbyMode that only happens if we have been triggered, so we
-			 * shouldn't loop anymore in that case.
+			 * If we get here for other than end-of-wal, emit the error message
+			 * right now. Otherwise the message if any is shown as a part of
+			 * the end-of-WAL message below.
 			 */
-			if (errormsg)
-				ereport(emode_for_corrupt_record(emode, xlogreader->EndRecPtr),
+			if (!xlogreader->EndOfWAL && errormsg)
+				ereport(emode_for_corrupt_record(emode, ErrRecPtr),
 						(errmsg_internal("%s", errormsg) /* already translated */ ));
 		}
 
@@ -4541,11 +4553,12 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
 			/* Great, got a record */
 			return record;
 		}
-		else
+
+		/* No valid record available from this source */
+		lastSourceFailed = true;
+
+		if (!fetching_ckpt)
 		{
-			/* No valid record available from this source */
-			lastSourceFailed = true;
-
 			/*
 			 * If archive recovery was requested, but we were still doing
 			 * crash recovery, switch to archive recovery and retry using the
@@ -4558,11 +4571,17 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
 			 * we'd have no idea how far we'd have to replay to reach
 			 * consistency.  So err on the safe side and give up.
 			 */
-			if (!InArchiveRecovery && ArchiveRecoveryRequested &&
-				!fetching_ckpt)
+			if (!InArchiveRecovery && ArchiveRecoveryRequested)
 			{
+				/*
+				 * We don't report this as LOG, since we don't stop recovery
+				 * here
+				 */
 				ereport(DEBUG1,
-						(errmsg_internal("reached end of WAL in pg_wal, entering archive recovery")));
+						(errmsg_internal("reached end of WAL at %X/%X on timeline %u in %s during crash recovery, entering archive recovery",
+										 LSN_FORMAT_ARGS(ErrRecPtr),
+										 replayTLI,
+										 xlogSourceNames[currentSource])));
 				InArchiveRecovery = true;
 				if (StandbyModeRequested)
 					StandbyMode = true;
@@ -4610,12 +4629,33 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
 				continue;
 			}
 
-			/* In standby mode, loop back to retry. Otherwise, give up. */
-			if (StandbyMode && !CheckForStandbyTrigger())
-				continue;
-			else
-				return NULL;
+			/*
+			 *  If we haven't emit an error message, we have safely reached the
+			 *  end-of-WAL.
+			 */
+			if (emode_for_corrupt_record(LOG, ErrRecPtr) == LOG)
+			{
+				char *fmt;
+
+				if (StandbyMode)
+					fmt = gettext_noop("reached end of WAL at %X/%X on timeline %u in %s during standby mode");
+				else if (InArchiveRecovery)
+					fmt = gettext_noop("reached end of WAL at %X/%X on timeline %u in %s during archive recovery");
+				else
+					fmt = gettext_noop("reached end of WAL at %X/%X on timeline %u in %s during crash recovery");
+
+				ereport(LOG,
+						(errmsg(fmt, LSN_FORMAT_ARGS(ErrRecPtr), replayTLI,
+								xlogSourceNames[currentSource]),
+						 (errormsg ? errdetail_internal("%s", errormsg) : 0)));
+			}
 		}
+
+		/* In standby mode, loop back to retry. Otherwise, give up. */
+		if (StandbyMode && !CheckForStandbyTrigger())
+			continue;
+		else
+			return NULL;
 	}
 }
 
@@ -7544,7 +7584,7 @@ StartupXLOG(void)
 		else
 		{
 			/* just have to read next record after CheckPoint */
-			record = ReadRecord(xlogreader, LOG, false, replayTLI);
+			record = ReadRecord(xlogreader, WARNING, false, replayTLI);
 		}
 
 		if (record != NULL)
@@ -7782,7 +7822,7 @@ StartupXLOG(void)
 				}
 
 				/* Else, try to fetch the next WAL record */
-				record = ReadRecord(xlogreader, LOG, false, replayTLI);
+				record = ReadRecord(xlogreader, WARNING, false, replayTLI);
 			} while (record != NULL);
 
 			/*
@@ -7842,13 +7882,20 @@ StartupXLOG(void)
 
 			InRedo = false;
 		}
-		else
+		else if (xlogreader->EndOfWAL)
 		{
 			/* there are no WAL records following the checkpoint */
 			ereport(LOG,
 					(errmsg("redo is not required")));
 
 		}
+		else
+		{
+			/* broken record found */
+			ereport(WARNING,
+					(errmsg("redo is skipped"),
+					 errhint("This suggests WAL file corruption. You might need to check the database.")));
+		}
 
 		/*
 		 * This check is intentionally after the above log messages that
@@ -13097,7 +13144,7 @@ emode_for_corrupt_record(int emode, XLogRecPtr RecPtr)
 {
 	static XLogRecPtr lastComplaint = 0;
 
-	if (readSource == XLOG_FROM_PG_WAL && emode == LOG)
+	if (currentSource == XLOG_FROM_PG_WAL && emode <= WARNING)
 	{
 		if (RecPtr == lastComplaint)
 			emode = DEBUG1;
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index 35029cf97d..55f54cd98d 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -121,6 +121,7 @@ XLogReaderAllocate(int wal_segment_size, const char *waldir,
 		pfree(state);
 		return NULL;
 	}
+	state->EndOfWAL = false;
 	state->errormsg_buf[0] = '\0';
 
 	/*
@@ -292,6 +293,7 @@ XLogReadRecord(XLogReaderState *state, char **errormsg)
 	/* reset error state */
 	*errormsg = NULL;
 	state->errormsg_buf[0] = '\0';
+	state->EndOfWAL = false;
 
 	ResetDecoder(state);
 	state->abortedRecPtr = InvalidXLogRecPtr;
@@ -588,6 +590,15 @@ err:
 		 */
 		state->abortedRecPtr = RecPtr;
 		state->missingContrecPtr = targetPagePtr;
+
+		/*
+		 * If the message is not set yet, that means we failed to load the
+		 * page for the record.  Otherwise do not hide the existing message.
+		 */
+		if (state->errormsg_buf[0] == '\0')
+			report_invalid_record(state,
+								  "missing contrecord at %X/%X",
+								  LSN_FORMAT_ARGS(RecPtr));
 	}
 
 	/*
@@ -730,6 +741,36 @@ ValidXLogRecordHeader(XLogReaderState *state, XLogRecPtr RecPtr,
 					  XLogRecPtr PrevRecPtr, XLogRecord *record,
 					  bool randAccess)
 {
+	if (record->xl_tot_len == 0)
+	{
+		/*
+		 * We are almost sure reaching the end of WAL, make sure that the whole
+		 * header is zeroed.
+		 */
+		char   *p = (char *)record;
+		char   *pe = (char *)record + SizeOfXLogRecord;
+
+		while (*p == 0 && p < pe)
+			p++;
+
+		if (p == pe)
+		{
+			/* it is completely zeroed, call it a day  */
+			report_invalid_record(state, "empty record header found at %X/%X",
+								  LSN_FORMAT_ARGS(RecPtr));
+
+			/* notify end-of-wal to callers */
+			state->EndOfWAL = true;
+		}
+		else
+		{
+			/* Otherwise the header is corrupted. */
+			report_invalid_record(state, "garbage record header at %X/%X",
+								  LSN_FORMAT_ARGS(RecPtr));
+		}
+
+		return false;
+	}
 	if (record->xl_tot_len < SizeOfXLogRecord)
 	{
 		report_invalid_record(state,
@@ -836,6 +877,29 @@ XLogReaderValidatePageHeader(XLogReaderState *state, XLogRecPtr recptr,
 
 	XLogSegNoOffsetToRecPtr(segno, offset, state->segcxt.ws_segsize, recaddr);
 
+	StaticAssertStmt(XLOG_PAGE_MAGIC != 0, "XLOG_PAGE_MAGIC is zero");
+
+	if (hdr->xlp_magic == 0)
+	{
+		/* Regard an empty page as End-Of-WAL */
+		int	i;
+
+		for (i = 0 ; i < XLOG_BLCKSZ && phdr[i] == 0 ; i++);
+		if (i == XLOG_BLCKSZ)
+		{
+			char		fname[MAXFNAMELEN];
+
+			XLogFileName(fname, state->seg.ws_tli, segno,
+						 state->segcxt.ws_segsize);
+
+			report_invalid_record(state,
+								  "empty page in log segment %s, offset %u",
+								  fname,
+								  offset);
+			state->EndOfWAL = true;
+			return false;
+		}
+	}
 	if (hdr->xlp_magic != XLOG_PAGE_MAGIC)
 	{
 		char		fname[MAXFNAMELEN];
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index b39fce8c23..3034f8281e 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -471,8 +471,7 @@ WalReceiverMain(void)
 						else if (len < 0)
 						{
 							ereport(LOG,
-									(errmsg("replication terminated by primary server"),
-									 errdetail("End of WAL reached on timeline %u at %X/%X.",
+									(errmsg("replication terminated by primary server on timeline %u at %X/%X.",
 											   startpointTLI,
 											   LSN_FORMAT_ARGS(LogstreamResult.Write))));
 							endofwal = true;
diff --git a/src/bin/pg_waldump/pg_waldump.c b/src/bin/pg_waldump/pg_waldump.c
index a6251e1a96..3745e76488 100644
--- a/src/bin/pg_waldump/pg_waldump.c
+++ b/src/bin/pg_waldump/pg_waldump.c
@@ -1176,9 +1176,16 @@ main(int argc, char **argv)
 		exit(0);
 
 	if (errormsg)
-		fatal_error("error in WAL record at %X/%X: %s",
-					LSN_FORMAT_ARGS(xlogreader_state->ReadRecPtr),
-					errormsg);
+	{
+		if (xlogreader_state->EndOfWAL)
+			pg_log_info("end of WAL at %X/%X: %s",
+						LSN_FORMAT_ARGS(xlogreader_state->EndRecPtr),
+						errormsg);
+		else
+			fatal_error("error in WAL record at %X/%X: %s",
+						LSN_FORMAT_ARGS(xlogreader_state->EndRecPtr),
+						errormsg);
+	}
 
 	XLogReaderFree(xlogreader_state);
 
diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h
index 477f0efe26..3eeba220a1 100644
--- a/src/include/access/xlogreader.h
+++ b/src/include/access/xlogreader.h
@@ -174,6 +174,7 @@ struct XLogReaderState
 	 */
 	XLogRecPtr	ReadRecPtr;		/* start of last record read */
 	XLogRecPtr	EndRecPtr;		/* end+1 of last record read */
+	bool		EndOfWAL;		/* the last attempt was EOW? */
 
 	/*
 	 * Set at the end of recovery: the start point of a partial record at the
diff --git a/src/test/recovery/t/011_crash_recovery.pl b/src/test/recovery/t/011_crash_recovery.pl
index 3892aba3e5..b793280a5c 100644
--- a/src/test/recovery/t/011_crash_recovery.pl
+++ b/src/test/recovery/t/011_crash_recovery.pl
@@ -10,9 +10,11 @@ use PostgreSQL::Test::Cluster;
 use PostgreSQL::Test::Utils;
 use Test::More;
 use Config;
+use IPC::Run;
 
-plan tests => 3;
+plan tests => 11;
 
+my $reached_eow_pat = "reached end of WAL at ";
 my $node = PostgreSQL::Test::Cluster->new('primary');
 $node->init(allows_streaming => 1);
 $node->start;
@@ -50,7 +52,15 @@ is($node->safe_psql('postgres', qq[SELECT pg_xact_status('$xid');]),
 
 # Crash and restart the postmaster
 $node->stop('immediate');
+my $logstart = get_log_size($node);
 $node->start;
+my $max_attempts = 360;
+while ($max_attempts-- >= 0)
+{
+	last if (find_in_log($node, $reached_eow_pat, $logstart));
+	sleep 0.5;
+}
+ok ($max_attempts >= 0, "end-of-wal is logged");
 
 # Make sure we really got a new xid
 cmp_ok($node->safe_psql('postgres', 'SELECT pg_current_xact_id()'),
@@ -62,3 +72,101 @@ is($node->safe_psql('postgres', qq[SELECT pg_xact_status('$xid');]),
 
 $stdin .= "\\q\n";
 $tx->finish;    # wait for psql to quit gracefully
+
+my $segsize = $node->safe_psql('postgres',
+	   qq[SELECT setting FROM pg_settings WHERE name = 'wal_segment_size';]);
+
+# make sure no records afterwards go to the next segment
+$node->safe_psql('postgres', qq[
+				 SELECT pg_switch_wal();
+				 CHECKPOINT;
+				 CREATE TABLE t();
+]);
+$node->stop('immediate');
+
+# identify REDO WAL file
+my $cmd = "pg_controldata -D " . $node->data_dir();
+my $chkptfile;
+$cmd = ['pg_controldata', '-D', $node->data_dir()];
+$stdout = '';
+$stderr = '';
+IPC::Run::run $cmd, '>', \$stdout, '2>', \$stderr;
+ok($stdout =~ /^Latest checkpoint's REDO WAL file:[ \t] *(.+)$/m,
+   "checkpoint file is identified");
+my $chkptfile = $1;
+
+# identify the last record
+my $walfile = $node->data_dir() . "/pg_wal/$chkptfile";
+$cmd = ['pg_waldump', $walfile];
+$stdout = '';
+$stderr = '';
+my $lastrec;
+IPC::Run::run $cmd, '>', \$stdout, '2>', \$stderr;
+foreach my $l (split(/\r?\n/, $stdout))
+{
+	$lastrec = $l;
+}
+ok(defined $lastrec, "last WAL record is extracted");
+ok($stderr =~ /end of WAL at ([0-9A-F\/]+): .* at \g1/,
+   "pg_waldump emits the correct ending message");
+
+# read the last record LSN excluding leading zeroes
+ok ($lastrec =~ /, lsn: 0\/0*([1-9A-F][0-9A-F]+),/,
+	"LSN of the last record identified");
+my $lastlsn = $1;
+
+# corrupt the last record
+my $offset = hex($lastlsn) % $segsize;
+open(my $segf, '+<', $walfile) or die "failed to open $walfile\n";
+seek($segf, $offset, 0);  # halfway break the last record
+print $segf "\0\0\0\0";
+close($segf);
+
+# pg_waldump complains about the corrupted record
+$stdout = '';
+$stderr = '';
+IPC::Run::run $cmd, '>', \$stdout, '2>', \$stderr;
+ok($stderr =~ /fatal: error in WAL record at 0\/$lastlsn: .* at 0\/$lastlsn/,
+   "pg_waldump emits the correct error message");
+
+# also server complains
+$logstart = get_log_size($node);
+$node->start;
+$max_attempts = 360;
+while ($max_attempts-- >= 0)
+{
+	last if (find_in_log($node, "WARNING:  garbage record header at 0/$lastlsn",
+						 $logstart));
+	sleep 0.5;
+}
+ok($max_attempts >= 0, "header error is logged at $lastlsn");
+
+# and the end-of-wal messages shouldn't be seen
+# the same message has been confirmed in the past
+ok(!find_in_log($node, $reached_eow_pat, $logstart),
+   "false log message is not emitted");
+
+$node->stop('immediate');
+
+#### helper routines
+# return the size of logfile of $node in bytes
+sub get_log_size
+{
+	my ($node) = @_;
+
+	return (stat $node->logfile)[7];
+}
+
+# find $pat in logfile of $node after $off-th byte
+sub find_in_log
+{
+	my ($node, $pat, $off) = @_;
+
+	$off = 0 unless defined $off;
+	my $log = PostgreSQL::Test::Utils::slurp_file($node->logfile);
+	return 0 if (length($log) <= $off);
+
+	$log = substr($log, $off);
+
+	return $log =~ m/$pat/;
+}
-- 
2.27.0

#22

Kyotaro Horiguchi

horikyota.ntt@gmail.com

almost 4 years ago

In reply to: Kyotaro Horiguchi (#21)

1 attachment(s)

Re: Make mesage at end-of-recovery less scary.

At Tue, 25 Jan 2022 17:34:56 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in

This v8 is changed in...

- Added tests to 011_crash_recovery.pl

- Fixed a bug that server emits "end-of-wal" messages even if it have
emitted an error message for the same LSN.

- Changed XLogReaderValidatePageHeader() so that it recognizes an
empty page as end-of-WAL.

- Made pg_waldump conscious of end-of-wal.

While doing the last item, I noticed that pg_waldump shows the wrong
LSN as the error position. Concretely it emits the LSN of the last
sound WAL record as the error position. I will post a bug-fix patch
for the issue after confirmation.

I noticed that I added a useless error message "garbage record
header", but it is a kind of invalid record length. So I removed the
message. That change makes the logic for EOW in ValidXLogRecordHeader
and XLogReaderValidatePageHeader share the same flow.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachments:

v9-0001-Make-End-Of-Recovery-error-less-scary.patchtext/x-patch; charset=us-asciiDownload

From 57cb251f7cacbb96066ead4543b9f12f5b3c7062 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Fri, 28 Feb 2020 15:52:58 +0900
Subject: [PATCH v9] Make End-Of-Recovery error less scary

When recovery in any type ends, we see a bit scary error message like
"invalid record length" that suggests something serious is
happening. Actually if recovery meets a record with length = 0, that
usually means it finished applying all available WAL records.

Make this message less scary as "reached end of WAL". Instead, raise
the error level for other kind of WAL failure to WARNING.
---
 src/backend/access/transam/xlog.c         |  91 +++++++++++++-----
 src/backend/access/transam/xlogreader.c   |  61 ++++++++++++
 src/backend/replication/walreceiver.c     |   3 +-
 src/bin/pg_waldump/pg_waldump.c           |  13 ++-
 src/include/access/xlogreader.h           |   1 +
 src/test/recovery/t/011_crash_recovery.pl | 110 +++++++++++++++++++++-
 6 files changed, 251 insertions(+), 28 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index dfe2a0bcce..5727e0939f 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -4480,6 +4480,7 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
 	for (;;)
 	{
 		char	   *errormsg;
+		XLogRecPtr	ErrRecPtr = InvalidXLogRecPtr;
 
 		record = XLogReadRecord(xlogreader, &errormsg);
 		if (record == NULL)
@@ -4495,6 +4496,18 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
 			{
 				abortedRecPtr = xlogreader->abortedRecPtr;
 				missingContrecPtr = xlogreader->missingContrecPtr;
+				ErrRecPtr = abortedRecPtr;
+			}
+			else
+			{
+				/*
+				 * NULL ReadRecPtr means we could not read a record at the
+				 * beginning. In that case EndRecPtr is storing the LSN of the
+				 * record we tried to read.
+				 */
+				ErrRecPtr =
+					xlogreader->ReadRecPtr ?
+					xlogreader->ReadRecPtr : xlogreader->EndRecPtr;
 			}
 
 			if (readFile >= 0)
@@ -4504,13 +4517,12 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
 			}
 
 			/*
-			 * We only end up here without a message when XLogPageRead()
-			 * failed - in that case we already logged something. In
-			 * StandbyMode that only happens if we have been triggered, so we
-			 * shouldn't loop anymore in that case.
+			 * If we get here for other than end-of-wal, emit the error message
+			 * right now. Otherwise the message if any is shown as a part of
+			 * the end-of-WAL message below.
 			 */
-			if (errormsg)
-				ereport(emode_for_corrupt_record(emode, xlogreader->EndRecPtr),
+			if (!xlogreader->EndOfWAL && errormsg)
+				ereport(emode_for_corrupt_record(emode, ErrRecPtr),
 						(errmsg_internal("%s", errormsg) /* already translated */ ));
 		}
 
@@ -4541,11 +4553,12 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
 			/* Great, got a record */
 			return record;
 		}
-		else
+
+		/* No valid record available from this source */
+		lastSourceFailed = true;
+
+		if (!fetching_ckpt)
 		{
-			/* No valid record available from this source */
-			lastSourceFailed = true;
-
 			/*
 			 * If archive recovery was requested, but we were still doing
 			 * crash recovery, switch to archive recovery and retry using the
@@ -4558,11 +4571,17 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
 			 * we'd have no idea how far we'd have to replay to reach
 			 * consistency.  So err on the safe side and give up.
 			 */
-			if (!InArchiveRecovery && ArchiveRecoveryRequested &&
-				!fetching_ckpt)
+			if (!InArchiveRecovery && ArchiveRecoveryRequested)
 			{
+				/*
+				 * We don't report this as LOG, since we don't stop recovery
+				 * here
+				 */
 				ereport(DEBUG1,
-						(errmsg_internal("reached end of WAL in pg_wal, entering archive recovery")));
+						(errmsg_internal("reached end of WAL at %X/%X on timeline %u in %s during crash recovery, entering archive recovery",
+										 LSN_FORMAT_ARGS(ErrRecPtr),
+										 replayTLI,
+										 xlogSourceNames[currentSource])));
 				InArchiveRecovery = true;
 				if (StandbyModeRequested)
 					StandbyMode = true;
@@ -4610,12 +4629,33 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
 				continue;
 			}
 
-			/* In standby mode, loop back to retry. Otherwise, give up. */
-			if (StandbyMode && !CheckForStandbyTrigger())
-				continue;
-			else
-				return NULL;
+			/*
+			 *  If we haven't emit an error message, we have safely reached the
+			 *  end-of-WAL.
+			 */
+			if (emode_for_corrupt_record(LOG, ErrRecPtr) == LOG)
+			{
+				char *fmt;
+
+				if (StandbyMode)
+					fmt = gettext_noop("reached end of WAL at %X/%X on timeline %u in %s during standby mode");
+				else if (InArchiveRecovery)
+					fmt = gettext_noop("reached end of WAL at %X/%X on timeline %u in %s during archive recovery");
+				else
+					fmt = gettext_noop("reached end of WAL at %X/%X on timeline %u in %s during crash recovery");
+
+				ereport(LOG,
+						(errmsg(fmt, LSN_FORMAT_ARGS(ErrRecPtr), replayTLI,
+								xlogSourceNames[currentSource]),
+						 (errormsg ? errdetail_internal("%s", errormsg) : 0)));
+			}
 		}
+
+		/* In standby mode, loop back to retry. Otherwise, give up. */
+		if (StandbyMode && !CheckForStandbyTrigger())
+			continue;
+		else
+			return NULL;
 	}
 }
 
@@ -7544,7 +7584,7 @@ StartupXLOG(void)
 		else
 		{
 			/* just have to read next record after CheckPoint */
-			record = ReadRecord(xlogreader, LOG, false, replayTLI);
+			record = ReadRecord(xlogreader, WARNING, false, replayTLI);
 		}
 
 		if (record != NULL)
@@ -7782,7 +7822,7 @@ StartupXLOG(void)
 				}
 
 				/* Else, try to fetch the next WAL record */
-				record = ReadRecord(xlogreader, LOG, false, replayTLI);
+				record = ReadRecord(xlogreader, WARNING, false, replayTLI);
 			} while (record != NULL);
 
 			/*
@@ -7842,13 +7882,20 @@ StartupXLOG(void)
 
 			InRedo = false;
 		}
-		else
+		else if (xlogreader->EndOfWAL)
 		{
 			/* there are no WAL records following the checkpoint */
 			ereport(LOG,
 					(errmsg("redo is not required")));
 
 		}
+		else
+		{
+			/* broken record found */
+			ereport(WARNING,
+					(errmsg("redo is skipped"),
+					 errhint("This suggests WAL file corruption. You might need to check the database.")));
+		}
 
 		/*
 		 * This check is intentionally after the above log messages that
@@ -13097,7 +13144,7 @@ emode_for_corrupt_record(int emode, XLogRecPtr RecPtr)
 {
 	static XLogRecPtr lastComplaint = 0;
 
-	if (readSource == XLOG_FROM_PG_WAL && emode == LOG)
+	if (currentSource == XLOG_FROM_PG_WAL && emode <= WARNING)
 	{
 		if (RecPtr == lastComplaint)
 			emode = DEBUG1;
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index 35029cf97d..418fb66ef2 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -121,6 +121,7 @@ XLogReaderAllocate(int wal_segment_size, const char *waldir,
 		pfree(state);
 		return NULL;
 	}
+	state->EndOfWAL = false;
 	state->errormsg_buf[0] = '\0';
 
 	/*
@@ -292,6 +293,7 @@ XLogReadRecord(XLogReaderState *state, char **errormsg)
 	/* reset error state */
 	*errormsg = NULL;
 	state->errormsg_buf[0] = '\0';
+	state->EndOfWAL = false;
 
 	ResetDecoder(state);
 	state->abortedRecPtr = InvalidXLogRecPtr;
@@ -588,6 +590,15 @@ err:
 		 */
 		state->abortedRecPtr = RecPtr;
 		state->missingContrecPtr = targetPagePtr;
+
+		/*
+		 * If the message is not set yet, that means we failed to load the
+		 * page for the record.  Otherwise do not hide the existing message.
+		 */
+		if (state->errormsg_buf[0] == '\0')
+			report_invalid_record(state,
+								  "missing contrecord at %X/%X",
+								  LSN_FORMAT_ARGS(RecPtr));
 	}
 
 	/*
@@ -730,6 +741,31 @@ ValidXLogRecordHeader(XLogReaderState *state, XLogRecPtr RecPtr,
 					  XLogRecPtr PrevRecPtr, XLogRecord *record,
 					  bool randAccess)
 {
+	if (record->xl_tot_len == 0)
+	{
+		/*
+		 * We are almost sure reaching the end of WAL, make sure that the whole
+		 * header is filled with zeroes.
+		 */
+		char   *p = (char *)record;
+		char   *pe = (char *)record + SizeOfXLogRecord;
+
+		while (*p == 0 && p < pe)
+			p++;
+
+		if (p == pe)
+		{
+			/* it is completely zeroed, call it a day  */
+			report_invalid_record(state, "empty record header found at %X/%X",
+								  LSN_FORMAT_ARGS(RecPtr));
+
+			/* notify end-of-wal to callers */
+			state->EndOfWAL = true;
+			return false;
+		}
+
+		/* The same condition will be caught as invalid record length */
+	}
 	if (record->xl_tot_len < SizeOfXLogRecord)
 	{
 		report_invalid_record(state,
@@ -836,6 +872,31 @@ XLogReaderValidatePageHeader(XLogReaderState *state, XLogRecPtr recptr,
 
 	XLogSegNoOffsetToRecPtr(segno, offset, state->segcxt.ws_segsize, recaddr);
 
+	StaticAssertStmt(XLOG_PAGE_MAGIC != 0, "XLOG_PAGE_MAGIC is zero");
+
+	if (hdr->xlp_magic == 0)
+	{
+		/* Regard an empty page as End-Of-WAL */
+		int	i;
+
+		for (i = 0 ; i < XLOG_BLCKSZ && phdr[i] == 0 ; i++);
+		if (i == XLOG_BLCKSZ)
+		{
+			char		fname[MAXFNAMELEN];
+
+			XLogFileName(fname, state->seg.ws_tli, segno,
+						 state->segcxt.ws_segsize);
+
+			report_invalid_record(state,
+								  "empty page in log segment %s, offset %u",
+								  fname,
+								  offset);
+			state->EndOfWAL = true;
+			return false;
+		}
+
+		/* The same condition will be caught as invalid magic number */
+	}
 	if (hdr->xlp_magic != XLOG_PAGE_MAGIC)
 	{
 		char		fname[MAXFNAMELEN];
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index b39fce8c23..3034f8281e 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -471,8 +471,7 @@ WalReceiverMain(void)
 						else if (len < 0)
 						{
 							ereport(LOG,
-									(errmsg("replication terminated by primary server"),
-									 errdetail("End of WAL reached on timeline %u at %X/%X.",
+									(errmsg("replication terminated by primary server on timeline %u at %X/%X.",
 											   startpointTLI,
 											   LSN_FORMAT_ARGS(LogstreamResult.Write))));
 							endofwal = true;
diff --git a/src/bin/pg_waldump/pg_waldump.c b/src/bin/pg_waldump/pg_waldump.c
index a6251e1a96..3745e76488 100644
--- a/src/bin/pg_waldump/pg_waldump.c
+++ b/src/bin/pg_waldump/pg_waldump.c
@@ -1176,9 +1176,16 @@ main(int argc, char **argv)
 		exit(0);
 
 	if (errormsg)
-		fatal_error("error in WAL record at %X/%X: %s",
-					LSN_FORMAT_ARGS(xlogreader_state->ReadRecPtr),
-					errormsg);
+	{
+		if (xlogreader_state->EndOfWAL)
+			pg_log_info("end of WAL at %X/%X: %s",
+						LSN_FORMAT_ARGS(xlogreader_state->EndRecPtr),
+						errormsg);
+		else
+			fatal_error("error in WAL record at %X/%X: %s",
+						LSN_FORMAT_ARGS(xlogreader_state->EndRecPtr),
+						errormsg);
+	}
 
 	XLogReaderFree(xlogreader_state);
 
diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h
index 477f0efe26..3eeba220a1 100644
--- a/src/include/access/xlogreader.h
+++ b/src/include/access/xlogreader.h
@@ -174,6 +174,7 @@ struct XLogReaderState
 	 */
 	XLogRecPtr	ReadRecPtr;		/* start of last record read */
 	XLogRecPtr	EndRecPtr;		/* end+1 of last record read */
+	bool		EndOfWAL;		/* the last attempt was EOW? */
 
 	/*
 	 * Set at the end of recovery: the start point of a partial record at the
diff --git a/src/test/recovery/t/011_crash_recovery.pl b/src/test/recovery/t/011_crash_recovery.pl
index 3892aba3e5..67d264df26 100644
--- a/src/test/recovery/t/011_crash_recovery.pl
+++ b/src/test/recovery/t/011_crash_recovery.pl
@@ -10,9 +10,11 @@ use PostgreSQL::Test::Cluster;
 use PostgreSQL::Test::Utils;
 use Test::More;
 use Config;
+use IPC::Run;
 
-plan tests => 3;
+plan tests => 11;
 
+my $reached_eow_pat = "reached end of WAL at ";
 my $node = PostgreSQL::Test::Cluster->new('primary');
 $node->init(allows_streaming => 1);
 $node->start;
@@ -50,7 +52,15 @@ is($node->safe_psql('postgres', qq[SELECT pg_xact_status('$xid');]),
 
 # Crash and restart the postmaster
 $node->stop('immediate');
+my $logstart = get_log_size($node);
 $node->start;
+my $max_attempts = 360;
+while ($max_attempts-- >= 0)
+{
+	last if (find_in_log($node, $reached_eow_pat, $logstart));
+	sleep 0.5;
+}
+ok ($max_attempts >= 0, "end-of-wal is logged");
 
 # Make sure we really got a new xid
 cmp_ok($node->safe_psql('postgres', 'SELECT pg_current_xact_id()'),
@@ -62,3 +72,101 @@ is($node->safe_psql('postgres', qq[SELECT pg_xact_status('$xid');]),
 
 $stdin .= "\\q\n";
 $tx->finish;    # wait for psql to quit gracefully
+
+my $segsize = $node->safe_psql('postgres',
+	   qq[SELECT setting FROM pg_settings WHERE name = 'wal_segment_size';]);
+
+# make sure no records afterwards go to the next segment
+$node->safe_psql('postgres', qq[
+				 SELECT pg_switch_wal();
+				 CHECKPOINT;
+				 CREATE TABLE t();
+]);
+$node->stop('immediate');
+
+# identify REDO WAL file
+my $cmd = "pg_controldata -D " . $node->data_dir();
+my $chkptfile;
+$cmd = ['pg_controldata', '-D', $node->data_dir()];
+$stdout = '';
+$stderr = '';
+IPC::Run::run $cmd, '>', \$stdout, '2>', \$stderr;
+ok($stdout =~ /^Latest checkpoint's REDO WAL file:[ \t] *(.+)$/m,
+   "checkpoint file is identified");
+my $chkptfile = $1;
+
+# identify the last record
+my $walfile = $node->data_dir() . "/pg_wal/$chkptfile";
+$cmd = ['pg_waldump', $walfile];
+$stdout = '';
+$stderr = '';
+my $lastrec;
+IPC::Run::run $cmd, '>', \$stdout, '2>', \$stderr;
+foreach my $l (split(/\r?\n/, $stdout))
+{
+	$lastrec = $l;
+}
+ok(defined $lastrec, "last WAL record is extracted");
+ok($stderr =~ /end of WAL at ([0-9A-F\/]+): .* at \g1/,
+   "pg_waldump emits the correct ending message");
+
+# read the last record LSN excluding leading zeroes
+ok ($lastrec =~ /, lsn: 0\/0*([1-9A-F][0-9A-F]+),/,
+	"LSN of the last record identified");
+my $lastlsn = $1;
+
+# corrupt the last record
+my $offset = hex($lastlsn) % $segsize;
+open(my $segf, '+<', $walfile) or die "failed to open $walfile\n";
+seek($segf, $offset, 0);  # halfway break the last record
+print $segf "\0\0\0\0";
+close($segf);
+
+# pg_waldump complains about the corrupted record
+$stdout = '';
+$stderr = '';
+IPC::Run::run $cmd, '>', \$stdout, '2>', \$stderr;
+ok($stderr =~ /fatal: error in WAL record at 0\/$lastlsn: .* at 0\/$lastlsn/,
+   "pg_waldump emits the correct error message");
+
+# also server complains
+$logstart = get_log_size($node);
+$node->start;
+$max_attempts = 360;
+while ($max_attempts-- >= 0)
+{
+	last if (find_in_log($node, "WARNING:  invalid record length at 0/$lastlsn: wanted [0-9]+, got 0",
+						 $logstart));
+	sleep 0.5;
+}
+ok($max_attempts >= 0, "header error is logged at $lastlsn");
+
+# and the end-of-wal messages shouldn't be seen
+# the same message has been confirmed in the past
+ok(!find_in_log($node, $reached_eow_pat, $logstart),
+   "false log message is not emitted");
+
+$node->stop('immediate');
+
+#### helper routines
+# return the size of logfile of $node in bytes
+sub get_log_size
+{
+	my ($node) = @_;
+
+	return (stat $node->logfile)[7];
+}
+
+# find $pat in logfile of $node after $off-th byte
+sub find_in_log
+{
+	my ($node, $pat, $off) = @_;
+
+	$off = 0 unless defined $off;
+	my $log = PostgreSQL::Test::Utils::slurp_file($node->logfile);
+	return 0 if (length($log) <= $off);
+
+	$log = substr($log, $off);
+
+	return $log =~ m/$pat/;
+}
-- 
2.27.0

#23

Pavel Borisov

pashkin.elfe@gmail.com

almost 4 years ago

In reply to: Kyotaro Horiguchi (#22)

Re: Make mesage at end-of-recovery less scary.

This v8 is changed in...

- Added tests to 011_crash_recovery.pl

- Fixed a bug that server emits "end-of-wal" messages even if it have
emitted an error message for the same LSN.

- Changed XLogReaderValidatePageHeader() so that it recognizes an
empty page as end-of-WAL.

- Made pg_waldump conscious of end-of-wal.

While doing the last item, I noticed that pg_waldump shows the wrong
LSN as the error position. Concretely it emits the LSN of the last
sound WAL record as the error position. I will post a bug-fix patch
for the issue after confirmation.

I noticed that I added a useless error message "garbage record
header", but it is a kind of invalid record length. So I removed the
message. That change makes the logic for EOW in ValidXLogRecordHeader
and XLogReaderValidatePageHeader share the same flow.

Hi, Kyotaro!

I don't quite understand a meaning of a comment:
/* it is completely zeroed, call it a day */

Please also run pgindent on your code.

Otherwise the new patch seems ok.

--
Best regards,
Pavel Borisov

Postgres Professional: http://postgrespro.com <http://www.postgrespro.com>

#24

Kyotaro Horiguchi

horikyota.ntt@gmail.com

almost 4 years ago

In reply to: Pavel Borisov (#23)

1 attachment(s)

Re: Make mesage at end-of-recovery less scary.

Hi, Pavel.

At Mon, 31 Jan 2022 15:17:09 +0400, Pavel Borisov <pashkin.elfe@gmail.com> wrote in

I don't quite understand a meaning of a comment:
/* it is completely zeroed, call it a day */

While rethinking about this comment, It came to my mind that
XLogReaderValidatePageHeader is doing whole-page check. There is no
clear reason for not doing at least the same check here.
ValidXLogRecordHeader is changed to check all bytes in the rest of the
page, instead of just the record header.

While working on that, I noticed another end-of-WAL case, unexpected
pageaddr. I think we can assume it safe when the pageaddr is smaller
than expected (or we have no choice than assuming
so). XLogReaderValidatePageHeader is changed that way. But I'm not
sure others regard it as a form of safe end-of-WAL.

Please also run pgindent on your code.

Hmm. I'm not sure we need to do that at this stage. pgindent makes
changes on the whole file involving unrelated part from this patch.
Anyway I did that then removed irrelevant edits.

pgindent makes a seemingly not-great suggestion.

+		char	   *pe =
+		(char *) record + XLOG_BLCKSZ - (RecPtr & (XLOG_BLCKSZ - 1));

I'm not sure this is intended but I split the line into two lines to
define and assign.

Otherwise the new patch seems ok.

Thanks!

This version 10 is changed in the following points.

- Rewrited the comment in ValidXLogRecordHeader.
- ValidXLogRecordHeader

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachments:

v10-0001-Make-End-Of-Recovery-error-less-scary.patchtext/x-patch; charset=us-asciiDownload

From 9eacdd050a8041b358df11ca3e18c1071b693d20 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Fri, 28 Feb 2020 15:52:58 +0900
Subject: [PATCH v10] Make End-Of-Recovery error less scary

When recovery in any type ends, we see a bit scary error message like
"invalid record length" that suggests something serious is
happening. Actually if recovery meets a record with length = 0, that
usually means it finished applying all available WAL records.

Make this message less scary as "reached end of WAL". Instead, raise
the error level for other kind of WAL failure to WARNING.
---
 src/backend/access/transam/xlog.c         |  91 +++++++++++++-----
 src/backend/access/transam/xlogreader.c   |  77 +++++++++++++++
 src/backend/replication/walreceiver.c     |   7 +-
 src/bin/pg_waldump/pg_waldump.c           |  13 ++-
 src/include/access/xlogreader.h           |   1 +
 src/test/recovery/t/011_crash_recovery.pl | 110 +++++++++++++++++++++-
 6 files changed, 269 insertions(+), 30 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index dfe2a0bcce..378c13ccf7 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -4480,6 +4480,7 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
 	for (;;)
 	{
 		char	   *errormsg;
+		XLogRecPtr	ErrRecPtr = InvalidXLogRecPtr;
 
 		record = XLogReadRecord(xlogreader, &errormsg);
 		if (record == NULL)
@@ -4495,6 +4496,18 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
 			{
 				abortedRecPtr = xlogreader->abortedRecPtr;
 				missingContrecPtr = xlogreader->missingContrecPtr;
+				ErrRecPtr = abortedRecPtr;
+			}
+			else
+			{
+				/*
+				 * NULL ReadRecPtr means we could not read a record at the
+				 * beginning. In that case EndRecPtr is storing the LSN of the
+				 * record we tried to read.
+				 */
+				ErrRecPtr =
+					xlogreader->ReadRecPtr ?
+					xlogreader->ReadRecPtr : xlogreader->EndRecPtr;
 			}
 
 			if (readFile >= 0)
@@ -4504,13 +4517,12 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
 			}
 
 			/*
-			 * We only end up here without a message when XLogPageRead()
-			 * failed - in that case we already logged something. In
-			 * StandbyMode that only happens if we have been triggered, so we
-			 * shouldn't loop anymore in that case.
+			 * If we get here for other than end-of-wal, emit the error
+			 * message right now. Otherwise the message if any is shown as a
+			 * part of the end-of-WAL message below.
 			 */
-			if (errormsg)
-				ereport(emode_for_corrupt_record(emode, xlogreader->EndRecPtr),
+			if (!xlogreader->EndOfWAL && errormsg)
+				ereport(emode_for_corrupt_record(emode, ErrRecPtr),
 						(errmsg_internal("%s", errormsg) /* already translated */ ));
 		}
 
@@ -4541,11 +4553,12 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
 			/* Great, got a record */
 			return record;
 		}
-		else
+
+		/* No valid record available from this source */
+		lastSourceFailed = true;
+
+		if (!fetching_ckpt)
 		{
-			/* No valid record available from this source */
-			lastSourceFailed = true;
-
 			/*
 			 * If archive recovery was requested, but we were still doing
 			 * crash recovery, switch to archive recovery and retry using the
@@ -4558,11 +4571,17 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
 			 * we'd have no idea how far we'd have to replay to reach
 			 * consistency.  So err on the safe side and give up.
 			 */
-			if (!InArchiveRecovery && ArchiveRecoveryRequested &&
-				!fetching_ckpt)
+			if (!InArchiveRecovery && ArchiveRecoveryRequested)
 			{
+				/*
+				 * We don't report this as LOG, since we don't stop recovery
+				 * here
+				 */
 				ereport(DEBUG1,
-						(errmsg_internal("reached end of WAL in pg_wal, entering archive recovery")));
+						(errmsg_internal("reached end of WAL at %X/%X on timeline %u in %s during crash recovery, entering archive recovery",
+										 LSN_FORMAT_ARGS(ErrRecPtr),
+										 replayTLI,
+										 xlogSourceNames[currentSource])));
 				InArchiveRecovery = true;
 				if (StandbyModeRequested)
 					StandbyMode = true;
@@ -4610,12 +4629,33 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
 				continue;
 			}
 
-			/* In standby mode, loop back to retry. Otherwise, give up. */
-			if (StandbyMode && !CheckForStandbyTrigger())
-				continue;
-			else
-				return NULL;
+			/*
+			 * If we haven't emit an error message, we have safely reached the
+			 * end-of-WAL.
+			 */
+			if (emode_for_corrupt_record(LOG, ErrRecPtr) == LOG)
+			{
+				char	   *fmt;
+
+				if (StandbyMode)
+					fmt = gettext_noop("reached end of WAL at %X/%X on timeline %u in %s during standby mode");
+				else if (InArchiveRecovery)
+					fmt = gettext_noop("reached end of WAL at %X/%X on timeline %u in %s during archive recovery");
+				else
+					fmt = gettext_noop("reached end of WAL at %X/%X on timeline %u in %s during crash recovery");
+
+				ereport(LOG,
+						(errmsg(fmt, LSN_FORMAT_ARGS(ErrRecPtr), replayTLI,
+								xlogSourceNames[currentSource]),
+						 (errormsg ? errdetail_internal("%s", errormsg) : 0)));
+			}
 		}
+
+		/* In standby mode, loop back to retry. Otherwise, give up. */
+		if (StandbyMode && !CheckForStandbyTrigger())
+			continue;
+		else
+			return NULL;
 	}
 }
 
@@ -7544,7 +7584,7 @@ StartupXLOG(void)
 		else
 		{
 			/* just have to read next record after CheckPoint */
-			record = ReadRecord(xlogreader, LOG, false, replayTLI);
+			record = ReadRecord(xlogreader, WARNING, false, replayTLI);
 		}
 
 		if (record != NULL)
@@ -7782,7 +7822,7 @@ StartupXLOG(void)
 				}
 
 				/* Else, try to fetch the next WAL record */
-				record = ReadRecord(xlogreader, LOG, false, replayTLI);
+				record = ReadRecord(xlogreader, WARNING, false, replayTLI);
 			} while (record != NULL);
 
 			/*
@@ -7842,13 +7882,20 @@ StartupXLOG(void)
 
 			InRedo = false;
 		}
-		else
+		else if (xlogreader->EndOfWAL)
 		{
 			/* there are no WAL records following the checkpoint */
 			ereport(LOG,
 					(errmsg("redo is not required")));
 
 		}
+		else
+		{
+			/* broken record found */
+			ereport(WARNING,
+					(errmsg("redo is skipped"),
+					 errhint("This suggests WAL file corruption. You might need to check the database.")));
+		}
 
 		/*
 		 * This check is intentionally after the above log messages that
@@ -13097,7 +13144,7 @@ emode_for_corrupt_record(int emode, XLogRecPtr RecPtr)
 {
 	static XLogRecPtr lastComplaint = 0;
 
-	if (readSource == XLOG_FROM_PG_WAL && emode == LOG)
+	if (currentSource == XLOG_FROM_PG_WAL && emode <= WARNING)
 	{
 		if (RecPtr == lastComplaint)
 			emode = DEBUG1;
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index 35029cf97d..9bcc4a2d37 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -121,6 +121,7 @@ XLogReaderAllocate(int wal_segment_size, const char *waldir,
 		pfree(state);
 		return NULL;
 	}
+	state->EndOfWAL = false;
 	state->errormsg_buf[0] = '\0';
 
 	/*
@@ -292,6 +293,7 @@ XLogReadRecord(XLogReaderState *state, char **errormsg)
 	/* reset error state */
 	*errormsg = NULL;
 	state->errormsg_buf[0] = '\0';
+	state->EndOfWAL = false;
 
 	ResetDecoder(state);
 	state->abortedRecPtr = InvalidXLogRecPtr;
@@ -588,6 +590,15 @@ err:
 		 */
 		state->abortedRecPtr = RecPtr;
 		state->missingContrecPtr = targetPagePtr;
+
+		/*
+		 * If the message is not set yet, that means we failed to load the
+		 * page for the record.  Otherwise do not hide the existing message.
+		 */
+		if (state->errormsg_buf[0] == '\0')
+			report_invalid_record(state,
+								  "missing contrecord at %X/%X",
+								  LSN_FORMAT_ARGS(RecPtr));
 	}
 
 	/*
@@ -730,6 +741,39 @@ ValidXLogRecordHeader(XLogReaderState *state, XLogRecPtr RecPtr,
 					  XLogRecPtr PrevRecPtr, XLogRecord *record,
 					  bool randAccess)
 {
+	if (record->xl_tot_len == 0)
+	{
+		/*
+		 * We are almost sure reaching the end of WAL, make sure that the
+		 * whole page after the record is filled with zeroes.
+		 */
+		char	   *p = (char *) record;
+		char	   *pe;
+
+		/* set pe to the beginning of the next page */
+		pe = (char *) record + XLOG_BLCKSZ - (RecPtr & (XLOG_BLCKSZ - 1));
+
+		while (*p == 0 && p < pe)
+			p++;
+
+		if (p == pe)
+		{
+			/*
+			 * The page after the record is completely zeroed. That suggests
+			 * we don't have a record after this point. We don't bother
+			 * checking the pages after since they are not zeroed in the case
+			 * of recycled segments.
+			 */
+			report_invalid_record(state, "empty record found at %X/%X",
+								  LSN_FORMAT_ARGS(RecPtr));
+
+			/* notify end-of-wal to callers */
+			state->EndOfWAL = true;
+			return false;
+		}
+
+		/* The same condition will be caught as invalid record length */
+	}
 	if (record->xl_tot_len < SizeOfXLogRecord)
 	{
 		report_invalid_record(state,
@@ -836,6 +880,31 @@ XLogReaderValidatePageHeader(XLogReaderState *state, XLogRecPtr recptr,
 
 	XLogSegNoOffsetToRecPtr(segno, offset, state->segcxt.ws_segsize, recaddr);
 
+	StaticAssertStmt(XLOG_PAGE_MAGIC != 0, "XLOG_PAGE_MAGIC is zero");
+
+	if (hdr->xlp_magic == 0)
+	{
+		/* Regard an empty page as End-Of-WAL */
+		int			i;
+
+		for (i = 0; i < XLOG_BLCKSZ && phdr[i] == 0; i++);
+		if (i == XLOG_BLCKSZ)
+		{
+			char		fname[MAXFNAMELEN];
+
+			XLogFileName(fname, state->seg.ws_tli, segno,
+						 state->segcxt.ws_segsize);
+
+			report_invalid_record(state,
+								  "empty page in log segment %s, offset %u",
+								  fname,
+								  offset);
+			state->EndOfWAL = true;
+			return false;
+		}
+
+		/* The same condition will be caught as invalid magic number */
+	}
 	if (hdr->xlp_magic != XLOG_PAGE_MAGIC)
 	{
 		char		fname[MAXFNAMELEN];
@@ -921,6 +990,14 @@ XLogReaderValidatePageHeader(XLogReaderState *state, XLogRecPtr recptr,
 							  LSN_FORMAT_ARGS(hdr->xlp_pageaddr),
 							  fname,
 							  offset);
+
+		/*
+		 * If the page address is less than expected we assume it is an unused
+		 * page in a recycled segment.
+		 */
+		if (hdr->xlp_pageaddr < recaddr)
+			state->EndOfWAL = true;
+
 		return false;
 	}
 
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index b39fce8c23..8e1fa32489 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -471,10 +471,9 @@ WalReceiverMain(void)
 						else if (len < 0)
 						{
 							ereport(LOG,
-									(errmsg("replication terminated by primary server"),
-									 errdetail("End of WAL reached on timeline %u at %X/%X.",
-											   startpointTLI,
-											   LSN_FORMAT_ARGS(LogstreamResult.Write))));
+									(errmsg("replication terminated by primary server on timeline %u at %X/%X.",
+											startpointTLI,
+											LSN_FORMAT_ARGS(LogstreamResult.Write))));
 							endofwal = true;
 							break;
 						}
diff --git a/src/bin/pg_waldump/pg_waldump.c b/src/bin/pg_waldump/pg_waldump.c
index a6251e1a96..3745e76488 100644
--- a/src/bin/pg_waldump/pg_waldump.c
+++ b/src/bin/pg_waldump/pg_waldump.c
@@ -1176,9 +1176,16 @@ main(int argc, char **argv)
 		exit(0);
 
 	if (errormsg)
-		fatal_error("error in WAL record at %X/%X: %s",
-					LSN_FORMAT_ARGS(xlogreader_state->ReadRecPtr),
-					errormsg);
+	{
+		if (xlogreader_state->EndOfWAL)
+			pg_log_info("end of WAL at %X/%X: %s",
+						LSN_FORMAT_ARGS(xlogreader_state->EndRecPtr),
+						errormsg);
+		else
+			fatal_error("error in WAL record at %X/%X: %s",
+						LSN_FORMAT_ARGS(xlogreader_state->EndRecPtr),
+						errormsg);
+	}
 
 	XLogReaderFree(xlogreader_state);
 
diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h
index 477f0efe26..3eeba220a1 100644
--- a/src/include/access/xlogreader.h
+++ b/src/include/access/xlogreader.h
@@ -174,6 +174,7 @@ struct XLogReaderState
 	 */
 	XLogRecPtr	ReadRecPtr;		/* start of last record read */
 	XLogRecPtr	EndRecPtr;		/* end+1 of last record read */
+	bool		EndOfWAL;		/* the last attempt was EOW? */
 
 	/*
 	 * Set at the end of recovery: the start point of a partial record at the
diff --git a/src/test/recovery/t/011_crash_recovery.pl b/src/test/recovery/t/011_crash_recovery.pl
index 3892aba3e5..67d264df26 100644
--- a/src/test/recovery/t/011_crash_recovery.pl
+++ b/src/test/recovery/t/011_crash_recovery.pl
@@ -10,9 +10,11 @@ use PostgreSQL::Test::Cluster;
 use PostgreSQL::Test::Utils;
 use Test::More;
 use Config;
+use IPC::Run;
 
-plan tests => 3;
+plan tests => 11;
 
+my $reached_eow_pat = "reached end of WAL at ";
 my $node = PostgreSQL::Test::Cluster->new('primary');
 $node->init(allows_streaming => 1);
 $node->start;
@@ -50,7 +52,15 @@ is($node->safe_psql('postgres', qq[SELECT pg_xact_status('$xid');]),
 
 # Crash and restart the postmaster
 $node->stop('immediate');
+my $logstart = get_log_size($node);
 $node->start;
+my $max_attempts = 360;
+while ($max_attempts-- >= 0)
+{
+	last if (find_in_log($node, $reached_eow_pat, $logstart));
+	sleep 0.5;
+}
+ok ($max_attempts >= 0, "end-of-wal is logged");
 
 # Make sure we really got a new xid
 cmp_ok($node->safe_psql('postgres', 'SELECT pg_current_xact_id()'),
@@ -62,3 +72,101 @@ is($node->safe_psql('postgres', qq[SELECT pg_xact_status('$xid');]),
 
 $stdin .= "\\q\n";
 $tx->finish;    # wait for psql to quit gracefully
+
+my $segsize = $node->safe_psql('postgres',
+	   qq[SELECT setting FROM pg_settings WHERE name = 'wal_segment_size';]);
+
+# make sure no records afterwards go to the next segment
+$node->safe_psql('postgres', qq[
+				 SELECT pg_switch_wal();
+				 CHECKPOINT;
+				 CREATE TABLE t();
+]);
+$node->stop('immediate');
+
+# identify REDO WAL file
+my $cmd = "pg_controldata -D " . $node->data_dir();
+my $chkptfile;
+$cmd = ['pg_controldata', '-D', $node->data_dir()];
+$stdout = '';
+$stderr = '';
+IPC::Run::run $cmd, '>', \$stdout, '2>', \$stderr;
+ok($stdout =~ /^Latest checkpoint's REDO WAL file:[ \t] *(.+)$/m,
+   "checkpoint file is identified");
+my $chkptfile = $1;
+
+# identify the last record
+my $walfile = $node->data_dir() . "/pg_wal/$chkptfile";
+$cmd = ['pg_waldump', $walfile];
+$stdout = '';
+$stderr = '';
+my $lastrec;
+IPC::Run::run $cmd, '>', \$stdout, '2>', \$stderr;
+foreach my $l (split(/\r?\n/, $stdout))
+{
+	$lastrec = $l;
+}
+ok(defined $lastrec, "last WAL record is extracted");
+ok($stderr =~ /end of WAL at ([0-9A-F\/]+): .* at \g1/,
+   "pg_waldump emits the correct ending message");
+
+# read the last record LSN excluding leading zeroes
+ok ($lastrec =~ /, lsn: 0\/0*([1-9A-F][0-9A-F]+),/,
+	"LSN of the last record identified");
+my $lastlsn = $1;
+
+# corrupt the last record
+my $offset = hex($lastlsn) % $segsize;
+open(my $segf, '+<', $walfile) or die "failed to open $walfile\n";
+seek($segf, $offset, 0);  # halfway break the last record
+print $segf "\0\0\0\0";
+close($segf);
+
+# pg_waldump complains about the corrupted record
+$stdout = '';
+$stderr = '';
+IPC::Run::run $cmd, '>', \$stdout, '2>', \$stderr;
+ok($stderr =~ /fatal: error in WAL record at 0\/$lastlsn: .* at 0\/$lastlsn/,
+   "pg_waldump emits the correct error message");
+
+# also server complains
+$logstart = get_log_size($node);
+$node->start;
+$max_attempts = 360;
+while ($max_attempts-- >= 0)
+{
+	last if (find_in_log($node, "WARNING:  invalid record length at 0/$lastlsn: wanted [0-9]+, got 0",
+						 $logstart));
+	sleep 0.5;
+}
+ok($max_attempts >= 0, "header error is logged at $lastlsn");
+
+# and the end-of-wal messages shouldn't be seen
+# the same message has been confirmed in the past
+ok(!find_in_log($node, $reached_eow_pat, $logstart),
+   "false log message is not emitted");
+
+$node->stop('immediate');
+
+#### helper routines
+# return the size of logfile of $node in bytes
+sub get_log_size
+{
+	my ($node) = @_;
+
+	return (stat $node->logfile)[7];
+}
+
+# find $pat in logfile of $node after $off-th byte
+sub find_in_log
+{
+	my ($node, $pat, $off) = @_;
+
+	$off = 0 unless defined $off;
+	my $log = PostgreSQL::Test::Utils::slurp_file($node->logfile);
+	return 0 if (length($log) <= $off);
+
+	$log = substr($log, $off);
+
+	return $log =~ m/$pat/;
+}
-- 
2.27.0

#25

Pavel Borisov

pashkin.elfe@gmail.com

almost 4 years ago

In reply to: Kyotaro Horiguchi (#24)

Re: Make mesage at end-of-recovery less scary.

This version 10 is changed in the following points.

- Rewrited the comment in ValidXLogRecordHeader.
- ValidXLogRecordHeader

Thanks!

Maybe it can be written little bit shorter:
pe = (char *) record + XLOG_BLCKSZ - (RecPtr & (XLOG_BLCKSZ - 1));
as
pe = p + XLOG_BLCKSZ - (RecPtr & (XLOG_BLCKSZ - 1));
?

The problem that pgindent sometimes reflow formatting of unrelated blocks
is indeed existing. But I think it's right to manually leave pgindent-ed
code only on what is related to the patch. The leftover is pgindent-ed in a
scheduled manner sometimes, so don't need to bother.

I'd like to set v10 as RfC.

--
Best regards,
Pavel Borisov

Postgres Professional: http://postgrespro.com <http://www.postgrespro.com>

#26

Kyotaro Horiguchi

horikyota.ntt@gmail.com

almost 4 years ago

In reply to: Pavel Borisov (#25)

1 attachment(s)

Re: Make mesage at end-of-recovery less scary.

At Tue, 1 Feb 2022 12:38:01 +0400, Pavel Borisov <pashkin.elfe@gmail.com> wrote in

Maybe it can be written little bit shorter:
pe = (char *) record + XLOG_BLCKSZ - (RecPtr & (XLOG_BLCKSZ - 1));
as
pe = p + XLOG_BLCKSZ - (RecPtr & (XLOG_BLCKSZ - 1));
?

That difference would be a matter of taste, but I found it looks
cleaner that definition and assignment is separated for both p and pe.
Now it is like the following.

char *p;
char *pe;

/* scan from the beginning of the record to the end of block */
p = (char *) record;
pe = p + XLOG_BLCKSZ - (RecPtr & (XLOG_BLCKSZ - 1));

The problem that pgindent sometimes reflow formatting of unrelated blocks
is indeed existing. But I think it's right to manually leave pgindent-ed
code only on what is related to the patch. The leftover is pgindent-ed in a
scheduled manner sometimes, so don't need to bother.

Yeah, I meant that it is a bit annoying to unpginden-ting unrelated
edits:p

I'd like to set v10 as RfC.

Thanks! The suggested change is done in the attached v11.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachments:

v11-0001-Make-End-Of-Recovery-error-less-scary.patchtext/x-patch; charset=us-asciiDownload

From 491416866920f8f9648dee9c0571022f71553879 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Fri, 28 Feb 2020 15:52:58 +0900
Subject: [PATCH v11] Make End-Of-Recovery error less scary

When recovery in any type ends, we see a bit scary error message like
"invalid record length" that suggests something serious is
happening. Actually if recovery meets a record with length = 0, that
usually means it finished applying all available WAL records.

Make this message less scary as "reached end of WAL". Instead, raise
the error level for other kind of WAL failure to WARNING.
---
 src/backend/access/transam/xlog.c         |  91 +++++++++++++-----
 src/backend/access/transam/xlogreader.c   |  78 +++++++++++++++
 src/backend/replication/walreceiver.c     |   7 +-
 src/bin/pg_waldump/pg_waldump.c           |  13 ++-
 src/include/access/xlogreader.h           |   1 +
 src/test/recovery/t/011_crash_recovery.pl | 110 +++++++++++++++++++++-
 6 files changed, 270 insertions(+), 30 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index dfe2a0bcce..378c13ccf7 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -4480,6 +4480,7 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
 	for (;;)
 	{
 		char	   *errormsg;
+		XLogRecPtr	ErrRecPtr = InvalidXLogRecPtr;
 
 		record = XLogReadRecord(xlogreader, &errormsg);
 		if (record == NULL)
@@ -4495,6 +4496,18 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
 			{
 				abortedRecPtr = xlogreader->abortedRecPtr;
 				missingContrecPtr = xlogreader->missingContrecPtr;
+				ErrRecPtr = abortedRecPtr;
+			}
+			else
+			{
+				/*
+				 * NULL ReadRecPtr means we could not read a record at the
+				 * beginning. In that case EndRecPtr is storing the LSN of the
+				 * record we tried to read.
+				 */
+				ErrRecPtr =
+					xlogreader->ReadRecPtr ?
+					xlogreader->ReadRecPtr : xlogreader->EndRecPtr;
 			}
 
 			if (readFile >= 0)
@@ -4504,13 +4517,12 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
 			}
 
 			/*
-			 * We only end up here without a message when XLogPageRead()
-			 * failed - in that case we already logged something. In
-			 * StandbyMode that only happens if we have been triggered, so we
-			 * shouldn't loop anymore in that case.
+			 * If we get here for other than end-of-wal, emit the error
+			 * message right now. Otherwise the message if any is shown as a
+			 * part of the end-of-WAL message below.
 			 */
-			if (errormsg)
-				ereport(emode_for_corrupt_record(emode, xlogreader->EndRecPtr),
+			if (!xlogreader->EndOfWAL && errormsg)
+				ereport(emode_for_corrupt_record(emode, ErrRecPtr),
 						(errmsg_internal("%s", errormsg) /* already translated */ ));
 		}
 
@@ -4541,11 +4553,12 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
 			/* Great, got a record */
 			return record;
 		}
-		else
+
+		/* No valid record available from this source */
+		lastSourceFailed = true;
+
+		if (!fetching_ckpt)
 		{
-			/* No valid record available from this source */
-			lastSourceFailed = true;
-
 			/*
 			 * If archive recovery was requested, but we were still doing
 			 * crash recovery, switch to archive recovery and retry using the
@@ -4558,11 +4571,17 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
 			 * we'd have no idea how far we'd have to replay to reach
 			 * consistency.  So err on the safe side and give up.
 			 */
-			if (!InArchiveRecovery && ArchiveRecoveryRequested &&
-				!fetching_ckpt)
+			if (!InArchiveRecovery && ArchiveRecoveryRequested)
 			{
+				/*
+				 * We don't report this as LOG, since we don't stop recovery
+				 * here
+				 */
 				ereport(DEBUG1,
-						(errmsg_internal("reached end of WAL in pg_wal, entering archive recovery")));
+						(errmsg_internal("reached end of WAL at %X/%X on timeline %u in %s during crash recovery, entering archive recovery",
+										 LSN_FORMAT_ARGS(ErrRecPtr),
+										 replayTLI,
+										 xlogSourceNames[currentSource])));
 				InArchiveRecovery = true;
 				if (StandbyModeRequested)
 					StandbyMode = true;
@@ -4610,12 +4629,33 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
 				continue;
 			}
 
-			/* In standby mode, loop back to retry. Otherwise, give up. */
-			if (StandbyMode && !CheckForStandbyTrigger())
-				continue;
-			else
-				return NULL;
+			/*
+			 * If we haven't emit an error message, we have safely reached the
+			 * end-of-WAL.
+			 */
+			if (emode_for_corrupt_record(LOG, ErrRecPtr) == LOG)
+			{
+				char	   *fmt;
+
+				if (StandbyMode)
+					fmt = gettext_noop("reached end of WAL at %X/%X on timeline %u in %s during standby mode");
+				else if (InArchiveRecovery)
+					fmt = gettext_noop("reached end of WAL at %X/%X on timeline %u in %s during archive recovery");
+				else
+					fmt = gettext_noop("reached end of WAL at %X/%X on timeline %u in %s during crash recovery");
+
+				ereport(LOG,
+						(errmsg(fmt, LSN_FORMAT_ARGS(ErrRecPtr), replayTLI,
+								xlogSourceNames[currentSource]),
+						 (errormsg ? errdetail_internal("%s", errormsg) : 0)));
+			}
 		}
+
+		/* In standby mode, loop back to retry. Otherwise, give up. */
+		if (StandbyMode && !CheckForStandbyTrigger())
+			continue;
+		else
+			return NULL;
 	}
 }
 
@@ -7544,7 +7584,7 @@ StartupXLOG(void)
 		else
 		{
 			/* just have to read next record after CheckPoint */
-			record = ReadRecord(xlogreader, LOG, false, replayTLI);
+			record = ReadRecord(xlogreader, WARNING, false, replayTLI);
 		}
 
 		if (record != NULL)
@@ -7782,7 +7822,7 @@ StartupXLOG(void)
 				}
 
 				/* Else, try to fetch the next WAL record */
-				record = ReadRecord(xlogreader, LOG, false, replayTLI);
+				record = ReadRecord(xlogreader, WARNING, false, replayTLI);
 			} while (record != NULL);
 
 			/*
@@ -7842,13 +7882,20 @@ StartupXLOG(void)
 
 			InRedo = false;
 		}
-		else
+		else if (xlogreader->EndOfWAL)
 		{
 			/* there are no WAL records following the checkpoint */
 			ereport(LOG,
 					(errmsg("redo is not required")));
 
 		}
+		else
+		{
+			/* broken record found */
+			ereport(WARNING,
+					(errmsg("redo is skipped"),
+					 errhint("This suggests WAL file corruption. You might need to check the database.")));
+		}
 
 		/*
 		 * This check is intentionally after the above log messages that
@@ -13097,7 +13144,7 @@ emode_for_corrupt_record(int emode, XLogRecPtr RecPtr)
 {
 	static XLogRecPtr lastComplaint = 0;
 
-	if (readSource == XLOG_FROM_PG_WAL && emode == LOG)
+	if (currentSource == XLOG_FROM_PG_WAL && emode <= WARNING)
 	{
 		if (RecPtr == lastComplaint)
 			emode = DEBUG1;
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index 35029cf97d..03a8b42f15 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -121,6 +121,7 @@ XLogReaderAllocate(int wal_segment_size, const char *waldir,
 		pfree(state);
 		return NULL;
 	}
+	state->EndOfWAL = false;
 	state->errormsg_buf[0] = '\0';
 
 	/*
@@ -292,6 +293,7 @@ XLogReadRecord(XLogReaderState *state, char **errormsg)
 	/* reset error state */
 	*errormsg = NULL;
 	state->errormsg_buf[0] = '\0';
+	state->EndOfWAL = false;
 
 	ResetDecoder(state);
 	state->abortedRecPtr = InvalidXLogRecPtr;
@@ -588,6 +590,15 @@ err:
 		 */
 		state->abortedRecPtr = RecPtr;
 		state->missingContrecPtr = targetPagePtr;
+
+		/*
+		 * If the message is not set yet, that means we failed to load the
+		 * page for the record.  Otherwise do not hide the existing message.
+		 */
+		if (state->errormsg_buf[0] == '\0')
+			report_invalid_record(state,
+								  "missing contrecord at %X/%X",
+								  LSN_FORMAT_ARGS(RecPtr));
 	}
 
 	/*
@@ -730,6 +741,40 @@ ValidXLogRecordHeader(XLogReaderState *state, XLogRecPtr RecPtr,
 					  XLogRecPtr PrevRecPtr, XLogRecord *record,
 					  bool randAccess)
 {
+	if (record->xl_tot_len == 0)
+	{
+		/*
+		 * We are almost sure reaching the end of WAL, make sure that the
+		 * whole page after the record is filled with zeroes.
+		 */
+		char	   *p;
+		char	   *pe;
+
+		/* scan from the beginning of the record to the end of block */
+		p = (char *) record;
+		pe = p + XLOG_BLCKSZ - (RecPtr & (XLOG_BLCKSZ - 1));
+
+		while (*p == 0 && p < pe)
+			p++;
+
+		if (p == pe)
+		{
+			/*
+			 * The page after the record is completely zeroed. That suggests
+			 * we don't have a record after this point. We don't bother
+			 * checking the pages after since they are not zeroed in the case
+			 * of recycled segments.
+			 */
+			report_invalid_record(state, "empty record found at %X/%X",
+								  LSN_FORMAT_ARGS(RecPtr));
+
+			/* notify end-of-wal to callers */
+			state->EndOfWAL = true;
+			return false;
+		}
+
+		/* The same condition will be caught as invalid record length */
+	}
 	if (record->xl_tot_len < SizeOfXLogRecord)
 	{
 		report_invalid_record(state,
@@ -836,6 +881,31 @@ XLogReaderValidatePageHeader(XLogReaderState *state, XLogRecPtr recptr,
 
 	XLogSegNoOffsetToRecPtr(segno, offset, state->segcxt.ws_segsize, recaddr);
 
+	StaticAssertStmt(XLOG_PAGE_MAGIC != 0, "XLOG_PAGE_MAGIC is zero");
+
+	if (hdr->xlp_magic == 0)
+	{
+		/* Regard an empty page as End-Of-WAL */
+		int			i;
+
+		for (i = 0; i < XLOG_BLCKSZ && phdr[i] == 0; i++);
+		if (i == XLOG_BLCKSZ)
+		{
+			char		fname[MAXFNAMELEN];
+
+			XLogFileName(fname, state->seg.ws_tli, segno,
+						 state->segcxt.ws_segsize);
+
+			report_invalid_record(state,
+								  "empty page in log segment %s, offset %u",
+								  fname,
+								  offset);
+			state->EndOfWAL = true;
+			return false;
+		}
+
+		/* The same condition will be caught as invalid magic number */
+	}
 	if (hdr->xlp_magic != XLOG_PAGE_MAGIC)
 	{
 		char		fname[MAXFNAMELEN];
@@ -921,6 +991,14 @@ XLogReaderValidatePageHeader(XLogReaderState *state, XLogRecPtr recptr,
 							  LSN_FORMAT_ARGS(hdr->xlp_pageaddr),
 							  fname,
 							  offset);
+
+		/*
+		 * If the page address is less than expected we assume it is an unused
+		 * page in a recycled segment.
+		 */
+		if (hdr->xlp_pageaddr < recaddr)
+			state->EndOfWAL = true;
+
 		return false;
 	}
 
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index b39fce8c23..8e1fa32489 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -471,10 +471,9 @@ WalReceiverMain(void)
 						else if (len < 0)
 						{
 							ereport(LOG,
-									(errmsg("replication terminated by primary server"),
-									 errdetail("End of WAL reached on timeline %u at %X/%X.",
-											   startpointTLI,
-											   LSN_FORMAT_ARGS(LogstreamResult.Write))));
+									(errmsg("replication terminated by primary server on timeline %u at %X/%X.",
+											startpointTLI,
+											LSN_FORMAT_ARGS(LogstreamResult.Write))));
 							endofwal = true;
 							break;
 						}
diff --git a/src/bin/pg_waldump/pg_waldump.c b/src/bin/pg_waldump/pg_waldump.c
index a6251e1a96..3745e76488 100644
--- a/src/bin/pg_waldump/pg_waldump.c
+++ b/src/bin/pg_waldump/pg_waldump.c
@@ -1176,9 +1176,16 @@ main(int argc, char **argv)
 		exit(0);
 
 	if (errormsg)
-		fatal_error("error in WAL record at %X/%X: %s",
-					LSN_FORMAT_ARGS(xlogreader_state->ReadRecPtr),
-					errormsg);
+	{
+		if (xlogreader_state->EndOfWAL)
+			pg_log_info("end of WAL at %X/%X: %s",
+						LSN_FORMAT_ARGS(xlogreader_state->EndRecPtr),
+						errormsg);
+		else
+			fatal_error("error in WAL record at %X/%X: %s",
+						LSN_FORMAT_ARGS(xlogreader_state->EndRecPtr),
+						errormsg);
+	}
 
 	XLogReaderFree(xlogreader_state);
 
diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h
index 477f0efe26..3eeba220a1 100644
--- a/src/include/access/xlogreader.h
+++ b/src/include/access/xlogreader.h
@@ -174,6 +174,7 @@ struct XLogReaderState
 	 */
 	XLogRecPtr	ReadRecPtr;		/* start of last record read */
 	XLogRecPtr	EndRecPtr;		/* end+1 of last record read */
+	bool		EndOfWAL;		/* the last attempt was EOW? */
 
 	/*
 	 * Set at the end of recovery: the start point of a partial record at the
diff --git a/src/test/recovery/t/011_crash_recovery.pl b/src/test/recovery/t/011_crash_recovery.pl
index 3892aba3e5..67d264df26 100644
--- a/src/test/recovery/t/011_crash_recovery.pl
+++ b/src/test/recovery/t/011_crash_recovery.pl
@@ -10,9 +10,11 @@ use PostgreSQL::Test::Cluster;
 use PostgreSQL::Test::Utils;
 use Test::More;
 use Config;
+use IPC::Run;
 
-plan tests => 3;
+plan tests => 11;
 
+my $reached_eow_pat = "reached end of WAL at ";
 my $node = PostgreSQL::Test::Cluster->new('primary');
 $node->init(allows_streaming => 1);
 $node->start;
@@ -50,7 +52,15 @@ is($node->safe_psql('postgres', qq[SELECT pg_xact_status('$xid');]),
 
 # Crash and restart the postmaster
 $node->stop('immediate');
+my $logstart = get_log_size($node);
 $node->start;
+my $max_attempts = 360;
+while ($max_attempts-- >= 0)
+{
+	last if (find_in_log($node, $reached_eow_pat, $logstart));
+	sleep 0.5;
+}
+ok ($max_attempts >= 0, "end-of-wal is logged");
 
 # Make sure we really got a new xid
 cmp_ok($node->safe_psql('postgres', 'SELECT pg_current_xact_id()'),
@@ -62,3 +72,101 @@ is($node->safe_psql('postgres', qq[SELECT pg_xact_status('$xid');]),
 
 $stdin .= "\\q\n";
 $tx->finish;    # wait for psql to quit gracefully
+
+my $segsize = $node->safe_psql('postgres',
+	   qq[SELECT setting FROM pg_settings WHERE name = 'wal_segment_size';]);
+
+# make sure no records afterwards go to the next segment
+$node->safe_psql('postgres', qq[
+				 SELECT pg_switch_wal();
+				 CHECKPOINT;
+				 CREATE TABLE t();
+]);
+$node->stop('immediate');
+
+# identify REDO WAL file
+my $cmd = "pg_controldata -D " . $node->data_dir();
+my $chkptfile;
+$cmd = ['pg_controldata', '-D', $node->data_dir()];
+$stdout = '';
+$stderr = '';
+IPC::Run::run $cmd, '>', \$stdout, '2>', \$stderr;
+ok($stdout =~ /^Latest checkpoint's REDO WAL file:[ \t] *(.+)$/m,
+   "checkpoint file is identified");
+my $chkptfile = $1;
+
+# identify the last record
+my $walfile = $node->data_dir() . "/pg_wal/$chkptfile";
+$cmd = ['pg_waldump', $walfile];
+$stdout = '';
+$stderr = '';
+my $lastrec;
+IPC::Run::run $cmd, '>', \$stdout, '2>', \$stderr;
+foreach my $l (split(/\r?\n/, $stdout))
+{
+	$lastrec = $l;
+}
+ok(defined $lastrec, "last WAL record is extracted");
+ok($stderr =~ /end of WAL at ([0-9A-F\/]+): .* at \g1/,
+   "pg_waldump emits the correct ending message");
+
+# read the last record LSN excluding leading zeroes
+ok ($lastrec =~ /, lsn: 0\/0*([1-9A-F][0-9A-F]+),/,
+	"LSN of the last record identified");
+my $lastlsn = $1;
+
+# corrupt the last record
+my $offset = hex($lastlsn) % $segsize;
+open(my $segf, '+<', $walfile) or die "failed to open $walfile\n";
+seek($segf, $offset, 0);  # halfway break the last record
+print $segf "\0\0\0\0";
+close($segf);
+
+# pg_waldump complains about the corrupted record
+$stdout = '';
+$stderr = '';
+IPC::Run::run $cmd, '>', \$stdout, '2>', \$stderr;
+ok($stderr =~ /fatal: error in WAL record at 0\/$lastlsn: .* at 0\/$lastlsn/,
+   "pg_waldump emits the correct error message");
+
+# also server complains
+$logstart = get_log_size($node);
+$node->start;
+$max_attempts = 360;
+while ($max_attempts-- >= 0)
+{
+	last if (find_in_log($node, "WARNING:  invalid record length at 0/$lastlsn: wanted [0-9]+, got 0",
+						 $logstart));
+	sleep 0.5;
+}
+ok($max_attempts >= 0, "header error is logged at $lastlsn");
+
+# and the end-of-wal messages shouldn't be seen
+# the same message has been confirmed in the past
+ok(!find_in_log($node, $reached_eow_pat, $logstart),
+   "false log message is not emitted");
+
+$node->stop('immediate');
+
+#### helper routines
+# return the size of logfile of $node in bytes
+sub get_log_size
+{
+	my ($node) = @_;
+
+	return (stat $node->logfile)[7];
+}
+
+# find $pat in logfile of $node after $off-th byte
+sub find_in_log
+{
+	my ($node, $pat, $off) = @_;
+
+	$off = 0 unless defined $off;
+	my $log = PostgreSQL::Test::Utils::slurp_file($node->logfile);
+	return 0 if (length($log) <= $off);
+
+	$log = substr($log, $off);
+
+	return $log =~ m/$pat/;
+}
-- 
2.27.0

#27

Pavel Borisov

pashkin.elfe@gmail.com

almost 4 years ago

In reply to: Kyotaro Horiguchi (#26)

Re: Make mesage at end-of-recovery less scary.

Thanks! The suggested change is done in the attached v11.

Thanks! v11 is a small refactoring of v10 that doesn't change behavior, so
it is RfC as well.

--
Best regards,
Pavel Borisov

Postgres Professional: http://postgrespro.com <http://www.postgrespro.com>

#28

Ashutosh Sharma

ashu.coek88@gmail.com

almost 4 years ago

In reply to: Kyotaro Horiguchi (#26)

Re: Make mesage at end-of-recovery less scary.

Hi,

Here are some of my review comments on the v11 patch:

-                       (errmsg_internal("reached end of WAL in
pg_wal, entering archive recovery")));
+                       (errmsg_internal("reached end of WAL at %X/%X
on timeline %u in %s during crash recovery, entering archive
recovery",
+                                        LSN_FORMAT_ARGS(ErrRecPtr),
+                                        replayTLI,
+                                        xlogSourceNames[currentSource])));

Why crash recovery? Won't this message get printed even during PITR?

I just did a PITR and could see these messages in the logfile.

2022-02-08 18:00:44.367 IST [86185] LOG: starting point-in-time
recovery to WAL location (LSN) "0/5227790"
2022-02-08 18:00:44.368 IST [86185] LOG: database system was not
properly shut down; automatic recovery in progress
2022-02-08 18:00:44.369 IST [86185] LOG: redo starts at 0/14DC8D8
2022-02-08 18:00:44.978 IST [86185] DEBUG1: reached end of WAL at
0/3FFFFD0 on timeline 1 in pg_wal during crash recovery, entering
archive recovery

+           /*
+            * If we haven't emit an error message, we have safely reached the
+            * end-of-WAL.
+            */
+           if (emode_for_corrupt_record(LOG, ErrRecPtr) == LOG)
+           {
+               char       *fmt;
+
+               if (StandbyMode)
+                   fmt = gettext_noop("reached end of WAL at %X/%X on
timeline %u in %s during standby mode");
+               else if (InArchiveRecovery)
+                   fmt = gettext_noop("reached end of WAL at %X/%X on
timeline %u in %s during archive recovery");
+               else
+                   fmt = gettext_noop("reached end of WAL at %X/%X on
timeline %u in %s during crash recovery");
+
+               ereport(LOG,
+                       (errmsg(fmt, LSN_FORMAT_ARGS(ErrRecPtr), replayTLI,
+                               xlogSourceNames[currentSource]),
+                        (errormsg ? errdetail_internal("%s", errormsg) : 0)));
+           }

Doesn't it make sense to add an assert statement inside this if-block
that will check for xlogreader->EndOfWAL?

-            * We only end up here without a message when XLogPageRead()
-            * failed - in that case we already logged something. In
-            * StandbyMode that only happens if we have been triggered, so we
-            * shouldn't loop anymore in that case.
+            * If we get here for other than end-of-wal, emit the error
+            * message right now. Otherwise the message if any is shown as a
+            * part of the end-of-WAL message below.
             */

For consistency, I think we can replace "end-of-wal" with
"end-of-WAL". Please note that everywhere else in the comments you
have used "end-of-WAL". So why not the same here?

                            ereport(LOG,
-                                   (errmsg("replication terminated by
primary server"),
-                                    errdetail("End of WAL reached on
timeline %u at %X/%X.",
-                                              startpointTLI,
-
LSN_FORMAT_ARGS(LogstreamResult.Write))));
+                                   (errmsg("replication terminated by
primary server on timeline %u at %X/%X.",
+                                           startpointTLI,
+
LSN_FORMAT_ARGS(LogstreamResult.Write))));

Is this change really required? I don't see any issue with the
existing error message.

Lastly, are we also planning to backport this patch?

--
With Regards,
Ashutosh Sharma.

On Wed, Feb 2, 2022 at 11:05 AM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:

Show quoted text

At Tue, 1 Feb 2022 12:38:01 +0400, Pavel Borisov <pashkin.elfe@gmail.com> wrote in

Maybe it can be written little bit shorter:
pe = (char *) record + XLOG_BLCKSZ - (RecPtr & (XLOG_BLCKSZ - 1));
as
pe = p + XLOG_BLCKSZ - (RecPtr & (XLOG_BLCKSZ - 1));
?

That difference would be a matter of taste, but I found it looks
cleaner that definition and assignment is separated for both p and pe.
Now it is like the following.

char *p;
char *pe;

/* scan from the beginning of the record to the end of block */
p = (char *) record;
pe = p + XLOG_BLCKSZ - (RecPtr & (XLOG_BLCKSZ - 1));

The problem that pgindent sometimes reflow formatting of unrelated blocks
is indeed existing. But I think it's right to manually leave pgindent-ed
code only on what is related to the patch. The leftover is pgindent-ed in a
scheduled manner sometimes, so don't need to bother.

Yeah, I meant that it is a bit annoying to unpginden-ting unrelated
edits:p

I'd like to set v10 as RfC.

Thanks! The suggested change is done in the attached v11.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#29

Kyotaro Horiguchi

horikyota.ntt@gmail.com

almost 4 years ago

In reply to: Ashutosh Sharma (#28)

1 attachment(s)

Re: Make mesage at end-of-recovery less scary.

Hi, Ashutosh.

At Tue, 8 Feb 2022 18:35:34 +0530, Ashutosh Sharma <ashu.coek88@gmail.com> wrote in

Here are some of my review comments on the v11 patch:

Thank you for taking a look on this.

-                       (errmsg_internal("reached end of WAL in
pg_wal, entering archive recovery")));
+                       (errmsg_internal("reached end of WAL at %X/%X
on timeline %u in %s during crash recovery, entering archive
recovery",
+                                        LSN_FORMAT_ARGS(ErrRecPtr),
+                                        replayTLI,
+                                        xlogSourceNames[currentSource])));

Why crash recovery? Won't this message get printed even during PITR?

It is in the if-block with the following condition.

* If archive recovery was requested, but we were still doing
* crash recovery, switch to archive recovery and retry using the
* offline archive. We have now replayed all the valid WAL in
* pg_wal, so we are presumably now consistent.

...

if (!InArchiveRecovery && ArchiveRecoveryRequested)

This means archive-recovery is requested but not started yet. That is,
we've just finished crash recovery. The existing comment cited
together is mentioning that.

At the end of PITR (or archive recovery), the other code works.

/*
* If we haven't emit an error message, we have safely reached the
* end-of-WAL.
*/
if (emode_for_corrupt_record(LOG, ErrRecPtr) == LOG)
{
char *fmt;

if (StandbyMode)
fmt = gettext_noop("reached end of WAL at %X/%X on timeline %u in %s during standby mode");
else if (InArchiveRecovery)
fmt = gettext_noop("reached end of WAL at %X/%X on timeline %u in %s during archive recovery");
else
fmt = gettext_noop("reached end of WAL at %X/%X on timeline %u in %s during crash recovery");

The last among the above messages is choosed when archive-recovery is
not requested at all.

I just did a PITR and could see these messages in the logfile.

Yeah, the log lines are describing that the server starting with crash
recovery to run PITR.

2022-02-08 18:00:44.367 IST [86185] LOG: starting point-in-time
recovery to WAL location (LSN) "0/5227790"
2022-02-08 18:00:44.368 IST [86185] LOG: database system was not
properly shut down; automatic recovery in progress

Well. I guess that the "automatic recovery" is ambiguous. Does it
make sense if the second line were like the follows instead?

+ 2022-02-08 18:00:44.368 IST [86185] LOG: database system was not properly shut down; crash recovery in progress

2022-02-08 18:00:44.369 IST [86185] LOG: redo starts at 0/14DC8D8
2022-02-08 18:00:44.978 IST [86185] DEBUG1: reached end of WAL at
0/3FFFFD0 on timeline 1 in pg_wal during crash recovery, entering
archive recovery

(I don't include this change in this patch since it would be another
issue.)

+           /*
+            * If we haven't emit an error message, we have safely reached the
+            * end-of-WAL.
+            */
+           if (emode_for_corrupt_record(LOG, ErrRecPtr) == LOG)
+           {
+               char       *fmt;
+
+               if (StandbyMode)
+                   fmt = gettext_noop("reached end of WAL at %X/%X on
timeline %u in %s during standby mode");
+               else if (InArchiveRecovery)
+                   fmt = gettext_noop("reached end of WAL at %X/%X on
timeline %u in %s during archive recovery");
+               else
+                   fmt = gettext_noop("reached end of WAL at %X/%X on
timeline %u in %s during crash recovery");
+
+               ereport(LOG,
+                       (errmsg(fmt, LSN_FORMAT_ARGS(ErrRecPtr), replayTLI,
+                               xlogSourceNames[currentSource]),
+                        (errormsg ? errdetail_internal("%s", errormsg) : 0)));
+           }

Doesn't it make sense to add an assert statement inside this if-block
that will check for xlogreader->EndOfWAL?

Good point. On second thought, the condition there is flat wrong.
The message is "reached end of WAL" so the condition should be
EndOfWAL. On the other hand we didn't make sure that the error
message for the stop is emitted anywhere. Thus I don't particularly
want to be strict on that point.

I made the following change for this.

-			if (emode_for_corrupt_record(LOG, ErrRecPtr) == LOG)
+			if (xlogreader->EndOfWAL)

-            * We only end up here without a message when XLogPageRead()
-            * failed - in that case we already logged something. In
-            * StandbyMode that only happens if we have been triggered, so we
-            * shouldn't loop anymore in that case.
+            * If we get here for other than end-of-wal, emit the error
+            * message right now. Otherwise the message if any is shown as a
+            * part of the end-of-WAL message below.
*/

For consistency, I think we can replace "end-of-wal" with
"end-of-WAL". Please note that everywhere else in the comments you
have used "end-of-WAL". So why not the same here?

Right. Fixed.

ereport(LOG,
-                                   (errmsg("replication terminated by
primary server"),
-                                    errdetail("End of WAL reached on
timeline %u at %X/%X.",
-                                              startpointTLI,
-
LSN_FORMAT_ARGS(LogstreamResult.Write))));
+                                   (errmsg("replication terminated by
primary server on timeline %u at %X/%X.",
+                                           startpointTLI,
+
LSN_FORMAT_ARGS(LogstreamResult.Write))));

Is this change really required? I don't see any issue with the
existing error message.

Without the change, we see two similar end-of-WAL messages from both
walreceiver and startup. (Please don't care about the slight
difference of LSNs..)

[walreceiver] LOG: replication terminated by primary server
[walreceiver] DETAIL: End of WAL reached on timeline 1 at 0/B0000D8.
[startup] LOG: reached end of WAL at 0/B000060 on timeline 1 in archive during standby mode
[startup] DETAIL: empty record found at 0/B0000D8

But what the walreceiver detected at the time is not End-of-WAL but an
error on the streaming connection. Since this patch makes startup
process to detect End-of-WAL, we don't need the duplicate and
in-a-sense false end-of-WAL message from walreceiver.

# By the way, I deliberately choosed to report the LSN of last
# successfully record in the "reached end of WAL" message. On second
# thought about this choice, I came to think that it is better to report
# the failure LSN. I changed it to report the failure LSN. In this
# case we face an ambiguity according to how we failed to read the
# record but for now we have no choice than blindly choosing one of
# them. I choosed EndRecPtr since I think decode error happens quite
# rarely than read errors.

[walreceiver] LOG: replication terminated by primary server at 0/B014228 on timeline 1.
[startup] LOG: reached end of WAL at 0/B014228 on timeline 1 in archive during standby mode
[startup] DETAIL: empty record found at 0/B014228

This is the reason for the change.

Lastly, are we also planning to backport this patch?

This is apparent a behavioral change, not a bug fix, which I think we
regard as not appropriate for back-patching.

As the result, I made the following chages in the version 11.

1. Changed the condition for the "end-of-WAL" message from
emode_for_corrupt_record to the EndOfWAL flag.

2. Corrected the wording of end-of-wal to end-of-WAL.

3. In the "reached end of WAL" message, report the LSN of the
beginning of failed record instead of the beginning of the
last-succeeded record.

4. In the changed message in walreceiver.c, I swapped LSN and timeline
so that they are in the same order with other similar messages.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachments:

v11-0001-Make-End-Of-Recovery-error-less-scary.patchtext/x-patch; charset=us-asciiDownload

From e07c1501cd0020f2a817dd9544c4aa5063e29685 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Fri, 28 Feb 2020 15:52:58 +0900
Subject: [PATCH v11] Make End-Of-Recovery error less scary

When recovery in any type ends, we see a bit scary error message like
"invalid record length" that suggests something serious is
happening. Actually if recovery meets a record with length = 0, that
usually means it finished applying all available WAL records.

Make this message less scary as "reached end of WAL". Instead, raise
the error level for other kind of WAL failure to WARNING.
---
 src/backend/access/transam/xlog.c         |  93 +++++++++++++-----
 src/backend/access/transam/xlogreader.c   |  78 +++++++++++++++
 src/backend/replication/walreceiver.c     |   7 +-
 src/bin/pg_waldump/pg_waldump.c           |  13 ++-
 src/include/access/xlogreader.h           |   1 +
 src/test/recovery/t/011_crash_recovery.pl | 110 +++++++++++++++++++++-
 6 files changed, 271 insertions(+), 31 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 958220c495..618f33d342 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -4480,6 +4480,7 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
 	for (;;)
 	{
 		char	   *errormsg;
+		XLogRecPtr	ErrRecPtr = InvalidXLogRecPtr;
 
 		record = XLogReadRecord(xlogreader, &errormsg);
 		if (record == NULL)
@@ -4495,6 +4496,18 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
 			{
 				abortedRecPtr = xlogreader->abortedRecPtr;
 				missingContrecPtr = xlogreader->missingContrecPtr;
+				ErrRecPtr = abortedRecPtr;
+			}
+			else
+			{
+				/*
+				 * EndRecPtr is the LSN we tried to read but failed. In the
+				 * case of decoding error, it is at the end of the failed
+				 * record but we don't have a means for now to know EndRecPtr
+				 * is pointing to which of the beginning or ending of the
+				 * failed record.
+				 */
+				ErrRecPtr = xlogreader->EndRecPtr;
 			}
 
 			if (readFile >= 0)
@@ -4504,13 +4517,12 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
 			}
 
 			/*
-			 * We only end up here without a message when XLogPageRead()
-			 * failed - in that case we already logged something. In
-			 * StandbyMode that only happens if we have been triggered, so we
-			 * shouldn't loop anymore in that case.
+			 * If we get here for other than end-of-WAL, emit the error
+			 * message right now. Otherwise the message if any is shown as a
+			 * part of the end-of-WAL message below.
 			 */
-			if (errormsg)
-				ereport(emode_for_corrupt_record(emode, xlogreader->EndRecPtr),
+			if (!xlogreader->EndOfWAL && errormsg)
+				ereport(emode_for_corrupt_record(emode, ErrRecPtr),
 						(errmsg_internal("%s", errormsg) /* already translated */ ));
 		}
 
@@ -4541,11 +4553,12 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
 			/* Great, got a record */
 			return record;
 		}
-		else
+
+		/* No valid record available from this source */
+		lastSourceFailed = true;
+
+		if (!fetching_ckpt)
 		{
-			/* No valid record available from this source */
-			lastSourceFailed = true;
-
 			/*
 			 * If archive recovery was requested, but we were still doing
 			 * crash recovery, switch to archive recovery and retry using the
@@ -4558,11 +4571,17 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
 			 * we'd have no idea how far we'd have to replay to reach
 			 * consistency.  So err on the safe side and give up.
 			 */
-			if (!InArchiveRecovery && ArchiveRecoveryRequested &&
-				!fetching_ckpt)
+			if (!InArchiveRecovery && ArchiveRecoveryRequested)
 			{
+				/*
+				 * We don't report this as LOG, since we don't stop recovery
+				 * here
+				 */
 				ereport(DEBUG1,
-						(errmsg_internal("reached end of WAL in pg_wal, entering archive recovery")));
+						(errmsg_internal("reached end of WAL at %X/%X on timeline %u in %s during crash recovery, entering archive recovery",
+										 LSN_FORMAT_ARGS(ErrRecPtr),
+										 replayTLI,
+										 xlogSourceNames[currentSource])));
 				InArchiveRecovery = true;
 				if (StandbyModeRequested)
 					StandbyMode = true;
@@ -4610,12 +4629,33 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
 				continue;
 			}
 
-			/* In standby mode, loop back to retry. Otherwise, give up. */
-			if (StandbyMode && !CheckForStandbyTrigger())
-				continue;
-			else
-				return NULL;
+			/*
+			 * If we haven't emit an error message, we have safely reached the
+			 * end-of-WAL.
+			 */
+			if (xlogreader->EndOfWAL)
+			{
+				char	   *fmt;
+
+				if (StandbyMode)
+					fmt = gettext_noop("reached end of WAL at %X/%X on timeline %u in %s during standby mode");
+				else if (InArchiveRecovery)
+					fmt = gettext_noop("reached end of WAL at %X/%X on timeline %u in %s during archive recovery");
+				else
+					fmt = gettext_noop("reached end of WAL at %X/%X on timeline %u in %s during crash recovery");
+
+				ereport(LOG,
+						(errmsg(fmt, LSN_FORMAT_ARGS(ErrRecPtr), replayTLI,
+								xlogSourceNames[currentSource]),
+						 (errormsg ? errdetail_internal("%s", errormsg) : 0)));
+			}
 		}
+
+		/* In standby mode, loop back to retry. Otherwise, give up. */
+		if (StandbyMode && !CheckForStandbyTrigger())
+			continue;
+		else
+			return NULL;
 	}
 }
 
@@ -7294,7 +7334,7 @@ StartupXLOG(void)
 		{
 			ereport(LOG,
 					(errmsg("database system was not properly shut down; "
-							"automatic recovery in progress")));
+							"crash recovery in progress")));
 			if (recoveryTargetTLI > ControlFile->checkPointCopy.ThisTimeLineID)
 				ereport(LOG,
 						(errmsg("crash recovery starts in timeline %u "
@@ -7544,7 +7584,7 @@ StartupXLOG(void)
 		else
 		{
 			/* just have to read next record after CheckPoint */
-			record = ReadRecord(xlogreader, LOG, false, replayTLI);
+			record = ReadRecord(xlogreader, WARNING, false, replayTLI);
 		}
 
 		if (record != NULL)
@@ -7782,7 +7822,7 @@ StartupXLOG(void)
 				}
 
 				/* Else, try to fetch the next WAL record */
-				record = ReadRecord(xlogreader, LOG, false, replayTLI);
+				record = ReadRecord(xlogreader, WARNING, false, replayTLI);
 			} while (record != NULL);
 
 			/*
@@ -7842,13 +7882,20 @@ StartupXLOG(void)
 
 			InRedo = false;
 		}
-		else
+		else if (xlogreader->EndOfWAL)
 		{
 			/* there are no WAL records following the checkpoint */
 			ereport(LOG,
 					(errmsg("redo is not required")));
 
 		}
+		else
+		{
+			/* broken record found */
+			ereport(WARNING,
+					(errmsg("redo is skipped"),
+					 errhint("This suggests WAL file corruption. You might need to check the database.")));
+		}
 
 		/*
 		 * This check is intentionally after the above log messages that
@@ -13097,7 +13144,7 @@ emode_for_corrupt_record(int emode, XLogRecPtr RecPtr)
 {
 	static XLogRecPtr lastComplaint = 0;
 
-	if (readSource == XLOG_FROM_PG_WAL && emode == LOG)
+	if (currentSource == XLOG_FROM_PG_WAL && emode <= WARNING)
 	{
 		if (RecPtr == lastComplaint)
 			emode = DEBUG1;
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index 35029cf97d..03a8b42f15 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -121,6 +121,7 @@ XLogReaderAllocate(int wal_segment_size, const char *waldir,
 		pfree(state);
 		return NULL;
 	}
+	state->EndOfWAL = false;
 	state->errormsg_buf[0] = '\0';
 
 	/*
@@ -292,6 +293,7 @@ XLogReadRecord(XLogReaderState *state, char **errormsg)
 	/* reset error state */
 	*errormsg = NULL;
 	state->errormsg_buf[0] = '\0';
+	state->EndOfWAL = false;
 
 	ResetDecoder(state);
 	state->abortedRecPtr = InvalidXLogRecPtr;
@@ -588,6 +590,15 @@ err:
 		 */
 		state->abortedRecPtr = RecPtr;
 		state->missingContrecPtr = targetPagePtr;
+
+		/*
+		 * If the message is not set yet, that means we failed to load the
+		 * page for the record.  Otherwise do not hide the existing message.
+		 */
+		if (state->errormsg_buf[0] == '\0')
+			report_invalid_record(state,
+								  "missing contrecord at %X/%X",
+								  LSN_FORMAT_ARGS(RecPtr));
 	}
 
 	/*
@@ -730,6 +741,40 @@ ValidXLogRecordHeader(XLogReaderState *state, XLogRecPtr RecPtr,
 					  XLogRecPtr PrevRecPtr, XLogRecord *record,
 					  bool randAccess)
 {
+	if (record->xl_tot_len == 0)
+	{
+		/*
+		 * We are almost sure reaching the end of WAL, make sure that the
+		 * whole page after the record is filled with zeroes.
+		 */
+		char	   *p;
+		char	   *pe;
+
+		/* scan from the beginning of the record to the end of block */
+		p = (char *) record;
+		pe = p + XLOG_BLCKSZ - (RecPtr & (XLOG_BLCKSZ - 1));
+
+		while (*p == 0 && p < pe)
+			p++;
+
+		if (p == pe)
+		{
+			/*
+			 * The page after the record is completely zeroed. That suggests
+			 * we don't have a record after this point. We don't bother
+			 * checking the pages after since they are not zeroed in the case
+			 * of recycled segments.
+			 */
+			report_invalid_record(state, "empty record found at %X/%X",
+								  LSN_FORMAT_ARGS(RecPtr));
+
+			/* notify end-of-wal to callers */
+			state->EndOfWAL = true;
+			return false;
+		}
+
+		/* The same condition will be caught as invalid record length */
+	}
 	if (record->xl_tot_len < SizeOfXLogRecord)
 	{
 		report_invalid_record(state,
@@ -836,6 +881,31 @@ XLogReaderValidatePageHeader(XLogReaderState *state, XLogRecPtr recptr,
 
 	XLogSegNoOffsetToRecPtr(segno, offset, state->segcxt.ws_segsize, recaddr);
 
+	StaticAssertStmt(XLOG_PAGE_MAGIC != 0, "XLOG_PAGE_MAGIC is zero");
+
+	if (hdr->xlp_magic == 0)
+	{
+		/* Regard an empty page as End-Of-WAL */
+		int			i;
+
+		for (i = 0; i < XLOG_BLCKSZ && phdr[i] == 0; i++);
+		if (i == XLOG_BLCKSZ)
+		{
+			char		fname[MAXFNAMELEN];
+
+			XLogFileName(fname, state->seg.ws_tli, segno,
+						 state->segcxt.ws_segsize);
+
+			report_invalid_record(state,
+								  "empty page in log segment %s, offset %u",
+								  fname,
+								  offset);
+			state->EndOfWAL = true;
+			return false;
+		}
+
+		/* The same condition will be caught as invalid magic number */
+	}
 	if (hdr->xlp_magic != XLOG_PAGE_MAGIC)
 	{
 		char		fname[MAXFNAMELEN];
@@ -921,6 +991,14 @@ XLogReaderValidatePageHeader(XLogReaderState *state, XLogRecPtr recptr,
 							  LSN_FORMAT_ARGS(hdr->xlp_pageaddr),
 							  fname,
 							  offset);
+
+		/*
+		 * If the page address is less than expected we assume it is an unused
+		 * page in a recycled segment.
+		 */
+		if (hdr->xlp_pageaddr < recaddr)
+			state->EndOfWAL = true;
+
 		return false;
 	}
 
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index b39fce8c23..1a7a692bc0 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -471,10 +471,9 @@ WalReceiverMain(void)
 						else if (len < 0)
 						{
 							ereport(LOG,
-									(errmsg("replication terminated by primary server"),
-									 errdetail("End of WAL reached on timeline %u at %X/%X.",
-											   startpointTLI,
-											   LSN_FORMAT_ARGS(LogstreamResult.Write))));
+									(errmsg("replication terminated by primary server at %X/%X on timeline %u.",
+											LSN_FORMAT_ARGS(LogstreamResult.Write),
+											startpointTLI)));
 							endofwal = true;
 							break;
 						}
diff --git a/src/bin/pg_waldump/pg_waldump.c b/src/bin/pg_waldump/pg_waldump.c
index a6251e1a96..3745e76488 100644
--- a/src/bin/pg_waldump/pg_waldump.c
+++ b/src/bin/pg_waldump/pg_waldump.c
@@ -1176,9 +1176,16 @@ main(int argc, char **argv)
 		exit(0);
 
 	if (errormsg)
-		fatal_error("error in WAL record at %X/%X: %s",
-					LSN_FORMAT_ARGS(xlogreader_state->ReadRecPtr),
-					errormsg);
+	{
+		if (xlogreader_state->EndOfWAL)
+			pg_log_info("end of WAL at %X/%X: %s",
+						LSN_FORMAT_ARGS(xlogreader_state->EndRecPtr),
+						errormsg);
+		else
+			fatal_error("error in WAL record at %X/%X: %s",
+						LSN_FORMAT_ARGS(xlogreader_state->EndRecPtr),
+						errormsg);
+	}
 
 	XLogReaderFree(xlogreader_state);
 
diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h
index 477f0efe26..3eeba220a1 100644
--- a/src/include/access/xlogreader.h
+++ b/src/include/access/xlogreader.h
@@ -174,6 +174,7 @@ struct XLogReaderState
 	 */
 	XLogRecPtr	ReadRecPtr;		/* start of last record read */
 	XLogRecPtr	EndRecPtr;		/* end+1 of last record read */
+	bool		EndOfWAL;		/* the last attempt was EOW? */
 
 	/*
 	 * Set at the end of recovery: the start point of a partial record at the
diff --git a/src/test/recovery/t/011_crash_recovery.pl b/src/test/recovery/t/011_crash_recovery.pl
index 3892aba3e5..67d264df26 100644
--- a/src/test/recovery/t/011_crash_recovery.pl
+++ b/src/test/recovery/t/011_crash_recovery.pl
@@ -10,9 +10,11 @@ use PostgreSQL::Test::Cluster;
 use PostgreSQL::Test::Utils;
 use Test::More;
 use Config;
+use IPC::Run;
 
-plan tests => 3;
+plan tests => 11;
 
+my $reached_eow_pat = "reached end of WAL at ";
 my $node = PostgreSQL::Test::Cluster->new('primary');
 $node->init(allows_streaming => 1);
 $node->start;
@@ -50,7 +52,15 @@ is($node->safe_psql('postgres', qq[SELECT pg_xact_status('$xid');]),
 
 # Crash and restart the postmaster
 $node->stop('immediate');
+my $logstart = get_log_size($node);
 $node->start;
+my $max_attempts = 360;
+while ($max_attempts-- >= 0)
+{
+	last if (find_in_log($node, $reached_eow_pat, $logstart));
+	sleep 0.5;
+}
+ok ($max_attempts >= 0, "end-of-wal is logged");
 
 # Make sure we really got a new xid
 cmp_ok($node->safe_psql('postgres', 'SELECT pg_current_xact_id()'),
@@ -62,3 +72,101 @@ is($node->safe_psql('postgres', qq[SELECT pg_xact_status('$xid');]),
 
 $stdin .= "\\q\n";
 $tx->finish;    # wait for psql to quit gracefully
+
+my $segsize = $node->safe_psql('postgres',
+	   qq[SELECT setting FROM pg_settings WHERE name = 'wal_segment_size';]);
+
+# make sure no records afterwards go to the next segment
+$node->safe_psql('postgres', qq[
+				 SELECT pg_switch_wal();
+				 CHECKPOINT;
+				 CREATE TABLE t();
+]);
+$node->stop('immediate');
+
+# identify REDO WAL file
+my $cmd = "pg_controldata -D " . $node->data_dir();
+my $chkptfile;
+$cmd = ['pg_controldata', '-D', $node->data_dir()];
+$stdout = '';
+$stderr = '';
+IPC::Run::run $cmd, '>', \$stdout, '2>', \$stderr;
+ok($stdout =~ /^Latest checkpoint's REDO WAL file:[ \t] *(.+)$/m,
+   "checkpoint file is identified");
+my $chkptfile = $1;
+
+# identify the last record
+my $walfile = $node->data_dir() . "/pg_wal/$chkptfile";
+$cmd = ['pg_waldump', $walfile];
+$stdout = '';
+$stderr = '';
+my $lastrec;
+IPC::Run::run $cmd, '>', \$stdout, '2>', \$stderr;
+foreach my $l (split(/\r?\n/, $stdout))
+{
+	$lastrec = $l;
+}
+ok(defined $lastrec, "last WAL record is extracted");
+ok($stderr =~ /end of WAL at ([0-9A-F\/]+): .* at \g1/,
+   "pg_waldump emits the correct ending message");
+
+# read the last record LSN excluding leading zeroes
+ok ($lastrec =~ /, lsn: 0\/0*([1-9A-F][0-9A-F]+),/,
+	"LSN of the last record identified");
+my $lastlsn = $1;
+
+# corrupt the last record
+my $offset = hex($lastlsn) % $segsize;
+open(my $segf, '+<', $walfile) or die "failed to open $walfile\n";
+seek($segf, $offset, 0);  # halfway break the last record
+print $segf "\0\0\0\0";
+close($segf);
+
+# pg_waldump complains about the corrupted record
+$stdout = '';
+$stderr = '';
+IPC::Run::run $cmd, '>', \$stdout, '2>', \$stderr;
+ok($stderr =~ /fatal: error in WAL record at 0\/$lastlsn: .* at 0\/$lastlsn/,
+   "pg_waldump emits the correct error message");
+
+# also server complains
+$logstart = get_log_size($node);
+$node->start;
+$max_attempts = 360;
+while ($max_attempts-- >= 0)
+{
+	last if (find_in_log($node, "WARNING:  invalid record length at 0/$lastlsn: wanted [0-9]+, got 0",
+						 $logstart));
+	sleep 0.5;
+}
+ok($max_attempts >= 0, "header error is logged at $lastlsn");
+
+# and the end-of-wal messages shouldn't be seen
+# the same message has been confirmed in the past
+ok(!find_in_log($node, $reached_eow_pat, $logstart),
+   "false log message is not emitted");
+
+$node->stop('immediate');
+
+#### helper routines
+# return the size of logfile of $node in bytes
+sub get_log_size
+{
+	my ($node) = @_;
+
+	return (stat $node->logfile)[7];
+}
+
+# find $pat in logfile of $node after $off-th byte
+sub find_in_log
+{
+	my ($node, $pat, $off) = @_;
+
+	$off = 0 unless defined $off;
+	my $log = PostgreSQL::Test::Utils::slurp_file($node->logfile);
+	return 0 if (length($log) <= $off);
+
+	$log = substr($log, $off);
+
+	return $log =~ m/$pat/;
+}
-- 
2.27.0

#30

Ashutosh Sharma

ashu.coek88@gmail.com

almost 4 years ago

In reply to: Kyotaro Horiguchi (#29)

Re: Make mesage at end-of-recovery less scary.

On Wed, Feb 9, 2022 at 1:14 PM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:

Hi, Ashutosh.

At Tue, 8 Feb 2022 18:35:34 +0530, Ashutosh Sharma <ashu.coek88@gmail.com> wrote in

Here are some of my review comments on the v11 patch:

Thank you for taking a look on this.
-                       (errmsg_internal("reached end of WAL in
pg_wal, entering archive recovery")));
+                       (errmsg_internal("reached end of WAL at %X/%X
on timeline %u in %s during crash recovery, entering archive
recovery",
+                                        LSN_FORMAT_ARGS(ErrRecPtr),
+                                        replayTLI,
+                                        xlogSourceNames[currentSource])));
Why crash recovery? Won't this message get printed even during PITR?
It is in the if-block with the following condition.

* If archive recovery was requested, but we were still doing
* crash recovery, switch to archive recovery and retry using the
* offline archive. We have now replayed all the valid WAL in
* pg_wal, so we are presumably now consistent.

...

if (!InArchiveRecovery && ArchiveRecoveryRequested)

This means archive-recovery is requested but not started yet. That is,
we've just finished crash recovery. The existing comment cited
together is mentioning that.

At the end of PITR (or archive recovery), the other code works.

This is quite understandable, the point here is that the message that
we are emitting says, we have just finished reading the wal files in
the pg_wal directory during crash recovery and are now entering
archive recovery when we are actually doing point-in-time recovery
which seems a bit misleading.

/*
* If we haven't emit an error message, we have safely reached the
* end-of-WAL.
*/
if (emode_for_corrupt_record(LOG, ErrRecPtr) == LOG)
{
char *fmt;

if (StandbyMode)
fmt = gettext_noop("reached end of WAL at %X/%X on timeline %u in %s during standby mode");
else if (InArchiveRecovery)
fmt = gettext_noop("reached end of WAL at %X/%X on timeline %u in %s during archive recovery");
else
fmt = gettext_noop("reached end of WAL at %X/%X on timeline %u in %s during crash recovery");

The last among the above messages is choosed when archive-recovery is
not requested at all.

I just did a PITR and could see these messages in the logfile.

Yeah, the log lines are describing that the server starting with crash
recovery to run PITR.

2022-02-08 18:00:44.367 IST [86185] LOG: starting point-in-time
recovery to WAL location (LSN) "0/5227790"
2022-02-08 18:00:44.368 IST [86185] LOG: database system was not
properly shut down; automatic recovery in progress

Well. I guess that the "automatic recovery" is ambiguous. Does it
make sense if the second line were like the follows instead?

+ 2022-02-08 18:00:44.368 IST [86185] LOG: database system was not properly shut down; crash recovery in progress

Well, according to me the current message looks fine.

Lastly, are we also planning to backport this patch?

This is apparent a behavioral change, not a bug fix, which I think we
regard as not appropriate for back-patching.

As the result, I made the following chages in the version 11.

1. Changed the condition for the "end-of-WAL" message from
emode_for_corrupt_record to the EndOfWAL flag.

2. Corrected the wording of end-of-wal to end-of-WAL.

3. In the "reached end of WAL" message, report the LSN of the
beginning of failed record instead of the beginning of the
last-succeeded record.

4. In the changed message in walreceiver.c, I swapped LSN and timeline
so that they are in the same order with other similar messages.

Thanks for sharing this information.

Here is one more comment:

One more comment:

+# identify REDO WAL file
+my $cmd = "pg_controldata -D " . $node->data_dir();
+my $chkptfile;
+$cmd = ['pg_controldata', '-D', $node->data_dir()];
+$stdout = '';
+$stderr = '';
+IPC::Run::run $cmd, '>', \$stdout, '2>', \$stderr;
+ok($stdout =~ /^Latest checkpoint's REDO WAL file:[ \t] *(.+)$/m,
+   "checkpoint file is identified");
+my $chkptfile = $1;

$chkptfile is declared twice in the same scope. We can probably remove
the first one.

--
With Regards,
Ashutosh Sharma.

#31

Kyotaro Horiguchi

horikyota.ntt@gmail.com

almost 4 years ago

In reply to: Ashutosh Sharma (#30)

1 attachment(s)

Re: Make mesage at end-of-recovery less scary.

At Wed, 9 Feb 2022 17:31:02 +0530, Ashutosh Sharma <ashu.coek88@gmail.com> wrote in

On Wed, Feb 9, 2022 at 1:14 PM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:

This means archive-recovery is requested but not started yet. That is,
we've just finished crash recovery. The existing comment cited
together is mentioning that.

At the end of PITR (or archive recovery), the other code works.

This is quite understandable, the point here is that the message that
we are emitting says, we have just finished reading the wal files in
the pg_wal directory during crash recovery and are now entering
archive recovery when we are actually doing point-in-time recovery
which seems a bit misleading.

Here is the messages.

2022-02-08 18:00:44.367 IST [86185] LOG: starting point-in-time
recovery to WAL location (LSN) "0/5227790"
2022-02-08 18:00:44.368 IST [86185] LOG: database system was not
properly shut down; automatic recovery in progress
2022-02-08 18:00:44.369 IST [86185] LOG: redo starts at 0/14DC8D8
2022-02-08 18:00:44.978 IST [86185] DEBUG1: reached end of WAL at
0/3FFFFD0 on timeline 1 in pg_wal during crash recovery, entering
archive recovery

In the first place the last DEBUG1 is not on my part, but one of the
messages added by this patch says the same thing. Is your point that
archive recovery is different thing from PITR? In regard to the
difference, I think PITR is a form of archive recovery.

That being said, after some thoughts on this, I changed my mind that
we don't need to say what operation was being performed at the
end-of-WAL. So in the attached the end-of-WAL message is not
accompanied by the kind of recovery.

LOG: reached end of WAL at 0/3000000 on timeline 1

I removed the archive-source part along with the operation mode.
Because it make the message untranslatable. It is now very simple but
seems enough.

While working on this, I noticed that we need to set EndOfWAL when
WaitForWALToBecomeAvailable returned with failure. That means the
file does not exist at all so it is a kind of end-of-WAL. In that
sense the following existing comment in ReadRecord is a bit wrong.

* We only end up here without a message when XLogPageRead()
* failed - in that case we already logged something. In
* StandbyMode that only happens if we have been triggered, so we
* shouldn't loop anymore in that case.

Actually there's a case we get there without a message and without
logged something when a segment file is not found unless we're in
standby mode.

Well. I guess that the "automatic recovery" is ambiguous. Does it
make sense if the second line were like the follows instead?

+ 2022-02-08 18:00:44.368 IST [86185] LOG: database system was not properly shut down; crash recovery in progress

Well, according to me the current message looks fine.

Good to hear. (In the previos version I modified the message by accident..)

$chkptfile is declared twice in the same scope. We can probably remove
the first one.

Ugh.. Fixed. (I wonder why Perl doesn't complain on this..)

In this version 12 I made the following changes.

- Rewrote (halfly reverted) a comment in ReadRecord

- Simplified the "reached end of WAL" message by removing recovery
mode and WAL source in ReadRecord.

- XLogPageRead sets EndOfWAL flag in the ENOENT case.

- Removed redundant declaration of the same variable in TAP script.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachments:

v12-0001-Make-End-Of-Recovery-error-less-scary.patchtext/x-patch; charset=us-asciiDownload

From e553164dbca709389d92b05cf8ae7a8b427e83a6 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Fri, 28 Feb 2020 15:52:58 +0900
Subject: [PATCH v12] Make End-Of-Recovery error less scary

When recovery in any type ends, we see a bit scary error message like
"invalid record length" that suggests something serious is
happening. Actually if recovery meets a record with length = 0, that
usually means it finished applying all available WAL records.

Make this message less scary as "reached end of WAL". Instead, raise
the error level for other kind of WAL failure to WARNING.
---
 src/backend/access/transam/xlog.c         |  92 +++++++++++++-----
 src/backend/access/transam/xlogreader.c   |  78 ++++++++++++++++
 src/backend/replication/walreceiver.c     |   7 +-
 src/bin/pg_waldump/pg_waldump.c           |  13 ++-
 src/include/access/xlogreader.h           |   1 +
 src/test/recovery/t/011_crash_recovery.pl | 108 +++++++++++++++++++++-
 6 files changed, 268 insertions(+), 31 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 958220c495..bf1d40e7cb 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -4480,6 +4480,7 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
 	for (;;)
 	{
 		char	   *errormsg;
+		XLogRecPtr	ErrRecPtr = InvalidXLogRecPtr;
 
 		record = XLogReadRecord(xlogreader, &errormsg);
 		if (record == NULL)
@@ -4495,6 +4496,18 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
 			{
 				abortedRecPtr = xlogreader->abortedRecPtr;
 				missingContrecPtr = xlogreader->missingContrecPtr;
+				ErrRecPtr = abortedRecPtr;
+			}
+			else
+			{
+				/*
+				 * EndRecPtr is the LSN we tried to read but failed. In the
+				 * case of decoding error, it is at the end of the failed
+				 * record but we don't have a means for now to know EndRecPtr
+				 * is pointing to which of the beginning or ending of the
+				 * failed record.
+				 */
+				ErrRecPtr = xlogreader->EndRecPtr;
 			}
 
 			if (readFile >= 0)
@@ -4504,13 +4517,16 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
 			}
 
 			/*
-			 * We only end up here without a message when XLogPageRead()
-			 * failed - in that case we already logged something. In
-			 * StandbyMode that only happens if we have been triggered, so we
-			 * shouldn't loop anymore in that case.
+			 * We only end up here without a message when XLogPageRead() failed
+			 * in that case we already logged something, or just met end-of-WAL
+			 * conditions. In StandbyMode that only happens if we have been
+			 * triggered, so we shouldn't loop anymore in that case. When
+			 * EndOfWAL is true, we don't emit that error if any immediately
+			 * and instead will show it as a part of a decent end-of-wal
+			 * message later.
 			 */
-			if (errormsg)
-				ereport(emode_for_corrupt_record(emode, xlogreader->EndRecPtr),
+			if (!xlogreader->EndOfWAL && errormsg)
+				ereport(emode_for_corrupt_record(emode, ErrRecPtr),
 						(errmsg_internal("%s", errormsg) /* already translated */ ));
 		}
 
@@ -4541,11 +4557,14 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
 			/* Great, got a record */
 			return record;
 		}
-		else
+
+		Assert(ErrRecPtr != InvalidXLogRecPtr);
+
+		/* No valid record available from this source */
+		lastSourceFailed = true;
+
+		if (!fetching_ckpt)
 		{
-			/* No valid record available from this source */
-			lastSourceFailed = true;
-
 			/*
 			 * If archive recovery was requested, but we were still doing
 			 * crash recovery, switch to archive recovery and retry using the
@@ -4558,11 +4577,17 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
 			 * we'd have no idea how far we'd have to replay to reach
 			 * consistency.  So err on the safe side and give up.
 			 */
-			if (!InArchiveRecovery && ArchiveRecoveryRequested &&
-				!fetching_ckpt)
+			if (!InArchiveRecovery && ArchiveRecoveryRequested)
 			{
+				/*
+				 * We don't report this as LOG, since we don't stop recovery
+				 * here
+				 */
 				ereport(DEBUG1,
-						(errmsg_internal("reached end of WAL in pg_wal, entering archive recovery")));
+						(errmsg_internal("reached end of WAL at %X/%X on timeline %u in %s during crash recovery, entering archive recovery",
+										 LSN_FORMAT_ARGS(ErrRecPtr),
+										 replayTLI,
+										 xlogSourceNames[currentSource])));
 				InArchiveRecovery = true;
 				if (StandbyModeRequested)
 					StandbyMode = true;
@@ -4610,12 +4635,24 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
 				continue;
 			}
 
-			/* In standby mode, loop back to retry. Otherwise, give up. */
-			if (StandbyMode && !CheckForStandbyTrigger())
-				continue;
-			else
-				return NULL;
+			/*
+			 * recovery ended.
+			 *
+			 * Emit a decent message if we met end-of-WAL. Otherwise we should
+			 * have already emitted an error message.
+			 */
+			if (xlogreader->EndOfWAL)
+				ereport(LOG,
+						(errmsg("reached end of WAL at %X/%X on timeline %u",
+								LSN_FORMAT_ARGS(ErrRecPtr), replayTLI),
+						 (errormsg ? errdetail_internal("%s", errormsg) : 0)));
 		}
+
+		/* In standby mode, loop back to retry. Otherwise, give up. */
+		if (StandbyMode && !CheckForStandbyTrigger())
+			continue;
+		else
+			return NULL;
 	}
 }
 
@@ -7544,7 +7581,7 @@ StartupXLOG(void)
 		else
 		{
 			/* just have to read next record after CheckPoint */
-			record = ReadRecord(xlogreader, LOG, false, replayTLI);
+			record = ReadRecord(xlogreader, WARNING, false, replayTLI);
 		}
 
 		if (record != NULL)
@@ -7782,7 +7819,7 @@ StartupXLOG(void)
 				}
 
 				/* Else, try to fetch the next WAL record */
-				record = ReadRecord(xlogreader, LOG, false, replayTLI);
+				record = ReadRecord(xlogreader, WARNING, false, replayTLI);
 			} while (record != NULL);
 
 			/*
@@ -7842,13 +7879,20 @@ StartupXLOG(void)
 
 			InRedo = false;
 		}
-		else
+		else if (xlogreader->EndOfWAL)
 		{
 			/* there are no WAL records following the checkpoint */
 			ereport(LOG,
 					(errmsg("redo is not required")));
 
 		}
+		else
+		{
+			/* broken record found */
+			ereport(WARNING,
+					(errmsg("redo is skipped"),
+					 errhint("This suggests WAL file corruption. You might need to check the database.")));
+		}
 
 		/*
 		 * This check is intentionally after the above log messages that
@@ -12434,12 +12478,14 @@ retry:
 										 private->replayTLI,
 										 xlogreader->EndRecPtr))
 		{
+			Assert(!StandbyMode);
+
 			if (readFile >= 0)
 				close(readFile);
 			readFile = -1;
 			readLen = 0;
 			readSource = XLOG_FROM_ANY;
-
+			xlogreader->EndOfWAL = true;
 			return -1;
 		}
 	}
@@ -13097,7 +13143,7 @@ emode_for_corrupt_record(int emode, XLogRecPtr RecPtr)
 {
 	static XLogRecPtr lastComplaint = 0;
 
-	if (readSource == XLOG_FROM_PG_WAL && emode == LOG)
+	if (currentSource == XLOG_FROM_PG_WAL && emode <= WARNING)
 	{
 		if (RecPtr == lastComplaint)
 			emode = DEBUG1;
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index 35029cf97d..22982c4de7 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -121,6 +121,7 @@ XLogReaderAllocate(int wal_segment_size, const char *waldir,
 		pfree(state);
 		return NULL;
 	}
+	state->EndOfWAL = false;
 	state->errormsg_buf[0] = '\0';
 
 	/*
@@ -292,6 +293,7 @@ XLogReadRecord(XLogReaderState *state, char **errormsg)
 	/* reset error state */
 	*errormsg = NULL;
 	state->errormsg_buf[0] = '\0';
+	state->EndOfWAL = false;
 
 	ResetDecoder(state);
 	state->abortedRecPtr = InvalidXLogRecPtr;
@@ -588,6 +590,15 @@ err:
 		 */
 		state->abortedRecPtr = RecPtr;
 		state->missingContrecPtr = targetPagePtr;
+
+		/*
+		 * If the message is not set yet, that means we failed to load the
+		 * page for the record.  Otherwise do not hide the existing message.
+		 */
+		if (state->errormsg_buf[0] == '\0')
+			report_invalid_record(state,
+								  "missing contrecord at %X/%X",
+								  LSN_FORMAT_ARGS(RecPtr));
 	}
 
 	/*
@@ -730,6 +741,40 @@ ValidXLogRecordHeader(XLogReaderState *state, XLogRecPtr RecPtr,
 					  XLogRecPtr PrevRecPtr, XLogRecord *record,
 					  bool randAccess)
 {
+	if (record->xl_tot_len == 0)
+	{
+		/*
+		 * We are almost sure reaching the end of WAL, make sure that the
+		 * whole page after the record is filled with zeroes.
+		 */
+		char	   *p;
+		char	   *pe;
+
+		/* scan from the beginning of the record to the end of block */
+		p = (char *) record;
+		pe = p + XLOG_BLCKSZ - (RecPtr & (XLOG_BLCKSZ - 1));
+
+		while (*p == 0 && p < pe)
+			p++;
+
+		if (p == pe)
+		{
+			/*
+			 * The page after the record is completely zeroed. That suggests
+			 * we don't have a record after this point. We don't bother
+			 * checking the pages after since they are not zeroed in the case
+			 * of recycled segments.
+			 */
+			report_invalid_record(state, "empty record at %X/%X",
+								  LSN_FORMAT_ARGS(RecPtr));
+
+			/* notify end-of-wal to callers */
+			state->EndOfWAL = true;
+			return false;
+		}
+
+		/* The same condition will be caught as invalid record length */
+	}
 	if (record->xl_tot_len < SizeOfXLogRecord)
 	{
 		report_invalid_record(state,
@@ -836,6 +881,31 @@ XLogReaderValidatePageHeader(XLogReaderState *state, XLogRecPtr recptr,
 
 	XLogSegNoOffsetToRecPtr(segno, offset, state->segcxt.ws_segsize, recaddr);
 
+	StaticAssertStmt(XLOG_PAGE_MAGIC != 0, "XLOG_PAGE_MAGIC is zero");
+
+	if (hdr->xlp_magic == 0)
+	{
+		/* Regard an empty page as End-Of-WAL */
+		int			i;
+
+		for (i = 0; i < XLOG_BLCKSZ && phdr[i] == 0; i++);
+		if (i == XLOG_BLCKSZ)
+		{
+			char		fname[MAXFNAMELEN];
+
+			XLogFileName(fname, state->seg.ws_tli, segno,
+						 state->segcxt.ws_segsize);
+
+			report_invalid_record(state,
+								  "empty page in log segment %s, offset %u",
+								  fname,
+								  offset);
+			state->EndOfWAL = true;
+			return false;
+		}
+
+		/* The same condition will be caught as invalid magic number */
+	}
 	if (hdr->xlp_magic != XLOG_PAGE_MAGIC)
 	{
 		char		fname[MAXFNAMELEN];
@@ -921,6 +991,14 @@ XLogReaderValidatePageHeader(XLogReaderState *state, XLogRecPtr recptr,
 							  LSN_FORMAT_ARGS(hdr->xlp_pageaddr),
 							  fname,
 							  offset);
+
+		/*
+		 * If the page address is less than expected we assume it is an unused
+		 * page in a recycled segment.
+		 */
+		if (hdr->xlp_pageaddr < recaddr)
+			state->EndOfWAL = true;
+
 		return false;
 	}
 
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index b39fce8c23..1a7a692bc0 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -471,10 +471,9 @@ WalReceiverMain(void)
 						else if (len < 0)
 						{
 							ereport(LOG,
-									(errmsg("replication terminated by primary server"),
-									 errdetail("End of WAL reached on timeline %u at %X/%X.",
-											   startpointTLI,
-											   LSN_FORMAT_ARGS(LogstreamResult.Write))));
+									(errmsg("replication terminated by primary server at %X/%X on timeline %u.",
+											LSN_FORMAT_ARGS(LogstreamResult.Write),
+											startpointTLI)));
 							endofwal = true;
 							break;
 						}
diff --git a/src/bin/pg_waldump/pg_waldump.c b/src/bin/pg_waldump/pg_waldump.c
index a6251e1a96..3745e76488 100644
--- a/src/bin/pg_waldump/pg_waldump.c
+++ b/src/bin/pg_waldump/pg_waldump.c
@@ -1176,9 +1176,16 @@ main(int argc, char **argv)
 		exit(0);
 
 	if (errormsg)
-		fatal_error("error in WAL record at %X/%X: %s",
-					LSN_FORMAT_ARGS(xlogreader_state->ReadRecPtr),
-					errormsg);
+	{
+		if (xlogreader_state->EndOfWAL)
+			pg_log_info("end of WAL at %X/%X: %s",
+						LSN_FORMAT_ARGS(xlogreader_state->EndRecPtr),
+						errormsg);
+		else
+			fatal_error("error in WAL record at %X/%X: %s",
+						LSN_FORMAT_ARGS(xlogreader_state->EndRecPtr),
+						errormsg);
+	}
 
 	XLogReaderFree(xlogreader_state);
 
diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h
index 477f0efe26..7b314ef10e 100644
--- a/src/include/access/xlogreader.h
+++ b/src/include/access/xlogreader.h
@@ -174,6 +174,7 @@ struct XLogReaderState
 	 */
 	XLogRecPtr	ReadRecPtr;		/* start of last record read */
 	XLogRecPtr	EndRecPtr;		/* end+1 of last record read */
+	bool		EndOfWAL;		/* was the last attempt EOW? */
 
 	/*
 	 * Set at the end of recovery: the start point of a partial record at the
diff --git a/src/test/recovery/t/011_crash_recovery.pl b/src/test/recovery/t/011_crash_recovery.pl
index 3892aba3e5..1d7476c309 100644
--- a/src/test/recovery/t/011_crash_recovery.pl
+++ b/src/test/recovery/t/011_crash_recovery.pl
@@ -10,9 +10,11 @@ use PostgreSQL::Test::Cluster;
 use PostgreSQL::Test::Utils;
 use Test::More;
 use Config;
+use IPC::Run;
 
-plan tests => 3;
+plan tests => 11;
 
+my $reached_eow_pat = "reached end of WAL at ";
 my $node = PostgreSQL::Test::Cluster->new('primary');
 $node->init(allows_streaming => 1);
 $node->start;
@@ -50,7 +52,15 @@ is($node->safe_psql('postgres', qq[SELECT pg_xact_status('$xid');]),
 
 # Crash and restart the postmaster
 $node->stop('immediate');
+my $logstart = get_log_size($node);
 $node->start;
+my $max_attempts = 360;
+while ($max_attempts-- >= 0)
+{
+	last if (find_in_log($node, $reached_eow_pat, $logstart));
+	sleep 0.5;
+}
+ok ($max_attempts >= 0, "end-of-wal is logged");
 
 # Make sure we really got a new xid
 cmp_ok($node->safe_psql('postgres', 'SELECT pg_current_xact_id()'),
@@ -62,3 +72,99 @@ is($node->safe_psql('postgres', qq[SELECT pg_xact_status('$xid');]),
 
 $stdin .= "\\q\n";
 $tx->finish;    # wait for psql to quit gracefully
+
+my $segsize = $node->safe_psql('postgres',
+	   qq[SELECT setting FROM pg_settings WHERE name = 'wal_segment_size';]);
+
+# make sure no records afterwards go to the next segment
+$node->safe_psql('postgres', qq[
+				 SELECT pg_switch_wal();
+				 CHECKPOINT;
+				 CREATE TABLE t();
+]);
+$node->stop('immediate');
+
+# identify REDO WAL file
+my $cmd = "pg_controldata -D " . $node->data_dir();
+$cmd = ['pg_controldata', '-D', $node->data_dir()];
+$stdout = '';
+$stderr = '';
+IPC::Run::run $cmd, '>', \$stdout, '2>', \$stderr;
+ok($stdout =~ /^Latest checkpoint's REDO WAL file:[ \t] *(.+)$/m,
+   "checkpoint file is identified");
+my $chkptfile = $1;
+
+# identify the last record
+my $walfile = $node->data_dir() . "/pg_wal/$chkptfile";
+$cmd = ['pg_waldump', $walfile];
+$stdout = '';
+$stderr = '';
+my $lastrec;
+IPC::Run::run $cmd, '>', \$stdout, '2>', \$stderr;
+foreach my $l (split(/\r?\n/, $stdout))
+{
+	$lastrec = $l;
+}
+ok(defined $lastrec, "last WAL record is extracted");
+ok($stderr =~ /end of WAL at ([0-9A-F\/]+): .* at \g1/,
+   "pg_waldump emits the correct ending message");
+
+# read the last record LSN excluding leading zeroes
+ok ($lastrec =~ /, lsn: 0\/0*([1-9A-F][0-9A-F]+),/,
+	"LSN of the last record identified");
+my $lastlsn = $1;
+
+# corrupt the last record
+my $offset = hex($lastlsn) % $segsize;
+open(my $segf, '+<', $walfile) or die "failed to open $walfile\n";
+seek($segf, $offset, 0);  # halfway break the last record
+print $segf "\0\0\0\0";
+close($segf);
+
+# pg_waldump complains about the corrupted record
+$stdout = '';
+$stderr = '';
+IPC::Run::run $cmd, '>', \$stdout, '2>', \$stderr;
+ok($stderr =~ /fatal: error in WAL record at 0\/$lastlsn: .* at 0\/$lastlsn/,
+   "pg_waldump emits the correct error message");
+
+# also server complains
+$logstart = get_log_size($node);
+$node->start;
+$max_attempts = 360;
+while ($max_attempts-- >= 0)
+{
+	last if (find_in_log($node, "WARNING:  invalid record length at 0/$lastlsn: wanted [0-9]+, got 0",
+						 $logstart));
+	sleep 0.5;
+}
+ok($max_attempts >= 0, "header error is logged at $lastlsn");
+
+# no end-of-wal message should be seen this time
+ok(!find_in_log($node, $reached_eow_pat, $logstart),
+   "false log message is not emitted");
+
+$node->stop('immediate');
+
+#### helper routines
+# return the size of logfile of $node in bytes
+sub get_log_size
+{
+	my ($node) = @_;
+
+	return (stat $node->logfile)[7];
+}
+
+# find $pat in logfile of $node after $off-th byte
+sub find_in_log
+{
+	my ($node, $pat, $off) = @_;
+
+	$off = 0 unless defined $off;
+	my $log = PostgreSQL::Test::Utils::slurp_file($node->logfile);
+	return 0 if (length($log) <= $off);
+
+	$log = substr($log, $off);
+
+	return $log =~ m/$pat/;
+}
-- 
2.27.0

#32

Ashutosh Sharma

ashu.coek88@gmail.com

almost 4 years ago

In reply to: Kyotaro Horiguchi (#31)

Re: Make mesage at end-of-recovery less scary.

Hi,

On Thu, Feb 10, 2022 at 11:47 AM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:

At Wed, 9 Feb 2022 17:31:02 +0530, Ashutosh Sharma <ashu.coek88@gmail.com> wrote in

On Wed, Feb 9, 2022 at 1:14 PM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:

This means archive-recovery is requested but not started yet. That is,
we've just finished crash recovery. The existing comment cited
together is mentioning that.

At the end of PITR (or archive recovery), the other code works.

This is quite understandable, the point here is that the message that
we are emitting says, we have just finished reading the wal files in
the pg_wal directory during crash recovery and are now entering
archive recovery when we are actually doing point-in-time recovery
which seems a bit misleading.

Here is the messages.

2022-02-08 18:00:44.367 IST [86185] LOG: starting point-in-time
recovery to WAL location (LSN) "0/5227790"
2022-02-08 18:00:44.368 IST [86185] LOG: database system was not
properly shut down; automatic recovery in progress
2022-02-08 18:00:44.369 IST [86185] LOG: redo starts at 0/14DC8D8
2022-02-08 18:00:44.978 IST [86185] DEBUG1: reached end of WAL at
0/3FFFFD0 on timeline 1 in pg_wal during crash recovery, entering
archive recovery

In the first place the last DEBUG1 is not on my part, but one of the
messages added by this patch says the same thing. Is your point that
archive recovery is different thing from PITR? In regard to the
difference, I think PITR is a form of archive recovery.

No, I haven't tried to compare archive recovery to PITR or vice versa,
instead I was trying to compare crash recovery with PITR. The message
you're emitting says just before entering into the archive recovery is
- "reached end-of-WAL on ... in pg_wal *during crash recovery*,
entering archive recovery". This message is static and can be emitted
not only during crash recovery, but also during PITR. I think we can
remove the "during crash recovery" part from this message, so "reached
the end of WAL at %X/%X on timeline %u in %s, entering archive
recovery". Also I don't think we need format specifier %s here, it can
be hard-coded with pg_wal as in this case we can only enter archive
recovery after reading wal from pg_wal, so current WAL source has to
be pg_wal, isn't it?

That being said, after some thoughts on this, I changed my mind that
we don't need to say what operation was being performed at the
end-of-WAL. So in the attached the end-of-WAL message is not
accompanied by the kind of recovery.

LOG: reached end of WAL at 0/3000000 on timeline 1

I removed the archive-source part along with the operation mode.
Because it make the message untranslatable. It is now very simple but
seems enough.

While working on this, I noticed that we need to set EndOfWAL when
WaitForWALToBecomeAvailable returned with failure. That means the
file does not exist at all so it is a kind of end-of-WAL. In that
sense the following existing comment in ReadRecord is a bit wrong.

* We only end up here without a message when XLogPageRead()
* failed - in that case we already logged something. In
* StandbyMode that only happens if we have been triggered, so we
* shouldn't loop anymore in that case.

Actually there's a case we get there without a message and without
logged something when a segment file is not found unless we're in
standby mode.

Well. I guess that the "automatic recovery" is ambiguous. Does it
make sense if the second line were like the follows instead?

+ 2022-02-08 18:00:44.368 IST [86185] LOG: database system was not properly shut down; crash recovery in progress

Well, according to me the current message looks fine.

Good to hear. (In the previos version I modified the message by accident..)

$chkptfile is declared twice in the same scope. We can probably remove
the first one.

Ugh.. Fixed. (I wonder why Perl doesn't complain on this..)

In this version 12 I made the following changes.

- Rewrote (halfly reverted) a comment in ReadRecord

- Simplified the "reached end of WAL" message by removing recovery
mode and WAL source in ReadRecord.

- XLogPageRead sets EndOfWAL flag in the ENOENT case.

- Removed redundant declaration of the same variable in TAP script.

Thanks for the changes. Please note that I am not able to apply the
latest patch on HEAD. Could you please rebase it on HEAD and share the
new version. Thank you.

--
With Regards,
Ashutosh Sharma.

#33

Kyotaro Horiguchi

horikyota.ntt@gmail.com

almost 4 years ago

In reply to: Ashutosh Sharma (#32)

1 attachment(s)

Re: Make mesage at end-of-recovery less scary.

At Mon, 14 Feb 2022 20:14:11 +0530, Ashutosh Sharma <ashu.coek88@gmail.com> wrote in

No, I haven't tried to compare archive recovery to PITR or vice versa,
instead I was trying to compare crash recovery with PITR. The message
you're emitting says just before entering into the archive recovery is
- "reached end-of-WAL on ... in pg_wal *during crash recovery*,
entering archive recovery". This message is static and can be emitted
not only during crash recovery, but also during PITR. I think we can

No. It is emitted *only* after crash recovery before starting archive
recovery. Another message this patch adds can be emitted after PITR
or archive recovery.

not only during crash recovery, but also during PITR. I think we can
remove the "during crash recovery" part from this message, so "reached
the end of WAL at %X/%X on timeline %u in %s, entering archive

What makes you think it can be emitted after other than crash recovery?
(Please look at the code comment just above.)

recovery". Also I don't think we need format specifier %s here, it can
be hard-coded with pg_wal as in this case we can only enter archive
recovery after reading wal from pg_wal, so current WAL source has to
be pg_wal, isn't it?

You're right that it can't be other than pg_wal. It was changed just
in accordance woth another message this patch adds and it would be a
matter of taste. I replaced to "pg_wal" in this version.

Thanks for the changes. Please note that I am not able to apply the
latest patch on HEAD. Could you please rebase it on HEAD and share the
new version. Thank you.

A change on TAP script hit this. The v13 attached is:

- Rebased.

- Replaced "%s" in the debug transition message from crash recovery to
archive recovery.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachments:

v13-0001-Make-End-Of-Recovery-error-less-scary.patchtext/x-patch; charset=us-asciiDownload

From 311e862e87dbdeb6348c6fc17063308342359c02 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Fri, 28 Feb 2020 15:52:58 +0900
Subject: [PATCH v13] Make End-Of-Recovery error less scary

When recovery in any type ends, we see a bit scary error message like
"invalid record length" that suggests something serious is
happening. Actually if recovery meets a record with length = 0, that
usually means it finished applying all available WAL records.

Make this message less scary as "reached end of WAL". Instead, raise
the error level for other kind of WAL failure to WARNING.
---
 src/backend/access/transam/xlog.c         |  91 ++++++++++++++-----
 src/backend/access/transam/xlogreader.c   |  78 ++++++++++++++++
 src/backend/replication/walreceiver.c     |   7 +-
 src/bin/pg_waldump/pg_waldump.c           |  13 ++-
 src/include/access/xlogreader.h           |   1 +
 src/test/recovery/t/011_crash_recovery.pl | 106 ++++++++++++++++++++++
 6 files changed, 266 insertions(+), 30 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 958220c495..bb7026ac77 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -4480,6 +4480,7 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
 	for (;;)
 	{
 		char	   *errormsg;
+		XLogRecPtr	ErrRecPtr = InvalidXLogRecPtr;
 
 		record = XLogReadRecord(xlogreader, &errormsg);
 		if (record == NULL)
@@ -4495,6 +4496,18 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
 			{
 				abortedRecPtr = xlogreader->abortedRecPtr;
 				missingContrecPtr = xlogreader->missingContrecPtr;
+				ErrRecPtr = abortedRecPtr;
+			}
+			else
+			{
+				/*
+				 * EndRecPtr is the LSN we tried to read but failed. In the
+				 * case of decoding error, it is at the end of the failed
+				 * record but we don't have a means for now to know EndRecPtr
+				 * is pointing to which of the beginning or ending of the
+				 * failed record.
+				 */
+				ErrRecPtr = xlogreader->EndRecPtr;
 			}
 
 			if (readFile >= 0)
@@ -4504,13 +4517,16 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
 			}
 
 			/*
-			 * We only end up here without a message when XLogPageRead()
-			 * failed - in that case we already logged something. In
-			 * StandbyMode that only happens if we have been triggered, so we
-			 * shouldn't loop anymore in that case.
+			 * We only end up here without a message when XLogPageRead() failed
+			 * in that case we already logged something, or just met end-of-WAL
+			 * conditions. In StandbyMode that only happens if we have been
+			 * triggered, so we shouldn't loop anymore in that case. When
+			 * EndOfWAL is true, we don't emit that error if any immediately
+			 * and instead will show it as a part of a decent end-of-wal
+			 * message later.
 			 */
-			if (errormsg)
-				ereport(emode_for_corrupt_record(emode, xlogreader->EndRecPtr),
+			if (!xlogreader->EndOfWAL && errormsg)
+				ereport(emode_for_corrupt_record(emode, ErrRecPtr),
 						(errmsg_internal("%s", errormsg) /* already translated */ ));
 		}
 
@@ -4541,11 +4557,14 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
 			/* Great, got a record */
 			return record;
 		}
-		else
+
+		Assert(ErrRecPtr != InvalidXLogRecPtr);
+
+		/* No valid record available from this source */
+		lastSourceFailed = true;
+
+		if (!fetching_ckpt)
 		{
-			/* No valid record available from this source */
-			lastSourceFailed = true;
-
 			/*
 			 * If archive recovery was requested, but we were still doing
 			 * crash recovery, switch to archive recovery and retry using the
@@ -4558,11 +4577,16 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
 			 * we'd have no idea how far we'd have to replay to reach
 			 * consistency.  So err on the safe side and give up.
 			 */
-			if (!InArchiveRecovery && ArchiveRecoveryRequested &&
-				!fetching_ckpt)
+			if (!InArchiveRecovery && ArchiveRecoveryRequested)
 			{
+				/*
+				 * We don't report this as LOG, since we don't stop recovery
+				 * here
+				 */
 				ereport(DEBUG1,
-						(errmsg_internal("reached end of WAL in pg_wal, entering archive recovery")));
+						(errmsg_internal("reached end of WAL at %X/%X on timeline %u in pg_wal during crash recovery, entering archive recovery",
+										 LSN_FORMAT_ARGS(ErrRecPtr),
+										 replayTLI)));
 				InArchiveRecovery = true;
 				if (StandbyModeRequested)
 					StandbyMode = true;
@@ -4610,12 +4634,24 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
 				continue;
 			}
 
-			/* In standby mode, loop back to retry. Otherwise, give up. */
-			if (StandbyMode && !CheckForStandbyTrigger())
-				continue;
-			else
-				return NULL;
+			/*
+			 * recovery ended.
+			 *
+			 * Emit a decent message if we met end-of-WAL. Otherwise we should
+			 * have already emitted an error message.
+			 */
+			if (xlogreader->EndOfWAL)
+				ereport(LOG,
+						(errmsg("reached end of WAL at %X/%X on timeline %u",
+								LSN_FORMAT_ARGS(ErrRecPtr), replayTLI),
+						 (errormsg ? errdetail_internal("%s", errormsg) : 0)));
 		}
+
+		/* In standby mode, loop back to retry. Otherwise, give up. */
+		if (StandbyMode && !CheckForStandbyTrigger())
+			continue;
+		else
+			return NULL;
 	}
 }
 
@@ -7544,7 +7580,7 @@ StartupXLOG(void)
 		else
 		{
 			/* just have to read next record after CheckPoint */
-			record = ReadRecord(xlogreader, LOG, false, replayTLI);
+			record = ReadRecord(xlogreader, WARNING, false, replayTLI);
 		}
 
 		if (record != NULL)
@@ -7782,7 +7818,7 @@ StartupXLOG(void)
 				}
 
 				/* Else, try to fetch the next WAL record */
-				record = ReadRecord(xlogreader, LOG, false, replayTLI);
+				record = ReadRecord(xlogreader, WARNING, false, replayTLI);
 			} while (record != NULL);
 
 			/*
@@ -7842,13 +7878,20 @@ StartupXLOG(void)
 
 			InRedo = false;
 		}
-		else
+		else if (xlogreader->EndOfWAL)
 		{
 			/* there are no WAL records following the checkpoint */
 			ereport(LOG,
 					(errmsg("redo is not required")));
 
 		}
+		else
+		{
+			/* broken record found */
+			ereport(WARNING,
+					(errmsg("redo is skipped"),
+					 errhint("This suggests WAL file corruption. You might need to check the database.")));
+		}
 
 		/*
 		 * This check is intentionally after the above log messages that
@@ -12434,12 +12477,14 @@ retry:
 										 private->replayTLI,
 										 xlogreader->EndRecPtr))
 		{
+			Assert(!StandbyMode);
+
 			if (readFile >= 0)
 				close(readFile);
 			readFile = -1;
 			readLen = 0;
 			readSource = XLOG_FROM_ANY;
-
+			xlogreader->EndOfWAL = true;
 			return -1;
 		}
 	}
@@ -13097,7 +13142,7 @@ emode_for_corrupt_record(int emode, XLogRecPtr RecPtr)
 {
 	static XLogRecPtr lastComplaint = 0;
 
-	if (readSource == XLOG_FROM_PG_WAL && emode == LOG)
+	if (currentSource == XLOG_FROM_PG_WAL && emode <= WARNING)
 	{
 		if (RecPtr == lastComplaint)
 			emode = DEBUG1;
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index 35029cf97d..22982c4de7 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -121,6 +121,7 @@ XLogReaderAllocate(int wal_segment_size, const char *waldir,
 		pfree(state);
 		return NULL;
 	}
+	state->EndOfWAL = false;
 	state->errormsg_buf[0] = '\0';
 
 	/*
@@ -292,6 +293,7 @@ XLogReadRecord(XLogReaderState *state, char **errormsg)
 	/* reset error state */
 	*errormsg = NULL;
 	state->errormsg_buf[0] = '\0';
+	state->EndOfWAL = false;
 
 	ResetDecoder(state);
 	state->abortedRecPtr = InvalidXLogRecPtr;
@@ -588,6 +590,15 @@ err:
 		 */
 		state->abortedRecPtr = RecPtr;
 		state->missingContrecPtr = targetPagePtr;
+
+		/*
+		 * If the message is not set yet, that means we failed to load the
+		 * page for the record.  Otherwise do not hide the existing message.
+		 */
+		if (state->errormsg_buf[0] == '\0')
+			report_invalid_record(state,
+								  "missing contrecord at %X/%X",
+								  LSN_FORMAT_ARGS(RecPtr));
 	}
 
 	/*
@@ -730,6 +741,40 @@ ValidXLogRecordHeader(XLogReaderState *state, XLogRecPtr RecPtr,
 					  XLogRecPtr PrevRecPtr, XLogRecord *record,
 					  bool randAccess)
 {
+	if (record->xl_tot_len == 0)
+	{
+		/*
+		 * We are almost sure reaching the end of WAL, make sure that the
+		 * whole page after the record is filled with zeroes.
+		 */
+		char	   *p;
+		char	   *pe;
+
+		/* scan from the beginning of the record to the end of block */
+		p = (char *) record;
+		pe = p + XLOG_BLCKSZ - (RecPtr & (XLOG_BLCKSZ - 1));
+
+		while (*p == 0 && p < pe)
+			p++;
+
+		if (p == pe)
+		{
+			/*
+			 * The page after the record is completely zeroed. That suggests
+			 * we don't have a record after this point. We don't bother
+			 * checking the pages after since they are not zeroed in the case
+			 * of recycled segments.
+			 */
+			report_invalid_record(state, "empty record at %X/%X",
+								  LSN_FORMAT_ARGS(RecPtr));
+
+			/* notify end-of-wal to callers */
+			state->EndOfWAL = true;
+			return false;
+		}
+
+		/* The same condition will be caught as invalid record length */
+	}
 	if (record->xl_tot_len < SizeOfXLogRecord)
 	{
 		report_invalid_record(state,
@@ -836,6 +881,31 @@ XLogReaderValidatePageHeader(XLogReaderState *state, XLogRecPtr recptr,
 
 	XLogSegNoOffsetToRecPtr(segno, offset, state->segcxt.ws_segsize, recaddr);
 
+	StaticAssertStmt(XLOG_PAGE_MAGIC != 0, "XLOG_PAGE_MAGIC is zero");
+
+	if (hdr->xlp_magic == 0)
+	{
+		/* Regard an empty page as End-Of-WAL */
+		int			i;
+
+		for (i = 0; i < XLOG_BLCKSZ && phdr[i] == 0; i++);
+		if (i == XLOG_BLCKSZ)
+		{
+			char		fname[MAXFNAMELEN];
+
+			XLogFileName(fname, state->seg.ws_tli, segno,
+						 state->segcxt.ws_segsize);
+
+			report_invalid_record(state,
+								  "empty page in log segment %s, offset %u",
+								  fname,
+								  offset);
+			state->EndOfWAL = true;
+			return false;
+		}
+
+		/* The same condition will be caught as invalid magic number */
+	}
 	if (hdr->xlp_magic != XLOG_PAGE_MAGIC)
 	{
 		char		fname[MAXFNAMELEN];
@@ -921,6 +991,14 @@ XLogReaderValidatePageHeader(XLogReaderState *state, XLogRecPtr recptr,
 							  LSN_FORMAT_ARGS(hdr->xlp_pageaddr),
 							  fname,
 							  offset);
+
+		/*
+		 * If the page address is less than expected we assume it is an unused
+		 * page in a recycled segment.
+		 */
+		if (hdr->xlp_pageaddr < recaddr)
+			state->EndOfWAL = true;
+
 		return false;
 	}
 
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index b39fce8c23..1a7a692bc0 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -471,10 +471,9 @@ WalReceiverMain(void)
 						else if (len < 0)
 						{
 							ereport(LOG,
-									(errmsg("replication terminated by primary server"),
-									 errdetail("End of WAL reached on timeline %u at %X/%X.",
-											   startpointTLI,
-											   LSN_FORMAT_ARGS(LogstreamResult.Write))));
+									(errmsg("replication terminated by primary server at %X/%X on timeline %u.",
+											LSN_FORMAT_ARGS(LogstreamResult.Write),
+											startpointTLI)));
 							endofwal = true;
 							break;
 						}
diff --git a/src/bin/pg_waldump/pg_waldump.c b/src/bin/pg_waldump/pg_waldump.c
index a6251e1a96..3745e76488 100644
--- a/src/bin/pg_waldump/pg_waldump.c
+++ b/src/bin/pg_waldump/pg_waldump.c
@@ -1176,9 +1176,16 @@ main(int argc, char **argv)
 		exit(0);
 
 	if (errormsg)
-		fatal_error("error in WAL record at %X/%X: %s",
-					LSN_FORMAT_ARGS(xlogreader_state->ReadRecPtr),
-					errormsg);
+	{
+		if (xlogreader_state->EndOfWAL)
+			pg_log_info("end of WAL at %X/%X: %s",
+						LSN_FORMAT_ARGS(xlogreader_state->EndRecPtr),
+						errormsg);
+		else
+			fatal_error("error in WAL record at %X/%X: %s",
+						LSN_FORMAT_ARGS(xlogreader_state->EndRecPtr),
+						errormsg);
+	}
 
 	XLogReaderFree(xlogreader_state);
 
diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h
index 477f0efe26..7b314ef10e 100644
--- a/src/include/access/xlogreader.h
+++ b/src/include/access/xlogreader.h
@@ -174,6 +174,7 @@ struct XLogReaderState
 	 */
 	XLogRecPtr	ReadRecPtr;		/* start of last record read */
 	XLogRecPtr	EndRecPtr;		/* end+1 of last record read */
+	bool		EndOfWAL;		/* was the last attempt EOW? */
 
 	/*
 	 * Set at the end of recovery: the start point of a partial record at the
diff --git a/src/test/recovery/t/011_crash_recovery.pl b/src/test/recovery/t/011_crash_recovery.pl
index 14154d1ce0..01033334d6 100644
--- a/src/test/recovery/t/011_crash_recovery.pl
+++ b/src/test/recovery/t/011_crash_recovery.pl
@@ -10,7 +10,9 @@ use PostgreSQL::Test::Cluster;
 use PostgreSQL::Test::Utils;
 use Test::More;
 use Config;
+use IPC::Run;
 
+my $reached_eow_pat = "reached end of WAL at ";
 my $node = PostgreSQL::Test::Cluster->new('primary');
 $node->init(allows_streaming => 1);
 $node->start;
@@ -48,7 +50,15 @@ is($node->safe_psql('postgres', qq[SELECT pg_xact_status('$xid');]),
 
 # Crash and restart the postmaster
 $node->stop('immediate');
+my $logstart = get_log_size($node);
 $node->start;
+my $max_attempts = 360;
+while ($max_attempts-- >= 0)
+{
+	last if (find_in_log($node, $reached_eow_pat, $logstart));
+	sleep 0.5;
+}
+ok ($max_attempts >= 0, "end-of-wal is logged");
 
 # Make sure we really got a new xid
 cmp_ok($node->safe_psql('postgres', 'SELECT pg_current_xact_id()'),
@@ -61,4 +71,100 @@ is($node->safe_psql('postgres', qq[SELECT pg_xact_status('$xid');]),
 $stdin .= "\\q\n";
 $tx->finish;    # wait for psql to quit gracefully
 
+my $segsize = $node->safe_psql('postgres',
+	   qq[SELECT setting FROM pg_settings WHERE name = 'wal_segment_size';]);
+
+# make sure no records afterwards go to the next segment
+$node->safe_psql('postgres', qq[
+				 SELECT pg_switch_wal();
+				 CHECKPOINT;
+				 CREATE TABLE t();
+]);
+$node->stop('immediate');
+
+# identify REDO WAL file
+my $cmd = "pg_controldata -D " . $node->data_dir();
+$cmd = ['pg_controldata', '-D', $node->data_dir()];
+$stdout = '';
+$stderr = '';
+IPC::Run::run $cmd, '>', \$stdout, '2>', \$stderr;
+ok($stdout =~ /^Latest checkpoint's REDO WAL file:[ \t] *(.+)$/m,
+   "checkpoint file is identified");
+my $chkptfile = $1;
+
+# identify the last record
+my $walfile = $node->data_dir() . "/pg_wal/$chkptfile";
+$cmd = ['pg_waldump', $walfile];
+$stdout = '';
+$stderr = '';
+my $lastrec;
+IPC::Run::run $cmd, '>', \$stdout, '2>', \$stderr;
+foreach my $l (split(/\r?\n/, $stdout))
+{
+	$lastrec = $l;
+}
+ok(defined $lastrec, "last WAL record is extracted");
+ok($stderr =~ /end of WAL at ([0-9A-F\/]+): .* at \g1/,
+   "pg_waldump emits the correct ending message");
+
+# read the last record LSN excluding leading zeroes
+ok ($lastrec =~ /, lsn: 0\/0*([1-9A-F][0-9A-F]+),/,
+	"LSN of the last record identified");
+my $lastlsn = $1;
+
+# corrupt the last record
+my $offset = hex($lastlsn) % $segsize;
+open(my $segf, '+<', $walfile) or die "failed to open $walfile\n";
+seek($segf, $offset, 0);  # halfway break the last record
+print $segf "\0\0\0\0";
+close($segf);
+
+# pg_waldump complains about the corrupted record
+$stdout = '';
+$stderr = '';
+IPC::Run::run $cmd, '>', \$stdout, '2>', \$stderr;
+ok($stderr =~ /fatal: error in WAL record at 0\/$lastlsn: .* at 0\/$lastlsn/,
+   "pg_waldump emits the correct error message");
+
+# also server complains
+$logstart = get_log_size($node);
+$node->start;
+$max_attempts = 360;
+while ($max_attempts-- >= 0)
+{
+	last if (find_in_log($node, "WARNING:  invalid record length at 0/$lastlsn: wanted [0-9]+, got 0",
+						 $logstart));
+	sleep 0.5;
+}
+ok($max_attempts >= 0, "header error is logged at $lastlsn");
+
+# no end-of-wal message should be seen this time
+ok(!find_in_log($node, $reached_eow_pat, $logstart),
+   "false log message is not emitted");
+
+$node->stop('immediate');
+
 done_testing();
+
+#### helper routines
+# return the size of logfile of $node in bytes
+sub get_log_size
+{
+	my ($node) = @_;
+
+	return (stat $node->logfile)[7];
+}
+
+# find $pat in logfile of $node after $off-th byte
+sub find_in_log
+{
+	my ($node, $pat, $off) = @_;
+
+	$off = 0 unless defined $off;
+	my $log = PostgreSQL::Test::Utils::slurp_file($node->logfile);
+	return 0 if (length($log) <= $off);
+
+	$log = substr($log, $off);
+
+	return $log =~ m/$pat/;
+}
-- 
2.27.0

#34

Ashutosh Sharma

ashu.coek88@gmail.com

almost 4 years ago

In reply to: Kyotaro Horiguchi (#33)

Re: Make mesage at end-of-recovery less scary.

On Tue, Feb 15, 2022 at 7:52 AM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:

At Mon, 14 Feb 2022 20:14:11 +0530, Ashutosh Sharma <ashu.coek88@gmail.com> wrote in

No, I haven't tried to compare archive recovery to PITR or vice versa,
instead I was trying to compare crash recovery with PITR. The message
you're emitting says just before entering into the archive recovery is
- "reached end-of-WAL on ... in pg_wal *during crash recovery*,
entering archive recovery". This message is static and can be emitted
not only during crash recovery, but also during PITR. I think we can

No. It is emitted *only* after crash recovery before starting archive
recovery. Another message this patch adds can be emitted after PITR
or archive recovery.

not only during crash recovery, but also during PITR. I think we can
remove the "during crash recovery" part from this message, so "reached
the end of WAL at %X/%X on timeline %u in %s, entering archive

What makes you think it can be emitted after other than crash recovery?
(Please look at the code comment just above.)

Yep that's right. We won't be coming here in case of pitr.

recovery". Also I don't think we need format specifier %s here, it can
be hard-coded with pg_wal as in this case we can only enter archive
recovery after reading wal from pg_wal, so current WAL source has to
be pg_wal, isn't it?

You're right that it can't be other than pg_wal. It was changed just
in accordance woth another message this patch adds and it would be a
matter of taste. I replaced to "pg_wal" in this version.

OK. I have verified the changes.

Thanks for the changes. Please note that I am not able to apply the
latest patch on HEAD. Could you please rebase it on HEAD and share the
new version. Thank you.

A change on TAP script hit this. The v13 attached is:

OK. The v13 patch looks good. I have marked it as ready to commit.
Thank you for working on all my review comments.

--
With Regards,
Ashutosh Sharma.

#35

Kyotaro Horiguchi

horikyota.ntt@gmail.com

almost 4 years ago

In reply to: Ashutosh Sharma (#34)

1 attachment(s)

Re: Make mesage at end-of-recovery less scary.

At Tue, 15 Feb 2022 20:17:20 +0530, Ashutosh Sharma <ashu.coek88@gmail.com> wrote in

OK. The v13 patch looks good. I have marked it as ready to commit.
Thank you for working on all my review comments.

Thaks! But the recent xlog.c refactoring crashes into this patch.
And I found a silly bug while rebasing.

xlog.c:12463 / xlogrecovery.c:3168
if (!WaitForWALToBecomeAvailable(targetPagePtr + reqLen,
..
{
+ Assert(!StandbyMode);
...
+ xlogreader->EndOfWAL = true;

Yeah, I forgot about promotion there.. So what I should have done is
setting EndOfWAL according to StandbyMode.

+			Assert(!StandbyMode || CheckForStandbyTrigger());
...
+			/* promotion exit is not end-of-WAL */
+			xlogreader->EndOfWAL = !StandbyMode;

The rebased v14 is attached.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachments:

v14-0001-Make-End-Of-Recovery-error-less-scary.patchtext/x-patch; charset=us-asciiDownload

From 5613ee80a4d2a9786f5ce8421dcbb560b63a13c1 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Fri, 28 Feb 2020 15:52:58 +0900
Subject: [PATCH v14] Make End-Of-Recovery error less scary

When recovery in any type ends, we see a bit scary error message like
"invalid record length" that suggests something serious is
happening. Actually if recovery meets a record with length = 0, that
usually means it finished applying all available WAL records.

Make this message less scary as "reached end of WAL". Instead, raise
the error level for other kind of WAL failure to WARNING.
---
 src/backend/access/transam/xlogreader.c   |  78 ++++++++++++++++
 src/backend/access/transam/xlogrecovery.c |  92 ++++++++++++++-----
 src/backend/replication/walreceiver.c     |   7 +-
 src/bin/pg_waldump/pg_waldump.c           |  13 ++-
 src/include/access/xlogreader.h           |   1 +
 src/test/recovery/t/011_crash_recovery.pl | 106 ++++++++++++++++++++++
 6 files changed, 268 insertions(+), 29 deletions(-)

diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index 35029cf97d..22982c4de7 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -121,6 +121,7 @@ XLogReaderAllocate(int wal_segment_size, const char *waldir,
 		pfree(state);
 		return NULL;
 	}
+	state->EndOfWAL = false;
 	state->errormsg_buf[0] = '\0';
 
 	/*
@@ -292,6 +293,7 @@ XLogReadRecord(XLogReaderState *state, char **errormsg)
 	/* reset error state */
 	*errormsg = NULL;
 	state->errormsg_buf[0] = '\0';
+	state->EndOfWAL = false;
 
 	ResetDecoder(state);
 	state->abortedRecPtr = InvalidXLogRecPtr;
@@ -588,6 +590,15 @@ err:
 		 */
 		state->abortedRecPtr = RecPtr;
 		state->missingContrecPtr = targetPagePtr;
+
+		/*
+		 * If the message is not set yet, that means we failed to load the
+		 * page for the record.  Otherwise do not hide the existing message.
+		 */
+		if (state->errormsg_buf[0] == '\0')
+			report_invalid_record(state,
+								  "missing contrecord at %X/%X",
+								  LSN_FORMAT_ARGS(RecPtr));
 	}
 
 	/*
@@ -730,6 +741,40 @@ ValidXLogRecordHeader(XLogReaderState *state, XLogRecPtr RecPtr,
 					  XLogRecPtr PrevRecPtr, XLogRecord *record,
 					  bool randAccess)
 {
+	if (record->xl_tot_len == 0)
+	{
+		/*
+		 * We are almost sure reaching the end of WAL, make sure that the
+		 * whole page after the record is filled with zeroes.
+		 */
+		char	   *p;
+		char	   *pe;
+
+		/* scan from the beginning of the record to the end of block */
+		p = (char *) record;
+		pe = p + XLOG_BLCKSZ - (RecPtr & (XLOG_BLCKSZ - 1));
+
+		while (*p == 0 && p < pe)
+			p++;
+
+		if (p == pe)
+		{
+			/*
+			 * The page after the record is completely zeroed. That suggests
+			 * we don't have a record after this point. We don't bother
+			 * checking the pages after since they are not zeroed in the case
+			 * of recycled segments.
+			 */
+			report_invalid_record(state, "empty record at %X/%X",
+								  LSN_FORMAT_ARGS(RecPtr));
+
+			/* notify end-of-wal to callers */
+			state->EndOfWAL = true;
+			return false;
+		}
+
+		/* The same condition will be caught as invalid record length */
+	}
 	if (record->xl_tot_len < SizeOfXLogRecord)
 	{
 		report_invalid_record(state,
@@ -836,6 +881,31 @@ XLogReaderValidatePageHeader(XLogReaderState *state, XLogRecPtr recptr,
 
 	XLogSegNoOffsetToRecPtr(segno, offset, state->segcxt.ws_segsize, recaddr);
 
+	StaticAssertStmt(XLOG_PAGE_MAGIC != 0, "XLOG_PAGE_MAGIC is zero");
+
+	if (hdr->xlp_magic == 0)
+	{
+		/* Regard an empty page as End-Of-WAL */
+		int			i;
+
+		for (i = 0; i < XLOG_BLCKSZ && phdr[i] == 0; i++);
+		if (i == XLOG_BLCKSZ)
+		{
+			char		fname[MAXFNAMELEN];
+
+			XLogFileName(fname, state->seg.ws_tli, segno,
+						 state->segcxt.ws_segsize);
+
+			report_invalid_record(state,
+								  "empty page in log segment %s, offset %u",
+								  fname,
+								  offset);
+			state->EndOfWAL = true;
+			return false;
+		}
+
+		/* The same condition will be caught as invalid magic number */
+	}
 	if (hdr->xlp_magic != XLOG_PAGE_MAGIC)
 	{
 		char		fname[MAXFNAMELEN];
@@ -921,6 +991,14 @@ XLogReaderValidatePageHeader(XLogReaderState *state, XLogRecPtr recptr,
 							  LSN_FORMAT_ARGS(hdr->xlp_pageaddr),
 							  fname,
 							  offset);
+
+		/*
+		 * If the page address is less than expected we assume it is an unused
+		 * page in a recycled segment.
+		 */
+		if (hdr->xlp_pageaddr < recaddr)
+			state->EndOfWAL = true;
+
 		return false;
 	}
 
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index f9f212680b..750056acaf 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -1592,7 +1592,7 @@ PerformWalRecovery(void)
 		/* just have to read next record after CheckPoint */
 		Assert(xlogreader->ReadRecPtr == CheckPointLoc);
 		replayTLI = CheckPointTLI;
-		record = ReadRecord(xlogreader, LOG, false, replayTLI);
+		record = ReadRecord(xlogreader, WARNING, false, replayTLI);
 	}
 
 	if (record != NULL)
@@ -1706,7 +1706,7 @@ PerformWalRecovery(void)
 			}
 
 			/* Else, try to fetch the next WAL record */
-			record = ReadRecord(xlogreader, LOG, false, replayTLI);
+			record = ReadRecord(xlogreader, WARNING, false, replayTLI);
 		} while (record != NULL);
 
 		/*
@@ -1765,13 +1765,20 @@ PerformWalRecovery(void)
 
 		InRedo = false;
 	}
-	else
+	else if (xlogreader->EndOfWAL)
 	{
 		/* there are no WAL records following the checkpoint */
 		ereport(LOG,
 				(errmsg("redo is not required")));
 
 	}
+	else
+	{
+		/* broken record found */
+		ereport(WARNING,
+				(errmsg("redo is skipped"),
+				 errhint("This suggests WAL file corruption. You might need to check the database.")));
+	}
 
 	/*
 	 * This check is intentionally after the above log messages that indicate
@@ -2939,6 +2946,7 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
 	for (;;)
 	{
 		char	   *errormsg;
+		XLogRecPtr	ErrRecPtr = InvalidXLogRecPtr;
 
 		record = XLogReadRecord(xlogreader, &errormsg);
 		if (record == NULL)
@@ -2954,6 +2962,18 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
 			{
 				abortedRecPtr = xlogreader->abortedRecPtr;
 				missingContrecPtr = xlogreader->missingContrecPtr;
+				ErrRecPtr = abortedRecPtr;
+			}
+			else
+			{
+				/*
+				 * EndRecPtr is the LSN we tried to read but failed. In the
+				 * case of decoding error, it is at the end of the failed
+				 * record but we don't have a means for now to know EndRecPtr
+				 * is pointing to which of the beginning or ending of the
+				 * failed record.
+				 */
+				ErrRecPtr = xlogreader->EndRecPtr;
 			}
 
 			if (readFile >= 0)
@@ -2963,13 +2983,16 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
 			}
 
 			/*
-			 * We only end up here without a message when XLogPageRead()
-			 * failed - in that case we already logged something. In
-			 * StandbyMode that only happens if we have been triggered, so we
-			 * shouldn't loop anymore in that case.
+			 * We only end up here without a message when XLogPageRead() failed
+			 * in that case we already logged something, or just met end-of-WAL
+			 * conditions. In StandbyMode that only happens if we have been
+			 * triggered, so we shouldn't loop anymore in that case. When
+			 * EndOfWAL is true, we don't emit that error if any immediately
+			 * and instead will show it as a part of a decent end-of-wal
+			 * message later.
 			 */
-			if (errormsg)
-				ereport(emode_for_corrupt_record(emode, xlogreader->EndRecPtr),
+			if (!xlogreader->EndOfWAL && errormsg)
+				ereport(emode_for_corrupt_record(emode, ErrRecPtr),
 						(errmsg_internal("%s", errormsg) /* already translated */ ));
 		}
 
@@ -3000,11 +3023,14 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
 			/* Great, got a record */
 			return record;
 		}
-		else
+
+		Assert(ErrRecPtr != InvalidXLogRecPtr);
+
+		/* No valid record available from this source */
+		lastSourceFailed = true;
+
+		if (!fetching_ckpt)
 		{
-			/* No valid record available from this source */
-			lastSourceFailed = true;
-
 			/*
 			 * If archive recovery was requested, but we were still doing
 			 * crash recovery, switch to archive recovery and retry using the
@@ -3017,11 +3043,16 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
 			 * we'd have no idea how far we'd have to replay to reach
 			 * consistency.  So err on the safe side and give up.
 			 */
-			if (!InArchiveRecovery && ArchiveRecoveryRequested &&
-				!fetching_ckpt)
+			if (!InArchiveRecovery && ArchiveRecoveryRequested)
 			{
+				/*
+				 * We don't report this as LOG, since we don't stop recovery
+				 * here
+				 */
 				ereport(DEBUG1,
-						(errmsg_internal("reached end of WAL in pg_wal, entering archive recovery")));
+						(errmsg_internal("reached end of WAL at %X/%X on timeline %u in pg_wal during crash recovery, entering archive recovery",
+										 LSN_FORMAT_ARGS(ErrRecPtr),
+										 replayTLI)));
 				InArchiveRecovery = true;
 				if (StandbyModeRequested)
 					StandbyMode = true;
@@ -3042,12 +3073,24 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
 				continue;
 			}
 
-			/* In standby mode, loop back to retry. Otherwise, give up. */
-			if (StandbyMode && !CheckForStandbyTrigger())
-				continue;
-			else
-				return NULL;
+			/*
+			 * recovery ended.
+			 *
+			 * Emit a decent message if we met end-of-WAL. Otherwise we should
+			 * have already emitted an error message.
+			 */
+			if (xlogreader->EndOfWAL)
+				ereport(LOG,
+						(errmsg("reached end of WAL at %X/%X on timeline %u",
+								LSN_FORMAT_ARGS(ErrRecPtr), replayTLI),
+						 (errormsg ? errdetail_internal("%s", errormsg) : 0)));
 		}
+
+		/* In standby mode, loop back to retry. Otherwise, give up. */
+		if (StandbyMode && !CheckForStandbyTrigger())
+			continue;
+		else
+			return NULL;
 	}
 }
 
@@ -3129,12 +3172,16 @@ retry:
 										 private->replayTLI,
 										 xlogreader->EndRecPtr))
 		{
+			Assert(!StandbyMode || CheckForStandbyTrigger());
+
 			if (readFile >= 0)
 				close(readFile);
 			readFile = -1;
 			readLen = 0;
 			readSource = XLOG_FROM_ANY;
 
+			/* promotion exit is not end-of-WAL */
+			xlogreader->EndOfWAL = !StandbyMode;
 			return -1;
 		}
 	}
@@ -3767,7 +3814,8 @@ emode_for_corrupt_record(int emode, XLogRecPtr RecPtr)
 {
 	static XLogRecPtr lastComplaint = 0;
 
-	if (readSource == XLOG_FROM_PG_WAL && emode == LOG)
+	/* use currentSource as readSource is reset at failure */
+	if (currentSource == XLOG_FROM_PG_WAL && emode <= WARNING)
 	{
 		if (RecPtr == lastComplaint)
 			emode = DEBUG1;
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index ceaff097b9..4f117ea4da 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -472,10 +472,9 @@ WalReceiverMain(void)
 						else if (len < 0)
 						{
 							ereport(LOG,
-									(errmsg("replication terminated by primary server"),
-									 errdetail("End of WAL reached on timeline %u at %X/%X.",
-											   startpointTLI,
-											   LSN_FORMAT_ARGS(LogstreamResult.Write))));
+									(errmsg("replication terminated by primary server at %X/%X on timeline %u.",
+											LSN_FORMAT_ARGS(LogstreamResult.Write),
+											startpointTLI)));
 							endofwal = true;
 							break;
 						}
diff --git a/src/bin/pg_waldump/pg_waldump.c b/src/bin/pg_waldump/pg_waldump.c
index a6251e1a96..3745e76488 100644
--- a/src/bin/pg_waldump/pg_waldump.c
+++ b/src/bin/pg_waldump/pg_waldump.c
@@ -1176,9 +1176,16 @@ main(int argc, char **argv)
 		exit(0);
 
 	if (errormsg)
-		fatal_error("error in WAL record at %X/%X: %s",
-					LSN_FORMAT_ARGS(xlogreader_state->ReadRecPtr),
-					errormsg);
+	{
+		if (xlogreader_state->EndOfWAL)
+			pg_log_info("end of WAL at %X/%X: %s",
+						LSN_FORMAT_ARGS(xlogreader_state->EndRecPtr),
+						errormsg);
+		else
+			fatal_error("error in WAL record at %X/%X: %s",
+						LSN_FORMAT_ARGS(xlogreader_state->EndRecPtr),
+						errormsg);
+	}
 
 	XLogReaderFree(xlogreader_state);
 
diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h
index 477f0efe26..7b314ef10e 100644
--- a/src/include/access/xlogreader.h
+++ b/src/include/access/xlogreader.h
@@ -174,6 +174,7 @@ struct XLogReaderState
 	 */
 	XLogRecPtr	ReadRecPtr;		/* start of last record read */
 	XLogRecPtr	EndRecPtr;		/* end+1 of last record read */
+	bool		EndOfWAL;		/* was the last attempt EOW? */
 
 	/*
 	 * Set at the end of recovery: the start point of a partial record at the
diff --git a/src/test/recovery/t/011_crash_recovery.pl b/src/test/recovery/t/011_crash_recovery.pl
index 14154d1ce0..01033334d6 100644
--- a/src/test/recovery/t/011_crash_recovery.pl
+++ b/src/test/recovery/t/011_crash_recovery.pl
@@ -10,7 +10,9 @@ use PostgreSQL::Test::Cluster;
 use PostgreSQL::Test::Utils;
 use Test::More;
 use Config;
+use IPC::Run;
 
+my $reached_eow_pat = "reached end of WAL at ";
 my $node = PostgreSQL::Test::Cluster->new('primary');
 $node->init(allows_streaming => 1);
 $node->start;
@@ -48,7 +50,15 @@ is($node->safe_psql('postgres', qq[SELECT pg_xact_status('$xid');]),
 
 # Crash and restart the postmaster
 $node->stop('immediate');
+my $logstart = get_log_size($node);
 $node->start;
+my $max_attempts = 360;
+while ($max_attempts-- >= 0)
+{
+	last if (find_in_log($node, $reached_eow_pat, $logstart));
+	sleep 0.5;
+}
+ok ($max_attempts >= 0, "end-of-wal is logged");
 
 # Make sure we really got a new xid
 cmp_ok($node->safe_psql('postgres', 'SELECT pg_current_xact_id()'),
@@ -61,4 +71,100 @@ is($node->safe_psql('postgres', qq[SELECT pg_xact_status('$xid');]),
 $stdin .= "\\q\n";
 $tx->finish;    # wait for psql to quit gracefully
 
+my $segsize = $node->safe_psql('postgres',
+	   qq[SELECT setting FROM pg_settings WHERE name = 'wal_segment_size';]);
+
+# make sure no records afterwards go to the next segment
+$node->safe_psql('postgres', qq[
+				 SELECT pg_switch_wal();
+				 CHECKPOINT;
+				 CREATE TABLE t();
+]);
+$node->stop('immediate');
+
+# identify REDO WAL file
+my $cmd = "pg_controldata -D " . $node->data_dir();
+$cmd = ['pg_controldata', '-D', $node->data_dir()];
+$stdout = '';
+$stderr = '';
+IPC::Run::run $cmd, '>', \$stdout, '2>', \$stderr;
+ok($stdout =~ /^Latest checkpoint's REDO WAL file:[ \t] *(.+)$/m,
+   "checkpoint file is identified");
+my $chkptfile = $1;
+
+# identify the last record
+my $walfile = $node->data_dir() . "/pg_wal/$chkptfile";
+$cmd = ['pg_waldump', $walfile];
+$stdout = '';
+$stderr = '';
+my $lastrec;
+IPC::Run::run $cmd, '>', \$stdout, '2>', \$stderr;
+foreach my $l (split(/\r?\n/, $stdout))
+{
+	$lastrec = $l;
+}
+ok(defined $lastrec, "last WAL record is extracted");
+ok($stderr =~ /end of WAL at ([0-9A-F\/]+): .* at \g1/,
+   "pg_waldump emits the correct ending message");
+
+# read the last record LSN excluding leading zeroes
+ok ($lastrec =~ /, lsn: 0\/0*([1-9A-F][0-9A-F]+),/,
+	"LSN of the last record identified");
+my $lastlsn = $1;
+
+# corrupt the last record
+my $offset = hex($lastlsn) % $segsize;
+open(my $segf, '+<', $walfile) or die "failed to open $walfile\n";
+seek($segf, $offset, 0);  # halfway break the last record
+print $segf "\0\0\0\0";
+close($segf);
+
+# pg_waldump complains about the corrupted record
+$stdout = '';
+$stderr = '';
+IPC::Run::run $cmd, '>', \$stdout, '2>', \$stderr;
+ok($stderr =~ /fatal: error in WAL record at 0\/$lastlsn: .* at 0\/$lastlsn/,
+   "pg_waldump emits the correct error message");
+
+# also server complains
+$logstart = get_log_size($node);
+$node->start;
+$max_attempts = 360;
+while ($max_attempts-- >= 0)
+{
+	last if (find_in_log($node, "WARNING:  invalid record length at 0/$lastlsn: wanted [0-9]+, got 0",
+						 $logstart));
+	sleep 0.5;
+}
+ok($max_attempts >= 0, "header error is logged at $lastlsn");
+
+# no end-of-wal message should be seen this time
+ok(!find_in_log($node, $reached_eow_pat, $logstart),
+   "false log message is not emitted");
+
+$node->stop('immediate');
+
 done_testing();
+
+#### helper routines
+# return the size of logfile of $node in bytes
+sub get_log_size
+{
+	my ($node) = @_;
+
+	return (stat $node->logfile)[7];
+}
+
+# find $pat in logfile of $node after $off-th byte
+sub find_in_log
+{
+	my ($node, $pat, $off) = @_;
+
+	$off = 0 unless defined $off;
+	my $log = PostgreSQL::Test::Utils::slurp_file($node->logfile);
+	return 0 if (length($log) <= $off);
+
+	$log = substr($log, $off);
+
+	return $log =~ m/$pat/;
+}
-- 
2.27.0

#36

Ashutosh Sharma

ashu.coek88@gmail.com

almost 4 years ago

In reply to: Kyotaro Horiguchi (#35)

Re: Make mesage at end-of-recovery less scary.

On Thu, Feb 17, 2022 at 1:20 PM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:

At Tue, 15 Feb 2022 20:17:20 +0530, Ashutosh Sharma <ashu.coek88@gmail.com> wrote in

OK. The v13 patch looks good. I have marked it as ready to commit.
Thank you for working on all my review comments.

Thaks! But the recent xlog.c refactoring crashes into this patch.
And I found a silly bug while rebasing.

Thanks.! I'll take a look at the new changes.

--
With Regards,
Ashutosh Sharma.

#37

Ashutosh Sharma

ashu.coek88@gmail.com

almost 4 years ago

In reply to: Kyotaro Horiguchi (#35)

Re: Make mesage at end-of-recovery less scary.

On Thu, Feb 17, 2022 at 1:20 PM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:

At Tue, 15 Feb 2022 20:17:20 +0530, Ashutosh Sharma <ashu.coek88@gmail.com> wrote in

OK. The v13 patch looks good. I have marked it as ready to commit.
Thank you for working on all my review comments.

Thaks! But the recent xlog.c refactoring crashes into this patch.
And I found a silly bug while rebasing.
xlog.c:12463 / xlogrecovery.c:3168
if (!WaitForWALToBecomeAvailable(targetPagePtr + reqLen,
..
{
+                       Assert(!StandbyMode);
...
+                       xlogreader->EndOfWAL = true;
Yeah, I forgot about promotion there..

Yes, we exit WaitForWALToBecomeAvailable() even in standby mode
provided the user has requested for the promotion. So checking for the
!StandbyMode condition alone was not enough.

So what I should have done is

setting EndOfWAL according to StandbyMode.

+                       Assert(!StandbyMode || CheckForStandbyTrigger());
...
+                       /* promotion exit is not end-of-WAL */
+                       xlogreader->EndOfWAL = !StandbyMode;

The changes looks good. thanks.!

--
With Regards,
Ashutosh Sharma.

#38

Kyotaro Horiguchi

horikyota.ntt@gmail.com

almost 4 years ago

In reply to: Ashutosh Sharma (#37)

1 attachment(s)

Re: Make mesage at end-of-recovery less scary.

At Sat, 19 Feb 2022 09:31:33 +0530, Ashutosh Sharma <ashu.coek88@gmail.com> wrote in

The changes looks good. thanks.!

Thanks!

Some recent core change changed WAL insertion speed during the TAP
test and revealed one forgotton case of EndOfWAL. When a record
header flows into the next page, XLogReadRecord does separate check
from ValidXLogRecordHeader by itself.

* If the whole record header is on this page, validate it immediately.
* Otherwise do just a basic sanity check on xl_tot_len, and validate the
* rest of the header after reading it from the next page. The xl_tot_len
* check is necessary here to ensure that we enter the "Need to reassemble
* record" code path below; otherwise we might fail to apply
* ValidXLogRecordHeader at all.
*/
if (targetRecOff <= XLOG_BLCKSZ - SizeOfXLogRecord)
{

...

}
else
{
/* XXX: more validation should be done here */
if (total_len < SizeOfXLogRecord)
{

I could simplly copy-in a part of ValidXLogRecordHeader there but that
results in rather large duplicate code. I could have
ValidXLogRecordHeader handle the partial header case but it seems to
me complex.

So in this version I split the xl_tot_len part of
ValidXLogRecordHeader into ValidXLogRecordLength.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachments:

v15-0001-Make-End-Of-Recovery-error-less-scary.patchtext/x-patch; charset=us-asciiDownload

From 01cce076d2b3ad536398cc2b716ef64ed9b2c409 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Fri, 28 Feb 2020 15:52:58 +0900
Subject: [PATCH v15] Make End-Of-Recovery error less scary

When recovery in any type ends, we see a bit scary error message like
"invalid record length" that suggests something serious is
happening. Actually if recovery meets a record with length = 0, that
usually means it finished applying all available WAL records.

Make this message less scary as "reached end of WAL". Instead, raise
the error level for other kind of WAL failure to WARNING.
---
 src/backend/access/transam/xlogreader.c   | 125 ++++++++++++++++++----
 src/backend/access/transam/xlogrecovery.c |  92 ++++++++++++----
 src/backend/replication/walreceiver.c     |   7 +-
 src/bin/pg_waldump/pg_waldump.c           |  13 ++-
 src/include/access/xlogreader.h           |   1 +
 src/test/recovery/t/011_crash_recovery.pl | 106 ++++++++++++++++++
 6 files changed, 297 insertions(+), 47 deletions(-)

diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index 35029cf97d..ba1c1ece87 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -42,6 +42,8 @@ static bool allocate_recordbuf(XLogReaderState *state, uint32 reclength);
 static int	ReadPageInternal(XLogReaderState *state, XLogRecPtr pageptr,
 							 int reqLen);
 static void XLogReaderInvalReadState(XLogReaderState *state);
+static bool ValidXLogRecordLength(XLogReaderState *state, XLogRecPtr RecPtr,
+								  XLogRecord *record);
 static bool ValidXLogRecordHeader(XLogReaderState *state, XLogRecPtr RecPtr,
 								  XLogRecPtr PrevRecPtr, XLogRecord *record, bool randAccess);
 static bool ValidXLogRecord(XLogReaderState *state, XLogRecord *record,
@@ -121,6 +123,7 @@ XLogReaderAllocate(int wal_segment_size, const char *waldir,
 		pfree(state);
 		return NULL;
 	}
+	state->EndOfWAL = false;
 	state->errormsg_buf[0] = '\0';
 
 	/*
@@ -292,6 +295,7 @@ XLogReadRecord(XLogReaderState *state, char **errormsg)
 	/* reset error state */
 	*errormsg = NULL;
 	state->errormsg_buf[0] = '\0';
+	state->EndOfWAL = false;
 
 	ResetDecoder(state);
 	state->abortedRecPtr = InvalidXLogRecPtr;
@@ -380,12 +384,11 @@ restart:
 	 * whole header.
 	 */
 	record = (XLogRecord *) (state->readBuf + RecPtr % XLOG_BLCKSZ);
-	total_len = record->xl_tot_len;
 
 	/*
 	 * If the whole record header is on this page, validate it immediately.
-	 * Otherwise do just a basic sanity check on xl_tot_len, and validate the
-	 * rest of the header after reading it from the next page.  The xl_tot_len
+	 * Otherwise do just a basic sanity check on record length, and validate
+	 * the rest of the header after reading it from the next page.  The length
 	 * check is necessary here to ensure that we enter the "Need to reassemble
 	 * record" code path below; otherwise we might fail to apply
 	 * ValidXLogRecordHeader at all.
@@ -399,18 +402,13 @@ restart:
 	}
 	else
 	{
-		/* XXX: more validation should be done here */
-		if (total_len < SizeOfXLogRecord)
-		{
-			report_invalid_record(state,
-								  "invalid record length at %X/%X: wanted %u, got %u",
-								  LSN_FORMAT_ARGS(RecPtr),
-								  (uint32) SizeOfXLogRecord, total_len);
+		if (!ValidXLogRecordLength(state, RecPtr, record))
 			goto err;
-		}
+
 		gotheader = false;
 	}
 
+	total_len = record->xl_tot_len;
 	len = XLOG_BLCKSZ - RecPtr % XLOG_BLCKSZ;
 	if (total_len > len)
 	{
@@ -588,6 +586,15 @@ err:
 		 */
 		state->abortedRecPtr = RecPtr;
 		state->missingContrecPtr = targetPagePtr;
+
+		/*
+		 * If the message is not set yet, that means we failed to load the
+		 * page for the record.  Otherwise do not hide the existing message.
+		 */
+		if (state->errormsg_buf[0] == '\0')
+			report_invalid_record(state,
+								  "missing contrecord at %X/%X",
+								  LSN_FORMAT_ARGS(RecPtr));
 	}
 
 	/*
@@ -719,6 +726,60 @@ XLogReaderInvalReadState(XLogReaderState *state)
 	state->readLen = 0;
 }
 
+/*
+ * Validate record length of an XLOG record header.
+ *
+ * This is substantially a part of ValidXLogRecordHeader.  But XLogReadRecord
+ * needs this separate from the function in case of a partial record header.
+ */
+static bool
+ValidXLogRecordLength(XLogReaderState *state, XLogRecPtr RecPtr,
+					  XLogRecord *record)
+{
+	if (record->xl_tot_len == 0)
+	{
+		char	   *p;
+		char	   *pe;
+
+		/*
+		 * We are almost sure reaching the end of WAL, make sure that the
+		 * whole page after the record is filled with zeroes.
+		 */
+		p = (char *) record;
+		pe = p + XLOG_BLCKSZ - (RecPtr & (XLOG_BLCKSZ - 1));
+
+		while (*p == 0 && p < pe)
+			p++;
+
+		if (p == pe)
+		{
+			/*
+			 * The page after the record is completely zeroed. That suggests
+			 * we don't have a record after this point. We don't bother
+			 * checking the pages after since they are not zeroed in the case
+			 * of recycled segments.
+			 */
+			report_invalid_record(state, "empty record at %X/%X",
+								  LSN_FORMAT_ARGS(RecPtr));
+
+			/* notify end-of-wal to callers */
+			state->EndOfWAL = true;
+			return false;
+		}
+	}
+
+	if (record->xl_tot_len < SizeOfXLogRecord)
+	{
+		report_invalid_record(state,
+							  "invalid record length at %X/%X: wanted %u, got %u",
+							  LSN_FORMAT_ARGS(RecPtr),
+							  (uint32) SizeOfXLogRecord, record->xl_tot_len);
+		return false;
+	}
+
+	return true;
+}
+
 /*
  * Validate an XLOG record header.
  *
@@ -730,14 +791,9 @@ ValidXLogRecordHeader(XLogReaderState *state, XLogRecPtr RecPtr,
 					  XLogRecPtr PrevRecPtr, XLogRecord *record,
 					  bool randAccess)
 {
-	if (record->xl_tot_len < SizeOfXLogRecord)
-	{
-		report_invalid_record(state,
-							  "invalid record length at %X/%X: wanted %u, got %u",
-							  LSN_FORMAT_ARGS(RecPtr),
-							  (uint32) SizeOfXLogRecord, record->xl_tot_len);
+	if (!ValidXLogRecordLength(state, RecPtr, record))
 		return false;
-	}
+
 	if (record->xl_rmid > RM_MAX_ID)
 	{
 		report_invalid_record(state,
@@ -836,6 +892,31 @@ XLogReaderValidatePageHeader(XLogReaderState *state, XLogRecPtr recptr,
 
 	XLogSegNoOffsetToRecPtr(segno, offset, state->segcxt.ws_segsize, recaddr);
 
+	StaticAssertStmt(XLOG_PAGE_MAGIC != 0, "XLOG_PAGE_MAGIC is zero");
+
+	if (hdr->xlp_magic == 0)
+	{
+		/* Regard an empty page as End-Of-WAL */
+		int			i;
+
+		for (i = 0; i < XLOG_BLCKSZ && phdr[i] == 0; i++);
+		if (i == XLOG_BLCKSZ)
+		{
+			char		fname[MAXFNAMELEN];
+
+			XLogFileName(fname, state->seg.ws_tli, segno,
+						 state->segcxt.ws_segsize);
+
+			report_invalid_record(state,
+								  "empty page in log segment %s, offset %u",
+								  fname,
+								  offset);
+			state->EndOfWAL = true;
+			return false;
+		}
+
+		/* The same condition will be caught as invalid magic number */
+	}
 	if (hdr->xlp_magic != XLOG_PAGE_MAGIC)
 	{
 		char		fname[MAXFNAMELEN];
@@ -921,6 +1002,14 @@ XLogReaderValidatePageHeader(XLogReaderState *state, XLogRecPtr recptr,
 							  LSN_FORMAT_ARGS(hdr->xlp_pageaddr),
 							  fname,
 							  offset);
+
+		/*
+		 * If the page address is less than expected we assume it is an unused
+		 * page in a recycled segment.
+		 */
+		if (hdr->xlp_pageaddr < recaddr)
+			state->EndOfWAL = true;
+
 		return false;
 	}
 
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index f9f212680b..750056acaf 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -1592,7 +1592,7 @@ PerformWalRecovery(void)
 		/* just have to read next record after CheckPoint */
 		Assert(xlogreader->ReadRecPtr == CheckPointLoc);
 		replayTLI = CheckPointTLI;
-		record = ReadRecord(xlogreader, LOG, false, replayTLI);
+		record = ReadRecord(xlogreader, WARNING, false, replayTLI);
 	}
 
 	if (record != NULL)
@@ -1706,7 +1706,7 @@ PerformWalRecovery(void)
 			}
 
 			/* Else, try to fetch the next WAL record */
-			record = ReadRecord(xlogreader, LOG, false, replayTLI);
+			record = ReadRecord(xlogreader, WARNING, false, replayTLI);
 		} while (record != NULL);
 
 		/*
@@ -1765,13 +1765,20 @@ PerformWalRecovery(void)
 
 		InRedo = false;
 	}
-	else
+	else if (xlogreader->EndOfWAL)
 	{
 		/* there are no WAL records following the checkpoint */
 		ereport(LOG,
 				(errmsg("redo is not required")));
 
 	}
+	else
+	{
+		/* broken record found */
+		ereport(WARNING,
+				(errmsg("redo is skipped"),
+				 errhint("This suggests WAL file corruption. You might need to check the database.")));
+	}
 
 	/*
 	 * This check is intentionally after the above log messages that indicate
@@ -2939,6 +2946,7 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
 	for (;;)
 	{
 		char	   *errormsg;
+		XLogRecPtr	ErrRecPtr = InvalidXLogRecPtr;
 
 		record = XLogReadRecord(xlogreader, &errormsg);
 		if (record == NULL)
@@ -2954,6 +2962,18 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
 			{
 				abortedRecPtr = xlogreader->abortedRecPtr;
 				missingContrecPtr = xlogreader->missingContrecPtr;
+				ErrRecPtr = abortedRecPtr;
+			}
+			else
+			{
+				/*
+				 * EndRecPtr is the LSN we tried to read but failed. In the
+				 * case of decoding error, it is at the end of the failed
+				 * record but we don't have a means for now to know EndRecPtr
+				 * is pointing to which of the beginning or ending of the
+				 * failed record.
+				 */
+				ErrRecPtr = xlogreader->EndRecPtr;
 			}
 
 			if (readFile >= 0)
@@ -2963,13 +2983,16 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
 			}
 
 			/*
-			 * We only end up here without a message when XLogPageRead()
-			 * failed - in that case we already logged something. In
-			 * StandbyMode that only happens if we have been triggered, so we
-			 * shouldn't loop anymore in that case.
+			 * We only end up here without a message when XLogPageRead() failed
+			 * in that case we already logged something, or just met end-of-WAL
+			 * conditions. In StandbyMode that only happens if we have been
+			 * triggered, so we shouldn't loop anymore in that case. When
+			 * EndOfWAL is true, we don't emit that error if any immediately
+			 * and instead will show it as a part of a decent end-of-wal
+			 * message later.
 			 */
-			if (errormsg)
-				ereport(emode_for_corrupt_record(emode, xlogreader->EndRecPtr),
+			if (!xlogreader->EndOfWAL && errormsg)
+				ereport(emode_for_corrupt_record(emode, ErrRecPtr),
 						(errmsg_internal("%s", errormsg) /* already translated */ ));
 		}
 
@@ -3000,11 +3023,14 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
 			/* Great, got a record */
 			return record;
 		}
-		else
+
+		Assert(ErrRecPtr != InvalidXLogRecPtr);
+
+		/* No valid record available from this source */
+		lastSourceFailed = true;
+
+		if (!fetching_ckpt)
 		{
-			/* No valid record available from this source */
-			lastSourceFailed = true;
-
 			/*
 			 * If archive recovery was requested, but we were still doing
 			 * crash recovery, switch to archive recovery and retry using the
@@ -3017,11 +3043,16 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
 			 * we'd have no idea how far we'd have to replay to reach
 			 * consistency.  So err on the safe side and give up.
 			 */
-			if (!InArchiveRecovery && ArchiveRecoveryRequested &&
-				!fetching_ckpt)
+			if (!InArchiveRecovery && ArchiveRecoveryRequested)
 			{
+				/*
+				 * We don't report this as LOG, since we don't stop recovery
+				 * here
+				 */
 				ereport(DEBUG1,
-						(errmsg_internal("reached end of WAL in pg_wal, entering archive recovery")));
+						(errmsg_internal("reached end of WAL at %X/%X on timeline %u in pg_wal during crash recovery, entering archive recovery",
+										 LSN_FORMAT_ARGS(ErrRecPtr),
+										 replayTLI)));
 				InArchiveRecovery = true;
 				if (StandbyModeRequested)
 					StandbyMode = true;
@@ -3042,12 +3073,24 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
 				continue;
 			}
 
-			/* In standby mode, loop back to retry. Otherwise, give up. */
-			if (StandbyMode && !CheckForStandbyTrigger())
-				continue;
-			else
-				return NULL;
+			/*
+			 * recovery ended.
+			 *
+			 * Emit a decent message if we met end-of-WAL. Otherwise we should
+			 * have already emitted an error message.
+			 */
+			if (xlogreader->EndOfWAL)
+				ereport(LOG,
+						(errmsg("reached end of WAL at %X/%X on timeline %u",
+								LSN_FORMAT_ARGS(ErrRecPtr), replayTLI),
+						 (errormsg ? errdetail_internal("%s", errormsg) : 0)));
 		}
+
+		/* In standby mode, loop back to retry. Otherwise, give up. */
+		if (StandbyMode && !CheckForStandbyTrigger())
+			continue;
+		else
+			return NULL;
 	}
 }
 
@@ -3129,12 +3172,16 @@ retry:
 										 private->replayTLI,
 										 xlogreader->EndRecPtr))
 		{
+			Assert(!StandbyMode || CheckForStandbyTrigger());
+
 			if (readFile >= 0)
 				close(readFile);
 			readFile = -1;
 			readLen = 0;
 			readSource = XLOG_FROM_ANY;
 
+			/* promotion exit is not end-of-WAL */
+			xlogreader->EndOfWAL = !StandbyMode;
 			return -1;
 		}
 	}
@@ -3767,7 +3814,8 @@ emode_for_corrupt_record(int emode, XLogRecPtr RecPtr)
 {
 	static XLogRecPtr lastComplaint = 0;
 
-	if (readSource == XLOG_FROM_PG_WAL && emode == LOG)
+	/* use currentSource as readSource is reset at failure */
+	if (currentSource == XLOG_FROM_PG_WAL && emode <= WARNING)
 	{
 		if (RecPtr == lastComplaint)
 			emode = DEBUG1;
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index ceaff097b9..4f117ea4da 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -472,10 +472,9 @@ WalReceiverMain(void)
 						else if (len < 0)
 						{
 							ereport(LOG,
-									(errmsg("replication terminated by primary server"),
-									 errdetail("End of WAL reached on timeline %u at %X/%X.",
-											   startpointTLI,
-											   LSN_FORMAT_ARGS(LogstreamResult.Write))));
+									(errmsg("replication terminated by primary server at %X/%X on timeline %u.",
+											LSN_FORMAT_ARGS(LogstreamResult.Write),
+											startpointTLI)));
 							endofwal = true;
 							break;
 						}
diff --git a/src/bin/pg_waldump/pg_waldump.c b/src/bin/pg_waldump/pg_waldump.c
index 2340dc247b..215abe95dc 100644
--- a/src/bin/pg_waldump/pg_waldump.c
+++ b/src/bin/pg_waldump/pg_waldump.c
@@ -1173,9 +1173,16 @@ main(int argc, char **argv)
 		exit(0);
 
 	if (errormsg)
-		fatal_error("error in WAL record at %X/%X: %s",
-					LSN_FORMAT_ARGS(xlogreader_state->ReadRecPtr),
-					errormsg);
+	{
+		if (xlogreader_state->EndOfWAL)
+			pg_log_info("end of WAL at %X/%X: %s",
+						LSN_FORMAT_ARGS(xlogreader_state->EndRecPtr),
+						errormsg);
+		else
+			fatal_error("error in WAL record at %X/%X: %s",
+						LSN_FORMAT_ARGS(xlogreader_state->EndRecPtr),
+						errormsg);
+	}
 
 	XLogReaderFree(xlogreader_state);
 
diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h
index 477f0efe26..7b314ef10e 100644
--- a/src/include/access/xlogreader.h
+++ b/src/include/access/xlogreader.h
@@ -174,6 +174,7 @@ struct XLogReaderState
 	 */
 	XLogRecPtr	ReadRecPtr;		/* start of last record read */
 	XLogRecPtr	EndRecPtr;		/* end+1 of last record read */
+	bool		EndOfWAL;		/* was the last attempt EOW? */
 
 	/*
 	 * Set at the end of recovery: the start point of a partial record at the
diff --git a/src/test/recovery/t/011_crash_recovery.pl b/src/test/recovery/t/011_crash_recovery.pl
index 14154d1ce0..01033334d6 100644
--- a/src/test/recovery/t/011_crash_recovery.pl
+++ b/src/test/recovery/t/011_crash_recovery.pl
@@ -10,7 +10,9 @@ use PostgreSQL::Test::Cluster;
 use PostgreSQL::Test::Utils;
 use Test::More;
 use Config;
+use IPC::Run;
 
+my $reached_eow_pat = "reached end of WAL at ";
 my $node = PostgreSQL::Test::Cluster->new('primary');
 $node->init(allows_streaming => 1);
 $node->start;
@@ -48,7 +50,15 @@ is($node->safe_psql('postgres', qq[SELECT pg_xact_status('$xid');]),
 
 # Crash and restart the postmaster
 $node->stop('immediate');
+my $logstart = get_log_size($node);
 $node->start;
+my $max_attempts = 360;
+while ($max_attempts-- >= 0)
+{
+	last if (find_in_log($node, $reached_eow_pat, $logstart));
+	sleep 0.5;
+}
+ok ($max_attempts >= 0, "end-of-wal is logged");
 
 # Make sure we really got a new xid
 cmp_ok($node->safe_psql('postgres', 'SELECT pg_current_xact_id()'),
@@ -61,4 +71,100 @@ is($node->safe_psql('postgres', qq[SELECT pg_xact_status('$xid');]),
 $stdin .= "\\q\n";
 $tx->finish;    # wait for psql to quit gracefully
 
+my $segsize = $node->safe_psql('postgres',
+	   qq[SELECT setting FROM pg_settings WHERE name = 'wal_segment_size';]);
+
+# make sure no records afterwards go to the next segment
+$node->safe_psql('postgres', qq[
+				 SELECT pg_switch_wal();
+				 CHECKPOINT;
+				 CREATE TABLE t();
+]);
+$node->stop('immediate');
+
+# identify REDO WAL file
+my $cmd = "pg_controldata -D " . $node->data_dir();
+$cmd = ['pg_controldata', '-D', $node->data_dir()];
+$stdout = '';
+$stderr = '';
+IPC::Run::run $cmd, '>', \$stdout, '2>', \$stderr;
+ok($stdout =~ /^Latest checkpoint's REDO WAL file:[ \t] *(.+)$/m,
+   "checkpoint file is identified");
+my $chkptfile = $1;
+
+# identify the last record
+my $walfile = $node->data_dir() . "/pg_wal/$chkptfile";
+$cmd = ['pg_waldump', $walfile];
+$stdout = '';
+$stderr = '';
+my $lastrec;
+IPC::Run::run $cmd, '>', \$stdout, '2>', \$stderr;
+foreach my $l (split(/\r?\n/, $stdout))
+{
+	$lastrec = $l;
+}
+ok(defined $lastrec, "last WAL record is extracted");
+ok($stderr =~ /end of WAL at ([0-9A-F\/]+): .* at \g1/,
+   "pg_waldump emits the correct ending message");
+
+# read the last record LSN excluding leading zeroes
+ok ($lastrec =~ /, lsn: 0\/0*([1-9A-F][0-9A-F]+),/,
+	"LSN of the last record identified");
+my $lastlsn = $1;
+
+# corrupt the last record
+my $offset = hex($lastlsn) % $segsize;
+open(my $segf, '+<', $walfile) or die "failed to open $walfile\n";
+seek($segf, $offset, 0);  # halfway break the last record
+print $segf "\0\0\0\0";
+close($segf);
+
+# pg_waldump complains about the corrupted record
+$stdout = '';
+$stderr = '';
+IPC::Run::run $cmd, '>', \$stdout, '2>', \$stderr;
+ok($stderr =~ /fatal: error in WAL record at 0\/$lastlsn: .* at 0\/$lastlsn/,
+   "pg_waldump emits the correct error message");
+
+# also server complains
+$logstart = get_log_size($node);
+$node->start;
+$max_attempts = 360;
+while ($max_attempts-- >= 0)
+{
+	last if (find_in_log($node, "WARNING:  invalid record length at 0/$lastlsn: wanted [0-9]+, got 0",
+						 $logstart));
+	sleep 0.5;
+}
+ok($max_attempts >= 0, "header error is logged at $lastlsn");
+
+# no end-of-wal message should be seen this time
+ok(!find_in_log($node, $reached_eow_pat, $logstart),
+   "false log message is not emitted");
+
+$node->stop('immediate');
+
 done_testing();
+
+#### helper routines
+# return the size of logfile of $node in bytes
+sub get_log_size
+{
+	my ($node) = @_;
+
+	return (stat $node->logfile)[7];
+}
+
+# find $pat in logfile of $node after $off-th byte
+sub find_in_log
+{
+	my ($node, $pat, $off) = @_;
+
+	$off = 0 unless defined $off;
+	my $log = PostgreSQL::Test::Utils::slurp_file($node->logfile);
+	return 0 if (length($log) <= $off);
+
+	$log = substr($log, $off);
+
+	return $log =~ m/$pat/;
+}
-- 
2.27.0

#39

Ashutosh Sharma

ashu.coek88@gmail.com

almost 4 years ago

In reply to: Kyotaro Horiguchi (#38)

Re: Make mesage at end-of-recovery less scary.

On Wed, Mar 2, 2022 at 7:47 AM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:

At Sat, 19 Feb 2022 09:31:33 +0530, Ashutosh Sharma <ashu.coek88@gmail.com> wrote in

The changes looks good. thanks.!

Thanks!

Some recent core change changed WAL insertion speed during the TAP
test and revealed one forgotton case of EndOfWAL. When a record
header flows into the next page, XLogReadRecord does separate check
from ValidXLogRecordHeader by itself.

The new changes made in the patch look good. Thanks to the recent
changes to speed WAL insertion that have helped us catch this bug.

One small comment:

record = (XLogRecord *) (state->readBuf + RecPtr % XLOG_BLCKSZ);
- total_len = record->xl_tot_len;

Do you think we need to change the position of the comments written
for above code that says:

/*
* Read the record length.
*
...
...

--
With Regards,
Ashutosh Sharma.

#40

Kyotaro Horiguchi

horikyota.ntt@gmail.com

almost 4 years ago

In reply to: Ashutosh Sharma (#39)

1 attachment(s)

Re: Make mesage at end-of-recovery less scary.

At Thu, 3 Mar 2022 15:39:44 +0530, Ashutosh Sharma <ashu.coek88@gmail.com> wrote in

The new changes made in the patch look good. Thanks to the recent
changes to speed WAL insertion that have helped us catch this bug.

Thanks for the quick checking.

One small comment:

record = (XLogRecord *) (state->readBuf + RecPtr % XLOG_BLCKSZ);
- total_len = record->xl_tot_len;

Do you think we need to change the position of the comments written
for above code that says:

Yeah, I didn't do that since it is about header verification. But as
you pointed, the result still doesn't look perfect.

On second thought the two seems repeating the same things. Thus I
merged the two comments together. In this verion 16 it looks like
this.

/*
* Validate the record header.
*
* Even though we use an XLogRecord pointer here, the whole record header
* might not fit on this page. If the whole record header is on this page,
* validate it immediately. Even otherwise xl_tot_len must be on this page
* (it is the first field of MAXALIGNed records), but we still cannot
* access any further fields until we've verified that we got the whole
* header, so do just a basic sanity check on record length, and validate
* the rest of the header after reading it from the next page. The length
* check is necessary here to ensure that we enter the "Need to reassemble
* record" code path below; otherwise we might fail to apply
* ValidXLogRecordHeader at all.
*/
record = (XLogRecord *) (state->readBuf + RecPtr % XLOG_BLCKSZ);

if (targetRecOff <= XLOG_BLCKSZ - SizeOfXLogRecord)

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachments:

v16-0001-Make-End-Of-Recovery-error-less-scary.patchtext/x-patch; charset=us-asciiDownload

From 00d848df6bb8b9966dfbd39c98a388fda42a3e3c Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Fri, 28 Feb 2020 15:52:58 +0900
Subject: [PATCH v16] Make End-Of-Recovery error less scary

When recovery in any type ends, we see a bit scary error message like
"invalid record length" that suggests something serious is
happening. Actually if recovery meets a record with length = 0, that
usually means it finished applying all available WAL records.

Make this message less scary as "reached end of WAL". Instead, raise
the error level for other kind of WAL failure to WARNING.
---
 src/backend/access/transam/xlogreader.c   | 144 +++++++++++++++++-----
 src/backend/access/transam/xlogrecovery.c |  92 ++++++++++----
 src/backend/replication/walreceiver.c     |   7 +-
 src/bin/pg_waldump/pg_waldump.c           |  13 +-
 src/include/access/xlogreader.h           |   1 +
 src/test/recovery/t/011_crash_recovery.pl | 106 ++++++++++++++++
 6 files changed, 305 insertions(+), 58 deletions(-)

diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index 35029cf97d..bd0f211a23 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -42,6 +42,8 @@ static bool allocate_recordbuf(XLogReaderState *state, uint32 reclength);
 static int	ReadPageInternal(XLogReaderState *state, XLogRecPtr pageptr,
 							 int reqLen);
 static void XLogReaderInvalReadState(XLogReaderState *state);
+static bool ValidXLogRecordLength(XLogReaderState *state, XLogRecPtr RecPtr,
+								  XLogRecord *record);
 static bool ValidXLogRecordHeader(XLogReaderState *state, XLogRecPtr RecPtr,
 								  XLogRecPtr PrevRecPtr, XLogRecord *record, bool randAccess);
 static bool ValidXLogRecord(XLogReaderState *state, XLogRecord *record,
@@ -121,6 +123,7 @@ XLogReaderAllocate(int wal_segment_size, const char *waldir,
 		pfree(state);
 		return NULL;
 	}
+	state->EndOfWAL = false;
 	state->errormsg_buf[0] = '\0';
 
 	/*
@@ -292,6 +295,7 @@ XLogReadRecord(XLogReaderState *state, char **errormsg)
 	/* reset error state */
 	*errormsg = NULL;
 	state->errormsg_buf[0] = '\0';
+	state->EndOfWAL = false;
 
 	ResetDecoder(state);
 	state->abortedRecPtr = InvalidXLogRecPtr;
@@ -371,25 +375,21 @@ restart:
 	Assert(pageHeaderSize <= readOff);
 
 	/*
-	 * Read the record length.
+	 * Validate the record header.
 	 *
-	 * NB: Even though we use an XLogRecord pointer here, the whole record
-	 * header might not fit on this page. xl_tot_len is the first field of the
-	 * struct, so it must be on this page (the records are MAXALIGNed), but we
-	 * cannot access any other fields until we've verified that we got the
-	 * whole header.
-	 */
-	record = (XLogRecord *) (state->readBuf + RecPtr % XLOG_BLCKSZ);
-	total_len = record->xl_tot_len;
-
-	/*
-	 * If the whole record header is on this page, validate it immediately.
-	 * Otherwise do just a basic sanity check on xl_tot_len, and validate the
-	 * rest of the header after reading it from the next page.  The xl_tot_len
+	 * Even though we use an XLogRecord pointer here, the whole record header
+	 * might not fit on this page.  If the whole record header is on this page,
+	 * validate it immediately.  Even otherwise xl_tot_len must be on this page
+	 * (it is the first field of MAXALIGNed records), but we still cannot
+	 * access any further fields until we've verified that we got the whole
+	 * header, so do just a basic sanity check on record length, and validate
+	 * the rest of the header after reading it from the next page.  The length
 	 * check is necessary here to ensure that we enter the "Need to reassemble
 	 * record" code path below; otherwise we might fail to apply
 	 * ValidXLogRecordHeader at all.
 	 */
+	record = (XLogRecord *) (state->readBuf + RecPtr % XLOG_BLCKSZ);
+
 	if (targetRecOff <= XLOG_BLCKSZ - SizeOfXLogRecord)
 	{
 		if (!ValidXLogRecordHeader(state, RecPtr, state->ReadRecPtr, record,
@@ -399,18 +399,13 @@ restart:
 	}
 	else
 	{
-		/* XXX: more validation should be done here */
-		if (total_len < SizeOfXLogRecord)
-		{
-			report_invalid_record(state,
-								  "invalid record length at %X/%X: wanted %u, got %u",
-								  LSN_FORMAT_ARGS(RecPtr),
-								  (uint32) SizeOfXLogRecord, total_len);
+		if (!ValidXLogRecordLength(state, RecPtr, record))
 			goto err;
-		}
+
 		gotheader = false;
 	}
 
+	total_len = record->xl_tot_len;
 	len = XLOG_BLCKSZ - RecPtr % XLOG_BLCKSZ;
 	if (total_len > len)
 	{
@@ -588,6 +583,15 @@ err:
 		 */
 		state->abortedRecPtr = RecPtr;
 		state->missingContrecPtr = targetPagePtr;
+
+		/*
+		 * If the message is not set yet, that means we failed to load the
+		 * page for the record.  Otherwise do not hide the existing message.
+		 */
+		if (state->errormsg_buf[0] == '\0')
+			report_invalid_record(state,
+								  "missing contrecord at %X/%X",
+								  LSN_FORMAT_ARGS(RecPtr));
 	}
 
 	/*
@@ -719,6 +723,60 @@ XLogReaderInvalReadState(XLogReaderState *state)
 	state->readLen = 0;
 }
 
+/*
+ * Validate record length of an XLOG record header.
+ *
+ * This is substantially a part of ValidXLogRecordHeader.  But XLogReadRecord
+ * needs this separate from the function in case of a partial record header.
+ */
+static bool
+ValidXLogRecordLength(XLogReaderState *state, XLogRecPtr RecPtr,
+					  XLogRecord *record)
+{
+	if (record->xl_tot_len == 0)
+	{
+		char	   *p;
+		char	   *pe;
+
+		/*
+		 * We are almost sure reaching the end of WAL, make sure that the
+		 * whole page after the record is filled with zeroes.
+		 */
+		p = (char *) record;
+		pe = p + XLOG_BLCKSZ - (RecPtr & (XLOG_BLCKSZ - 1));
+
+		while (*p == 0 && p < pe)
+			p++;
+
+		if (p == pe)
+		{
+			/*
+			 * The page after the record is completely zeroed. That suggests
+			 * we don't have a record after this point. We don't bother
+			 * checking the pages after since they are not zeroed in the case
+			 * of recycled segments.
+			 */
+			report_invalid_record(state, "empty record at %X/%X",
+								  LSN_FORMAT_ARGS(RecPtr));
+
+			/* notify end-of-wal to callers */
+			state->EndOfWAL = true;
+			return false;
+		}
+	}
+
+	if (record->xl_tot_len < SizeOfXLogRecord)
+	{
+		report_invalid_record(state,
+							  "invalid record length at %X/%X: wanted %u, got %u",
+							  LSN_FORMAT_ARGS(RecPtr),
+							  (uint32) SizeOfXLogRecord, record->xl_tot_len);
+		return false;
+	}
+
+	return true;
+}
+
 /*
  * Validate an XLOG record header.
  *
@@ -730,14 +788,9 @@ ValidXLogRecordHeader(XLogReaderState *state, XLogRecPtr RecPtr,
 					  XLogRecPtr PrevRecPtr, XLogRecord *record,
 					  bool randAccess)
 {
-	if (record->xl_tot_len < SizeOfXLogRecord)
-	{
-		report_invalid_record(state,
-							  "invalid record length at %X/%X: wanted %u, got %u",
-							  LSN_FORMAT_ARGS(RecPtr),
-							  (uint32) SizeOfXLogRecord, record->xl_tot_len);
+	if (!ValidXLogRecordLength(state, RecPtr, record))
 		return false;
-	}
+
 	if (record->xl_rmid > RM_MAX_ID)
 	{
 		report_invalid_record(state,
@@ -836,6 +889,31 @@ XLogReaderValidatePageHeader(XLogReaderState *state, XLogRecPtr recptr,
 
 	XLogSegNoOffsetToRecPtr(segno, offset, state->segcxt.ws_segsize, recaddr);
 
+	StaticAssertStmt(XLOG_PAGE_MAGIC != 0, "XLOG_PAGE_MAGIC is zero");
+
+	if (hdr->xlp_magic == 0)
+	{
+		/* Regard an empty page as End-Of-WAL */
+		int			i;
+
+		for (i = 0; i < XLOG_BLCKSZ && phdr[i] == 0; i++);
+		if (i == XLOG_BLCKSZ)
+		{
+			char		fname[MAXFNAMELEN];
+
+			XLogFileName(fname, state->seg.ws_tli, segno,
+						 state->segcxt.ws_segsize);
+
+			report_invalid_record(state,
+								  "empty page in log segment %s, offset %u",
+								  fname,
+								  offset);
+			state->EndOfWAL = true;
+			return false;
+		}
+
+		/* The same condition will be caught as invalid magic number */
+	}
 	if (hdr->xlp_magic != XLOG_PAGE_MAGIC)
 	{
 		char		fname[MAXFNAMELEN];
@@ -921,6 +999,14 @@ XLogReaderValidatePageHeader(XLogReaderState *state, XLogRecPtr recptr,
 							  LSN_FORMAT_ARGS(hdr->xlp_pageaddr),
 							  fname,
 							  offset);
+
+		/*
+		 * If the page address is less than expected we assume it is an unused
+		 * page in a recycled segment.
+		 */
+		if (hdr->xlp_pageaddr < recaddr)
+			state->EndOfWAL = true;
+
 		return false;
 	}
 
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index f9f212680b..750056acaf 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -1592,7 +1592,7 @@ PerformWalRecovery(void)
 		/* just have to read next record after CheckPoint */
 		Assert(xlogreader->ReadRecPtr == CheckPointLoc);
 		replayTLI = CheckPointTLI;
-		record = ReadRecord(xlogreader, LOG, false, replayTLI);
+		record = ReadRecord(xlogreader, WARNING, false, replayTLI);
 	}
 
 	if (record != NULL)
@@ -1706,7 +1706,7 @@ PerformWalRecovery(void)
 			}
 
 			/* Else, try to fetch the next WAL record */
-			record = ReadRecord(xlogreader, LOG, false, replayTLI);
+			record = ReadRecord(xlogreader, WARNING, false, replayTLI);
 		} while (record != NULL);
 
 		/*
@@ -1765,13 +1765,20 @@ PerformWalRecovery(void)
 
 		InRedo = false;
 	}
-	else
+	else if (xlogreader->EndOfWAL)
 	{
 		/* there are no WAL records following the checkpoint */
 		ereport(LOG,
 				(errmsg("redo is not required")));
 
 	}
+	else
+	{
+		/* broken record found */
+		ereport(WARNING,
+				(errmsg("redo is skipped"),
+				 errhint("This suggests WAL file corruption. You might need to check the database.")));
+	}
 
 	/*
 	 * This check is intentionally after the above log messages that indicate
@@ -2939,6 +2946,7 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
 	for (;;)
 	{
 		char	   *errormsg;
+		XLogRecPtr	ErrRecPtr = InvalidXLogRecPtr;
 
 		record = XLogReadRecord(xlogreader, &errormsg);
 		if (record == NULL)
@@ -2954,6 +2962,18 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
 			{
 				abortedRecPtr = xlogreader->abortedRecPtr;
 				missingContrecPtr = xlogreader->missingContrecPtr;
+				ErrRecPtr = abortedRecPtr;
+			}
+			else
+			{
+				/*
+				 * EndRecPtr is the LSN we tried to read but failed. In the
+				 * case of decoding error, it is at the end of the failed
+				 * record but we don't have a means for now to know EndRecPtr
+				 * is pointing to which of the beginning or ending of the
+				 * failed record.
+				 */
+				ErrRecPtr = xlogreader->EndRecPtr;
 			}
 
 			if (readFile >= 0)
@@ -2963,13 +2983,16 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
 			}
 
 			/*
-			 * We only end up here without a message when XLogPageRead()
-			 * failed - in that case we already logged something. In
-			 * StandbyMode that only happens if we have been triggered, so we
-			 * shouldn't loop anymore in that case.
+			 * We only end up here without a message when XLogPageRead() failed
+			 * in that case we already logged something, or just met end-of-WAL
+			 * conditions. In StandbyMode that only happens if we have been
+			 * triggered, so we shouldn't loop anymore in that case. When
+			 * EndOfWAL is true, we don't emit that error if any immediately
+			 * and instead will show it as a part of a decent end-of-wal
+			 * message later.
 			 */
-			if (errormsg)
-				ereport(emode_for_corrupt_record(emode, xlogreader->EndRecPtr),
+			if (!xlogreader->EndOfWAL && errormsg)
+				ereport(emode_for_corrupt_record(emode, ErrRecPtr),
 						(errmsg_internal("%s", errormsg) /* already translated */ ));
 		}
 
@@ -3000,11 +3023,14 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
 			/* Great, got a record */
 			return record;
 		}
-		else
+
+		Assert(ErrRecPtr != InvalidXLogRecPtr);
+
+		/* No valid record available from this source */
+		lastSourceFailed = true;
+
+		if (!fetching_ckpt)
 		{
-			/* No valid record available from this source */
-			lastSourceFailed = true;
-
 			/*
 			 * If archive recovery was requested, but we were still doing
 			 * crash recovery, switch to archive recovery and retry using the
@@ -3017,11 +3043,16 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
 			 * we'd have no idea how far we'd have to replay to reach
 			 * consistency.  So err on the safe side and give up.
 			 */
-			if (!InArchiveRecovery && ArchiveRecoveryRequested &&
-				!fetching_ckpt)
+			if (!InArchiveRecovery && ArchiveRecoveryRequested)
 			{
+				/*
+				 * We don't report this as LOG, since we don't stop recovery
+				 * here
+				 */
 				ereport(DEBUG1,
-						(errmsg_internal("reached end of WAL in pg_wal, entering archive recovery")));
+						(errmsg_internal("reached end of WAL at %X/%X on timeline %u in pg_wal during crash recovery, entering archive recovery",
+										 LSN_FORMAT_ARGS(ErrRecPtr),
+										 replayTLI)));
 				InArchiveRecovery = true;
 				if (StandbyModeRequested)
 					StandbyMode = true;
@@ -3042,12 +3073,24 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
 				continue;
 			}
 
-			/* In standby mode, loop back to retry. Otherwise, give up. */
-			if (StandbyMode && !CheckForStandbyTrigger())
-				continue;
-			else
-				return NULL;
+			/*
+			 * recovery ended.
+			 *
+			 * Emit a decent message if we met end-of-WAL. Otherwise we should
+			 * have already emitted an error message.
+			 */
+			if (xlogreader->EndOfWAL)
+				ereport(LOG,
+						(errmsg("reached end of WAL at %X/%X on timeline %u",
+								LSN_FORMAT_ARGS(ErrRecPtr), replayTLI),
+						 (errormsg ? errdetail_internal("%s", errormsg) : 0)));
 		}
+
+		/* In standby mode, loop back to retry. Otherwise, give up. */
+		if (StandbyMode && !CheckForStandbyTrigger())
+			continue;
+		else
+			return NULL;
 	}
 }
 
@@ -3129,12 +3172,16 @@ retry:
 										 private->replayTLI,
 										 xlogreader->EndRecPtr))
 		{
+			Assert(!StandbyMode || CheckForStandbyTrigger());
+
 			if (readFile >= 0)
 				close(readFile);
 			readFile = -1;
 			readLen = 0;
 			readSource = XLOG_FROM_ANY;
 
+			/* promotion exit is not end-of-WAL */
+			xlogreader->EndOfWAL = !StandbyMode;
 			return -1;
 		}
 	}
@@ -3767,7 +3814,8 @@ emode_for_corrupt_record(int emode, XLogRecPtr RecPtr)
 {
 	static XLogRecPtr lastComplaint = 0;
 
-	if (readSource == XLOG_FROM_PG_WAL && emode == LOG)
+	/* use currentSource as readSource is reset at failure */
+	if (currentSource == XLOG_FROM_PG_WAL && emode <= WARNING)
 	{
 		if (RecPtr == lastComplaint)
 			emode = DEBUG1;
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index ceaff097b9..4f117ea4da 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -472,10 +472,9 @@ WalReceiverMain(void)
 						else if (len < 0)
 						{
 							ereport(LOG,
-									(errmsg("replication terminated by primary server"),
-									 errdetail("End of WAL reached on timeline %u at %X/%X.",
-											   startpointTLI,
-											   LSN_FORMAT_ARGS(LogstreamResult.Write))));
+									(errmsg("replication terminated by primary server at %X/%X on timeline %u.",
+											LSN_FORMAT_ARGS(LogstreamResult.Write),
+											startpointTLI)));
 							endofwal = true;
 							break;
 						}
diff --git a/src/bin/pg_waldump/pg_waldump.c b/src/bin/pg_waldump/pg_waldump.c
index 2340dc247b..215abe95dc 100644
--- a/src/bin/pg_waldump/pg_waldump.c
+++ b/src/bin/pg_waldump/pg_waldump.c
@@ -1173,9 +1173,16 @@ main(int argc, char **argv)
 		exit(0);
 
 	if (errormsg)
-		fatal_error("error in WAL record at %X/%X: %s",
-					LSN_FORMAT_ARGS(xlogreader_state->ReadRecPtr),
-					errormsg);
+	{
+		if (xlogreader_state->EndOfWAL)
+			pg_log_info("end of WAL at %X/%X: %s",
+						LSN_FORMAT_ARGS(xlogreader_state->EndRecPtr),
+						errormsg);
+		else
+			fatal_error("error in WAL record at %X/%X: %s",
+						LSN_FORMAT_ARGS(xlogreader_state->EndRecPtr),
+						errormsg);
+	}
 
 	XLogReaderFree(xlogreader_state);
 
diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h
index 477f0efe26..7b314ef10e 100644
--- a/src/include/access/xlogreader.h
+++ b/src/include/access/xlogreader.h
@@ -174,6 +174,7 @@ struct XLogReaderState
 	 */
 	XLogRecPtr	ReadRecPtr;		/* start of last record read */
 	XLogRecPtr	EndRecPtr;		/* end+1 of last record read */
+	bool		EndOfWAL;		/* was the last attempt EOW? */
 
 	/*
 	 * Set at the end of recovery: the start point of a partial record at the
diff --git a/src/test/recovery/t/011_crash_recovery.pl b/src/test/recovery/t/011_crash_recovery.pl
index 14154d1ce0..01033334d6 100644
--- a/src/test/recovery/t/011_crash_recovery.pl
+++ b/src/test/recovery/t/011_crash_recovery.pl
@@ -10,7 +10,9 @@ use PostgreSQL::Test::Cluster;
 use PostgreSQL::Test::Utils;
 use Test::More;
 use Config;
+use IPC::Run;
 
+my $reached_eow_pat = "reached end of WAL at ";
 my $node = PostgreSQL::Test::Cluster->new('primary');
 $node->init(allows_streaming => 1);
 $node->start;
@@ -48,7 +50,15 @@ is($node->safe_psql('postgres', qq[SELECT pg_xact_status('$xid');]),
 
 # Crash and restart the postmaster
 $node->stop('immediate');
+my $logstart = get_log_size($node);
 $node->start;
+my $max_attempts = 360;
+while ($max_attempts-- >= 0)
+{
+	last if (find_in_log($node, $reached_eow_pat, $logstart));
+	sleep 0.5;
+}
+ok ($max_attempts >= 0, "end-of-wal is logged");
 
 # Make sure we really got a new xid
 cmp_ok($node->safe_psql('postgres', 'SELECT pg_current_xact_id()'),
@@ -61,4 +71,100 @@ is($node->safe_psql('postgres', qq[SELECT pg_xact_status('$xid');]),
 $stdin .= "\\q\n";
 $tx->finish;    # wait for psql to quit gracefully
 
+my $segsize = $node->safe_psql('postgres',
+	   qq[SELECT setting FROM pg_settings WHERE name = 'wal_segment_size';]);
+
+# make sure no records afterwards go to the next segment
+$node->safe_psql('postgres', qq[
+				 SELECT pg_switch_wal();
+				 CHECKPOINT;
+				 CREATE TABLE t();
+]);
+$node->stop('immediate');
+
+# identify REDO WAL file
+my $cmd = "pg_controldata -D " . $node->data_dir();
+$cmd = ['pg_controldata', '-D', $node->data_dir()];
+$stdout = '';
+$stderr = '';
+IPC::Run::run $cmd, '>', \$stdout, '2>', \$stderr;
+ok($stdout =~ /^Latest checkpoint's REDO WAL file:[ \t] *(.+)$/m,
+   "checkpoint file is identified");
+my $chkptfile = $1;
+
+# identify the last record
+my $walfile = $node->data_dir() . "/pg_wal/$chkptfile";
+$cmd = ['pg_waldump', $walfile];
+$stdout = '';
+$stderr = '';
+my $lastrec;
+IPC::Run::run $cmd, '>', \$stdout, '2>', \$stderr;
+foreach my $l (split(/\r?\n/, $stdout))
+{
+	$lastrec = $l;
+}
+ok(defined $lastrec, "last WAL record is extracted");
+ok($stderr =~ /end of WAL at ([0-9A-F\/]+): .* at \g1/,
+   "pg_waldump emits the correct ending message");
+
+# read the last record LSN excluding leading zeroes
+ok ($lastrec =~ /, lsn: 0\/0*([1-9A-F][0-9A-F]+),/,
+	"LSN of the last record identified");
+my $lastlsn = $1;
+
+# corrupt the last record
+my $offset = hex($lastlsn) % $segsize;
+open(my $segf, '+<', $walfile) or die "failed to open $walfile\n";
+seek($segf, $offset, 0);  # halfway break the last record
+print $segf "\0\0\0\0";
+close($segf);
+
+# pg_waldump complains about the corrupted record
+$stdout = '';
+$stderr = '';
+IPC::Run::run $cmd, '>', \$stdout, '2>', \$stderr;
+ok($stderr =~ /fatal: error in WAL record at 0\/$lastlsn: .* at 0\/$lastlsn/,
+   "pg_waldump emits the correct error message");
+
+# also server complains
+$logstart = get_log_size($node);
+$node->start;
+$max_attempts = 360;
+while ($max_attempts-- >= 0)
+{
+	last if (find_in_log($node, "WARNING:  invalid record length at 0/$lastlsn: wanted [0-9]+, got 0",
+						 $logstart));
+	sleep 0.5;
+}
+ok($max_attempts >= 0, "header error is logged at $lastlsn");
+
+# no end-of-wal message should be seen this time
+ok(!find_in_log($node, $reached_eow_pat, $logstart),
+   "false log message is not emitted");
+
+$node->stop('immediate');
+
 done_testing();
+
+#### helper routines
+# return the size of logfile of $node in bytes
+sub get_log_size
+{
+	my ($node) = @_;
+
+	return (stat $node->logfile)[7];
+}
+
+# find $pat in logfile of $node after $off-th byte
+sub find_in_log
+{
+	my ($node, $pat, $off) = @_;
+
+	$off = 0 unless defined $off;
+	my $log = PostgreSQL::Test::Utils::slurp_file($node->logfile);
+	return 0 if (length($log) <= $off);
+
+	$log = substr($log, $off);
+
+	return $log =~ m/$pat/;
+}
-- 
2.27.0

#41

Andres Freund

andres@anarazel.de

almost 4 years ago

In reply to: Kyotaro Horiguchi (#40)

Re: Make mesage at end-of-recovery less scary.

Hi,

On 2022-03-04 09:43:59 +0900, Kyotaro Horiguchi wrote:

On second thought the two seems repeating the same things. Thus I
merged the two comments together. In this verion 16 it looks like
this.

Patch currently fails to apply, needs a rebase:
http://cfbot.cputube.org/patch_37_2490.log

Greetings,

Andres Freund

#42

Kyotaro Horiguchi

horikyota.ntt@gmail.com

almost 4 years ago

In reply to: Andres Freund (#41)

1 attachment(s)

Re: Make mesage at end-of-recovery less scary.

At Mon, 21 Mar 2022 17:01:19 -0700, Andres Freund <andres@anarazel.de> wrote in

Patch currently fails to apply, needs a rebase:
http://cfbot.cputube.org/patch_37_2490.log

Thanks for noticing me of that.

Rebased to the current HEAD.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachments:

v17-0001-Make-End-Of-Recovery-error-less-scary.patchtext/x-patch; charset=us-asciiDownload

From a7c9f36e631eaba5078398598dae5d459e79add9 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Fri, 28 Feb 2020 15:52:58 +0900
Subject: [PATCH v17] Make End-Of-Recovery error less scary

When recovery in any type ends, we see a bit scary error message like
"invalid record length" that suggests something serious is
happening. Actually if recovery meets a record with length = 0, that
usually means it finished applying all available WAL records.

Make this message less scary as "reached end of WAL". Instead, raise
the error level for other kind of WAL failure to WARNING.
---
 src/backend/access/transam/xlogreader.c   | 145 +++++++++++++++++-----
 src/backend/access/transam/xlogrecovery.c |  92 ++++++++++----
 src/backend/replication/walreceiver.c     |   7 +-
 src/bin/pg_waldump/pg_waldump.c           |  13 +-
 src/include/access/xlogreader.h           |   1 +
 src/test/recovery/t/011_crash_recovery.pl | 106 ++++++++++++++++
 6 files changed, 306 insertions(+), 58 deletions(-)

diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index e437c42992..0942265408 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -46,6 +46,8 @@ static int	ReadPageInternal(XLogReaderState *state, XLogRecPtr pageptr,
 							 int reqLen);
 static void XLogReaderInvalReadState(XLogReaderState *state);
 static XLogPageReadResult XLogDecodeNextRecord(XLogReaderState *state, bool non_blocking);
+static bool ValidXLogRecordLength(XLogReaderState *state, XLogRecPtr RecPtr,
+								  XLogRecord *record);
 static bool ValidXLogRecordHeader(XLogReaderState *state, XLogRecPtr RecPtr,
 								  XLogRecPtr PrevRecPtr, XLogRecord *record, bool randAccess);
 static bool ValidXLogRecord(XLogReaderState *state, XLogRecord *record,
@@ -147,6 +149,7 @@ XLogReaderAllocate(int wal_segment_size, const char *waldir,
 		pfree(state);
 		return NULL;
 	}
+	state->EndOfWAL = false;
 	state->errormsg_buf[0] = '\0';
 
 	/*
@@ -552,6 +555,7 @@ XLogDecodeNextRecord(XLogReaderState *state, bool nonblocking)
 	/* reset error state */
 	state->errormsg_buf[0] = '\0';
 	decoded = NULL;
+	state->EndOfWAL = false;
 
 	state->abortedRecPtr = InvalidXLogRecPtr;
 	state->missingContrecPtr = InvalidXLogRecPtr;
@@ -633,25 +637,21 @@ restart:
 	Assert(pageHeaderSize <= readOff);
 
 	/*
-	 * Read the record length.
+	 * Validate the record header.
 	 *
-	 * NB: Even though we use an XLogRecord pointer here, the whole record
-	 * header might not fit on this page. xl_tot_len is the first field of the
-	 * struct, so it must be on this page (the records are MAXALIGNed), but we
-	 * cannot access any other fields until we've verified that we got the
-	 * whole header.
-	 */
-	record = (XLogRecord *) (state->readBuf + RecPtr % XLOG_BLCKSZ);
-	total_len = record->xl_tot_len;
-
-	/*
-	 * If the whole record header is on this page, validate it immediately.
-	 * Otherwise do just a basic sanity check on xl_tot_len, and validate the
-	 * rest of the header after reading it from the next page.  The xl_tot_len
+	 * Even though we use an XLogRecord pointer here, the whole record header
+	 * might not fit on this page.  If the whole record header is on this page,
+	 * validate it immediately.  Even otherwise xl_tot_len must be on this page
+	 * (it is the first field of MAXALIGNed records), but we still cannot
+	 * access any further fields until we've verified that we got the whole
+	 * header, so do just a basic sanity check on record length, and validate
+	 * the rest of the header after reading it from the next page.  The length
 	 * check is necessary here to ensure that we enter the "Need to reassemble
 	 * record" code path below; otherwise we might fail to apply
 	 * ValidXLogRecordHeader at all.
 	 */
+	record = (XLogRecord *) (state->readBuf + RecPtr % XLOG_BLCKSZ);
+
 	if (targetRecOff <= XLOG_BLCKSZ - SizeOfXLogRecord)
 	{
 		if (!ValidXLogRecordHeader(state, RecPtr, state->DecodeRecPtr, record,
@@ -661,18 +661,14 @@ restart:
 	}
 	else
 	{
-		/* XXX: more validation should be done here */
-		if (total_len < SizeOfXLogRecord)
-		{
-			report_invalid_record(state,
-								  "invalid record length at %X/%X: wanted %u, got %u",
-								  LSN_FORMAT_ARGS(RecPtr),
-								  (uint32) SizeOfXLogRecord, total_len);
+		if (!ValidXLogRecordLength(state, RecPtr, record))
 			goto err;
-		}
+
 		gotheader = false;
 	}
 
+	total_len = record->xl_tot_len;
+
 	/*
 	 * Find space to decode this record.  Don't allow oversized allocation if
 	 * the caller requested nonblocking.  Otherwise, we *have* to try to
@@ -904,6 +900,15 @@ err:
 		 */
 		state->abortedRecPtr = RecPtr;
 		state->missingContrecPtr = targetPagePtr;
+
+		/*
+		 * If the message is not set yet, that means we failed to load the
+		 * page for the record.  Otherwise do not hide the existing message.
+		 */
+		if (state->errormsg_buf[0] == '\0')
+			report_invalid_record(state,
+								  "missing contrecord at %X/%X",
+								  LSN_FORMAT_ARGS(RecPtr));
 	}
 
 	if (decoded && decoded->oversized)
@@ -1083,6 +1088,60 @@ XLogReaderInvalReadState(XLogReaderState *state)
 	state->readLen = 0;
 }
 
+/*
+ * Validate record length of an XLOG record header.
+ *
+ * This is substantially a part of ValidXLogRecordHeader.  But XLogReadRecord
+ * needs this separate from the function in case of a partial record header.
+ */
+static bool
+ValidXLogRecordLength(XLogReaderState *state, XLogRecPtr RecPtr,
+					  XLogRecord *record)
+{
+	if (record->xl_tot_len == 0)
+	{
+		char	   *p;
+		char	   *pe;
+
+		/*
+		 * We are almost sure reaching the end of WAL, make sure that the
+		 * whole page after the record is filled with zeroes.
+		 */
+		p = (char *) record;
+		pe = p + XLOG_BLCKSZ - (RecPtr & (XLOG_BLCKSZ - 1));
+
+		while (*p == 0 && p < pe)
+			p++;
+
+		if (p == pe)
+		{
+			/*
+			 * The page after the record is completely zeroed. That suggests
+			 * we don't have a record after this point. We don't bother
+			 * checking the pages after since they are not zeroed in the case
+			 * of recycled segments.
+			 */
+			report_invalid_record(state, "empty record at %X/%X",
+								  LSN_FORMAT_ARGS(RecPtr));
+
+			/* notify end-of-wal to callers */
+			state->EndOfWAL = true;
+			return false;
+		}
+	}
+
+	if (record->xl_tot_len < SizeOfXLogRecord)
+	{
+		report_invalid_record(state,
+							  "invalid record length at %X/%X: wanted %u, got %u",
+							  LSN_FORMAT_ARGS(RecPtr),
+							  (uint32) SizeOfXLogRecord, record->xl_tot_len);
+		return false;
+	}
+
+	return true;
+}
+
 /*
  * Validate an XLOG record header.
  *
@@ -1094,14 +1153,9 @@ ValidXLogRecordHeader(XLogReaderState *state, XLogRecPtr RecPtr,
 					  XLogRecPtr PrevRecPtr, XLogRecord *record,
 					  bool randAccess)
 {
-	if (record->xl_tot_len < SizeOfXLogRecord)
-	{
-		report_invalid_record(state,
-							  "invalid record length at %X/%X: wanted %u, got %u",
-							  LSN_FORMAT_ARGS(RecPtr),
-							  (uint32) SizeOfXLogRecord, record->xl_tot_len);
+	if (!ValidXLogRecordLength(state, RecPtr, record))
 		return false;
-	}
+
 	if (record->xl_rmid > RM_MAX_ID)
 	{
 		report_invalid_record(state,
@@ -1200,6 +1254,31 @@ XLogReaderValidatePageHeader(XLogReaderState *state, XLogRecPtr recptr,
 
 	XLogSegNoOffsetToRecPtr(segno, offset, state->segcxt.ws_segsize, recaddr);
 
+	StaticAssertStmt(XLOG_PAGE_MAGIC != 0, "XLOG_PAGE_MAGIC is zero");
+
+	if (hdr->xlp_magic == 0)
+	{
+		/* Regard an empty page as End-Of-WAL */
+		int			i;
+
+		for (i = 0; i < XLOG_BLCKSZ && phdr[i] == 0; i++);
+		if (i == XLOG_BLCKSZ)
+		{
+			char		fname[MAXFNAMELEN];
+
+			XLogFileName(fname, state->seg.ws_tli, segno,
+						 state->segcxt.ws_segsize);
+
+			report_invalid_record(state,
+								  "empty page in log segment %s, offset %u",
+								  fname,
+								  offset);
+			state->EndOfWAL = true;
+			return false;
+		}
+
+		/* The same condition will be caught as invalid magic number */
+	}
 	if (hdr->xlp_magic != XLOG_PAGE_MAGIC)
 	{
 		char		fname[MAXFNAMELEN];
@@ -1285,6 +1364,14 @@ XLogReaderValidatePageHeader(XLogReaderState *state, XLogRecPtr recptr,
 							  LSN_FORMAT_ARGS(hdr->xlp_pageaddr),
 							  fname,
 							  offset);
+
+		/*
+		 * If the page address is less than expected we assume it is an unused
+		 * page in a recycled segment.
+		 */
+		if (hdr->xlp_pageaddr < recaddr)
+			state->EndOfWAL = true;
+
 		return false;
 	}
 
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 9feea3e6ec..98382d66a4 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -1592,7 +1592,7 @@ PerformWalRecovery(void)
 		/* just have to read next record after CheckPoint */
 		Assert(xlogreader->ReadRecPtr == CheckPointLoc);
 		replayTLI = CheckPointTLI;
-		record = ReadRecord(xlogreader, LOG, false, replayTLI);
+		record = ReadRecord(xlogreader, WARNING, false, replayTLI);
 	}
 
 	if (record != NULL)
@@ -1706,7 +1706,7 @@ PerformWalRecovery(void)
 			}
 
 			/* Else, try to fetch the next WAL record */
-			record = ReadRecord(xlogreader, LOG, false, replayTLI);
+			record = ReadRecord(xlogreader, WARNING, false, replayTLI);
 		} while (record != NULL);
 
 		/*
@@ -1765,13 +1765,20 @@ PerformWalRecovery(void)
 
 		InRedo = false;
 	}
-	else
+	else if (xlogreader->EndOfWAL)
 	{
 		/* there are no WAL records following the checkpoint */
 		ereport(LOG,
 				(errmsg("redo is not required")));
 
 	}
+	else
+	{
+		/* broken record found */
+		ereport(WARNING,
+				(errmsg("redo is skipped"),
+				 errhint("This suggests WAL file corruption. You might need to check the database.")));
+	}
 
 	/*
 	 * This check is intentionally after the above log messages that indicate
@@ -2939,6 +2946,7 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
 	for (;;)
 	{
 		char	   *errormsg;
+		XLogRecPtr	ErrRecPtr = InvalidXLogRecPtr;
 
 		record = XLogReadRecord(xlogreader, &errormsg);
 		if (record == NULL)
@@ -2954,6 +2962,18 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
 			{
 				abortedRecPtr = xlogreader->abortedRecPtr;
 				missingContrecPtr = xlogreader->missingContrecPtr;
+				ErrRecPtr = abortedRecPtr;
+			}
+			else
+			{
+				/*
+				 * EndRecPtr is the LSN we tried to read but failed. In the
+				 * case of decoding error, it is at the end of the failed
+				 * record but we don't have a means for now to know EndRecPtr
+				 * is pointing to which of the beginning or ending of the
+				 * failed record.
+				 */
+				ErrRecPtr = xlogreader->EndRecPtr;
 			}
 
 			if (readFile >= 0)
@@ -2963,13 +2983,16 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
 			}
 
 			/*
-			 * We only end up here without a message when XLogPageRead()
-			 * failed - in that case we already logged something. In
-			 * StandbyMode that only happens if we have been triggered, so we
-			 * shouldn't loop anymore in that case.
+			 * We only end up here without a message when XLogPageRead() failed
+			 * in that case we already logged something, or just met end-of-WAL
+			 * conditions. In StandbyMode that only happens if we have been
+			 * triggered, so we shouldn't loop anymore in that case. When
+			 * EndOfWAL is true, we don't emit that error if any immediately
+			 * and instead will show it as a part of a decent end-of-wal
+			 * message later.
 			 */
-			if (errormsg)
-				ereport(emode_for_corrupt_record(emode, xlogreader->EndRecPtr),
+			if (!xlogreader->EndOfWAL && errormsg)
+				ereport(emode_for_corrupt_record(emode, ErrRecPtr),
 						(errmsg_internal("%s", errormsg) /* already translated */ ));
 		}
 
@@ -3000,11 +3023,14 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
 			/* Great, got a record */
 			return record;
 		}
-		else
+
+		Assert(ErrRecPtr != InvalidXLogRecPtr);
+
+		/* No valid record available from this source */
+		lastSourceFailed = true;
+
+		if (!fetching_ckpt)
 		{
-			/* No valid record available from this source */
-			lastSourceFailed = true;
-
 			/*
 			 * If archive recovery was requested, but we were still doing
 			 * crash recovery, switch to archive recovery and retry using the
@@ -3017,11 +3043,16 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
 			 * we'd have no idea how far we'd have to replay to reach
 			 * consistency.  So err on the safe side and give up.
 			 */
-			if (!InArchiveRecovery && ArchiveRecoveryRequested &&
-				!fetching_ckpt)
+			if (!InArchiveRecovery && ArchiveRecoveryRequested)
 			{
+				/*
+				 * We don't report this as LOG, since we don't stop recovery
+				 * here
+				 */
 				ereport(DEBUG1,
-						(errmsg_internal("reached end of WAL in pg_wal, entering archive recovery")));
+						(errmsg_internal("reached end of WAL at %X/%X on timeline %u in pg_wal during crash recovery, entering archive recovery",
+										 LSN_FORMAT_ARGS(ErrRecPtr),
+										 replayTLI)));
 				InArchiveRecovery = true;
 				if (StandbyModeRequested)
 					StandbyMode = true;
@@ -3042,12 +3073,24 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
 				continue;
 			}
 
-			/* In standby mode, loop back to retry. Otherwise, give up. */
-			if (StandbyMode && !CheckForStandbyTrigger())
-				continue;
-			else
-				return NULL;
+			/*
+			 * recovery ended.
+			 *
+			 * Emit a decent message if we met end-of-WAL. Otherwise we should
+			 * have already emitted an error message.
+			 */
+			if (xlogreader->EndOfWAL)
+				ereport(LOG,
+						(errmsg("reached end of WAL at %X/%X on timeline %u",
+								LSN_FORMAT_ARGS(ErrRecPtr), replayTLI),
+						 (errormsg ? errdetail_internal("%s", errormsg) : 0)));
 		}
+
+		/* In standby mode, loop back to retry. Otherwise, give up. */
+		if (StandbyMode && !CheckForStandbyTrigger())
+			continue;
+		else
+			return NULL;
 	}
 }
 
@@ -3129,12 +3172,16 @@ retry:
 										 private->replayTLI,
 										 xlogreader->EndRecPtr))
 		{
+			Assert(!StandbyMode || CheckForStandbyTrigger());
+
 			if (readFile >= 0)
 				close(readFile);
 			readFile = -1;
 			readLen = 0;
 			readSource = XLOG_FROM_ANY;
 
+			/* promotion exit is not end-of-WAL */
+			xlogreader->EndOfWAL = !StandbyMode;
 			return -1;
 		}
 	}
@@ -3767,7 +3814,8 @@ emode_for_corrupt_record(int emode, XLogRecPtr RecPtr)
 {
 	static XLogRecPtr lastComplaint = 0;
 
-	if (readSource == XLOG_FROM_PG_WAL && emode == LOG)
+	/* use currentSource as readSource is reset at failure */
+	if (currentSource == XLOG_FROM_PG_WAL && emode <= WARNING)
 	{
 		if (RecPtr == lastComplaint)
 			emode = DEBUG1;
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index ceaff097b9..4f117ea4da 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -472,10 +472,9 @@ WalReceiverMain(void)
 						else if (len < 0)
 						{
 							ereport(LOG,
-									(errmsg("replication terminated by primary server"),
-									 errdetail("End of WAL reached on timeline %u at %X/%X.",
-											   startpointTLI,
-											   LSN_FORMAT_ARGS(LogstreamResult.Write))));
+									(errmsg("replication terminated by primary server at %X/%X on timeline %u.",
+											LSN_FORMAT_ARGS(LogstreamResult.Write),
+											startpointTLI)));
 							endofwal = true;
 							break;
 						}
diff --git a/src/bin/pg_waldump/pg_waldump.c b/src/bin/pg_waldump/pg_waldump.c
index fc081adfb8..9bebca8154 100644
--- a/src/bin/pg_waldump/pg_waldump.c
+++ b/src/bin/pg_waldump/pg_waldump.c
@@ -1174,9 +1174,16 @@ main(int argc, char **argv)
 		exit(0);
 
 	if (errormsg)
-		fatal_error("error in WAL record at %X/%X: %s",
-					LSN_FORMAT_ARGS(xlogreader_state->ReadRecPtr),
-					errormsg);
+	{
+		if (xlogreader_state->EndOfWAL)
+			pg_log_info("end of WAL at %X/%X: %s",
+						LSN_FORMAT_ARGS(xlogreader_state->EndRecPtr),
+						errormsg);
+		else
+			fatal_error("error in WAL record at %X/%X: %s",
+						LSN_FORMAT_ARGS(xlogreader_state->EndRecPtr),
+						errormsg);
+	}
 
 	XLogReaderFree(xlogreader_state);
 
diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h
index f4388cc9be..21a8f9552c 100644
--- a/src/include/access/xlogreader.h
+++ b/src/include/access/xlogreader.h
@@ -201,6 +201,7 @@ struct XLogReaderState
 	 */
 	XLogRecPtr	ReadRecPtr;		/* start of last record read */
 	XLogRecPtr	EndRecPtr;		/* end+1 of last record read */
+	bool		EndOfWAL;		/* was the last attempt EOW? */
 
 	/*
 	 * Set at the end of recovery: the start point of a partial record at the
diff --git a/src/test/recovery/t/011_crash_recovery.pl b/src/test/recovery/t/011_crash_recovery.pl
index 14154d1ce0..01033334d6 100644
--- a/src/test/recovery/t/011_crash_recovery.pl
+++ b/src/test/recovery/t/011_crash_recovery.pl
@@ -10,7 +10,9 @@ use PostgreSQL::Test::Cluster;
 use PostgreSQL::Test::Utils;
 use Test::More;
 use Config;
+use IPC::Run;
 
+my $reached_eow_pat = "reached end of WAL at ";
 my $node = PostgreSQL::Test::Cluster->new('primary');
 $node->init(allows_streaming => 1);
 $node->start;
@@ -48,7 +50,15 @@ is($node->safe_psql('postgres', qq[SELECT pg_xact_status('$xid');]),
 
 # Crash and restart the postmaster
 $node->stop('immediate');
+my $logstart = get_log_size($node);
 $node->start;
+my $max_attempts = 360;
+while ($max_attempts-- >= 0)
+{
+	last if (find_in_log($node, $reached_eow_pat, $logstart));
+	sleep 0.5;
+}
+ok ($max_attempts >= 0, "end-of-wal is logged");
 
 # Make sure we really got a new xid
 cmp_ok($node->safe_psql('postgres', 'SELECT pg_current_xact_id()'),
@@ -61,4 +71,100 @@ is($node->safe_psql('postgres', qq[SELECT pg_xact_status('$xid');]),
 $stdin .= "\\q\n";
 $tx->finish;    # wait for psql to quit gracefully
 
+my $segsize = $node->safe_psql('postgres',
+	   qq[SELECT setting FROM pg_settings WHERE name = 'wal_segment_size';]);
+
+# make sure no records afterwards go to the next segment
+$node->safe_psql('postgres', qq[
+				 SELECT pg_switch_wal();
+				 CHECKPOINT;
+				 CREATE TABLE t();
+]);
+$node->stop('immediate');
+
+# identify REDO WAL file
+my $cmd = "pg_controldata -D " . $node->data_dir();
+$cmd = ['pg_controldata', '-D', $node->data_dir()];
+$stdout = '';
+$stderr = '';
+IPC::Run::run $cmd, '>', \$stdout, '2>', \$stderr;
+ok($stdout =~ /^Latest checkpoint's REDO WAL file:[ \t] *(.+)$/m,
+   "checkpoint file is identified");
+my $chkptfile = $1;
+
+# identify the last record
+my $walfile = $node->data_dir() . "/pg_wal/$chkptfile";
+$cmd = ['pg_waldump', $walfile];
+$stdout = '';
+$stderr = '';
+my $lastrec;
+IPC::Run::run $cmd, '>', \$stdout, '2>', \$stderr;
+foreach my $l (split(/\r?\n/, $stdout))
+{
+	$lastrec = $l;
+}
+ok(defined $lastrec, "last WAL record is extracted");
+ok($stderr =~ /end of WAL at ([0-9A-F\/]+): .* at \g1/,
+   "pg_waldump emits the correct ending message");
+
+# read the last record LSN excluding leading zeroes
+ok ($lastrec =~ /, lsn: 0\/0*([1-9A-F][0-9A-F]+),/,
+	"LSN of the last record identified");
+my $lastlsn = $1;
+
+# corrupt the last record
+my $offset = hex($lastlsn) % $segsize;
+open(my $segf, '+<', $walfile) or die "failed to open $walfile\n";
+seek($segf, $offset, 0);  # halfway break the last record
+print $segf "\0\0\0\0";
+close($segf);
+
+# pg_waldump complains about the corrupted record
+$stdout = '';
+$stderr = '';
+IPC::Run::run $cmd, '>', \$stdout, '2>', \$stderr;
+ok($stderr =~ /fatal: error in WAL record at 0\/$lastlsn: .* at 0\/$lastlsn/,
+   "pg_waldump emits the correct error message");
+
+# also server complains
+$logstart = get_log_size($node);
+$node->start;
+$max_attempts = 360;
+while ($max_attempts-- >= 0)
+{
+	last if (find_in_log($node, "WARNING:  invalid record length at 0/$lastlsn: wanted [0-9]+, got 0",
+						 $logstart));
+	sleep 0.5;
+}
+ok($max_attempts >= 0, "header error is logged at $lastlsn");
+
+# no end-of-wal message should be seen this time
+ok(!find_in_log($node, $reached_eow_pat, $logstart),
+   "false log message is not emitted");
+
+$node->stop('immediate');
+
 done_testing();
+
+#### helper routines
+# return the size of logfile of $node in bytes
+sub get_log_size
+{
+	my ($node) = @_;
+
+	return (stat $node->logfile)[7];
+}
+
+# find $pat in logfile of $node after $off-th byte
+sub find_in_log
+{
+	my ($node, $pat, $off) = @_;
+
+	$off = 0 unless defined $off;
+	my $log = PostgreSQL::Test::Utils::slurp_file($node->logfile);
+	return 0 if (length($log) <= $off);
+
+	$log = substr($log, $off);
+
+	return $log =~ m/$pat/;
+}
-- 
2.27.0

#43

Kyotaro Horiguchi

horikyota.ntt@gmail.com

almost 4 years ago

In reply to: Kyotaro Horiguchi (#42)

1 attachment(s)

Re: Make mesage at end-of-recovery less scary.

me> Rebased to the current HEAD.

b64c3bd62e (removal of unused "use Config") conflicted on a TAP
script.

Rebased.

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachments:

v18-0001-Make-End-Of-Recovery-error-less-scary.patchtext/x-patch; charset=us-asciiDownload

From e1492816913efcb4fc25ee6a3bafd27a6c5f3f9a Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Fri, 28 Feb 2020 15:52:58 +0900
Subject: [PATCH v18] Make End-Of-Recovery error less scary

When recovery in any type ends, we see a bit scary error message like
"invalid record length" that suggests something serious is
happening. Actually if recovery meets a record with length = 0, that
usually means it finished applying all available WAL records.

Make this message less scary as "reached end of WAL". Instead, raise
the error level for other kind of WAL failure to WARNING.
---
 src/backend/access/transam/xlogreader.c   | 145 +++++++++++++++++-----
 src/backend/access/transam/xlogrecovery.c |  92 ++++++++++----
 src/backend/replication/walreceiver.c     |   7 +-
 src/bin/pg_waldump/pg_waldump.c           |  13 +-
 src/include/access/xlogreader.h           |   1 +
 src/test/recovery/t/011_crash_recovery.pl | 106 ++++++++++++++++
 6 files changed, 306 insertions(+), 58 deletions(-)

diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index e437c42992..0942265408 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -46,6 +46,8 @@ static int	ReadPageInternal(XLogReaderState *state, XLogRecPtr pageptr,
 							 int reqLen);
 static void XLogReaderInvalReadState(XLogReaderState *state);
 static XLogPageReadResult XLogDecodeNextRecord(XLogReaderState *state, bool non_blocking);
+static bool ValidXLogRecordLength(XLogReaderState *state, XLogRecPtr RecPtr,
+								  XLogRecord *record);
 static bool ValidXLogRecordHeader(XLogReaderState *state, XLogRecPtr RecPtr,
 								  XLogRecPtr PrevRecPtr, XLogRecord *record, bool randAccess);
 static bool ValidXLogRecord(XLogReaderState *state, XLogRecord *record,
@@ -147,6 +149,7 @@ XLogReaderAllocate(int wal_segment_size, const char *waldir,
 		pfree(state);
 		return NULL;
 	}
+	state->EndOfWAL = false;
 	state->errormsg_buf[0] = '\0';
 
 	/*
@@ -552,6 +555,7 @@ XLogDecodeNextRecord(XLogReaderState *state, bool nonblocking)
 	/* reset error state */
 	state->errormsg_buf[0] = '\0';
 	decoded = NULL;
+	state->EndOfWAL = false;
 
 	state->abortedRecPtr = InvalidXLogRecPtr;
 	state->missingContrecPtr = InvalidXLogRecPtr;
@@ -633,25 +637,21 @@ restart:
 	Assert(pageHeaderSize <= readOff);
 
 	/*
-	 * Read the record length.
+	 * Validate the record header.
 	 *
-	 * NB: Even though we use an XLogRecord pointer here, the whole record
-	 * header might not fit on this page. xl_tot_len is the first field of the
-	 * struct, so it must be on this page (the records are MAXALIGNed), but we
-	 * cannot access any other fields until we've verified that we got the
-	 * whole header.
-	 */
-	record = (XLogRecord *) (state->readBuf + RecPtr % XLOG_BLCKSZ);
-	total_len = record->xl_tot_len;
-
-	/*
-	 * If the whole record header is on this page, validate it immediately.
-	 * Otherwise do just a basic sanity check on xl_tot_len, and validate the
-	 * rest of the header after reading it from the next page.  The xl_tot_len
+	 * Even though we use an XLogRecord pointer here, the whole record header
+	 * might not fit on this page.  If the whole record header is on this page,
+	 * validate it immediately.  Even otherwise xl_tot_len must be on this page
+	 * (it is the first field of MAXALIGNed records), but we still cannot
+	 * access any further fields until we've verified that we got the whole
+	 * header, so do just a basic sanity check on record length, and validate
+	 * the rest of the header after reading it from the next page.  The length
 	 * check is necessary here to ensure that we enter the "Need to reassemble
 	 * record" code path below; otherwise we might fail to apply
 	 * ValidXLogRecordHeader at all.
 	 */
+	record = (XLogRecord *) (state->readBuf + RecPtr % XLOG_BLCKSZ);
+
 	if (targetRecOff <= XLOG_BLCKSZ - SizeOfXLogRecord)
 	{
 		if (!ValidXLogRecordHeader(state, RecPtr, state->DecodeRecPtr, record,
@@ -661,18 +661,14 @@ restart:
 	}
 	else
 	{
-		/* XXX: more validation should be done here */
-		if (total_len < SizeOfXLogRecord)
-		{
-			report_invalid_record(state,
-								  "invalid record length at %X/%X: wanted %u, got %u",
-								  LSN_FORMAT_ARGS(RecPtr),
-								  (uint32) SizeOfXLogRecord, total_len);
+		if (!ValidXLogRecordLength(state, RecPtr, record))
 			goto err;
-		}
+
 		gotheader = false;
 	}
 
+	total_len = record->xl_tot_len;
+
 	/*
 	 * Find space to decode this record.  Don't allow oversized allocation if
 	 * the caller requested nonblocking.  Otherwise, we *have* to try to
@@ -904,6 +900,15 @@ err:
 		 */
 		state->abortedRecPtr = RecPtr;
 		state->missingContrecPtr = targetPagePtr;
+
+		/*
+		 * If the message is not set yet, that means we failed to load the
+		 * page for the record.  Otherwise do not hide the existing message.
+		 */
+		if (state->errormsg_buf[0] == '\0')
+			report_invalid_record(state,
+								  "missing contrecord at %X/%X",
+								  LSN_FORMAT_ARGS(RecPtr));
 	}
 
 	if (decoded && decoded->oversized)
@@ -1083,6 +1088,60 @@ XLogReaderInvalReadState(XLogReaderState *state)
 	state->readLen = 0;
 }
 
+/*
+ * Validate record length of an XLOG record header.
+ *
+ * This is substantially a part of ValidXLogRecordHeader.  But XLogReadRecord
+ * needs this separate from the function in case of a partial record header.
+ */
+static bool
+ValidXLogRecordLength(XLogReaderState *state, XLogRecPtr RecPtr,
+					  XLogRecord *record)
+{
+	if (record->xl_tot_len == 0)
+	{
+		char	   *p;
+		char	   *pe;
+
+		/*
+		 * We are almost sure reaching the end of WAL, make sure that the
+		 * whole page after the record is filled with zeroes.
+		 */
+		p = (char *) record;
+		pe = p + XLOG_BLCKSZ - (RecPtr & (XLOG_BLCKSZ - 1));
+
+		while (*p == 0 && p < pe)
+			p++;
+
+		if (p == pe)
+		{
+			/*
+			 * The page after the record is completely zeroed. That suggests
+			 * we don't have a record after this point. We don't bother
+			 * checking the pages after since they are not zeroed in the case
+			 * of recycled segments.
+			 */
+			report_invalid_record(state, "empty record at %X/%X",
+								  LSN_FORMAT_ARGS(RecPtr));
+
+			/* notify end-of-wal to callers */
+			state->EndOfWAL = true;
+			return false;
+		}
+	}
+
+	if (record->xl_tot_len < SizeOfXLogRecord)
+	{
+		report_invalid_record(state,
+							  "invalid record length at %X/%X: wanted %u, got %u",
+							  LSN_FORMAT_ARGS(RecPtr),
+							  (uint32) SizeOfXLogRecord, record->xl_tot_len);
+		return false;
+	}
+
+	return true;
+}
+
 /*
  * Validate an XLOG record header.
  *
@@ -1094,14 +1153,9 @@ ValidXLogRecordHeader(XLogReaderState *state, XLogRecPtr RecPtr,
 					  XLogRecPtr PrevRecPtr, XLogRecord *record,
 					  bool randAccess)
 {
-	if (record->xl_tot_len < SizeOfXLogRecord)
-	{
-		report_invalid_record(state,
-							  "invalid record length at %X/%X: wanted %u, got %u",
-							  LSN_FORMAT_ARGS(RecPtr),
-							  (uint32) SizeOfXLogRecord, record->xl_tot_len);
+	if (!ValidXLogRecordLength(state, RecPtr, record))
 		return false;
-	}
+
 	if (record->xl_rmid > RM_MAX_ID)
 	{
 		report_invalid_record(state,
@@ -1200,6 +1254,31 @@ XLogReaderValidatePageHeader(XLogReaderState *state, XLogRecPtr recptr,
 
 	XLogSegNoOffsetToRecPtr(segno, offset, state->segcxt.ws_segsize, recaddr);
 
+	StaticAssertStmt(XLOG_PAGE_MAGIC != 0, "XLOG_PAGE_MAGIC is zero");
+
+	if (hdr->xlp_magic == 0)
+	{
+		/* Regard an empty page as End-Of-WAL */
+		int			i;
+
+		for (i = 0; i < XLOG_BLCKSZ && phdr[i] == 0; i++);
+		if (i == XLOG_BLCKSZ)
+		{
+			char		fname[MAXFNAMELEN];
+
+			XLogFileName(fname, state->seg.ws_tli, segno,
+						 state->segcxt.ws_segsize);
+
+			report_invalid_record(state,
+								  "empty page in log segment %s, offset %u",
+								  fname,
+								  offset);
+			state->EndOfWAL = true;
+			return false;
+		}
+
+		/* The same condition will be caught as invalid magic number */
+	}
 	if (hdr->xlp_magic != XLOG_PAGE_MAGIC)
 	{
 		char		fname[MAXFNAMELEN];
@@ -1285,6 +1364,14 @@ XLogReaderValidatePageHeader(XLogReaderState *state, XLogRecPtr recptr,
 							  LSN_FORMAT_ARGS(hdr->xlp_pageaddr),
 							  fname,
 							  offset);
+
+		/*
+		 * If the page address is less than expected we assume it is an unused
+		 * page in a recycled segment.
+		 */
+		if (hdr->xlp_pageaddr < recaddr)
+			state->EndOfWAL = true;
+
 		return false;
 	}
 
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 8b22c4e634..de8be3b834 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -1592,7 +1592,7 @@ PerformWalRecovery(void)
 		/* just have to read next record after CheckPoint */
 		Assert(xlogreader->ReadRecPtr == CheckPointLoc);
 		replayTLI = CheckPointTLI;
-		record = ReadRecord(xlogreader, LOG, false, replayTLI);
+		record = ReadRecord(xlogreader, WARNING, false, replayTLI);
 	}
 
 	if (record != NULL)
@@ -1706,7 +1706,7 @@ PerformWalRecovery(void)
 			}
 
 			/* Else, try to fetch the next WAL record */
-			record = ReadRecord(xlogreader, LOG, false, replayTLI);
+			record = ReadRecord(xlogreader, WARNING, false, replayTLI);
 		} while (record != NULL);
 
 		/*
@@ -1765,13 +1765,20 @@ PerformWalRecovery(void)
 
 		InRedo = false;
 	}
-	else
+	else if (xlogreader->EndOfWAL)
 	{
 		/* there are no WAL records following the checkpoint */
 		ereport(LOG,
 				(errmsg("redo is not required")));
 
 	}
+	else
+	{
+		/* broken record found */
+		ereport(WARNING,
+				(errmsg("redo is skipped"),
+				 errhint("This suggests WAL file corruption. You might need to check the database.")));
+	}
 
 	/*
 	 * This check is intentionally after the above log messages that indicate
@@ -2949,6 +2956,7 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
 	for (;;)
 	{
 		char	   *errormsg;
+		XLogRecPtr	ErrRecPtr = InvalidXLogRecPtr;
 
 		record = XLogReadRecord(xlogreader, &errormsg);
 		if (record == NULL)
@@ -2964,6 +2972,18 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
 			{
 				abortedRecPtr = xlogreader->abortedRecPtr;
 				missingContrecPtr = xlogreader->missingContrecPtr;
+				ErrRecPtr = abortedRecPtr;
+			}
+			else
+			{
+				/*
+				 * EndRecPtr is the LSN we tried to read but failed. In the
+				 * case of decoding error, it is at the end of the failed
+				 * record but we don't have a means for now to know EndRecPtr
+				 * is pointing to which of the beginning or ending of the
+				 * failed record.
+				 */
+				ErrRecPtr = xlogreader->EndRecPtr;
 			}
 
 			if (readFile >= 0)
@@ -2973,13 +2993,16 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
 			}
 
 			/*
-			 * We only end up here without a message when XLogPageRead()
-			 * failed - in that case we already logged something. In
-			 * StandbyMode that only happens if we have been triggered, so we
-			 * shouldn't loop anymore in that case.
+			 * We only end up here without a message when XLogPageRead() failed
+			 * in that case we already logged something, or just met end-of-WAL
+			 * conditions. In StandbyMode that only happens if we have been
+			 * triggered, so we shouldn't loop anymore in that case. When
+			 * EndOfWAL is true, we don't emit that error if any immediately
+			 * and instead will show it as a part of a decent end-of-wal
+			 * message later.
 			 */
-			if (errormsg)
-				ereport(emode_for_corrupt_record(emode, xlogreader->EndRecPtr),
+			if (!xlogreader->EndOfWAL && errormsg)
+				ereport(emode_for_corrupt_record(emode, ErrRecPtr),
 						(errmsg_internal("%s", errormsg) /* already translated */ ));
 		}
 
@@ -3010,11 +3033,14 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
 			/* Great, got a record */
 			return record;
 		}
-		else
+
+		Assert(ErrRecPtr != InvalidXLogRecPtr);
+
+		/* No valid record available from this source */
+		lastSourceFailed = true;
+
+		if (!fetching_ckpt)
 		{
-			/* No valid record available from this source */
-			lastSourceFailed = true;
-
 			/*
 			 * If archive recovery was requested, but we were still doing
 			 * crash recovery, switch to archive recovery and retry using the
@@ -3027,11 +3053,16 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
 			 * we'd have no idea how far we'd have to replay to reach
 			 * consistency.  So err on the safe side and give up.
 			 */
-			if (!InArchiveRecovery && ArchiveRecoveryRequested &&
-				!fetching_ckpt)
+			if (!InArchiveRecovery && ArchiveRecoveryRequested)
 			{
+				/*
+				 * We don't report this as LOG, since we don't stop recovery
+				 * here
+				 */
 				ereport(DEBUG1,
-						(errmsg_internal("reached end of WAL in pg_wal, entering archive recovery")));
+						(errmsg_internal("reached end of WAL at %X/%X on timeline %u in pg_wal during crash recovery, entering archive recovery",
+										 LSN_FORMAT_ARGS(ErrRecPtr),
+										 replayTLI)));
 				InArchiveRecovery = true;
 				if (StandbyModeRequested)
 					StandbyMode = true;
@@ -3052,12 +3083,24 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
 				continue;
 			}
 
-			/* In standby mode, loop back to retry. Otherwise, give up. */
-			if (StandbyMode && !CheckForStandbyTrigger())
-				continue;
-			else
-				return NULL;
+			/*
+			 * recovery ended.
+			 *
+			 * Emit a decent message if we met end-of-WAL. Otherwise we should
+			 * have already emitted an error message.
+			 */
+			if (xlogreader->EndOfWAL)
+				ereport(LOG,
+						(errmsg("reached end of WAL at %X/%X on timeline %u",
+								LSN_FORMAT_ARGS(ErrRecPtr), replayTLI),
+						 (errormsg ? errdetail_internal("%s", errormsg) : 0)));
 		}
+
+		/* In standby mode, loop back to retry. Otherwise, give up. */
+		if (StandbyMode && !CheckForStandbyTrigger())
+			continue;
+		else
+			return NULL;
 	}
 }
 
@@ -3139,12 +3182,16 @@ retry:
 										 private->replayTLI,
 										 xlogreader->EndRecPtr))
 		{
+			Assert(!StandbyMode || CheckForStandbyTrigger());
+
 			if (readFile >= 0)
 				close(readFile);
 			readFile = -1;
 			readLen = 0;
 			readSource = XLOG_FROM_ANY;
 
+			/* promotion exit is not end-of-WAL */
+			xlogreader->EndOfWAL = !StandbyMode;
 			return -1;
 		}
 	}
@@ -3777,7 +3824,8 @@ emode_for_corrupt_record(int emode, XLogRecPtr RecPtr)
 {
 	static XLogRecPtr lastComplaint = 0;
 
-	if (readSource == XLOG_FROM_PG_WAL && emode == LOG)
+	/* use currentSource as readSource is reset at failure */
+	if (currentSource == XLOG_FROM_PG_WAL && emode <= WARNING)
 	{
 		if (RecPtr == lastComplaint)
 			emode = DEBUG1;
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index 3c9411e221..2f9ef9bf31 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -472,10 +472,9 @@ WalReceiverMain(void)
 						else if (len < 0)
 						{
 							ereport(LOG,
-									(errmsg("replication terminated by primary server"),
-									 errdetail("End of WAL reached on timeline %u at %X/%X.",
-											   startpointTLI,
-											   LSN_FORMAT_ARGS(LogstreamResult.Write))));
+									(errmsg("replication terminated by primary server at %X/%X on timeline %u.",
+											LSN_FORMAT_ARGS(LogstreamResult.Write),
+											startpointTLI)));
 							endofwal = true;
 							break;
 						}
diff --git a/src/bin/pg_waldump/pg_waldump.c b/src/bin/pg_waldump/pg_waldump.c
index 4cb40d068a..2a78e954de 100644
--- a/src/bin/pg_waldump/pg_waldump.c
+++ b/src/bin/pg_waldump/pg_waldump.c
@@ -1310,9 +1310,16 @@ main(int argc, char **argv)
 		exit(0);
 
 	if (errormsg)
-		fatal_error("error in WAL record at %X/%X: %s",
-					LSN_FORMAT_ARGS(xlogreader_state->ReadRecPtr),
-					errormsg);
+	{
+		if (xlogreader_state->EndOfWAL)
+			pg_log_info("end of WAL at %X/%X: %s",
+						LSN_FORMAT_ARGS(xlogreader_state->EndRecPtr),
+						errormsg);
+		else
+			fatal_error("error in WAL record at %X/%X: %s",
+						LSN_FORMAT_ARGS(xlogreader_state->EndRecPtr),
+						errormsg);
+	}
 
 	XLogReaderFree(xlogreader_state);
 
diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h
index f4388cc9be..21a8f9552c 100644
--- a/src/include/access/xlogreader.h
+++ b/src/include/access/xlogreader.h
@@ -201,6 +201,7 @@ struct XLogReaderState
 	 */
 	XLogRecPtr	ReadRecPtr;		/* start of last record read */
 	XLogRecPtr	EndRecPtr;		/* end+1 of last record read */
+	bool		EndOfWAL;		/* was the last attempt EOW? */
 
 	/*
 	 * Set at the end of recovery: the start point of a partial record at the
diff --git a/src/test/recovery/t/011_crash_recovery.pl b/src/test/recovery/t/011_crash_recovery.pl
index 1b57d01046..f8b4a8417c 100644
--- a/src/test/recovery/t/011_crash_recovery.pl
+++ b/src/test/recovery/t/011_crash_recovery.pl
@@ -9,7 +9,9 @@ use warnings;
 use PostgreSQL::Test::Cluster;
 use PostgreSQL::Test::Utils;
 use Test::More;
+use IPC::Run;
 
+my $reached_eow_pat = "reached end of WAL at ";
 my $node = PostgreSQL::Test::Cluster->new('primary');
 $node->init(allows_streaming => 1);
 $node->start;
@@ -47,7 +49,15 @@ is($node->safe_psql('postgres', qq[SELECT pg_xact_status('$xid');]),
 
 # Crash and restart the postmaster
 $node->stop('immediate');
+my $logstart = get_log_size($node);
 $node->start;
+my $max_attempts = 360;
+while ($max_attempts-- >= 0)
+{
+	last if (find_in_log($node, $reached_eow_pat, $logstart));
+	sleep 0.5;
+}
+ok ($max_attempts >= 0, "end-of-wal is logged");
 
 # Make sure we really got a new xid
 cmp_ok($node->safe_psql('postgres', 'SELECT pg_current_xact_id()'),
@@ -60,4 +70,100 @@ is($node->safe_psql('postgres', qq[SELECT pg_xact_status('$xid');]),
 $stdin .= "\\q\n";
 $tx->finish;    # wait for psql to quit gracefully
 
+my $segsize = $node->safe_psql('postgres',
+	   qq[SELECT setting FROM pg_settings WHERE name = 'wal_segment_size';]);
+
+# make sure no records afterwards go to the next segment
+$node->safe_psql('postgres', qq[
+				 SELECT pg_switch_wal();
+				 CHECKPOINT;
+				 CREATE TABLE t();
+]);
+$node->stop('immediate');
+
+# identify REDO WAL file
+my $cmd = "pg_controldata -D " . $node->data_dir();
+$cmd = ['pg_controldata', '-D', $node->data_dir()];
+$stdout = '';
+$stderr = '';
+IPC::Run::run $cmd, '>', \$stdout, '2>', \$stderr;
+ok($stdout =~ /^Latest checkpoint's REDO WAL file:[ \t] *(.+)$/m,
+   "checkpoint file is identified");
+my $chkptfile = $1;
+
+# identify the last record
+my $walfile = $node->data_dir() . "/pg_wal/$chkptfile";
+$cmd = ['pg_waldump', $walfile];
+$stdout = '';
+$stderr = '';
+my $lastrec;
+IPC::Run::run $cmd, '>', \$stdout, '2>', \$stderr;
+foreach my $l (split(/\r?\n/, $stdout))
+{
+	$lastrec = $l;
+}
+ok(defined $lastrec, "last WAL record is extracted");
+ok($stderr =~ /end of WAL at ([0-9A-F\/]+): .* at \g1/,
+   "pg_waldump emits the correct ending message");
+
+# read the last record LSN excluding leading zeroes
+ok ($lastrec =~ /, lsn: 0\/0*([1-9A-F][0-9A-F]+),/,
+	"LSN of the last record identified");
+my $lastlsn = $1;
+
+# corrupt the last record
+my $offset = hex($lastlsn) % $segsize;
+open(my $segf, '+<', $walfile) or die "failed to open $walfile\n";
+seek($segf, $offset, 0);  # halfway break the last record
+print $segf "\0\0\0\0";
+close($segf);
+
+# pg_waldump complains about the corrupted record
+$stdout = '';
+$stderr = '';
+IPC::Run::run $cmd, '>', \$stdout, '2>', \$stderr;
+ok($stderr =~ /fatal: error in WAL record at 0\/$lastlsn: .* at 0\/$lastlsn/,
+   "pg_waldump emits the correct error message");
+
+# also server complains
+$logstart = get_log_size($node);
+$node->start;
+$max_attempts = 360;
+while ($max_attempts-- >= 0)
+{
+	last if (find_in_log($node, "WARNING:  invalid record length at 0/$lastlsn: wanted [0-9]+, got 0",
+						 $logstart));
+	sleep 0.5;
+}
+ok($max_attempts >= 0, "header error is logged at $lastlsn");
+
+# no end-of-wal message should be seen this time
+ok(!find_in_log($node, $reached_eow_pat, $logstart),
+   "false log message is not emitted");
+
+$node->stop('immediate');
+
 done_testing();
+
+#### helper routines
+# return the size of logfile of $node in bytes
+sub get_log_size
+{
+	my ($node) = @_;
+
+	return (stat $node->logfile)[7];
+}
+
+# find $pat in logfile of $node after $off-th byte
+sub find_in_log
+{
+	my ($node, $pat, $off) = @_;
+
+	$off = 0 unless defined $off;
+	my $log = PostgreSQL::Test::Utils::slurp_file($node->logfile);
+	return 0 if (length($log) <= $off);
+
+	$log = substr($log, $off);
+
+	return $log =~ m/$pat/;
+}
-- 
2.27.0

#44

Jacob Champion

jchampion@timescale.com

over 3 years ago

In reply to: Kyotaro Horiguchi (#43)

Re: Make mesage at end-of-recovery less scary.

On Mon, Mar 28, 2022 at 11:07 PM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:

Rebased.

Unfortunately this will need another rebase over latest.

[CFM hat] Looking through the history here, this has been bumped to
Ready for Committer a few times and then bumped back to Needs Review
after a required rebase. What's the best way for us to provide support
for contributors who get stuck in this loop? Maybe we can be more
aggressive about automated notifications when a RfC patch goes red in
the cfbot?

Thanks,
--Jacob

#45

Michael Paquier

michael@paquier.xyz

over 3 years ago

In reply to: Jacob Champion (#44)

Re: Make mesage at end-of-recovery less scary.

On Wed, Jul 06, 2022 at 11:05:51AM -0700, Jacob Champion wrote:

[CFM hat] Looking through the history here, this has been bumped to
Ready for Committer a few times and then bumped back to Needs Review
after a required rebase. What's the best way for us to provide support
for contributors who get stuck in this loop? Maybe we can be more
aggressive about automated notifications when a RfC patch goes red in
the cfbot?

Having a better integration between the CF bot and the CF app would be
great, IMO. People tend to easily forget about what they send in my
experience, even if they manage a small pool of patches or a larger
one.
--
Michael

#46

Kyotaro Horiguchi

horikyota.ntt@gmail.com

over 3 years ago

In reply to: Jacob Champion (#44)

1 attachment(s)

Re: Make mesage at end-of-recovery less scary.

At Wed, 6 Jul 2022 11:05:51 -0700, Jacob Champion <jchampion@timescale.com> wrote in

On Mon, Mar 28, 2022 at 11:07 PM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:

Rebased.

Unfortunately this will need another rebase over latest.

Thanks! Done.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachments:

v19-0001-Make-End-Of-Recovery-error-less-scary.patchtext/x-patch; charset=us-asciiDownload

From ff8a069ca587c55e06f0700492edc8fc16d15138 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Thu, 7 Jul 2022 11:51:45 +0900
Subject: [PATCH v19] Make End-Of-Recovery error less scary

When recovery in any type ends, we see a bit scary error message like
"invalid record length" that suggests something serious is
happening. Actually if recovery meets a record with length = 0, that
usually means it finished applying all available WAL records.

Make this message less scary as "reached end of WAL". Instead, raise
the error level for other kind of WAL failure to WARNING.
---
 src/backend/access/transam/xlogreader.c   | 145 +++++++++++++++++-----
 src/backend/access/transam/xlogrecovery.c |  95 ++++++++++----
 src/backend/replication/walreceiver.c     |   7 +-
 src/bin/pg_waldump/pg_waldump.c           |  13 +-
 src/include/access/xlogreader.h           |   1 +
 src/test/recovery/t/011_crash_recovery.pl | 106 ++++++++++++++++
 6 files changed, 308 insertions(+), 59 deletions(-)

diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index f3dc4b7797..a86ab2b02b 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -48,6 +48,8 @@ static int	ReadPageInternal(XLogReaderState *state, XLogRecPtr pageptr,
 							 int reqLen);
 static void XLogReaderInvalReadState(XLogReaderState *state);
 static XLogPageReadResult XLogDecodeNextRecord(XLogReaderState *state, bool non_blocking);
+static bool ValidXLogRecordLength(XLogReaderState *state, XLogRecPtr RecPtr,
+								  XLogRecord *record);
 static bool ValidXLogRecordHeader(XLogReaderState *state, XLogRecPtr RecPtr,
 								  XLogRecPtr PrevRecPtr, XLogRecord *record, bool randAccess);
 static bool ValidXLogRecord(XLogReaderState *state, XLogRecord *record,
@@ -149,6 +151,7 @@ XLogReaderAllocate(int wal_segment_size, const char *waldir,
 		pfree(state);
 		return NULL;
 	}
+	state->EndOfWAL = false;
 	state->errormsg_buf[0] = '\0';
 
 	/*
@@ -554,6 +557,7 @@ XLogDecodeNextRecord(XLogReaderState *state, bool nonblocking)
 	/* reset error state */
 	state->errormsg_buf[0] = '\0';
 	decoded = NULL;
+	state->EndOfWAL = false;
 
 	state->abortedRecPtr = InvalidXLogRecPtr;
 	state->missingContrecPtr = InvalidXLogRecPtr;
@@ -635,25 +639,21 @@ restart:
 	Assert(pageHeaderSize <= readOff);
 
 	/*
-	 * Read the record length.
+	 * Validate the record header.
 	 *
-	 * NB: Even though we use an XLogRecord pointer here, the whole record
-	 * header might not fit on this page. xl_tot_len is the first field of the
-	 * struct, so it must be on this page (the records are MAXALIGNed), but we
-	 * cannot access any other fields until we've verified that we got the
-	 * whole header.
-	 */
-	record = (XLogRecord *) (state->readBuf + RecPtr % XLOG_BLCKSZ);
-	total_len = record->xl_tot_len;
-
-	/*
-	 * If the whole record header is on this page, validate it immediately.
-	 * Otherwise do just a basic sanity check on xl_tot_len, and validate the
-	 * rest of the header after reading it from the next page.  The xl_tot_len
+	 * Even though we use an XLogRecord pointer here, the whole record header
+	 * might not fit on this page.  If the whole record header is on this page,
+	 * validate it immediately.  Even otherwise xl_tot_len must be on this page
+	 * (it is the first field of MAXALIGNed records), but we still cannot
+	 * access any further fields until we've verified that we got the whole
+	 * header, so do just a basic sanity check on record length, and validate
+	 * the rest of the header after reading it from the next page.  The length
 	 * check is necessary here to ensure that we enter the "Need to reassemble
 	 * record" code path below; otherwise we might fail to apply
 	 * ValidXLogRecordHeader at all.
 	 */
+	record = (XLogRecord *) (state->readBuf + RecPtr % XLOG_BLCKSZ);
+
 	if (targetRecOff <= XLOG_BLCKSZ - SizeOfXLogRecord)
 	{
 		if (!ValidXLogRecordHeader(state, RecPtr, state->DecodeRecPtr, record,
@@ -663,18 +663,14 @@ restart:
 	}
 	else
 	{
-		/* XXX: more validation should be done here */
-		if (total_len < SizeOfXLogRecord)
-		{
-			report_invalid_record(state,
-								  "invalid record length at %X/%X: wanted %u, got %u",
-								  LSN_FORMAT_ARGS(RecPtr),
-								  (uint32) SizeOfXLogRecord, total_len);
+		if (!ValidXLogRecordLength(state, RecPtr, record))
 			goto err;
-		}
+
 		gotheader = false;
 	}
 
+	total_len = record->xl_tot_len;
+
 	/*
 	 * Find space to decode this record.  Don't allow oversized allocation if
 	 * the caller requested nonblocking.  Otherwise, we *have* to try to
@@ -906,6 +902,15 @@ err:
 		 */
 		state->abortedRecPtr = RecPtr;
 		state->missingContrecPtr = targetPagePtr;
+
+		/*
+		 * If the message is not set yet, that means we failed to load the
+		 * page for the record.  Otherwise do not hide the existing message.
+		 */
+		if (state->errormsg_buf[0] == '\0')
+			report_invalid_record(state,
+								  "missing contrecord at %X/%X",
+								  LSN_FORMAT_ARGS(RecPtr));
 	}
 
 	if (decoded && decoded->oversized)
@@ -1085,6 +1090,60 @@ XLogReaderInvalReadState(XLogReaderState *state)
 	state->readLen = 0;
 }
 
+/*
+ * Validate record length of an XLOG record header.
+ *
+ * This is substantially a part of ValidXLogRecordHeader.  But XLogReadRecord
+ * needs this separate from the function in case of a partial record header.
+ */
+static bool
+ValidXLogRecordLength(XLogReaderState *state, XLogRecPtr RecPtr,
+					  XLogRecord *record)
+{
+	if (record->xl_tot_len == 0)
+	{
+		char	   *p;
+		char	   *pe;
+
+		/*
+		 * We are almost sure reaching the end of WAL, make sure that the
+		 * whole page after the record is filled with zeroes.
+		 */
+		p = (char *) record;
+		pe = p + XLOG_BLCKSZ - (RecPtr & (XLOG_BLCKSZ - 1));
+
+		while (*p == 0 && p < pe)
+			p++;
+
+		if (p == pe)
+		{
+			/*
+			 * The page after the record is completely zeroed. That suggests
+			 * we don't have a record after this point. We don't bother
+			 * checking the pages after since they are not zeroed in the case
+			 * of recycled segments.
+			 */
+			report_invalid_record(state, "empty record at %X/%X",
+								  LSN_FORMAT_ARGS(RecPtr));
+
+			/* notify end-of-wal to callers */
+			state->EndOfWAL = true;
+			return false;
+		}
+	}
+
+	if (record->xl_tot_len < SizeOfXLogRecord)
+	{
+		report_invalid_record(state,
+							  "invalid record length at %X/%X: wanted %u, got %u",
+							  LSN_FORMAT_ARGS(RecPtr),
+							  (uint32) SizeOfXLogRecord, record->xl_tot_len);
+		return false;
+	}
+
+	return true;
+}
+
 /*
  * Validate an XLOG record header.
  *
@@ -1096,14 +1155,9 @@ ValidXLogRecordHeader(XLogReaderState *state, XLogRecPtr RecPtr,
 					  XLogRecPtr PrevRecPtr, XLogRecord *record,
 					  bool randAccess)
 {
-	if (record->xl_tot_len < SizeOfXLogRecord)
-	{
-		report_invalid_record(state,
-							  "invalid record length at %X/%X: wanted %u, got %u",
-							  LSN_FORMAT_ARGS(RecPtr),
-							  (uint32) SizeOfXLogRecord, record->xl_tot_len);
+	if (!ValidXLogRecordLength(state, RecPtr, record))
 		return false;
-	}
+
 	if (!RmgrIdIsValid(record->xl_rmid))
 	{
 		report_invalid_record(state,
@@ -1202,6 +1256,31 @@ XLogReaderValidatePageHeader(XLogReaderState *state, XLogRecPtr recptr,
 
 	XLogSegNoOffsetToRecPtr(segno, offset, state->segcxt.ws_segsize, recaddr);
 
+	StaticAssertStmt(XLOG_PAGE_MAGIC != 0, "XLOG_PAGE_MAGIC is zero");
+
+	if (hdr->xlp_magic == 0)
+	{
+		/* Regard an empty page as End-Of-WAL */
+		int			i;
+
+		for (i = 0; i < XLOG_BLCKSZ && phdr[i] == 0; i++);
+		if (i == XLOG_BLCKSZ)
+		{
+			char		fname[MAXFNAMELEN];
+
+			XLogFileName(fname, state->seg.ws_tli, segno,
+						 state->segcxt.ws_segsize);
+
+			report_invalid_record(state,
+								  "empty page in log segment %s, offset %u",
+								  fname,
+								  offset);
+			state->EndOfWAL = true;
+			return false;
+		}
+
+		/* The same condition will be caught as invalid magic number */
+	}
 	if (hdr->xlp_magic != XLOG_PAGE_MAGIC)
 	{
 		char		fname[MAXFNAMELEN];
@@ -1287,6 +1366,14 @@ XLogReaderValidatePageHeader(XLogReaderState *state, XLogRecPtr recptr,
 							  LSN_FORMAT_ARGS(hdr->xlp_pageaddr),
 							  fname,
 							  offset);
+
+		/*
+		 * If the page address is less than expected we assume it is an unused
+		 * page in a recycled segment.
+		 */
+		if (hdr->xlp_pageaddr < recaddr)
+			state->EndOfWAL = true;
+
 		return false;
 	}
 
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 5d6f1b5e46..baae4e84cf 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -1622,7 +1622,7 @@ PerformWalRecovery(void)
 		/* just have to read next record after CheckPoint */
 		Assert(xlogreader->ReadRecPtr == CheckPointLoc);
 		replayTLI = CheckPointTLI;
-		record = ReadRecord(xlogprefetcher, LOG, false, replayTLI);
+		record = ReadRecord(xlogprefetcher, WARNING, false, replayTLI);
 	}
 
 	if (record != NULL)
@@ -1731,7 +1731,7 @@ PerformWalRecovery(void)
 			}
 
 			/* Else, try to fetch the next WAL record */
-			record = ReadRecord(xlogprefetcher, LOG, false, replayTLI);
+			record = ReadRecord(xlogprefetcher, WARNING, false, replayTLI);
 		} while (record != NULL);
 
 		/*
@@ -1785,11 +1785,19 @@ PerformWalRecovery(void)
 
 		InRedo = false;
 	}
-	else
+	else if (xlogreader->EndOfWAL)
 	{
 		/* there are no WAL records following the checkpoint */
 		ereport(LOG,
-				(errmsg("redo is not required")));
+				errmsg("redo is not required"));
+		
+	}
+	else
+	{
+		/* broken record found */
+		ereport(WARNING,
+				errmsg("redo is skipped"),
+				errhint("This suggests WAL file corruption. You might need to check the database."));
 	}
 
 	/*
@@ -2969,6 +2977,7 @@ ReadRecord(XLogPrefetcher *xlogprefetcher, int emode,
 	for (;;)
 	{
 		char	   *errormsg;
+		XLogRecPtr	ErrRecPtr = InvalidXLogRecPtr;
 
 		record = XLogPrefetcherReadRecord(xlogprefetcher, &errormsg);
 		if (record == NULL)
@@ -2984,6 +2993,18 @@ ReadRecord(XLogPrefetcher *xlogprefetcher, int emode,
 			{
 				abortedRecPtr = xlogreader->abortedRecPtr;
 				missingContrecPtr = xlogreader->missingContrecPtr;
+				ErrRecPtr = abortedRecPtr;
+			}
+			else
+			{
+				/*
+				 * EndRecPtr is the LSN we tried to read but failed. In the
+				 * case of decoding error, it is at the end of the failed
+				 * record but we don't have a means for now to know EndRecPtr
+				 * is pointing to which of the beginning or ending of the
+				 * failed record.
+				 */
+				ErrRecPtr = xlogreader->EndRecPtr;
 			}
 
 			if (readFile >= 0)
@@ -2994,13 +3015,15 @@ ReadRecord(XLogPrefetcher *xlogprefetcher, int emode,
 
 			/*
 			 * We only end up here without a message when XLogPageRead()
-			 * failed - in that case we already logged something. In
-			 * StandbyMode that only happens if we have been triggered, so we
-			 * shouldn't loop anymore in that case.
+			 * failed- in that case we already logged something. In StandbyMode
+			 * that only happens if we have been triggered, so we shouldn't
+			 * loop anymore in that case. When EndOfWAL is true, we don't emit
+			 * the message immediately and instead will show it as a part of a
+			 * decent end-of-wal message later.
 			 */
-			if (errormsg)
-				ereport(emode_for_corrupt_record(emode, xlogreader->EndRecPtr),
-						(errmsg_internal("%s", errormsg) /* already translated */ ));
+			if (!xlogreader->EndOfWAL && errormsg)
+				ereport(emode_for_corrupt_record(emode, ErrRecPtr),
+						errmsg_internal("%s", errormsg) /* already translated */ );
 		}
 
 		/*
@@ -3030,11 +3053,14 @@ ReadRecord(XLogPrefetcher *xlogprefetcher, int emode,
 			/* Great, got a record */
 			return record;
 		}
-		else
+
+		Assert(ErrRecPtr != InvalidXLogRecPtr);
+
+		/* No valid record available from this source */
+		lastSourceFailed = true;
+
+		if (!fetching_ckpt)
 		{
-			/* No valid record available from this source */
-			lastSourceFailed = true;
-
 			/*
 			 * If archive recovery was requested, but we were still doing
 			 * crash recovery, switch to archive recovery and retry using the
@@ -3047,11 +3073,16 @@ ReadRecord(XLogPrefetcher *xlogprefetcher, int emode,
 			 * we'd have no idea how far we'd have to replay to reach
 			 * consistency.  So err on the safe side and give up.
 			 */
-			if (!InArchiveRecovery && ArchiveRecoveryRequested &&
-				!fetching_ckpt)
+			if (!InArchiveRecovery && ArchiveRecoveryRequested)
 			{
+				/*
+				 * We don't report this as LOG, since we don't stop recovery
+				 * here
+				 */
 				ereport(DEBUG1,
-						(errmsg_internal("reached end of WAL in pg_wal, entering archive recovery")));
+						errmsg_internal("reached end of WAL at %X/%X on timeline %u in pg_wal during crash recovery, entering archive recovery",
+										LSN_FORMAT_ARGS(ErrRecPtr),
+										replayTLI));
 				InArchiveRecovery = true;
 				if (StandbyModeRequested)
 					StandbyMode = true;
@@ -3072,12 +3103,24 @@ ReadRecord(XLogPrefetcher *xlogprefetcher, int emode,
 				continue;
 			}
 
-			/* In standby mode, loop back to retry. Otherwise, give up. */
-			if (StandbyMode && !CheckForStandbyTrigger())
-				continue;
-			else
-				return NULL;
+			/*
+			 * recovery ended.
+			 *
+			 * Emit a decent message if we met end-of-WAL. Otherwise we should
+			 * have already emitted an error message.
+			 */
+			if (xlogreader->EndOfWAL)
+				ereport(LOG,
+						errmsg("reached end of WAL at %X/%X on timeline %u",
+							   LSN_FORMAT_ARGS(ErrRecPtr), replayTLI),
+						(errormsg ? errdetail_internal("%s", errormsg) : 0));
 		}
+
+		/* In standby mode, loop back to retry. Otherwise, give up. */
+		if (StandbyMode && !CheckForStandbyTrigger())
+			continue;
+		else
+			return NULL;
 	}
 }
 
@@ -3172,11 +3215,16 @@ retry:
 			case XLREAD_WOULDBLOCK:
 				return XLREAD_WOULDBLOCK;
 			case XLREAD_FAIL:
+				Assert(!StandbyMode || CheckForStandbyTrigger());
+
 				if (readFile >= 0)
 					close(readFile);
 				readFile = -1;
 				readLen = 0;
 				readSource = XLOG_FROM_ANY;
+
+				/* promotion exit is not end-of-WAL */
+				xlogreader->EndOfWAL = !StandbyMode;
 				return XLREAD_FAIL;
 			case XLREAD_SUCCESS:
 				break;
@@ -3830,7 +3878,8 @@ emode_for_corrupt_record(int emode, XLogRecPtr RecPtr)
 {
 	static XLogRecPtr lastComplaint = 0;
 
-	if (readSource == XLOG_FROM_PG_WAL && emode == LOG)
+	/* use currentSource as readSource is reset at failure */
+	if (currentSource == XLOG_FROM_PG_WAL && emode <= WARNING)
 	{
 		if (RecPtr == lastComplaint)
 			emode = DEBUG1;
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index 3d37c1fe62..73f7641c4f 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -472,10 +472,9 @@ WalReceiverMain(void)
 						else if (len < 0)
 						{
 							ereport(LOG,
-									(errmsg("replication terminated by primary server"),
-									 errdetail("End of WAL reached on timeline %u at %X/%X.",
-											   startpointTLI,
-											   LSN_FORMAT_ARGS(LogstreamResult.Write))));
+									errmsg("replication terminated by primary server at %X/%X on timeline %u.",
+										   LSN_FORMAT_ARGS(LogstreamResult.Write),
+										   startpointTLI));
 							endofwal = true;
 							break;
 						}
diff --git a/src/bin/pg_waldump/pg_waldump.c b/src/bin/pg_waldump/pg_waldump.c
index 6528113628..bd09a62a9d 100644
--- a/src/bin/pg_waldump/pg_waldump.c
+++ b/src/bin/pg_waldump/pg_waldump.c
@@ -1168,9 +1168,16 @@ main(int argc, char **argv)
 		exit(0);
 
 	if (errormsg)
-		pg_fatal("error in WAL record at %X/%X: %s",
-				 LSN_FORMAT_ARGS(xlogreader_state->ReadRecPtr),
-				 errormsg);
+	{
+		if (xlogreader_state->EndOfWAL)
+			pg_log_info("end of WAL at %X/%X: %s",
+						LSN_FORMAT_ARGS(xlogreader_state->EndRecPtr),
+						errormsg);
+		else
+			pg_fatal("error in WAL record at %X/%X: %s",
+					 LSN_FORMAT_ARGS(xlogreader_state->EndRecPtr),
+					 errormsg);
+	}
 
 	XLogReaderFree(xlogreader_state);
 
diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h
index 5395f155aa..b2dae53557 100644
--- a/src/include/access/xlogreader.h
+++ b/src/include/access/xlogreader.h
@@ -205,6 +205,7 @@ struct XLogReaderState
 	 */
 	XLogRecPtr	ReadRecPtr;		/* start of last record read */
 	XLogRecPtr	EndRecPtr;		/* end+1 of last record read */
+	bool		EndOfWAL;		/* was the last attempt EOW? */
 
 	/*
 	 * Set at the end of recovery: the start point of a partial record at the
diff --git a/src/test/recovery/t/011_crash_recovery.pl b/src/test/recovery/t/011_crash_recovery.pl
index 1b57d01046..bde16b7cfa 100644
--- a/src/test/recovery/t/011_crash_recovery.pl
+++ b/src/test/recovery/t/011_crash_recovery.pl
@@ -9,7 +9,9 @@ use warnings;
 use PostgreSQL::Test::Cluster;
 use PostgreSQL::Test::Utils;
 use Test::More;
+use IPC::Run;
 
+my $reached_eow_pat = "reached end of WAL at ";
 my $node = PostgreSQL::Test::Cluster->new('primary');
 $node->init(allows_streaming => 1);
 $node->start;
@@ -47,7 +49,15 @@ is($node->safe_psql('postgres', qq[SELECT pg_xact_status('$xid');]),
 
 # Crash and restart the postmaster
 $node->stop('immediate');
+my $logstart = get_log_size($node);
 $node->start;
+my $max_attempts = 360;
+while ($max_attempts-- >= 0)
+{
+	last if (find_in_log($node, $reached_eow_pat, $logstart));
+	sleep 0.5;
+}
+ok ($max_attempts >= 0, "end-of-wal is logged");
 
 # Make sure we really got a new xid
 cmp_ok($node->safe_psql('postgres', 'SELECT pg_current_xact_id()'),
@@ -60,4 +70,100 @@ is($node->safe_psql('postgres', qq[SELECT pg_xact_status('$xid');]),
 $stdin .= "\\q\n";
 $tx->finish;    # wait for psql to quit gracefully
 
+my $segsize = $node->safe_psql('postgres',
+	   qq[SELECT setting FROM pg_settings WHERE name = 'wal_segment_size';]);
+
+# make sure no records afterwards go to the next segment
+$node->safe_psql('postgres', qq[
+				 SELECT pg_switch_wal();
+				 CHECKPOINT;
+				 CREATE TABLE t();
+]);
+$node->stop('immediate');
+
+# identify REDO WAL file
+my $cmd = "pg_controldata -D " . $node->data_dir();
+$cmd = ['pg_controldata', '-D', $node->data_dir()];
+$stdout = '';
+$stderr = '';
+IPC::Run::run $cmd, '>', \$stdout, '2>', \$stderr;
+ok($stdout =~ /^Latest checkpoint's REDO WAL file:[ \t] *(.+)$/m,
+   "checkpoint file is identified");
+my $chkptfile = $1;
+
+# identify the last record
+my $walfile = $node->data_dir() . "/pg_wal/$chkptfile";
+$cmd = ['pg_waldump', $walfile];
+$stdout = '';
+$stderr = '';
+my $lastrec;
+IPC::Run::run $cmd, '>', \$stdout, '2>', \$stderr;
+foreach my $l (split(/\r?\n/, $stdout))
+{
+	$lastrec = $l;
+}
+ok(defined $lastrec, "last WAL record is extracted");
+ok($stderr =~ /end of WAL at ([0-9A-F\/]+): .* at \g1/,
+   "pg_waldump emits the correct ending message");
+
+# read the last record LSN excluding leading zeroes
+ok ($lastrec =~ /, lsn: 0\/0*([1-9A-F][0-9A-F]+),/,
+	"LSN of the last record identified");
+my $lastlsn = $1;
+
+# corrupt the last record
+my $offset = hex($lastlsn) % $segsize;
+open(my $segf, '+<', $walfile) or die "failed to open $walfile\n";
+seek($segf, $offset, 0);  # halfway corrupt the last record
+print $segf "\0\0\0\0";
+close($segf);
+
+# pg_waldump complains about the corrupted record
+$stdout = '';
+$stderr = '';
+IPC::Run::run $cmd, '>', \$stdout, '2>', \$stderr;
+ok($stderr =~ /error: error in WAL record at 0\/$lastlsn: .* at 0\/$lastlsn/,
+   "pg_waldump emits the correct error message");
+
+# also server complains
+$logstart = get_log_size($node);
+$node->start;
+$max_attempts = 360;
+while ($max_attempts-- >= 0)
+{
+	last if (find_in_log($node, "WARNING:  invalid record length at 0/$lastlsn: wanted [0-9]+, got 0",
+						 $logstart));
+	sleep 0.5;
+}
+ok($max_attempts >= 0, "header error is logged at $lastlsn");
+
+# no end-of-wal message should be seen this time
+ok(!find_in_log($node, $reached_eow_pat, $logstart),
+   "false log message is not emitted");
+
+$node->stop('immediate');
+
 done_testing();
+
+#### helper routines
+# return the size of logfile of $node in bytes
+sub get_log_size
+{
+	my ($node) = @_;
+
+	return (stat $node->logfile)[7];
+}
+
+# find $pat in logfile of $node after $off-th byte
+sub find_in_log
+{
+	my ($node, $pat, $off) = @_;
+
+	$off = 0 unless defined $off;
+	my $log = PostgreSQL::Test::Utils::slurp_file($node->logfile);
+	return 0 if (length($log) <= $off);
+
+	$log = substr($log, $off);
+
+	return $log =~ m/$pat/;
+}
-- 
2.31.1

#47

Justin Pryzby

pryzby@telsasoft.com

over 3 years ago

In reply to: Kyotaro Horiguchi (#46)

1 attachment(s)

Re: Make mesage at end-of-recovery less scary.

@cfbot: rebased over adb466150, which did the same thing as one of the
hunks in xlogreader.c.

Attachments:

0001-Make-End-Of-Recovery-error-less-scary.patchtext/x-diff; charset=us-asciiDownload

From c4069bb7181b68d742d2025567f859e69d24f513 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Thu, 7 Jul 2022 11:51:45 +0900
Subject: [PATCH] Make End-Of-Recovery error less scary

When recovery in any type ends, we see a bit scary error message like
"invalid record length" that suggests something serious is
happening. Actually if recovery meets a record with length = 0, that
usually means it finished applying all available WAL records.

Make this message less scary as "reached end of WAL". Instead, raise
the error level for other kind of WAL failure to WARNING.
---
 src/backend/access/transam/xlogreader.c   | 134 +++++++++++++++++-----
 src/backend/access/transam/xlogrecovery.c |  95 +++++++++++----
 src/backend/replication/walreceiver.c     |   7 +-
 src/bin/pg_waldump/pg_waldump.c           |  13 ++-
 src/include/access/xlogreader.h           |   1 +
 src/test/recovery/t/011_crash_recovery.pl | 106 +++++++++++++++++
 6 files changed, 298 insertions(+), 58 deletions(-)

diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index 050d2f424e4..9b8f29d0ad0 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -48,6 +48,8 @@ static int	ReadPageInternal(XLogReaderState *state, XLogRecPtr pageptr,
 							 int reqLen);
 static void XLogReaderInvalReadState(XLogReaderState *state);
 static XLogPageReadResult XLogDecodeNextRecord(XLogReaderState *state, bool non_blocking);
+static bool ValidXLogRecordLength(XLogReaderState *state, XLogRecPtr RecPtr,
+								  XLogRecord *record);
 static bool ValidXLogRecordHeader(XLogReaderState *state, XLogRecPtr RecPtr,
 								  XLogRecPtr PrevRecPtr, XLogRecord *record, bool randAccess);
 static bool ValidXLogRecord(XLogReaderState *state, XLogRecord *record,
@@ -149,6 +151,7 @@ XLogReaderAllocate(int wal_segment_size, const char *waldir,
 		pfree(state);
 		return NULL;
 	}
+	state->EndOfWAL = false;
 	state->errormsg_buf[0] = '\0';
 
 	/*
@@ -558,6 +561,7 @@ XLogDecodeNextRecord(XLogReaderState *state, bool nonblocking)
 	/* reset error state */
 	state->errormsg_buf[0] = '\0';
 	decoded = NULL;
+	state->EndOfWAL = false;
 
 	state->abortedRecPtr = InvalidXLogRecPtr;
 	state->missingContrecPtr = InvalidXLogRecPtr;
@@ -640,25 +644,21 @@ restart:
 	Assert(pageHeaderSize <= readOff);
 
 	/*
-	 * Read the record length.
+	 * Validate the record header.
 	 *
-	 * NB: Even though we use an XLogRecord pointer here, the whole record
-	 * header might not fit on this page. xl_tot_len is the first field of the
-	 * struct, so it must be on this page (the records are MAXALIGNed), but we
-	 * cannot access any other fields until we've verified that we got the
-	 * whole header.
-	 */
-	record = (XLogRecord *) (state->readBuf + RecPtr % XLOG_BLCKSZ);
-	total_len = record->xl_tot_len;
-
-	/*
-	 * If the whole record header is on this page, validate it immediately.
-	 * Otherwise do just a basic sanity check on xl_tot_len, and validate the
-	 * rest of the header after reading it from the next page.  The xl_tot_len
+	 * Even though we use an XLogRecord pointer here, the whole record header
+	 * might not fit on this page.  If the whole record header is on this page,
+	 * validate it immediately.  Even otherwise xl_tot_len must be on this page
+	 * (it is the first field of MAXALIGNed records), but we still cannot
+	 * access any further fields until we've verified that we got the whole
+	 * header, so do just a basic sanity check on record length, and validate
+	 * the rest of the header after reading it from the next page.  The length
 	 * check is necessary here to ensure that we enter the "Need to reassemble
 	 * record" code path below; otherwise we might fail to apply
 	 * ValidXLogRecordHeader at all.
 	 */
+	record = (XLogRecord *) (state->readBuf + RecPtr % XLOG_BLCKSZ);
+
 	if (targetRecOff <= XLOG_BLCKSZ - SizeOfXLogRecord)
 	{
 		if (!ValidXLogRecordHeader(state, RecPtr, state->DecodeRecPtr, record,
@@ -668,18 +668,14 @@ restart:
 	}
 	else
 	{
-		/* XXX: more validation should be done here */
-		if (total_len < SizeOfXLogRecord)
-		{
-			report_invalid_record(state,
-								  "invalid record length at %X/%X: wanted %u, got %u",
-								  LSN_FORMAT_ARGS(RecPtr),
-								  (uint32) SizeOfXLogRecord, total_len);
+		if (!ValidXLogRecordLength(state, RecPtr, record))
 			goto err;
-		}
+
 		gotheader = false;
 	}
 
+	total_len = record->xl_tot_len;
+
 	/*
 	 * Find space to decode this record.  Don't allow oversized allocation if
 	 * the caller requested nonblocking.  Otherwise, we *have* to try to
@@ -1106,16 +1102,47 @@ XLogReaderInvalReadState(XLogReaderState *state)
 }
 
 /*
- * Validate an XLOG record header.
+ * Validate record length of an XLOG record header.
  *
- * This is just a convenience subroutine to avoid duplicated code in
- * XLogReadRecord.  It's not intended for use from anywhere else.
+ * This is substantially a part of ValidXLogRecordHeader.  But XLogReadRecord
+ * needs this separate from the function in case of a partial record header.
  */
 static bool
-ValidXLogRecordHeader(XLogReaderState *state, XLogRecPtr RecPtr,
-					  XLogRecPtr PrevRecPtr, XLogRecord *record,
-					  bool randAccess)
+ValidXLogRecordLength(XLogReaderState *state, XLogRecPtr RecPtr,
+					  XLogRecord *record)
 {
+	if (record->xl_tot_len == 0)
+	{
+		char	   *p;
+		char	   *pe;
+
+		/*
+		 * We are almost sure reaching the end of WAL, make sure that the
+		 * whole page after the record is filled with zeroes.
+		 */
+		p = (char *) record;
+		pe = p + XLOG_BLCKSZ - (RecPtr & (XLOG_BLCKSZ - 1));
+
+		while (*p == 0 && p < pe)
+			p++;
+
+		if (p == pe)
+		{
+			/*
+			 * The page after the record is completely zeroed. That suggests
+			 * we don't have a record after this point. We don't bother
+			 * checking the pages after since they are not zeroed in the case
+			 * of recycled segments.
+			 */
+			report_invalid_record(state, "empty record at %X/%X",
+								  LSN_FORMAT_ARGS(RecPtr));
+
+			/* notify end-of-wal to callers */
+			state->EndOfWAL = true;
+			return false;
+		}
+	}
+
 	if (record->xl_tot_len < SizeOfXLogRecord)
 	{
 		report_invalid_record(state,
@@ -1124,6 +1151,24 @@ ValidXLogRecordHeader(XLogReaderState *state, XLogRecPtr RecPtr,
 							  (uint32) SizeOfXLogRecord, record->xl_tot_len);
 		return false;
 	}
+
+	return true;
+}
+
+/*
+ * Validate an XLOG record header.
+ *
+ * This is just a convenience subroutine to avoid duplicated code in
+ * XLogReadRecord.  It's not intended for use from anywhere else.
+ */
+static bool
+ValidXLogRecordHeader(XLogReaderState *state, XLogRecPtr RecPtr,
+					  XLogRecPtr PrevRecPtr, XLogRecord *record,
+					  bool randAccess)
+{
+	if (!ValidXLogRecordLength(state, RecPtr, record))
+		return false;
+
 	if (!RmgrIdIsValid(record->xl_rmid))
 	{
 		report_invalid_record(state,
@@ -1222,6 +1267,31 @@ XLogReaderValidatePageHeader(XLogReaderState *state, XLogRecPtr recptr,
 
 	XLogSegNoOffsetToRecPtr(segno, offset, state->segcxt.ws_segsize, recaddr);
 
+	StaticAssertStmt(XLOG_PAGE_MAGIC != 0, "XLOG_PAGE_MAGIC is zero");
+
+	if (hdr->xlp_magic == 0)
+	{
+		/* Regard an empty page as End-Of-WAL */
+		int			i;
+
+		for (i = 0; i < XLOG_BLCKSZ && phdr[i] == 0; i++);
+		if (i == XLOG_BLCKSZ)
+		{
+			char		fname[MAXFNAMELEN];
+
+			XLogFileName(fname, state->seg.ws_tli, segno,
+						 state->segcxt.ws_segsize);
+
+			report_invalid_record(state,
+								  "empty page in log segment %s, offset %u",
+								  fname,
+								  offset);
+			state->EndOfWAL = true;
+			return false;
+		}
+
+		/* The same condition will be caught as invalid magic number */
+	}
 	if (hdr->xlp_magic != XLOG_PAGE_MAGIC)
 	{
 		char		fname[MAXFNAMELEN];
@@ -1307,6 +1377,14 @@ XLogReaderValidatePageHeader(XLogReaderState *state, XLogRecPtr recptr,
 							  LSN_FORMAT_ARGS(hdr->xlp_pageaddr),
 							  fname,
 							  offset);
+
+		/*
+		 * If the page address is less than expected we assume it is an unused
+		 * page in a recycled segment.
+		 */
+		if (hdr->xlp_pageaddr < recaddr)
+			state->EndOfWAL = true;
+
 		return false;
 	}
 
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index b41e6826643..caa3b5e5b31 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -1626,7 +1626,7 @@ PerformWalRecovery(void)
 		/* just have to read next record after CheckPoint */
 		Assert(xlogreader->ReadRecPtr == CheckPointLoc);
 		replayTLI = CheckPointTLI;
-		record = ReadRecord(xlogprefetcher, LOG, false, replayTLI);
+		record = ReadRecord(xlogprefetcher, WARNING, false, replayTLI);
 	}
 
 	if (record != NULL)
@@ -1735,7 +1735,7 @@ PerformWalRecovery(void)
 			}
 
 			/* Else, try to fetch the next WAL record */
-			record = ReadRecord(xlogprefetcher, LOG, false, replayTLI);
+			record = ReadRecord(xlogprefetcher, WARNING, false, replayTLI);
 		} while (record != NULL);
 
 		/*
@@ -1789,11 +1789,19 @@ PerformWalRecovery(void)
 
 		InRedo = false;
 	}
-	else
+	else if (xlogreader->EndOfWAL)
 	{
 		/* there are no WAL records following the checkpoint */
 		ereport(LOG,
-				(errmsg("redo is not required")));
+				errmsg("redo is not required"));
+		
+	}
+	else
+	{
+		/* broken record found */
+		ereport(WARNING,
+				errmsg("redo is skipped"),
+				errhint("This suggests WAL file corruption. You might need to check the database."));
 	}
 
 	/*
@@ -3024,6 +3032,7 @@ ReadRecord(XLogPrefetcher *xlogprefetcher, int emode,
 	for (;;)
 	{
 		char	   *errormsg;
+		XLogRecPtr	ErrRecPtr = InvalidXLogRecPtr;
 
 		record = XLogPrefetcherReadRecord(xlogprefetcher, &errormsg);
 		if (record == NULL)
@@ -3045,6 +3054,18 @@ ReadRecord(XLogPrefetcher *xlogprefetcher, int emode,
 			{
 				abortedRecPtr = xlogreader->abortedRecPtr;
 				missingContrecPtr = xlogreader->missingContrecPtr;
+				ErrRecPtr = abortedRecPtr;
+			}
+			else
+			{
+				/*
+				 * EndRecPtr is the LSN we tried to read but failed. In the
+				 * case of decoding error, it is at the end of the failed
+				 * record but we don't have a means for now to know EndRecPtr
+				 * is pointing to which of the beginning or ending of the
+				 * failed record.
+				 */
+				ErrRecPtr = xlogreader->EndRecPtr;
 			}
 
 			if (readFile >= 0)
@@ -3055,13 +3076,15 @@ ReadRecord(XLogPrefetcher *xlogprefetcher, int emode,
 
 			/*
 			 * We only end up here without a message when XLogPageRead()
-			 * failed - in that case we already logged something. In
-			 * StandbyMode that only happens if we have been triggered, so we
-			 * shouldn't loop anymore in that case.
+			 * failed- in that case we already logged something. In StandbyMode
+			 * that only happens if we have been triggered, so we shouldn't
+			 * loop anymore in that case. When EndOfWAL is true, we don't emit
+			 * the message immediately and instead will show it as a part of a
+			 * decent end-of-wal message later.
 			 */
-			if (errormsg)
-				ereport(emode_for_corrupt_record(emode, xlogreader->EndRecPtr),
-						(errmsg_internal("%s", errormsg) /* already translated */ ));
+			if (!xlogreader->EndOfWAL && errormsg)
+				ereport(emode_for_corrupt_record(emode, ErrRecPtr),
+						errmsg_internal("%s", errormsg) /* already translated */ );
 		}
 
 		/*
@@ -3091,11 +3114,14 @@ ReadRecord(XLogPrefetcher *xlogprefetcher, int emode,
 			/* Great, got a record */
 			return record;
 		}
-		else
-		{
-			/* No valid record available from this source */
-			lastSourceFailed = true;
 
+		Assert(ErrRecPtr != InvalidXLogRecPtr);
+
+		/* No valid record available from this source */
+		lastSourceFailed = true;
+
+		if (!fetching_ckpt)
+		{
 			/*
 			 * If archive recovery was requested, but we were still doing
 			 * crash recovery, switch to archive recovery and retry using the
@@ -3108,11 +3134,16 @@ ReadRecord(XLogPrefetcher *xlogprefetcher, int emode,
 			 * we'd have no idea how far we'd have to replay to reach
 			 * consistency.  So err on the safe side and give up.
 			 */
-			if (!InArchiveRecovery && ArchiveRecoveryRequested &&
-				!fetching_ckpt)
+			if (!InArchiveRecovery && ArchiveRecoveryRequested)
 			{
+				/*
+				 * We don't report this as LOG, since we don't stop recovery
+				 * here
+				 */
 				ereport(DEBUG1,
-						(errmsg_internal("reached end of WAL in pg_wal, entering archive recovery")));
+						errmsg_internal("reached end of WAL at %X/%X on timeline %u in pg_wal during crash recovery, entering archive recovery",
+										LSN_FORMAT_ARGS(ErrRecPtr),
+										replayTLI));
 				InArchiveRecovery = true;
 				if (StandbyModeRequested)
 					StandbyMode = true;
@@ -3133,12 +3164,24 @@ ReadRecord(XLogPrefetcher *xlogprefetcher, int emode,
 				continue;
 			}
 
-			/* In standby mode, loop back to retry. Otherwise, give up. */
-			if (StandbyMode && !CheckForStandbyTrigger())
-				continue;
-			else
-				return NULL;
+			/*
+			 * recovery ended.
+			 *
+			 * Emit a decent message if we met end-of-WAL. Otherwise we should
+			 * have already emitted an error message.
+			 */
+			if (xlogreader->EndOfWAL)
+				ereport(LOG,
+						errmsg("reached end of WAL at %X/%X on timeline %u",
+							   LSN_FORMAT_ARGS(ErrRecPtr), replayTLI),
+						(errormsg ? errdetail_internal("%s", errormsg) : 0));
 		}
+
+		/* In standby mode, loop back to retry. Otherwise, give up. */
+		if (StandbyMode && !CheckForStandbyTrigger())
+			continue;
+		else
+			return NULL;
 	}
 }
 
@@ -3233,11 +3276,16 @@ retry:
 			case XLREAD_WOULDBLOCK:
 				return XLREAD_WOULDBLOCK;
 			case XLREAD_FAIL:
+				Assert(!StandbyMode || CheckForStandbyTrigger());
+
 				if (readFile >= 0)
 					close(readFile);
 				readFile = -1;
 				readLen = 0;
 				readSource = XLOG_FROM_ANY;
+
+				/* promotion exit is not end-of-WAL */
+				xlogreader->EndOfWAL = !StandbyMode;
 				return XLREAD_FAIL;
 			case XLREAD_SUCCESS:
 				break;
@@ -3898,7 +3946,8 @@ emode_for_corrupt_record(int emode, XLogRecPtr RecPtr)
 {
 	static XLogRecPtr lastComplaint = 0;
 
-	if (readSource == XLOG_FROM_PG_WAL && emode == LOG)
+	/* use currentSource as readSource is reset at failure */
+	if (currentSource == XLOG_FROM_PG_WAL && emode <= WARNING)
 	{
 		if (RecPtr == lastComplaint)
 			emode = DEBUG1;
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index f6ef0ace2c4..c6d7be66885 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -472,10 +472,9 @@ WalReceiverMain(void)
 						else if (len < 0)
 						{
 							ereport(LOG,
-									(errmsg("replication terminated by primary server"),
-									 errdetail("End of WAL reached on timeline %u at %X/%X.",
-											   startpointTLI,
-											   LSN_FORMAT_ARGS(LogstreamResult.Write))));
+									errmsg("replication terminated by primary server at %X/%X on timeline %u.",
+										   LSN_FORMAT_ARGS(LogstreamResult.Write),
+										   startpointTLI));
 							endofwal = true;
 							break;
 						}
diff --git a/src/bin/pg_waldump/pg_waldump.c b/src/bin/pg_waldump/pg_waldump.c
index 9993378ca58..26a4125b301 100644
--- a/src/bin/pg_waldump/pg_waldump.c
+++ b/src/bin/pg_waldump/pg_waldump.c
@@ -1168,9 +1168,16 @@ main(int argc, char **argv)
 		exit(0);
 
 	if (errormsg)
-		pg_fatal("error in WAL record at %X/%X: %s",
-				 LSN_FORMAT_ARGS(xlogreader_state->ReadRecPtr),
-				 errormsg);
+	{
+		if (xlogreader_state->EndOfWAL)
+			pg_log_info("end of WAL at %X/%X: %s",
+						LSN_FORMAT_ARGS(xlogreader_state->EndRecPtr),
+						errormsg);
+		else
+			pg_fatal("error in WAL record at %X/%X: %s",
+					 LSN_FORMAT_ARGS(xlogreader_state->EndRecPtr),
+					 errormsg);
+	}
 
 	XLogReaderFree(xlogreader_state);
 
diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h
index 6afec33d418..264afb6a78e 100644
--- a/src/include/access/xlogreader.h
+++ b/src/include/access/xlogreader.h
@@ -205,6 +205,7 @@ struct XLogReaderState
 	 */
 	XLogRecPtr	ReadRecPtr;		/* start of last record read */
 	XLogRecPtr	EndRecPtr;		/* end+1 of last record read */
+	bool		EndOfWAL;		/* was the last attempt EOW? */
 
 	/*
 	 * Set at the end of recovery: the start point of a partial record at the
diff --git a/src/test/recovery/t/011_crash_recovery.pl b/src/test/recovery/t/011_crash_recovery.pl
index 1b57d01046d..bde16b7cfa7 100644
--- a/src/test/recovery/t/011_crash_recovery.pl
+++ b/src/test/recovery/t/011_crash_recovery.pl
@@ -9,7 +9,9 @@ use warnings;
 use PostgreSQL::Test::Cluster;
 use PostgreSQL::Test::Utils;
 use Test::More;
+use IPC::Run;
 
+my $reached_eow_pat = "reached end of WAL at ";
 my $node = PostgreSQL::Test::Cluster->new('primary');
 $node->init(allows_streaming => 1);
 $node->start;
@@ -47,7 +49,15 @@ is($node->safe_psql('postgres', qq[SELECT pg_xact_status('$xid');]),
 
 # Crash and restart the postmaster
 $node->stop('immediate');
+my $logstart = get_log_size($node);
 $node->start;
+my $max_attempts = 360;
+while ($max_attempts-- >= 0)
+{
+	last if (find_in_log($node, $reached_eow_pat, $logstart));
+	sleep 0.5;
+}
+ok ($max_attempts >= 0, "end-of-wal is logged");
 
 # Make sure we really got a new xid
 cmp_ok($node->safe_psql('postgres', 'SELECT pg_current_xact_id()'),
@@ -60,4 +70,100 @@ is($node->safe_psql('postgres', qq[SELECT pg_xact_status('$xid');]),
 $stdin .= "\\q\n";
 $tx->finish;    # wait for psql to quit gracefully
 
+my $segsize = $node->safe_psql('postgres',
+	   qq[SELECT setting FROM pg_settings WHERE name = 'wal_segment_size';]);
+
+# make sure no records afterwards go to the next segment
+$node->safe_psql('postgres', qq[
+				 SELECT pg_switch_wal();
+				 CHECKPOINT;
+				 CREATE TABLE t();
+]);
+$node->stop('immediate');
+
+# identify REDO WAL file
+my $cmd = "pg_controldata -D " . $node->data_dir();
+$cmd = ['pg_controldata', '-D', $node->data_dir()];
+$stdout = '';
+$stderr = '';
+IPC::Run::run $cmd, '>', \$stdout, '2>', \$stderr;
+ok($stdout =~ /^Latest checkpoint's REDO WAL file:[ \t] *(.+)$/m,
+   "checkpoint file is identified");
+my $chkptfile = $1;
+
+# identify the last record
+my $walfile = $node->data_dir() . "/pg_wal/$chkptfile";
+$cmd = ['pg_waldump', $walfile];
+$stdout = '';
+$stderr = '';
+my $lastrec;
+IPC::Run::run $cmd, '>', \$stdout, '2>', \$stderr;
+foreach my $l (split(/\r?\n/, $stdout))
+{
+	$lastrec = $l;
+}
+ok(defined $lastrec, "last WAL record is extracted");
+ok($stderr =~ /end of WAL at ([0-9A-F\/]+): .* at \g1/,
+   "pg_waldump emits the correct ending message");
+
+# read the last record LSN excluding leading zeroes
+ok ($lastrec =~ /, lsn: 0\/0*([1-9A-F][0-9A-F]+),/,
+	"LSN of the last record identified");
+my $lastlsn = $1;
+
+# corrupt the last record
+my $offset = hex($lastlsn) % $segsize;
+open(my $segf, '+<', $walfile) or die "failed to open $walfile\n";
+seek($segf, $offset, 0);  # halfway corrupt the last record
+print $segf "\0\0\0\0";
+close($segf);
+
+# pg_waldump complains about the corrupted record
+$stdout = '';
+$stderr = '';
+IPC::Run::run $cmd, '>', \$stdout, '2>', \$stderr;
+ok($stderr =~ /error: error in WAL record at 0\/$lastlsn: .* at 0\/$lastlsn/,
+   "pg_waldump emits the correct error message");
+
+# also server complains
+$logstart = get_log_size($node);
+$node->start;
+$max_attempts = 360;
+while ($max_attempts-- >= 0)
+{
+	last if (find_in_log($node, "WARNING:  invalid record length at 0/$lastlsn: wanted [0-9]+, got 0",
+						 $logstart));
+	sleep 0.5;
+}
+ok($max_attempts >= 0, "header error is logged at $lastlsn");
+
+# no end-of-wal message should be seen this time
+ok(!find_in_log($node, $reached_eow_pat, $logstart),
+   "false log message is not emitted");
+
+$node->stop('immediate');
+
 done_testing();
+
+#### helper routines
+# return the size of logfile of $node in bytes
+sub get_log_size
+{
+	my ($node) = @_;
+
+	return (stat $node->logfile)[7];
+}
+
+# find $pat in logfile of $node after $off-th byte
+sub find_in_log
+{
+	my ($node, $pat, $off) = @_;
+
+	$off = 0 unless defined $off;
+	my $log = PostgreSQL::Test::Utils::slurp_file($node->logfile);
+	return 0 if (length($log) <= $off);
+
+	$log = substr($log, $off);
+
+	return $log =~ m/$pat/;
+}
-- 
2.25.1

#48

Kyotaro Horiguchi

horikyota.ntt@gmail.com

over 3 years ago

In reply to: Justin Pryzby (#47)

1 attachment(s)

Re: Make mesage at end-of-recovery less scary.

At Fri, 16 Sep 2022 23:21:50 -0500, Justin Pryzby <pryzby@telsasoft.com> wrote in

@cfbot: rebased over adb466150, which did the same thing as one of the
hunks in xlogreader.c.

Oops. Thanks! And then this gets a further conflict (param names
harmonization). So further rebased. And removed an extra blank line
you pointed.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachments:

v21-0001-Make-End-Of-Recovery-error-less-scary.patchtext/x-patch; charset=us-asciiDownload

From b70a33bd941e9845106bea502db30d32e0138251 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Thu, 7 Jul 2022 11:51:45 +0900
Subject: [PATCH v21] Make End-Of-Recovery error less scary

When recovery in any type ends, we see a bit scary error message like
"invalid record length" that suggests something serious is
happening. Actually if recovery meets a record with length = 0, that
usually means it finished applying all available WAL records.

Make this message less scary as "reached end of WAL". Instead, raise
the error level for other kind of WAL failure to WARNING.
---
 src/backend/access/transam/xlogreader.c   | 136 +++++++++++++++++-----
 src/backend/access/transam/xlogrecovery.c |  94 +++++++++++----
 src/backend/replication/walreceiver.c     |   7 +-
 src/bin/pg_waldump/pg_waldump.c           |  13 ++-
 src/include/access/xlogreader.h           |   1 +
 src/test/recovery/t/011_crash_recovery.pl | 106 +++++++++++++++++
 6 files changed, 298 insertions(+), 59 deletions(-)

diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index 4d6c34e0fc..b03eeb1487 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -48,6 +48,8 @@ static int	ReadPageInternal(XLogReaderState *state, XLogRecPtr pageptr,
 							 int reqLen);
 static void XLogReaderInvalReadState(XLogReaderState *state);
 static XLogPageReadResult XLogDecodeNextRecord(XLogReaderState *state, bool nonblocking);
+static bool ValidXLogRecordLength(XLogReaderState *state, XLogRecPtr RecPtr,
+								  XLogRecord *record);
 static bool ValidXLogRecordHeader(XLogReaderState *state, XLogRecPtr RecPtr,
 								  XLogRecPtr PrevRecPtr, XLogRecord *record, bool randAccess);
 static bool ValidXLogRecord(XLogReaderState *state, XLogRecord *record,
@@ -149,6 +151,7 @@ XLogReaderAllocate(int wal_segment_size, const char *waldir,
 		pfree(state);
 		return NULL;
 	}
+	state->EndOfWAL = false;
 	state->errormsg_buf[0] = '\0';
 
 	/*
@@ -558,6 +561,7 @@ XLogDecodeNextRecord(XLogReaderState *state, bool nonblocking)
 	/* reset error state */
 	state->errormsg_buf[0] = '\0';
 	decoded = NULL;
+	state->EndOfWAL = false;
 
 	state->abortedRecPtr = InvalidXLogRecPtr;
 	state->missingContrecPtr = InvalidXLogRecPtr;
@@ -640,25 +644,21 @@ restart:
 	Assert(pageHeaderSize <= readOff);
 
 	/*
-	 * Read the record length.
+	 * Validate the record header.
 	 *
-	 * NB: Even though we use an XLogRecord pointer here, the whole record
-	 * header might not fit on this page. xl_tot_len is the first field of the
-	 * struct, so it must be on this page (the records are MAXALIGNed), but we
-	 * cannot access any other fields until we've verified that we got the
-	 * whole header.
-	 */
-	record = (XLogRecord *) (state->readBuf + RecPtr % XLOG_BLCKSZ);
-	total_len = record->xl_tot_len;
-
-	/*
-	 * If the whole record header is on this page, validate it immediately.
-	 * Otherwise do just a basic sanity check on xl_tot_len, and validate the
-	 * rest of the header after reading it from the next page.  The xl_tot_len
+	 * Even though we use an XLogRecord pointer here, the whole record header
+	 * might not fit on this page.  If the whole record header is on this page,
+	 * validate it immediately.  Even otherwise xl_tot_len must be on this page
+	 * (it is the first field of MAXALIGNed records), but we still cannot
+	 * access any further fields until we've verified that we got the whole
+	 * header, so do just a basic sanity check on record length, and validate
+	 * the rest of the header after reading it from the next page.  The length
 	 * check is necessary here to ensure that we enter the "Need to reassemble
 	 * record" code path below; otherwise we might fail to apply
 	 * ValidXLogRecordHeader at all.
 	 */
+	record = (XLogRecord *) (state->readBuf + RecPtr % XLOG_BLCKSZ);
+
 	if (targetRecOff <= XLOG_BLCKSZ - SizeOfXLogRecord)
 	{
 		if (!ValidXLogRecordHeader(state, RecPtr, state->DecodeRecPtr, record,
@@ -668,18 +668,14 @@ restart:
 	}
 	else
 	{
-		/* XXX: more validation should be done here */
-		if (total_len < SizeOfXLogRecord)
-		{
-			report_invalid_record(state,
-								  "invalid record length at %X/%X: wanted %u, got %u",
-								  LSN_FORMAT_ARGS(RecPtr),
-								  (uint32) SizeOfXLogRecord, total_len);
+		if (!ValidXLogRecordLength(state, RecPtr, record))
 			goto err;
-		}
+
 		gotheader = false;
 	}
 
+	total_len = record->xl_tot_len;
+
 	/*
 	 * Find space to decode this record.  Don't allow oversized allocation if
 	 * the caller requested nonblocking.  Otherwise, we *have* to try to
@@ -1105,6 +1101,60 @@ XLogReaderInvalReadState(XLogReaderState *state)
 	state->readLen = 0;
 }
 
+/*
+ * Validate record length of an XLOG record header.
+ *
+ * This is substantially a part of ValidXLogRecordHeader.  But XLogReadRecord
+ * needs this separate from the function in case of a partial record header.
+ */
+static bool
+ValidXLogRecordLength(XLogReaderState *state, XLogRecPtr RecPtr,
+					  XLogRecord *record)
+{
+	if (record->xl_tot_len == 0)
+	{
+		char	   *p;
+		char	   *pe;
+
+		/*
+		 * We are almost sure reaching the end of WAL, make sure that the
+		 * whole page after the record is filled with zeroes.
+		 */
+		p = (char *) record;
+		pe = p + XLOG_BLCKSZ - (RecPtr & (XLOG_BLCKSZ - 1));
+
+		while (*p == 0 && p < pe)
+			p++;
+
+		if (p == pe)
+		{
+			/*
+			 * The page after the record is completely zeroed. That suggests
+			 * we don't have a record after this point. We don't bother
+			 * checking the pages after since they are not zeroed in the case
+			 * of recycled segments.
+			 */
+			report_invalid_record(state, "empty record at %X/%X",
+								  LSN_FORMAT_ARGS(RecPtr));
+
+			/* notify end-of-wal to callers */
+			state->EndOfWAL = true;
+			return false;
+		}
+	}
+
+	if (record->xl_tot_len < SizeOfXLogRecord)
+	{
+		report_invalid_record(state,
+							  "invalid record length at %X/%X: wanted %u, got %u",
+							  LSN_FORMAT_ARGS(RecPtr),
+							  (uint32) SizeOfXLogRecord, record->xl_tot_len);
+		return false;
+	}
+
+	return true;
+}
+
 /*
  * Validate an XLOG record header.
  *
@@ -1116,14 +1166,9 @@ ValidXLogRecordHeader(XLogReaderState *state, XLogRecPtr RecPtr,
 					  XLogRecPtr PrevRecPtr, XLogRecord *record,
 					  bool randAccess)
 {
-	if (record->xl_tot_len < SizeOfXLogRecord)
-	{
-		report_invalid_record(state,
-							  "invalid record length at %X/%X: wanted %u, got %u",
-							  LSN_FORMAT_ARGS(RecPtr),
-							  (uint32) SizeOfXLogRecord, record->xl_tot_len);
+	if (!ValidXLogRecordLength(state, RecPtr, record))
 		return false;
-	}
+
 	if (!RmgrIdIsValid(record->xl_rmid))
 	{
 		report_invalid_record(state,
@@ -1222,6 +1267,31 @@ XLogReaderValidatePageHeader(XLogReaderState *state, XLogRecPtr recptr,
 
 	XLogSegNoOffsetToRecPtr(segno, offset, state->segcxt.ws_segsize, recaddr);
 
+	StaticAssertStmt(XLOG_PAGE_MAGIC != 0, "XLOG_PAGE_MAGIC is zero");
+
+	if (hdr->xlp_magic == 0)
+	{
+		/* Regard an empty page as End-Of-WAL */
+		int			i;
+
+		for (i = 0; i < XLOG_BLCKSZ && phdr[i] == 0; i++);
+		if (i == XLOG_BLCKSZ)
+		{
+			char		fname[MAXFNAMELEN];
+
+			XLogFileName(fname, state->seg.ws_tli, segno,
+						 state->segcxt.ws_segsize);
+
+			report_invalid_record(state,
+								  "empty page in log segment %s, offset %u",
+								  fname,
+								  offset);
+			state->EndOfWAL = true;
+			return false;
+		}
+
+		/* The same condition will be caught as invalid magic number */
+	}
 	if (hdr->xlp_magic != XLOG_PAGE_MAGIC)
 	{
 		char		fname[MAXFNAMELEN];
@@ -1307,6 +1377,14 @@ XLogReaderValidatePageHeader(XLogReaderState *state, XLogRecPtr recptr,
 							  LSN_FORMAT_ARGS(hdr->xlp_pageaddr),
 							  fname,
 							  offset);
+
+		/*
+		 * If the page address is less than expected we assume it is an unused
+		 * page in a recycled segment.
+		 */
+		if (hdr->xlp_pageaddr < recaddr)
+			state->EndOfWAL = true;
+
 		return false;
 	}
 
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index b41e682664..56f29e73fe 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -1626,7 +1626,7 @@ PerformWalRecovery(void)
 		/* just have to read next record after CheckPoint */
 		Assert(xlogreader->ReadRecPtr == CheckPointLoc);
 		replayTLI = CheckPointTLI;
-		record = ReadRecord(xlogprefetcher, LOG, false, replayTLI);
+		record = ReadRecord(xlogprefetcher, WARNING, false, replayTLI);
 	}
 
 	if (record != NULL)
@@ -1735,7 +1735,7 @@ PerformWalRecovery(void)
 			}
 
 			/* Else, try to fetch the next WAL record */
-			record = ReadRecord(xlogprefetcher, LOG, false, replayTLI);
+			record = ReadRecord(xlogprefetcher, WARNING, false, replayTLI);
 		} while (record != NULL);
 
 		/*
@@ -1789,11 +1789,18 @@ PerformWalRecovery(void)
 
 		InRedo = false;
 	}
-	else
+	else if (xlogreader->EndOfWAL)
 	{
 		/* there are no WAL records following the checkpoint */
 		ereport(LOG,
-				(errmsg("redo is not required")));
+				errmsg("redo is not required"));
+	}
+	else
+	{
+		/* broken record found */
+		ereport(WARNING,
+				errmsg("redo is skipped"),
+				errhint("This suggests WAL file corruption. You might need to check the database."));
 	}
 
 	/*
@@ -3024,6 +3031,7 @@ ReadRecord(XLogPrefetcher *xlogprefetcher, int emode,
 	for (;;)
 	{
 		char	   *errormsg;
+		XLogRecPtr	ErrRecPtr = InvalidXLogRecPtr;
 
 		record = XLogPrefetcherReadRecord(xlogprefetcher, &errormsg);
 		if (record == NULL)
@@ -3045,6 +3053,18 @@ ReadRecord(XLogPrefetcher *xlogprefetcher, int emode,
 			{
 				abortedRecPtr = xlogreader->abortedRecPtr;
 				missingContrecPtr = xlogreader->missingContrecPtr;
+				ErrRecPtr = abortedRecPtr;
+			}
+			else
+			{
+				/*
+				 * EndRecPtr is the LSN we tried to read but failed. In the
+				 * case of decoding error, it is at the end of the failed
+				 * record but we don't have a means for now to know EndRecPtr
+				 * is pointing to which of the beginning or ending of the
+				 * failed record.
+				 */
+				ErrRecPtr = xlogreader->EndRecPtr;
 			}
 
 			if (readFile >= 0)
@@ -3055,13 +3075,15 @@ ReadRecord(XLogPrefetcher *xlogprefetcher, int emode,
 
 			/*
 			 * We only end up here without a message when XLogPageRead()
-			 * failed - in that case we already logged something. In
-			 * StandbyMode that only happens if we have been triggered, so we
-			 * shouldn't loop anymore in that case.
+			 * failed- in that case we already logged something. In StandbyMode
+			 * that only happens if we have been triggered, so we shouldn't
+			 * loop anymore in that case. When EndOfWAL is true, we don't emit
+			 * the message immediately and instead will show it as a part of a
+			 * decent end-of-wal message later.
 			 */
-			if (errormsg)
-				ereport(emode_for_corrupt_record(emode, xlogreader->EndRecPtr),
-						(errmsg_internal("%s", errormsg) /* already translated */ ));
+			if (!xlogreader->EndOfWAL && errormsg)
+				ereport(emode_for_corrupt_record(emode, ErrRecPtr),
+						errmsg_internal("%s", errormsg) /* already translated */ );
 		}
 
 		/*
@@ -3091,11 +3113,14 @@ ReadRecord(XLogPrefetcher *xlogprefetcher, int emode,
 			/* Great, got a record */
 			return record;
 		}
-		else
+
+		Assert(ErrRecPtr != InvalidXLogRecPtr);
+
+		/* No valid record available from this source */
+		lastSourceFailed = true;
+
+		if (!fetching_ckpt)
 		{
-			/* No valid record available from this source */
-			lastSourceFailed = true;
-
 			/*
 			 * If archive recovery was requested, but we were still doing
 			 * crash recovery, switch to archive recovery and retry using the
@@ -3108,11 +3133,16 @@ ReadRecord(XLogPrefetcher *xlogprefetcher, int emode,
 			 * we'd have no idea how far we'd have to replay to reach
 			 * consistency.  So err on the safe side and give up.
 			 */
-			if (!InArchiveRecovery && ArchiveRecoveryRequested &&
-				!fetching_ckpt)
+			if (!InArchiveRecovery && ArchiveRecoveryRequested)
 			{
+				/*
+				 * We don't report this as LOG, since we don't stop recovery
+				 * here
+				 */
 				ereport(DEBUG1,
-						(errmsg_internal("reached end of WAL in pg_wal, entering archive recovery")));
+						errmsg_internal("reached end of WAL at %X/%X on timeline %u in pg_wal during crash recovery, entering archive recovery",
+										LSN_FORMAT_ARGS(ErrRecPtr),
+										replayTLI));
 				InArchiveRecovery = true;
 				if (StandbyModeRequested)
 					StandbyMode = true;
@@ -3133,12 +3163,24 @@ ReadRecord(XLogPrefetcher *xlogprefetcher, int emode,
 				continue;
 			}
 
-			/* In standby mode, loop back to retry. Otherwise, give up. */
-			if (StandbyMode && !CheckForStandbyTrigger())
-				continue;
-			else
-				return NULL;
+			/*
+			 * recovery ended.
+			 *
+			 * Emit a decent message if we met end-of-WAL. Otherwise we should
+			 * have already emitted an error message.
+			 */
+			if (xlogreader->EndOfWAL)
+				ereport(LOG,
+						errmsg("reached end of WAL at %X/%X on timeline %u",
+							   LSN_FORMAT_ARGS(ErrRecPtr), replayTLI),
+						(errormsg ? errdetail_internal("%s", errormsg) : 0));
 		}
+
+		/* In standby mode, loop back to retry. Otherwise, give up. */
+		if (StandbyMode && !CheckForStandbyTrigger())
+			continue;
+		else
+			return NULL;
 	}
 }
 
@@ -3233,11 +3275,16 @@ retry:
 			case XLREAD_WOULDBLOCK:
 				return XLREAD_WOULDBLOCK;
 			case XLREAD_FAIL:
+				Assert(!StandbyMode || CheckForStandbyTrigger());
+
 				if (readFile >= 0)
 					close(readFile);
 				readFile = -1;
 				readLen = 0;
 				readSource = XLOG_FROM_ANY;
+
+				/* promotion exit is not end-of-WAL */
+				xlogreader->EndOfWAL = !StandbyMode;
 				return XLREAD_FAIL;
 			case XLREAD_SUCCESS:
 				break;
@@ -3898,7 +3945,8 @@ emode_for_corrupt_record(int emode, XLogRecPtr RecPtr)
 {
 	static XLogRecPtr lastComplaint = 0;
 
-	if (readSource == XLOG_FROM_PG_WAL && emode == LOG)
+	/* use currentSource as readSource is reset at failure */
+	if (currentSource == XLOG_FROM_PG_WAL && emode <= WARNING)
 	{
 		if (RecPtr == lastComplaint)
 			emode = DEBUG1;
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index f6ef0ace2c..c6d7be6688 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -472,10 +472,9 @@ WalReceiverMain(void)
 						else if (len < 0)
 						{
 							ereport(LOG,
-									(errmsg("replication terminated by primary server"),
-									 errdetail("End of WAL reached on timeline %u at %X/%X.",
-											   startpointTLI,
-											   LSN_FORMAT_ARGS(LogstreamResult.Write))));
+									errmsg("replication terminated by primary server at %X/%X on timeline %u.",
+										   LSN_FORMAT_ARGS(LogstreamResult.Write),
+										   startpointTLI));
 							endofwal = true;
 							break;
 						}
diff --git a/src/bin/pg_waldump/pg_waldump.c b/src/bin/pg_waldump/pg_waldump.c
index 9993378ca5..26a4125b30 100644
--- a/src/bin/pg_waldump/pg_waldump.c
+++ b/src/bin/pg_waldump/pg_waldump.c
@@ -1168,9 +1168,16 @@ main(int argc, char **argv)
 		exit(0);
 
 	if (errormsg)
-		pg_fatal("error in WAL record at %X/%X: %s",
-				 LSN_FORMAT_ARGS(xlogreader_state->ReadRecPtr),
-				 errormsg);
+	{
+		if (xlogreader_state->EndOfWAL)
+			pg_log_info("end of WAL at %X/%X: %s",
+						LSN_FORMAT_ARGS(xlogreader_state->EndRecPtr),
+						errormsg);
+		else
+			pg_fatal("error in WAL record at %X/%X: %s",
+					 LSN_FORMAT_ARGS(xlogreader_state->EndRecPtr),
+					 errormsg);
+	}
 
 	XLogReaderFree(xlogreader_state);
 
diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h
index 6dcde2523a..0818cb7ef0 100644
--- a/src/include/access/xlogreader.h
+++ b/src/include/access/xlogreader.h
@@ -205,6 +205,7 @@ struct XLogReaderState
 	 */
 	XLogRecPtr	ReadRecPtr;		/* start of last record read */
 	XLogRecPtr	EndRecPtr;		/* end+1 of last record read */
+	bool		EndOfWAL;		/* was the last attempt EOW? */
 
 	/*
 	 * Set at the end of recovery: the start point of a partial record at the
diff --git a/src/test/recovery/t/011_crash_recovery.pl b/src/test/recovery/t/011_crash_recovery.pl
index 1b57d01046..bde16b7cfa 100644
--- a/src/test/recovery/t/011_crash_recovery.pl
+++ b/src/test/recovery/t/011_crash_recovery.pl
@@ -9,7 +9,9 @@ use warnings;
 use PostgreSQL::Test::Cluster;
 use PostgreSQL::Test::Utils;
 use Test::More;
+use IPC::Run;
 
+my $reached_eow_pat = "reached end of WAL at ";
 my $node = PostgreSQL::Test::Cluster->new('primary');
 $node->init(allows_streaming => 1);
 $node->start;
@@ -47,7 +49,15 @@ is($node->safe_psql('postgres', qq[SELECT pg_xact_status('$xid');]),
 
 # Crash and restart the postmaster
 $node->stop('immediate');
+my $logstart = get_log_size($node);
 $node->start;
+my $max_attempts = 360;
+while ($max_attempts-- >= 0)
+{
+	last if (find_in_log($node, $reached_eow_pat, $logstart));
+	sleep 0.5;
+}
+ok ($max_attempts >= 0, "end-of-wal is logged");
 
 # Make sure we really got a new xid
 cmp_ok($node->safe_psql('postgres', 'SELECT pg_current_xact_id()'),
@@ -60,4 +70,100 @@ is($node->safe_psql('postgres', qq[SELECT pg_xact_status('$xid');]),
 $stdin .= "\\q\n";
 $tx->finish;    # wait for psql to quit gracefully
 
+my $segsize = $node->safe_psql('postgres',
+	   qq[SELECT setting FROM pg_settings WHERE name = 'wal_segment_size';]);
+
+# make sure no records afterwards go to the next segment
+$node->safe_psql('postgres', qq[
+				 SELECT pg_switch_wal();
+				 CHECKPOINT;
+				 CREATE TABLE t();
+]);
+$node->stop('immediate');
+
+# identify REDO WAL file
+my $cmd = "pg_controldata -D " . $node->data_dir();
+$cmd = ['pg_controldata', '-D', $node->data_dir()];
+$stdout = '';
+$stderr = '';
+IPC::Run::run $cmd, '>', \$stdout, '2>', \$stderr;
+ok($stdout =~ /^Latest checkpoint's REDO WAL file:[ \t] *(.+)$/m,
+   "checkpoint file is identified");
+my $chkptfile = $1;
+
+# identify the last record
+my $walfile = $node->data_dir() . "/pg_wal/$chkptfile";
+$cmd = ['pg_waldump', $walfile];
+$stdout = '';
+$stderr = '';
+my $lastrec;
+IPC::Run::run $cmd, '>', \$stdout, '2>', \$stderr;
+foreach my $l (split(/\r?\n/, $stdout))
+{
+	$lastrec = $l;
+}
+ok(defined $lastrec, "last WAL record is extracted");
+ok($stderr =~ /end of WAL at ([0-9A-F\/]+): .* at \g1/,
+   "pg_waldump emits the correct ending message");
+
+# read the last record LSN excluding leading zeroes
+ok ($lastrec =~ /, lsn: 0\/0*([1-9A-F][0-9A-F]+),/,
+	"LSN of the last record identified");
+my $lastlsn = $1;
+
+# corrupt the last record
+my $offset = hex($lastlsn) % $segsize;
+open(my $segf, '+<', $walfile) or die "failed to open $walfile\n";
+seek($segf, $offset, 0);  # halfway corrupt the last record
+print $segf "\0\0\0\0";
+close($segf);
+
+# pg_waldump complains about the corrupted record
+$stdout = '';
+$stderr = '';
+IPC::Run::run $cmd, '>', \$stdout, '2>', \$stderr;
+ok($stderr =~ /error: error in WAL record at 0\/$lastlsn: .* at 0\/$lastlsn/,
+   "pg_waldump emits the correct error message");
+
+# also server complains
+$logstart = get_log_size($node);
+$node->start;
+$max_attempts = 360;
+while ($max_attempts-- >= 0)
+{
+	last if (find_in_log($node, "WARNING:  invalid record length at 0/$lastlsn: wanted [0-9]+, got 0",
+						 $logstart));
+	sleep 0.5;
+}
+ok($max_attempts >= 0, "header error is logged at $lastlsn");
+
+# no end-of-wal message should be seen this time
+ok(!find_in_log($node, $reached_eow_pat, $logstart),
+   "false log message is not emitted");
+
+$node->stop('immediate');
+
 done_testing();
+
+#### helper routines
+# return the size of logfile of $node in bytes
+sub get_log_size
+{
+	my ($node) = @_;
+
+	return (stat $node->logfile)[7];
+}
+
+# find $pat in logfile of $node after $off-th byte
+sub find_in_log
+{
+	my ($node, $pat, $off) = @_;
+
+	$off = 0 unless defined $off;
+	my $log = PostgreSQL::Test::Utils::slurp_file($node->logfile);
+	return 0 if (length($log) <= $off);
+
+	$log = substr($log, $off);
+
+	return $log =~ m/$pat/;
+}
-- 
2.31.1

#49

Justin Pryzby

pryzby@telsasoft.com

about 3 years ago

In reply to: Kyotaro Horiguchi (#48)

1 attachment(s)

Re: Make mesage at end-of-recovery less scary.

rebased

Attachments:

0001-Make-End-Of-Recovery-error-less-scary.patchtext/x-diff; charset=us-asciiDownload

From 67ce65038ae6a7d5b023b7472df9f9ca9835d5f5 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Thu, 7 Jul 2022 11:51:45 +0900
Subject: [PATCH] Make End-Of-Recovery error less scary

When recovery in any type ends, we see a bit scary error message like
"invalid record length" that suggests something serious is
happening. Actually if recovery meets a record with length = 0, that
usually means it finished applying all available WAL records.

Make this message less scary as "reached end of WAL". Instead, raise
the error level for other kind of WAL failure to WARNING.
---
 src/backend/access/transam/xlogreader.c   | 135 +++++++++++++++++-----
 src/backend/access/transam/xlogrecovery.c |  95 +++++++++++----
 src/backend/replication/walreceiver.c     |   7 +-
 src/bin/pg_waldump/pg_waldump.c           |  13 ++-
 src/include/access/xlogreader.h           |   1 +
 src/test/recovery/t/011_crash_recovery.pl | 106 +++++++++++++++++
 6 files changed, 299 insertions(+), 58 deletions(-)

diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index 93f667b2544..f891a629443 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -48,6 +48,8 @@ static int	ReadPageInternal(XLogReaderState *state, XLogRecPtr pageptr,
 							 int reqLen);
 static void XLogReaderInvalReadState(XLogReaderState *state);
 static XLogPageReadResult XLogDecodeNextRecord(XLogReaderState *state, bool nonblocking);
+static bool ValidXLogRecordLength(XLogReaderState *state, XLogRecPtr RecPtr,
+								  XLogRecord *record);
 static bool ValidXLogRecordHeader(XLogReaderState *state, XLogRecPtr RecPtr,
 								  XLogRecPtr PrevRecPtr, XLogRecord *record, bool randAccess);
 static bool ValidXLogRecord(XLogReaderState *state, XLogRecord *record,
@@ -149,6 +151,7 @@ XLogReaderAllocate(int wal_segment_size, const char *waldir,
 		pfree(state);
 		return NULL;
 	}
+	state->EndOfWAL = false;
 	state->errormsg_buf[0] = '\0';
 
 	/*
@@ -558,6 +561,7 @@ XLogDecodeNextRecord(XLogReaderState *state, bool nonblocking)
 	/* reset error state */
 	state->errormsg_buf[0] = '\0';
 	decoded = NULL;
+	state->EndOfWAL = false;
 
 	state->abortedRecPtr = InvalidXLogRecPtr;
 	state->missingContrecPtr = InvalidXLogRecPtr;
@@ -640,25 +644,21 @@ restart:
 	Assert(pageHeaderSize <= readOff);
 
 	/*
-	 * Read the record length.
+	 * Validate the record header.
 	 *
-	 * NB: Even though we use an XLogRecord pointer here, the whole record
-	 * header might not fit on this page. xl_tot_len is the first field of the
-	 * struct, so it must be on this page (the records are MAXALIGNed), but we
-	 * cannot access any other fields until we've verified that we got the
-	 * whole header.
-	 */
-	record = (XLogRecord *) (state->readBuf + RecPtr % XLOG_BLCKSZ);
-	total_len = record->xl_tot_len;
-
-	/*
-	 * If the whole record header is on this page, validate it immediately.
-	 * Otherwise do just a basic sanity check on xl_tot_len, and validate the
-	 * rest of the header after reading it from the next page.  The xl_tot_len
+	 * Even though we use an XLogRecord pointer here, the whole record header
+	 * might not fit on this page.  If the whole record header is on this page,
+	 * validate it immediately.  Even otherwise xl_tot_len must be on this page
+	 * (it is the first field of MAXALIGNed records), but we still cannot
+	 * access any further fields until we've verified that we got the whole
+	 * header, so do just a basic sanity check on record length, and validate
+	 * the rest of the header after reading it from the next page.  The length
 	 * check is necessary here to ensure that we enter the "Need to reassemble
 	 * record" code path below; otherwise we might fail to apply
 	 * ValidXLogRecordHeader at all.
 	 */
+	record = (XLogRecord *) (state->readBuf + RecPtr % XLOG_BLCKSZ);
+
 	if (targetRecOff <= XLOG_BLCKSZ - SizeOfXLogRecord)
 	{
 		if (!ValidXLogRecordHeader(state, RecPtr, state->DecodeRecPtr, record,
@@ -668,18 +668,14 @@ restart:
 	}
 	else
 	{
-		/* XXX: more validation should be done here */
-		if (total_len < SizeOfXLogRecord)
-		{
-			report_invalid_record(state,
-								  "invalid record length at %X/%X: wanted %u, got %u",
-								  LSN_FORMAT_ARGS(RecPtr),
-								  (uint32) SizeOfXLogRecord, total_len);
+		if (!ValidXLogRecordLength(state, RecPtr, record))
 			goto err;
-		}
+
 		gotheader = false;
 	}
 
+	total_len = record->xl_tot_len;
+
 	/*
 	 * Find space to decode this record.  Don't allow oversized allocation if
 	 * the caller requested nonblocking.  Otherwise, we *have* to try to
@@ -1106,16 +1102,47 @@ XLogReaderInvalReadState(XLogReaderState *state)
 }
 
 /*
- * Validate an XLOG record header.
+ * Validate record length of an XLOG record header.
  *
- * This is just a convenience subroutine to avoid duplicated code in
- * XLogReadRecord.  It's not intended for use from anywhere else.
+ * This is substantially a part of ValidXLogRecordHeader.  But XLogReadRecord
+ * needs this separate from the function in case of a partial record header.
  */
 static bool
-ValidXLogRecordHeader(XLogReaderState *state, XLogRecPtr RecPtr,
-					  XLogRecPtr PrevRecPtr, XLogRecord *record,
-					  bool randAccess)
+ValidXLogRecordLength(XLogReaderState *state, XLogRecPtr RecPtr,
+					  XLogRecord *record)
 {
+	if (record->xl_tot_len == 0)
+	{
+		char	   *p;
+		char	   *pe;
+
+		/*
+		 * We are almost sure reaching the end of WAL, make sure that the
+		 * whole page after the record is filled with zeroes.
+		 */
+		p = (char *) record;
+		pe = p + XLOG_BLCKSZ - (RecPtr & (XLOG_BLCKSZ - 1));
+
+		while (*p == 0 && p < pe)
+			p++;
+
+		if (p == pe)
+		{
+			/*
+			 * The page after the record is completely zeroed. That suggests
+			 * we don't have a record after this point. We don't bother
+			 * checking the pages after since they are not zeroed in the case
+			 * of recycled segments.
+			 */
+			report_invalid_record(state, "empty record at %X/%X",
+								  LSN_FORMAT_ARGS(RecPtr));
+
+			/* notify end-of-wal to callers */
+			state->EndOfWAL = true;
+			return false;
+		}
+	}
+
 	if (record->xl_tot_len < SizeOfXLogRecord)
 	{
 		report_invalid_record(state,
@@ -1124,6 +1151,24 @@ ValidXLogRecordHeader(XLogReaderState *state, XLogRecPtr RecPtr,
 							  (uint32) SizeOfXLogRecord, record->xl_tot_len);
 		return false;
 	}
+
+	return true;
+}
+
+/*
+ * Validate an XLOG record header.
+ *
+ * This is just a convenience subroutine to avoid duplicated code in
+ * XLogReadRecord.  It's not intended for use from anywhere else.
+ */
+static bool
+ValidXLogRecordHeader(XLogReaderState *state, XLogRecPtr RecPtr,
+					  XLogRecPtr PrevRecPtr, XLogRecord *record,
+					  bool randAccess)
+{
+	if (!ValidXLogRecordLength(state, RecPtr, record))
+		return false;
+
 	if (!RmgrIdIsValid(record->xl_rmid))
 	{
 		report_invalid_record(state,
@@ -1219,6 +1264,32 @@ XLogReaderValidatePageHeader(XLogReaderState *state, XLogRecPtr recptr,
 	XLByteToSeg(recptr, segno, state->segcxt.ws_segsize);
 	offset = XLogSegmentOffset(recptr, state->segcxt.ws_segsize);
 
+	StaticAssertStmt(XLOG_PAGE_MAGIC != 0, "XLOG_PAGE_MAGIC is zero");
+
+	if (hdr->xlp_magic == 0)
+	{
+		/* Regard an empty page as End-Of-WAL */
+		int			i;
+
+		for (i = 0; i < XLOG_BLCKSZ && phdr[i] == 0; i++);
+		if (i == XLOG_BLCKSZ)
+		{
+			char		fname[MAXFNAMELEN];
+
+			XLogFileName(fname, state->seg.ws_tli, segno,
+						 state->segcxt.ws_segsize);
+
+			report_invalid_record(state,
+								  "empty page in log segment %s, offset %u",
+								  fname,
+								  offset);
+			state->EndOfWAL = true;
+			return false;
+		}
+
+		/* The same condition will be caught as invalid magic number */
+	}
+
 	if (hdr->xlp_magic != XLOG_PAGE_MAGIC)
 	{
 		char		fname[MAXFNAMELEN];
@@ -1304,6 +1375,14 @@ XLogReaderValidatePageHeader(XLogReaderState *state, XLogRecPtr recptr,
 							  LSN_FORMAT_ARGS(hdr->xlp_pageaddr),
 							  fname,
 							  offset);
+
+		/*
+		 * If the page address is less than expected we assume it is an unused
+		 * page in a recycled segment.
+		 */
+		if (hdr->xlp_pageaddr < recptr)
+			state->EndOfWAL = true;
+
 		return false;
 	}
 
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index cb07694aea6..0034f65af02 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -1626,7 +1626,7 @@ PerformWalRecovery(void)
 		/* just have to read next record after CheckPoint */
 		Assert(xlogreader->ReadRecPtr == CheckPointLoc);
 		replayTLI = CheckPointTLI;
-		record = ReadRecord(xlogprefetcher, LOG, false, replayTLI);
+		record = ReadRecord(xlogprefetcher, WARNING, false, replayTLI);
 	}
 
 	if (record != NULL)
@@ -1735,7 +1735,7 @@ PerformWalRecovery(void)
 			}
 
 			/* Else, try to fetch the next WAL record */
-			record = ReadRecord(xlogprefetcher, LOG, false, replayTLI);
+			record = ReadRecord(xlogprefetcher, WARNING, false, replayTLI);
 		} while (record != NULL);
 
 		/*
@@ -1789,11 +1789,19 @@ PerformWalRecovery(void)
 
 		InRedo = false;
 	}
-	else
+	else if (xlogreader->EndOfWAL)
 	{
 		/* there are no WAL records following the checkpoint */
 		ereport(LOG,
-				(errmsg("redo is not required")));
+				errmsg("redo is not required"));
+		
+	}
+	else
+	{
+		/* broken record found */
+		ereport(WARNING,
+				errmsg("redo is skipped"),
+				errhint("This suggests WAL file corruption. You might need to check the database."));
 	}
 
 	/*
@@ -3024,6 +3032,7 @@ ReadRecord(XLogPrefetcher *xlogprefetcher, int emode,
 	for (;;)
 	{
 		char	   *errormsg;
+		XLogRecPtr	ErrRecPtr = InvalidXLogRecPtr;
 
 		record = XLogPrefetcherReadRecord(xlogprefetcher, &errormsg);
 		if (record == NULL)
@@ -3045,6 +3054,18 @@ ReadRecord(XLogPrefetcher *xlogprefetcher, int emode,
 			{
 				abortedRecPtr = xlogreader->abortedRecPtr;
 				missingContrecPtr = xlogreader->missingContrecPtr;
+				ErrRecPtr = abortedRecPtr;
+			}
+			else
+			{
+				/*
+				 * EndRecPtr is the LSN we tried to read but failed. In the
+				 * case of decoding error, it is at the end of the failed
+				 * record but we don't have a means for now to know EndRecPtr
+				 * is pointing to which of the beginning or ending of the
+				 * failed record.
+				 */
+				ErrRecPtr = xlogreader->EndRecPtr;
 			}
 
 			if (readFile >= 0)
@@ -3055,13 +3076,15 @@ ReadRecord(XLogPrefetcher *xlogprefetcher, int emode,
 
 			/*
 			 * We only end up here without a message when XLogPageRead()
-			 * failed - in that case we already logged something. In
-			 * StandbyMode that only happens if we have been triggered, so we
-			 * shouldn't loop anymore in that case.
+			 * failed- in that case we already logged something. In StandbyMode
+			 * that only happens if we have been triggered, so we shouldn't
+			 * loop anymore in that case. When EndOfWAL is true, we don't emit
+			 * the message immediately and instead will show it as a part of a
+			 * decent end-of-wal message later.
 			 */
-			if (errormsg)
-				ereport(emode_for_corrupt_record(emode, xlogreader->EndRecPtr),
-						(errmsg_internal("%s", errormsg) /* already translated */ ));
+			if (!xlogreader->EndOfWAL && errormsg)
+				ereport(emode_for_corrupt_record(emode, ErrRecPtr),
+						errmsg_internal("%s", errormsg) /* already translated */ );
 		}
 
 		/*
@@ -3091,11 +3114,14 @@ ReadRecord(XLogPrefetcher *xlogprefetcher, int emode,
 			/* Great, got a record */
 			return record;
 		}
-		else
-		{
-			/* No valid record available from this source */
-			lastSourceFailed = true;
 
+		Assert(ErrRecPtr != InvalidXLogRecPtr);
+
+		/* No valid record available from this source */
+		lastSourceFailed = true;
+
+		if (!fetching_ckpt)
+		{
 			/*
 			 * If archive recovery was requested, but we were still doing
 			 * crash recovery, switch to archive recovery and retry using the
@@ -3108,11 +3134,16 @@ ReadRecord(XLogPrefetcher *xlogprefetcher, int emode,
 			 * we'd have no idea how far we'd have to replay to reach
 			 * consistency.  So err on the safe side and give up.
 			 */
-			if (!InArchiveRecovery && ArchiveRecoveryRequested &&
-				!fetching_ckpt)
+			if (!InArchiveRecovery && ArchiveRecoveryRequested)
 			{
+				/*
+				 * We don't report this as LOG, since we don't stop recovery
+				 * here
+				 */
 				ereport(DEBUG1,
-						(errmsg_internal("reached end of WAL in pg_wal, entering archive recovery")));
+						errmsg_internal("reached end of WAL at %X/%X on timeline %u in pg_wal during crash recovery, entering archive recovery",
+										LSN_FORMAT_ARGS(ErrRecPtr),
+										replayTLI));
 				InArchiveRecovery = true;
 				if (StandbyModeRequested)
 					StandbyMode = true;
@@ -3133,12 +3164,24 @@ ReadRecord(XLogPrefetcher *xlogprefetcher, int emode,
 				continue;
 			}
 
-			/* In standby mode, loop back to retry. Otherwise, give up. */
-			if (StandbyMode && !CheckForStandbyTrigger())
-				continue;
-			else
-				return NULL;
+			/*
+			 * recovery ended.
+			 *
+			 * Emit a decent message if we met end-of-WAL. Otherwise we should
+			 * have already emitted an error message.
+			 */
+			if (xlogreader->EndOfWAL)
+				ereport(LOG,
+						errmsg("reached end of WAL at %X/%X on timeline %u",
+							   LSN_FORMAT_ARGS(ErrRecPtr), replayTLI),
+						(errormsg ? errdetail_internal("%s", errormsg) : 0));
 		}
+
+		/* In standby mode, loop back to retry. Otherwise, give up. */
+		if (StandbyMode && !CheckForStandbyTrigger())
+			continue;
+		else
+			return NULL;
 	}
 }
 
@@ -3233,11 +3276,16 @@ retry:
 			case XLREAD_WOULDBLOCK:
 				return XLREAD_WOULDBLOCK;
 			case XLREAD_FAIL:
+				Assert(!StandbyMode || CheckForStandbyTrigger());
+
 				if (readFile >= 0)
 					close(readFile);
 				readFile = -1;
 				readLen = 0;
 				readSource = XLOG_FROM_ANY;
+
+				/* promotion exit is not end-of-WAL */
+				xlogreader->EndOfWAL = !StandbyMode;
 				return XLREAD_FAIL;
 			case XLREAD_SUCCESS:
 				break;
@@ -3898,7 +3946,8 @@ emode_for_corrupt_record(int emode, XLogRecPtr RecPtr)
 {
 	static XLogRecPtr lastComplaint = 0;
 
-	if (readSource == XLOG_FROM_PG_WAL && emode == LOG)
+	/* use currentSource as readSource is reset at failure */
+	if (currentSource == XLOG_FROM_PG_WAL && emode <= WARNING)
 	{
 		if (RecPtr == lastComplaint)
 			emode = DEBUG1;
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index 6cbb67c92a3..7ef19e435ce 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -472,10 +472,9 @@ WalReceiverMain(void)
 						else if (len < 0)
 						{
 							ereport(LOG,
-									(errmsg("replication terminated by primary server"),
-									 errdetail("End of WAL reached on timeline %u at %X/%X.",
-											   startpointTLI,
-											   LSN_FORMAT_ARGS(LogstreamResult.Write))));
+									errmsg("replication terminated by primary server at %X/%X on timeline %u.",
+										   LSN_FORMAT_ARGS(LogstreamResult.Write),
+										   startpointTLI));
 							endofwal = true;
 							break;
 						}
diff --git a/src/bin/pg_waldump/pg_waldump.c b/src/bin/pg_waldump/pg_waldump.c
index 9993378ca58..26a4125b301 100644
--- a/src/bin/pg_waldump/pg_waldump.c
+++ b/src/bin/pg_waldump/pg_waldump.c
@@ -1168,9 +1168,16 @@ main(int argc, char **argv)
 		exit(0);
 
 	if (errormsg)
-		pg_fatal("error in WAL record at %X/%X: %s",
-				 LSN_FORMAT_ARGS(xlogreader_state->ReadRecPtr),
-				 errormsg);
+	{
+		if (xlogreader_state->EndOfWAL)
+			pg_log_info("end of WAL at %X/%X: %s",
+						LSN_FORMAT_ARGS(xlogreader_state->EndRecPtr),
+						errormsg);
+		else
+			pg_fatal("error in WAL record at %X/%X: %s",
+					 LSN_FORMAT_ARGS(xlogreader_state->EndRecPtr),
+					 errormsg);
+	}
 
 	XLogReaderFree(xlogreader_state);
 
diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h
index e87f91316ae..70d3b25edaf 100644
--- a/src/include/access/xlogreader.h
+++ b/src/include/access/xlogreader.h
@@ -205,6 +205,7 @@ struct XLogReaderState
 	 */
 	XLogRecPtr	ReadRecPtr;		/* start of last record read */
 	XLogRecPtr	EndRecPtr;		/* end+1 of last record read */
+	bool		EndOfWAL;		/* was the last attempt EOW? */
 
 	/*
 	 * Set at the end of recovery: the start point of a partial record at the
diff --git a/src/test/recovery/t/011_crash_recovery.pl b/src/test/recovery/t/011_crash_recovery.pl
index 1b57d01046d..bde16b7cfa7 100644
--- a/src/test/recovery/t/011_crash_recovery.pl
+++ b/src/test/recovery/t/011_crash_recovery.pl
@@ -9,7 +9,9 @@ use warnings;
 use PostgreSQL::Test::Cluster;
 use PostgreSQL::Test::Utils;
 use Test::More;
+use IPC::Run;
 
+my $reached_eow_pat = "reached end of WAL at ";
 my $node = PostgreSQL::Test::Cluster->new('primary');
 $node->init(allows_streaming => 1);
 $node->start;
@@ -47,7 +49,15 @@ is($node->safe_psql('postgres', qq[SELECT pg_xact_status('$xid');]),
 
 # Crash and restart the postmaster
 $node->stop('immediate');
+my $logstart = get_log_size($node);
 $node->start;
+my $max_attempts = 360;
+while ($max_attempts-- >= 0)
+{
+	last if (find_in_log($node, $reached_eow_pat, $logstart));
+	sleep 0.5;
+}
+ok ($max_attempts >= 0, "end-of-wal is logged");
 
 # Make sure we really got a new xid
 cmp_ok($node->safe_psql('postgres', 'SELECT pg_current_xact_id()'),
@@ -60,4 +70,100 @@ is($node->safe_psql('postgres', qq[SELECT pg_xact_status('$xid');]),
 $stdin .= "\\q\n";
 $tx->finish;    # wait for psql to quit gracefully
 
+my $segsize = $node->safe_psql('postgres',
+	   qq[SELECT setting FROM pg_settings WHERE name = 'wal_segment_size';]);
+
+# make sure no records afterwards go to the next segment
+$node->safe_psql('postgres', qq[
+				 SELECT pg_switch_wal();
+				 CHECKPOINT;
+				 CREATE TABLE t();
+]);
+$node->stop('immediate');
+
+# identify REDO WAL file
+my $cmd = "pg_controldata -D " . $node->data_dir();
+$cmd = ['pg_controldata', '-D', $node->data_dir()];
+$stdout = '';
+$stderr = '';
+IPC::Run::run $cmd, '>', \$stdout, '2>', \$stderr;
+ok($stdout =~ /^Latest checkpoint's REDO WAL file:[ \t] *(.+)$/m,
+   "checkpoint file is identified");
+my $chkptfile = $1;
+
+# identify the last record
+my $walfile = $node->data_dir() . "/pg_wal/$chkptfile";
+$cmd = ['pg_waldump', $walfile];
+$stdout = '';
+$stderr = '';
+my $lastrec;
+IPC::Run::run $cmd, '>', \$stdout, '2>', \$stderr;
+foreach my $l (split(/\r?\n/, $stdout))
+{
+	$lastrec = $l;
+}
+ok(defined $lastrec, "last WAL record is extracted");
+ok($stderr =~ /end of WAL at ([0-9A-F\/]+): .* at \g1/,
+   "pg_waldump emits the correct ending message");
+
+# read the last record LSN excluding leading zeroes
+ok ($lastrec =~ /, lsn: 0\/0*([1-9A-F][0-9A-F]+),/,
+	"LSN of the last record identified");
+my $lastlsn = $1;
+
+# corrupt the last record
+my $offset = hex($lastlsn) % $segsize;
+open(my $segf, '+<', $walfile) or die "failed to open $walfile\n";
+seek($segf, $offset, 0);  # halfway corrupt the last record
+print $segf "\0\0\0\0";
+close($segf);
+
+# pg_waldump complains about the corrupted record
+$stdout = '';
+$stderr = '';
+IPC::Run::run $cmd, '>', \$stdout, '2>', \$stderr;
+ok($stderr =~ /error: error in WAL record at 0\/$lastlsn: .* at 0\/$lastlsn/,
+   "pg_waldump emits the correct error message");
+
+# also server complains
+$logstart = get_log_size($node);
+$node->start;
+$max_attempts = 360;
+while ($max_attempts-- >= 0)
+{
+	last if (find_in_log($node, "WARNING:  invalid record length at 0/$lastlsn: wanted [0-9]+, got 0",
+						 $logstart));
+	sleep 0.5;
+}
+ok($max_attempts >= 0, "header error is logged at $lastlsn");
+
+# no end-of-wal message should be seen this time
+ok(!find_in_log($node, $reached_eow_pat, $logstart),
+   "false log message is not emitted");
+
+$node->stop('immediate');
+
 done_testing();
+
+#### helper routines
+# return the size of logfile of $node in bytes
+sub get_log_size
+{
+	my ($node) = @_;
+
+	return (stat $node->logfile)[7];
+}
+
+# find $pat in logfile of $node after $off-th byte
+sub find_in_log
+{
+	my ($node, $pat, $off) = @_;
+
+	$off = 0 unless defined $off;
+	my $log = PostgreSQL::Test::Utils::slurp_file($node->logfile);
+	return 0 if (length($log) <= $off);
+
+	$log = substr($log, $off);
+
+	return $log =~ m/$pat/;
+}
-- 
2.25.1

#50

Kyotaro Horiguchi

horikyota.ntt@gmail.com

about 3 years ago

In reply to: Justin Pryzby (#49)

1 attachment(s)

Re: Make mesage at end-of-recovery less scary.

Just rebased.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachments:

v22-0001-Make-End-Of-Recovery-error-less-scary.patchtext/x-patch; charset=us-asciiDownload

From 1efe0601596807c25769370f38884c7027a00839 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Tue, 15 Nov 2022 13:41:46 +0900
Subject: [PATCH v22] Make End-Of-Recovery error less scary

When recovery in any type ends, we see a bit scary error message like
"invalid record length" that suggests something serious is
happening. Actually if recovery meets a record with length = 0, that
usually means it finished applying all available WAL records.

Make this message less scary as "reached end of WAL". Instead, raise
the error level for other kind of WAL failure to WARNING.
---
 src/backend/access/transam/xlogreader.c   | 137 +++++++++++++++++-----
 src/backend/access/transam/xlogrecovery.c |  94 +++++++++++----
 src/backend/replication/walreceiver.c     |   7 +-
 src/bin/pg_waldump/pg_waldump.c           |  13 +-
 src/include/access/xlogreader.h           |   1 +
 src/test/recovery/t/011_crash_recovery.pl | 106 +++++++++++++++++
 6 files changed, 299 insertions(+), 59 deletions(-)

diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index 93f667b254..f891a62944 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -48,6 +48,8 @@ static int	ReadPageInternal(XLogReaderState *state, XLogRecPtr pageptr,
 							 int reqLen);
 static void XLogReaderInvalReadState(XLogReaderState *state);
 static XLogPageReadResult XLogDecodeNextRecord(XLogReaderState *state, bool nonblocking);
+static bool ValidXLogRecordLength(XLogReaderState *state, XLogRecPtr RecPtr,
+								  XLogRecord *record);
 static bool ValidXLogRecordHeader(XLogReaderState *state, XLogRecPtr RecPtr,
 								  XLogRecPtr PrevRecPtr, XLogRecord *record, bool randAccess);
 static bool ValidXLogRecord(XLogReaderState *state, XLogRecord *record,
@@ -149,6 +151,7 @@ XLogReaderAllocate(int wal_segment_size, const char *waldir,
 		pfree(state);
 		return NULL;
 	}
+	state->EndOfWAL = false;
 	state->errormsg_buf[0] = '\0';
 
 	/*
@@ -558,6 +561,7 @@ XLogDecodeNextRecord(XLogReaderState *state, bool nonblocking)
 	/* reset error state */
 	state->errormsg_buf[0] = '\0';
 	decoded = NULL;
+	state->EndOfWAL = false;
 
 	state->abortedRecPtr = InvalidXLogRecPtr;
 	state->missingContrecPtr = InvalidXLogRecPtr;
@@ -640,25 +644,21 @@ restart:
 	Assert(pageHeaderSize <= readOff);
 
 	/*
-	 * Read the record length.
+	 * Validate the record header.
 	 *
-	 * NB: Even though we use an XLogRecord pointer here, the whole record
-	 * header might not fit on this page. xl_tot_len is the first field of the
-	 * struct, so it must be on this page (the records are MAXALIGNed), but we
-	 * cannot access any other fields until we've verified that we got the
-	 * whole header.
-	 */
-	record = (XLogRecord *) (state->readBuf + RecPtr % XLOG_BLCKSZ);
-	total_len = record->xl_tot_len;
-
-	/*
-	 * If the whole record header is on this page, validate it immediately.
-	 * Otherwise do just a basic sanity check on xl_tot_len, and validate the
-	 * rest of the header after reading it from the next page.  The xl_tot_len
+	 * Even though we use an XLogRecord pointer here, the whole record header
+	 * might not fit on this page.  If the whole record header is on this page,
+	 * validate it immediately.  Even otherwise xl_tot_len must be on this page
+	 * (it is the first field of MAXALIGNed records), but we still cannot
+	 * access any further fields until we've verified that we got the whole
+	 * header, so do just a basic sanity check on record length, and validate
+	 * the rest of the header after reading it from the next page.  The length
 	 * check is necessary here to ensure that we enter the "Need to reassemble
 	 * record" code path below; otherwise we might fail to apply
 	 * ValidXLogRecordHeader at all.
 	 */
+	record = (XLogRecord *) (state->readBuf + RecPtr % XLOG_BLCKSZ);
+
 	if (targetRecOff <= XLOG_BLCKSZ - SizeOfXLogRecord)
 	{
 		if (!ValidXLogRecordHeader(state, RecPtr, state->DecodeRecPtr, record,
@@ -668,18 +668,14 @@ restart:
 	}
 	else
 	{
-		/* XXX: more validation should be done here */
-		if (total_len < SizeOfXLogRecord)
-		{
-			report_invalid_record(state,
-								  "invalid record length at %X/%X: wanted %u, got %u",
-								  LSN_FORMAT_ARGS(RecPtr),
-								  (uint32) SizeOfXLogRecord, total_len);
+		if (!ValidXLogRecordLength(state, RecPtr, record))
 			goto err;
-		}
+
 		gotheader = false;
 	}
 
+	total_len = record->xl_tot_len;
+
 	/*
 	 * Find space to decode this record.  Don't allow oversized allocation if
 	 * the caller requested nonblocking.  Otherwise, we *have* to try to
@@ -1105,6 +1101,60 @@ XLogReaderInvalReadState(XLogReaderState *state)
 	state->readLen = 0;
 }
 
+/*
+ * Validate record length of an XLOG record header.
+ *
+ * This is substantially a part of ValidXLogRecordHeader.  But XLogReadRecord
+ * needs this separate from the function in case of a partial record header.
+ */
+static bool
+ValidXLogRecordLength(XLogReaderState *state, XLogRecPtr RecPtr,
+					  XLogRecord *record)
+{
+	if (record->xl_tot_len == 0)
+	{
+		char	   *p;
+		char	   *pe;
+
+		/*
+		 * We are almost sure reaching the end of WAL, make sure that the
+		 * whole page after the record is filled with zeroes.
+		 */
+		p = (char *) record;
+		pe = p + XLOG_BLCKSZ - (RecPtr & (XLOG_BLCKSZ - 1));
+
+		while (*p == 0 && p < pe)
+			p++;
+
+		if (p == pe)
+		{
+			/*
+			 * The page after the record is completely zeroed. That suggests
+			 * we don't have a record after this point. We don't bother
+			 * checking the pages after since they are not zeroed in the case
+			 * of recycled segments.
+			 */
+			report_invalid_record(state, "empty record at %X/%X",
+								  LSN_FORMAT_ARGS(RecPtr));
+
+			/* notify end-of-wal to callers */
+			state->EndOfWAL = true;
+			return false;
+		}
+	}
+
+	if (record->xl_tot_len < SizeOfXLogRecord)
+	{
+		report_invalid_record(state,
+							  "invalid record length at %X/%X: wanted %u, got %u",
+							  LSN_FORMAT_ARGS(RecPtr),
+							  (uint32) SizeOfXLogRecord, record->xl_tot_len);
+		return false;
+	}
+
+	return true;
+}
+
 /*
  * Validate an XLOG record header.
  *
@@ -1116,14 +1166,9 @@ ValidXLogRecordHeader(XLogReaderState *state, XLogRecPtr RecPtr,
 					  XLogRecPtr PrevRecPtr, XLogRecord *record,
 					  bool randAccess)
 {
-	if (record->xl_tot_len < SizeOfXLogRecord)
-	{
-		report_invalid_record(state,
-							  "invalid record length at %X/%X: wanted %u, got %u",
-							  LSN_FORMAT_ARGS(RecPtr),
-							  (uint32) SizeOfXLogRecord, record->xl_tot_len);
+	if (!ValidXLogRecordLength(state, RecPtr, record))
 		return false;
-	}
+
 	if (!RmgrIdIsValid(record->xl_rmid))
 	{
 		report_invalid_record(state,
@@ -1219,6 +1264,32 @@ XLogReaderValidatePageHeader(XLogReaderState *state, XLogRecPtr recptr,
 	XLByteToSeg(recptr, segno, state->segcxt.ws_segsize);
 	offset = XLogSegmentOffset(recptr, state->segcxt.ws_segsize);
 
+	StaticAssertStmt(XLOG_PAGE_MAGIC != 0, "XLOG_PAGE_MAGIC is zero");
+
+	if (hdr->xlp_magic == 0)
+	{
+		/* Regard an empty page as End-Of-WAL */
+		int			i;
+
+		for (i = 0; i < XLOG_BLCKSZ && phdr[i] == 0; i++);
+		if (i == XLOG_BLCKSZ)
+		{
+			char		fname[MAXFNAMELEN];
+
+			XLogFileName(fname, state->seg.ws_tli, segno,
+						 state->segcxt.ws_segsize);
+
+			report_invalid_record(state,
+								  "empty page in log segment %s, offset %u",
+								  fname,
+								  offset);
+			state->EndOfWAL = true;
+			return false;
+		}
+
+		/* The same condition will be caught as invalid magic number */
+	}
+
 	if (hdr->xlp_magic != XLOG_PAGE_MAGIC)
 	{
 		char		fname[MAXFNAMELEN];
@@ -1304,6 +1375,14 @@ XLogReaderValidatePageHeader(XLogReaderState *state, XLogRecPtr recptr,
 							  LSN_FORMAT_ARGS(hdr->xlp_pageaddr),
 							  fname,
 							  offset);
+
+		/*
+		 * If the page address is less than expected we assume it is an unused
+		 * page in a recycled segment.
+		 */
+		if (hdr->xlp_pageaddr < recptr)
+			state->EndOfWAL = true;
+
 		return false;
 	}
 
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index cb07694aea..3f54c875e5 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -1626,7 +1626,7 @@ PerformWalRecovery(void)
 		/* just have to read next record after CheckPoint */
 		Assert(xlogreader->ReadRecPtr == CheckPointLoc);
 		replayTLI = CheckPointTLI;
-		record = ReadRecord(xlogprefetcher, LOG, false, replayTLI);
+		record = ReadRecord(xlogprefetcher, WARNING, false, replayTLI);
 	}
 
 	if (record != NULL)
@@ -1735,7 +1735,7 @@ PerformWalRecovery(void)
 			}
 
 			/* Else, try to fetch the next WAL record */
-			record = ReadRecord(xlogprefetcher, LOG, false, replayTLI);
+			record = ReadRecord(xlogprefetcher, WARNING, false, replayTLI);
 		} while (record != NULL);
 
 		/*
@@ -1789,11 +1789,18 @@ PerformWalRecovery(void)
 
 		InRedo = false;
 	}
-	else
+	else if (xlogreader->EndOfWAL)
 	{
 		/* there are no WAL records following the checkpoint */
 		ereport(LOG,
-				(errmsg("redo is not required")));
+				errmsg("redo is not required"));
+	}
+	else
+	{
+		/* broken record found */
+		ereport(WARNING,
+				errmsg("redo is skipped"),
+				errhint("This suggests WAL file corruption. You might need to check the database."));
 	}
 
 	/*
@@ -3024,6 +3031,7 @@ ReadRecord(XLogPrefetcher *xlogprefetcher, int emode,
 	for (;;)
 	{
 		char	   *errormsg;
+		XLogRecPtr	ErrRecPtr = InvalidXLogRecPtr;
 
 		record = XLogPrefetcherReadRecord(xlogprefetcher, &errormsg);
 		if (record == NULL)
@@ -3045,6 +3053,18 @@ ReadRecord(XLogPrefetcher *xlogprefetcher, int emode,
 			{
 				abortedRecPtr = xlogreader->abortedRecPtr;
 				missingContrecPtr = xlogreader->missingContrecPtr;
+				ErrRecPtr = abortedRecPtr;
+			}
+			else
+			{
+				/*
+				 * EndRecPtr is the LSN we tried to read but failed. In the
+				 * case of decoding error, it is at the end of the failed
+				 * record but we don't have a means for now to know EndRecPtr
+				 * is pointing to which of the beginning or ending of the
+				 * failed record.
+				 */
+				ErrRecPtr = xlogreader->EndRecPtr;
 			}
 
 			if (readFile >= 0)
@@ -3055,13 +3075,15 @@ ReadRecord(XLogPrefetcher *xlogprefetcher, int emode,
 
 			/*
 			 * We only end up here without a message when XLogPageRead()
-			 * failed - in that case we already logged something. In
-			 * StandbyMode that only happens if we have been triggered, so we
-			 * shouldn't loop anymore in that case.
+			 * failed- in that case we already logged something. In StandbyMode
+			 * that only happens if we have been triggered, so we shouldn't
+			 * loop anymore in that case. When EndOfWAL is true, we don't emit
+			 * the message immediately and instead will show it as a part of a
+			 * decent end-of-wal message later.
 			 */
-			if (errormsg)
-				ereport(emode_for_corrupt_record(emode, xlogreader->EndRecPtr),
-						(errmsg_internal("%s", errormsg) /* already translated */ ));
+			if (!xlogreader->EndOfWAL && errormsg)
+				ereport(emode_for_corrupt_record(emode, ErrRecPtr),
+						errmsg_internal("%s", errormsg) /* already translated */ );
 		}
 
 		/*
@@ -3091,11 +3113,14 @@ ReadRecord(XLogPrefetcher *xlogprefetcher, int emode,
 			/* Great, got a record */
 			return record;
 		}
-		else
+
+		Assert(ErrRecPtr != InvalidXLogRecPtr);
+
+		/* No valid record available from this source */
+		lastSourceFailed = true;
+
+		if (!fetching_ckpt)
 		{
-			/* No valid record available from this source */
-			lastSourceFailed = true;
-
 			/*
 			 * If archive recovery was requested, but we were still doing
 			 * crash recovery, switch to archive recovery and retry using the
@@ -3108,11 +3133,16 @@ ReadRecord(XLogPrefetcher *xlogprefetcher, int emode,
 			 * we'd have no idea how far we'd have to replay to reach
 			 * consistency.  So err on the safe side and give up.
 			 */
-			if (!InArchiveRecovery && ArchiveRecoveryRequested &&
-				!fetching_ckpt)
+			if (!InArchiveRecovery && ArchiveRecoveryRequested)
 			{
+				/*
+				 * We don't report this as LOG, since we don't stop recovery
+				 * here
+				 */
 				ereport(DEBUG1,
-						(errmsg_internal("reached end of WAL in pg_wal, entering archive recovery")));
+						errmsg_internal("reached end of WAL at %X/%X on timeline %u in pg_wal during crash recovery, entering archive recovery",
+										LSN_FORMAT_ARGS(ErrRecPtr),
+										replayTLI));
 				InArchiveRecovery = true;
 				if (StandbyModeRequested)
 					StandbyMode = true;
@@ -3133,12 +3163,24 @@ ReadRecord(XLogPrefetcher *xlogprefetcher, int emode,
 				continue;
 			}
 
-			/* In standby mode, loop back to retry. Otherwise, give up. */
-			if (StandbyMode && !CheckForStandbyTrigger())
-				continue;
-			else
-				return NULL;
+			/*
+			 * recovery ended.
+			 *
+			 * Emit a decent message if we met end-of-WAL. Otherwise we should
+			 * have already emitted an error message.
+			 */
+			if (xlogreader->EndOfWAL)
+				ereport(LOG,
+						errmsg("reached end of WAL at %X/%X on timeline %u",
+							   LSN_FORMAT_ARGS(ErrRecPtr), replayTLI),
+						(errormsg ? errdetail_internal("%s", errormsg) : 0));
 		}
+
+		/* In standby mode, loop back to retry. Otherwise, give up. */
+		if (StandbyMode && !CheckForStandbyTrigger())
+			continue;
+		else
+			return NULL;
 	}
 }
 
@@ -3233,11 +3275,16 @@ retry:
 			case XLREAD_WOULDBLOCK:
 				return XLREAD_WOULDBLOCK;
 			case XLREAD_FAIL:
+				Assert(!StandbyMode || CheckForStandbyTrigger());
+
 				if (readFile >= 0)
 					close(readFile);
 				readFile = -1;
 				readLen = 0;
 				readSource = XLOG_FROM_ANY;
+
+				/* promotion exit is not end-of-WAL */
+				xlogreader->EndOfWAL = !StandbyMode;
 				return XLREAD_FAIL;
 			case XLREAD_SUCCESS:
 				break;
@@ -3898,7 +3945,8 @@ emode_for_corrupt_record(int emode, XLogRecPtr RecPtr)
 {
 	static XLogRecPtr lastComplaint = 0;
 
-	if (readSource == XLOG_FROM_PG_WAL && emode == LOG)
+	/* use currentSource as readSource is reset at failure */
+	if (currentSource == XLOG_FROM_PG_WAL && emode <= WARNING)
 	{
 		if (RecPtr == lastComplaint)
 			emode = DEBUG1;
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index ad383dbcaa..054a4cb127 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -498,10 +498,9 @@ WalReceiverMain(void)
 						else if (len < 0)
 						{
 							ereport(LOG,
-									(errmsg("replication terminated by primary server"),
-									 errdetail("End of WAL reached on timeline %u at %X/%X.",
-											   startpointTLI,
-											   LSN_FORMAT_ARGS(LogstreamResult.Write))));
+									errmsg("replication terminated by primary server at %X/%X on timeline %u.",
+										   LSN_FORMAT_ARGS(LogstreamResult.Write),
+										   startpointTLI));
 							endofwal = true;
 							break;
 						}
diff --git a/src/bin/pg_waldump/pg_waldump.c b/src/bin/pg_waldump/pg_waldump.c
index 9993378ca5..26a4125b30 100644
--- a/src/bin/pg_waldump/pg_waldump.c
+++ b/src/bin/pg_waldump/pg_waldump.c
@@ -1168,9 +1168,16 @@ main(int argc, char **argv)
 		exit(0);
 
 	if (errormsg)
-		pg_fatal("error in WAL record at %X/%X: %s",
-				 LSN_FORMAT_ARGS(xlogreader_state->ReadRecPtr),
-				 errormsg);
+	{
+		if (xlogreader_state->EndOfWAL)
+			pg_log_info("end of WAL at %X/%X: %s",
+						LSN_FORMAT_ARGS(xlogreader_state->EndRecPtr),
+						errormsg);
+		else
+			pg_fatal("error in WAL record at %X/%X: %s",
+					 LSN_FORMAT_ARGS(xlogreader_state->EndRecPtr),
+					 errormsg);
+	}
 
 	XLogReaderFree(xlogreader_state);
 
diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h
index e87f91316a..70d3b25eda 100644
--- a/src/include/access/xlogreader.h
+++ b/src/include/access/xlogreader.h
@@ -205,6 +205,7 @@ struct XLogReaderState
 	 */
 	XLogRecPtr	ReadRecPtr;		/* start of last record read */
 	XLogRecPtr	EndRecPtr;		/* end+1 of last record read */
+	bool		EndOfWAL;		/* was the last attempt EOW? */
 
 	/*
 	 * Set at the end of recovery: the start point of a partial record at the
diff --git a/src/test/recovery/t/011_crash_recovery.pl b/src/test/recovery/t/011_crash_recovery.pl
index 1b57d01046..bde16b7cfa 100644
--- a/src/test/recovery/t/011_crash_recovery.pl
+++ b/src/test/recovery/t/011_crash_recovery.pl
@@ -9,7 +9,9 @@ use warnings;
 use PostgreSQL::Test::Cluster;
 use PostgreSQL::Test::Utils;
 use Test::More;
+use IPC::Run;
 
+my $reached_eow_pat = "reached end of WAL at ";
 my $node = PostgreSQL::Test::Cluster->new('primary');
 $node->init(allows_streaming => 1);
 $node->start;
@@ -47,7 +49,15 @@ is($node->safe_psql('postgres', qq[SELECT pg_xact_status('$xid');]),
 
 # Crash and restart the postmaster
 $node->stop('immediate');
+my $logstart = get_log_size($node);
 $node->start;
+my $max_attempts = 360;
+while ($max_attempts-- >= 0)
+{
+	last if (find_in_log($node, $reached_eow_pat, $logstart));
+	sleep 0.5;
+}
+ok ($max_attempts >= 0, "end-of-wal is logged");
 
 # Make sure we really got a new xid
 cmp_ok($node->safe_psql('postgres', 'SELECT pg_current_xact_id()'),
@@ -60,4 +70,100 @@ is($node->safe_psql('postgres', qq[SELECT pg_xact_status('$xid');]),
 $stdin .= "\\q\n";
 $tx->finish;    # wait for psql to quit gracefully
 
+my $segsize = $node->safe_psql('postgres',
+	   qq[SELECT setting FROM pg_settings WHERE name = 'wal_segment_size';]);
+
+# make sure no records afterwards go to the next segment
+$node->safe_psql('postgres', qq[
+				 SELECT pg_switch_wal();
+				 CHECKPOINT;
+				 CREATE TABLE t();
+]);
+$node->stop('immediate');
+
+# identify REDO WAL file
+my $cmd = "pg_controldata -D " . $node->data_dir();
+$cmd = ['pg_controldata', '-D', $node->data_dir()];
+$stdout = '';
+$stderr = '';
+IPC::Run::run $cmd, '>', \$stdout, '2>', \$stderr;
+ok($stdout =~ /^Latest checkpoint's REDO WAL file:[ \t] *(.+)$/m,
+   "checkpoint file is identified");
+my $chkptfile = $1;
+
+# identify the last record
+my $walfile = $node->data_dir() . "/pg_wal/$chkptfile";
+$cmd = ['pg_waldump', $walfile];
+$stdout = '';
+$stderr = '';
+my $lastrec;
+IPC::Run::run $cmd, '>', \$stdout, '2>', \$stderr;
+foreach my $l (split(/\r?\n/, $stdout))
+{
+	$lastrec = $l;
+}
+ok(defined $lastrec, "last WAL record is extracted");
+ok($stderr =~ /end of WAL at ([0-9A-F\/]+): .* at \g1/,
+   "pg_waldump emits the correct ending message");
+
+# read the last record LSN excluding leading zeroes
+ok ($lastrec =~ /, lsn: 0\/0*([1-9A-F][0-9A-F]+),/,
+	"LSN of the last record identified");
+my $lastlsn = $1;
+
+# corrupt the last record
+my $offset = hex($lastlsn) % $segsize;
+open(my $segf, '+<', $walfile) or die "failed to open $walfile\n";
+seek($segf, $offset, 0);  # halfway corrupt the last record
+print $segf "\0\0\0\0";
+close($segf);
+
+# pg_waldump complains about the corrupted record
+$stdout = '';
+$stderr = '';
+IPC::Run::run $cmd, '>', \$stdout, '2>', \$stderr;
+ok($stderr =~ /error: error in WAL record at 0\/$lastlsn: .* at 0\/$lastlsn/,
+   "pg_waldump emits the correct error message");
+
+# also server complains
+$logstart = get_log_size($node);
+$node->start;
+$max_attempts = 360;
+while ($max_attempts-- >= 0)
+{
+	last if (find_in_log($node, "WARNING:  invalid record length at 0/$lastlsn: wanted [0-9]+, got 0",
+						 $logstart));
+	sleep 0.5;
+}
+ok($max_attempts >= 0, "header error is logged at $lastlsn");
+
+# no end-of-wal message should be seen this time
+ok(!find_in_log($node, $reached_eow_pat, $logstart),
+   "false log message is not emitted");
+
+$node->stop('immediate');
+
 done_testing();
+
+#### helper routines
+# return the size of logfile of $node in bytes
+sub get_log_size
+{
+	my ($node) = @_;
+
+	return (stat $node->logfile)[7];
+}
+
+# find $pat in logfile of $node after $off-th byte
+sub find_in_log
+{
+	my ($node, $pat, $off) = @_;
+
+	$off = 0 unless defined $off;
+	my $log = PostgreSQL::Test::Utils::slurp_file($node->logfile);
+	return 0 if (length($log) <= $off);
+
+	$log = substr($log, $off);
+
+	return $log =~ m/$pat/;
+}
-- 
2.31.1

#51

Andres Freund

andres@anarazel.de

about 3 years ago

In reply to: Kyotaro Horiguchi (#50)

Re: Make mesage at end-of-recovery less scary.

Hi,

On 2022-11-18 17:25:37 +0900, Kyotaro Horiguchi wrote:

Just rebased.

Fails with address sanitizer:
https://cirrus-ci.com/task/5632986241564672

Unfortunately one of the failures is in pg_waldump and we don't seem to
capture its output in 011_crash_recovery. So we don't see the nice formattted
output...

[11:07:18.868] #0 0x00007fcf43803ce1 in raise () from /lib/x86_64-linux-gnu/libc.so.6
[11:07:18.912]
[11:07:18.912] Thread 1 (Thread 0x7fcf43662780 (LWP 39124)):
[11:07:18.912] #0 0x00007fcf43803ce1 in raise () from /lib/x86_64-linux-gnu/libc.so.6
[11:07:18.912] No symbol table info available.
[11:07:18.912] #1 0x00007fcf437ed537 in abort () from /lib/x86_64-linux-gnu/libc.so.6
[11:07:18.912] No symbol table info available.
[11:07:18.912] #2 0x00007fcf43b8511b in __sanitizer::Abort () at ../../../../src/libsanitizer/sanitizer_common/sanitizer_posix_libcdep.cpp:155
[11:07:18.912] No locals.
[11:07:18.912] #3 0x00007fcf43b8fce8 in __sanitizer::Die () at ../../../../src/libsanitizer/sanitizer_common/sanitizer_termination.cpp:58
[11:07:18.912] No locals.
[11:07:18.912] #4 0x00007fcf43b7244c in __asan::ScopedInErrorReport::~ScopedInErrorReport (this=0x7ffd4fde18e6, __in_chrg=<optimized out>) at ../../../../src/libsanitizer/asan/asan_report.cpp:186
[11:07:18.912] buffer_copy = {<__sanitizer::InternalMmapVectorNoCtor<char>> = {data_ = 0x7fcf40350000 '=' <repeats 65 times>, "\n==39124==ERROR: AddressSanitizer: heap-buffer-overflow on address 0x625000002100 at pc 0x55c36c21e315 bp 0x7ffd4fde2550 sp 0x7ffd4fde2"..., capacity_bytes_ = 65536, size_ = <optimized out>}, <No data fields>}
...
[11:07:18.912] #6 0x00007fcf43b72788 in __asan::__asan_report_load1 (addr=<optimized out>) at ../../../../src/libsanitizer/asan/asan_rtl.cpp:117
[11:07:18.912] bp = 140725943412048
[11:07:18.912] pc = <optimized out>
[11:07:18.912] local_stack = 140528180793728
[11:07:18.912] sp = 140725943412040
[11:07:18.912] #7 0x000055c36c21e315 in ValidXLogRecordLength (state=state@entry=0x61a000000680, RecPtr=RecPtr@entry=33655480, record=record@entry=0x625000000bb8) at xlogreader.c:1126
[11:07:18.912] p = <optimized out>
[11:07:18.912] pe = 0x625000002100 ""
[11:07:18.912] #8 0x000055c36c21e3b1 in ValidXLogRecordHeader (state=state@entry=0x61a000000680, RecPtr=RecPtr@entry=33655480, PrevRecPtr=33655104, record=record@entry=0x625000000bb8, randAccess=randAccess@entry=false) at xlogreader.c:1169
[11:07:18.912] No locals.

The most important bit is "AddressSanitizer: heap-buffer-overflow on address 0x6250000\
02100 at pc 0x55c36c21e315 bp 0x7ffd4fde2550 sp 0x7ffd4fde2"

Greetings,

Andres Freund

#52

Justin Pryzby

pryzby@telsasoft.com

about 3 years ago

In reply to: Kyotaro Horiguchi (#50)

1 attachment(s)

Re: Make mesage at end-of-recovery less scary.

On Fri, Nov 18, 2022 at 05:25:37PM +0900, Kyotaro Horiguchi wrote:

+ while (*p == 0 && p < pe)
+ p++;

The bug reported by Andres/cfbot/ubsan is here.

Fixed in attached.

I didn't try to patch the test case to output the failing stderr, but
that might be good.

Attachments:

0001-Make-End-Of-Recovery-error-less-scary.patchtext/x-diff; charset=us-asciiDownload

From 9bdf59ed0d78fff3f690584fc3c49c863d53f321 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Tue, 15 Nov 2022 13:41:46 +0900
Subject: [PATCH] Make End-Of-Recovery error less scary

When recovery in any type ends, we see a bit scary error message like
"invalid record length" that suggests something serious is
happening. Actually if recovery meets a record with length = 0, that
usually means it finished applying all available WAL records.

Make this message less scary as "reached end of WAL". Instead, raise
the error level for other kind of WAL failure to WARNING.
---
 src/backend/access/transam/xlogreader.c   | 135 +++++++++++++++++-----
 src/backend/access/transam/xlogrecovery.c |  94 +++++++++++----
 src/backend/replication/walreceiver.c     |   7 +-
 src/bin/pg_waldump/pg_waldump.c           |  13 ++-
 src/include/access/xlogreader.h           |   1 +
 src/test/recovery/t/011_crash_recovery.pl | 106 +++++++++++++++++
 6 files changed, 298 insertions(+), 58 deletions(-)

diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index 93f667b2544..137de967951 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -48,6 +48,8 @@ static int	ReadPageInternal(XLogReaderState *state, XLogRecPtr pageptr,
 							 int reqLen);
 static void XLogReaderInvalReadState(XLogReaderState *state);
 static XLogPageReadResult XLogDecodeNextRecord(XLogReaderState *state, bool nonblocking);
+static bool ValidXLogRecordLength(XLogReaderState *state, XLogRecPtr RecPtr,
+								  XLogRecord *record);
 static bool ValidXLogRecordHeader(XLogReaderState *state, XLogRecPtr RecPtr,
 								  XLogRecPtr PrevRecPtr, XLogRecord *record, bool randAccess);
 static bool ValidXLogRecord(XLogReaderState *state, XLogRecord *record,
@@ -149,6 +151,7 @@ XLogReaderAllocate(int wal_segment_size, const char *waldir,
 		pfree(state);
 		return NULL;
 	}
+	state->EndOfWAL = false;
 	state->errormsg_buf[0] = '\0';
 
 	/*
@@ -558,6 +561,7 @@ XLogDecodeNextRecord(XLogReaderState *state, bool nonblocking)
 	/* reset error state */
 	state->errormsg_buf[0] = '\0';
 	decoded = NULL;
+	state->EndOfWAL = false;
 
 	state->abortedRecPtr = InvalidXLogRecPtr;
 	state->missingContrecPtr = InvalidXLogRecPtr;
@@ -640,25 +644,21 @@ restart:
 	Assert(pageHeaderSize <= readOff);
 
 	/*
-	 * Read the record length.
+	 * Validate the record header.
 	 *
-	 * NB: Even though we use an XLogRecord pointer here, the whole record
-	 * header might not fit on this page. xl_tot_len is the first field of the
-	 * struct, so it must be on this page (the records are MAXALIGNed), but we
-	 * cannot access any other fields until we've verified that we got the
-	 * whole header.
-	 */
-	record = (XLogRecord *) (state->readBuf + RecPtr % XLOG_BLCKSZ);
-	total_len = record->xl_tot_len;
-
-	/*
-	 * If the whole record header is on this page, validate it immediately.
-	 * Otherwise do just a basic sanity check on xl_tot_len, and validate the
-	 * rest of the header after reading it from the next page.  The xl_tot_len
+	 * Even though we use an XLogRecord pointer here, the whole record header
+	 * might not fit on this page.  If the whole record header is on this page,
+	 * validate it immediately.  Even otherwise xl_tot_len must be on this page
+	 * (it is the first field of MAXALIGNed records), but we still cannot
+	 * access any further fields until we've verified that we got the whole
+	 * header, so do just a basic sanity check on record length, and validate
+	 * the rest of the header after reading it from the next page.  The length
 	 * check is necessary here to ensure that we enter the "Need to reassemble
 	 * record" code path below; otherwise we might fail to apply
 	 * ValidXLogRecordHeader at all.
 	 */
+	record = (XLogRecord *) (state->readBuf + RecPtr % XLOG_BLCKSZ);
+
 	if (targetRecOff <= XLOG_BLCKSZ - SizeOfXLogRecord)
 	{
 		if (!ValidXLogRecordHeader(state, RecPtr, state->DecodeRecPtr, record,
@@ -668,18 +668,14 @@ restart:
 	}
 	else
 	{
-		/* XXX: more validation should be done here */
-		if (total_len < SizeOfXLogRecord)
-		{
-			report_invalid_record(state,
-								  "invalid record length at %X/%X: wanted %u, got %u",
-								  LSN_FORMAT_ARGS(RecPtr),
-								  (uint32) SizeOfXLogRecord, total_len);
+		if (!ValidXLogRecordLength(state, RecPtr, record))
 			goto err;
-		}
+
 		gotheader = false;
 	}
 
+	total_len = record->xl_tot_len;
+
 	/*
 	 * Find space to decode this record.  Don't allow oversized allocation if
 	 * the caller requested nonblocking.  Otherwise, we *have* to try to
@@ -1106,16 +1102,47 @@ XLogReaderInvalReadState(XLogReaderState *state)
 }
 
 /*
- * Validate an XLOG record header.
+ * Validate record length of an XLOG record header.
  *
- * This is just a convenience subroutine to avoid duplicated code in
- * XLogReadRecord.  It's not intended for use from anywhere else.
+ * This is substantially a part of ValidXLogRecordHeader.  But XLogReadRecord
+ * needs this separate from the function in case of a partial record header.
  */
 static bool
-ValidXLogRecordHeader(XLogReaderState *state, XLogRecPtr RecPtr,
-					  XLogRecPtr PrevRecPtr, XLogRecord *record,
-					  bool randAccess)
+ValidXLogRecordLength(XLogReaderState *state, XLogRecPtr RecPtr,
+					  XLogRecord *record)
 {
+	if (record->xl_tot_len == 0)
+	{
+		char	   *p;
+		char	   *pe;
+
+		/*
+		 * We are almost sure reaching the end of WAL, make sure that the
+		 * whole page after the record is filled with zeroes.
+		 */
+		p = (char *) record;
+		pe = p + XLOG_BLCKSZ - (RecPtr & (XLOG_BLCKSZ - 1));
+
+		while (p < pe && *p == 0)
+			p++;
+
+		if (p == pe)
+		{
+			/*
+			 * The page after the record is completely zeroed. That suggests
+			 * we don't have a record after this point. We don't bother
+			 * checking the pages after since they are not zeroed in the case
+			 * of recycled segments.
+			 */
+			report_invalid_record(state, "empty record at %X/%X",
+								  LSN_FORMAT_ARGS(RecPtr));
+
+			/* notify end-of-wal to callers */
+			state->EndOfWAL = true;
+			return false;
+		}
+	}
+
 	if (record->xl_tot_len < SizeOfXLogRecord)
 	{
 		report_invalid_record(state,
@@ -1124,6 +1151,24 @@ ValidXLogRecordHeader(XLogReaderState *state, XLogRecPtr RecPtr,
 							  (uint32) SizeOfXLogRecord, record->xl_tot_len);
 		return false;
 	}
+
+	return true;
+}
+
+/*
+ * Validate an XLOG record header.
+ *
+ * This is just a convenience subroutine to avoid duplicated code in
+ * XLogReadRecord.  It's not intended for use from anywhere else.
+ */
+static bool
+ValidXLogRecordHeader(XLogReaderState *state, XLogRecPtr RecPtr,
+					  XLogRecPtr PrevRecPtr, XLogRecord *record,
+					  bool randAccess)
+{
+	if (!ValidXLogRecordLength(state, RecPtr, record))
+		return false;
+
 	if (!RmgrIdIsValid(record->xl_rmid))
 	{
 		report_invalid_record(state,
@@ -1219,6 +1264,32 @@ XLogReaderValidatePageHeader(XLogReaderState *state, XLogRecPtr recptr,
 	XLByteToSeg(recptr, segno, state->segcxt.ws_segsize);
 	offset = XLogSegmentOffset(recptr, state->segcxt.ws_segsize);
 
+	StaticAssertStmt(XLOG_PAGE_MAGIC != 0, "XLOG_PAGE_MAGIC is zero");
+
+	if (hdr->xlp_magic == 0)
+	{
+		/* Regard an empty page as End-Of-WAL */
+		int			i;
+
+		for (i = 0; i < XLOG_BLCKSZ && phdr[i] == 0; i++);
+		if (i == XLOG_BLCKSZ)
+		{
+			char		fname[MAXFNAMELEN];
+
+			XLogFileName(fname, state->seg.ws_tli, segno,
+						 state->segcxt.ws_segsize);
+
+			report_invalid_record(state,
+								  "empty page in log segment %s, offset %u",
+								  fname,
+								  offset);
+			state->EndOfWAL = true;
+			return false;
+		}
+
+		/* The same condition will be caught as invalid magic number */
+	}
+
 	if (hdr->xlp_magic != XLOG_PAGE_MAGIC)
 	{
 		char		fname[MAXFNAMELEN];
@@ -1304,6 +1375,14 @@ XLogReaderValidatePageHeader(XLogReaderState *state, XLogRecPtr recptr,
 							  LSN_FORMAT_ARGS(hdr->xlp_pageaddr),
 							  fname,
 							  offset);
+
+		/*
+		 * If the page address is less than expected we assume it is an unused
+		 * page in a recycled segment.
+		 */
+		if (hdr->xlp_pageaddr < recptr)
+			state->EndOfWAL = true;
+
 		return false;
 	}
 
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index cb07694aea6..3f54c875e5a 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -1626,7 +1626,7 @@ PerformWalRecovery(void)
 		/* just have to read next record after CheckPoint */
 		Assert(xlogreader->ReadRecPtr == CheckPointLoc);
 		replayTLI = CheckPointTLI;
-		record = ReadRecord(xlogprefetcher, LOG, false, replayTLI);
+		record = ReadRecord(xlogprefetcher, WARNING, false, replayTLI);
 	}
 
 	if (record != NULL)
@@ -1735,7 +1735,7 @@ PerformWalRecovery(void)
 			}
 
 			/* Else, try to fetch the next WAL record */
-			record = ReadRecord(xlogprefetcher, LOG, false, replayTLI);
+			record = ReadRecord(xlogprefetcher, WARNING, false, replayTLI);
 		} while (record != NULL);
 
 		/*
@@ -1789,11 +1789,18 @@ PerformWalRecovery(void)
 
 		InRedo = false;
 	}
-	else
+	else if (xlogreader->EndOfWAL)
 	{
 		/* there are no WAL records following the checkpoint */
 		ereport(LOG,
-				(errmsg("redo is not required")));
+				errmsg("redo is not required"));
+	}
+	else
+	{
+		/* broken record found */
+		ereport(WARNING,
+				errmsg("redo is skipped"),
+				errhint("This suggests WAL file corruption. You might need to check the database."));
 	}
 
 	/*
@@ -3024,6 +3031,7 @@ ReadRecord(XLogPrefetcher *xlogprefetcher, int emode,
 	for (;;)
 	{
 		char	   *errormsg;
+		XLogRecPtr	ErrRecPtr = InvalidXLogRecPtr;
 
 		record = XLogPrefetcherReadRecord(xlogprefetcher, &errormsg);
 		if (record == NULL)
@@ -3045,6 +3053,18 @@ ReadRecord(XLogPrefetcher *xlogprefetcher, int emode,
 			{
 				abortedRecPtr = xlogreader->abortedRecPtr;
 				missingContrecPtr = xlogreader->missingContrecPtr;
+				ErrRecPtr = abortedRecPtr;
+			}
+			else
+			{
+				/*
+				 * EndRecPtr is the LSN we tried to read but failed. In the
+				 * case of decoding error, it is at the end of the failed
+				 * record but we don't have a means for now to know EndRecPtr
+				 * is pointing to which of the beginning or ending of the
+				 * failed record.
+				 */
+				ErrRecPtr = xlogreader->EndRecPtr;
 			}
 
 			if (readFile >= 0)
@@ -3055,13 +3075,15 @@ ReadRecord(XLogPrefetcher *xlogprefetcher, int emode,
 
 			/*
 			 * We only end up here without a message when XLogPageRead()
-			 * failed - in that case we already logged something. In
-			 * StandbyMode that only happens if we have been triggered, so we
-			 * shouldn't loop anymore in that case.
+			 * failed- in that case we already logged something. In StandbyMode
+			 * that only happens if we have been triggered, so we shouldn't
+			 * loop anymore in that case. When EndOfWAL is true, we don't emit
+			 * the message immediately and instead will show it as a part of a
+			 * decent end-of-wal message later.
 			 */
-			if (errormsg)
-				ereport(emode_for_corrupt_record(emode, xlogreader->EndRecPtr),
-						(errmsg_internal("%s", errormsg) /* already translated */ ));
+			if (!xlogreader->EndOfWAL && errormsg)
+				ereport(emode_for_corrupt_record(emode, ErrRecPtr),
+						errmsg_internal("%s", errormsg) /* already translated */ );
 		}
 
 		/*
@@ -3091,11 +3113,14 @@ ReadRecord(XLogPrefetcher *xlogprefetcher, int emode,
 			/* Great, got a record */
 			return record;
 		}
-		else
-		{
-			/* No valid record available from this source */
-			lastSourceFailed = true;
 
+		Assert(ErrRecPtr != InvalidXLogRecPtr);
+
+		/* No valid record available from this source */
+		lastSourceFailed = true;
+
+		if (!fetching_ckpt)
+		{
 			/*
 			 * If archive recovery was requested, but we were still doing
 			 * crash recovery, switch to archive recovery and retry using the
@@ -3108,11 +3133,16 @@ ReadRecord(XLogPrefetcher *xlogprefetcher, int emode,
 			 * we'd have no idea how far we'd have to replay to reach
 			 * consistency.  So err on the safe side and give up.
 			 */
-			if (!InArchiveRecovery && ArchiveRecoveryRequested &&
-				!fetching_ckpt)
+			if (!InArchiveRecovery && ArchiveRecoveryRequested)
 			{
+				/*
+				 * We don't report this as LOG, since we don't stop recovery
+				 * here
+				 */
 				ereport(DEBUG1,
-						(errmsg_internal("reached end of WAL in pg_wal, entering archive recovery")));
+						errmsg_internal("reached end of WAL at %X/%X on timeline %u in pg_wal during crash recovery, entering archive recovery",
+										LSN_FORMAT_ARGS(ErrRecPtr),
+										replayTLI));
 				InArchiveRecovery = true;
 				if (StandbyModeRequested)
 					StandbyMode = true;
@@ -3133,12 +3163,24 @@ ReadRecord(XLogPrefetcher *xlogprefetcher, int emode,
 				continue;
 			}
 
-			/* In standby mode, loop back to retry. Otherwise, give up. */
-			if (StandbyMode && !CheckForStandbyTrigger())
-				continue;
-			else
-				return NULL;
+			/*
+			 * recovery ended.
+			 *
+			 * Emit a decent message if we met end-of-WAL. Otherwise we should
+			 * have already emitted an error message.
+			 */
+			if (xlogreader->EndOfWAL)
+				ereport(LOG,
+						errmsg("reached end of WAL at %X/%X on timeline %u",
+							   LSN_FORMAT_ARGS(ErrRecPtr), replayTLI),
+						(errormsg ? errdetail_internal("%s", errormsg) : 0));
 		}
+
+		/* In standby mode, loop back to retry. Otherwise, give up. */
+		if (StandbyMode && !CheckForStandbyTrigger())
+			continue;
+		else
+			return NULL;
 	}
 }
 
@@ -3233,11 +3275,16 @@ retry:
 			case XLREAD_WOULDBLOCK:
 				return XLREAD_WOULDBLOCK;
 			case XLREAD_FAIL:
+				Assert(!StandbyMode || CheckForStandbyTrigger());
+
 				if (readFile >= 0)
 					close(readFile);
 				readFile = -1;
 				readLen = 0;
 				readSource = XLOG_FROM_ANY;
+
+				/* promotion exit is not end-of-WAL */
+				xlogreader->EndOfWAL = !StandbyMode;
 				return XLREAD_FAIL;
 			case XLREAD_SUCCESS:
 				break;
@@ -3898,7 +3945,8 @@ emode_for_corrupt_record(int emode, XLogRecPtr RecPtr)
 {
 	static XLogRecPtr lastComplaint = 0;
 
-	if (readSource == XLOG_FROM_PG_WAL && emode == LOG)
+	/* use currentSource as readSource is reset at failure */
+	if (currentSource == XLOG_FROM_PG_WAL && emode <= WARNING)
 	{
 		if (RecPtr == lastComplaint)
 			emode = DEBUG1;
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index ad383dbcaa6..054a4cb127a 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -498,10 +498,9 @@ WalReceiverMain(void)
 						else if (len < 0)
 						{
 							ereport(LOG,
-									(errmsg("replication terminated by primary server"),
-									 errdetail("End of WAL reached on timeline %u at %X/%X.",
-											   startpointTLI,
-											   LSN_FORMAT_ARGS(LogstreamResult.Write))));
+									errmsg("replication terminated by primary server at %X/%X on timeline %u.",
+										   LSN_FORMAT_ARGS(LogstreamResult.Write),
+										   startpointTLI));
 							endofwal = true;
 							break;
 						}
diff --git a/src/bin/pg_waldump/pg_waldump.c b/src/bin/pg_waldump/pg_waldump.c
index 9993378ca58..26a4125b301 100644
--- a/src/bin/pg_waldump/pg_waldump.c
+++ b/src/bin/pg_waldump/pg_waldump.c
@@ -1168,9 +1168,16 @@ main(int argc, char **argv)
 		exit(0);
 
 	if (errormsg)
-		pg_fatal("error in WAL record at %X/%X: %s",
-				 LSN_FORMAT_ARGS(xlogreader_state->ReadRecPtr),
-				 errormsg);
+	{
+		if (xlogreader_state->EndOfWAL)
+			pg_log_info("end of WAL at %X/%X: %s",
+						LSN_FORMAT_ARGS(xlogreader_state->EndRecPtr),
+						errormsg);
+		else
+			pg_fatal("error in WAL record at %X/%X: %s",
+					 LSN_FORMAT_ARGS(xlogreader_state->EndRecPtr),
+					 errormsg);
+	}
 
 	XLogReaderFree(xlogreader_state);
 
diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h
index e87f91316ae..70d3b25edaf 100644
--- a/src/include/access/xlogreader.h
+++ b/src/include/access/xlogreader.h
@@ -205,6 +205,7 @@ struct XLogReaderState
 	 */
 	XLogRecPtr	ReadRecPtr;		/* start of last record read */
 	XLogRecPtr	EndRecPtr;		/* end+1 of last record read */
+	bool		EndOfWAL;		/* was the last attempt EOW? */
 
 	/*
 	 * Set at the end of recovery: the start point of a partial record at the
diff --git a/src/test/recovery/t/011_crash_recovery.pl b/src/test/recovery/t/011_crash_recovery.pl
index 1b57d01046d..bde16b7cfa7 100644
--- a/src/test/recovery/t/011_crash_recovery.pl
+++ b/src/test/recovery/t/011_crash_recovery.pl
@@ -9,7 +9,9 @@ use warnings;
 use PostgreSQL::Test::Cluster;
 use PostgreSQL::Test::Utils;
 use Test::More;
+use IPC::Run;
 
+my $reached_eow_pat = "reached end of WAL at ";
 my $node = PostgreSQL::Test::Cluster->new('primary');
 $node->init(allows_streaming => 1);
 $node->start;
@@ -47,7 +49,15 @@ is($node->safe_psql('postgres', qq[SELECT pg_xact_status('$xid');]),
 
 # Crash and restart the postmaster
 $node->stop('immediate');
+my $logstart = get_log_size($node);
 $node->start;
+my $max_attempts = 360;
+while ($max_attempts-- >= 0)
+{
+	last if (find_in_log($node, $reached_eow_pat, $logstart));
+	sleep 0.5;
+}
+ok ($max_attempts >= 0, "end-of-wal is logged");
 
 # Make sure we really got a new xid
 cmp_ok($node->safe_psql('postgres', 'SELECT pg_current_xact_id()'),
@@ -60,4 +70,100 @@ is($node->safe_psql('postgres', qq[SELECT pg_xact_status('$xid');]),
 $stdin .= "\\q\n";
 $tx->finish;    # wait for psql to quit gracefully
 
+my $segsize = $node->safe_psql('postgres',
+	   qq[SELECT setting FROM pg_settings WHERE name = 'wal_segment_size';]);
+
+# make sure no records afterwards go to the next segment
+$node->safe_psql('postgres', qq[
+				 SELECT pg_switch_wal();
+				 CHECKPOINT;
+				 CREATE TABLE t();
+]);
+$node->stop('immediate');
+
+# identify REDO WAL file
+my $cmd = "pg_controldata -D " . $node->data_dir();
+$cmd = ['pg_controldata', '-D', $node->data_dir()];
+$stdout = '';
+$stderr = '';
+IPC::Run::run $cmd, '>', \$stdout, '2>', \$stderr;
+ok($stdout =~ /^Latest checkpoint's REDO WAL file:[ \t] *(.+)$/m,
+   "checkpoint file is identified");
+my $chkptfile = $1;
+
+# identify the last record
+my $walfile = $node->data_dir() . "/pg_wal/$chkptfile";
+$cmd = ['pg_waldump', $walfile];
+$stdout = '';
+$stderr = '';
+my $lastrec;
+IPC::Run::run $cmd, '>', \$stdout, '2>', \$stderr;
+foreach my $l (split(/\r?\n/, $stdout))
+{
+	$lastrec = $l;
+}
+ok(defined $lastrec, "last WAL record is extracted");
+ok($stderr =~ /end of WAL at ([0-9A-F\/]+): .* at \g1/,
+   "pg_waldump emits the correct ending message");
+
+# read the last record LSN excluding leading zeroes
+ok ($lastrec =~ /, lsn: 0\/0*([1-9A-F][0-9A-F]+),/,
+	"LSN of the last record identified");
+my $lastlsn = $1;
+
+# corrupt the last record
+my $offset = hex($lastlsn) % $segsize;
+open(my $segf, '+<', $walfile) or die "failed to open $walfile\n";
+seek($segf, $offset, 0);  # halfway corrupt the last record
+print $segf "\0\0\0\0";
+close($segf);
+
+# pg_waldump complains about the corrupted record
+$stdout = '';
+$stderr = '';
+IPC::Run::run $cmd, '>', \$stdout, '2>', \$stderr;
+ok($stderr =~ /error: error in WAL record at 0\/$lastlsn: .* at 0\/$lastlsn/,
+   "pg_waldump emits the correct error message");
+
+# also server complains
+$logstart = get_log_size($node);
+$node->start;
+$max_attempts = 360;
+while ($max_attempts-- >= 0)
+{
+	last if (find_in_log($node, "WARNING:  invalid record length at 0/$lastlsn: wanted [0-9]+, got 0",
+						 $logstart));
+	sleep 0.5;
+}
+ok($max_attempts >= 0, "header error is logged at $lastlsn");
+
+# no end-of-wal message should be seen this time
+ok(!find_in_log($node, $reached_eow_pat, $logstart),
+   "false log message is not emitted");
+
+$node->stop('immediate');
+
 done_testing();
+
+#### helper routines
+# return the size of logfile of $node in bytes
+sub get_log_size
+{
+	my ($node) = @_;
+
+	return (stat $node->logfile)[7];
+}
+
+# find $pat in logfile of $node after $off-th byte
+sub find_in_log
+{
+	my ($node, $pat, $off) = @_;
+
+	$off = 0 unless defined $off;
+	my $log = PostgreSQL::Test::Utils::slurp_file($node->logfile);
+	return 0 if (length($log) <= $off);
+
+	$log = substr($log, $off);
+
+	return $log =~ m/$pat/;
+}
-- 
2.25.1

#53

Kyotaro Horiguchi

horikyota.ntt@gmail.com

about 3 years ago

In reply to: Justin Pryzby (#52)

1 attachment(s)

Re: Make mesage at end-of-recovery less scary.

At Tue, 22 Nov 2022 16:04:56 -0600, Justin Pryzby <pryzby@telsasoft.com> wrote in

On Fri, Nov 18, 2022 at 05:25:37PM +0900, Kyotaro Horiguchi wrote:

+ while (*p == 0 && p < pe)
+ p++;

The bug reported by Andres/cfbot/ubsan is here.

Fixed in attached.

Ur..ou..

-		while (*p == 0 && p < pe)
+		while (p < pe && *p == 0)

It was an off-by-one error. Thanks!

I didn't try to patch the test case to output the failing stderr, but
that might be good.

I have made use of Cluster::wait_for_log(), but still find_in_log() is
there since it is used to check if a message that should not be logged
is not logged.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachments:

v23-0001-Make-End-Of-Recovery-error-less-scary.patchtext/x-patch; charset=us-asciiDownload

From 37dd82652bf028811002af9e7c0df9e9d2ddb7d7 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Wed, 30 Nov 2022 11:51:46 +0900
Subject: [PATCH v23] Make End-Of-Recovery error less scary

When recovery in any type ends, we see a bit scary error message like
"invalid record length" that suggests something serious is
happening. Actually if recovery meets a record with length = 0, that
usually means it finished applying all available WAL records.

Make this message less scary as "reached end of WAL". Instead, raise
the error level for other kind of WAL failure to WARNING.
---
 src/backend/access/transam/xlogreader.c   | 137 +++++++++++++++++-----
 src/backend/access/transam/xlogrecovery.c |  94 +++++++++++----
 src/backend/replication/walreceiver.c     |   7 +-
 src/bin/pg_waldump/pg_waldump.c           |  13 +-
 src/include/access/xlogreader.h           |   1 +
 src/test/recovery/t/011_crash_recovery.pl |  96 +++++++++++++++
 6 files changed, 289 insertions(+), 59 deletions(-)

diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index 93f667b254..137de96795 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -48,6 +48,8 @@ static int	ReadPageInternal(XLogReaderState *state, XLogRecPtr pageptr,
 							 int reqLen);
 static void XLogReaderInvalReadState(XLogReaderState *state);
 static XLogPageReadResult XLogDecodeNextRecord(XLogReaderState *state, bool nonblocking);
+static bool ValidXLogRecordLength(XLogReaderState *state, XLogRecPtr RecPtr,
+								  XLogRecord *record);
 static bool ValidXLogRecordHeader(XLogReaderState *state, XLogRecPtr RecPtr,
 								  XLogRecPtr PrevRecPtr, XLogRecord *record, bool randAccess);
 static bool ValidXLogRecord(XLogReaderState *state, XLogRecord *record,
@@ -149,6 +151,7 @@ XLogReaderAllocate(int wal_segment_size, const char *waldir,
 		pfree(state);
 		return NULL;
 	}
+	state->EndOfWAL = false;
 	state->errormsg_buf[0] = '\0';
 
 	/*
@@ -558,6 +561,7 @@ XLogDecodeNextRecord(XLogReaderState *state, bool nonblocking)
 	/* reset error state */
 	state->errormsg_buf[0] = '\0';
 	decoded = NULL;
+	state->EndOfWAL = false;
 
 	state->abortedRecPtr = InvalidXLogRecPtr;
 	state->missingContrecPtr = InvalidXLogRecPtr;
@@ -640,25 +644,21 @@ restart:
 	Assert(pageHeaderSize <= readOff);
 
 	/*
-	 * Read the record length.
+	 * Validate the record header.
 	 *
-	 * NB: Even though we use an XLogRecord pointer here, the whole record
-	 * header might not fit on this page. xl_tot_len is the first field of the
-	 * struct, so it must be on this page (the records are MAXALIGNed), but we
-	 * cannot access any other fields until we've verified that we got the
-	 * whole header.
-	 */
-	record = (XLogRecord *) (state->readBuf + RecPtr % XLOG_BLCKSZ);
-	total_len = record->xl_tot_len;
-
-	/*
-	 * If the whole record header is on this page, validate it immediately.
-	 * Otherwise do just a basic sanity check on xl_tot_len, and validate the
-	 * rest of the header after reading it from the next page.  The xl_tot_len
+	 * Even though we use an XLogRecord pointer here, the whole record header
+	 * might not fit on this page.  If the whole record header is on this page,
+	 * validate it immediately.  Even otherwise xl_tot_len must be on this page
+	 * (it is the first field of MAXALIGNed records), but we still cannot
+	 * access any further fields until we've verified that we got the whole
+	 * header, so do just a basic sanity check on record length, and validate
+	 * the rest of the header after reading it from the next page.  The length
 	 * check is necessary here to ensure that we enter the "Need to reassemble
 	 * record" code path below; otherwise we might fail to apply
 	 * ValidXLogRecordHeader at all.
 	 */
+	record = (XLogRecord *) (state->readBuf + RecPtr % XLOG_BLCKSZ);
+
 	if (targetRecOff <= XLOG_BLCKSZ - SizeOfXLogRecord)
 	{
 		if (!ValidXLogRecordHeader(state, RecPtr, state->DecodeRecPtr, record,
@@ -668,18 +668,14 @@ restart:
 	}
 	else
 	{
-		/* XXX: more validation should be done here */
-		if (total_len < SizeOfXLogRecord)
-		{
-			report_invalid_record(state,
-								  "invalid record length at %X/%X: wanted %u, got %u",
-								  LSN_FORMAT_ARGS(RecPtr),
-								  (uint32) SizeOfXLogRecord, total_len);
+		if (!ValidXLogRecordLength(state, RecPtr, record))
 			goto err;
-		}
+
 		gotheader = false;
 	}
 
+	total_len = record->xl_tot_len;
+
 	/*
 	 * Find space to decode this record.  Don't allow oversized allocation if
 	 * the caller requested nonblocking.  Otherwise, we *have* to try to
@@ -1105,6 +1101,60 @@ XLogReaderInvalReadState(XLogReaderState *state)
 	state->readLen = 0;
 }
 
+/*
+ * Validate record length of an XLOG record header.
+ *
+ * This is substantially a part of ValidXLogRecordHeader.  But XLogReadRecord
+ * needs this separate from the function in case of a partial record header.
+ */
+static bool
+ValidXLogRecordLength(XLogReaderState *state, XLogRecPtr RecPtr,
+					  XLogRecord *record)
+{
+	if (record->xl_tot_len == 0)
+	{
+		char	   *p;
+		char	   *pe;
+
+		/*
+		 * We are almost sure reaching the end of WAL, make sure that the
+		 * whole page after the record is filled with zeroes.
+		 */
+		p = (char *) record;
+		pe = p + XLOG_BLCKSZ - (RecPtr & (XLOG_BLCKSZ - 1));
+
+		while (p < pe && *p == 0)
+			p++;
+
+		if (p == pe)
+		{
+			/*
+			 * The page after the record is completely zeroed. That suggests
+			 * we don't have a record after this point. We don't bother
+			 * checking the pages after since they are not zeroed in the case
+			 * of recycled segments.
+			 */
+			report_invalid_record(state, "empty record at %X/%X",
+								  LSN_FORMAT_ARGS(RecPtr));
+
+			/* notify end-of-wal to callers */
+			state->EndOfWAL = true;
+			return false;
+		}
+	}
+
+	if (record->xl_tot_len < SizeOfXLogRecord)
+	{
+		report_invalid_record(state,
+							  "invalid record length at %X/%X: wanted %u, got %u",
+							  LSN_FORMAT_ARGS(RecPtr),
+							  (uint32) SizeOfXLogRecord, record->xl_tot_len);
+		return false;
+	}
+
+	return true;
+}
+
 /*
  * Validate an XLOG record header.
  *
@@ -1116,14 +1166,9 @@ ValidXLogRecordHeader(XLogReaderState *state, XLogRecPtr RecPtr,
 					  XLogRecPtr PrevRecPtr, XLogRecord *record,
 					  bool randAccess)
 {
-	if (record->xl_tot_len < SizeOfXLogRecord)
-	{
-		report_invalid_record(state,
-							  "invalid record length at %X/%X: wanted %u, got %u",
-							  LSN_FORMAT_ARGS(RecPtr),
-							  (uint32) SizeOfXLogRecord, record->xl_tot_len);
+	if (!ValidXLogRecordLength(state, RecPtr, record))
 		return false;
-	}
+
 	if (!RmgrIdIsValid(record->xl_rmid))
 	{
 		report_invalid_record(state,
@@ -1219,6 +1264,32 @@ XLogReaderValidatePageHeader(XLogReaderState *state, XLogRecPtr recptr,
 	XLByteToSeg(recptr, segno, state->segcxt.ws_segsize);
 	offset = XLogSegmentOffset(recptr, state->segcxt.ws_segsize);
 
+	StaticAssertStmt(XLOG_PAGE_MAGIC != 0, "XLOG_PAGE_MAGIC is zero");
+
+	if (hdr->xlp_magic == 0)
+	{
+		/* Regard an empty page as End-Of-WAL */
+		int			i;
+
+		for (i = 0; i < XLOG_BLCKSZ && phdr[i] == 0; i++);
+		if (i == XLOG_BLCKSZ)
+		{
+			char		fname[MAXFNAMELEN];
+
+			XLogFileName(fname, state->seg.ws_tli, segno,
+						 state->segcxt.ws_segsize);
+
+			report_invalid_record(state,
+								  "empty page in log segment %s, offset %u",
+								  fname,
+								  offset);
+			state->EndOfWAL = true;
+			return false;
+		}
+
+		/* The same condition will be caught as invalid magic number */
+	}
+
 	if (hdr->xlp_magic != XLOG_PAGE_MAGIC)
 	{
 		char		fname[MAXFNAMELEN];
@@ -1304,6 +1375,14 @@ XLogReaderValidatePageHeader(XLogReaderState *state, XLogRecPtr recptr,
 							  LSN_FORMAT_ARGS(hdr->xlp_pageaddr),
 							  fname,
 							  offset);
+
+		/*
+		 * If the page address is less than expected we assume it is an unused
+		 * page in a recycled segment.
+		 */
+		if (hdr->xlp_pageaddr < recptr)
+			state->EndOfWAL = true;
+
 		return false;
 	}
 
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 41ffc57da9..049b468620 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -1625,7 +1625,7 @@ PerformWalRecovery(void)
 		/* just have to read next record after CheckPoint */
 		Assert(xlogreader->ReadRecPtr == CheckPointLoc);
 		replayTLI = CheckPointTLI;
-		record = ReadRecord(xlogprefetcher, LOG, false, replayTLI);
+		record = ReadRecord(xlogprefetcher, WARNING, false, replayTLI);
 	}
 
 	if (record != NULL)
@@ -1734,7 +1734,7 @@ PerformWalRecovery(void)
 			}
 
 			/* Else, try to fetch the next WAL record */
-			record = ReadRecord(xlogprefetcher, LOG, false, replayTLI);
+			record = ReadRecord(xlogprefetcher, WARNING, false, replayTLI);
 		} while (record != NULL);
 
 		/*
@@ -1788,11 +1788,18 @@ PerformWalRecovery(void)
 
 		InRedo = false;
 	}
-	else
+	else if (xlogreader->EndOfWAL)
 	{
 		/* there are no WAL records following the checkpoint */
 		ereport(LOG,
-				(errmsg("redo is not required")));
+				errmsg("redo is not required"));
+	}
+	else
+	{
+		/* broken record found */
+		ereport(WARNING,
+				errmsg("redo is skipped"),
+				errhint("This suggests WAL file corruption. You might need to check the database."));
 	}
 
 	/*
@@ -3020,6 +3027,7 @@ ReadRecord(XLogPrefetcher *xlogprefetcher, int emode,
 	for (;;)
 	{
 		char	   *errormsg;
+		XLogRecPtr	ErrRecPtr = InvalidXLogRecPtr;
 
 		record = XLogPrefetcherReadRecord(xlogprefetcher, &errormsg);
 		if (record == NULL)
@@ -3041,6 +3049,18 @@ ReadRecord(XLogPrefetcher *xlogprefetcher, int emode,
 			{
 				abortedRecPtr = xlogreader->abortedRecPtr;
 				missingContrecPtr = xlogreader->missingContrecPtr;
+				ErrRecPtr = abortedRecPtr;
+			}
+			else
+			{
+				/*
+				 * EndRecPtr is the LSN we tried to read but failed. In the
+				 * case of decoding error, it is at the end of the failed
+				 * record but we don't have a means for now to know EndRecPtr
+				 * is pointing to which of the beginning or ending of the
+				 * failed record.
+				 */
+				ErrRecPtr = xlogreader->EndRecPtr;
 			}
 
 			if (readFile >= 0)
@@ -3051,13 +3071,15 @@ ReadRecord(XLogPrefetcher *xlogprefetcher, int emode,
 
 			/*
 			 * We only end up here without a message when XLogPageRead()
-			 * failed - in that case we already logged something. In
-			 * StandbyMode that only happens if we have been triggered, so we
-			 * shouldn't loop anymore in that case.
+			 * failed- in that case we already logged something. In StandbyMode
+			 * that only happens if we have been triggered, so we shouldn't
+			 * loop anymore in that case. When EndOfWAL is true, we don't emit
+			 * the message immediately and instead will show it as a part of a
+			 * decent end-of-wal message later.
 			 */
-			if (errormsg)
-				ereport(emode_for_corrupt_record(emode, xlogreader->EndRecPtr),
-						(errmsg_internal("%s", errormsg) /* already translated */ ));
+			if (!xlogreader->EndOfWAL && errormsg)
+				ereport(emode_for_corrupt_record(emode, ErrRecPtr),
+						errmsg_internal("%s", errormsg) /* already translated */ );
 		}
 
 		/*
@@ -3087,11 +3109,14 @@ ReadRecord(XLogPrefetcher *xlogprefetcher, int emode,
 			/* Great, got a record */
 			return record;
 		}
-		else
+
+		Assert(ErrRecPtr != InvalidXLogRecPtr);
+
+		/* No valid record available from this source */
+		lastSourceFailed = true;
+
+		if (!fetching_ckpt)
 		{
-			/* No valid record available from this source */
-			lastSourceFailed = true;
-
 			/*
 			 * If archive recovery was requested, but we were still doing
 			 * crash recovery, switch to archive recovery and retry using the
@@ -3104,11 +3129,16 @@ ReadRecord(XLogPrefetcher *xlogprefetcher, int emode,
 			 * we'd have no idea how far we'd have to replay to reach
 			 * consistency.  So err on the safe side and give up.
 			 */
-			if (!InArchiveRecovery && ArchiveRecoveryRequested &&
-				!fetching_ckpt)
+			if (!InArchiveRecovery && ArchiveRecoveryRequested)
 			{
+				/*
+				 * We don't report this as LOG, since we don't stop recovery
+				 * here
+				 */
 				ereport(DEBUG1,
-						(errmsg_internal("reached end of WAL in pg_wal, entering archive recovery")));
+						errmsg_internal("reached end of WAL at %X/%X on timeline %u in pg_wal during crash recovery, entering archive recovery",
+										LSN_FORMAT_ARGS(ErrRecPtr),
+										replayTLI));
 				InArchiveRecovery = true;
 				if (StandbyModeRequested)
 					StandbyMode = true;
@@ -3129,12 +3159,24 @@ ReadRecord(XLogPrefetcher *xlogprefetcher, int emode,
 				continue;
 			}
 
-			/* In standby mode, loop back to retry. Otherwise, give up. */
-			if (StandbyMode && !CheckForStandbyTrigger())
-				continue;
-			else
-				return NULL;
+			/*
+			 * recovery ended.
+			 *
+			 * Emit a decent message if we met end-of-WAL. Otherwise we should
+			 * have already emitted an error message.
+			 */
+			if (xlogreader->EndOfWAL)
+				ereport(LOG,
+						errmsg("reached end of WAL at %X/%X on timeline %u",
+							   LSN_FORMAT_ARGS(ErrRecPtr), replayTLI),
+						(errormsg ? errdetail_internal("%s", errormsg) : 0));
 		}
+
+		/* In standby mode, loop back to retry. Otherwise, give up. */
+		if (StandbyMode && !CheckForStandbyTrigger())
+			continue;
+		else
+			return NULL;
 	}
 }
 
@@ -3229,11 +3271,16 @@ retry:
 			case XLREAD_WOULDBLOCK:
 				return XLREAD_WOULDBLOCK;
 			case XLREAD_FAIL:
+				Assert(!StandbyMode || CheckForStandbyTrigger());
+
 				if (readFile >= 0)
 					close(readFile);
 				readFile = -1;
 				readLen = 0;
 				readSource = XLOG_FROM_ANY;
+
+				/* promotion exit is not end-of-WAL */
+				xlogreader->EndOfWAL = !StandbyMode;
 				return XLREAD_FAIL;
 			case XLREAD_SUCCESS:
 				break;
@@ -3899,7 +3946,8 @@ emode_for_corrupt_record(int emode, XLogRecPtr RecPtr)
 {
 	static XLogRecPtr lastComplaint = 0;
 
-	if (readSource == XLOG_FROM_PG_WAL && emode == LOG)
+	/* use currentSource as readSource is reset at failure */
+	if (currentSource == XLOG_FROM_PG_WAL && emode <= WARNING)
 	{
 		if (RecPtr == lastComplaint)
 			emode = DEBUG1;
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index ad383dbcaa..054a4cb127 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -498,10 +498,9 @@ WalReceiverMain(void)
 						else if (len < 0)
 						{
 							ereport(LOG,
-									(errmsg("replication terminated by primary server"),
-									 errdetail("End of WAL reached on timeline %u at %X/%X.",
-											   startpointTLI,
-											   LSN_FORMAT_ARGS(LogstreamResult.Write))));
+									errmsg("replication terminated by primary server at %X/%X on timeline %u.",
+										   LSN_FORMAT_ARGS(LogstreamResult.Write),
+										   startpointTLI));
 							endofwal = true;
 							break;
 						}
diff --git a/src/bin/pg_waldump/pg_waldump.c b/src/bin/pg_waldump/pg_waldump.c
index 9993378ca5..26a4125b30 100644
--- a/src/bin/pg_waldump/pg_waldump.c
+++ b/src/bin/pg_waldump/pg_waldump.c
@@ -1168,9 +1168,16 @@ main(int argc, char **argv)
 		exit(0);
 
 	if (errormsg)
-		pg_fatal("error in WAL record at %X/%X: %s",
-				 LSN_FORMAT_ARGS(xlogreader_state->ReadRecPtr),
-				 errormsg);
+	{
+		if (xlogreader_state->EndOfWAL)
+			pg_log_info("end of WAL at %X/%X: %s",
+						LSN_FORMAT_ARGS(xlogreader_state->EndRecPtr),
+						errormsg);
+		else
+			pg_fatal("error in WAL record at %X/%X: %s",
+					 LSN_FORMAT_ARGS(xlogreader_state->EndRecPtr),
+					 errormsg);
+	}
 
 	XLogReaderFree(xlogreader_state);
 
diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h
index e87f91316a..70d3b25eda 100644
--- a/src/include/access/xlogreader.h
+++ b/src/include/access/xlogreader.h
@@ -205,6 +205,7 @@ struct XLogReaderState
 	 */
 	XLogRecPtr	ReadRecPtr;		/* start of last record read */
 	XLogRecPtr	EndRecPtr;		/* end+1 of last record read */
+	bool		EndOfWAL;		/* was the last attempt EOW? */
 
 	/*
 	 * Set at the end of recovery: the start point of a partial record at the
diff --git a/src/test/recovery/t/011_crash_recovery.pl b/src/test/recovery/t/011_crash_recovery.pl
index 1b57d01046..8a379e8988 100644
--- a/src/test/recovery/t/011_crash_recovery.pl
+++ b/src/test/recovery/t/011_crash_recovery.pl
@@ -9,7 +9,9 @@ use warnings;
 use PostgreSQL::Test::Cluster;
 use PostgreSQL::Test::Utils;
 use Test::More;
+use IPC::Run;
 
+my $reached_eow_pat = "reached end of WAL at ";
 my $node = PostgreSQL::Test::Cluster->new('primary');
 $node->init(allows_streaming => 1);
 $node->start;
@@ -47,7 +49,10 @@ is($node->safe_psql('postgres', qq[SELECT pg_xact_status('$xid');]),
 
 # Crash and restart the postmaster
 $node->stop('immediate');
+my $logstart = get_log_size($node);
 $node->start;
+$node->wait_for_log($reached_eow_pat, $logstart);
+pass("end-of-wal is logged");
 
 # Make sure we really got a new xid
 cmp_ok($node->safe_psql('postgres', 'SELECT pg_current_xact_id()'),
@@ -60,4 +65,95 @@ is($node->safe_psql('postgres', qq[SELECT pg_xact_status('$xid');]),
 $stdin .= "\\q\n";
 $tx->finish;    # wait for psql to quit gracefully
 
+my $segsize = $node->safe_psql('postgres',
+	   qq[SELECT setting FROM pg_settings WHERE name = 'wal_segment_size';]);
+
+# make sure no records afterwards go to the next segment
+$node->safe_psql('postgres', qq[
+				 SELECT pg_switch_wal();
+				 CHECKPOINT;
+				 CREATE TABLE t();
+]);
+$node->stop('immediate');
+
+# identify REDO WAL file
+my $cmd = "pg_controldata -D " . $node->data_dir();
+$cmd = ['pg_controldata', '-D', $node->data_dir()];
+$stdout = '';
+$stderr = '';
+IPC::Run::run $cmd, '>', \$stdout, '2>', \$stderr;
+ok($stdout =~ /^Latest checkpoint's REDO WAL file:[ \t] *(.+)$/m,
+   "checkpoint file is identified");
+my $chkptfile = $1;
+
+# identify the last record
+my $walfile = $node->data_dir() . "/pg_wal/$chkptfile";
+$cmd = ['pg_waldump', $walfile];
+$stdout = '';
+$stderr = '';
+my $lastrec;
+IPC::Run::run $cmd, '>', \$stdout, '2>', \$stderr;
+foreach my $l (split(/\r?\n/, $stdout))
+{
+	$lastrec = $l;
+}
+ok(defined $lastrec, "last WAL record is extracted");
+ok($stderr =~ /end of WAL at ([0-9A-F\/]+): .* at \g1/,
+   "pg_waldump emits the correct ending message");
+
+# read the last record LSN excluding leading zeroes
+ok ($lastrec =~ /, lsn: 0\/0*([1-9A-F][0-9A-F]+),/,
+	"LSN of the last record identified");
+my $lastlsn = $1;
+
+# corrupt the last record
+my $offset = hex($lastlsn) % $segsize;
+open(my $segf, '+<', $walfile) or die "failed to open $walfile\n";
+seek($segf, $offset, 0);  # halfway corrupt the last record
+print $segf "\0\0\0\0";
+close($segf);
+
+# pg_waldump complains about the corrupted record
+$stdout = '';
+$stderr = '';
+IPC::Run::run $cmd, '>', \$stdout, '2>', \$stderr;
+ok($stderr =~ /error: error in WAL record at 0\/$lastlsn: .* at 0\/$lastlsn/,
+   "pg_waldump emits the correct error message");
+
+# also server complains
+$logstart = get_log_size($node);
+$node->start;
+$node->wait_for_log("WARNING:  invalid record length at 0/$lastlsn: wanted [0-9]+, got 0",
+					$logstart);
+pass("header error is logged at $lastlsn");
+
+# no end-of-wal message should be seen this time
+ok(!find_in_log($node, $reached_eow_pat, $logstart),
+   "false log message is not emitted");
+
+$node->stop('immediate');
+
 done_testing();
+
+#### helper routines
+# return the size of logfile of $node in bytes
+sub get_log_size
+{
+	my ($node) = @_;
+
+	return (stat $node->logfile)[7];
+}
+
+# find $pat in logfile of $node after $off-th byte
+sub find_in_log
+{
+	my ($node, $pat, $off) = @_;
+
+	$off = 0 unless defined $off;
+	my $log = PostgreSQL::Test::Utils::slurp_file($node->logfile);
+	return 0 if (length($log) <= $off);
+
+	$log = substr($log, $off);
+
+	return $log =~ m/$pat/;
+}
-- 
2.31.1

#54

Alvaro Herrera

alvherre@alvh.no-ip.org

almost 3 years ago

In reply to: Kyotaro Horiguchi (#53)

Re: Make mesage at end-of-recovery less scary.

So this patch is now failing because it applies new tests to
011_crash_recovery.pl, which was removed recently. Can you please move
them elsewhere?

I think the comment for ValidXLogRecordLength should explain what the
return value is.

--
Álvaro Herrera Breisgau, Deutschland — https://www.EnterpriseDB.com/

#55

Kyotaro Horiguchi

horikyota.ntt@gmail.com

almost 3 years ago

In reply to: Alvaro Herrera (#54)

1 attachment(s)

Re: Make mesage at end-of-recovery less scary.

Thanks!

At Fri, 3 Feb 2023 15:16:02 +0100, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote in

So this patch is now failing because it applies new tests to
011_crash_recovery.pl, which was removed recently. Can you please move
them elsewhere?

I don't find an appropriate file to move to. In the end I created a
new file with the name 034_recovery.pl. I added a test for standbys,
too. (which is the first objective of this patch.)

I think the comment for ValidXLogRecordLength should explain what the
return value is.

Agreed.

/*
  * Validate record length of an XLOG record header.
  *
  * This is substantially a part of ValidXLogRecordHeader.  But XLogReadRecord
  * needs this separate from the function in case of a partial record header.
+ *
+ * Returns true if the xl_tot_len header field has a seemingly valid value,
+ * which means the caller can proceed reading to the following part of the
+ * record.
  */
 static bool
 ValidXLogRecordLength(XLogReaderState *state, XLogRecPtr RecPtr,

I added a similar description to ValidXLogRecordHeader.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachments:

v24-0001-Make-End-Of-Recovery-error-less-scary.patchtext/x-patch; charset=us-asciiDownload

From c58ca4d5db52c75dec9882d158d5724e12617005 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Wed, 30 Nov 2022 11:51:46 +0900
Subject: [PATCH v24] Make End-Of-Recovery error less scary

When recovery in any type ends, we see a bit scary error message like
"invalid record length" that suggests something serious is
happening. Actually if recovery meets a record with length = 0, that
usually means it finished applying all available WAL records.

Make this message less scary as "reached end of WAL". Instead, raise
the error level for other kind of WAL failure to WARNING.  To make
sure that the detection is correct, this patch checks if all trailing
bytes in the same page are zeroed in that case.
---
 src/backend/access/transam/xlogreader.c   | 144 +++++++++++++++++-----
 src/backend/access/transam/xlogrecovery.c |  94 ++++++++++----
 src/backend/replication/walreceiver.c     |   7 +-
 src/bin/pg_waldump/pg_waldump.c           |  13 +-
 src/include/access/xlogreader.h           |   1 +
 src/test/recovery/t/034_recovery.pl       | 135 ++++++++++++++++++++
 6 files changed, 335 insertions(+), 59 deletions(-)
 create mode 100644 src/test/recovery/t/034_recovery.pl

diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index aa6c929477..8cb2d55333 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -48,6 +48,8 @@ static int	ReadPageInternal(XLogReaderState *state, XLogRecPtr pageptr,
 							 int reqLen);
 static void XLogReaderInvalReadState(XLogReaderState *state);
 static XLogPageReadResult XLogDecodeNextRecord(XLogReaderState *state, bool nonblocking);
+static bool ValidXLogRecordLength(XLogReaderState *state, XLogRecPtr RecPtr,
+								  XLogRecord *record);
 static bool ValidXLogRecordHeader(XLogReaderState *state, XLogRecPtr RecPtr,
 								  XLogRecPtr PrevRecPtr, XLogRecord *record, bool randAccess);
 static bool ValidXLogRecord(XLogReaderState *state, XLogRecord *record,
@@ -149,6 +151,7 @@ XLogReaderAllocate(int wal_segment_size, const char *waldir,
 		pfree(state);
 		return NULL;
 	}
+	state->EndOfWAL = false;
 	state->errormsg_buf[0] = '\0';
 
 	/*
@@ -558,6 +561,7 @@ XLogDecodeNextRecord(XLogReaderState *state, bool nonblocking)
 	/* reset error state */
 	state->errormsg_buf[0] = '\0';
 	decoded = NULL;
+	state->EndOfWAL = false;
 
 	state->abortedRecPtr = InvalidXLogRecPtr;
 	state->missingContrecPtr = InvalidXLogRecPtr;
@@ -640,25 +644,21 @@ restart:
 	Assert(pageHeaderSize <= readOff);
 
 	/*
-	 * Read the record length.
+	 * Validate the record header.
 	 *
-	 * NB: Even though we use an XLogRecord pointer here, the whole record
-	 * header might not fit on this page. xl_tot_len is the first field of the
-	 * struct, so it must be on this page (the records are MAXALIGNed), but we
-	 * cannot access any other fields until we've verified that we got the
-	 * whole header.
-	 */
-	record = (XLogRecord *) (state->readBuf + RecPtr % XLOG_BLCKSZ);
-	total_len = record->xl_tot_len;
-
-	/*
-	 * If the whole record header is on this page, validate it immediately.
-	 * Otherwise do just a basic sanity check on xl_tot_len, and validate the
-	 * rest of the header after reading it from the next page.  The xl_tot_len
+	 * Even though we use an XLogRecord pointer here, the whole record header
+	 * might not fit on this page.  If the whole record header is on this page,
+	 * validate it immediately.  Even otherwise xl_tot_len must be on this page
+	 * (it is the first field of MAXALIGNed records), but we still cannot
+	 * access any further fields until we've verified that we got the whole
+	 * header, so do just a basic sanity check on record length, and validate
+	 * the rest of the header after reading it from the next page.  The length
 	 * check is necessary here to ensure that we enter the "Need to reassemble
 	 * record" code path below; otherwise we might fail to apply
 	 * ValidXLogRecordHeader at all.
 	 */
+	record = (XLogRecord *) (state->readBuf + RecPtr % XLOG_BLCKSZ);
+
 	if (targetRecOff <= XLOG_BLCKSZ - SizeOfXLogRecord)
 	{
 		if (!ValidXLogRecordHeader(state, RecPtr, state->DecodeRecPtr, record,
@@ -668,18 +668,14 @@ restart:
 	}
 	else
 	{
-		/* XXX: more validation should be done here */
-		if (total_len < SizeOfXLogRecord)
-		{
-			report_invalid_record(state,
-								  "invalid record length at %X/%X: wanted %u, got %u",
-								  LSN_FORMAT_ARGS(RecPtr),
-								  (uint32) SizeOfXLogRecord, total_len);
+		if (!ValidXLogRecordLength(state, RecPtr, record))
 			goto err;
-		}
+
 		gotheader = false;
 	}
 
+	total_len = record->xl_tot_len;
+
 	/*
 	 * Find space to decode this record.  Don't allow oversized allocation if
 	 * the caller requested nonblocking.  Otherwise, we *have* to try to
@@ -1105,25 +1101,81 @@ XLogReaderInvalReadState(XLogReaderState *state)
 	state->readLen = 0;
 }
 
+/*
+ * Validate record length of an XLOG record header.
+ *
+ * This is substantially a part of ValidXLogRecordHeader.  But XLogReadRecord
+ * needs this separate from the function in case of a partial record header.
+ *
+ * Returns true if the xl_tot_len header field has a seemingly valid value,
+ * which means the caller can proceed reading to the following part of the
+ * record.
+ */
+static bool
+ValidXLogRecordLength(XLogReaderState *state, XLogRecPtr RecPtr,
+					  XLogRecord *record)
+{
+	if (record->xl_tot_len == 0)
+	{
+		char	   *p;
+		char	   *pe;
+
+		/*
+		 * We are almost sure reaching the end of WAL, make sure that the
+		 * whole page after the record is filled with zeroes.
+		 */
+		p = (char *) record;
+		pe = p + XLOG_BLCKSZ - (RecPtr & (XLOG_BLCKSZ - 1));
+
+		while (p < pe && *p == 0)
+			p++;
+
+		if (p == pe)
+		{
+			/*
+			 * The page after the record is completely zeroed. That suggests
+			 * we don't have a record after this point. We don't bother
+			 * checking the pages after since they are not zeroed in the case
+			 * of recycled segments.
+			 */
+			report_invalid_record(state, "empty record at %X/%X",
+								  LSN_FORMAT_ARGS(RecPtr));
+
+			/* notify end-of-wal to callers */
+			state->EndOfWAL = true;
+			return false;
+		}
+	}
+
+	if (record->xl_tot_len < SizeOfXLogRecord)
+	{
+		report_invalid_record(state,
+							  "invalid record length at %X/%X: wanted %u, got %u",
+							  LSN_FORMAT_ARGS(RecPtr),
+							  (uint32) SizeOfXLogRecord, record->xl_tot_len);
+		return false;
+	}
+
+	return true;
+}
+
 /*
  * Validate an XLOG record header.
  *
  * This is just a convenience subroutine to avoid duplicated code in
  * XLogReadRecord.  It's not intended for use from anywhere else.
+ *
+ * Returns true if the header fields have the valid values and the caller can
+ * proceed reading to the following part of the record.
  */
 static bool
 ValidXLogRecordHeader(XLogReaderState *state, XLogRecPtr RecPtr,
 					  XLogRecPtr PrevRecPtr, XLogRecord *record,
 					  bool randAccess)
 {
-	if (record->xl_tot_len < SizeOfXLogRecord)
-	{
-		report_invalid_record(state,
-							  "invalid record length at %X/%X: wanted %u, got %u",
-							  LSN_FORMAT_ARGS(RecPtr),
-							  (uint32) SizeOfXLogRecord, record->xl_tot_len);
+	if (!ValidXLogRecordLength(state, RecPtr, record))
 		return false;
-	}
+
 	if (!RmgrIdIsValid(record->xl_rmid))
 	{
 		report_invalid_record(state,
@@ -1219,6 +1271,32 @@ XLogReaderValidatePageHeader(XLogReaderState *state, XLogRecPtr recptr,
 	XLByteToSeg(recptr, segno, state->segcxt.ws_segsize);
 	offset = XLogSegmentOffset(recptr, state->segcxt.ws_segsize);
 
+	StaticAssertStmt(XLOG_PAGE_MAGIC != 0, "XLOG_PAGE_MAGIC is zero");
+
+	if (hdr->xlp_magic == 0)
+	{
+		/* Regard an empty page as End-Of-WAL */
+		int			i;
+
+		for (i = 0; i < XLOG_BLCKSZ && phdr[i] == 0; i++);
+		if (i == XLOG_BLCKSZ)
+		{
+			char		fname[MAXFNAMELEN];
+
+			XLogFileName(fname, state->seg.ws_tli, segno,
+						 state->segcxt.ws_segsize);
+
+			report_invalid_record(state,
+								  "empty page in log segment %s, offset %u",
+								  fname,
+								  offset);
+			state->EndOfWAL = true;
+			return false;
+		}
+
+		/* The same condition will be caught as invalid magic number */
+	}
+
 	if (hdr->xlp_magic != XLOG_PAGE_MAGIC)
 	{
 		char		fname[MAXFNAMELEN];
@@ -1308,6 +1386,14 @@ XLogReaderValidatePageHeader(XLogReaderState *state, XLogRecPtr recptr,
 							  fname,
 							  LSN_FORMAT_ARGS(recptr),
 							  offset);
+
+		/*
+		 * If the page address is less than expected we assume it is an unused
+		 * page in a recycled segment.
+		 */
+		if (hdr->xlp_pageaddr < recptr)
+			state->EndOfWAL = true;
+
 		return false;
 	}
 
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index dbe9394762..3ada8b346d 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -1644,7 +1644,7 @@ PerformWalRecovery(void)
 		/* just have to read next record after CheckPoint */
 		Assert(xlogreader->ReadRecPtr == CheckPointLoc);
 		replayTLI = CheckPointTLI;
-		record = ReadRecord(xlogprefetcher, LOG, false, replayTLI);
+		record = ReadRecord(xlogprefetcher, WARNING, false, replayTLI);
 	}
 
 	if (record != NULL)
@@ -1753,7 +1753,7 @@ PerformWalRecovery(void)
 			}
 
 			/* Else, try to fetch the next WAL record */
-			record = ReadRecord(xlogprefetcher, LOG, false, replayTLI);
+			record = ReadRecord(xlogprefetcher, WARNING, false, replayTLI);
 		} while (record != NULL);
 
 		/*
@@ -1807,11 +1807,18 @@ PerformWalRecovery(void)
 
 		InRedo = false;
 	}
-	else
+	else if (xlogreader->EndOfWAL)
 	{
 		/* there are no WAL records following the checkpoint */
 		ereport(LOG,
-				(errmsg("redo is not required")));
+				errmsg("redo is not required"));
+	}
+	else
+	{
+		/* broken record found */
+		ereport(WARNING,
+				errmsg("redo is skipped"),
+				errhint("This suggests WAL file corruption. You might need to check the database."));
 	}
 
 	/*
@@ -3044,6 +3051,7 @@ ReadRecord(XLogPrefetcher *xlogprefetcher, int emode,
 	for (;;)
 	{
 		char	   *errormsg;
+		XLogRecPtr	ErrRecPtr = InvalidXLogRecPtr;
 
 		record = XLogPrefetcherReadRecord(xlogprefetcher, &errormsg);
 		if (record == NULL)
@@ -3065,6 +3073,18 @@ ReadRecord(XLogPrefetcher *xlogprefetcher, int emode,
 			{
 				abortedRecPtr = xlogreader->abortedRecPtr;
 				missingContrecPtr = xlogreader->missingContrecPtr;
+				ErrRecPtr = abortedRecPtr;
+			}
+			else
+			{
+				/*
+				 * EndRecPtr is the LSN we tried to read but failed. In the
+				 * case of decoding error, it is at the end of the failed
+				 * record but we don't have a means for now to know EndRecPtr
+				 * is pointing to which of the beginning or ending of the
+				 * failed record.
+				 */
+				ErrRecPtr = xlogreader->EndRecPtr;
 			}
 
 			if (readFile >= 0)
@@ -3075,13 +3095,15 @@ ReadRecord(XLogPrefetcher *xlogprefetcher, int emode,
 
 			/*
 			 * We only end up here without a message when XLogPageRead()
-			 * failed - in that case we already logged something. In
-			 * StandbyMode that only happens if we have been triggered, so we
-			 * shouldn't loop anymore in that case.
+			 * failed- in that case we already logged something. In StandbyMode
+			 * that only happens if we have been triggered, so we shouldn't
+			 * loop anymore in that case. When EndOfWAL is true, we don't emit
+			 * the message immediately and instead will show it as a part of a
+			 * decent end-of-wal message later.
 			 */
-			if (errormsg)
-				ereport(emode_for_corrupt_record(emode, xlogreader->EndRecPtr),
-						(errmsg_internal("%s", errormsg) /* already translated */ ));
+			if (!xlogreader->EndOfWAL && errormsg)
+				ereport(emode_for_corrupt_record(emode, ErrRecPtr),
+						errmsg_internal("%s", errormsg) /* already translated */ );
 		}
 
 		/*
@@ -3112,11 +3134,14 @@ ReadRecord(XLogPrefetcher *xlogprefetcher, int emode,
 			/* Great, got a record */
 			return record;
 		}
-		else
+
+		Assert(ErrRecPtr != InvalidXLogRecPtr);
+
+		/* No valid record available from this source */
+		lastSourceFailed = true;
+
+		if (!fetching_ckpt)
 		{
-			/* No valid record available from this source */
-			lastSourceFailed = true;
-
 			/*
 			 * If archive recovery was requested, but we were still doing
 			 * crash recovery, switch to archive recovery and retry using the
@@ -3129,11 +3154,16 @@ ReadRecord(XLogPrefetcher *xlogprefetcher, int emode,
 			 * we'd have no idea how far we'd have to replay to reach
 			 * consistency.  So err on the safe side and give up.
 			 */
-			if (!InArchiveRecovery && ArchiveRecoveryRequested &&
-				!fetching_ckpt)
+			if (!InArchiveRecovery && ArchiveRecoveryRequested)
 			{
+				/*
+				 * We don't report this as LOG, since we don't stop recovery
+				 * here
+				 */
 				ereport(DEBUG1,
-						(errmsg_internal("reached end of WAL in pg_wal, entering archive recovery")));
+						errmsg_internal("reached end of WAL at %X/%X on timeline %u in pg_wal during crash recovery, entering archive recovery",
+										LSN_FORMAT_ARGS(ErrRecPtr),
+										replayTLI));
 				InArchiveRecovery = true;
 				if (StandbyModeRequested)
 					EnableStandbyMode();
@@ -3154,12 +3184,24 @@ ReadRecord(XLogPrefetcher *xlogprefetcher, int emode,
 				continue;
 			}
 
-			/* In standby mode, loop back to retry. Otherwise, give up. */
-			if (StandbyMode && !CheckForStandbyTrigger())
-				continue;
-			else
-				return NULL;
+			/*
+			 * recovery ended.
+			 *
+			 * Emit a decent message if we met end-of-WAL. Otherwise we should
+			 * have already emitted an error message.
+			 */
+			if (xlogreader->EndOfWAL)
+				ereport(LOG,
+						errmsg("reached end of WAL at %X/%X on timeline %u",
+							   LSN_FORMAT_ARGS(ErrRecPtr), replayTLI),
+						(errormsg ? errdetail_internal("%s", errormsg) : 0));
 		}
+
+		/* In standby mode, loop back to retry. Otherwise, give up. */
+		if (StandbyMode && !CheckForStandbyTrigger())
+			continue;
+		else
+			return NULL;
 	}
 }
 
@@ -3256,11 +3298,16 @@ retry:
 			case XLREAD_WOULDBLOCK:
 				return XLREAD_WOULDBLOCK;
 			case XLREAD_FAIL:
+				Assert(!StandbyMode || CheckForStandbyTrigger());
+
 				if (readFile >= 0)
 					close(readFile);
 				readFile = -1;
 				readLen = 0;
 				readSource = XLOG_FROM_ANY;
+
+				/* promotion exit is not end-of-WAL */
+				xlogreader->EndOfWAL = !StandbyMode;
 				return XLREAD_FAIL;
 			case XLREAD_SUCCESS:
 				break;
@@ -3928,7 +3975,8 @@ emode_for_corrupt_record(int emode, XLogRecPtr RecPtr)
 {
 	static XLogRecPtr lastComplaint = 0;
 
-	if (readSource == XLOG_FROM_PG_WAL && emode == LOG)
+	/* use currentSource as readSource is reset at failure */
+	if (currentSource == XLOG_FROM_PG_WAL && emode <= WARNING)
 	{
 		if (RecPtr == lastComplaint)
 			emode = DEBUG1;
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index f6446da2d6..78b85c5a25 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -498,10 +498,9 @@ WalReceiverMain(void)
 						else if (len < 0)
 						{
 							ereport(LOG,
-									(errmsg("replication terminated by primary server"),
-									 errdetail("End of WAL reached on timeline %u at %X/%X.",
-											   startpointTLI,
-											   LSN_FORMAT_ARGS(LogstreamResult.Write))));
+									errmsg("replication terminated by primary server at %X/%X on timeline %u.",
+										   LSN_FORMAT_ARGS(LogstreamResult.Write),
+										   startpointTLI));
 							endofwal = true;
 							break;
 						}
diff --git a/src/bin/pg_waldump/pg_waldump.c b/src/bin/pg_waldump/pg_waldump.c
index 44b5c8726e..0f0cccd425 100644
--- a/src/bin/pg_waldump/pg_waldump.c
+++ b/src/bin/pg_waldump/pg_waldump.c
@@ -1276,9 +1276,16 @@ main(int argc, char **argv)
 		exit(0);
 
 	if (errormsg)
-		pg_fatal("error in WAL record at %X/%X: %s",
-				 LSN_FORMAT_ARGS(xlogreader_state->ReadRecPtr),
-				 errormsg);
+	{
+		if (xlogreader_state->EndOfWAL)
+			pg_log_info("end of WAL at %X/%X: %s",
+						LSN_FORMAT_ARGS(xlogreader_state->EndRecPtr),
+						errormsg);
+		else
+			pg_fatal("error in WAL record at %X/%X: %s",
+					 LSN_FORMAT_ARGS(xlogreader_state->EndRecPtr),
+					 errormsg);
+	}
 
 	XLogReaderFree(xlogreader_state);
 
diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h
index d77bb2ab9b..84562d9de1 100644
--- a/src/include/access/xlogreader.h
+++ b/src/include/access/xlogreader.h
@@ -205,6 +205,7 @@ struct XLogReaderState
 	 */
 	XLogRecPtr	ReadRecPtr;		/* start of last record read */
 	XLogRecPtr	EndRecPtr;		/* end+1 of last record read */
+	bool		EndOfWAL;		/* was the last attempt EOW? */
 
 	/*
 	 * Set at the end of recovery: the start point of a partial record at the
diff --git a/src/test/recovery/t/034_recovery.pl b/src/test/recovery/t/034_recovery.pl
new file mode 100644
index 0000000000..580ae3b9f1
--- /dev/null
+++ b/src/test/recovery/t/034_recovery.pl
@@ -0,0 +1,135 @@
+
+# Copyright (c) 2021-2023, PostgreSQL Global Development Group
+
+# Minimal test testing recovery process
+
+use strict;
+use warnings;
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+use IPC::Run;
+
+my $reached_eow_pat = "reached end of WAL at ";
+my $node = PostgreSQL::Test::Cluster->new('primary');
+$node->init(allows_streaming => 1);
+$node->start;
+
+my ($stdout, $stderr) = ('', '');
+
+my $segsize = $node->safe_psql('postgres',
+	   qq[SELECT setting FROM pg_settings WHERE name = 'wal_segment_size';]);
+
+# make sure no records afterwards go to the next segment
+$node->safe_psql('postgres', qq[
+				 SELECT pg_switch_wal();
+				 CHECKPOINT;
+				 CREATE TABLE t();
+]);
+$node->stop('immediate');
+
+# identify REDO WAL file
+my $cmd = "pg_controldata -D " . $node->data_dir();
+$cmd = ['pg_controldata', '-D', $node->data_dir()];
+$stdout = '';
+$stderr = '';
+IPC::Run::run $cmd, '>', \$stdout, '2>', \$stderr;
+ok($stdout =~ /^Latest checkpoint's REDO WAL file:[ \t] *(.+)$/m,
+   "checkpoint file is identified");
+my $chkptfile = $1;
+
+# identify the last record
+my $walfile = $node->data_dir() . "/pg_wal/$chkptfile";
+$cmd = ['pg_waldump', $walfile];
+$stdout = '';
+$stderr = '';
+my $lastrec;
+IPC::Run::run $cmd, '>', \$stdout, '2>', \$stderr;
+foreach my $l (split(/\r?\n/, $stdout))
+{
+	$lastrec = $l;
+}
+ok(defined $lastrec, "last WAL record is extracted");
+ok($stderr =~ /end of WAL at ([0-9A-F\/]+): .* at \g1/,
+   "pg_waldump emits the correct ending message");
+
+# read the last record LSN excluding leading zeroes
+ok ($lastrec =~ /, lsn: 0\/0*([1-9A-F][0-9A-F]+),/,
+	"LSN of the last record identified");
+my $lastlsn = $1;
+
+# corrupt the last record
+my $offset = hex($lastlsn) % $segsize;
+open(my $segf, '+<', $walfile) or die "failed to open $walfile\n";
+seek($segf, $offset, 0);  # zero xl_tot_len, leaving following bytes alone.
+print $segf "\0\0\0\0";
+close($segf);
+
+# pg_waldump complains about the corrupted record
+$stdout = '';
+$stderr = '';
+IPC::Run::run $cmd, '>', \$stdout, '2>', \$stderr;
+ok($stderr =~ /error: error in WAL record at 0\/$lastlsn: .* at 0\/$lastlsn/,
+   "pg_waldump emits the correct error message");
+
+# also server complains for the same reason
+my $logstart = get_log_size($node);
+$node->start;
+$node->wait_for_log("WARNING:  invalid record length at 0/$lastlsn: wanted [0-9]+, got 0",
+					$logstart);
+pass("header error is logged at $lastlsn");
+
+# no end-of-wal message should be seen this time
+ok(!find_in_log($node, $reached_eow_pat, $logstart),
+   "false log message is not emitted");
+
+# Create streaming standby linking to primary
+my $backup_name = 'my_backup';
+$node->backup($backup_name);
+my $node_standby = PostgreSQL::Test::Cluster->new('standby');
+$node_standby->init_from_backup($node, $backup_name, has_streaming => 1);
+$node_standby->start;
+$node->safe_psql('postgres', 'CREATE TABLE t ()');
+my $primary_lsn = $node->lsn('write');
+$node->wait_for_catchup($node_standby, 'write', $primary_lsn);
+
+$node_standby->stop();
+$node->stop('immediate');
+
+# crash restart the primary
+$logstart = get_log_size($node);
+$node->start();
+$node->wait_for_log($reached_eow_pat, $logstart);
+
+# restart the standby
+$logstart = get_log_size($node_standby);
+$node_standby->start();
+$node_standby->wait_for_log($reached_eow_pat, $logstart);
+
+$node_standby->stop();
+$node->stop();
+
+done_testing();
+
+#### helper routines
+# return the size of logfile of $node in bytes
+sub get_log_size
+{
+	my ($node) = @_;
+
+	return (stat $node->logfile)[7];
+}
+
+# find $pat in logfile of $node after $off-th byte
+sub find_in_log
+{
+	my ($node, $pat, $off) = @_;
+
+	$off = 0 unless defined $off;
+	my $log = PostgreSQL::Test::Utils::slurp_file($node->logfile);
+	return 0 if (length($log) <= $off);
+
+	$log = substr($log, $off);
+
+	return $log =~ m/$pat/;
+}
-- 
2.31.1

#56

Gregory Stark (as CFM)

stark.cfm@gmail.com

almost 3 years ago

In reply to: Kyotaro Horiguchi (#55)

Re: Make mesage at end-of-recovery less scary.

It looks like this needs a rebase and at a quick glance it looks like
more than a trivial conflict. I'll mark it Waiting on Author. Please
update it back when it's rebased

--
Gregory Stark
As Commitfest Manager

#57

Kyotaro Horiguchi

horikyota.ntt@gmail.com

almost 3 years ago

In reply to: Gregory Stark (as CFM) (#56)

1 attachment(s)

Re: Make mesage at end-of-recovery less scary.

At Mon, 6 Mar 2023 14:58:15 -0500, "Gregory Stark (as CFM)" <stark.cfm@gmail.com> wrote in

It looks like this needs a rebase and at a quick glance it looks like
more than a trivial conflict. I'll mark it Waiting on Author. Please
update it back when it's rebased

Thanks for checking it!

I think 4ac30ba4f2 is that, which changes a few error
messages. Addition to rebasing, I rewrote some code comments of
xlogreader.c and revised the additional test script.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachments:

v25-0001-Make-End-Of-Recovery-error-less-scary.patchtext/x-patch; charset=us-asciiDownload

From d8b7ff96f48ede26390fa2208460ee2a8ea9cd87 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Tue, 7 Mar 2023 14:55:58 +0900
Subject: [PATCH v25] Make End-Of-Recovery error less scary

When recovery in any type ends, we see a bit scary error message like
"invalid record length" that suggests something serious is
happening. Actually if recovery meets a record with length = 0, that
usually means it finished applying all available WAL records.

Make this message less scary as "reached end of WAL". Instead, raise
the error level for other kind of WAL failure to WARNING.  To make
sure that the detection is correct, this patch checks if all trailing
bytes in the same page are zeroed in that case.
---
 src/backend/access/transam/xlogreader.c   | 135 ++++++++++++++++++----
 src/backend/access/transam/xlogrecovery.c |  94 +++++++++++----
 src/backend/replication/walreceiver.c     |   7 +-
 src/bin/pg_waldump/pg_waldump.c           |  13 ++-
 src/include/access/xlogreader.h           |   1 +
 src/test/recovery/t/035_recovery.pl       | 130 +++++++++++++++++++++
 6 files changed, 329 insertions(+), 51 deletions(-)
 create mode 100644 src/test/recovery/t/035_recovery.pl

diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index cadea21b37..5a27c10bbb 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -48,6 +48,8 @@ static int	ReadPageInternal(XLogReaderState *state, XLogRecPtr pageptr,
 							 int reqLen);
 static void XLogReaderInvalReadState(XLogReaderState *state);
 static XLogPageReadResult XLogDecodeNextRecord(XLogReaderState *state, bool nonblocking);
+static bool ValidXLogRecordLength(XLogReaderState *state, XLogRecPtr RecPtr,
+								  XLogRecord *record);
 static bool ValidXLogRecordHeader(XLogReaderState *state, XLogRecPtr RecPtr,
 								  XLogRecPtr PrevRecPtr, XLogRecord *record, bool randAccess);
 static bool ValidXLogRecord(XLogReaderState *state, XLogRecord *record,
@@ -149,6 +151,7 @@ XLogReaderAllocate(int wal_segment_size, const char *waldir,
 		pfree(state);
 		return NULL;
 	}
+	state->EndOfWAL = false;
 	state->errormsg_buf[0] = '\0';
 
 	/*
@@ -558,6 +561,7 @@ XLogDecodeNextRecord(XLogReaderState *state, bool nonblocking)
 	/* reset error state */
 	state->errormsg_buf[0] = '\0';
 	decoded = NULL;
+	state->EndOfWAL = false;
 
 	state->abortedRecPtr = InvalidXLogRecPtr;
 	state->missingContrecPtr = InvalidXLogRecPtr;
@@ -641,16 +645,12 @@ restart:
 	Assert(pageHeaderSize <= readOff);
 
 	/*
-	 * Read the record length.
+	 * Verify the record header.
 	 *
 	 * NB: Even though we use an XLogRecord pointer here, the whole record
-	 * header might not fit on this page. xl_tot_len is the first field of the
-	 * struct, so it must be on this page (the records are MAXALIGNed), but we
-	 * cannot access any other fields until we've verified that we got the
-	 * whole header.
+	 * header might not fit on this page.
 	 */
 	record = (XLogRecord *) (state->readBuf + RecPtr % XLOG_BLCKSZ);
-	total_len = record->xl_tot_len;
 
 	/*
 	 * If the whole record header is on this page, validate it immediately.
@@ -669,18 +669,21 @@ restart:
 	}
 	else
 	{
-		/* XXX: more validation should be done here */
-		if (total_len < SizeOfXLogRecord)
-		{
-			report_invalid_record(state,
-								  "invalid record length at %X/%X: expected at least %u, got %u",
-								  LSN_FORMAT_ARGS(RecPtr),
-								  (uint32) SizeOfXLogRecord, total_len);
+		/*
+		 * xl_tot_len is the first field of the struct, so it must be on this
+		 * page (the records are MAXALIGNed), but we cannot access any other
+		 * fields until we've verified that we got the whole header.
+		 *
+		 * XXX: more validation should be done here
+		 */
+		if (!ValidXLogRecordLength(state, RecPtr, record))
 			goto err;
-		}
+
 		gotheader = false;
 	}
 
+	total_len = record->xl_tot_len;
+
 	/*
 	 * Find space to decode this record.  Don't allow oversized allocation if
 	 * the caller requested nonblocking.  Otherwise, we *have* to try to
@@ -1106,25 +1109,81 @@ XLogReaderInvalReadState(XLogReaderState *state)
 	state->readLen = 0;
 }
 
+/*
+ * Validate record length of an XLOG record header.
+ *
+ * This is substantially a part of ValidXLogRecordHeader.  But XLogReadRecord
+ * needs this separate from the function in case of a partial record header.
+ *
+ * Returns true if the xl_tot_len header field has a seemingly valid value,
+ * which means the caller can proceed reading to the following part of the
+ * record.
+ */
+static bool
+ValidXLogRecordLength(XLogReaderState *state, XLogRecPtr RecPtr,
+					  XLogRecord *record)
+{
+	if (record->xl_tot_len == 0)
+	{
+		char	   *p;
+		char	   *pe;
+
+		/*
+		 * We are almost sure reaching the end of WAL, make sure that the
+		 * whole page after the record is filled with zeroes.
+		 */
+		p = (char *) record;
+		pe = p + XLOG_BLCKSZ - (RecPtr & (XLOG_BLCKSZ - 1));
+
+		while (p < pe && *p == 0)
+			p++;
+
+		if (p == pe)
+		{
+			/*
+			 * The page after the record is completely zeroed. That suggests
+			 * we don't have a record after this point. We don't bother
+			 * checking the pages after since they are not zeroed in the case
+			 * of recycled segments.
+			 */
+			report_invalid_record(state, "empty record at %X/%X",
+								  LSN_FORMAT_ARGS(RecPtr));
+
+			/* notify end-of-wal to callers */
+			state->EndOfWAL = true;
+			return false;
+		}
+	}
+
+	if (record->xl_tot_len < SizeOfXLogRecord)
+	{
+		report_invalid_record(state,
+							  "invalid record length at %X/%X: expected at least %u, got %u",
+							  LSN_FORMAT_ARGS(RecPtr),
+							  (uint32) SizeOfXLogRecord, record->xl_tot_len);
+		return false;
+	}
+
+	return true;
+}
+
 /*
  * Validate an XLOG record header.
  *
  * This is just a convenience subroutine to avoid duplicated code in
  * XLogReadRecord.  It's not intended for use from anywhere else.
+ *
+ * Returns true if the header fields have the valid values and the caller can
+ * proceed reading to the following part of the record.
  */
 static bool
 ValidXLogRecordHeader(XLogReaderState *state, XLogRecPtr RecPtr,
 					  XLogRecPtr PrevRecPtr, XLogRecord *record,
 					  bool randAccess)
 {
-	if (record->xl_tot_len < SizeOfXLogRecord)
-	{
-		report_invalid_record(state,
-							  "invalid record length at %X/%X: expected at least %u, got %u",
-							  LSN_FORMAT_ARGS(RecPtr),
-							  (uint32) SizeOfXLogRecord, record->xl_tot_len);
+	if (!ValidXLogRecordLength(state, RecPtr, record))
 		return false;
-	}
+
 	if (!RmgrIdIsValid(record->xl_rmid))
 	{
 		report_invalid_record(state,
@@ -1220,6 +1279,32 @@ XLogReaderValidatePageHeader(XLogReaderState *state, XLogRecPtr recptr,
 	XLByteToSeg(recptr, segno, state->segcxt.ws_segsize);
 	offset = XLogSegmentOffset(recptr, state->segcxt.ws_segsize);
 
+	StaticAssertStmt(XLOG_PAGE_MAGIC != 0, "XLOG_PAGE_MAGIC is zero");
+
+	if (hdr->xlp_magic == 0)
+	{
+		/* Regard an empty page as End-Of-WAL */
+		int			i;
+
+		for (i = 0; i < XLOG_BLCKSZ && phdr[i] == 0; i++);
+		if (i == XLOG_BLCKSZ)
+		{
+			char		fname[MAXFNAMELEN];
+
+			XLogFileName(fname, state->seg.ws_tli, segno,
+						 state->segcxt.ws_segsize);
+
+			report_invalid_record(state,
+								  "empty page in log segment %s, offset %u",
+								  fname,
+								  offset);
+			state->EndOfWAL = true;
+			return false;
+		}
+
+		/* The same condition will be caught as invalid magic number */
+	}
+
 	if (hdr->xlp_magic != XLOG_PAGE_MAGIC)
 	{
 		char		fname[MAXFNAMELEN];
@@ -1309,6 +1394,14 @@ XLogReaderValidatePageHeader(XLogReaderState *state, XLogRecPtr recptr,
 							  fname,
 							  LSN_FORMAT_ARGS(recptr),
 							  offset);
+
+		/*
+		 * If the page address is less than expected we assume it is an unused
+		 * page in a recycled segment.
+		 */
+		if (hdr->xlp_pageaddr < recptr)
+			state->EndOfWAL = true;
+
 		return false;
 	}
 
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index dbe9394762..3ada8b346d 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -1644,7 +1644,7 @@ PerformWalRecovery(void)
 		/* just have to read next record after CheckPoint */
 		Assert(xlogreader->ReadRecPtr == CheckPointLoc);
 		replayTLI = CheckPointTLI;
-		record = ReadRecord(xlogprefetcher, LOG, false, replayTLI);
+		record = ReadRecord(xlogprefetcher, WARNING, false, replayTLI);
 	}
 
 	if (record != NULL)
@@ -1753,7 +1753,7 @@ PerformWalRecovery(void)
 			}
 
 			/* Else, try to fetch the next WAL record */
-			record = ReadRecord(xlogprefetcher, LOG, false, replayTLI);
+			record = ReadRecord(xlogprefetcher, WARNING, false, replayTLI);
 		} while (record != NULL);
 
 		/*
@@ -1807,11 +1807,18 @@ PerformWalRecovery(void)
 
 		InRedo = false;
 	}
-	else
+	else if (xlogreader->EndOfWAL)
 	{
 		/* there are no WAL records following the checkpoint */
 		ereport(LOG,
-				(errmsg("redo is not required")));
+				errmsg("redo is not required"));
+	}
+	else
+	{
+		/* broken record found */
+		ereport(WARNING,
+				errmsg("redo is skipped"),
+				errhint("This suggests WAL file corruption. You might need to check the database."));
 	}
 
 	/*
@@ -3044,6 +3051,7 @@ ReadRecord(XLogPrefetcher *xlogprefetcher, int emode,
 	for (;;)
 	{
 		char	   *errormsg;
+		XLogRecPtr	ErrRecPtr = InvalidXLogRecPtr;
 
 		record = XLogPrefetcherReadRecord(xlogprefetcher, &errormsg);
 		if (record == NULL)
@@ -3065,6 +3073,18 @@ ReadRecord(XLogPrefetcher *xlogprefetcher, int emode,
 			{
 				abortedRecPtr = xlogreader->abortedRecPtr;
 				missingContrecPtr = xlogreader->missingContrecPtr;
+				ErrRecPtr = abortedRecPtr;
+			}
+			else
+			{
+				/*
+				 * EndRecPtr is the LSN we tried to read but failed. In the
+				 * case of decoding error, it is at the end of the failed
+				 * record but we don't have a means for now to know EndRecPtr
+				 * is pointing to which of the beginning or ending of the
+				 * failed record.
+				 */
+				ErrRecPtr = xlogreader->EndRecPtr;
 			}
 
 			if (readFile >= 0)
@@ -3075,13 +3095,15 @@ ReadRecord(XLogPrefetcher *xlogprefetcher, int emode,
 
 			/*
 			 * We only end up here without a message when XLogPageRead()
-			 * failed - in that case we already logged something. In
-			 * StandbyMode that only happens if we have been triggered, so we
-			 * shouldn't loop anymore in that case.
+			 * failed- in that case we already logged something. In StandbyMode
+			 * that only happens if we have been triggered, so we shouldn't
+			 * loop anymore in that case. When EndOfWAL is true, we don't emit
+			 * the message immediately and instead will show it as a part of a
+			 * decent end-of-wal message later.
 			 */
-			if (errormsg)
-				ereport(emode_for_corrupt_record(emode, xlogreader->EndRecPtr),
-						(errmsg_internal("%s", errormsg) /* already translated */ ));
+			if (!xlogreader->EndOfWAL && errormsg)
+				ereport(emode_for_corrupt_record(emode, ErrRecPtr),
+						errmsg_internal("%s", errormsg) /* already translated */ );
 		}
 
 		/*
@@ -3112,11 +3134,14 @@ ReadRecord(XLogPrefetcher *xlogprefetcher, int emode,
 			/* Great, got a record */
 			return record;
 		}
-		else
+
+		Assert(ErrRecPtr != InvalidXLogRecPtr);
+
+		/* No valid record available from this source */
+		lastSourceFailed = true;
+
+		if (!fetching_ckpt)
 		{
-			/* No valid record available from this source */
-			lastSourceFailed = true;
-
 			/*
 			 * If archive recovery was requested, but we were still doing
 			 * crash recovery, switch to archive recovery and retry using the
@@ -3129,11 +3154,16 @@ ReadRecord(XLogPrefetcher *xlogprefetcher, int emode,
 			 * we'd have no idea how far we'd have to replay to reach
 			 * consistency.  So err on the safe side and give up.
 			 */
-			if (!InArchiveRecovery && ArchiveRecoveryRequested &&
-				!fetching_ckpt)
+			if (!InArchiveRecovery && ArchiveRecoveryRequested)
 			{
+				/*
+				 * We don't report this as LOG, since we don't stop recovery
+				 * here
+				 */
 				ereport(DEBUG1,
-						(errmsg_internal("reached end of WAL in pg_wal, entering archive recovery")));
+						errmsg_internal("reached end of WAL at %X/%X on timeline %u in pg_wal during crash recovery, entering archive recovery",
+										LSN_FORMAT_ARGS(ErrRecPtr),
+										replayTLI));
 				InArchiveRecovery = true;
 				if (StandbyModeRequested)
 					EnableStandbyMode();
@@ -3154,12 +3184,24 @@ ReadRecord(XLogPrefetcher *xlogprefetcher, int emode,
 				continue;
 			}
 
-			/* In standby mode, loop back to retry. Otherwise, give up. */
-			if (StandbyMode && !CheckForStandbyTrigger())
-				continue;
-			else
-				return NULL;
+			/*
+			 * recovery ended.
+			 *
+			 * Emit a decent message if we met end-of-WAL. Otherwise we should
+			 * have already emitted an error message.
+			 */
+			if (xlogreader->EndOfWAL)
+				ereport(LOG,
+						errmsg("reached end of WAL at %X/%X on timeline %u",
+							   LSN_FORMAT_ARGS(ErrRecPtr), replayTLI),
+						(errormsg ? errdetail_internal("%s", errormsg) : 0));
 		}
+
+		/* In standby mode, loop back to retry. Otherwise, give up. */
+		if (StandbyMode && !CheckForStandbyTrigger())
+			continue;
+		else
+			return NULL;
 	}
 }
 
@@ -3256,11 +3298,16 @@ retry:
 			case XLREAD_WOULDBLOCK:
 				return XLREAD_WOULDBLOCK;
 			case XLREAD_FAIL:
+				Assert(!StandbyMode || CheckForStandbyTrigger());
+
 				if (readFile >= 0)
 					close(readFile);
 				readFile = -1;
 				readLen = 0;
 				readSource = XLOG_FROM_ANY;
+
+				/* promotion exit is not end-of-WAL */
+				xlogreader->EndOfWAL = !StandbyMode;
 				return XLREAD_FAIL;
 			case XLREAD_SUCCESS:
 				break;
@@ -3928,7 +3975,8 @@ emode_for_corrupt_record(int emode, XLogRecPtr RecPtr)
 {
 	static XLogRecPtr lastComplaint = 0;
 
-	if (readSource == XLOG_FROM_PG_WAL && emode == LOG)
+	/* use currentSource as readSource is reset at failure */
+	if (currentSource == XLOG_FROM_PG_WAL && emode <= WARNING)
 	{
 		if (RecPtr == lastComplaint)
 			emode = DEBUG1;
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index f6446da2d6..78b85c5a25 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -498,10 +498,9 @@ WalReceiverMain(void)
 						else if (len < 0)
 						{
 							ereport(LOG,
-									(errmsg("replication terminated by primary server"),
-									 errdetail("End of WAL reached on timeline %u at %X/%X.",
-											   startpointTLI,
-											   LSN_FORMAT_ARGS(LogstreamResult.Write))));
+									errmsg("replication terminated by primary server at %X/%X on timeline %u.",
+										   LSN_FORMAT_ARGS(LogstreamResult.Write),
+										   startpointTLI));
 							endofwal = true;
 							break;
 						}
diff --git a/src/bin/pg_waldump/pg_waldump.c b/src/bin/pg_waldump/pg_waldump.c
index 44b5c8726e..0f0cccd425 100644
--- a/src/bin/pg_waldump/pg_waldump.c
+++ b/src/bin/pg_waldump/pg_waldump.c
@@ -1276,9 +1276,16 @@ main(int argc, char **argv)
 		exit(0);
 
 	if (errormsg)
-		pg_fatal("error in WAL record at %X/%X: %s",
-				 LSN_FORMAT_ARGS(xlogreader_state->ReadRecPtr),
-				 errormsg);
+	{
+		if (xlogreader_state->EndOfWAL)
+			pg_log_info("end of WAL at %X/%X: %s",
+						LSN_FORMAT_ARGS(xlogreader_state->EndRecPtr),
+						errormsg);
+		else
+			pg_fatal("error in WAL record at %X/%X: %s",
+					 LSN_FORMAT_ARGS(xlogreader_state->EndRecPtr),
+					 errormsg);
+	}
 
 	XLogReaderFree(xlogreader_state);
 
diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h
index d77bb2ab9b..84562d9de1 100644
--- a/src/include/access/xlogreader.h
+++ b/src/include/access/xlogreader.h
@@ -205,6 +205,7 @@ struct XLogReaderState
 	 */
 	XLogRecPtr	ReadRecPtr;		/* start of last record read */
 	XLogRecPtr	EndRecPtr;		/* end+1 of last record read */
+	bool		EndOfWAL;		/* was the last attempt EOW? */
 
 	/*
 	 * Set at the end of recovery: the start point of a partial record at the
diff --git a/src/test/recovery/t/035_recovery.pl b/src/test/recovery/t/035_recovery.pl
new file mode 100644
index 0000000000..7107df6509
--- /dev/null
+++ b/src/test/recovery/t/035_recovery.pl
@@ -0,0 +1,130 @@
+
+# Copyright (c) 2021-2023, PostgreSQL Global Development Group
+
+# Minimal test testing recovery process
+
+use strict;
+use warnings;
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+use IPC::Run;
+
+my $reached_eow_pat = "reached end of WAL at ";
+my $node = PostgreSQL::Test::Cluster->new('primary');
+$node->init(allows_streaming => 1);
+$node->start;
+
+my ($stdout, $stderr) = ('', '');
+
+my $segsize = $node->safe_psql('postgres',
+	   qq[SELECT setting FROM pg_settings WHERE name = 'wal_segment_size';]);
+
+# make sure no records afterwards go to the next segment
+$node->safe_psql('postgres', qq[
+				 SELECT pg_switch_wal();
+				 CHECKPOINT;
+				 CREATE TABLE t();
+]);
+$node->stop('immediate');
+
+# identify REDO WAL file
+my $cmd = "pg_controldata -D " . $node->data_dir();
+$cmd = ['pg_controldata', '-D', $node->data_dir()];
+$stdout = '';
+$stderr = '';
+IPC::Run::run $cmd, '>', \$stdout, '2>', \$stderr;
+ok($stdout =~ /^Latest checkpoint's REDO WAL file:[ \t] *(.+)$/m,
+   "checkpoint file is identified");
+my $chkptfile = $1;
+
+# identify the last record
+my $walfile = $node->data_dir() . "/pg_wal/$chkptfile";
+$cmd = ['pg_waldump', $walfile];
+$stdout = '';
+$stderr = '';
+my $lastrec;
+IPC::Run::run $cmd, '>', \$stdout, '2>', \$stderr;
+foreach my $l (split(/\r?\n/, $stdout))
+{
+	$lastrec = $l;
+}
+ok(defined $lastrec, "last WAL record is extracted");
+ok($stderr =~ /end of WAL at ([0-9A-F\/]+): .* at \g1/,
+   "pg_waldump emits the correct ending message");
+
+# read the last record LSN excluding leading zeroes
+ok ($lastrec =~ /, lsn: 0\/0*([1-9A-F][0-9A-F]+),/,
+	"LSN of the last record identified");
+my $lastlsn = $1;
+
+# corrupt the last record
+my $offset = hex($lastlsn) % $segsize;
+open(my $segf, '+<', $walfile) or die "failed to open $walfile\n";
+seek($segf, $offset, 0);  # zero xl_tot_len, leaving following bytes alone.
+print $segf "\0\0\0\0";
+close($segf);
+
+# pg_waldump complains about the corrupted record
+$stdout = '';
+$stderr = '';
+IPC::Run::run $cmd, '>', \$stdout, '2>', \$stderr;
+ok($stderr =~ /error: error in WAL record at 0\/$lastlsn: .* at 0\/$lastlsn/,
+   "pg_waldump emits the correct error message");
+
+# also server complains for the same reason
+my $logstart = -s $node->logfile;
+$node->start;
+ok($node->wait_for_log(
+	   "WARNING:  invalid record length at 0/$lastlsn: expected at least 24, got 0",
+	   $logstart),
+   "header error is correctly logged at $lastlsn");
+
+# no end-of-wal message should be seen this time
+ok(!find_in_log($node, $reached_eow_pat, $logstart),
+   "false log message is not emitted");
+
+# Create streaming standby linking to primary
+my $backup_name = 'my_backup';
+$node->backup($backup_name);
+my $node_standby = PostgreSQL::Test::Cluster->new('standby');
+$node_standby->init_from_backup($node, $backup_name, has_streaming => 1);
+$node_standby->start;
+$node->safe_psql('postgres', 'CREATE TABLE t ()');
+my $primary_lsn = $node->lsn('write');
+$node->wait_for_catchup($node_standby, 'write', $primary_lsn);
+
+$node_standby->stop();
+$node->stop('immediate');
+
+# crash restart the primary
+$logstart = -s $node->logfile;
+$node->start();
+ok($node->wait_for_log($reached_eow_pat, $logstart),
+   'primary properly emits end-of-WAL message');
+
+# restart the standby
+$logstart = -s $node_standby->logfile;
+$node_standby->start();
+ok($node->wait_for_log($reached_eow_pat, $logstart),
+   'standby properly emits end-of-WAL message');
+
+$node_standby->stop();
+$node->stop();
+
+done_testing();
+
+#### helper routines
+# find $pat in logfile of $node after $off-th byte
+sub find_in_log
+{
+	my ($node, $pat, $off) = @_;
+
+	$off = 0 unless defined $off;
+	my $log = PostgreSQL::Test::Utils::slurp_file($node->logfile);
+	return 0 if (length($log) <= $off);
+
+	$log = substr($log, $off);
+
+	return $log =~ m/$pat/;
+}
-- 
2.31.1

#58

Aleksander Alekseev

aleksander@timescale.com

over 2 years ago

In reply to: Kyotaro Horiguchi (#57)

1 attachment(s)

Re: Make mesage at end-of-recovery less scary.

Hi,

Thanks for checking it!

I think 4ac30ba4f2 is that, which changes a few error
messages. Addition to rebasing, I rewrote some code comments of
xlogreader.c and revised the additional test script.

Thanks for working on this, it bugged me for a while. I noticed that
cfbot is not happy with the patch so I rebased it.
postgresql:pg_waldump test suite didn't pass after the rebase. I fixed
it too. Other than that the patch LGTM so I'm not changing its status
from "Ready for Committer".

It looks like the patch was moved between the commitfests since
2020... If there is anything that may help merging it into PG17 please
let me know.

--
Best regards,
Aleksander Alekseev

Attachments:

v26-0001-Make-End-Of-Recovery-error-less-scary.patchapplication/octet-stream; name=v26-0001-Make-End-Of-Recovery-error-less-scary.patchDownload

From b50d2b4abf89c6bb5e64934be1574f1863661080 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Tue, 7 Mar 2023 14:55:58 +0900
Subject: [PATCH v26] Make End-Of-Recovery error less scary

When recovery in any type ends, we see a bit scary error message like
"invalid record length" that suggests something serious is
happening. Actually if recovery meets a record with length = 0, that
usually means it finished applying all available WAL records.

Make this message less scary as "reached end of WAL". Instead, raise
the error level for other kind of WAL failure to WARNING.  To make
sure that the detection is correct, this patch checks if all trailing
bytes in the same page are zeroed in that case.

Author: Kyotaro Horiguchi
Reviewed-by: Ashutosh Sharma, Justin Pryzby, Pavel Borisov, Aleksander Alekseev
Discussion: https://postgr.es/m/20200228.160100.2210969269596489579.horikyota.ntt@gmail.com
---
 src/backend/access/transam/xlogreader.c   | 133 ++++++++++++++++++----
 src/backend/access/transam/xlogrecovery.c |  94 +++++++++++----
 src/backend/replication/walreceiver.c     |   7 +-
 src/bin/pg_waldump/pg_waldump.c           |  13 ++-
 src/bin/pg_waldump/t/001_basic.pl         |   4 +-
 src/include/access/xlogreader.h           |   1 +
 src/test/recovery/t/035_recovery.pl       | 130 +++++++++++++++++++++
 7 files changed, 330 insertions(+), 52 deletions(-)
 create mode 100644 src/test/recovery/t/035_recovery.pl

diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index c9f9f6e98f..f38b392bf9 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -48,6 +48,8 @@ static int	ReadPageInternal(XLogReaderState *state, XLogRecPtr pageptr,
 							 int reqLen);
 static void XLogReaderInvalReadState(XLogReaderState *state);
 static XLogPageReadResult XLogDecodeNextRecord(XLogReaderState *state, bool nonblocking);
+static bool ValidXLogRecordLength(XLogReaderState *state, XLogRecPtr RecPtr,
+								  XLogRecord *record);
 static bool ValidXLogRecordHeader(XLogReaderState *state, XLogRecPtr RecPtr,
 								  XLogRecPtr PrevRecPtr, XLogRecord *record, bool randAccess);
 static bool ValidXLogRecord(XLogReaderState *state, XLogRecord *record,
@@ -149,6 +151,7 @@ XLogReaderAllocate(int wal_segment_size, const char *waldir,
 		pfree(state);
 		return NULL;
 	}
+	state->EndOfWAL = false;
 	state->errormsg_buf[0] = '\0';
 
 	/*
@@ -558,6 +561,7 @@ XLogDecodeNextRecord(XLogReaderState *state, bool nonblocking)
 	/* reset error state */
 	state->errormsg_buf[0] = '\0';
 	decoded = NULL;
+	state->EndOfWAL = false;
 
 	state->abortedRecPtr = InvalidXLogRecPtr;
 	state->missingContrecPtr = InvalidXLogRecPtr;
@@ -641,16 +645,12 @@ restart:
 	Assert(pageHeaderSize <= readOff);
 
 	/*
-	 * Read the record length.
+	 * Verify the record header.
 	 *
 	 * NB: Even though we use an XLogRecord pointer here, the whole record
-	 * header might not fit on this page. xl_tot_len is the first field of the
-	 * struct, so it must be on this page (the records are MAXALIGNed), but we
-	 * cannot access any other fields until we've verified that we got the
-	 * whole header.
+	 * header might not fit on this page.
 	 */
 	record = (XLogRecord *) (state->readBuf + RecPtr % XLOG_BLCKSZ);
-	total_len = record->xl_tot_len;
 
 	/*
 	 * If the whole record header is on this page, validate it immediately.
@@ -669,18 +669,21 @@ restart:
 	}
 	else
 	{
-		/* XXX: more validation should be done here */
-		if (total_len < SizeOfXLogRecord)
-		{
-			report_invalid_record(state,
-								  "invalid record length at %X/%X: expected at least %u, got %u",
-								  LSN_FORMAT_ARGS(RecPtr),
-								  (uint32) SizeOfXLogRecord, total_len);
+		/*
+		 * xl_tot_len is the first field of the struct, so it must be on this
+		 * page (the records are MAXALIGNed), but we cannot access any other
+		 * fields until we've verified that we got the whole header.
+		 *
+		 * XXX: more validation should be done here
+		 */
+		if (!ValidXLogRecordLength(state, RecPtr, record))
 			goto err;
-		}
+
 		gotheader = false;
 	}
 
+	total_len = record->xl_tot_len;
+
 	/*
 	 * Find space to decode this record.  Don't allow oversized allocation if
 	 * the caller requested nonblocking.  Otherwise, we *have* to try to
@@ -1103,16 +1106,51 @@ XLogReaderInvalReadState(XLogReaderState *state)
 }
 
 /*
- * Validate an XLOG record header.
+ * Validate record length of an XLOG record header.
  *
- * This is just a convenience subroutine to avoid duplicated code in
- * XLogReadRecord.  It's not intended for use from anywhere else.
+ * This is substantially a part of ValidXLogRecordHeader.  But XLogReadRecord
+ * needs this separate from the function in case of a partial record header.
+ *
+ * Returns true if the xl_tot_len header field has a seemingly valid value,
+ * which means the caller can proceed reading to the following part of the
+ * record.
  */
 static bool
-ValidXLogRecordHeader(XLogReaderState *state, XLogRecPtr RecPtr,
-					  XLogRecPtr PrevRecPtr, XLogRecord *record,
-					  bool randAccess)
+ValidXLogRecordLength(XLogReaderState *state, XLogRecPtr RecPtr,
+					  XLogRecord *record)
 {
+	if (record->xl_tot_len == 0)
+	{
+		char	   *p;
+		char	   *pe;
+
+		/*
+		 * We are almost sure reaching the end of WAL, make sure that the
+		 * whole page after the record is filled with zeroes.
+		 */
+		p = (char *) record;
+		pe = p + XLOG_BLCKSZ - (RecPtr & (XLOG_BLCKSZ - 1));
+
+		while (p < pe && *p == 0)
+			p++;
+
+		if (p == pe)
+		{
+			/*
+			 * The page after the record is completely zeroed. That suggests
+			 * we don't have a record after this point. We don't bother
+			 * checking the pages after since they are not zeroed in the case
+			 * of recycled segments.
+			 */
+			report_invalid_record(state, "empty record at %X/%X",
+								  LSN_FORMAT_ARGS(RecPtr));
+
+			/* notify end-of-wal to callers */
+			state->EndOfWAL = true;
+			return false;
+		}
+	}
+
 	if (record->xl_tot_len < SizeOfXLogRecord)
 	{
 		report_invalid_record(state,
@@ -1121,6 +1159,27 @@ ValidXLogRecordHeader(XLogReaderState *state, XLogRecPtr RecPtr,
 							  (uint32) SizeOfXLogRecord, record->xl_tot_len);
 		return false;
 	}
+
+	return true;
+}
+
+/*
+ * Validate an XLOG record header.
+ *
+ * This is just a convenience subroutine to avoid duplicated code in
+ * XLogReadRecord.  It's not intended for use from anywhere else.
+ *
+ * Returns true if the header fields have the valid values and the caller can
+ * proceed reading to the following part of the record.
+ */
+static bool
+ValidXLogRecordHeader(XLogReaderState *state, XLogRecPtr RecPtr,
+					  XLogRecPtr PrevRecPtr, XLogRecord *record,
+					  bool randAccess)
+{
+	if (!ValidXLogRecordLength(state, RecPtr, record))
+		return false;
+
 	if (!RmgrIdIsValid(record->xl_rmid))
 	{
 		report_invalid_record(state,
@@ -1216,6 +1275,32 @@ XLogReaderValidatePageHeader(XLogReaderState *state, XLogRecPtr recptr,
 	XLByteToSeg(recptr, segno, state->segcxt.ws_segsize);
 	offset = XLogSegmentOffset(recptr, state->segcxt.ws_segsize);
 
+	StaticAssertStmt(XLOG_PAGE_MAGIC != 0, "XLOG_PAGE_MAGIC is zero");
+
+	if (hdr->xlp_magic == 0)
+	{
+		/* Regard an empty page as End-Of-WAL */
+		int			i;
+
+		for (i = 0; i < XLOG_BLCKSZ && phdr[i] == 0; i++);
+		if (i == XLOG_BLCKSZ)
+		{
+			char		fname[MAXFNAMELEN];
+
+			XLogFileName(fname, state->seg.ws_tli, segno,
+						 state->segcxt.ws_segsize);
+
+			report_invalid_record(state,
+								  "empty page in log segment %s, offset %u",
+								  fname,
+								  offset);
+			state->EndOfWAL = true;
+			return false;
+		}
+
+		/* The same condition will be caught as invalid magic number */
+	}
+
 	if (hdr->xlp_magic != XLOG_PAGE_MAGIC)
 	{
 		char		fname[MAXFNAMELEN];
@@ -1305,6 +1390,14 @@ XLogReaderValidatePageHeader(XLogReaderState *state, XLogRecPtr recptr,
 							  fname,
 							  LSN_FORMAT_ARGS(recptr),
 							  offset);
+
+		/*
+		 * If the page address is less than expected we assume it is an unused
+		 * page in a recycled segment.
+		 */
+		if (hdr->xlp_pageaddr < recptr)
+			state->EndOfWAL = true;
+
 		return false;
 	}
 
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index becc2bda62..4f0b1cd48f 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -1644,7 +1644,7 @@ PerformWalRecovery(void)
 		/* just have to read next record after CheckPoint */
 		Assert(xlogreader->ReadRecPtr == CheckPointLoc);
 		replayTLI = CheckPointTLI;
-		record = ReadRecord(xlogprefetcher, LOG, false, replayTLI);
+		record = ReadRecord(xlogprefetcher, WARNING, false, replayTLI);
 	}
 
 	if (record != NULL)
@@ -1753,7 +1753,7 @@ PerformWalRecovery(void)
 			}
 
 			/* Else, try to fetch the next WAL record */
-			record = ReadRecord(xlogprefetcher, LOG, false, replayTLI);
+			record = ReadRecord(xlogprefetcher, WARNING, false, replayTLI);
 		} while (record != NULL);
 
 		/*
@@ -1807,11 +1807,18 @@ PerformWalRecovery(void)
 
 		InRedo = false;
 	}
-	else
+	else if (xlogreader->EndOfWAL)
 	{
 		/* there are no WAL records following the checkpoint */
 		ereport(LOG,
-				(errmsg("redo is not required")));
+				errmsg("redo is not required"));
+	}
+	else
+	{
+		/* broken record found */
+		ereport(WARNING,
+				errmsg("redo is skipped"),
+				errhint("This suggests WAL file corruption. You might need to check the database."));
 	}
 
 	/*
@@ -3063,6 +3070,7 @@ ReadRecord(XLogPrefetcher *xlogprefetcher, int emode,
 	for (;;)
 	{
 		char	   *errormsg;
+		XLogRecPtr	ErrRecPtr = InvalidXLogRecPtr;
 
 		record = XLogPrefetcherReadRecord(xlogprefetcher, &errormsg);
 		if (record == NULL)
@@ -3084,6 +3092,18 @@ ReadRecord(XLogPrefetcher *xlogprefetcher, int emode,
 			{
 				abortedRecPtr = xlogreader->abortedRecPtr;
 				missingContrecPtr = xlogreader->missingContrecPtr;
+				ErrRecPtr = abortedRecPtr;
+			}
+			else
+			{
+				/*
+				 * EndRecPtr is the LSN we tried to read but failed. In the
+				 * case of decoding error, it is at the end of the failed
+				 * record but we don't have a means for now to know EndRecPtr
+				 * is pointing to which of the beginning or ending of the
+				 * failed record.
+				 */
+				ErrRecPtr = xlogreader->EndRecPtr;
 			}
 
 			if (readFile >= 0)
@@ -3094,13 +3114,15 @@ ReadRecord(XLogPrefetcher *xlogprefetcher, int emode,
 
 			/*
 			 * We only end up here without a message when XLogPageRead()
-			 * failed - in that case we already logged something. In
-			 * StandbyMode that only happens if we have been triggered, so we
-			 * shouldn't loop anymore in that case.
+			 * failed- in that case we already logged something. In StandbyMode
+			 * that only happens if we have been triggered, so we shouldn't
+			 * loop anymore in that case. When EndOfWAL is true, we don't emit
+			 * the message immediately and instead will show it as a part of a
+			 * decent end-of-wal message later.
 			 */
-			if (errormsg)
-				ereport(emode_for_corrupt_record(emode, xlogreader->EndRecPtr),
-						(errmsg_internal("%s", errormsg) /* already translated */ ));
+			if (!xlogreader->EndOfWAL && errormsg)
+				ereport(emode_for_corrupt_record(emode, ErrRecPtr),
+						errmsg_internal("%s", errormsg) /* already translated */ );
 		}
 
 		/*
@@ -3131,11 +3153,14 @@ ReadRecord(XLogPrefetcher *xlogprefetcher, int emode,
 			/* Great, got a record */
 			return record;
 		}
-		else
-		{
-			/* No valid record available from this source */
-			lastSourceFailed = true;
 
+		Assert(ErrRecPtr != InvalidXLogRecPtr);
+
+		/* No valid record available from this source */
+		lastSourceFailed = true;
+
+		if (!fetching_ckpt)
+		{
 			/*
 			 * If archive recovery was requested, but we were still doing
 			 * crash recovery, switch to archive recovery and retry using the
@@ -3148,11 +3173,16 @@ ReadRecord(XLogPrefetcher *xlogprefetcher, int emode,
 			 * we'd have no idea how far we'd have to replay to reach
 			 * consistency.  So err on the safe side and give up.
 			 */
-			if (!InArchiveRecovery && ArchiveRecoveryRequested &&
-				!fetching_ckpt)
+			if (!InArchiveRecovery && ArchiveRecoveryRequested)
 			{
+				/*
+				 * We don't report this as LOG, since we don't stop recovery
+				 * here
+				 */
 				ereport(DEBUG1,
-						(errmsg_internal("reached end of WAL in pg_wal, entering archive recovery")));
+						errmsg_internal("reached end of WAL at %X/%X on timeline %u in pg_wal during crash recovery, entering archive recovery",
+										LSN_FORMAT_ARGS(ErrRecPtr),
+										replayTLI));
 				InArchiveRecovery = true;
 				if (StandbyModeRequested)
 					EnableStandbyMode();
@@ -3173,12 +3203,24 @@ ReadRecord(XLogPrefetcher *xlogprefetcher, int emode,
 				continue;
 			}
 
-			/* In standby mode, loop back to retry. Otherwise, give up. */
-			if (StandbyMode && !CheckForStandbyTrigger())
-				continue;
-			else
-				return NULL;
+			/*
+			 * recovery ended.
+			 *
+			 * Emit a decent message if we met end-of-WAL. Otherwise we should
+			 * have already emitted an error message.
+			 */
+			if (xlogreader->EndOfWAL)
+				ereport(LOG,
+						errmsg("reached end of WAL at %X/%X on timeline %u",
+							   LSN_FORMAT_ARGS(ErrRecPtr), replayTLI),
+						(errormsg ? errdetail_internal("%s", errormsg) : 0));
 		}
+
+		/* In standby mode, loop back to retry. Otherwise, give up. */
+		if (StandbyMode && !CheckForStandbyTrigger())
+			continue;
+		else
+			return NULL;
 	}
 }
 
@@ -3275,11 +3317,16 @@ retry:
 			case XLREAD_WOULDBLOCK:
 				return XLREAD_WOULDBLOCK;
 			case XLREAD_FAIL:
+				Assert(!StandbyMode || CheckForStandbyTrigger());
+
 				if (readFile >= 0)
 					close(readFile);
 				readFile = -1;
 				readLen = 0;
 				readSource = XLOG_FROM_ANY;
+
+				/* promotion exit is not end-of-WAL */
+				xlogreader->EndOfWAL = !StandbyMode;
 				return XLREAD_FAIL;
 			case XLREAD_SUCCESS:
 				break;
@@ -3947,7 +3994,8 @@ emode_for_corrupt_record(int emode, XLogRecPtr RecPtr)
 {
 	static XLogRecPtr lastComplaint = 0;
 
-	if (readSource == XLOG_FROM_PG_WAL && emode == LOG)
+	/* use currentSource as readSource is reset at failure */
+	if (currentSource == XLOG_FROM_PG_WAL && emode <= WARNING)
 	{
 		if (RecPtr == lastComplaint)
 			emode = DEBUG1;
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index feff709435..5332a6d825 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -498,10 +498,9 @@ WalReceiverMain(void)
 						else if (len < 0)
 						{
 							ereport(LOG,
-									(errmsg("replication terminated by primary server"),
-									 errdetail("End of WAL reached on timeline %u at %X/%X.",
-											   startpointTLI,
-											   LSN_FORMAT_ARGS(LogstreamResult.Write))));
+									errmsg("replication terminated by primary server at %X/%X on timeline %u.",
+										   LSN_FORMAT_ARGS(LogstreamResult.Write),
+										   startpointTLI));
 							endofwal = true;
 							break;
 						}
diff --git a/src/bin/pg_waldump/pg_waldump.c b/src/bin/pg_waldump/pg_waldump.c
index e8b5a6cd61..766fd5c80d 100644
--- a/src/bin/pg_waldump/pg_waldump.c
+++ b/src/bin/pg_waldump/pg_waldump.c
@@ -1305,9 +1305,16 @@ main(int argc, char **argv)
 		exit(0);
 
 	if (errormsg)
-		pg_fatal("error in WAL record at %X/%X: %s",
-				 LSN_FORMAT_ARGS(xlogreader_state->ReadRecPtr),
-				 errormsg);
+	{
+		if (xlogreader_state->EndOfWAL)
+			pg_log_info("end of WAL at %X/%X: %s",
+						LSN_FORMAT_ARGS(xlogreader_state->EndRecPtr),
+						errormsg);
+		else
+			pg_fatal("error in WAL record at %X/%X: %s",
+					 LSN_FORMAT_ARGS(xlogreader_state->EndRecPtr),
+					 errormsg);
+	}
 
 	XLogReaderFree(xlogreader_state);
 
diff --git a/src/bin/pg_waldump/t/001_basic.pl b/src/bin/pg_waldump/t/001_basic.pl
index 029a0d0521..f6b1988b75 100644
--- a/src/bin/pg_waldump/t/001_basic.pl
+++ b/src/bin/pg_waldump/t/001_basic.pl
@@ -147,10 +147,10 @@ command_fails_like([ 'pg_waldump', $node->data_dir . '/pg_wal/' . $start_walfile
 command_like([ 'pg_waldump', $node->data_dir . '/pg_wal/' . $start_walfile, $node->data_dir . '/pg_wal/' . $end_walfile ], qr/./, 'runs with start and end segment specified');
 command_fails_like([ 'pg_waldump', '-p', $node->data_dir ], qr/error: no start WAL location given/, 'path option requires start location');
 command_like([ 'pg_waldump', '-p', $node->data_dir, '--start', $start_lsn, '--end', $end_lsn ], qr/./, 'runs with path option and start and end locations');
-command_fails_like([ 'pg_waldump', '-p', $node->data_dir, '--start', $start_lsn ], qr/error: error in WAL record at/, 'falling off the end of the WAL results in an error');
+command_checks_all([ 'pg_waldump', '-p', $node->data_dir, '--start', $start_lsn ], 0, [], [qr/empty record at/], 'falling off the end of the WAL results in an empty record error');
 
 command_like([ 'pg_waldump', '--quiet', $node->data_dir . '/pg_wal/' . $start_walfile ], qr/^$/, 'no output with --quiet option');
-command_fails_like([ 'pg_waldump', '--quiet', '-p', $node->data_dir, '--start', $start_lsn ], qr/error: error in WAL record at/, 'errors are shown with --quiet');
+command_checks_all([ 'pg_waldump', '--quiet', '-p', $node->data_dir, '--start', $start_lsn ], 0, [], [qr/empty record at/], 'empty record error is shown with --quiet');
 
 
 # Test for: Display a message that we're skipping data if `from`
diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h
index da32c7db77..9e5020adf5 100644
--- a/src/include/access/xlogreader.h
+++ b/src/include/access/xlogreader.h
@@ -205,6 +205,7 @@ struct XLogReaderState
 	 */
 	XLogRecPtr	ReadRecPtr;		/* start of last record read */
 	XLogRecPtr	EndRecPtr;		/* end+1 of last record read */
+	bool		EndOfWAL;		/* was the last attempt EOW? */
 
 	/*
 	 * Set at the end of recovery: the start point of a partial record at the
diff --git a/src/test/recovery/t/035_recovery.pl b/src/test/recovery/t/035_recovery.pl
new file mode 100644
index 0000000000..7107df6509
--- /dev/null
+++ b/src/test/recovery/t/035_recovery.pl
@@ -0,0 +1,130 @@
+
+# Copyright (c) 2021-2023, PostgreSQL Global Development Group
+
+# Minimal test testing recovery process
+
+use strict;
+use warnings;
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+use IPC::Run;
+
+my $reached_eow_pat = "reached end of WAL at ";
+my $node = PostgreSQL::Test::Cluster->new('primary');
+$node->init(allows_streaming => 1);
+$node->start;
+
+my ($stdout, $stderr) = ('', '');
+
+my $segsize = $node->safe_psql('postgres',
+	   qq[SELECT setting FROM pg_settings WHERE name = 'wal_segment_size';]);
+
+# make sure no records afterwards go to the next segment
+$node->safe_psql('postgres', qq[
+				 SELECT pg_switch_wal();
+				 CHECKPOINT;
+				 CREATE TABLE t();
+]);
+$node->stop('immediate');
+
+# identify REDO WAL file
+my $cmd = "pg_controldata -D " . $node->data_dir();
+$cmd = ['pg_controldata', '-D', $node->data_dir()];
+$stdout = '';
+$stderr = '';
+IPC::Run::run $cmd, '>', \$stdout, '2>', \$stderr;
+ok($stdout =~ /^Latest checkpoint's REDO WAL file:[ \t] *(.+)$/m,
+   "checkpoint file is identified");
+my $chkptfile = $1;
+
+# identify the last record
+my $walfile = $node->data_dir() . "/pg_wal/$chkptfile";
+$cmd = ['pg_waldump', $walfile];
+$stdout = '';
+$stderr = '';
+my $lastrec;
+IPC::Run::run $cmd, '>', \$stdout, '2>', \$stderr;
+foreach my $l (split(/\r?\n/, $stdout))
+{
+	$lastrec = $l;
+}
+ok(defined $lastrec, "last WAL record is extracted");
+ok($stderr =~ /end of WAL at ([0-9A-F\/]+): .* at \g1/,
+   "pg_waldump emits the correct ending message");
+
+# read the last record LSN excluding leading zeroes
+ok ($lastrec =~ /, lsn: 0\/0*([1-9A-F][0-9A-F]+),/,
+	"LSN of the last record identified");
+my $lastlsn = $1;
+
+# corrupt the last record
+my $offset = hex($lastlsn) % $segsize;
+open(my $segf, '+<', $walfile) or die "failed to open $walfile\n";
+seek($segf, $offset, 0);  # zero xl_tot_len, leaving following bytes alone.
+print $segf "\0\0\0\0";
+close($segf);
+
+# pg_waldump complains about the corrupted record
+$stdout = '';
+$stderr = '';
+IPC::Run::run $cmd, '>', \$stdout, '2>', \$stderr;
+ok($stderr =~ /error: error in WAL record at 0\/$lastlsn: .* at 0\/$lastlsn/,
+   "pg_waldump emits the correct error message");
+
+# also server complains for the same reason
+my $logstart = -s $node->logfile;
+$node->start;
+ok($node->wait_for_log(
+	   "WARNING:  invalid record length at 0/$lastlsn: expected at least 24, got 0",
+	   $logstart),
+   "header error is correctly logged at $lastlsn");
+
+# no end-of-wal message should be seen this time
+ok(!find_in_log($node, $reached_eow_pat, $logstart),
+   "false log message is not emitted");
+
+# Create streaming standby linking to primary
+my $backup_name = 'my_backup';
+$node->backup($backup_name);
+my $node_standby = PostgreSQL::Test::Cluster->new('standby');
+$node_standby->init_from_backup($node, $backup_name, has_streaming => 1);
+$node_standby->start;
+$node->safe_psql('postgres', 'CREATE TABLE t ()');
+my $primary_lsn = $node->lsn('write');
+$node->wait_for_catchup($node_standby, 'write', $primary_lsn);
+
+$node_standby->stop();
+$node->stop('immediate');
+
+# crash restart the primary
+$logstart = -s $node->logfile;
+$node->start();
+ok($node->wait_for_log($reached_eow_pat, $logstart),
+   'primary properly emits end-of-WAL message');
+
+# restart the standby
+$logstart = -s $node_standby->logfile;
+$node_standby->start();
+ok($node->wait_for_log($reached_eow_pat, $logstart),
+   'standby properly emits end-of-WAL message');
+
+$node_standby->stop();
+$node->stop();
+
+done_testing();
+
+#### helper routines
+# find $pat in logfile of $node after $off-th byte
+sub find_in_log
+{
+	my ($node, $pat, $off) = @_;
+
+	$off = 0 unless defined $off;
+	my $log = PostgreSQL::Test::Utils::slurp_file($node->logfile);
+	return 0 if (length($log) <= $off);
+
+	$log = substr($log, $off);
+
+	return $log =~ m/$pat/;
+}
-- 
2.41.0

#59

Kyotaro Horiguchi

horikyota.ntt@gmail.com

over 2 years ago

In reply to: Aleksander Alekseev (#58)

Re: Make mesage at end-of-recovery less scary.

At Mon, 17 Jul 2023 15:20:30 +0300, Aleksander Alekseev <aleksander@timescale.com> wrote in

Thanks for working on this, it bugged me for a while. I noticed that
cfbot is not happy with the patch so I rebased it.
postgresql:pg_waldump test suite didn't pass after the rebase. I fixed
it too. Other than that the patch LGTM so I'm not changing its status
from "Ready for Committer".

Thanks for the rebasing.

It looks like the patch was moved between the commitfests since
2020... If there is anything that may help merging it into PG17 please
let me know.

This might be just too-much or there might be some doubt in this..

This change basically makes a zero-length record be considered as the
normal end of WAL.

The most controvorsial point I think in the design is the criteria for
an error condition. The assumption is that the WAL is sound if all
bytes following a complete record, up to the next page boundary, are
zeroed out. This is slightly narrower than the original criteria,
merely checking the next record is zero-length. Naturally, there
might be instances where that page has been blown out due to device
failure or some other reasons. Despite this, I believe it is
preferable rather than always issuing a warning (in the LOG level,
though) about a potential WAL corruption.

I've adjusted the condition for muting repeated log messages at the
same LSN, changing it from ==LOG to <=WARNING. This is simply a
consequence of following the change of "real" warnings from LOG to
WARNING. I believe this is acceptable even without considering
aforementioned change, as any single retriable (<ERROR) error at an
LSN should be sufficient to alert users about potential issues.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#60

Kyotaro Horiguchi

horikyota.ntt@gmail.com

about 2 years ago

In reply to: Kyotaro Horiguchi (#59)

1 attachment(s)

Re: Make mesage at end-of-recovery less scary.

Anyway, this requires rebsaing, and done.

Thanks for John (Naylor) for pointing this out.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachments:

v27-0001-Make-End-Of-Recovery-error-less-scary.patchtext/x-patch; charset=us-asciiDownload

From e56f1f24523e3e562a4db166dfeaadc79fd7b27a Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Tue, 7 Mar 2023 14:55:58 +0900
Subject: [PATCH v27] Make End-Of-Recovery error less scary

When recovery in any type ends, we see a bit scary error message like
"invalid record length" that suggests something serious is
happening. Actually if recovery meets a record with length = 0, that
usually means it finished applying all available WAL records.

Make this message less scary as "reached end of WAL". Instead, raise
the error level for other kind of WAL failure to WARNING.  To make
sure that the detection is correct, this patch checks if all trailing
bytes in the same page are zeroed in that case.
---
 src/backend/access/transam/xlogreader.c   | 134 ++++++++++++++++++----
 src/backend/access/transam/xlogrecovery.c |  94 +++++++++++----
 src/backend/replication/walreceiver.c     |   7 +-
 src/bin/pg_waldump/pg_waldump.c           |  13 ++-
 src/include/access/xlogreader.h           |   1 +
 src/test/recovery/t/035_recovery.pl       | 130 +++++++++++++++++++++
 6 files changed, 327 insertions(+), 52 deletions(-)
 create mode 100644 src/test/recovery/t/035_recovery.pl

diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index e0baa86bd3..ce65f99c60 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -48,6 +48,8 @@ static int	ReadPageInternal(XLogReaderState *state, XLogRecPtr pageptr,
 							 int reqLen);
 static void XLogReaderInvalReadState(XLogReaderState *state);
 static XLogPageReadResult XLogDecodeNextRecord(XLogReaderState *state, bool nonblocking);
+static bool ValidXLogRecordLength(XLogReaderState *state, XLogRecPtr RecPtr,
+								  XLogRecord *record);
 static bool ValidXLogRecordHeader(XLogReaderState *state, XLogRecPtr RecPtr,
 								  XLogRecPtr PrevRecPtr, XLogRecord *record, bool randAccess);
 static bool ValidXLogRecord(XLogReaderState *state, XLogRecord *record,
@@ -149,6 +151,7 @@ XLogReaderAllocate(int wal_segment_size, const char *waldir,
 		pfree(state);
 		return NULL;
 	}
+	state->EndOfWAL = false;
 	state->errormsg_buf[0] = '\0';
 
 	/*
@@ -525,6 +528,7 @@ XLogDecodeNextRecord(XLogReaderState *state, bool nonblocking)
 	/* reset error state */
 	state->errormsg_buf[0] = '\0';
 	decoded = NULL;
+	state->EndOfWAL = false;
 
 	state->abortedRecPtr = InvalidXLogRecPtr;
 	state->missingContrecPtr = InvalidXLogRecPtr;
@@ -608,16 +612,12 @@ restart:
 	Assert(pageHeaderSize <= readOff);
 
 	/*
-	 * Read the record length.
+	 * Verify the record header.
 	 *
 	 * NB: Even though we use an XLogRecord pointer here, the whole record
-	 * header might not fit on this page. xl_tot_len is the first field of the
-	 * struct, so it must be on this page (the records are MAXALIGNed), but we
-	 * cannot access any other fields until we've verified that we got the
-	 * whole header.
+	 * header might not fit on this page.
 	 */
 	record = (XLogRecord *) (state->readBuf + RecPtr % XLOG_BLCKSZ);
-	total_len = record->xl_tot_len;
 
 	/*
 	 * If the whole record header is on this page, validate it immediately.
@@ -636,19 +636,19 @@ restart:
 	}
 	else
 	{
-		/* There may be no next page if it's too small. */
-		if (total_len < SizeOfXLogRecord)
-		{
-			report_invalid_record(state,
-								  "invalid record length at %X/%X: expected at least %u, got %u",
-								  LSN_FORMAT_ARGS(RecPtr),
-								  (uint32) SizeOfXLogRecord, total_len);
+		/*
+		 * xl_tot_len is the first field of the struct, so it must be on this
+		 * page (the records are MAXALIGNed), but we cannot access any other
+		 * fields until we've verified that we got the whole header.
+		 */
+		if (!ValidXLogRecordLength(state, RecPtr, record))
 			goto err;
-		}
-		/* We'll validate the header once we have the next page. */
+
 		gotheader = false;
 	}
 
+	total_len = record->xl_tot_len;
+
 	/*
 	 * Try to find space to decode this record, if we can do so without
 	 * calling palloc.  If we can't, we'll try again below after we've
@@ -1091,25 +1091,81 @@ XLogReaderInvalReadState(XLogReaderState *state)
 	state->readLen = 0;
 }
 
+/*
+ * Validate record length of an XLOG record header.
+ *
+ * This is substantially a part of ValidXLogRecordHeader.  But XLogReadRecord
+ * needs this separate from the function in case of a partial record header.
+ *
+ * Returns true if the xl_tot_len header field has a seemingly valid value,
+ * which means the caller can proceed reading to the following part of the
+ * record.
+ */
+static bool
+ValidXLogRecordLength(XLogReaderState *state, XLogRecPtr RecPtr,
+					  XLogRecord *record)
+{
+	if (record->xl_tot_len == 0)
+	{
+		char	   *p;
+		char	   *pe;
+
+		/*
+		 * We are almost sure reaching the end of WAL, make sure that the
+		 * whole page after the record is filled with zeroes.
+		 */
+		p = (char *) record;
+		pe = p + XLOG_BLCKSZ - (RecPtr & (XLOG_BLCKSZ - 1));
+
+		while (p < pe && *p == 0)
+			p++;
+
+		if (p == pe)
+		{
+			/*
+			 * The page after the record is completely zeroed. That suggests
+			 * we don't have a record after this point. We don't bother
+			 * checking the pages after since they are not zeroed in the case
+			 * of recycled segments.
+			 */
+			report_invalid_record(state, "empty record at %X/%X",
+								  LSN_FORMAT_ARGS(RecPtr));
+
+			/* notify end-of-wal to callers */
+			state->EndOfWAL = true;
+			return false;
+		}
+	}
+
+	if (record->xl_tot_len < SizeOfXLogRecord)
+	{
+		report_invalid_record(state,
+							  "invalid record length at %X/%X: expected at least %u, got %u",
+							  LSN_FORMAT_ARGS(RecPtr),
+							  (uint32) SizeOfXLogRecord, record->xl_tot_len);
+		return false;
+	}
+
+	return true;
+}
+
 /*
  * Validate an XLOG record header.
  *
  * This is just a convenience subroutine to avoid duplicated code in
  * XLogReadRecord.  It's not intended for use from anywhere else.
+ *
+ * Returns true if the header fields have the valid values and the caller can
+ * proceed reading to the following part of the record.
  */
 static bool
 ValidXLogRecordHeader(XLogReaderState *state, XLogRecPtr RecPtr,
 					  XLogRecPtr PrevRecPtr, XLogRecord *record,
 					  bool randAccess)
 {
-	if (record->xl_tot_len < SizeOfXLogRecord)
-	{
-		report_invalid_record(state,
-							  "invalid record length at %X/%X: expected at least %u, got %u",
-							  LSN_FORMAT_ARGS(RecPtr),
-							  (uint32) SizeOfXLogRecord, record->xl_tot_len);
+	if (!ValidXLogRecordLength(state, RecPtr, record))
 		return false;
-	}
+
 	if (!RmgrIdIsValid(record->xl_rmid))
 	{
 		report_invalid_record(state,
@@ -1207,6 +1263,32 @@ XLogReaderValidatePageHeader(XLogReaderState *state, XLogRecPtr recptr,
 	XLByteToSeg(recptr, segno, state->segcxt.ws_segsize);
 	offset = XLogSegmentOffset(recptr, state->segcxt.ws_segsize);
 
+	StaticAssertStmt(XLOG_PAGE_MAGIC != 0, "XLOG_PAGE_MAGIC is zero");
+
+	if (hdr->xlp_magic == 0)
+	{
+		/* Regard an empty page as End-Of-WAL */
+		int			i;
+
+		for (i = 0; i < XLOG_BLCKSZ && phdr[i] == 0; i++);
+		if (i == XLOG_BLCKSZ)
+		{
+			char		fname[MAXFNAMELEN];
+
+			XLogFileName(fname, state->seg.ws_tli, segno,
+						 state->segcxt.ws_segsize);
+
+			report_invalid_record(state,
+								  "empty page in log segment %s, offset %u",
+								  fname,
+								  offset);
+			state->EndOfWAL = true;
+			return false;
+		}
+
+		/* The same condition will be caught as invalid magic number */
+	}
+
 	if (hdr->xlp_magic != XLOG_PAGE_MAGIC)
 	{
 		char		fname[MAXFNAMELEN];
@@ -1296,6 +1378,14 @@ XLogReaderValidatePageHeader(XLogReaderState *state, XLogRecPtr recptr,
 							  fname,
 							  LSN_FORMAT_ARGS(recptr),
 							  offset);
+
+		/*
+		 * If the page address is less than expected we assume it is an unused
+		 * page in a recycled segment.
+		 */
+		if (hdr->xlp_pageaddr < recptr)
+			state->EndOfWAL = true;
+
 		return false;
 	}
 
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index c61566666a..9db27b765f 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -1672,7 +1672,7 @@ PerformWalRecovery(void)
 		/* just have to read next record after CheckPoint */
 		Assert(xlogreader->ReadRecPtr == CheckPointLoc);
 		replayTLI = CheckPointTLI;
-		record = ReadRecord(xlogprefetcher, LOG, false, replayTLI);
+		record = ReadRecord(xlogprefetcher, WARNING, false, replayTLI);
 	}
 
 	if (record != NULL)
@@ -1781,7 +1781,7 @@ PerformWalRecovery(void)
 			}
 
 			/* Else, try to fetch the next WAL record */
-			record = ReadRecord(xlogprefetcher, LOG, false, replayTLI);
+			record = ReadRecord(xlogprefetcher, WARNING, false, replayTLI);
 		} while (record != NULL);
 
 		/*
@@ -1835,11 +1835,18 @@ PerformWalRecovery(void)
 
 		InRedo = false;
 	}
-	else
+	else if (xlogreader->EndOfWAL)
 	{
 		/* there are no WAL records following the checkpoint */
 		ereport(LOG,
-				(errmsg("redo is not required")));
+				errmsg("redo is not required"));
+	}
+	else
+	{
+		/* broken record found */
+		ereport(WARNING,
+				errmsg("redo is skipped"),
+				errhint("This suggests WAL file corruption. You might need to check the database."));
 	}
 
 	/*
@@ -3091,6 +3098,7 @@ ReadRecord(XLogPrefetcher *xlogprefetcher, int emode,
 	for (;;)
 	{
 		char	   *errormsg;
+		XLogRecPtr	ErrRecPtr = InvalidXLogRecPtr;
 
 		record = XLogPrefetcherReadRecord(xlogprefetcher, &errormsg);
 		if (record == NULL)
@@ -3112,6 +3120,18 @@ ReadRecord(XLogPrefetcher *xlogprefetcher, int emode,
 			{
 				abortedRecPtr = xlogreader->abortedRecPtr;
 				missingContrecPtr = xlogreader->missingContrecPtr;
+				ErrRecPtr = abortedRecPtr;
+			}
+			else
+			{
+				/*
+				 * EndRecPtr is the LSN we tried to read but failed. In the
+				 * case of decoding error, it is at the end of the failed
+				 * record but we don't have a means for now to know EndRecPtr
+				 * is pointing to which of the beginning or ending of the
+				 * failed record.
+				 */
+				ErrRecPtr = xlogreader->EndRecPtr;
 			}
 
 			if (readFile >= 0)
@@ -3122,13 +3142,15 @@ ReadRecord(XLogPrefetcher *xlogprefetcher, int emode,
 
 			/*
 			 * We only end up here without a message when XLogPageRead()
-			 * failed - in that case we already logged something. In
-			 * StandbyMode that only happens if we have been triggered, so we
-			 * shouldn't loop anymore in that case.
+			 * failed- in that case we already logged something. In StandbyMode
+			 * that only happens if we have been triggered, so we shouldn't
+			 * loop anymore in that case. When EndOfWAL is true, we don't emit
+			 * the message immediately and instead will show it as a part of a
+			 * decent end-of-wal message later.
 			 */
-			if (errormsg)
-				ereport(emode_for_corrupt_record(emode, xlogreader->EndRecPtr),
-						(errmsg_internal("%s", errormsg) /* already translated */ ));
+			if (!xlogreader->EndOfWAL && errormsg)
+				ereport(emode_for_corrupt_record(emode, ErrRecPtr),
+						errmsg_internal("%s", errormsg) /* already translated */ );
 		}
 
 		/*
@@ -3159,11 +3181,14 @@ ReadRecord(XLogPrefetcher *xlogprefetcher, int emode,
 			/* Great, got a record */
 			return record;
 		}
-		else
+
+		Assert(ErrRecPtr != InvalidXLogRecPtr);
+
+		/* No valid record available from this source */
+		lastSourceFailed = true;
+
+		if (!fetching_ckpt)
 		{
-			/* No valid record available from this source */
-			lastSourceFailed = true;
-
 			/*
 			 * If archive recovery was requested, but we were still doing
 			 * crash recovery, switch to archive recovery and retry using the
@@ -3176,11 +3201,16 @@ ReadRecord(XLogPrefetcher *xlogprefetcher, int emode,
 			 * we'd have no idea how far we'd have to replay to reach
 			 * consistency.  So err on the safe side and give up.
 			 */
-			if (!InArchiveRecovery && ArchiveRecoveryRequested &&
-				!fetching_ckpt)
+			if (!InArchiveRecovery && ArchiveRecoveryRequested)
 			{
+				/*
+				 * We don't report this as LOG, since we don't stop recovery
+				 * here
+				 */
 				ereport(DEBUG1,
-						(errmsg_internal("reached end of WAL in pg_wal, entering archive recovery")));
+						errmsg_internal("reached end of WAL at %X/%X on timeline %u in pg_wal during crash recovery, entering archive recovery",
+										LSN_FORMAT_ARGS(ErrRecPtr),
+										replayTLI));
 				InArchiveRecovery = true;
 				if (StandbyModeRequested)
 					EnableStandbyMode();
@@ -3201,12 +3231,24 @@ ReadRecord(XLogPrefetcher *xlogprefetcher, int emode,
 				continue;
 			}
 
-			/* In standby mode, loop back to retry. Otherwise, give up. */
-			if (StandbyMode && !CheckForStandbyTrigger())
-				continue;
-			else
-				return NULL;
+			/*
+			 * recovery ended.
+			 *
+			 * Emit a decent message if we met end-of-WAL. Otherwise we should
+			 * have already emitted an error message.
+			 */
+			if (xlogreader->EndOfWAL)
+				ereport(LOG,
+						errmsg("reached end of WAL at %X/%X on timeline %u",
+							   LSN_FORMAT_ARGS(ErrRecPtr), replayTLI),
+						(errormsg ? errdetail_internal("%s", errormsg) : 0));
 		}
+
+		/* In standby mode, loop back to retry. Otherwise, give up. */
+		if (StandbyMode && !CheckForStandbyTrigger())
+			continue;
+		else
+			return NULL;
 	}
 }
 
@@ -3303,11 +3345,16 @@ retry:
 			case XLREAD_WOULDBLOCK:
 				return XLREAD_WOULDBLOCK;
 			case XLREAD_FAIL:
+				Assert(!StandbyMode || CheckForStandbyTrigger());
+
 				if (readFile >= 0)
 					close(readFile);
 				readFile = -1;
 				readLen = 0;
 				readSource = XLOG_FROM_ANY;
+
+				/* promotion exit is not end-of-WAL */
+				xlogreader->EndOfWAL = !StandbyMode;
 				return XLREAD_FAIL;
 			case XLREAD_SUCCESS:
 				break;
@@ -3975,7 +4022,8 @@ emode_for_corrupt_record(int emode, XLogRecPtr RecPtr)
 {
 	static XLogRecPtr lastComplaint = 0;
 
-	if (readSource == XLOG_FROM_PG_WAL && emode == LOG)
+	/* use currentSource as readSource is reset at failure */
+	if (currentSource == XLOG_FROM_PG_WAL && emode <= WARNING)
 	{
 		if (RecPtr == lastComplaint)
 			emode = DEBUG1;
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index 2398167f49..2271304814 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -496,10 +496,9 @@ WalReceiverMain(void)
 						else if (len < 0)
 						{
 							ereport(LOG,
-									(errmsg("replication terminated by primary server"),
-									 errdetail("End of WAL reached on timeline %u at %X/%X.",
-											   startpointTLI,
-											   LSN_FORMAT_ARGS(LogstreamResult.Write))));
+									errmsg("replication terminated by primary server at %X/%X on timeline %u.",
+										   LSN_FORMAT_ARGS(LogstreamResult.Write),
+										   startpointTLI));
 							endofwal = true;
 							break;
 						}
diff --git a/src/bin/pg_waldump/pg_waldump.c b/src/bin/pg_waldump/pg_waldump.c
index a3535bdfa9..11a69027d0 100644
--- a/src/bin/pg_waldump/pg_waldump.c
+++ b/src/bin/pg_waldump/pg_waldump.c
@@ -1309,9 +1309,16 @@ main(int argc, char **argv)
 		exit(0);
 
 	if (errormsg)
-		pg_fatal("error in WAL record at %X/%X: %s",
-				 LSN_FORMAT_ARGS(xlogreader_state->ReadRecPtr),
-				 errormsg);
+	{
+		if (xlogreader_state->EndOfWAL)
+			pg_log_info("end of WAL at %X/%X: %s",
+						LSN_FORMAT_ARGS(xlogreader_state->EndRecPtr),
+						errormsg);
+		else
+			pg_fatal("error in WAL record at %X/%X: %s",
+					 LSN_FORMAT_ARGS(xlogreader_state->EndRecPtr),
+					 errormsg);
+	}
 
 	XLogReaderFree(xlogreader_state);
 
diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h
index 0813722715..2be8ea7c37 100644
--- a/src/include/access/xlogreader.h
+++ b/src/include/access/xlogreader.h
@@ -205,6 +205,7 @@ struct XLogReaderState
 	 */
 	XLogRecPtr	ReadRecPtr;		/* start of last record read */
 	XLogRecPtr	EndRecPtr;		/* end+1 of last record read */
+	bool		EndOfWAL;		/* was the last attempt EOW? */
 
 	/*
 	 * Set at the end of recovery: the start point of a partial record at the
diff --git a/src/test/recovery/t/035_recovery.pl b/src/test/recovery/t/035_recovery.pl
new file mode 100644
index 0000000000..7107df6509
--- /dev/null
+++ b/src/test/recovery/t/035_recovery.pl
@@ -0,0 +1,130 @@
+
+# Copyright (c) 2021-2023, PostgreSQL Global Development Group
+
+# Minimal test testing recovery process
+
+use strict;
+use warnings;
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+use IPC::Run;
+
+my $reached_eow_pat = "reached end of WAL at ";
+my $node = PostgreSQL::Test::Cluster->new('primary');
+$node->init(allows_streaming => 1);
+$node->start;
+
+my ($stdout, $stderr) = ('', '');
+
+my $segsize = $node->safe_psql('postgres',
+	   qq[SELECT setting FROM pg_settings WHERE name = 'wal_segment_size';]);
+
+# make sure no records afterwards go to the next segment
+$node->safe_psql('postgres', qq[
+				 SELECT pg_switch_wal();
+				 CHECKPOINT;
+				 CREATE TABLE t();
+]);
+$node->stop('immediate');
+
+# identify REDO WAL file
+my $cmd = "pg_controldata -D " . $node->data_dir();
+$cmd = ['pg_controldata', '-D', $node->data_dir()];
+$stdout = '';
+$stderr = '';
+IPC::Run::run $cmd, '>', \$stdout, '2>', \$stderr;
+ok($stdout =~ /^Latest checkpoint's REDO WAL file:[ \t] *(.+)$/m,
+   "checkpoint file is identified");
+my $chkptfile = $1;
+
+# identify the last record
+my $walfile = $node->data_dir() . "/pg_wal/$chkptfile";
+$cmd = ['pg_waldump', $walfile];
+$stdout = '';
+$stderr = '';
+my $lastrec;
+IPC::Run::run $cmd, '>', \$stdout, '2>', \$stderr;
+foreach my $l (split(/\r?\n/, $stdout))
+{
+	$lastrec = $l;
+}
+ok(defined $lastrec, "last WAL record is extracted");
+ok($stderr =~ /end of WAL at ([0-9A-F\/]+): .* at \g1/,
+   "pg_waldump emits the correct ending message");
+
+# read the last record LSN excluding leading zeroes
+ok ($lastrec =~ /, lsn: 0\/0*([1-9A-F][0-9A-F]+),/,
+	"LSN of the last record identified");
+my $lastlsn = $1;
+
+# corrupt the last record
+my $offset = hex($lastlsn) % $segsize;
+open(my $segf, '+<', $walfile) or die "failed to open $walfile\n";
+seek($segf, $offset, 0);  # zero xl_tot_len, leaving following bytes alone.
+print $segf "\0\0\0\0";
+close($segf);
+
+# pg_waldump complains about the corrupted record
+$stdout = '';
+$stderr = '';
+IPC::Run::run $cmd, '>', \$stdout, '2>', \$stderr;
+ok($stderr =~ /error: error in WAL record at 0\/$lastlsn: .* at 0\/$lastlsn/,
+   "pg_waldump emits the correct error message");
+
+# also server complains for the same reason
+my $logstart = -s $node->logfile;
+$node->start;
+ok($node->wait_for_log(
+	   "WARNING:  invalid record length at 0/$lastlsn: expected at least 24, got 0",
+	   $logstart),
+   "header error is correctly logged at $lastlsn");
+
+# no end-of-wal message should be seen this time
+ok(!find_in_log($node, $reached_eow_pat, $logstart),
+   "false log message is not emitted");
+
+# Create streaming standby linking to primary
+my $backup_name = 'my_backup';
+$node->backup($backup_name);
+my $node_standby = PostgreSQL::Test::Cluster->new('standby');
+$node_standby->init_from_backup($node, $backup_name, has_streaming => 1);
+$node_standby->start;
+$node->safe_psql('postgres', 'CREATE TABLE t ()');
+my $primary_lsn = $node->lsn('write');
+$node->wait_for_catchup($node_standby, 'write', $primary_lsn);
+
+$node_standby->stop();
+$node->stop('immediate');
+
+# crash restart the primary
+$logstart = -s $node->logfile;
+$node->start();
+ok($node->wait_for_log($reached_eow_pat, $logstart),
+   'primary properly emits end-of-WAL message');
+
+# restart the standby
+$logstart = -s $node_standby->logfile;
+$node_standby->start();
+ok($node->wait_for_log($reached_eow_pat, $logstart),
+   'standby properly emits end-of-WAL message');
+
+$node_standby->stop();
+$node->stop();
+
+done_testing();
+
+#### helper routines
+# find $pat in logfile of $node after $off-th byte
+sub find_in_log
+{
+	my ($node, $pat, $off) = @_;
+
+	$off = 0 unless defined $off;
+	my $log = PostgreSQL::Test::Utils::slurp_file($node->logfile);
+	return 0 if (length($log) <= $off);
+
+	$log = substr($log, $off);
+
+	return $log =~ m/$pat/;
+}
-- 
2.39.3

#61

vignesh C

vignesh21@gmail.com

about 2 years ago

In reply to: Kyotaro Horiguchi (#60)

Re: Make mesage at end-of-recovery less scary.

On Wed, 22 Nov 2023 at 13:01, Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote:

Anyway, this requires rebsaing, and done.

Few tests are failing at [1]https://cirrus-ci.com/task/5859293157654528, kindly post an updated patch:
/tmp/cirrus-ci-build/src/test/recovery --testgroup recovery --testname
039_end_of_wal -- /usr/local/bin/perl -I
/tmp/cirrus-ci-build/src/test/perl -I
/tmp/cirrus-ci-build/src/test/recovery
/tmp/cirrus-ci-build/src/test/recovery/t/039_end_of_wal.pl
[23:53:10.370] ――――――――――――――――――――――――――――――――――――― ✀
―――――――――――――――――――――――――――――――――――――
[23:53:10.370] stderr:
[23:53:10.370] # Failed test 'xl_tot_len zero'
[23:53:10.370] # at
/tmp/cirrus-ci-build/src/test/recovery/t/039_end_of_wal.pl line 267.
[23:53:10.370] # Failed test 'xlp_magic zero'
[23:53:10.370] # at
/tmp/cirrus-ci-build/src/test/recovery/t/039_end_of_wal.pl line 340.
[23:53:10.370] # Failed test 'xlp_magic zero (split record header)'
[23:53:10.370] # at
/tmp/cirrus-ci-build/src/test/recovery/t/039_end_of_wal.pl line 445.
[23:53:10.370] # Looks like you failed 3 tests of 14.
[23:53:10.370]
[23:53:10.370] (test program exited with status code 3)
[23:53:10.370] ――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――

[1]: https://cirrus-ci.com/task/5859293157654528

Regards,
Vignesh

#62

Kyotaro Horiguchi

horikyota.ntt@gmail.com

about 2 years ago

In reply to: vignesh C (#61)

1 attachment(s)

Re: Make mesage at end-of-recovery less scary.

At Fri, 5 Jan 2024 16:02:24 +0530, vignesh C <vignesh21@gmail.com> wrote in

On Wed, 22 Nov 2023 at 13:01, Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote:

Anyway, this requires rebsaing, and done.

Few tests are failing at [1], kindly post an updated patch:

Thanks!

The errors occurred in a part of the tests for end-of-WAL detection
added in the master branch. These failures were primarily due to
changes in the message contents introduced by this patch. During the
revision, I discovered an issue with the handling of empty pages that
appear in the middle of reading continuation records. In the previous
version, such empty pages were mistakenly identified as indicating a
clean end-of-WAL (that is a LOG). However, they should actually be
handled as a WARNING, since the record curently being read is broken
at the empty pages. The following changes have been made in this
version:

1. Adjusting the test to align with the error message changes
introduced by this patch.

2. Adding tests for the newly added messages.

3. Correcting the handling of empty pages encountered during the
reading of continuation records. (XLogReaderValidatePageHeader)

4. Revising code comments.

5. Changing the term "log segment" to "WAL
segment". (XLogReaderValidatePageHeader)

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachments:

v28-0001-Make-End-Of-Recovery-error-less-scary.patchtext/x-patch; charset=us-asciiDownload

From 47cc39a212d7fd6857f30c35c76bcdd0d26bbc3f Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Tue, 7 Mar 2023 14:55:58 +0900
Subject: [PATCH v28] Make End-Of-Recovery error less scary

When recovery in any type ends, we see a bit scary error message like
"invalid record length" that suggests something serious is
happening. Actually if recovery meets a record with length = 0, that
usually means it finished applying all available WAL records.

Make this message less scary as "reached end of WAL". Instead, raise
the error level for other kind of WAL failure to WARNING.  To make
sure that the detection is correct, this patch checks if all trailing
bytes in the same page are zeroed in that case.
---
 src/backend/access/transam/xlogreader.c   | 147 ++++++++++++++++++----
 src/backend/access/transam/xlogrecovery.c |  96 ++++++++++----
 src/backend/replication/walreceiver.c     |   7 +-
 src/bin/pg_waldump/pg_waldump.c           |  13 +-
 src/include/access/xlogreader.h           |   1 +
 src/test/recovery/t/035_recovery.pl       | 130 +++++++++++++++++++
 src/test/recovery/t/039_end_of_wal.pl     |  47 +++++--
 7 files changed, 380 insertions(+), 61 deletions(-)
 create mode 100644 src/test/recovery/t/035_recovery.pl

diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index 7190156f2f..94861969eb 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -48,6 +48,8 @@ static int	ReadPageInternal(XLogReaderState *state, XLogRecPtr pageptr,
 							 int reqLen);
 static void XLogReaderInvalReadState(XLogReaderState *state);
 static XLogPageReadResult XLogDecodeNextRecord(XLogReaderState *state, bool nonblocking);
+static bool ValidXLogRecordLength(XLogReaderState *state, XLogRecPtr RecPtr,
+								  XLogRecord *record);
 static bool ValidXLogRecordHeader(XLogReaderState *state, XLogRecPtr RecPtr,
 								  XLogRecPtr PrevRecPtr, XLogRecord *record, bool randAccess);
 static bool ValidXLogRecord(XLogReaderState *state, XLogRecord *record,
@@ -149,6 +151,7 @@ XLogReaderAllocate(int wal_segment_size, const char *waldir,
 		pfree(state);
 		return NULL;
 	}
+	state->EndOfWAL = false;
 	state->errormsg_buf[0] = '\0';
 
 	/*
@@ -553,6 +556,7 @@ XLogDecodeNextRecord(XLogReaderState *state, bool nonblocking)
 	/* reset error state */
 	state->errormsg_buf[0] = '\0';
 	decoded = NULL;
+	state->EndOfWAL = false;
 
 	state->abortedRecPtr = InvalidXLogRecPtr;
 	state->missingContrecPtr = InvalidXLogRecPtr;
@@ -636,16 +640,12 @@ restart:
 	Assert(pageHeaderSize <= readOff);
 
 	/*
-	 * Read the record length.
+	 * Verify the record header.
 	 *
 	 * NB: Even though we use an XLogRecord pointer here, the whole record
-	 * header might not fit on this page. xl_tot_len is the first field of the
-	 * struct, so it must be on this page (the records are MAXALIGNed), but we
-	 * cannot access any other fields until we've verified that we got the
-	 * whole header.
+	 * header might not fit on this page.
 	 */
 	record = (XLogRecord *) (state->readBuf + RecPtr % XLOG_BLCKSZ);
-	total_len = record->xl_tot_len;
 
 	/*
 	 * If the whole record header is on this page, validate it immediately.
@@ -664,19 +664,19 @@ restart:
 	}
 	else
 	{
-		/* There may be no next page if it's too small. */
-		if (total_len < SizeOfXLogRecord)
-		{
-			report_invalid_record(state,
-								  "invalid record length at %X/%X: expected at least %u, got %u",
-								  LSN_FORMAT_ARGS(RecPtr),
-								  (uint32) SizeOfXLogRecord, total_len);
+		/*
+		 * xl_tot_len is the first field of the struct, so it must be on this
+		 * page (the records are MAXALIGNed), but we cannot access any other
+		 * fields until we've verified that we got the whole header.
+		 */
+		if (!ValidXLogRecordLength(state, RecPtr, record))
 			goto err;
-		}
-		/* We'll validate the header once we have the next page. */
+
 		gotheader = false;
 	}
 
+	total_len = record->xl_tot_len;
+
 	/*
 	 * Try to find space to decode this record, if we can do so without
 	 * calling palloc.  If we can't, we'll try again below after we've
@@ -1119,25 +1119,80 @@ XLogReaderInvalReadState(XLogReaderState *state)
 	state->readLen = 0;
 }
 
+/*
+ * Validate record length of an XLOG record header.
+ *
+ * This is substantially a part of ValidXLogRecordHeader.  But XLogReadRecord
+ * needs this separate from the function in case of a partial record header.
+ *
+ * Returns true if the xl_tot_len header field has a seemingly valid value,
+ * which means the caller can proceed reading to the following part of the
+ * record.
+ */
+static bool
+ValidXLogRecordLength(XLogReaderState *state, XLogRecPtr RecPtr,
+					  XLogRecord *record)
+{
+	if (record->xl_tot_len == 0)
+	{
+		char	   *p;
+		char	   *pe;
+
+		/*
+		 * We are almost sure reaching the end of WAL, make sure that the
+		 * whole page after the record is filled with zeroes.
+		 */
+		p = (char *) record;
+		pe = p + XLOG_BLCKSZ - (RecPtr & (XLOG_BLCKSZ - 1));
+
+		while (p < pe && *p == 0)
+			p++;
+
+		if (p == pe)
+		{
+			/*
+			 * Consider it as end-of-WAL if all subsequent bytes of this page
+			 * are zero. We don't bother checking the subsequent pages since
+			 * they are not zeroed in the case of recycled segments.
+			 */
+			report_invalid_record(state, "empty record at %X/%X",
+								  LSN_FORMAT_ARGS(RecPtr));
+
+			/* notify end-of-wal to callers */
+			state->EndOfWAL = true;
+			return false;
+		}
+	}
+
+	if (record->xl_tot_len < SizeOfXLogRecord)
+	{
+		report_invalid_record(state,
+							  "invalid record length at %X/%X: expected at least %u, got %u",
+							  LSN_FORMAT_ARGS(RecPtr),
+							  (uint32) SizeOfXLogRecord, record->xl_tot_len);
+		return false;
+	}
+
+	return true;
+}
+
 /*
  * Validate an XLOG record header.
  *
- * This is just a convenience subroutine to avoid duplicated code in
+ * This is just a convenience subroutine to avoid duplicate code in
  * XLogReadRecord.  It's not intended for use from anywhere else.
+ *
+ * Returns true if the header fields have the valid values and the caller can
+ * proceed reading to the following part of the record.
  */
 static bool
 ValidXLogRecordHeader(XLogReaderState *state, XLogRecPtr RecPtr,
 					  XLogRecPtr PrevRecPtr, XLogRecord *record,
 					  bool randAccess)
 {
-	if (record->xl_tot_len < SizeOfXLogRecord)
-	{
-		report_invalid_record(state,
-							  "invalid record length at %X/%X: expected at least %u, got %u",
-							  LSN_FORMAT_ARGS(RecPtr),
-							  (uint32) SizeOfXLogRecord, record->xl_tot_len);
+	if (!ValidXLogRecordLength(state, RecPtr, record))
 		return false;
-	}
+
 	if (!RmgrIdIsValid(record->xl_rmid))
 	{
 		report_invalid_record(state,
@@ -1235,6 +1290,44 @@ XLogReaderValidatePageHeader(XLogReaderState *state, XLogRecPtr recptr,
 	XLByteToSeg(recptr, segno, state->segcxt.ws_segsize);
 	offset = XLogSegmentOffset(recptr, state->segcxt.ws_segsize);
 
+	StaticAssertStmt(XLOG_PAGE_MAGIC != 0, "XLOG_PAGE_MAGIC is zero");
+
+	if (hdr->xlp_magic == 0)
+	{
+		/* Regard an empty page as End-Of-WAL */
+		int			i;
+
+		for (i = 0; i < XLOG_BLCKSZ && phdr[i] == 0; i++);
+		if (i == XLOG_BLCKSZ)
+		{
+			char		fname[MAXFNAMELEN];
+
+			XLogFileName(fname, state->seg.ws_tli, segno,
+						 state->segcxt.ws_segsize);
+
+			/*
+			 * Consider an empty page as end-of-WAL only when reading the first
+			 * part of a record.
+			 */
+			if (state->currRecPtr / XLOG_BLCKSZ == recptr / XLOG_BLCKSZ)
+			{
+				report_invalid_record(state,
+									  "empty page in WAL segment %s, offset %u",
+									  fname, offset);
+				state->EndOfWAL = true;
+			}
+			else
+				report_invalid_record(state,
+									  "empty page in WAL segment %s, offset %u while reading continuation record at %X/%X",
+									  fname, offset,
+									  LSN_FORMAT_ARGS(state->currRecPtr));
+				
+			return false;
+		}
+
+		/* The same condition will be caught as invalid magic number */
+	}
+
 	if (hdr->xlp_magic != XLOG_PAGE_MAGIC)
 	{
 		char		fname[MAXFNAMELEN];
@@ -1324,6 +1417,14 @@ XLogReaderValidatePageHeader(XLogReaderState *state, XLogRecPtr recptr,
 							  fname,
 							  LSN_FORMAT_ARGS(recptr),
 							  offset);
+
+		/*
+		 * If the page address is less than expected we assume it is an unused
+		 * page in a recycled segment.
+		 */
+		if (hdr->xlp_pageaddr < recptr)
+			state->EndOfWAL = true;
+
 		return false;
 	}
 
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 1b48d7171a..3b56bd04a6 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -1678,7 +1678,7 @@ PerformWalRecovery(void)
 		/* just have to read next record after CheckPoint */
 		Assert(xlogreader->ReadRecPtr == CheckPointLoc);
 		replayTLI = CheckPointTLI;
-		record = ReadRecord(xlogprefetcher, LOG, false, replayTLI);
+		record = ReadRecord(xlogprefetcher, WARNING, false, replayTLI);
 	}
 
 	if (record != NULL)
@@ -1785,7 +1785,7 @@ PerformWalRecovery(void)
 			}
 
 			/* Else, try to fetch the next WAL record */
-			record = ReadRecord(xlogprefetcher, LOG, false, replayTLI);
+			record = ReadRecord(xlogprefetcher, WARNING, false, replayTLI);
 		} while (record != NULL);
 
 		/*
@@ -1839,12 +1839,19 @@ PerformWalRecovery(void)
 
 		InRedo = false;
 	}
-	else
+	else if (xlogreader->EndOfWAL)
 	{
 		/* there are no WAL records following the checkpoint */
 		ereport(LOG,
 				(errmsg("redo is not required")));
 	}
+	else
+	{
+		/* broken record found */
+		ereport(WARNING,
+				errmsg("redo is skipped"),
+				errhint("This suggests WAL file corruption. You might need to check the database."));
+	}
 
 	/*
 	 * This check is intentionally after the above log messages that indicate
@@ -3095,6 +3102,7 @@ ReadRecord(XLogPrefetcher *xlogprefetcher, int emode,
 	for (;;)
 	{
 		char	   *errormsg;
+		XLogRecPtr	ErrRecPtr = InvalidXLogRecPtr;
 
 		record = XLogPrefetcherReadRecord(xlogprefetcher, &errormsg);
 		if (record == NULL)
@@ -3116,6 +3124,18 @@ ReadRecord(XLogPrefetcher *xlogprefetcher, int emode,
 			{
 				abortedRecPtr = xlogreader->abortedRecPtr;
 				missingContrecPtr = xlogreader->missingContrecPtr;
+				ErrRecPtr = abortedRecPtr;
+			}
+			else
+			{
+				/*
+				 * EndRecPtr is the LSN we tried to read but failed. In the
+				 * case of decoding error, it is at the end of the failed
+				 * record but we don't have a means for now to know EndRecPtr
+				 * is pointing to which of the beginning or ending of the
+				 * failed record.
+				 */
+				ErrRecPtr = xlogreader->EndRecPtr;
 			}
 
 			if (readFile >= 0)
@@ -3126,13 +3146,15 @@ ReadRecord(XLogPrefetcher *xlogprefetcher, int emode,
 
 			/*
 			 * We only end up here without a message when XLogPageRead()
-			 * failed - in that case we already logged something. In
-			 * StandbyMode that only happens if we have been triggered, so we
-			 * shouldn't loop anymore in that case.
+			 * failed- in that case we already logged something. In StandbyMode
+			 * that only happens if we have been triggered, so we shouldn't
+			 * loop anymore in that case. When EndOfWAL is true, we don't emit
+			 * the message immediately and instead will show it as a part of a
+			 * decent end-of-wal message later.
 			 */
-			if (errormsg)
-				ereport(emode_for_corrupt_record(emode, xlogreader->EndRecPtr),
-						(errmsg_internal("%s", errormsg) /* already translated */ ));
+			if (!xlogreader->EndOfWAL && errormsg)
+				ereport(emode_for_corrupt_record(emode, ErrRecPtr),
+						errmsg_internal("%s", errormsg) /* already translated */ );
 		}
 
 		/*
@@ -3163,11 +3185,14 @@ ReadRecord(XLogPrefetcher *xlogprefetcher, int emode,
 			/* Great, got a record */
 			return record;
 		}
-		else
+
+		Assert(ErrRecPtr != InvalidXLogRecPtr);
+
+		/* No valid record available from this source */
+		lastSourceFailed = true;
+
+		if (!fetching_ckpt)
 		{
-			/* No valid record available from this source */
-			lastSourceFailed = true;
-
 			/*
 			 * If archive recovery was requested, but we were still doing
 			 * crash recovery, switch to archive recovery and retry using the
@@ -3180,11 +3205,16 @@ ReadRecord(XLogPrefetcher *xlogprefetcher, int emode,
 			 * we'd have no idea how far we'd have to replay to reach
 			 * consistency.  So err on the safe side and give up.
 			 */
-			if (!InArchiveRecovery && ArchiveRecoveryRequested &&
-				!fetching_ckpt)
+			if (!InArchiveRecovery && ArchiveRecoveryRequested)
 			{
+				/*
+				 * We don't report this as LOG, since we don't stop recovery
+				 * here
+				 */
 				ereport(DEBUG1,
-						(errmsg_internal("reached end of WAL in pg_wal, entering archive recovery")));
+						errmsg_internal("reached end of WAL at %X/%X on timeline %u in pg_wal during crash recovery, entering archive recovery",
+										LSN_FORMAT_ARGS(ErrRecPtr),
+										replayTLI));
 				InArchiveRecovery = true;
 				if (StandbyModeRequested)
 					EnableStandbyMode();
@@ -3205,12 +3235,24 @@ ReadRecord(XLogPrefetcher *xlogprefetcher, int emode,
 				continue;
 			}
 
-			/* In standby mode, loop back to retry. Otherwise, give up. */
-			if (StandbyMode && !CheckForStandbyTrigger())
-				continue;
-			else
-				return NULL;
+			/*
+			 * recovery ended.
+			 *
+			 * Emit a decent message if we met end-of-WAL. Otherwise we should
+			 * have already emitted an error message.
+			 */
+			if (xlogreader->EndOfWAL)
+				ereport(LOG,
+						errmsg("reached end of WAL at %X/%X on timeline %u",
+							   LSN_FORMAT_ARGS(ErrRecPtr), replayTLI),
+						(errormsg ? errdetail_internal("%s", errormsg) : 0));
 		}
+
+		/* In standby mode, loop back to retry. Otherwise, give up. */
+		if (StandbyMode && !CheckForStandbyTrigger())
+			continue;
+		else
+			return NULL;
 	}
 }
 
@@ -3307,11 +3349,17 @@ retry:
 			case XLREAD_WOULDBLOCK:
 				return XLREAD_WOULDBLOCK;
 			case XLREAD_FAIL:
+				/* make sure we didn't exit standby mode without trigger */
+				Assert(!StandbyMode || CheckForStandbyTrigger());
+
 				if (readFile >= 0)
 					close(readFile);
 				readFile = -1;
 				readLen = 0;
 				readSource = XLOG_FROM_ANY;
+
+				/* exit by promotion is not end-of-WAL */
+				xlogreader->EndOfWAL = !StandbyMode;
 				return XLREAD_FAIL;
 			case XLREAD_SUCCESS:
 				break;
@@ -3979,7 +4027,11 @@ emode_for_corrupt_record(int emode, XLogRecPtr RecPtr)
 {
 	static XLogRecPtr lastComplaint = 0;
 
-	if (readSource == XLOG_FROM_PG_WAL && emode == LOG)
+	/*
+	 * readSource cannot be used in place of currentSource because readSource
+	 * is reset on failure
+	 */
+	if (currentSource == XLOG_FROM_PG_WAL && emode <= WARNING)
 	{
 		if (RecPtr == lastComplaint)
 			emode = DEBUG1;
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index e00395ff2b..fd6b1bf7e1 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -497,10 +497,9 @@ WalReceiverMain(void)
 						else if (len < 0)
 						{
 							ereport(LOG,
-									(errmsg("replication terminated by primary server"),
-									 errdetail("End of WAL reached on timeline %u at %X/%X.",
-											   startpointTLI,
-											   LSN_FORMAT_ARGS(LogstreamResult.Write))));
+									errmsg("replication terminated by primary server at %X/%X on timeline %u.",
+										   LSN_FORMAT_ARGS(LogstreamResult.Write),
+										   startpointTLI));
 							endofwal = true;
 							break;
 						}
diff --git a/src/bin/pg_waldump/pg_waldump.c b/src/bin/pg_waldump/pg_waldump.c
index 1f9403fc5c..1c45a25313 100644
--- a/src/bin/pg_waldump/pg_waldump.c
+++ b/src/bin/pg_waldump/pg_waldump.c
@@ -1309,9 +1309,16 @@ main(int argc, char **argv)
 		exit(0);
 
 	if (errormsg)
-		pg_fatal("error in WAL record at %X/%X: %s",
-				 LSN_FORMAT_ARGS(xlogreader_state->ReadRecPtr),
-				 errormsg);
+	{
+		if (xlogreader_state->EndOfWAL)
+			pg_log_info("end of WAL at %X/%X: %s",
+						LSN_FORMAT_ARGS(xlogreader_state->EndRecPtr),
+						errormsg);
+		else
+			pg_fatal("error in WAL record at %X/%X: %s",
+					 LSN_FORMAT_ARGS(xlogreader_state->EndRecPtr),
+					 errormsg);
+	}
 
 	XLogReaderFree(xlogreader_state);
 
diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h
index 2e9e5f43eb..7f879281f5 100644
--- a/src/include/access/xlogreader.h
+++ b/src/include/access/xlogreader.h
@@ -205,6 +205,7 @@ struct XLogReaderState
 	 */
 	XLogRecPtr	ReadRecPtr;		/* start of last record read */
 	XLogRecPtr	EndRecPtr;		/* end+1 of last record read */
+	bool		EndOfWAL;		/* was the last attempt EOW? */
 
 	/*
 	 * Set at the end of recovery: the start point of a partial record at the
diff --git a/src/test/recovery/t/035_recovery.pl b/src/test/recovery/t/035_recovery.pl
new file mode 100644
index 0000000000..7107df6509
--- /dev/null
+++ b/src/test/recovery/t/035_recovery.pl
@@ -0,0 +1,130 @@
+
+# Copyright (c) 2021-2023, PostgreSQL Global Development Group
+
+# Minimal test testing recovery process
+
+use strict;
+use warnings;
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+use IPC::Run;
+
+my $reached_eow_pat = "reached end of WAL at ";
+my $node = PostgreSQL::Test::Cluster->new('primary');
+$node->init(allows_streaming => 1);
+$node->start;
+
+my ($stdout, $stderr) = ('', '');
+
+my $segsize = $node->safe_psql('postgres',
+	   qq[SELECT setting FROM pg_settings WHERE name = 'wal_segment_size';]);
+
+# make sure no records afterwards go to the next segment
+$node->safe_psql('postgres', qq[
+				 SELECT pg_switch_wal();
+				 CHECKPOINT;
+				 CREATE TABLE t();
+]);
+$node->stop('immediate');
+
+# identify REDO WAL file
+my $cmd = "pg_controldata -D " . $node->data_dir();
+$cmd = ['pg_controldata', '-D', $node->data_dir()];
+$stdout = '';
+$stderr = '';
+IPC::Run::run $cmd, '>', \$stdout, '2>', \$stderr;
+ok($stdout =~ /^Latest checkpoint's REDO WAL file:[ \t] *(.+)$/m,
+   "checkpoint file is identified");
+my $chkptfile = $1;
+
+# identify the last record
+my $walfile = $node->data_dir() . "/pg_wal/$chkptfile";
+$cmd = ['pg_waldump', $walfile];
+$stdout = '';
+$stderr = '';
+my $lastrec;
+IPC::Run::run $cmd, '>', \$stdout, '2>', \$stderr;
+foreach my $l (split(/\r?\n/, $stdout))
+{
+	$lastrec = $l;
+}
+ok(defined $lastrec, "last WAL record is extracted");
+ok($stderr =~ /end of WAL at ([0-9A-F\/]+): .* at \g1/,
+   "pg_waldump emits the correct ending message");
+
+# read the last record LSN excluding leading zeroes
+ok ($lastrec =~ /, lsn: 0\/0*([1-9A-F][0-9A-F]+),/,
+	"LSN of the last record identified");
+my $lastlsn = $1;
+
+# corrupt the last record
+my $offset = hex($lastlsn) % $segsize;
+open(my $segf, '+<', $walfile) or die "failed to open $walfile\n";
+seek($segf, $offset, 0);  # zero xl_tot_len, leaving following bytes alone.
+print $segf "\0\0\0\0";
+close($segf);
+
+# pg_waldump complains about the corrupted record
+$stdout = '';
+$stderr = '';
+IPC::Run::run $cmd, '>', \$stdout, '2>', \$stderr;
+ok($stderr =~ /error: error in WAL record at 0\/$lastlsn: .* at 0\/$lastlsn/,
+   "pg_waldump emits the correct error message");
+
+# also server complains for the same reason
+my $logstart = -s $node->logfile;
+$node->start;
+ok($node->wait_for_log(
+	   "WARNING:  invalid record length at 0/$lastlsn: expected at least 24, got 0",
+	   $logstart),
+   "header error is correctly logged at $lastlsn");
+
+# no end-of-wal message should be seen this time
+ok(!find_in_log($node, $reached_eow_pat, $logstart),
+   "false log message is not emitted");
+
+# Create streaming standby linking to primary
+my $backup_name = 'my_backup';
+$node->backup($backup_name);
+my $node_standby = PostgreSQL::Test::Cluster->new('standby');
+$node_standby->init_from_backup($node, $backup_name, has_streaming => 1);
+$node_standby->start;
+$node->safe_psql('postgres', 'CREATE TABLE t ()');
+my $primary_lsn = $node->lsn('write');
+$node->wait_for_catchup($node_standby, 'write', $primary_lsn);
+
+$node_standby->stop();
+$node->stop('immediate');
+
+# crash restart the primary
+$logstart = -s $node->logfile;
+$node->start();
+ok($node->wait_for_log($reached_eow_pat, $logstart),
+   'primary properly emits end-of-WAL message');
+
+# restart the standby
+$logstart = -s $node_standby->logfile;
+$node_standby->start();
+ok($node->wait_for_log($reached_eow_pat, $logstart),
+   'standby properly emits end-of-WAL message');
+
+$node_standby->stop();
+$node->stop();
+
+done_testing();
+
+#### helper routines
+# find $pat in logfile of $node after $off-th byte
+sub find_in_log
+{
+	my ($node, $pat, $off) = @_;
+
+	$off = 0 unless defined $off;
+	my $log = PostgreSQL::Test::Utils::slurp_file($node->logfile);
+	return 0 if (length($log) <= $off);
+
+	$log = substr($log, $off);
+
+	return $log =~ m/$pat/;
+}
diff --git a/src/test/recovery/t/039_end_of_wal.pl b/src/test/recovery/t/039_end_of_wal.pl
index f9acc83c7d..5a17f68512 100644
--- a/src/test/recovery/t/039_end_of_wal.pl
+++ b/src/test/recovery/t/039_end_of_wal.pl
@@ -258,16 +258,29 @@ my $prev_lsn;
 note "Single-page end-of-WAL detection";
 ###########################################################################
 
-# xl_tot_len is 0 (a common case, we hit trailing zeroes).
-emit_message($node, 0);
-$end_lsn = advance_out_of_record_splitting_zone($node);
+# empty record without trailing garbage bytes until the page end - not error
 $node->stop('immediate');
 my $log_size = -s $node->logfile;
 $node->start;
+ok( $node->log_contains(
+		"LOG:  reached end of WAL at.*\n.* DETAIL:  empty record at",
+		$log_size),
+	"end-of-WAL by empty record");
+
+# xl_tot_len is 0 with following garbage bytes in the page
+emit_message($node, 0);
+$end_lsn = advance_out_of_record_splitting_zone($node);
+$node->stop('immediate');
+write_wal($node, $TLI,
+		  # last byte in the page at $end_lsn
+		  $end_lsn - ($end_lsn % $WAL_BLOCK_SIZE) + $WAL_BLOCK_SIZE - 1,
+		  pack("c", 1)); # garbage byte
+$log_size = -s $node->logfile;
+$node->start;
 ok( $node->log_contains(
 		"invalid record length at .*: expected at least 24, got 0", $log_size
 	),
-	"xl_tot_len zero");
+	"zero xl_tot_len followed by garbage bytes");
 
 # xl_tot_len is < 24 (presumably recycled garbage).
 emit_message($node, 0);
@@ -328,7 +341,7 @@ note "Multi-page end-of-WAL detection, header is not split";
 # This series of tests requires a valid xl_prev set in the record header
 # written to WAL.
 
-# Good xl_prev, we hit zero page next (zero magic).
+# Good xl_prev, we hit zero page next
 emit_message($node, 0);
 $prev_lsn = advance_out_of_record_splitting_zone($node);
 $end_lsn = emit_message($node, 0);
@@ -337,8 +350,24 @@ write_wal($node, $TLI, $end_lsn,
 	build_record_header(2 * 1024 * 1024 * 1024, 0, $prev_lsn));
 $log_size = -s $node->logfile;
 $node->start;
-ok($node->log_contains("invalid magic number 0000 .* LSN .*", $log_size),
-	"xlp_magic zero");
+ok( $node->log_contains("WARNING:  empty page in WAL segment .*, offset .* while reading continuation record at .*", $log_size),
+   "empty page");
+
+# Good xl_prev, we hit zero page magic with following garbage bytes.
+emit_message($node, 0);
+$prev_lsn = advance_out_of_record_splitting_zone($node);
+$end_lsn = emit_message($node, 0);
+$node->stop('immediate');
+write_wal($node, $TLI, $end_lsn,
+		  build_record_header(2 * 1024 * 1024 * 1024, 0, $prev_lsn));
+# place garbage at the end of the next page
+write_wal($node, $TLI,
+		  start_of_next_page(start_of_next_page($end_lsn)) - 1,
+		  pack("i", 1));
+$log_size = -s $node->logfile;
+$node->start;
+ok( $node->log_contains("invalid magic number 0000 .* LSN .*", $log_size),
+   "bad magic");
 
 # Good xl_prev, we hit garbage page next (bad magic).
 emit_message($node, 0);
@@ -442,8 +471,8 @@ write_wal($node, $TLI, $end_lsn,
 	build_record_header(2 * 1024 * 1024 * 1024, 0, 0xdeadbeef));
 $log_size = -s $node->logfile;
 $node->start;
-ok($node->log_contains("invalid magic number 0000 .* LSN .*", $log_size),
-	"xlp_magic zero (split record header)");
+ok( $node->log_contains("WARNING:  empty page in WAL segment .*, offset .* while reading continuation record at .*", $log_size),
+	"zero page while reading a record (split record header)");
 
 # And we'll also check xlp_pageaddr before any header checks.
 emit_message($node, 0);
-- 
2.39.3

#63

Aleksander Alekseev

aleksander@timescale.com

almost 2 years ago

In reply to: Kyotaro Horiguchi (#62)

Re: Make mesage at end-of-recovery less scary.

Hi,

The errors occurred in a part of the tests for end-of-WAL detection
added in the master branch. These failures were primarily due to
changes in the message contents introduced by this patch. During the
revision, I discovered an issue with the handling of empty pages that
appear in the middle of reading continuation records. In the previous
version, such empty pages were mistakenly identified as indicating a
clean end-of-WAL (that is a LOG). However, they should actually be
handled as a WARNING, since the record curently being read is broken
at the empty pages. The following changes have been made in this
version:

1. Adjusting the test to align with the error message changes
introduced by this patch.

2. Adding tests for the newly added messages.

3. Correcting the handling of empty pages encountered during the
reading of continuation records. (XLogReaderValidatePageHeader)

4. Revising code comments.

5. Changing the term "log segment" to "WAL
segment". (XLogReaderValidatePageHeader)

regards.

Thanks for the updated patch.

```
+        p = (char *) record;
+        pe = p + XLOG_BLCKSZ - (RecPtr & (XLOG_BLCKSZ - 1));
+
+        while (p < pe && *p == 0)
+            p++;
+
+        if (p == pe)
```

Just as a random thought: perhaps we should make this a separate
function, as a part of src/port/. It seems to me that this code could
benefit from using vector instructions some day, similarly to
memcmp(), memset() etc. Surprisingly there doesn't seem to be a
standard C function for this. Alternatively one could argue that one
cycle doesn't make much code to reuse and that the C compiler will
place SIMD instructions for us. However a counter-counter argument
would be that we could use a macro or even better an inline function
and have the same effect except getting a slightly more readable code.

```
- * This is just a convenience subroutine to avoid duplicated code in
+ * This is just a convenience subroutine to avoid duplicate code in
```

This change doesn't seem to be related to the patch. Personally I
don't mind it though.

All in all I find v28 somewhat scary. It does much more than "making
one message less scary" as it was initially intended and what bugged
me personally, and accordingly touches many more places including
xlogreader.c, xlogrecovery.c, etc.

Particularly I have mixed feeling about this:

```
+            /*
+             * Consider it as end-of-WAL if all subsequent bytes of this page
+             * are zero. We don't bother checking the subsequent pages since
+             * they are not zeroed in the case of recycled segments.
+             */
```

If I understand correctly, if somehow several FS blocks end up being
zeroed (due to OS bug, bit rot, restoring from a corrupted for
whatever reason backup, hardware failures, ...) there is non-zero
chance that PG will interpret this as a normal situation. To my
knowledge this is not what we typically do - typically PG would report
an error and ask a human to figure out what happened. Of course the
possibility of such a scenario is small, but I don't think that as
DBMS developers we can ignore it.

Does anyone agree or maybe I'm making things up?

--
Best regards,
Aleksander Alekseev

#64

Kyotaro Horiguchi

horikyota.ntt@gmail.com

almost 2 years ago

In reply to: Aleksander Alekseev (#63)

Re: Make mesage at end-of-recovery less scary.

Thank you for the comments.

At Fri, 12 Jan 2024 15:03:26 +0300, Aleksander Alekseev <aleksander@timescale.com> wrote in

```
+        p = (char *) record;
+        pe = p + XLOG_BLCKSZ - (RecPtr & (XLOG_BLCKSZ - 1));
+
+        while (p < pe && *p == 0)
+            p++;
+
+        if (p == pe)
```
Just as a random thought: perhaps we should make this a separate
function, as a part of src/port/. It seems to me that this code could
benefit from using vector instructions some day, similarly to
memcmp(), memset() etc. Surprisingly there doesn't seem to be a
standard C function for this. Alternatively one could argue that one
cycle doesn't make much code to reuse and that the C compiler will
place SIMD instructions for us. However a counter-counter argument
would be that we could use a macro or even better an inline function
and have the same effect except getting a slightly more readable code.

Creating a function with a name like memcmp_byte() should be
straightforward, but implementing it with SIMD right away seems a bit
challenging. Similar operations are already being performed elsewhere
in the code, probably within the stats collector, where memcmp is used
with a statically allocated area that's filled with zeros. If we can
achieve a performance equivalent to memcmp with this new function,
then it definitely seems worth pursuing.

```
- * This is just a convenience subroutine to avoid duplicated code in
+ * This is just a convenience subroutine to avoid duplicate code in
```
This change doesn't seem to be related to the patch. Personally I
don't mind it though.

Ah, I'm sorry. That was something I mistakenly thought I had written
at the last moment and made modifications to.

All in all I find v28 somewhat scary. It does much more than "making
one message less scary" as it was initially intended and what bugged
me personally, and accordingly touches many more places including
xlogreader.c, xlogrecovery.c, etc.

Particularly I have mixed feeling about this:
```
+            /*
+             * Consider it as end-of-WAL if all subsequent bytes of this page
+             * are zero. We don't bother checking the subsequent pages since
+             * they are not zeroed in the case of recycled segments.
+             */
```
If I understand correctly, if somehow several FS blocks end up being
zeroed (due to OS bug, bit rot, restoring from a corrupted for
whatever reason backup, hardware failures, ...) there is non-zero
chance that PG will interpret this as a normal situation. To my
knowledge this is not what we typically do - typically PG would report
an error and ask a human to figure out what happened. Of course the
possibility of such a scenario is small, but I don't think that as
DBMS developers we can ignore it.

For now, let me explain the basis for this patch. The fundamental
issue is that these warnings that always appear are, in practice, not
a problem in almost all cases. Some of those who encounter them for
the first time may feel uneasy and reach out with inquiries. On the
other hand, those familiar with these warnings tend to ignore them and
only pay attention to details when actual issues arise. Therefore, the
intention of this patch is to label them as "no issue" unless a
problem is blatantly evident, in order to prevent unnecessary concern.

Does anyone agree or maybe I'm making things up?

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#65

Aleksander Alekseev

aleksander@timescale.com

almost 2 years ago

In reply to: Kyotaro Horiguchi (#64)

Re: Make mesage at end-of-recovery less scary.

Hi,

If I understand correctly, if somehow several FS blocks end up being
zeroed (due to OS bug, bit rot, restoring from a corrupted for
whatever reason backup, hardware failures, ...) there is non-zero
chance that PG will interpret this as a normal situation. To my
knowledge this is not what we typically do - typically PG would report
an error and ask a human to figure out what happened. Of course the
possibility of such a scenario is small, but I don't think that as
DBMS developers we can ignore it.

For now, let me explain the basis for this patch. The fundamental
issue is that these warnings that always appear are, in practice, not
a problem in almost all cases. Some of those who encounter them for
the first time may feel uneasy and reach out with inquiries. On the
other hand, those familiar with these warnings tend to ignore them and
only pay attention to details when actual issues arise. Therefore, the
intention of this patch is to label them as "no issue" unless a
problem is blatantly evident, in order to prevent unnecessary concern.

I agree and don't mind affecting the error message per se.

However I see that the actual logic of how WAL is processed is being
changed. If we do this, at very least it requires thorough thinking. I
strongly suspect that the proposed code is wrong and/or not safe
and/or less safe than it is now for the reasons named above.

--
Best regards,
Aleksander Alekseev

#66

Michael Paquier

michael@paquier.xyz

almost 2 years ago

In reply to: Aleksander Alekseev (#65)

Re: Make mesage at end-of-recovery less scary.

On Tue, Jan 16, 2024 at 02:46:02PM +0300, Aleksander Alekseev wrote:

For now, let me explain the basis for this patch. The fundamental
issue is that these warnings that always appear are, in practice, not
a problem in almost all cases. Some of those who encounter them for
the first time may feel uneasy and reach out with inquiries. On the
other hand, those familiar with these warnings tend to ignore them and
only pay attention to details when actual issues arise. Therefore, the
intention of this patch is to label them as "no issue" unless a
problem is blatantly evident, in order to prevent unnecessary concern.

I agree and don't mind affecting the error message per se.

However I see that the actual logic of how WAL is processed is being
changed. If we do this, at very least it requires thorough thinking. I
strongly suspect that the proposed code is wrong and/or not safe
and/or less safe than it is now for the reasons named above.

FWIW, that pretty much sums up my feeling regarding this patch,
because an error, basically any error, would hurt back very badly.
Sure, the error messages we generate now when reaching the end of WAL
can sound scary, and they are (I suspect that's not really the case
for anybody who has history doing support with PostgreSQL because a
bunch of these messages are old enough to vote, but I can understand
that anybody would freak out the first time they see that).

However, per the recent issues we've had in this area, like
cd7f19da3468 but I'm more thinking about 6b18b3fe2c2f and
bae868caf222, I am of the opinion that the header validation, the
empty page case in XLogReaderValidatePageHeader() and the record read
changes are risky enough that I am not convinced that the gains are
worth the risks taken.

The error stack in the WAL reader is complicated enough that making it
more complicated as the patch proposes does not sound like not a good
tradeoff to me to make the reports related to the end of WAL cleaner
for the end-user. I agree that we should do something, but the patch
does not seem like a good step towards this goal. Perhaps somebody
would be more excited about this proposal than I am, of course.
--
Michael

#67

Peter Smith

smithpb2250@gmail.com

almost 2 years ago

In reply to: Michael Paquier (#66)

Re: Make mesage at end-of-recovery less scary.

2024-01 Commitfest.

Hi, This patch has a CF status of "Needs Review" [1]https://commitfest.postgresql.org/46/2490/, but it seems
there were CFbot test failures last time it was run [2]https://cirrus-ci.com/github/postgresql-cfbot/postgresql/commitfest/46/2490. Please have a
look and post an updated version if necessary.

======
[1]: https://commitfest.postgresql.org/46/2490/
[2]: https://cirrus-ci.com/github/postgresql-cfbot/postgresql/commitfest/46/2490

Kind Regards,
Peter Smith.

#68

Kyotaro Horiguchi

horikyota.ntt@gmail.com

almost 2 years ago

In reply to: Peter Smith (#67)

Re: Make mesage at end-of-recovery less scary.

At Mon, 22 Jan 2024 16:09:28 +1100, Peter Smith <smithpb2250@gmail.com> wrote in

2024-01 Commitfest.

Hi, This patch has a CF status of "Needs Review" [1], but it seems
there were CFbot test failures last time it was run [2]. Please have a
look and post an updated version if necessary.

======
[1] https://commitfest.postgresql.org/46/2490/
[2] https://cirrus-ci.com/github/postgresql-cfbot/postgresql/commitfest/46/2490

Thanks for noticing of that. Will repost a new version.
regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#69

Kyotaro Horiguchi

horikyota.ntt@gmail.com

almost 2 years ago

In reply to: Michael Paquier (#66)

1 attachment(s)

Re: Make mesage at end-of-recovery less scary.

At Wed, 17 Jan 2024 14:32:00 +0900, Michael Paquier <michael@paquier.xyz> wrote in

On Tue, Jan 16, 2024 at 02:46:02PM +0300, Aleksander Alekseev wrote:

For now, let me explain the basis for this patch. The fundamental
issue is that these warnings that always appear are, in practice, not
a problem in almost all cases. Some of those who encounter them for
the first time may feel uneasy and reach out with inquiries. On the
other hand, those familiar with these warnings tend to ignore them and
only pay attention to details when actual issues arise. Therefore, the
intention of this patch is to label them as "no issue" unless a
problem is blatantly evident, in order to prevent unnecessary concern.

I agree and don't mind affecting the error message per se.

However I see that the actual logic of how WAL is processed is being
changed. If we do this, at very least it requires thorough thinking. I
strongly suspect that the proposed code is wrong and/or not safe
and/or less safe than it is now for the reasons named above.

FWIW, that pretty much sums up my feeling regarding this patch,
because an error, basically any error, would hurt back very badly.
Sure, the error messages we generate now when reaching the end of WAL
can sound scary, and they are (I suspect that's not really the case
for anybody who has history doing support with PostgreSQL because a
bunch of these messages are old enough to vote, but I can understand
that anybody would freak out the first time they see that).

However, per the recent issues we've had in this area, like
cd7f19da3468 but I'm more thinking about 6b18b3fe2c2f and
bae868caf222, I am of the opinion that the header validation, the
empty page case in XLogReaderValidatePageHeader() and the record read
changes are risky enough that I am not convinced that the gains are
worth the risks taken.

The error stack in the WAL reader is complicated enough that making it
more complicated as the patch proposes does not sound like not a good
tradeoff to me to make the reports related to the end of WAL cleaner
for the end-user. I agree that we should do something, but the patch
does not seem like a good step towards this goal. Perhaps somebody
would be more excited about this proposal than I am, of course.

Thank you both for the comments. The criticism seems valid. The
approach to identifying the end-of-WAL state in this patch is quite
heuristic, and its validity or safety can certainly be contested. On
the other hand, if we seek perfection in this area of judgment, we may
need to have the WAL format itself more robust. In any case, since the
majority of the feedback on this patch seems to be negative, I am
going to withdraw it if no supportive opinions emerge during this
commit-fest.

The attached patch addresses the errors reported by CF-bot.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachments:

v29-0001-Make-End-Of-Recovery-error-less-scary.patchtext/x-patch; charset=us-asciiDownload

From 933d10fa6c7b71e4684f5ba38e85177afaa56f58 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Tue, 7 Mar 2023 14:55:58 +0900
Subject: [PATCH v29] Make End-Of-Recovery error less scary

When recovery in any type ends, we see a bit scary error message like
"invalid record length" that suggests something serious is
happening. Actually if recovery meets a record with length = 0, that
usually means it finished applying all available WAL records.

Make this message less scary as "reached end of WAL". Instead, raise
the error level for other kind of WAL failure to WARNING.  To make
sure that the detection is correct, this patch checks if all trailing
bytes in the same page are zeroed in that case.
---
 src/backend/access/transam/xlogreader.c   | 147 ++++++++++++++++++----
 src/backend/access/transam/xlogrecovery.c |  96 ++++++++++----
 src/backend/replication/walreceiver.c     |   7 +-
 src/bin/pg_waldump/pg_waldump.c           |  13 +-
 src/bin/pg_waldump/t/001_basic.pl         |   5 +-
 src/include/access/xlogreader.h           |   1 +
 src/test/recovery/t/035_recovery.pl       | 130 +++++++++++++++++++
 src/test/recovery/t/039_end_of_wal.pl     |  47 +++++--
 8 files changed, 383 insertions(+), 63 deletions(-)
 create mode 100644 src/test/recovery/t/035_recovery.pl

diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index 7190156f2f..94861969eb 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -48,6 +48,8 @@ static int	ReadPageInternal(XLogReaderState *state, XLogRecPtr pageptr,
 							 int reqLen);
 static void XLogReaderInvalReadState(XLogReaderState *state);
 static XLogPageReadResult XLogDecodeNextRecord(XLogReaderState *state, bool nonblocking);
+static bool ValidXLogRecordLength(XLogReaderState *state, XLogRecPtr RecPtr,
+								  XLogRecord *record);
 static bool ValidXLogRecordHeader(XLogReaderState *state, XLogRecPtr RecPtr,
 								  XLogRecPtr PrevRecPtr, XLogRecord *record, bool randAccess);
 static bool ValidXLogRecord(XLogReaderState *state, XLogRecord *record,
@@ -149,6 +151,7 @@ XLogReaderAllocate(int wal_segment_size, const char *waldir,
 		pfree(state);
 		return NULL;
 	}
+	state->EndOfWAL = false;
 	state->errormsg_buf[0] = '\0';
 
 	/*
@@ -553,6 +556,7 @@ XLogDecodeNextRecord(XLogReaderState *state, bool nonblocking)
 	/* reset error state */
 	state->errormsg_buf[0] = '\0';
 	decoded = NULL;
+	state->EndOfWAL = false;
 
 	state->abortedRecPtr = InvalidXLogRecPtr;
 	state->missingContrecPtr = InvalidXLogRecPtr;
@@ -636,16 +640,12 @@ restart:
 	Assert(pageHeaderSize <= readOff);
 
 	/*
-	 * Read the record length.
+	 * Verify the record header.
 	 *
 	 * NB: Even though we use an XLogRecord pointer here, the whole record
-	 * header might not fit on this page. xl_tot_len is the first field of the
-	 * struct, so it must be on this page (the records are MAXALIGNed), but we
-	 * cannot access any other fields until we've verified that we got the
-	 * whole header.
+	 * header might not fit on this page.
 	 */
 	record = (XLogRecord *) (state->readBuf + RecPtr % XLOG_BLCKSZ);
-	total_len = record->xl_tot_len;
 
 	/*
 	 * If the whole record header is on this page, validate it immediately.
@@ -664,19 +664,19 @@ restart:
 	}
 	else
 	{
-		/* There may be no next page if it's too small. */
-		if (total_len < SizeOfXLogRecord)
-		{
-			report_invalid_record(state,
-								  "invalid record length at %X/%X: expected at least %u, got %u",
-								  LSN_FORMAT_ARGS(RecPtr),
-								  (uint32) SizeOfXLogRecord, total_len);
+		/*
+		 * xl_tot_len is the first field of the struct, so it must be on this
+		 * page (the records are MAXALIGNed), but we cannot access any other
+		 * fields until we've verified that we got the whole header.
+		 */
+		if (!ValidXLogRecordLength(state, RecPtr, record))
 			goto err;
-		}
-		/* We'll validate the header once we have the next page. */
+
 		gotheader = false;
 	}
 
+	total_len = record->xl_tot_len;
+
 	/*
 	 * Try to find space to decode this record, if we can do so without
 	 * calling palloc.  If we can't, we'll try again below after we've
@@ -1119,25 +1119,80 @@ XLogReaderInvalReadState(XLogReaderState *state)
 	state->readLen = 0;
 }
 
+/*
+ * Validate record length of an XLOG record header.
+ *
+ * This is substantially a part of ValidXLogRecordHeader.  But XLogReadRecord
+ * needs this separate from the function in case of a partial record header.
+ *
+ * Returns true if the xl_tot_len header field has a seemingly valid value,
+ * which means the caller can proceed reading to the following part of the
+ * record.
+ */
+static bool
+ValidXLogRecordLength(XLogReaderState *state, XLogRecPtr RecPtr,
+					  XLogRecord *record)
+{
+	if (record->xl_tot_len == 0)
+	{
+		char	   *p;
+		char	   *pe;
+
+		/*
+		 * We are almost sure reaching the end of WAL, make sure that the
+		 * whole page after the record is filled with zeroes.
+		 */
+		p = (char *) record;
+		pe = p + XLOG_BLCKSZ - (RecPtr & (XLOG_BLCKSZ - 1));
+
+		while (p < pe && *p == 0)
+			p++;
+
+		if (p == pe)
+		{
+			/*
+			 * Consider it as end-of-WAL if all subsequent bytes of this page
+			 * are zero. We don't bother checking the subsequent pages since
+			 * they are not zeroed in the case of recycled segments.
+			 */
+			report_invalid_record(state, "empty record at %X/%X",
+								  LSN_FORMAT_ARGS(RecPtr));
+
+			/* notify end-of-wal to callers */
+			state->EndOfWAL = true;
+			return false;
+		}
+	}
+
+	if (record->xl_tot_len < SizeOfXLogRecord)
+	{
+		report_invalid_record(state,
+							  "invalid record length at %X/%X: expected at least %u, got %u",
+							  LSN_FORMAT_ARGS(RecPtr),
+							  (uint32) SizeOfXLogRecord, record->xl_tot_len);
+		return false;
+	}
+
+	return true;
+}
+
 /*
  * Validate an XLOG record header.
  *
- * This is just a convenience subroutine to avoid duplicated code in
+ * This is just a convenience subroutine to avoid duplicate code in
  * XLogReadRecord.  It's not intended for use from anywhere else.
+ *
+ * Returns true if the header fields have the valid values and the caller can
+ * proceed reading to the following part of the record.
  */
 static bool
 ValidXLogRecordHeader(XLogReaderState *state, XLogRecPtr RecPtr,
 					  XLogRecPtr PrevRecPtr, XLogRecord *record,
 					  bool randAccess)
 {
-	if (record->xl_tot_len < SizeOfXLogRecord)
-	{
-		report_invalid_record(state,
-							  "invalid record length at %X/%X: expected at least %u, got %u",
-							  LSN_FORMAT_ARGS(RecPtr),
-							  (uint32) SizeOfXLogRecord, record->xl_tot_len);
+	if (!ValidXLogRecordLength(state, RecPtr, record))
 		return false;
-	}
+
 	if (!RmgrIdIsValid(record->xl_rmid))
 	{
 		report_invalid_record(state,
@@ -1235,6 +1290,44 @@ XLogReaderValidatePageHeader(XLogReaderState *state, XLogRecPtr recptr,
 	XLByteToSeg(recptr, segno, state->segcxt.ws_segsize);
 	offset = XLogSegmentOffset(recptr, state->segcxt.ws_segsize);
 
+	StaticAssertStmt(XLOG_PAGE_MAGIC != 0, "XLOG_PAGE_MAGIC is zero");
+
+	if (hdr->xlp_magic == 0)
+	{
+		/* Regard an empty page as End-Of-WAL */
+		int			i;
+
+		for (i = 0; i < XLOG_BLCKSZ && phdr[i] == 0; i++);
+		if (i == XLOG_BLCKSZ)
+		{
+			char		fname[MAXFNAMELEN];
+
+			XLogFileName(fname, state->seg.ws_tli, segno,
+						 state->segcxt.ws_segsize);
+
+			/*
+			 * Consider an empty page as end-of-WAL only when reading the first
+			 * part of a record.
+			 */
+			if (state->currRecPtr / XLOG_BLCKSZ == recptr / XLOG_BLCKSZ)
+			{
+				report_invalid_record(state,
+									  "empty page in WAL segment %s, offset %u",
+									  fname, offset);
+				state->EndOfWAL = true;
+			}
+			else
+				report_invalid_record(state,
+									  "empty page in WAL segment %s, offset %u while reading continuation record at %X/%X",
+									  fname, offset,
+									  LSN_FORMAT_ARGS(state->currRecPtr));
+				
+			return false;
+		}
+
+		/* The same condition will be caught as invalid magic number */
+	}
+
 	if (hdr->xlp_magic != XLOG_PAGE_MAGIC)
 	{
 		char		fname[MAXFNAMELEN];
@@ -1324,6 +1417,14 @@ XLogReaderValidatePageHeader(XLogReaderState *state, XLogRecPtr recptr,
 							  fname,
 							  LSN_FORMAT_ARGS(recptr),
 							  offset);
+
+		/*
+		 * If the page address is less than expected we assume it is an unused
+		 * page in a recycled segment.
+		 */
+		if (hdr->xlp_pageaddr < recptr)
+			state->EndOfWAL = true;
+
 		return false;
 	}
 
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 1b48d7171a..3b56bd04a6 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -1678,7 +1678,7 @@ PerformWalRecovery(void)
 		/* just have to read next record after CheckPoint */
 		Assert(xlogreader->ReadRecPtr == CheckPointLoc);
 		replayTLI = CheckPointTLI;
-		record = ReadRecord(xlogprefetcher, LOG, false, replayTLI);
+		record = ReadRecord(xlogprefetcher, WARNING, false, replayTLI);
 	}
 
 	if (record != NULL)
@@ -1785,7 +1785,7 @@ PerformWalRecovery(void)
 			}
 
 			/* Else, try to fetch the next WAL record */
-			record = ReadRecord(xlogprefetcher, LOG, false, replayTLI);
+			record = ReadRecord(xlogprefetcher, WARNING, false, replayTLI);
 		} while (record != NULL);
 
 		/*
@@ -1839,12 +1839,19 @@ PerformWalRecovery(void)
 
 		InRedo = false;
 	}
-	else
+	else if (xlogreader->EndOfWAL)
 	{
 		/* there are no WAL records following the checkpoint */
 		ereport(LOG,
 				(errmsg("redo is not required")));
 	}
+	else
+	{
+		/* broken record found */
+		ereport(WARNING,
+				errmsg("redo is skipped"),
+				errhint("This suggests WAL file corruption. You might need to check the database."));
+	}
 
 	/*
 	 * This check is intentionally after the above log messages that indicate
@@ -3095,6 +3102,7 @@ ReadRecord(XLogPrefetcher *xlogprefetcher, int emode,
 	for (;;)
 	{
 		char	   *errormsg;
+		XLogRecPtr	ErrRecPtr = InvalidXLogRecPtr;
 
 		record = XLogPrefetcherReadRecord(xlogprefetcher, &errormsg);
 		if (record == NULL)
@@ -3116,6 +3124,18 @@ ReadRecord(XLogPrefetcher *xlogprefetcher, int emode,
 			{
 				abortedRecPtr = xlogreader->abortedRecPtr;
 				missingContrecPtr = xlogreader->missingContrecPtr;
+				ErrRecPtr = abortedRecPtr;
+			}
+			else
+			{
+				/*
+				 * EndRecPtr is the LSN we tried to read but failed. In the
+				 * case of decoding error, it is at the end of the failed
+				 * record but we don't have a means for now to know EndRecPtr
+				 * is pointing to which of the beginning or ending of the
+				 * failed record.
+				 */
+				ErrRecPtr = xlogreader->EndRecPtr;
 			}
 
 			if (readFile >= 0)
@@ -3126,13 +3146,15 @@ ReadRecord(XLogPrefetcher *xlogprefetcher, int emode,
 
 			/*
 			 * We only end up here without a message when XLogPageRead()
-			 * failed - in that case we already logged something. In
-			 * StandbyMode that only happens if we have been triggered, so we
-			 * shouldn't loop anymore in that case.
+			 * failed- in that case we already logged something. In StandbyMode
+			 * that only happens if we have been triggered, so we shouldn't
+			 * loop anymore in that case. When EndOfWAL is true, we don't emit
+			 * the message immediately and instead will show it as a part of a
+			 * decent end-of-wal message later.
 			 */
-			if (errormsg)
-				ereport(emode_for_corrupt_record(emode, xlogreader->EndRecPtr),
-						(errmsg_internal("%s", errormsg) /* already translated */ ));
+			if (!xlogreader->EndOfWAL && errormsg)
+				ereport(emode_for_corrupt_record(emode, ErrRecPtr),
+						errmsg_internal("%s", errormsg) /* already translated */ );
 		}
 
 		/*
@@ -3163,11 +3185,14 @@ ReadRecord(XLogPrefetcher *xlogprefetcher, int emode,
 			/* Great, got a record */
 			return record;
 		}
-		else
+
+		Assert(ErrRecPtr != InvalidXLogRecPtr);
+
+		/* No valid record available from this source */
+		lastSourceFailed = true;
+
+		if (!fetching_ckpt)
 		{
-			/* No valid record available from this source */
-			lastSourceFailed = true;
-
 			/*
 			 * If archive recovery was requested, but we were still doing
 			 * crash recovery, switch to archive recovery and retry using the
@@ -3180,11 +3205,16 @@ ReadRecord(XLogPrefetcher *xlogprefetcher, int emode,
 			 * we'd have no idea how far we'd have to replay to reach
 			 * consistency.  So err on the safe side and give up.
 			 */
-			if (!InArchiveRecovery && ArchiveRecoveryRequested &&
-				!fetching_ckpt)
+			if (!InArchiveRecovery && ArchiveRecoveryRequested)
 			{
+				/*
+				 * We don't report this as LOG, since we don't stop recovery
+				 * here
+				 */
 				ereport(DEBUG1,
-						(errmsg_internal("reached end of WAL in pg_wal, entering archive recovery")));
+						errmsg_internal("reached end of WAL at %X/%X on timeline %u in pg_wal during crash recovery, entering archive recovery",
+										LSN_FORMAT_ARGS(ErrRecPtr),
+										replayTLI));
 				InArchiveRecovery = true;
 				if (StandbyModeRequested)
 					EnableStandbyMode();
@@ -3205,12 +3235,24 @@ ReadRecord(XLogPrefetcher *xlogprefetcher, int emode,
 				continue;
 			}
 
-			/* In standby mode, loop back to retry. Otherwise, give up. */
-			if (StandbyMode && !CheckForStandbyTrigger())
-				continue;
-			else
-				return NULL;
+			/*
+			 * recovery ended.
+			 *
+			 * Emit a decent message if we met end-of-WAL. Otherwise we should
+			 * have already emitted an error message.
+			 */
+			if (xlogreader->EndOfWAL)
+				ereport(LOG,
+						errmsg("reached end of WAL at %X/%X on timeline %u",
+							   LSN_FORMAT_ARGS(ErrRecPtr), replayTLI),
+						(errormsg ? errdetail_internal("%s", errormsg) : 0));
 		}
+
+		/* In standby mode, loop back to retry. Otherwise, give up. */
+		if (StandbyMode && !CheckForStandbyTrigger())
+			continue;
+		else
+			return NULL;
 	}
 }
 
@@ -3307,11 +3349,17 @@ retry:
 			case XLREAD_WOULDBLOCK:
 				return XLREAD_WOULDBLOCK;
 			case XLREAD_FAIL:
+				/* make sure we didn't exit standby mode without trigger */
+				Assert(!StandbyMode || CheckForStandbyTrigger());
+
 				if (readFile >= 0)
 					close(readFile);
 				readFile = -1;
 				readLen = 0;
 				readSource = XLOG_FROM_ANY;
+
+				/* exit by promotion is not end-of-WAL */
+				xlogreader->EndOfWAL = !StandbyMode;
 				return XLREAD_FAIL;
 			case XLREAD_SUCCESS:
 				break;
@@ -3979,7 +4027,11 @@ emode_for_corrupt_record(int emode, XLogRecPtr RecPtr)
 {
 	static XLogRecPtr lastComplaint = 0;
 
-	if (readSource == XLOG_FROM_PG_WAL && emode == LOG)
+	/*
+	 * readSource cannot be used in place of currentSource because readSource
+	 * is reset on failure
+	 */
+	if (currentSource == XLOG_FROM_PG_WAL && emode <= WARNING)
 	{
 		if (RecPtr == lastComplaint)
 			emode = DEBUG1;
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index 728059518e..c69e33dbe7 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -497,10 +497,9 @@ WalReceiverMain(void)
 						else if (len < 0)
 						{
 							ereport(LOG,
-									(errmsg("replication terminated by primary server"),
-									 errdetail("End of WAL reached on timeline %u at %X/%X.",
-											   startpointTLI,
-											   LSN_FORMAT_ARGS(LogstreamResult.Write))));
+									errmsg("replication terminated by primary server at %X/%X on timeline %u.",
+										   LSN_FORMAT_ARGS(LogstreamResult.Write),
+										   startpointTLI));
 							endofwal = true;
 							break;
 						}
diff --git a/src/bin/pg_waldump/pg_waldump.c b/src/bin/pg_waldump/pg_waldump.c
index 1f9403fc5c..1c45a25313 100644
--- a/src/bin/pg_waldump/pg_waldump.c
+++ b/src/bin/pg_waldump/pg_waldump.c
@@ -1309,9 +1309,16 @@ main(int argc, char **argv)
 		exit(0);
 
 	if (errormsg)
-		pg_fatal("error in WAL record at %X/%X: %s",
-				 LSN_FORMAT_ARGS(xlogreader_state->ReadRecPtr),
-				 errormsg);
+	{
+		if (xlogreader_state->EndOfWAL)
+			pg_log_info("end of WAL at %X/%X: %s",
+						LSN_FORMAT_ARGS(xlogreader_state->EndRecPtr),
+						errormsg);
+		else
+			pg_fatal("error in WAL record at %X/%X: %s",
+					 LSN_FORMAT_ARGS(xlogreader_state->EndRecPtr),
+					 errormsg);
+	}
 
 	XLogReaderFree(xlogreader_state);
 
diff --git a/src/bin/pg_waldump/t/001_basic.pl b/src/bin/pg_waldump/t/001_basic.pl
index 082cb8a589..b15e7a2ca3 100644
--- a/src/bin/pg_waldump/t/001_basic.pl
+++ b/src/bin/pg_waldump/t/001_basic.pl
@@ -147,10 +147,11 @@ command_fails_like([ 'pg_waldump', $node->data_dir . '/pg_wal/' . $start_walfile
 command_like([ 'pg_waldump', $node->data_dir . '/pg_wal/' . $start_walfile, $node->data_dir . '/pg_wal/' . $end_walfile ], qr/./, 'runs with start and end segment specified');
 command_fails_like([ 'pg_waldump', '-p', $node->data_dir ], qr/error: no start WAL location given/, 'path option requires start location');
 command_like([ 'pg_waldump', '-p', $node->data_dir, '--start', $start_lsn, '--end', $end_lsn ], qr/./, 'runs with path option and start and end locations');
-command_fails_like([ 'pg_waldump', '-p', $node->data_dir, '--start', $start_lsn ], qr/error: error in WAL record at/, 'falling off the end of the WAL results in an error');
+
+command_checks_all([ 'pg_waldump', '-p', $node->data_dir, '--start', $start_lsn ], 0, [qr/./], [qr/pg_waldump: end of WAL at.*: empty record at/], 'the end of the WAL doesn\'t result in an error');
 
 command_like([ 'pg_waldump', '--quiet', $node->data_dir . '/pg_wal/' . $start_walfile ], qr/^$/, 'no output with --quiet option');
-command_fails_like([ 'pg_waldump', '--quiet', '-p', $node->data_dir, '--start', $start_lsn ], qr/error: error in WAL record at/, 'errors are shown with --quiet');
+command_checks_all([ 'pg_waldump', '--quiet', '-p', $node->data_dir, '--start', $start_lsn ], 0, [qr/^$/], [qr/pg_waldump: end of WAL at .*: empty record at .*/], 'end-status is shown with --quiet');
 
 
 # Test for: Display a message that we're skipping data if `from`
diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h
index 2e9e5f43eb..7f879281f5 100644
--- a/src/include/access/xlogreader.h
+++ b/src/include/access/xlogreader.h
@@ -205,6 +205,7 @@ struct XLogReaderState
 	 */
 	XLogRecPtr	ReadRecPtr;		/* start of last record read */
 	XLogRecPtr	EndRecPtr;		/* end+1 of last record read */
+	bool		EndOfWAL;		/* was the last attempt EOW? */
 
 	/*
 	 * Set at the end of recovery: the start point of a partial record at the
diff --git a/src/test/recovery/t/035_recovery.pl b/src/test/recovery/t/035_recovery.pl
new file mode 100644
index 0000000000..7107df6509
--- /dev/null
+++ b/src/test/recovery/t/035_recovery.pl
@@ -0,0 +1,130 @@
+
+# Copyright (c) 2021-2023, PostgreSQL Global Development Group
+
+# Minimal test testing recovery process
+
+use strict;
+use warnings;
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+use IPC::Run;
+
+my $reached_eow_pat = "reached end of WAL at ";
+my $node = PostgreSQL::Test::Cluster->new('primary');
+$node->init(allows_streaming => 1);
+$node->start;
+
+my ($stdout, $stderr) = ('', '');
+
+my $segsize = $node->safe_psql('postgres',
+	   qq[SELECT setting FROM pg_settings WHERE name = 'wal_segment_size';]);
+
+# make sure no records afterwards go to the next segment
+$node->safe_psql('postgres', qq[
+				 SELECT pg_switch_wal();
+				 CHECKPOINT;
+				 CREATE TABLE t();
+]);
+$node->stop('immediate');
+
+# identify REDO WAL file
+my $cmd = "pg_controldata -D " . $node->data_dir();
+$cmd = ['pg_controldata', '-D', $node->data_dir()];
+$stdout = '';
+$stderr = '';
+IPC::Run::run $cmd, '>', \$stdout, '2>', \$stderr;
+ok($stdout =~ /^Latest checkpoint's REDO WAL file:[ \t] *(.+)$/m,
+   "checkpoint file is identified");
+my $chkptfile = $1;
+
+# identify the last record
+my $walfile = $node->data_dir() . "/pg_wal/$chkptfile";
+$cmd = ['pg_waldump', $walfile];
+$stdout = '';
+$stderr = '';
+my $lastrec;
+IPC::Run::run $cmd, '>', \$stdout, '2>', \$stderr;
+foreach my $l (split(/\r?\n/, $stdout))
+{
+	$lastrec = $l;
+}
+ok(defined $lastrec, "last WAL record is extracted");
+ok($stderr =~ /end of WAL at ([0-9A-F\/]+): .* at \g1/,
+   "pg_waldump emits the correct ending message");
+
+# read the last record LSN excluding leading zeroes
+ok ($lastrec =~ /, lsn: 0\/0*([1-9A-F][0-9A-F]+),/,
+	"LSN of the last record identified");
+my $lastlsn = $1;
+
+# corrupt the last record
+my $offset = hex($lastlsn) % $segsize;
+open(my $segf, '+<', $walfile) or die "failed to open $walfile\n";
+seek($segf, $offset, 0);  # zero xl_tot_len, leaving following bytes alone.
+print $segf "\0\0\0\0";
+close($segf);
+
+# pg_waldump complains about the corrupted record
+$stdout = '';
+$stderr = '';
+IPC::Run::run $cmd, '>', \$stdout, '2>', \$stderr;
+ok($stderr =~ /error: error in WAL record at 0\/$lastlsn: .* at 0\/$lastlsn/,
+   "pg_waldump emits the correct error message");
+
+# also server complains for the same reason
+my $logstart = -s $node->logfile;
+$node->start;
+ok($node->wait_for_log(
+	   "WARNING:  invalid record length at 0/$lastlsn: expected at least 24, got 0",
+	   $logstart),
+   "header error is correctly logged at $lastlsn");
+
+# no end-of-wal message should be seen this time
+ok(!find_in_log($node, $reached_eow_pat, $logstart),
+   "false log message is not emitted");
+
+# Create streaming standby linking to primary
+my $backup_name = 'my_backup';
+$node->backup($backup_name);
+my $node_standby = PostgreSQL::Test::Cluster->new('standby');
+$node_standby->init_from_backup($node, $backup_name, has_streaming => 1);
+$node_standby->start;
+$node->safe_psql('postgres', 'CREATE TABLE t ()');
+my $primary_lsn = $node->lsn('write');
+$node->wait_for_catchup($node_standby, 'write', $primary_lsn);
+
+$node_standby->stop();
+$node->stop('immediate');
+
+# crash restart the primary
+$logstart = -s $node->logfile;
+$node->start();
+ok($node->wait_for_log($reached_eow_pat, $logstart),
+   'primary properly emits end-of-WAL message');
+
+# restart the standby
+$logstart = -s $node_standby->logfile;
+$node_standby->start();
+ok($node->wait_for_log($reached_eow_pat, $logstart),
+   'standby properly emits end-of-WAL message');
+
+$node_standby->stop();
+$node->stop();
+
+done_testing();
+
+#### helper routines
+# find $pat in logfile of $node after $off-th byte
+sub find_in_log
+{
+	my ($node, $pat, $off) = @_;
+
+	$off = 0 unless defined $off;
+	my $log = PostgreSQL::Test::Utils::slurp_file($node->logfile);
+	return 0 if (length($log) <= $off);
+
+	$log = substr($log, $off);
+
+	return $log =~ m/$pat/;
+}
diff --git a/src/test/recovery/t/039_end_of_wal.pl b/src/test/recovery/t/039_end_of_wal.pl
index f9acc83c7d..5a17f68512 100644
--- a/src/test/recovery/t/039_end_of_wal.pl
+++ b/src/test/recovery/t/039_end_of_wal.pl
@@ -258,16 +258,29 @@ my $prev_lsn;
 note "Single-page end-of-WAL detection";
 ###########################################################################
 
-# xl_tot_len is 0 (a common case, we hit trailing zeroes).
-emit_message($node, 0);
-$end_lsn = advance_out_of_record_splitting_zone($node);
+# empty record without trailing garbage bytes until the page end - not error
 $node->stop('immediate');
 my $log_size = -s $node->logfile;
 $node->start;
+ok( $node->log_contains(
+		"LOG:  reached end of WAL at.*\n.* DETAIL:  empty record at",
+		$log_size),
+	"end-of-WAL by empty record");
+
+# xl_tot_len is 0 with following garbage bytes in the page
+emit_message($node, 0);
+$end_lsn = advance_out_of_record_splitting_zone($node);
+$node->stop('immediate');
+write_wal($node, $TLI,
+		  # last byte in the page at $end_lsn
+		  $end_lsn - ($end_lsn % $WAL_BLOCK_SIZE) + $WAL_BLOCK_SIZE - 1,
+		  pack("c", 1)); # garbage byte
+$log_size = -s $node->logfile;
+$node->start;
 ok( $node->log_contains(
 		"invalid record length at .*: expected at least 24, got 0", $log_size
 	),
-	"xl_tot_len zero");
+	"zero xl_tot_len followed by garbage bytes");
 
 # xl_tot_len is < 24 (presumably recycled garbage).
 emit_message($node, 0);
@@ -328,7 +341,7 @@ note "Multi-page end-of-WAL detection, header is not split";
 # This series of tests requires a valid xl_prev set in the record header
 # written to WAL.
 
-# Good xl_prev, we hit zero page next (zero magic).
+# Good xl_prev, we hit zero page next
 emit_message($node, 0);
 $prev_lsn = advance_out_of_record_splitting_zone($node);
 $end_lsn = emit_message($node, 0);
@@ -337,8 +350,24 @@ write_wal($node, $TLI, $end_lsn,
 	build_record_header(2 * 1024 * 1024 * 1024, 0, $prev_lsn));
 $log_size = -s $node->logfile;
 $node->start;
-ok($node->log_contains("invalid magic number 0000 .* LSN .*", $log_size),
-	"xlp_magic zero");
+ok( $node->log_contains("WARNING:  empty page in WAL segment .*, offset .* while reading continuation record at .*", $log_size),
+   "empty page");
+
+# Good xl_prev, we hit zero page magic with following garbage bytes.
+emit_message($node, 0);
+$prev_lsn = advance_out_of_record_splitting_zone($node);
+$end_lsn = emit_message($node, 0);
+$node->stop('immediate');
+write_wal($node, $TLI, $end_lsn,
+		  build_record_header(2 * 1024 * 1024 * 1024, 0, $prev_lsn));
+# place garbage at the end of the next page
+write_wal($node, $TLI,
+		  start_of_next_page(start_of_next_page($end_lsn)) - 1,
+		  pack("i", 1));
+$log_size = -s $node->logfile;
+$node->start;
+ok( $node->log_contains("invalid magic number 0000 .* LSN .*", $log_size),
+   "bad magic");
 
 # Good xl_prev, we hit garbage page next (bad magic).
 emit_message($node, 0);
@@ -442,8 +471,8 @@ write_wal($node, $TLI, $end_lsn,
 	build_record_header(2 * 1024 * 1024 * 1024, 0, 0xdeadbeef));
 $log_size = -s $node->logfile;
 $node->start;
-ok($node->log_contains("invalid magic number 0000 .* LSN .*", $log_size),
-	"xlp_magic zero (split record header)");
+ok( $node->log_contains("WARNING:  empty page in WAL segment .*, offset .* while reading continuation record at .*", $log_size),
+	"zero page while reading a record (split record header)");
 
 # And we'll also check xlp_pageaddr before any header checks.
 emit_message($node, 0);
-- 
2.39.3